 Details
 Parent Category: Programming Assignments' Solutions
We Helped With This R Programming Homework: Have A Similar One?
Category  Programming 

Subject  R  R Studio 
Difficulty  Graduate 
Status  Solved 
More Info  Probability Homework Answers 
Short Assignment Requirements
Assignment Description
Lab Assignment
Problem 1. The file SpeedTrap.RData is an R data set that contains a data frame called SpeedTrap. This data frame consists of 184 observations (rows) and 7 variables (columns).
Each row corresponds to a town in the Chicago area. The variables are as follows:
Variable Name  Description 
Res.Stop  The number of traffic stops made by police in 2014 where the 
 motorist was from the same town as where the stop was made 
 (resident stops) 


Res.Ticket  The number of resident traffic stops where a ticket was issued 


Out.Stop  The number of traffic stops made by police in 2014 where the 
 motorist was not from the same town as where the stop was 
 made (outsider stops) 


Out.Ticket  The number of outsider traffic stops where a ticket was issued 


Pop  Population of the town (in thousands), 2010 census 


PPSQMI  Number of persons per square mile in the town, 2010 census 


PPHU  Number of persons per housing unit in the town, 2010 census 


PCI  Per capita income of the town (in thousands of dollars), 2010 
 census 


In each community we want to compare the rate of ticketing outsiders who are stopped for a traffic violation to the rate of ticketing residents who are stopped. To do this we will use the odds ratio which is defined as follows:
θ = 
 π _{out} / (1  − π_{out} )  =  odds of outsider being ticketed  , 

 π _{res} / (1  − π_{res} ) 
 odds of resident being ticketed 

where π _{out} and π_{res} are the probabilities of being ticketed for outsiders and resident, respectively.
The odds ratio is used to compare probabilities between two populations. It often is preferred to using the straight difference π _{out} − π_{res} in statistical modeling. An odds ratio of 1.0 implies that the two probabilities are equal. An odds ratio greater than 1.0 implies that π_{out} is greater than
π _{res} . The odds ratio is estimated from the counts of successes and failures in each community by replacing π _{out} and π_{res} with sample estimates.
(a) Begin by calculating the estimated odds ratio for each community. Append this variable to the data frame (call it OddsRatio). The first three values should match the following:
> SpeedTrap[1:3,"OddsRatio"]
[1] 1.146857 1.201661 1.264754
(b) Fit a regression model using OddsRatio as the outcome variable and Pop, PPSQMI, PPHU, and PCI as predictor variables. Using diagnostic plots, describe how well the regression conforms to the assumptions of the normal, linear regression model. [3 pts]
(c) Identify those communities for which the leverage exceeds three times the average value. Rerun the regression with these communities removed from the data set. Describe how their removal affects the fitted model. [3 pts]
(d) Rerun the regression in (a), replacing each of the predictors by its logarithm. How does this change affect the presence of observations with high leverage? [2 pts]
(e) Using logtransformed predictors, find a BoxCox transformation of the outcome variable that maximizes the likelihood. Refit the model with the transformed outcome variable. Does it better conform to the assumptions of the normal, linear regression model than the model that you fit originally? In what respects are the diagnostics still troublesome?
[3 pts]
(f) Produce and interpret a set of partial regression plots for the model that you fit in (e). Do the predictor variables appear to be treated appropriately in the model? [3 pts]
(g) Assuming that all necessary assumptions are met with the model that you fit in (e):
i. Test the null hypothesis that the coefficients on log(PPHU) and log(PPSQMI) are both zero. [2 pts]
ii. Give a 95% confidence interval for the coefficient on log(PCI). [2 pts]
iii. Give a 95% prediction interval for the estimated odds ratio in a community that has a population of 25,000; 4000 persons per square mile; 2.8 persons per housing unit, and a per capita income of $26,000. [2 pts]
(h) Conduct an outlier analysis on residuals from the regression in (e). Use a familywide Type I error probability of α = .01. Which communities should be considered for removal from the regression? [3 pts]
n = 111. This vector was obtained from a regression of air quality measurements (ozone) taken on 111 consecutive days in New York City in 1973. Each entry of ozone is either –1 if the
residual is negative or +1 if the residual is positive. We are interested in testing the null hypothesis that the residuals are not serially correlated versus the alternative hypothesis that the residuals are serially correlated. Using the Runs Test, report a pvalue and state your conclusion at the .05 test level. [3 pts]
Problem 3. For this problem you will use the prostate data that is available in the faraway package. The outcome variable is lcavol, all other variables are predictors. We want to
determine if a regression model behaves differently for younger (under age 65) subjects than for older (age 65 and over) subjects.
(a) To do this, introduce a new variable called Young to the data set as a factor that distinguishes younger from older men. Introduce it in a way that separate intercepts and slopes are applied to the two groups of men. Show a summary of your regression. Note: we will accept the validity of all regression assumptions in this exercise. [2 pts]
(b) Using the model in (a) conduct an Ftest to see if you reject the null hypothesis that coefficients associated with Young are all equal to zero. Explain in practical terms what your results mean. [2 pts]