Multiple Linear Regression in R: Exercises and Solutions
In order to solve the tasks you need:
- R Studio
- Data files: data file1, data file2, data file3, Rmd File (right mouse click -> Save Link as).
Exercise 1
A personnel officer in a governmental agency administered three newly developed aptitude tests to a random sample of 25 applicants for entry-level positions in the agency. For the purpose of the study, all 25 applicants were accepted for positions irrespective of their test scores. After a probationary period, each applicant was rated for proficiency on the job.
The scores on the three tests (x1, x2, x3) and the job proficiency score (y) for the 25 employees are in the file jobprof
.
(Based on an exercise from Applied Linear Statistical Models, 5th ed. by Kutner, Nachtsheim, Neter, & Li)
Part 1a
Create a scatterplot matrix and the correlation matrix for all of the variables in the data set.
Do any aptitude test scores appear to be linearly related to the proficiency score? Do any relationships appear to be quadratic? Do any aptitude scores appear to be linearly related to each other?
Answer 1a
library(readxl)
jobprof <- read_excel("jobprof.xlsx")
# scatterplot and the correlation matrix of all observations
plot(jobprof)
## y x1 x2 x3
## y 1.0000000 0.5144107 0.497005711 0.474604323
## x1 0.5144107 1.0000000 0.102268932 0.187945571
## x2 0.4970057 0.1022689 1.000000000 -0.006265503
## x3 0.4746043 0.1879456 -0.006265503 1.000000000
# remove the outlier
ind <- which(jobprof$x3 < 20)
jobprof_new <- jobprof[-ind, ]
# scatterplot and the correlation matrix without the outlier
plot(jobprof_new)
## y x1 x2 x3
## y 1.0000000 0.5084617 0.5461008 0.8961876
## x1 0.5084617 1.0000000 0.1426666 0.1714944
## x2 0.5461008 0.1426666 1.0000000 0.5659944
## x3 0.8961876 0.1714944 0.5659944 1.0000000
Due to the outlier in the original data it was difficult to see how the variables are related with each other, so the outlier (observation with the value of x3
smaller than 20) was removed.
It appears that the aptitude test x3
scores are linearly related to the proficiency score. Aptitude test scores of x1
and x2
also appear to have a weak linear relationship with the proficiency score. No relationships appear to be quadratic. The aptitude scores of x2
and x3
appear to have a weak linear relationship.
Part 1b
Obtain the model summary for the model composed of the three first-order terms and the three cross-product interaction terms:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \epsilon\]
Also use R to compute the VIF for each term in the model. Are any of the VIFs over 10?
This model is suffering from the effects of collinearity, which inflates the standard errors of the estimated coefficients.
What do you notice about the overall model p-value (from the F-statistic) and the individual p-values for each term in the model? Does it make sense that the overall model shows statistical significance but no individual term does?
Answer 1b
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3, data = jobprof_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.227 -3.368 -1.075 2.849 11.850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -45.885322 145.886108 -0.315 0.757
## x1 -0.594584 0.842030 -0.706 0.490
## x2 -0.235729 0.944371 -0.250 0.806
## x3 1.454096 1.533963 0.948 0.356
## x1:x2 0.004320 0.004536 0.952 0.354
## x1:x3 0.004784 0.009131 0.524 0.607
## x2:x3 -0.001823 0.008629 -0.211 0.835
##
## Residual standard error: 5.568 on 17 degrees of freedom
## Multiple R-squared: 0.9412, Adjusted R-squared: 0.9205
## F-statistic: 45.37 on 6 and 17 DF, p-value: 1.546e-09
## x1 x2 x3 x1:x2 x1:x3 x2:x3
## 222.5079 191.6002 141.7109 143.0503 362.3913 302.4157
All of the VIFs are over 10. The overall model p-value is very small, while the individual p-values for each term in the model all exceed 0.05. It does make sense that the overall model shows statistical significance but no individual term does, as ther are high levels of multicollinearity.
Part 1c
Many times, collinearity can be alleviated by centering the predictor variables. Center the predictor variables x1, x2, and x3 and create new variables to hold them (call them cx1, cx2, and cx3). Furthermore, create a quadratic term for the centered x2 variable.
Answer 1c
Part 1d
Now obtain the model summary for the model composed of the three first-order terms and the three cross-product interaction terms using the centered variables:
\[y=\beta_0 + \beta_1 cx_1 + \beta_2 cx_2 + \beta_3 cx_3 + \beta_4 cx_1 cx_2 + \beta_5 cx_1 cx_3 + \beta_6 cx_2 cx_3 + \epsilon\]
Use R to compute the VIF for each term in the model. Have the VIF values decreased after the variables are centered? What can you about the overall model p-value (from the F-statistic) and the individual p-values for each term in the model? Does this make more sense?
Answer 1d
##
## Call:
## lm(formula = y ~ cx1 + cx2 + cx3 + cx1:cx2 + cx1:cx3 + cx2:cx3,
## data = jobprof_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.227 -3.368 -1.075 2.849 11.850
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 92.382572 1.411257 65.461 < 2e-16 ***
## cx1 0.345382 0.060708 5.689 2.66e-05 ***
## cx2 0.028918 0.090583 0.319 0.753
## cx3 1.758126 0.159804 11.002 3.75e-09 ***
## cx1:cx2 0.004320 0.004536 0.952 0.354
## cx1:cx3 0.004784 0.009131 0.524 0.607
## cx2:cx3 -0.001823 0.008629 -0.211 0.835
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.568 on 17 degrees of freedom
## Multiple R-squared: 0.9412, Adjusted R-squared: 0.9205
## F-statistic: 45.37 on 6 and 17 DF, p-value: 1.546e-09
## cx1 cx2 cx3 cx1:cx2 cx1:cx3 cx2:cx3
## 1.156590 1.762814 1.537979 1.300745 1.425879 1.390737
The VIF values have significantly decreased after the variables are centered and now none of them exceed 10. The overall model p-value and the individual p-values for terms cx1
and cx3
are very small and the individual p-values for other terms still show no statistical significance. It makes more sense than the previous model.
Part 1e
Test the significance of all three coefficients for the interaction terms as a subset. Use a 5% level of significance. State \(H_0\) and \(H_a\) and provide the R output as well as a written conclusion.
Look back and check the individual p-values for the interactions terms from the previous model, how do they compare to the p-value when the interaction terms are tested together as a subset?
Answer 1e
## Linear hypothesis test
##
## Hypothesis:
## cx1:cx2 = 0
## cx1:cx3 = 0
## cx2:cx3 = 0
##
## Model 1: restricted model
## Model 2: y ~ cx1 + cx2 + cx3 + cx1:cx2 + cx1:cx3 + cx2:cx3
##
## Note: Coefficient covariance matrix supplied.
##
## Res.Df Df F Pr(>F)
## 1 20
## 2 17 3 1.6518 0.2149
\(H_0: \beta_4 = \beta_5 = \beta_6 = 0\)
\(H_a: \beta_4 \neq 0\) or \(\beta_5 \neq 0\) or \(\beta_6 \neq 0\)
Since p-value
= 0.2149 > 0.05, we cannot reject the null hypothesis. The coefficients for the interaction terms are not significant.
The individual p-values for the interactions terms from the previous model are all higher than the p-value when the interaction terms are tested together as a subset.
Part 1f
Drop the interaction terms from the model and fit the following model with the quadratic term for \(x_2\):
\[y = \beta_0 + \beta_1 cx_1 + \beta_2 cx_2 + \beta_3 cx_3 + \beta_4 cx_2^2 +\epsilon\]
Should the quadratic term be retained in the model at a 5% level of significance?
Answer 1f
##
## Call:
## lm(formula = y ~ cx1 + cx2 + cx3 + cx2sq, data = jobprof_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.908 -2.995 -1.052 2.361 9.841
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 91.251997 1.825735 49.981 < 2e-16 ***
## cx1 0.339332 0.057514 5.900 1.11e-05 ***
## cx2 0.072882 0.089888 0.811 0.428
## cx3 1.816837 0.158203 11.484 5.42e-10 ***
## cx2sq 0.004798 0.005201 0.922 0.368
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.478 on 19 degrees of freedom
## Multiple R-squared: 0.9364, Adjusted R-squared: 0.923
## F-statistic: 69.94 on 4 and 19 DF, p-value: 4.247e-11
The quadratic term should not be retained in the model at a 5% level of significance because the p-value
for this term is higher than 0.05.
Part 1g
Drop the quadratic term and fit the model with only the original uncentered variables:
\[y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon\]
Are there any other terms that should be dropped from the model using the criteria of a 5% level of significance?
Answer 1g
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = jobprof_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.627 -3.305 -0.667 2.355 11.883
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -128.09700 13.30659 -9.627 5.98e-09 ***
## x1 0.34942 0.05625 6.211 4.57e-06 ***
## x2 0.03801 0.08125 0.468 0.645
## x3 1.78651 0.15417 11.588 2.51e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.458 on 20 degrees of freedom
## Multiple R-squared: 0.9336, Adjusted R-squared: 0.9236
## F-statistic: 93.67 on 3 and 20 DF, p-value: 6.016e-12
Using the criteria of a 5% level of significance, the term x2
should be dropped from the model, because p-value for this term is higher than 0.05.
Part 1h
Fit the final model for predicting the proficiency score for the population of all employees for this government agency.
Answer 1h
##
## Call:
## lm(formula = y ~ x1 + x3, data = jobprof_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2056 -3.0803 -0.4918 3.0091 12.7156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -128.27318 13.05153 -9.828 2.62e-09 ***
## x1 0.35090 0.05511 6.367 2.59e-06 ***
## x3 1.82656 0.12580 14.519 2.01e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.355 on 21 degrees of freedom
## Multiple R-squared: 0.9328, Adjusted R-squared: 0.9264
## F-statistic: 145.8 on 2 and 21 DF, p-value: 4.847e-13
Part 1i
Obtain the residuals for your final model and evaluate the residual plots using the “plot” function. Does the regression line appear to be a good fit? Does a visual inspection indicate that the model assumptions appear to be satisfied? Comment.
Answer 1i
The regression line appears to be a good fit, because the residuals are equally spread around a horizontal line without distinct patterns in the Residuals vs Fitted
graph. Also, residuals appear to be normally distributed as they follow a straight line quite well in the Normal Q-Q
plot. Other plots also indicate that the model assumptions appear to be satisfied.
Part 1j
Perform a Shapiro-Wilk test for normality. Use \(\alpha=0.05\). Comment on the results.
Answer 1j
##
## Shapiro-Wilk normality test
##
## data: res
## W = 0.96635, p-value = 0.5783
The p-value is larger then \(\alpha\) (0.5783 > 0.05) so the null hypothesis that the data came from a normally distributed population cannot be rejected. The residuals are normally distributed.
Part 1k
Perform a Breusch-Pagan test for homogeneity of variance among the residuals. Use \(\alpha=0.05\). Comment on the results.
Answer 1k
##
## studentized Breusch-Pagan test
##
## data: mod_final
## BP = 0.17108, df = 2, p-value = 0.918
The p-value is larger then \(\alpha\) (0.918 > 0.05) so the null hypothesis of homoscedasticity cannot be rejected. There is homogeneity of variance among the residuals.
Part 1l
Perform a Durbin-Watson test for serial correlation the residuals. Use \(\alpha=0.05\). Comment on the results.
Answer 1l
##
## Durbin-Watson test
##
## data: mod_final
## DW = 1.2862, p-value = 0.07421
## alternative hypothesis: true autocorrelation is not 0
The p-value is larger then \(\alpha\) (0.07421 > 0.05) so the null hypothesis that the autocorrelation of the disturbances is 0 cannot be rejected. The residuals are not correlated.
Part 1m
Obtain a 95% confidence interval for \(\beta_3\) and interpret it in the context of this problem.
Answer 1m
## 2.5 % 97.5 %
## x3 1.564941 2.088189
The 95% confidence interval for \(\beta_3\) is (1.565, 2.088). It means that we are 95% confident that the expected job proficiency score (y) is between 1.565 and 2.088 points higher for each unit increase in the x3 aptitude test score.
Part 1n
Construct a 95% prediction interval for a randomly selected employee with aptitude scores of \(x_1 = 99, x_2 = 112\), and \(x_3 = 105\) to forecast their proficiency rating at the end of the probationary period. Write an interpretation for the interval in the context of this problem.
Answer 1n
newdata <- data.frame(x1 = 99, x2 = 112, x3 = 105)
predict(mod_final, newdata, interval = "prediction")
## fit lwr upr
## 1 98.25507 86.81697 109.6932
The 95% prediction interval of (86.817, 109.693) means that 95% of the time, the proficiency rating for a randomly selected employee with aptitude scores of \(x_1=99, x_2=112,\) and \(x_3=105\) will be between 86.817 and 109.693.
Exercise 2
Consider the scenario from Exercises 12.5 and 12.7 on page 725 of Ott’s textbook. There are two categorical variables (Method and Gender) and one quantitative variable (index of English proficiency prior to the program). See the textbook for details on how the qualitative variables are coded using indicator variables.
Part 2a
Use data in the file English
to estimate the coefficients for the model in Exercise 12.5:
\[y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon\]
Obtain the estimated intercept and coefficients and state the estimated mean English proficiency scores for each of the 3 methods of teaching English as a second language.
Answer 2a
## (Intercept) x1 x2
## 44.75 61.40 3.95
Replace the ## symbols with the parameter estimates:
y = 44.75 + 61.40 \(x_1\) + 3.95 \(x_2\)
State the estimated mean English proficiency scores for each of the 3 methods:
Estimated mean for Method 1 = 44.75
Estimated mean for Method 2 = 106.15
Estimated mean for Method 3 = 48.7
Part 2b
Before fitting the model of Exercise 12.7, create a centered variable for x4 (call it cx4).
Fit the model for Exercise 12.7 using the centered variable x4c:
\[y = \beta_0 + \beta_1 cx_4 + \beta_2 x_1 + \beta_3 x_2 + \beta_5 x_1 cx_4 + \beta_6 x_2 cx_4 + \epsilon\]
Using the estimated coefficients, write three separate estimated models, one for each method, relating the scores after 3 months in the program (y) to the index score prior to starting the program (\(x_4\)).
Answer 2b
cx4 <- english$x4 - mean(english$x4)
english <- cbind(english, cx4)
mod2 <- lm(y ~ cx4 + x1 + x2 + x1:cx4 + x2:cx4, data = english)
coef(mod2)
## (Intercept) cx4 x1 x2 cx4:x1 cx4:x2
## 44.7601695 0.1220339 59.9318600 4.2308263 1.7796668 0.3038136
Method 1: \(y = 44.76 + 0.12 cx_4\)
Method 2: \(y = 104.69 + 1.9 cx_4\)
Method3: \(y = 48.99 + 0.42 cx_4\)
Exercise 3
Ninety members (aged = 18.1 – 23.4 years) of three Division I women’s intercollegiate rowing teams (National Collegiate Athletic Association) within the Big Ten Conference volunteered to participate in a study to predict race time for female collegiate rowers from nineteen physical characteristics.
Data is in the file rowtime
. The race times are in the variable named “racetime”.
Part 3a
Load the data and use head(rowtime) to see the other variable names and the first 6 values of each.
Answer 3a
## # A tibble: 6 x 20
## racetime tall weight armspan flexarm thighci calfcir tricep biceps thigh
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 470. 172. 86.7 172. 34.2 65.5 40.4 21 19 29
## 2 469. 168. 72.6 156. 31.2 59.4 39.5 24 11 34
## 3 509 169. 69.4 167 31 57.5 39 22 19 35
## 4 516 158. 58.6 158. 29.5 54 37 19 12 13
## 5 465 172 72.8 176. 33 55 38 21 7 23
## 6 480. 176. 71.7 171. 32.5 54.5 37 17 7 29
## # ... with 10 more variables: estffm <dbl>, estfm <dbl>, bestsnr <dbl>,
## # bestvj <dbl>, legpower <dbl>, endo <dbl>, meso <dbl>, ecto <dbl>,
## # expvarsity <dbl>, preexper <dbl>
Part 3b
Use the regsubsets function to find the “best” model for predicting the response variable racetime with up to 8 of the 19 predictor variables in the data set. Produce the summary and the plot for the best single models with up to 8 predictors according to \(R^2_{adj}\).
Which independent variables are in the best model with 8 predictors when the \(R^2_{adj}\) is the criterion for selection?
Answer 3b
library(leaps)
regsubsets.out <- regsubsets(racetime ~ ., data = rowtime, nvmax = 8)
plot(regsubsets.out, scale = "adjr2")
mod_regs <- lm(racetime ~ calfcir + biceps + bestvj + legpower + meso +
ecto + expvarsity + preexper, data = rowtime)
summary(mod_regs)
##
## Call:
## lm(formula = racetime ~ calfcir + biceps + bestvj + legpower +
## meso + ecto + expvarsity + preexper, data = rowtime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.206 -9.912 -1.412 9.776 33.826
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 505.5492 39.6802 12.741 < 2e-16 ***
## calfcir 2.7507 1.1668 2.357 0.02082 *
## biceps 1.5304 0.5148 2.973 0.00389 **
## bestvj 7.9155 1.0640 7.439 9.45e-11 ***
## legpower -2.0904 0.2258 -9.258 2.45e-14 ***
## meso -9.5530 3.2477 -2.941 0.00426 **
## ecto -15.8944 4.2064 -3.779 0.00030 ***
## expvarsity -18.2117 3.4750 -5.241 1.24e-06 ***
## preexper -12.1505 3.5743 -3.399 0.00105 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.11 on 81 degrees of freedom
## Multiple R-squared: 0.6507, Adjusted R-squared: 0.6162
## F-statistic: 18.86 on 8 and 81 DF, p-value: 1.156e-15
In the best model with 8 predictors when the \(R^2_{adj}\) is the criterion for selection the variables are calfcir
, biceps
, bestvj
, legpower
, meso
, ecto
, expvarsity
and preexper
.
Part 3c
Use the step function with backward selection to find the “best” model for predicting the response variable rowtime. Recall that the formula structure y ~ .
will produce the model using y as the response variable and all other variables in the data set as the predictors; in this set racetime is the response variable and all other variables are potential predictors.
Which independent variables are in this model? What is the AIC value for this model?
Answer 3c
full <- lm(racetime ~ ., data = rowtime)
mod_back <- step(full, data = rowtime, direction = "backward", trace = 0)
summary(mod_back)
##
## Call:
## lm(formula = racetime ~ tall + calfcir + biceps + estfm + bestvj +
## legpower + meso + expvarsity + preexper, data = rowtime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.392 -10.098 -1.056 8.676 33.046
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 797.8807 108.6409 7.344 1.54e-10 ***
## tall -2.2842 0.6347 -3.599 0.000553 ***
## calfcir 2.7636 1.1699 2.362 0.020592 *
## biceps 1.1183 0.5703 1.961 0.053377 .
## estfm 1.6653 1.2028 1.384 0.170056
## bestvj 5.1014 1.7652 2.890 0.004957 **
## legpower -1.1402 0.4546 -2.508 0.014161 *
## meso -9.7856 3.3879 -2.888 0.004979 **
## expvarsity -18.3696 3.4565 -5.315 9.42e-07 ***
## preexper -11.0965 3.5822 -3.098 0.002691 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.03 on 80 degrees of freedom
## Multiple R-squared: 0.6587, Adjusted R-squared: 0.6204
## F-statistic: 17.16 on 9 and 80 DF, p-value: 2.149e-15
## [1] 10.0000 497.2245
In this model the variables are tall
, calfcir
, biceps
, estfm
, bestvj
, legpower
, meso
, expvarsity
and preexper
. The AIC value for this model is 497.2245.
Part 3d
Use the step function with forward selection to find the “best” model for predicting the response variable rowtime.
Which independent variables are in the model selected? What is the AIC value for this model?
Answer 3d
null <- lm(racetime ~ 1, data = rowtime)
mod_forw <- step(null, scope = list(lower = null, upper = full),
direction = "forward", trace = 0)
summary(mod_forw)
##
## Call:
## lm(formula = racetime ~ estffm + expvarsity + tall + preexper +
## biceps + meso + calfcir + bestvj, data = rowtime)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.777 -11.034 -0.336 8.303 31.520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 867.7058 92.0113 9.430 1.12e-14 ***
## estffm -1.2986 0.6450 -2.013 0.04740 *
## expvarsity -17.5792 3.4968 -5.027 2.93e-06 ***
## tall -2.4040 0.6425 -3.742 0.00034 ***
## preexper -10.9832 3.6057 -3.046 0.00313 **
## biceps 1.1548 0.4651 2.483 0.01511 *
## meso -10.3890 3.3900 -3.065 0.00296 **
## calfcir 2.8099 1.1638 2.414 0.01802 *
## bestvj 1.0024 0.6918 1.449 0.15119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15.14 on 81 degrees of freedom
## Multiple R-squared: 0.6494, Adjusted R-squared: 0.6148
## F-statistic: 18.76 on 8 and 81 DF, p-value: 1.331e-15
## [1] 9.0000 497.6452
In this model the variables are estffm
, expvarsity
, tall
, preexper
, biceps
, meso
, calfcir
and bestvj
. The AIC value for this model is 497.6452.
Part 3e
Compute the AIC for the the best model with 8 predictors from the regsubsets function. How does it compare with the AIC for the two models produced by the backward and forward selection procedure?
Which model is the “best” according to the AIC? (remember, smaller is better for AIC)
Answer 3e
## [1] 9.0000 497.3196
The AIC value for the best model with 8 predictors from the regsubsets function is 497.3196. It is very similar to the AIC values for the two models produced by the backward and forward selection procedure. However, the “best” model according to the AIC is the model selected by the step function with backward selection.
Excercises and Solutions