Education Expenditure and SAT: Linear Regression in R Studio
Solution file can be downloaded here.
Problem 1
The package faraway
in R contains the data frame sat
with information collected to study the relationship between expenditure on public education and test results for all 50 of the United States. For the sake of this analysis, treat the data as a sample. The variable total
contains the average total score on the SAT test for each state. We are interested in the relationship between the average total score on the SAT and the variables expend
, ratio
, salary
and takers
. Use the help function to see a description of the variables in this data set.
- Use
pairs()
to obtain a matrix of scatter plots for the five variables mentioned above (omitverbal
andmath
). Based on this graph, which predictor would you expect to explain the greatest portion of variability in the response (total
)? Justify your answer.
I would expect predictor takers
to explain the greatest portion of variability in the response (total
), because the scatterplot of total
vs. takers
indicates that these two variables are strongly correlated.
- Use R to fit the linear model for
total
, using all four of the predictor variables. State the model for this regression and the fitted regression equation.
##
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
##
## Coefficients:
## (Intercept) expend ratio salary takers
## 1045.972 4.463 -3.624 1.638 -2.904
Multiple regression model:
\(total = \beta_0 + \beta_1expend + \beta_2ratio + \beta_3salary + \beta_4takers + \epsilon\)
Fitted regression equation:
\(\widehat{total} = 1045.972 + 4.463expend - 3.624ratio + 1.638salary - 2.904takers\)
- Use the fitted regression equation to predict the total SAT score for a hypothetical state that has
expend = 9
,ratio = 14
,salary = 32
, andtakers = 36
. Does it seem reasonable to predict total SAT score for these values?
newdata <- data.frame(expend = 9, ratio = 14, salary = 32, takers = 36)
predict(mod, newdata = newdata)
## 1
## 983.2477
It seems reasonable to predict total SAT score for these values.
- Compute the hat matrix using R. Print the first 10 rows and 4 columns of the hat matrix. Confirm that the hat matrix is of the correct dimensions.
X <- as.matrix(cbind(1, sat[, 1:4]))
hat_matrix <- X %*% solve(t(X) %*% X) %*% t(X)
hat_matrix[1:10, 1:4]
## Alabama Alaska Arizona Arkansas
## Alabama 0.09537668 -0.03073763 0.02806512 0.060804724
## Alaska -0.03073763 0.18030612 -0.00140838 -0.024199931
## Arizona 0.02806512 -0.00140838 0.04931612 0.029281289
## Arkansas 0.06080472 -0.02419993 0.02928129 0.053828781
## California 0.04934236 0.04489829 0.08491826 0.013020789
## Colorado 0.02929069 0.02035841 0.03524397 0.025135448
## Connecticut 0.03703401 0.09181368 -0.02471065 -0.015641013
## Delaware -0.02523557 0.04366914 0.01509440 -0.013618402
## Florida -0.05248402 0.02889058 0.04787050 -0.005452524
## Georgia 0.01579371 -0.05615508 0.02889423 0.013104875
## [1] 50 50
- Use R to confirm the properties of the hat matrix (symmetric, idempotent, positive semi-definite) for this study.
Symmetric:
## [1] TRUE
Idempotent:
## [1] TRUE
Positive semi_definite:
## [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0
The hat matrix is positive semi-definite, because all of the eigenvalues are >= 0.
- Use the hat matrix found in (d) to compute the fitted Y-values. Compute the correlation between the Y-values and the fitted Y-values, and confirm that the square of the correlation coefficient is the same as the Multiple R-squared on the summary.
y <- as.matrix(sat$total)
# fitted Y-values
y_hat <- hat_matrix %*% y
# square of the correlation coefficient
cor(y, y_hat)^2
## [,1]
## [1,] 0.8245623
## [1] 0.8245623
- Use the hat matrix again to compute the sample residuals, then compute the residual standard error from those sample residuals. Check to see that it matches the residual standard error on the summary.
# sample residuals
res <- y - hat_matrix %*% y
# residual standard error
sqrt(sum(res^2) / (length(res) - 5))
## [1] 32.70199
##
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## ratio -3.6242 3.2154 -1.127 0.266
## salary 1.6379 2.3872 0.686 0.496
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
- Compute simultaneous confidence intervals for the coefficients of all predictors, with an overall significance of 92%. Use the information from the summary to calculate these.
## 40.8 % 59.2 %
## (Intercept) 1033.596838 1058.346233
## expend 1.994073 6.931115
## ratio -4.376833 -2.871631
## salary 1.079158 2.196677
## takers -2.958609 -2.850352
Problem 2
For the linear model in Problem 1, complete the general test for the whole model (all predictors). State each of the following:
- the full and reduced models
Full model:
\(total = \beta_0 + \beta_1expend + \beta_2ratio + \beta_3salary + \beta_4takers + \epsilon\)
Reduced model:
\(total = \beta_0 + \epsilon\)
- both hypotheses
\(H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0\)
\(H_a: \beta_j \neq 0\) for at least one value of \(j = 1, 2, 3, 4\)
- the ANOVA table
## Analysis of Variance Table
##
## Model 1: total ~ expend + ratio + salary + takers
## Model 2: total ~ 1
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 45 48124
## 2 49 274308 -4 -226184 52.875 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
- the value of the test statistic
F = 52.875
- the p-value
p-value < 2.2e-16
- the decision based on the p-value
We reject the null hypothesis at 5% level of significance because p-value < 0.05.
- an interpretation of the decision
The full model fits the data statistically significantly better than the intercept-only model.