Education Expenditure and SAT: Linear Regression in R Studio

Solution file can be downloaded here.

Problem 1

The package faraway in R contains the data frame sat with information collected to study the relationship between expenditure on public education and test results for all 50 of the United States. For the sake of this analysis, treat the data as a sample. The variable total contains the average total score on the SAT test for each state. We are interested in the relationship between the average total score on the SAT and the variables expend, ratio, salary and takers. Use the help function to see a description of the variables in this data set.

Use pairs() to obtain a matrix of scatter plots for the five variables mentioned above (omit verbal and math). Based on this graph, which predictor would you expect to explain the greatest portion of variability in the response (total)? Justify your answer.

library(faraway)
data("sat")
pairs(sat[, -c(5, 6)])

I would expect predictor takers to explain the greatest portion of variability in the response (total), because the scatterplot of total vs. takers indicates that these two variables are strongly correlated.

Use R to fit the linear model for total, using all four of the predictor variables. State the model for this regression and the fitted regression equation.

mod <- lm(total ~ expend + ratio + salary + takers, data = sat)
mod

## 
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
## 
## Coefficients:
## (Intercept)       expend        ratio       salary       takers  
##    1045.972        4.463       -3.624        1.638       -2.904

Multiple regression model:

\(total = \beta_0 + \beta_1expend + \beta_2ratio + \beta_3salary + \beta_4takers + \epsilon\)

Fitted regression equation:

\(\widehat{total} = 1045.972 + 4.463expend - 3.624ratio + 1.638salary - 2.904takers\)

Use the fitted regression equation to predict the total SAT score for a hypothetical state that has expend = 9, ratio = 14, salary = 32, and takers = 36. Does it seem reasonable to predict total SAT score for these values?

newdata <- data.frame(expend = 9, ratio = 14, salary = 32, takers = 36)
predict(mod, newdata = newdata)

##        1 
## 983.2477

It seems reasonable to predict total SAT score for these values.

Compute the hat matrix using R. Print the first 10 rows and 4 columns of the hat matrix. Confirm that the hat matrix is of the correct dimensions.

X <- as.matrix(cbind(1, sat[, 1:4]))
hat_matrix <- X %*% solve(t(X) %*% X) %*% t(X)
hat_matrix[1:10, 1:4]

##                 Alabama      Alaska     Arizona     Arkansas
## Alabama      0.09537668 -0.03073763  0.02806512  0.060804724
## Alaska      -0.03073763  0.18030612 -0.00140838 -0.024199931
## Arizona      0.02806512 -0.00140838  0.04931612  0.029281289
## Arkansas     0.06080472 -0.02419993  0.02928129  0.053828781
## California   0.04934236  0.04489829  0.08491826  0.013020789
## Colorado     0.02929069  0.02035841  0.03524397  0.025135448
## Connecticut  0.03703401  0.09181368 -0.02471065 -0.015641013
## Delaware    -0.02523557  0.04366914  0.01509440 -0.013618402
## Florida     -0.05248402  0.02889058  0.04787050 -0.005452524
## Georgia      0.01579371 -0.05615508  0.02889423  0.013104875

dim(hat_matrix)

## [1] 50 50

Use R to confirm the properties of the hat matrix (symmetric, idempotent, positive semi-definite) for this study.

Symmetric:

all.equal(hat_matrix, t(hat_matrix))

## [1] TRUE

Idempotent:

all.equal(hat_matrix %*% hat_matrix, hat_matrix)

## [1] TRUE

Positive semi_definite:

round(eigen(hat_matrix, symmetric = TRUE)$values, 10)

##  [1] 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [39] 0 0 0 0 0 0 0 0 0 0 0 0

The hat matrix is positive semi-definite, because all of the eigenvalues are >= 0.

Use the hat matrix found in (d) to compute the fitted Y-values. Compute the correlation between the Y-values and the fitted Y-values, and confirm that the square of the correlation coefficient is the same as the Multiple R-squared on the summary.

y <- as.matrix(sat$total)
# fitted Y-values
y_hat <- hat_matrix %*% y
# square of the correlation coefficient
cor(y, y_hat)^2

##           [,1]
## [1,] 0.8245623

# Multiple R-squared on the summary
summary(mod)$r.squared

## [1] 0.8245623

Use the hat matrix again to compute the sample residuals, then compute the residual standard error from those sample residuals. Check to see that it matches the residual standard error on the summary.

# sample residuals
res <- y - hat_matrix %*% y
# residual standard error
sqrt(sum(res^2) / (length(res) - 5))

## [1] 32.70199

summary(mod)

## 
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.531 -20.855  -1.746  15.979  66.571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1045.9715    52.8698  19.784  < 2e-16 ***
## expend         4.4626    10.5465   0.423    0.674    
## ratio         -3.6242     3.2154  -1.127    0.266    
## salary         1.6379     2.3872   0.686    0.496    
## takers        -2.9045     0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.809 
## F-statistic: 52.88 on 4 and 45 DF,  p-value: < 2.2e-16

Compute simultaneous confidence intervals for the coefficients of all predictors, with an overall significance of 92%. Use the information from the summary to calculate these.

confint(mod, level = 0.92 / length(coefficients(mod)))

##                  40.8 %      59.2 %
## (Intercept) 1033.596838 1058.346233
## expend         1.994073    6.931115
## ratio         -4.376833   -2.871631
## salary         1.079158    2.196677
## takers        -2.958609   -2.850352

Problem 2

For the linear model in Problem 1, complete the general test for the whole model (all predictors). State each of the following:

the full and reduced models

Full model:

\(total = \beta_0 + \beta_1expend + \beta_2ratio + \beta_3salary + \beta_4takers + \epsilon\)

Reduced model:

\(total = \beta_0 + \epsilon\)

both hypotheses

\(H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0\)

\(H_a: \beta_j \neq 0\) for at least one value of \(j = 1, 2, 3, 4\)

the ANOVA table

anova(mod, lm(total ~ 1, data = sat))

## Analysis of Variance Table
## 
## Model 1: total ~ expend + ratio + salary + takers
## Model 2: total ~ 1
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     45  48124                                  
## 2     49 274308 -4   -226184 52.875 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

the value of the test statistic

F = 52.875

the p-value

p-value < 2.2e-16

the decision based on the p-value

We reject the null hypothesis at 5% level of significance because p-value < 0.05.

an interpretation of the decision

The full model fits the data statistically significantly better than the intercept-only model.