--- output: prettydoc::html_pretty: theme: architect highlight: vignette --- # Public Education Expenditure and SAT: Linear Regression in R ## Problem 1 The package faraway in R contains the data frame sat with information collected to study the relationship between expenditure on public education and test results for all 50 of the United States. For the sake of this analysis, treat the data as a sample. The variable total contains the average total score on the SAT test for each state. We are interested in the relationship between the average total score on the SAT and the variables expend, ratio, salary and takers. Use the help function to see a description of the variables in this data set. a. Use pairs() to obtain a matrix of scatter plots for the five variables mentioned above (omit verbal and math). Based on this graph, which predictor would you expect to explain the greatest portion of variability in the response (total)? Justify your answer. {r} library(faraway) data("sat") pairs(sat[, -c(5, 6)])  I would expect predictor takers to explain the greatest portion of variability in the response (total), because the scatterplot of total vs. takers indicates that these two variables are strongly correlated. b. Use R to fit the linear model for total, using all four of the predictor variables. State the model for this regression and the fitted regression equation. {r} mod <- lm(total ~ expend + ratio + salary + takers, data = sat) mod  Multiple regression model: $total = \beta_0 + \beta_1expend + \beta_2ratio + \beta_3salary + \beta_4takers + \epsilon$ Fitted regression equation: $\widehat{total} = 1045.972 + 4.463expend - 3.624ratio + 1.638salary - 2.904takers$ c. Use the fitted regression equation to predict the total SAT score for a hypothetical state that has expend = 9, ratio = 14, salary = 32, and takers = 36. Does it seem reasonable to predict total SAT score for these values? {r} newdata <- data.frame(expend = 9, ratio = 14, salary = 32, takers = 36) predict(mod, newdata = newdata)  It seems reasonable to predict total SAT score for these values. d. Compute the hat matrix using R. Print the first 10 rows and 4 columns of the hat matrix. Confirm that the hat matrix is of the correct dimensions. {r} X <- as.matrix(cbind(1, sat[, 1:4])) hat_matrix <- X %*% solve(t(X) %*% X) %*% t(X) hat_matrix[1:10, 1:4] dim(hat_matrix)  e. Use R to confirm the properties of the hat matrix (symmetric, idempotent, positive semi-definite) for this study. Symmetric: {r} all.equal(hat_matrix, t(hat_matrix))  Idempotent: {r} all.equal(hat_matrix %*% hat_matrix, hat_matrix)  Positive semi_definite: {r} round(eigen(hat_matrix, symmetric = TRUE)$values, 10)  The hat matrix is positive semi-definite, because all of the eigenvalues are >= 0. f. Use the hat matrix found in (d) to compute the fitted Y-values. Compute the correlation between the Y-values and the fitted Y-values, and confirm that the square of the correlation coefficient is the same as the Multiple R-squared on the summary. {r} y <- as.matrix(sat$total) # fitted Y-values y_hat <- hat_matrix %*% y # square of the correlation coefficient cor(y, y_hat)^2 # Multiple R-squared on the summary summary(mod)$r.squared  g. Use the hat matrix again to compute the sample residuals, then compute the residual standard error from those sample residuals. Check to see that it matches the residual standard error on the summary. {r} # sample residuals res <- y - hat_matrix %*% y # residual standard error sqrt(sum(res^2) / (length(res) - 5)) summary(mod)  h. Compute simultaneous confidence intervals for the coefficients of all predictors, with an overall significance of 92%. Use the information from the summary to calculate these. {r} confint(mod, level = 0.92 / length(coefficients(mod)))  ## Problem 2 For the linear model in Problem 1, complete the general test for the whole model (all predictors). State each of the following: a. the full and reduced models Full model:$total = \beta_0 + \beta_1expend + \beta_2ratio + \beta_3salary + \beta_4takers + \epsilon$Reduced model:$total = \beta_0 + \epsilon$b. both hypotheses$H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0H_a: \beta_j \neq 0$for at least one value of$j = 1, 2, 3, 4\$ c. the ANOVA table {r} anova(mod, lm(total ~ 1, data = sat))  d. the value of the test statistic F = 52.875 e. the p-value p-value < 2.2e-16 f. the decision based on the p-value We reject the null hypothesis at 5% level of significance because p-value < 0.05. g. an interpretation of the decision The full model fits the data statistically significantly better than the intercept-only model.