---
output:
  prettydoc::html_pretty:
    theme: architect
    highlight: vignette
---

# Public Education Expenditure and SAT: Linear Regression in R

## Problem 1

The package `faraway` in R contains the data frame `sat` with information collected to study the relationship between expenditure on public education and test results for all 50 of the United States. For the sake of this analysis, treat the data as a sample. The variable `total` contains the average total score on the SAT test for each state. We are interested in the relationship between the average total score on the SAT and the variables `expend`, `ratio`, `salary` and `takers`. Use the help function to see a description of the variables in this data set.

a. Use `pairs()` to obtain a matrix of scatter plots for the five variables mentioned above (omit `verbal` and `math`). Based on this graph, which predictor would you expect to explain the greatest portion of variability in the response (`total`)? Justify your answer.

```{r}
library(faraway)
data("sat")
pairs(sat[, -c(5, 6)])
```

I would expect predictor `takers` to explain the greatest portion of variability in the response (`total`), because the scatterplot of `total` vs. `takers` indicates that these two variables are strongly correlated.

b. Use R to fit the linear model for `total`, using all four of the predictor variables. State the model for this regression and the fitted regression equation.

```{r}
mod <- lm(total ~ expend + ratio + salary + takers, data = sat)
mod
```

Multiple regression model:

$total = \beta_0 + \beta_1expend + \beta_2ratio + \beta_3salary + \beta_4takers + \epsilon$

Fitted regression equation:

$\widehat{total} = 1045.972 + 4.463expend - 3.624ratio + 1.638salary - 2.904takers$

c. Use the fitted regression equation to predict the total SAT score for a hypothetical state that has `expend = 9`, `ratio = 14`, `salary = 32`, and `takers = 36`. Does it seem reasonable to predict total SAT score for these values?

```{r}
newdata <- data.frame(expend = 9, ratio = 14, salary = 32, takers = 36)
predict(mod, newdata = newdata)
```

It seems reasonable to predict total SAT score for these values.

d. Compute the hat matrix using R. Print the first 10 rows and 4 columns of the hat matrix. Confirm that the hat matrix is of the correct dimensions.

```{r}
X <- as.matrix(cbind(1, sat[, 1:4]))
hat_matrix <- X %*% solve(t(X) %*% X) %*% t(X)
hat_matrix[1:10, 1:4]
dim(hat_matrix)
```

e. Use R to confirm the properties of the hat matrix (symmetric, idempotent, positive semi-definite) for this study.

Symmetric:

```{r}
all.equal(hat_matrix, t(hat_matrix))
```

Idempotent:

```{r}
all.equal(hat_matrix %*% hat_matrix, hat_matrix)
```

Positive semi_definite:

```{r}
round(eigen(hat_matrix, symmetric = TRUE)$values, 10)
```

The hat matrix is positive semi-definite, because all of the eigenvalues are >= 0.

f. Use the hat matrix found in (d) to compute the fitted Y-values. Compute the correlation between the Y-values and the fitted Y-values, and confirm that the square of the correlation coefficient is the same as the Multiple R-squared on the summary.

```{r}
y <- as.matrix(sat$total)
# fitted Y-values
y_hat <- hat_matrix %*% y
# square of the correlation coefficient
cor(y, y_hat)^2
# Multiple R-squared on the summary
summary(mod)$r.squared
```

g. Use the hat matrix again to compute the sample residuals, then compute the residual standard error from those sample residuals. Check to see that it matches the residual standard error on the summary.

```{r}
# sample residuals
res <- y - hat_matrix %*% y
# residual standard error
sqrt(sum(res^2) / (length(res) - 5))
summary(mod)
```

h. Compute simultaneous confidence intervals for the coefficients of all predictors, with an overall significance of 92%. Use the information from the summary to calculate these.

```{r}
confint(mod, level = 0.92 / length(coefficients(mod)))
```


## Problem 2

For the linear model in Problem 1, complete the general test for the whole model (all predictors). State each of the following:

a. the full and reduced models

Full model:

$total = \beta_0 + \beta_1expend + \beta_2ratio + \beta_3salary + \beta_4takers + \epsilon$

Reduced model:

$total = \beta_0 + \epsilon$

b. both hypotheses

$H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0$

$H_a: \beta_j \neq 0$ for at least one value of $j = 1, 2, 3, 4$

c. the ANOVA table

```{r}
anova(mod, lm(total ~ 1, data = sat))
```

d. the value of the test statistic

F = 52.875

e. the p-value

p-value < 2.2e-16

f. the decision based on the p-value

We reject the null hypothesis at 5% level of significance because p-value < 0.05.

g. an interpretation of the decision

The full model fits the data statistically significantly better than the intercept-only model.