# Behind the Scenes of R lm() function: OLS in Matrix Form

The following exercises aim to compare simple linear regression results computed in matrix form with the built in R function lm().

Solution file can be obtained here.

Use matrix based OLS approach (do not use R) to fit a simple regression model for the following data:

xy
2.5-8
4.516
540
8.2115
9.3122

a) OLS estimation of $$\beta_0$$ and $$\beta_1$$.

$$Y = X\hat{\beta}$$

$$Y = \begin{bmatrix} -8 \\ 16 \\ 40 \\ 115 \\ 122 \end{bmatrix}$$

$$X = \begin{bmatrix} 1 & 2.5 \\ 1 & 4.5 \\ 1 & 5 \\ 1 & 8.2 \\ 1 & 9.3 \end{bmatrix}$$

$$X'X = \begin{bmatrix} 5 & 29.5 \\ 29.5 & 205.23 \end{bmatrix}$$

$$(X'X)^{-1} = \begin{bmatrix} 1.316 & -0.189 \\ -0.189 & 0.032 \end{bmatrix}$$

$$\hat{\beta} = (X'X)^{-1}X'Y = \begin{bmatrix} -65.636 \\ 20.786 \end{bmatrix}$$

$$\hat{\beta_0} = -65.636$$, $$\hat{\beta_1} = 20.786$$

b) SSE

$$SSE = Y'Y - \hat{\beta}X'Y = 30029 - 29716.81 = 312.19$$

c) MSE

$$MSE = SSE/(n-p) = SSE/(5-2) = 312.19/3 = 104.063$$

d) R-squared and adjusted R-squared. Is your model a good fit? Why or why not?

$$SST = \sum_{i=1}^{5} (y_i - \bar{y})^2 = 13784$$

$$R^2 = 1 - (SSE/SST) = 1 - (312.19/13784) = 0.977$$

$$R^2adj = 1 - ((1 - R^2)(n-1)/(n - p)) = 1 - ((1 - 0.977)*4/3) = 0.969$$

The model is a good fit because it explains about 97% of the variation in y.

e) $$\hat{y}$$, given $$x = 8.2$$. Compare with the observed response in the original data. Why are these two not equal?

$$\hat{y} = -65.636 + 20.786*8.2 = 104.8092$$

The observed response in the original data was 115. These two are not equal because the model does not fit the data perfectly.

Compare your results in 1. (a)-(e) with R function lm().

x <- c(2.5, 4.5, 5, 8.2, 9.3)
y <- c(-8, 16, 40, 115, 122)
mod <- lm(y ~ x)
summary(mod)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
##       1       2       3       4       5
##   5.672 -11.900   1.707  10.193  -5.672
##
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -65.636     11.715  -5.603  0.01123 *
## x             20.786      1.829  11.368  0.00146 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.21 on 3 degrees of freedom
## Multiple R-squared:  0.9773, Adjusted R-squared:  0.9697
## F-statistic: 129.2 on 1 and 3 DF,  p-value: 0.001461
SSE <- sum(mod$residuals^2) SSE ##  312.7489 MSE <- sum(mod$residuals^2)/3
MSE
##  104.2496
predict(mod, newdata = data.frame(x = 8.2))
##        1
## 104.8072

$$\beta_0 = -65.636$$, $$\beta_1 = 20.786$$

$$SSE = 312.7489$$, $$MSE = 104.2496$$

$$R^2 = 0.9773$$, $$R^2adj = 0.9697$$.

Given $$x = 8.2$$, $$\hat{y} = 104.8072$$.

My results are very similar to R function lm() results and the differences are most likely due to rounding errors.

Use R matrix operations (Not lm()) and repeat 1. (a) and (b). Compare your results with lm() function results.

a) OLS estimation of $$\beta_0$$ and $$\beta_1$$.

X <- matrix(c(rep(1, 5), 2.5, 4.5, 5, 8.2, 9.3), nrow = 5)
Y <- matrix(c(-8, 16, 40, 115, 122), nrow = 5)
beta <- solve(t(X) %*% X) %*% t(X) %*% Y
beta
##           [,1]
## [1,] -65.63598
## [2,]  20.78576

$$\beta_0 = -65.636$$, $$\beta_1 = 20.786$$

b) SSE

t(Y) %*% Y - t(beta) %*% t(X) %*% Y
##          [,1]
## [1,] 312.7489

$$SSE = 312.7489$$.

The results are identical to lm() function results.