---
output:
  prettydoc::html_pretty:
    theme: architect
    highlight: vignette
---

# Multiple Linear Regression

## Exercise 1

A personnel officer in a governmental agency administered three newly developed aptitude tests to a random sample of 25 applicants for entry-level positions in the agency.  For the purpose of the study, all 25 applicants were accepted for positions irrespective of their test scores. After a probationary period, each applicant was rated for proficiency on the job.  

The scores on the three tests (x1, x2, x3) and the job proficiency score (y) for the 25 employees are in the file `jobprof`.

(Based on an exercise from Applied Linear Statistical Models, 5th ed. by Kutner, Nachtsheim, Neter, & Li)

### Part 1a

Create a scatterplot matrix and the correlation matrix for all of the variables in the data set. 

Do any aptitude test scores appear to be linearly related to the proficiency score? Do any relationships appear to be quadratic? Do any aptitude scores appear to be linearly related to each other?

### Answer 1a

```{r, warning = FALSE}
library(readxl)
jobprof <- read_excel("jobprof.xlsx")
# scatterplot and the correlation matrix of all observations
plot(jobprof)
cor(jobprof)
# remove the outlier
ind <- which(jobprof$x3 < 20)
jobprof_new <- jobprof[-ind, ]
# scatterplot and the correlation matrix without the outlier
plot(jobprof_new)
cor(jobprof_new)
```

Due to the outlier in the original data it was difficult to see how the variables are related with each other, so the outlier (observation with the value of `x3` smaller than 20) was removed.

It appears that the aptitude test `x3` scores are linearly related to the proficiency score. Aptitude test scores of `x1` and `x2` also appear to have a weak linear relationship with the proficiency score. No relationships appear to be quadratic. The aptitude scores of `x2` and `x3` appear to have a weak linear relationship.

### Part 1b

Obtain the model summary for the model composed of the three first-order terms and the three cross-product interaction terms:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_1 x_2 + \beta_5 x_1 x_3 + \beta_6 x_2 x_3 + \epsilon$$

Also use R to compute the VIF for each term in the model. Are any of the VIFs over 10?

This model is suffering from the effects of collinearity, which inflates the standard errors of the estimated coefficients.

What do you notice about the overall model p-value (from the F-statistic) and the individual p-values for each term in the model?  Does it make sense that the overall model shows statistical significance but no individual term does?  

### Answer 1b

```{r, warning = FALSE, message = FALSE}
library(car)
mod1 <- lm(y ~ x1 + x2 + x3 + x1:x2 + x1:x3 + x2:x3, data = jobprof_new)
summary(mod1)
vif(mod1)
```

All of the VIFs are over 10. The overall model p-value is very small, while the individual p-values for each term in the model all exceed 0.05. It does make sense that the overall model shows statistical significance but no individual term does, as ther are high levels of multicollinearity.

### Part 1c  

Many times, collinearity can be alleviated by centering the predictor variables.  Center the predictor variables x1, x2, and x3 and create new variables to hold them (call them cx1, cx2, and cx3).  Furthermore, create a quadratic term for the centered x2 variable.

### Answer 1c

```{r}
cx1 <- jobprof_new$x1 - mean(jobprof_new$x1)
cx2 <- jobprof_new$x2 - mean(jobprof_new$x2)
cx3 <- jobprof_new$x3 - mean(jobprof_new$x3)
cx2sq <- cx2^2
jobprof_new <- cbind(jobprof_new, cx1, cx2, cx3, cx2sq)
```

### Part 1d

Now obtain the model summary for the model composed of the three first-order terms and the three cross-product interaction terms using the centered variables:  

$$y=\beta_0 + \beta_1 cx_1 + \beta_2 cx_2 + \beta_3 cx_3 + \beta_4 cx_1 cx_2 + \beta_5 cx_1 cx_3 + \beta_6 cx_2 cx_3 + \epsilon$$

Use R to compute the VIF for each term in the model. Have the VIF values decreased after the variables are centered?  What can you about the overall model p-value (from the F-statistic) and the individual p-values for each term in the model?  Does this make more sense?

### Answer 1d

```{r}
mod2 <- lm(y ~ cx1 + cx2 + cx3 + cx1:cx2 + cx1:cx3 + cx2:cx3, data = jobprof_new)
summary(mod2)
vif(mod2)
```

The VIF values have significantly decreased after the variables are centered and now none of them exceed 10. The overall model p-value and the individual p-values for terms `cx1` and `cx3` are very small and the individual p-values for other terms still show no statistical significance. It makes more sense than the previous model.

### Part 1e

Test the significance of all three coefficients for the interaction terms as a subset. Use a 5% level of significance. State $H_0$ and $H_a$ and provide the R output as well as a written conclusion.

Look back and check the individual p-values for the interactions terms from the previous model, how do they compare to the p-value when the interaction terms are tested together as a subset?

### Answer 1e

```{r}
lht(mod2, c("cx1:cx2 = 0", "cx1:cx3 = 0", "cx2:cx3 = 0"), white.adjust = "hc1")
```

$H_0: \beta_4 = \beta_5 = \beta_6 = 0$

$H_a: \beta_4 \neq 0$ or $\beta_5 \neq 0$ or $\beta_6 \neq 0$

Since `p-value` = 0.2149 > 0.05, we cannot reject the null hypothesis. The coefficients for the interaction terms are not significant.

The individual p-values for the interactions terms from the previous model are all higher than the p-value when the interaction terms are tested together as a subset.

### Part 1f

Drop the interaction terms from the model and fit the following model with the quadratic term for $x_2$:

$$y = \beta_0 + \beta_1 cx_1 + \beta_2 cx_2 + \beta_3 cx_3 + \beta_4 cx_2^2 +\epsilon$$

Should the quadratic term be retained in the model at a 5% level of significance?

### Answer 1f

```{r}
mod3 <- lm(y ~ cx1 + cx2 + cx3 + cx2sq, data = jobprof_new)
summary(mod3)
```

The quadratic term should not be retained in the model at a 5% level of significance because the `p-value` for this term is higher than 0.05.

### Part 1g
    
Drop the quadratic term and fit the model with only the original uncentered variables:

$$y=\beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \epsilon$$

Are there any other terms that should be dropped from the model using the criteria of a 5% level of significance?  
 
### Answer 1g

```{r}
mod4 <- lm(y ~ x1 + x2 + x3, data = jobprof_new)
summary(mod4)
```

Using the criteria of a 5% level of significance, the term `x2` should be dropped from the model, because p-value for this term is higher than 0.05.

### Part 1h

Fit the final model for predicting the proficiency score for the population of all employees for this government agency.

### Answer 1h

```{r}
mod_final <- lm(y ~ x1 + x3, data = jobprof_new)
summary(mod_final)
```

### Part 1i

Obtain the residuals for your final model and evaluate the residual plots using the "plot" function.  Does the regression line appear to be a good fit?  Does a visual inspection indicate that the model assumptions appear to be satisfied?  Comment.

### Answer 1i

```{r}
res <- residuals(mod_final)
par(mfrow = c(2, 2))
plot(mod_final)
par(mfrow = c(1, 1))
```

The regression line appears to be a good fit, because the residuals are equally spread around a horizontal line without distinct patterns in the `Residuals vs Fitted` graph. Also, residuals appear to be normally distributed as they follow a straight line quite well in the `Normal Q-Q` plot. Other plots also indicate that the model assumptions appear to be satisfied.

### Part 1j

Perform a Shapiro-Wilk test for normality. Use $\alpha=0.05$. Comment on the results.

### Answer 1j

```{r}
shapiro.test(res)
```

The p-value is larger then $\alpha$ (0.5783 > 0.05) so the null hypothesis that the data came from a normally distributed population cannot be rejected. The residuals are normally distributed.

### Part 1k

Perform a Breusch-Pagan test for homogeneity of variance among the residuals. Use $\alpha=0.05$. Comment on the results.

### Answer 1k

```{r, warning = FALSE, message = FALSE}
library(lmtest)
bptest(mod_final)
```

The p-value is larger then $\alpha$ (0.918 > 0.05) so the null hypothesis of homoscedasticity cannot be rejected. There is homogeneity of variance among the residuals.

### Part 1l

Perform a Durbin-Watson test for serial correlation the residuals. Use $\alpha=0.05$. Comment on the results.

### Answer 1l

```{r}
dwtest(mod_final, alternative = "two.sided")
```

The p-value is larger then $\alpha$ (0.07421 > 0.05) so the null hypothesis that the autocorrelation of the disturbances is 0 cannot be rejected. The residuals are not correlated.

### Part 1m

Obtain a 95% confidence interval for $\beta_3$ and interpret it in the context of this problem.

### Answer 1m

```{r}
confint(mod_final, "x3", level = 0.95)
```

The 95% confidence interval for $\beta_3$ is (1.565, 2.088). It means that we are 95% confident that the expected job proficiency score (y) is between 1.565 and 2.088 points higher for each unit increase in the x3 aptitude test score.

### Part 1n

Construct a 95% prediction interval for a randomly selected employee with aptitude scores of $x_1 = 99, x_2 = 112$, and $x_3 = 105$ to forecast their proficiency rating at the end of the probationary period. Write an interpretation for the interval in the context of this problem.

### Answer 1n

```{r}
newdata <- data.frame(x1 = 99, x2 = 112, x3 = 105)
predict(mod_final, newdata, interval = "prediction")
```

The 95% prediction interval of (86.817, 109.693) means that 95% of the time, the proficiency rating for a randomly selected employee with aptitude scores of $x_1=99, x_2=112,$ and $x_3=105$ will be between 86.817 and 109.693.


## Exercise 2

Consider the scenario from Exercises 12.5 and 12.7 on page 725 of Ott's textbook. There are two categorical variables (Method and Gender) and one quantitative variable (index of English proficiency prior to the program). See the textbook for details on how the qualitative variables are coded using indicator variables.

### Part 2a

Use data in the file `English` to estimate the coefficients for the model in Exercise 12.5:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \epsilon$$ 

Obtain the estimated intercept and coefficients and state the estimated mean English proficiency scores for each of the 3 methods of teaching English as a second language.

### Answer 2a

```{r}
english <- read_excel("English.xlsx")
mod1 <- lm(y ~ x1 + x2, data = english)
coef(mod1)
```

Replace the ## symbols with the parameter estimates:

y = 44.75 + 61.40 $x_1$ + 3.95 $x_2$

State the estimated mean English proficiency scores for each of the 3 methods:

Estimated mean for Method 1 = 44.75

Estimated mean for Method 2 = 106.15

Estimated mean for Method 3 = 48.7

### Part 2b  

Before fitting the model of Exercise 12.7, create a centered variable for x4 (call it cx4).  

Fit the model for Exercise 12.7 using the centered variable x4c:

$$y = \beta_0 + \beta_1 cx_4 + \beta_2 x_1 + \beta_3 x_2 + \beta_5 x_1 cx_4 + \beta_6 x_2 cx_4 + \epsilon$$

Using the estimated coefficients, write three separate estimated models, one for each method, relating the scores after 3 months in the program (y) to the index score prior to starting the program ($x_4$).

### Answer 2b

```{r}
cx4 <- english$x4 - mean(english$x4)
english <- cbind(english, cx4)
mod2 <- lm(y ~ cx4 + x1 + x2 + x1:cx4 + x2:cx4, data = english)
coef(mod2)
```

Method 1: $y = 44.76 + 0.12 cx_4$

Method 2: $y = 104.69 + 1.9 cx_4$

Method3: $y = 48.99 + 0.42 cx_4$


## Exercise 3

Ninety members (aged = 18.1 – 23.4 years) of three Division I women’s intercollegiate rowing teams (National Collegiate Athletic Association) within the Big Ten Conference volunteered to participate in a study to predict race time for female collegiate rowers from nineteen physical characteristics.

Data is in the file `rowtime`. The race times are in the variable named "racetime".

### Part 3a

Load the data and use head(rowtime) to see the other variable names and the first 6 values of each.

### Answer 3a

```{r}
rowtime <- read_excel("rowtime.xlsx")
head(rowtime)
```

### Part 3b

Use the **regsubsets** function to find the "best" model for predicting the response variable racetime with up to 8 of the 19 predictor variables in the data set.  Produce the summary and the plot for the best single models with up to 8 predictors according to $R^2_{adj}$.

Which independent variables are in the best model with 8 predictors when the $R^2_{adj}$ is the criterion for selection?

### Answer 3b

```{r, warning = FALSE}
library(leaps)
regsubsets.out <- regsubsets(racetime ~ ., data = rowtime, nvmax = 8)
plot(regsubsets.out, scale = "adjr2")
mod_regs <- lm(racetime ~ calfcir + biceps + bestvj + legpower + meso +
                 ecto + expvarsity + preexper, data = rowtime)
summary(mod_regs)
```

In the best model with 8 predictors when the $R^2_{adj}$ is the criterion for selection the variables are `calfcir`, `biceps`, `bestvj`, `legpower`, `meso`, `ecto`, `expvarsity` and `preexper`.

### Part 3c

Use the **step** function with backward selection to find the "best" model for predicting the response variable rowtime. Recall that the formula structure `y ~ .` will produce the model using y as the response variable and all other variables in the data set as the predictors; in this set racetime is the response variable and all other variables are potential predictors.

Which independent variables are in this model?  What is the AIC value for this model? 

### Answer 3c

```{r}
full <- lm(racetime ~ ., data = rowtime)
mod_back <- step(full, data = rowtime, direction = "backward", trace = 0)
summary(mod_back)
extractAIC(mod_back)
```

In this model the variables are `tall`, `calfcir`, `biceps`, `estfm`, `bestvj`, `legpower`, `meso`, `expvarsity` and `preexper`. The AIC value for this model is 497.2245.

### Part 3d

Use the **step** function with forward selection to find the "best" model for predicting the response variable rowtime.  

Which independent variables are in the model selected?  What is the AIC value for this model? 

### Answer 3d

```{r}
null <- lm(racetime ~ 1, data = rowtime)
mod_forw <- step(null, scope = list(lower = null, upper = full),
                 direction = "forward", trace = 0)
summary(mod_forw)
extractAIC(mod_forw)
```

In this model the variables are `estffm`, `expvarsity`, `tall`, `preexper`, `biceps`, `meso`, `calfcir` and `bestvj`. The AIC value for this model is 497.6452.

### Part 3e

Compute the AIC for the the best model with 8 predictors from the **regsubsets** function.  How does it compare with the AIC for the two models produced by the backward and forward selection procedure?

Which model is the "best" according to the AIC?  (remember, smaller is better for AIC)

### Answer 3e

```{r}
extractAIC(mod_regs)
```

The AIC value for the best model with 8 predictors from the **regsubsets** function is 497.3196. It is very similar to the AIC values for the two models produced by the backward and forward selection procedure. However, the "best" model according to the AIC is the model selected by the **step** function with backward selection.