Linear Regression for College Tuition in R
Required files can be downlowded here.
The following exercises aim to use linear regression tools in R to investigate whether a linear relationship exists between tuition and other variables. The dataset for these tasks is called tuition.csv. The dataset has 1283 records of information about school tuition. The data table (d0
) are structured as the table below.
VARIABLES | DESCRIPTION | DATA TYPE |
---|---|---|
tuition | College tuition (“out-of-state” rate) | continuous |
pcttop25 | Percent of new students from the top 25% of high school class | continuous |
sf_ratio | Student to faculty ratio | continuous |
fac_comp | Average faculty compensation | continuous |
accrate | Fraction of applicants accepted for admission | continuous |
graduat | Percent of students who graduate | continuous |
pct_phd | Percent of faculty with Ph.D.’s | continuous |
fulltime | Percent of undergraduates who are full time students | continuous |
alumni | Percent of alumni who donate | continuous |
num_enrl | Number of new students enrolled | continuous |
public.private | Is the college a public or private institution? public=0, private=1 | discrete |
Task 1. Data imputation
a) Check the names and data types of variables (columns) in R. And compare the ones in above table. Take actions to make sure that there is no problem with the data types: Integer or Numeric for continuous variables; Factor for discrete variables.
d0 <- read.csv("tuition.csv")
d0$public.private <- factor(d0$public.private, levels = c(0, 1),
labels = c("public", "private"))
str(d0)
## 'data.frame': 1283 obs. of 11 variables:
## $ tuition : int 3400 5600 4440 3000 6300 2600 4104 11660 2970 8080 ...
## $ pcttop25 : int NA 27 78 NA 57 6 46 88 40 47 ...
## $ sf_ratio : num 14.3 32.8 18.9 18.7 16.7 16.5 25.3 14 19.4 11.4 ...
## $ fac_comp : int 42300 NA 47700 NA 54600 NA 54800 52300 50300 36600 ...
## $ accrate : num 0.682 0.928 0.66 0.705 0.9 1 0.735 0.73 0.646 0.855 ...
## $ graduat : int 40 55 51 15 69 36 50 72 76 44 ...
## $ pct_phd : int 53 52 72 48 85 64 94 74 62 63 ...
## $ fulltime : num 92.8 70.3 87.8 90.9 90.5 ...
## $ alumni : int NA NA 8 NA 18 NA 3 34 5 9 ...
## $ num_enrl : int 984 179 570 1278 3070 644 1686 287 625 127 ...
## $ public.private: Factor w/ 2 levels "public","private": 1 2 1 1 1 1 1 2 1 2 ...
b) Missing data appears to be a problem with this data set. Prepare a copy of the data set (d1
), where the missing values are each replaced with their column means. Try to use for
loop in R for this question.
c) Report on how this imputation has affected the summary statistics between d0 and d1. What do you think of this method of dealing with missing values? Can you suggest a better method?
## tuition pcttop25 sf_ratio fac_comp
## Min. : 1044 Min. : 6.00 Min. : 2.30 Min. : 26500
## 1st Qu.: 6114 1st Qu.: 37.00 1st Qu.:11.80 1st Qu.: 43600
## Median : 8670 Median : 50.00 Median :14.30 Median : 50900
## Mean : 9284 Mean : 52.28 Mean :14.87 Mean : 52680
## 3rd Qu.:11675 3rd Qu.: 65.00 3rd Qu.:17.60 3rd Qu.: 60100
## Max. :25750 Max. :100.00 Max. :91.80 Max. :107500
## NA's :196 NA's :2 NA's :162
## accrate graduat pct_phd fulltime
## Min. :0.1540 Min. : 8.00 Min. : 8.00 Min. :11.43
## 1st Qu.:0.6837 1st Qu.: 47.00 1st Qu.: 57.00 1st Qu.:68.59
## Median :0.7840 Median : 60.00 Median : 71.00 Median :83.42
## Mean :0.7581 Mean : 60.42 Mean : 68.72 Mean :78.79
## 3rd Qu.:0.8610 3rd Qu.: 74.00 3rd Qu.: 82.00 3rd Qu.:91.88
## Max. :1.0000 Max. :118.00 Max. :103.00 Max. :99.94
## NA's :11 NA's :95 NA's :31 NA's :27
## alumni num_enrl public.private
## Min. : 0.00 Min. : 18.0 public :457
## 1st Qu.:11.00 1st Qu.: 234.5 private:826
## Median :19.00 Median : 446.5
## Mean :20.92 Mean : 782.4
## 3rd Qu.:29.00 3rd Qu.: 984.2
## Max. :81.00 Max. :7425.0
## NA's :213 NA's :3
## tuition pcttop25 sf_ratio fac_comp
## Min. : 1044 Min. : 6.00 Min. : 2.30 Min. : 26500
## 1st Qu.: 6114 1st Qu.: 39.00 1st Qu.:11.80 1st Qu.: 44700
## Median : 8670 Median : 52.28 Median :14.30 Median : 52680
## Mean : 9284 Mean : 52.28 Mean :14.87 Mean : 52680
## 3rd Qu.:11675 3rd Qu.: 63.00 3rd Qu.:17.55 3rd Qu.: 58500
## Max. :25750 Max. :100.00 Max. :91.80 Max. :107500
## accrate graduat pct_phd fulltime
## Min. :0.1540 Min. : 8.00 Min. : 8.00 Min. :11.43
## 1st Qu.:0.6850 1st Qu.: 48.00 1st Qu.: 57.00 1st Qu.:68.91
## Median :0.7820 Median : 60.42 Median : 70.00 Median :82.89
## Mean :0.7581 Mean : 60.42 Mean : 68.72 Mean :78.79
## 3rd Qu.:0.8605 3rd Qu.: 72.00 3rd Qu.: 82.00 3rd Qu.:91.59
## Max. :1.0000 Max. :118.00 Max. :103.00 Max. :99.94
## alumni num_enrl public.private
## Min. : 0.00 Min. : 18.0 public :457
## 1st Qu.:12.00 1st Qu.: 235.5 private:826
## Median :20.92 Median : 449.0
## Mean :20.92 Mean : 782.4
## 3rd Qu.:26.00 3rd Qu.: 982.0
## Max. :81.00 Max. :7425.0
This method has changed the median values for most of the variables, however it did not change the mean. I think this method of dealing with missing values is quite good, however it can lead to biased estimators. I cannot suggest a better method.
Task 2. Relationships between Variables
a) Provide a scatter plot describing the relationship of each variable with tuition. Examine the plots and rank them in descending order according to the correlation between each variable and tuition.
Ranked scatter plots:
par(mfrow = c(1, 2))
plot(tuition ~ graduat, data = d1, las = 1)
plot(tuition ~ alumni, data = d1, las = 1)
b) Calculate the correlation values for each pair of relationship and validate your ranking.
## tuition graduat alumni pcttop25 fac_comp pct_phd fulltime
## 1.0000000 0.6040003 0.4995750 0.4726536 0.3937655 0.3739891 0.2584426
## num_enrl accrate sf_ratio
## -0.1396856 -0.2730378 -0.4424179
Task 3. Model Building
a) Fit a linear regression model (L1
) using all variables and data.
##
## Call:
## lm(formula = tuition ~ ., data = d0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9085.9 -1257.7 -18.3 1256.7 11017.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.542e+03 9.887e+02 -2.571 0.010328 *
## pcttop25 -3.048e-01 5.426e+00 -0.056 0.955208
## sf_ratio -1.749e+02 2.319e+01 -7.544 1.25e-13 ***
## fac_comp 1.288e-01 1.043e-02 12.340 < 2e-16 ***
## accrate 6.292e+00 6.357e+02 0.010 0.992105
## graduat 2.700e+01 5.651e+00 4.778 2.11e-06 ***
## pct_phd 3.136e+01 6.861e+00 4.572 5.61e-06 ***
## fulltime 1.259e+01 5.201e+00 2.420 0.015730 *
## alumni 4.483e+01 7.663e+00 5.850 7.17e-09 ***
## num_enrl -3.907e-01 1.120e-01 -3.490 0.000509 ***
## public.privateprivate 3.952e+03 2.487e+02 15.892 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2118 on 793 degrees of freedom
## (479 observations deleted due to missingness)
## Multiple R-squared: 0.7403, Adjusted R-squared: 0.737
## F-statistic: 226 on 10 and 793 DF, p-value: < 2.2e-16
b) Write out the estimated regression equation in mathematic formulation.
\(\widehat{tuition}\) = -2541.86 - 0.30pcttop25 - 174.91sf_ratio + 0.13fac_comp + 6.29accrate + 27.00graduat + 31.36pct_phd + 12.59fulltime + 44.83alumni - 0.39num_enrl + 3951.67public.private
c) Report RMSE, R-squared, and adjusted R-squared.
## [1] 2103.585
RMSE = 2103.585, R-squared = 0.7403, adjusted R-squared = 0.737.
Task 4. Model Examination
a) Investigate whether the regression assumptions hold. These assumptions are linearity, normality, and constant variance.
Linearity: The Residuals vs Fitted plot shows equally spread residuals around a horizontal line without distinct patterns, suggesting that the linearity assumption is satisfied.
Normality: The Normal Q-Q plot shows the points following the straight line quite well, suggesting that the normality assumption is satisfied,
Constant variance: The Scale-Location plot shows a horizontal line with equally (randomly) spread points, suggesting that the homoscedasticity (constant variance) assumption is satisfied.
b) Explain clearly and completely the meaning and interpretation of the regression coefficient for faculty compensation.
Holding other predictors in the model constant, every one unit increase in faculty compensation results in 0.13 increase in tuition.
c) Explain clearly and completely the meaning of the coincident for public.private.
Holding other predictors in the model constant, the tuition for private schools is about 3951.67 higher than the tuition for public schools.
Task 5. Model Improvement
a) Rebuild a linear regression model (L2
) by removing some variables that you think are insignificant.
L2 <- lm(tuition ~ sf_ratio + fac_comp + graduat + pct_phd + fulltime + alumni +
num_enrl + public.private, data = d0)
summary(L2)
##
## Call:
## lm(formula = tuition ~ sf_ratio + fac_comp + graduat + pct_phd +
## fulltime + alumni + num_enrl + public.private, data = d0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9350.2 -1236.5 7.6 1226.8 10838.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.622e+03 6.734e+02 -3.894 0.000106 ***
## sf_ratio -1.691e+02 2.176e+01 -7.771 2.18e-14 ***
## fac_comp 1.320e-01 9.021e-03 14.629 < 2e-16 ***
## graduat 2.433e+01 5.172e+00 4.705 2.95e-06 ***
## pct_phd 2.727e+01 6.177e+00 4.415 1.14e-05 ***
## fulltime 1.582e+01 4.780e+00 3.310 0.000971 ***
## alumni 4.629e+01 7.180e+00 6.447 1.89e-10 ***
## num_enrl -3.811e-01 1.041e-01 -3.660 0.000268 ***
## public.privateprivate 3.914e+03 2.322e+02 16.855 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2092 on 876 degrees of freedom
## (398 observations deleted due to missingness)
## Multiple R-squared: 0.7491, Adjusted R-squared: 0.7468
## F-statistic: 326.9 on 8 and 876 DF, p-value: < 2.2e-16
b) Construct a table to compare L1
and L2
models, variables included, R-squared, and pvalue of each variable. Which model do you prefer and why?
##
## ========================================================================
## Dependent variable:
## --------------------------------------------------
## tuition
## L1 L2
## (1) (2)
## ------------------------------------------------------------------------
## pcttop25 -0.305
## (5.426)
## sf_ratio -174.906*** -169.062***
## (23.186) (21.756)
## fac_comp 0.129*** 0.132***
## (0.010) (0.009)
## accrate 6.292
## (635.673)
## graduat 27.000*** 24.334***
## (5.651) (5.172)
## pct_phd 31.364*** 27.269***
## (6.861) (6.177)
## fulltime 12.588** 15.821***
## (5.201) (4.780)
## alumni 44.829*** 46.287***
## (7.663) (7.180)
## num_enrl -0.391*** -0.381***
## (0.112) (0.104)
## public.privateprivate 3,951.672*** 3,914.068***
## (248.658) (232.222)
## Constant -2,541.856** -2,621.899***
## (988.745) (673.357)
## ------------------------------------------------------------------------
## Observations 804 885
## R2 0.740 0.749
## Adjusted R2 0.737 0.747
## Residual Std. Error 2,118.124 (df = 793) 2,092.359 (df = 876)
## F Statistic 226.003*** (df = 10; 793) 326.915*** (df = 8; 876)
## ========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
I prefer the L2
model because it contains only significant variables and has higher adjusted R-squared value.
Task 6. Model Building with Data Imputation
a) Re-do task 3, by using data set d1
where the missing values are each replaced with their column means and build the linear regression model L3
.
##
## Call:
## lm(formula = tuition ~ ., data = d1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9367.0 -1386.0 76.3 1402.3 12179.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.735e+03 7.809e+02 -6.064 1.75e-09 ***
## pcttop25 9.382e+00 4.720e+00 1.988 0.047 *
## sf_ratio -1.011e+02 1.424e+01 -7.100 2.07e-12 ***
## fac_comp 1.095e-01 8.490e-03 12.893 < 2e-16 ***
## accrate 2.554e+02 5.041e+02 0.507 0.613
## graduat 4.209e+01 4.679e+00 8.995 < 2e-16 ***
## pct_phd 3.920e+01 4.994e+00 7.849 8.83e-15 ***
## fulltime 7.704e+00 4.411e+00 1.747 0.081 .
## alumni 4.193e+01 6.840e+00 6.130 1.17e-09 ***
## num_enrl -1.939e-01 9.969e-02 -1.945 0.052 .
## public.privateprivate 3.888e+03 1.954e+02 19.901 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2363 on 1272 degrees of freedom
## Multiple R-squared: 0.6827, Adjusted R-squared: 0.6802
## F-statistic: 273.7 on 10 and 1272 DF, p-value: < 2.2e-16
\(\widehat{tuition}\) = -4735.07 + 9.38pcttop25 - 101.11sf_ratio + 0.11fac_comp + 255.37accrate + 42.09graduat + 39.20pct_phd + 7.70fulltime + 41.93alumni - 0.19num_enrl + 3888.06public.private
## [1] 2352.389
RMSE = 2352.389, R-squared = 0.6827, adjusted R-squared = 0.6802.
b) Add the results of L3
to the previous table, showing variables included, R-squared, and p-value of each variable. Which model do you find more credible and why?
##
## ===================================================================================================
## Dependent variable:
## -----------------------------------------------------------------------------
## tuition
## L1 L2 L3
## (1) (2) (3)
## ---------------------------------------------------------------------------------------------------
## pcttop25 -0.305 9.382**
## (5.426) (4.720)
## sf_ratio -174.906*** -169.062*** -101.112***
## (23.186) (21.756) (14.242)
## fac_comp 0.129*** 0.132*** 0.109***
## (0.010) (0.009) (0.008)
## accrate 6.292 255.371
## (635.673) (504.103)
## graduat 27.000*** 24.334*** 42.087***
## (5.651) (5.172) (4.679)
## pct_phd 31.364*** 27.269*** 39.201***
## (6.861) (6.177) (4.994)
## fulltime 12.588** 15.821*** 7.704*
## (5.201) (4.780) (4.411)
## alumni 44.829*** 46.287*** 41.927***
## (7.663) (7.180) (6.840)
## num_enrl -0.391*** -0.381*** -0.194*
## (0.112) (0.104) (0.100)
## public.privateprivate 3,951.672*** 3,914.068*** 3,888.064***
## (248.658) (232.222) (195.374)
## Constant -2,541.856** -2,621.899*** -4,735.070***
## (988.745) (673.357) (780.879)
## ---------------------------------------------------------------------------------------------------
## Observations 804 885 1,283
## R2 0.740 0.749 0.683
## Adjusted R2 0.737 0.747 0.680
## Residual Std. Error 2,118.124 (df = 793) 2,092.359 (df = 876) 2,362.538 (df = 1272)
## F Statistic 226.003*** (df = 10; 793) 326.915*** (df = 8; 876) 273.723*** (df = 10; 1272)
## ===================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
I think model L2
gives more credible results because it has the largest adjusted R-squared and all variables are statistically significant.
Task 7. Regression Model Error
a) Split the original dataset into two parts, training and testing. Randomly select 70% rows of data as training part and the rest 30% as test part, using sample() R function.
set.seed(123)
trainRows <- sample(nrow(d0), 0.7 * nrow(d0))
mdataTrain <- d0[trainRows,]
mdataTest <- d0[-trainRows,]
b) Re-do task 3 by using training data only, and get a LR model (L4
).
##
## Call:
## lm(formula = tuition ~ ., data = mdataTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8964.3 -1305.9 -17.7 1245.1 10859.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.851e+03 1.157e+03 -1.599 0.110380
## pcttop25 -5.111e+00 6.462e+00 -0.791 0.429376
## sf_ratio -1.676e+02 2.793e+01 -6.000 3.54e-09 ***
## fac_comp 1.298e-01 1.231e-02 10.545 < 2e-16 ***
## accrate -5.721e+02 7.621e+02 -0.751 0.453195
## graduat 2.605e+01 6.695e+00 3.891 0.000112 ***
## pct_phd 3.090e+01 8.143e+00 3.795 0.000164 ***
## fulltime 1.250e+01 6.280e+00 1.991 0.046980 *
## alumni 4.513e+01 9.214e+00 4.898 1.27e-06 ***
## num_enrl -4.325e-01 1.339e-01 -3.229 0.001315 **
## public.privateprivate 3.989e+03 2.963e+02 13.466 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2156 on 559 degrees of freedom
## (328 observations deleted due to missingness)
## Multiple R-squared: 0.7363, Adjusted R-squared: 0.7316
## F-statistic: 156.1 on 10 and 559 DF, p-value: < 2.2e-16
\(\widehat{tuition}\) = -1850.72 - 5.11pcttop25 - 167.56sf_ratio + 0.13fac_comp - 572.06accrate + 26.05graduat + 30.90pct_phd + 12.50fulltime + 45.13alumni - 0.43num_enrl + 3989.36public.private
## [1] 2135.556
RMSE = 2135.556, R-squared = 0.7363, adjusted R-squared = 0.7316.
c) Apply L4
for training dataset, calculate the regression performance indicators, include the Sum of Square Errors (SSE), the Mean Squared Error (MSE), and the Mean Absolute Percentage Error (MAPE).
## [1] 2599541347
## [1] 4560599
MAPE <- mean(abs((mdataTrain$tuition - predict(L4, mdataTrain))/mdataTrain$tuition),
na.rm = TRUE)
MAPE
## [1] 0.1918862
d) Apply L4
for test dataset, calculate the same set of performance indicators as the last question. Compare them and report your observations.
## [1] 970654781
## [1] 4148097
MAPE <- mean(abs((mdataTest$tuition - predict(L4, mdataTest))/mdataTest$tuition),
na.rm = TRUE)
MAPE
## [1] 0.1880695
All performance indicators are lower for the test dataset.
e) Redo task 7 parts a, b, c, d for the datasets which randomly select 30% rows of data as training part and the rest 70% as test part, and get a LR model L5
.
set.seed(1234)
trainRows <- sample(nrow(d0), 0.3 * nrow(d0))
mdataTrain <- d0[trainRows,]
mdataTest <- d0[-trainRows,]
L5 <- lm(tuition ~ ., data = mdataTrain)
summary(L5)
##
## Call:
## lm(formula = tuition ~ ., data = mdataTrain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4753.6 -1217.3 -101.7 1475.6 9885.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.927e+03 1.833e+03 -1.597 0.111739
## pcttop25 1.550e+01 1.053e+01 1.472 0.142514
## sf_ratio -8.675e+01 4.010e+01 -2.163 0.031573 *
## fac_comp 1.544e-01 2.108e-02 7.322 4.34e-12 ***
## accrate -4.180e+02 1.125e+03 -0.371 0.710666
## graduat 2.980e+00 1.107e+01 0.269 0.788026
## pct_phd 6.734e+00 1.326e+01 0.508 0.612098
## fulltime 5.127e+00 8.746e+00 0.586 0.558321
## alumni 6.097e+01 1.670e+01 3.651 0.000325 ***
## num_enrl -2.182e-01 2.230e-01 -0.978 0.328884
## public.privateprivate 5.230e+03 4.845e+02 10.795 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2172 on 224 degrees of freedom
## (149 observations deleted due to missingness)
## Multiple R-squared: 0.7358, Adjusted R-squared: 0.724
## F-statistic: 62.4 on 10 and 224 DF, p-value: < 2.2e-16
\(\widehat{tuition}\) = -2926.89 + 15.50pcttop25 - 86.75sf_ratio + 0.15fac_comp - 417.96accrate + 2.98graduat + 6.73pct_phd + 5.13fulltime + 60.97alumni - 0.22num_enrl + 5230.45public.private
## [1] 2120.472
RMSE = 2120.472, R-squared = 0.7358, adjusted R-squared = 0.724.
## [1] 1056654006
## [1] 4496400
MAPE <- mean(abs((mdataTrain$tuition - predict(L5, mdataTrain))/mdataTrain$tuition),
na.rm = TRUE)
MAPE
## [1] 0.1820471
## [1] 2848081410
## [1] 5005415
MAPE <- mean(abs((mdataTest$tuition - predict(L5, mdataTest))/mdataTest$tuition),
na.rm = TRUE)
MAPE
## [1] 0.2078353
All performance indicators are lower for the training dataset.