Regression and Classification Trees: Predict Prices of Used Cars
Download assignment solution and files for this problem.
The file ToyotaCorolla.csv
contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1,436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.
Data Preprocessing
Split the data into training (60%), and validation (40%) datasets.
library(rpart)
library(rpart.plot)
# read file
df <- read.csv("ToyotaCorolla.csv")
# training dataset size
size <- floor(0.6 * nrow(df))
# set seed to make partition reproducible
set.seed(123)
# indices for training set
train_ind <- sample(nrow(df), size = size)
# split data into training and validation datasets
train <- df[train_ind, ]
test <- df[-train_ind, ]
- Run a regression tree (RT) with outcome variable
Price
and predictorsAge_08_04
,KM
,Fuel_Type
,HP
,Automatic
,Doors
,Quarterly_Tax
,Mfg_Guarantee
,Guarantee_Period
,Airco
,Automatic_airco
,CD_Player
,Powered_Windows
,Sport_Model
, andTow_Bar
. Keep the minimum number of records in a terminal node to 1, maximum number of tree levels to 100, and cp = 0.001, to make the run least restrictive.
# maxdepth = 30 instead of 100 because it is restricted as explained in documentation
# of rpart.control
RT <- rpart(Price ~ Age_08_04 + KM + Fuel_Type + HP + Automatic + Doors +
Quarterly_Tax + Mfr_Guarantee + Guarantee_Period + Airco +
Automatic_airco + CD_Player + Powered_Windows + Sport_Model + Tow_Bar,
data = train, method = "anova",
control = rpart.control(minbucket = 1, maxdepth = 30, cp = 0.001))
prp(RT)
- Which appear to be the three or four most important car specifications for predicting the cars price?
## Age_08_04 KM Automatic_airco Quarterly_Tax
## 9158111071 2717817030 2523223937 1555864580
## HP Guarantee_Period CD_Player Fuel_Type
## 955797822 716761790 470078859 109155834
## Airco Powered_Windows Doors Tow_Bar
## 88398185 63622243 54284221 9132287
## Sport_Model Mfr_Guarantee
## 5885715 2343215
The three most important car specifications for predicting car price are Age, Kilometers and Automatic aircon.
- Compare the prediction errors of the training and validation sets by examining their RMS error and by plotting the two box plots. What is happening with the training set predictions? How does the predictive performance of the validation set compare to the training set? Why does this occur?
# prediction error for training set
pred_train <- predict(RT)
res_train <- train$Price - pred_train
# prediction error for validation set
pred_test <- predict(RT, test)
res_test <- test$Price - pred_test
# RMS error for training and validation sets
rmse_train <- sqrt(mean(res_train^2))
rmse_test <- sqrt(mean(res_test^2))
rmse_train
## [1] 974.5408
## [1] 1231.599
RMS error for training set is 974.5408 and RMS error for validation set is 1231.599. There are much more outliers for validation set prediction errors compared to training set because the model is trained by minimising squared errors for the prediction of training set. Validation set contains observations that are very different to observations used when training the model, thus the predictive performance for the training set is better.
- How can we achieve predictions for the training set that are not equal to the actual prices?
When training the model we tolerated the prediction errors in exchange to lower complexity of the tree.
- Prune the full tree using the cross-validation error. Compared to the full tree, what is the predictive performance for the validation set?
# complexity parameter associated with minimum error
cp <- RT$cptable[which.min(RT$cptable[, "xerror"]), "CP"]
# prune the tree
pRT <- prune(RT, cp = cp)
# prediction error for validation set
pred_test_p <- predict(pRT, test)
res_test_p <- test$Price - pred_test_p
# RMS error for validation set
rmse_test_p <- sqrt(mean(res_test_p^2))
rmse_test
## [1] 1231.599
## [1] 1299.381
The RMS error for the validation set is 1299.381 which is slightly higher compared to RMS error for the full tree which was 1231.599. It means that predictive performance for the validation set for pruned tree is slightly worse compared to the full tree.
- Let us see the effect of turning the price variable into a categorical variable. First, create a new variable that categorizes price into 20 bins. Now repartition the data keeping
Binned_Price
instead ofPrice
. Run a classification tree with the same set of input variables as in the RT, and withBinned_Price
as the output variable. Keep the minimum number of records in a terminal node to 1.
# categorize Price into 20 bins
Binned_Price <- cut(df$Price, breaks = 20)
# add Binned_Price to full dataset
df <- cbind(df, Binned_Price)
# split data into training and validation datasets
train <- df[train_ind, ]
test <- df[-train_ind, ]
# run classification tree
CT <- rpart(Binned_Price ~ Age_08_04 + KM + Fuel_Type + HP + Automatic + Doors +
Quarterly_Tax + Mfr_Guarantee + Guarantee_Period + Airco +
Automatic_airco + CD_Player + Powered_Windows + Sport_Model + Tow_Bar,
data = train, method = "class", control = rpart.control(minbucket = 1))
prp(CT)
- Compare the tree generated by the CT with the one generated by the RT. Are they different? (Look at structure, the top predictors, size of tree, etc) Why?
The trees generated by the CT and the RT are different (as can be seen from the figures above). The RT tree is bigger and more complex compared to the CT tree. For both trees top predictors are Age and Kilometers.
## Age_08_04 KM CD_Player Airco
## 146.857190 53.665182 28.972423 23.881571
## Automatic_airco Quarterly_Tax Sport_Model Powered_Windows
## 17.259249 14.815708 13.063749 8.212048
## Guarantee_Period Fuel_Type HP Tow_Bar
## 3.171908 1.926651 1.467326 1.321867
## Automatic
## 0.439766
- Predict the price, using the RT and the CT, of a used Toyota Corolla with the specifications listed in the below table.
Variable | Value |
---|---|
Age_08_04 | 77 |
KM | 117,000 |
Fuel_Type | Petrol |
HP | 110 |
Automatic | No |
Doors | 5 |
Quarterly_Tax | 100 |
Mfg_Guarantee | No |
Guarantee_Period | 3 |
Airco | Yes |
Automatic_Airco | No |
CD_Player | No |
Powered_Windows | No |
Sport_Model | No |
Tow_Bar | Yes |
newdf <- data.frame(Age_08_04 = 77, KM = 117000, Fuel_Type = "Petrol", HP = 110,
Automatic = 0, Doors = 5, Quarterly_Tax = 100,
Mfr_Guarantee = 0, Guarantee_Period = 3, Airco = 1,
Automatic_airco = 0, CD_Player = 0, Powered_Windows = 0,
Sport_Model = 0, Tow_Bar = 1)
# prediction with RT
predict(RT, newdata = newdf)
## 1
## 7422.414
## 1
## (7.16e+03,8.57e+03]
## 20 Levels: (4.32e+03,5.76e+03] (5.76e+03,7.16e+03] ... (3.11e+04,3.25e+04]
- Compare the predictions in terms of the predictors that were used, the magnitude of the difference between the two predictions, and the advantages and disadvantages of the two methods.
##
## Regression tree:
## rpart(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Automatic +
## Doors + Quarterly_Tax + Mfr_Guarantee + Guarantee_Period +
## Airco + Automatic_airco + CD_Player + Powered_Windows + Sport_Model +
## Tow_Bar, data = train, method = "anova", control = rpart.control(minbucket = 1,
## maxdepth = 30, cp = 0.001))
##
## Variables actually used in tree construction:
## [1] Age_08_04 Airco Automatic_airco Fuel_Type
## [5] HP KM Powered_Windows Quarterly_Tax
##
## Root node error: 1.1075e+10/861 = 12863139
##
## n= 861
##
## CP nsplit rel error xerror xstd
## 1 0.6489170 0 1.000000 1.00251 0.081299
## 2 0.1221376 1 0.351083 0.35427 0.026346
## 3 0.0300485 2 0.228945 0.23404 0.024672
## 4 0.0190477 3 0.198897 0.23335 0.024714
## 5 0.0178945 4 0.179849 0.20973 0.018759
## 6 0.0173926 5 0.161955 0.20476 0.018993
## 7 0.0073501 6 0.144562 0.16685 0.013798
## 8 0.0059295 7 0.137212 0.16335 0.012766
## 9 0.0055985 8 0.131283 0.15900 0.012646
## 10 0.0051837 9 0.125684 0.15723 0.012698
## 11 0.0041495 10 0.120501 0.15394 0.012624
## 12 0.0041011 11 0.116351 0.15710 0.013589
## 13 0.0039226 12 0.112250 0.15983 0.013827
## 14 0.0038129 13 0.108327 0.16013 0.013850
## 15 0.0033489 14 0.104514 0.15950 0.013847
## 16 0.0029336 15 0.101165 0.15055 0.013500
## 17 0.0028023 16 0.098232 0.14361 0.012940
## 18 0.0024706 18 0.092627 0.14361 0.012964
## 19 0.0023188 19 0.090157 0.14084 0.012829
## 20 0.0021020 20 0.087838 0.14039 0.012650
## 21 0.0020065 21 0.085736 0.14030 0.012650
## 22 0.0018770 22 0.083729 0.14025 0.012645
## 23 0.0012687 23 0.081852 0.14261 0.012902
## 24 0.0012400 24 0.080584 0.14077 0.012910
## 25 0.0012387 25 0.079344 0.14067 0.012912
## 26 0.0011848 26 0.078105 0.14529 0.014192
## 27 0.0010643 27 0.076920 0.14321 0.014073
## 28 0.0010156 28 0.075856 0.14395 0.014115
## 29 0.0010068 29 0.074840 0.14525 0.014185
## 30 0.0010000 30 0.073833 0.14541 0.014186
##
## Classification tree:
## rpart(formula = Binned_Price ~ Age_08_04 + KM + Fuel_Type + HP +
## Automatic + Doors + Quarterly_Tax + Mfr_Guarantee + Guarantee_Period +
## Airco + Automatic_airco + CD_Player + Powered_Windows + Sport_Model +
## Tow_Bar, data = train, method = "class", control = rpart.control(minbucket = 1))
##
## Variables actually used in tree construction:
## [1] Age_08_04 KM Powered_Windows
##
## Root node error: 608/861 = 0.70616
##
## n= 861
##
## CP nsplit rel error xerror xstd
## 1 0.085526 0 1.00000 1.00000 0.021984
## 2 0.047697 2 0.82895 0.82895 0.023776
## 3 0.031250 3 0.78125 0.78783 0.023977
## 4 0.016447 4 0.75000 0.76974 0.024039
## 5 0.013158 5 0.73355 0.76480 0.024053
## 6 0.010000 7 0.70724 0.77138 0.024034
Predictors that were used in RT model are Age_08_04, Airco, Automatic_airco, Fuel_Type, HP, KM, Powered_Windows and Quarterly_Tax.
Predictors that were used in CT model are Age_08_04, KM and Powered_Windows. Thus, CT model is less complex than RT model.
The two predictions are very similar and the RT prediction falls into the range predicted by the CT model.
One disadvantage of CT method is that the Price ranges are set beforehand and the resulting splits might not be optimal. The splits should be based on mutual information between the outcome variable and the predictors or similar criteria.