Regression and Classification Trees: Predict Prices of Used Cars

Download assignment solution and files for this problem.

The file ToyotaCorolla.csv contains the data on used cars (Toyota Corolla) on sale during late summer of 2004 in the Netherlands. It has 1,436 records containing details on 38 attributes, including Price, Age, Kilometers, HP, and other specifications. The goal is to predict the price of a used Toyota Corolla based on its specifications.

Data Preprocessing

Split the data into training (60%), and validation (40%) datasets.

library(rpart)
library(rpart.plot)
# read file
df <- read.csv("ToyotaCorolla.csv")
# training dataset size
size <- floor(0.6 * nrow(df))
# set seed to make partition reproducible
set.seed(123)
# indices for training set
train_ind <- sample(nrow(df), size = size)
# split data into training and validation datasets
train <- df[train_ind, ]
test <- df[-train_ind, ]

Run a regression tree (RT) with outcome variable Price and predictors Age_08_04, KM, Fuel_Type, HP, Automatic, Doors, Quarterly_Tax, Mfg_Guarantee, Guarantee_Period, Airco, Automatic_airco, CD_Player, Powered_Windows, Sport_Model, and Tow_Bar. Keep the minimum number of records in a terminal node to 1, maximum number of tree levels to 100, and cp = 0.001, to make the run least restrictive.

# maxdepth = 30 instead of 100 because it is restricted as explained in documentation
# of rpart.control
RT <- rpart(Price ~ Age_08_04 + KM + Fuel_Type + HP + Automatic + Doors +
              Quarterly_Tax + Mfr_Guarantee + Guarantee_Period + Airco +
              Automatic_airco + CD_Player + Powered_Windows + Sport_Model + Tow_Bar,
            data = train, method = "anova",
            control = rpart.control(minbucket = 1, maxdepth = 30, cp = 0.001))
prp(RT)

complex decision tree in R

Which appear to be the three or four most important car specifications for predicting the cars price?

RT$variable.importance

##        Age_08_04               KM  Automatic_airco    Quarterly_Tax 
##       9158111071       2717817030       2523223937       1555864580 
##               HP Guarantee_Period        CD_Player        Fuel_Type 
##        955797822        716761790        470078859        109155834 
##            Airco  Powered_Windows            Doors          Tow_Bar 
##         88398185         63622243         54284221          9132287 
##      Sport_Model    Mfr_Guarantee 
##          5885715          2343215

The three most important car specifications for predicting car price are Age, Kilometers and Automatic aircon.

Compare the prediction errors of the training and validation sets by examining their RMS error and by plotting the two box plots. What is happening with the training set predictions? How does the predictive performance of the validation set compare to the training set? Why does this occur?

# prediction error for training set
pred_train <- predict(RT)
res_train <- train$Price - pred_train
# prediction error for validation set
pred_test <- predict(RT, test)
res_test <- test$Price - pred_test
# RMS error for training and validation sets
rmse_train <- sqrt(mean(res_train^2))
rmse_test <- sqrt(mean(res_test^2))
rmse_train

## [1] 974.5408

rmse_test

## [1] 1231.599

boxplot(res_train, res_test, names = c("Training set", "Validation set"))

boxplot training versus validation set in R Studio

RMS error for training set is 974.5408 and RMS error for validation set is 1231.599. There are much more outliers for validation set prediction errors compared to training set because the model is trained by minimising squared errors for the prediction of training set. Validation set contains observations that are very different to observations used when training the model, thus the predictive performance for the training set is better.

How can we achieve predictions for the training set that are not equal to the actual prices?

When training the model we tolerated the prediction errors in exchange to lower complexity of the tree.

Prune the full tree using the cross-validation error. Compared to the full tree, what is the predictive performance for the validation set?

# complexity parameter associated with minimum error
cp <- RT$cptable[which.min(RT$cptable[, "xerror"]), "CP"]
# prune the tree
pRT <- prune(RT, cp = cp)
# prediction error for validation set
pred_test_p <- predict(pRT, test)
res_test_p <- test$Price - pred_test_p
# RMS error for validation set
rmse_test_p <- sqrt(mean(res_test_p^2))
rmse_test

## [1] 1231.599

rmse_test_p

## [1] 1299.381

The RMS error for the validation set is 1299.381 which is slightly higher compared to RMS error for the full tree which was 1231.599. It means that predictive performance for the validation set for pruned tree is slightly worse compared to the full tree.

Let us see the effect of turning the price variable into a categorical variable. First, create a new variable that categorizes price into 20 bins. Now repartition the data keeping Binned_Price instead of Price. Run a classification tree with the same set of input variables as in the RT, and with Binned_Price as the output variable. Keep the minimum number of records in a terminal node to 1.

# categorize Price into 20 bins
Binned_Price <- cut(df$Price, breaks = 20)
# add Binned_Price to full dataset
df <- cbind(df, Binned_Price)
# split data into training and validation datasets
train <- df[train_ind, ]
test <- df[-train_ind, ]
# run classification tree
CT <- rpart(Binned_Price ~ Age_08_04 + KM + Fuel_Type + HP + Automatic + Doors +
              Quarterly_Tax + Mfr_Guarantee + Guarantee_Period + Airco +
              Automatic_airco + CD_Player + Powered_Windows + Sport_Model + Tow_Bar,
            data = train, method = "class", control = rpart.control(minbucket = 1))
prp(CT)

Binary Decision Tree of Prices

Compare the tree generated by the CT with the one generated by the RT. Are they different? (Look at structure, the top predictors, size of tree, etc) Why?

The trees generated by the CT and the RT are different (as can be seen from the figures above). The RT tree is bigger and more complex compared to the CT tree. For both trees top predictors are Age and Kilometers.

CT$variable.importance

##        Age_08_04               KM        CD_Player            Airco 
##       146.857190        53.665182        28.972423        23.881571 
##  Automatic_airco    Quarterly_Tax      Sport_Model  Powered_Windows 
##        17.259249        14.815708        13.063749         8.212048 
## Guarantee_Period        Fuel_Type               HP          Tow_Bar 
##         3.171908         1.926651         1.467326         1.321867 
##        Automatic 
##         0.439766

Predict the price, using the RT and the CT, of a used Toyota Corolla with the specifications listed in the below table.

Variable	Value
Age_08_04	77
KM	117,000
Fuel_Type	Petrol
HP	110
Automatic	No
Doors	5
Quarterly_Tax	100
Mfg_Guarantee	No
Guarantee_Period	3
Airco	Yes
Automatic_Airco	No
CD_Player	No
Powered_Windows	No
Sport_Model	No
Tow_Bar	Yes

newdf <- data.frame(Age_08_04 = 77, KM = 117000, Fuel_Type = "Petrol", HP = 110,
                    Automatic = 0, Doors = 5, Quarterly_Tax = 100,
                    Mfr_Guarantee = 0, Guarantee_Period = 3, Airco = 1,
                    Automatic_airco = 0, CD_Player = 0, Powered_Windows = 0,
                    Sport_Model = 0, Tow_Bar = 1)
# prediction with RT
predict(RT, newdata = newdf)

##        1 
## 7422.414

# prediction with CT
predict(CT, newdata = newdf, type = "class")

##                   1 
## (7.16e+03,8.57e+03] 
## 20 Levels: (4.32e+03,5.76e+03] (5.76e+03,7.16e+03] ... (3.11e+04,3.25e+04]

Compare the predictions in terms of the predictors that were used, the magnitude of the difference between the two predictions, and the advantages and disadvantages of the two methods.

printcp(RT)

## 
## Regression tree:
## rpart(formula = Price ~ Age_08_04 + KM + Fuel_Type + HP + Automatic + 
##     Doors + Quarterly_Tax + Mfr_Guarantee + Guarantee_Period + 
##     Airco + Automatic_airco + CD_Player + Powered_Windows + Sport_Model + 
##     Tow_Bar, data = train, method = "anova", control = rpart.control(minbucket = 1, 
##     maxdepth = 30, cp = 0.001))
## 
## Variables actually used in tree construction:
## [1] Age_08_04       Airco           Automatic_airco Fuel_Type      
## [5] HP              KM              Powered_Windows Quarterly_Tax  
## 
## Root node error: 1.1075e+10/861 = 12863139
## 
## n= 861 
## 
##           CP nsplit rel error  xerror     xstd
## 1  0.6489170      0  1.000000 1.00251 0.081299
## 2  0.1221376      1  0.351083 0.35427 0.026346
## 3  0.0300485      2  0.228945 0.23404 0.024672
## 4  0.0190477      3  0.198897 0.23335 0.024714
## 5  0.0178945      4  0.179849 0.20973 0.018759
## 6  0.0173926      5  0.161955 0.20476 0.018993
## 7  0.0073501      6  0.144562 0.16685 0.013798
## 8  0.0059295      7  0.137212 0.16335 0.012766
## 9  0.0055985      8  0.131283 0.15900 0.012646
## 10 0.0051837      9  0.125684 0.15723 0.012698
## 11 0.0041495     10  0.120501 0.15394 0.012624
## 12 0.0041011     11  0.116351 0.15710 0.013589
## 13 0.0039226     12  0.112250 0.15983 0.013827
## 14 0.0038129     13  0.108327 0.16013 0.013850
## 15 0.0033489     14  0.104514 0.15950 0.013847
## 16 0.0029336     15  0.101165 0.15055 0.013500
## 17 0.0028023     16  0.098232 0.14361 0.012940
## 18 0.0024706     18  0.092627 0.14361 0.012964
## 19 0.0023188     19  0.090157 0.14084 0.012829
## 20 0.0021020     20  0.087838 0.14039 0.012650
## 21 0.0020065     21  0.085736 0.14030 0.012650
## 22 0.0018770     22  0.083729 0.14025 0.012645
## 23 0.0012687     23  0.081852 0.14261 0.012902
## 24 0.0012400     24  0.080584 0.14077 0.012910
## 25 0.0012387     25  0.079344 0.14067 0.012912
## 26 0.0011848     26  0.078105 0.14529 0.014192
## 27 0.0010643     27  0.076920 0.14321 0.014073
## 28 0.0010156     28  0.075856 0.14395 0.014115
## 29 0.0010068     29  0.074840 0.14525 0.014185
## 30 0.0010000     30  0.073833 0.14541 0.014186

printcp(CT)

## 
## Classification tree:
## rpart(formula = Binned_Price ~ Age_08_04 + KM + Fuel_Type + HP + 
##     Automatic + Doors + Quarterly_Tax + Mfr_Guarantee + Guarantee_Period + 
##     Airco + Automatic_airco + CD_Player + Powered_Windows + Sport_Model + 
##     Tow_Bar, data = train, method = "class", control = rpart.control(minbucket = 1))
## 
## Variables actually used in tree construction:
## [1] Age_08_04       KM              Powered_Windows
## 
## Root node error: 608/861 = 0.70616
## 
## n= 861 
## 
##         CP nsplit rel error  xerror     xstd
## 1 0.085526      0   1.00000 1.00000 0.021984
## 2 0.047697      2   0.82895 0.82895 0.023776
## 3 0.031250      3   0.78125 0.78783 0.023977
## 4 0.016447      4   0.75000 0.76974 0.024039
## 5 0.013158      5   0.73355 0.76480 0.024053
## 6 0.010000      7   0.70724 0.77138 0.024034

Predictors that were used in RT model are Age_08_04, Airco, Automatic_airco, Fuel_Type, HP, KM, Powered_Windows and Quarterly_Tax.

Predictors that were used in CT model are Age_08_04, KM and Powered_Windows. Thus, CT model is less complex than RT model.

The two predictions are very similar and the RT prediction falls into the range predicted by the CT model.

One disadvantage of CT method is that the Price ranges are set beforehand and the resulting splits might not be optimal. The splits should be based on mutual information between the outcome variable and the predictors or similar criteria.