(a) regression tree model

I will use the package rpart to build a regression tree model. Because rpart performs 10-fold cross validation internally by default, so I don’t need to split the data.

uscrime <- read.csv("./data 10.1/uscrime.txt",sep = "")

library(rpart)
## Warning: package 'rpart' was built under R version 4.1.3
set.seed(1)
tree<-rpart(formula=Crime~., data = uscrime, method = 'anova') # a default tree, 'anova' is for regression

plot(tree, margin = 0.1);text(tree,use.n = T)

So our default tree’s root node splits on Po1<7.65, then splits on two parallel nodes Pop<22.5 and NW<7.65 producing four terminal nodes.

To reduce overfitting, we can apply a cost complexity (cp) penalty to prune the tree. The larger the cp value, the larger the penalty for more complex trees hence we get smaller trees. On the other hand, smaller cp values allow for more complex trees. Our goal is to prune the tree so that it’s simple enough and yet still has low prediction error on the test set. Conveniently the default rpart() setting already performs 10-fold CV with a spectrum of cp values to prune the tree. The printcp() function pulls up the CV result referred to as the cp table.

# CP is the minimal cp for which the pruning happens
# nsplit is the total number of splits in the tree
# rel error is the sum of square error (SSE) relative to root node (training error)
# xerror is the CV error as an average across all folds relative to root node (test error by CV)
# xstd is the standard error for xerror
printcp(tree)
## 
## Regression tree:
## rpart(formula = Crime ~ ., data = uscrime, method = "anova")
## 
## Variables actually used in tree construction:
## [1] NW  Po1 Pop
## 
## Root node error: 6880928/47 = 146403
## 
## n= 47 
## 
##         CP nsplit rel error  xerror    xstd
## 1 0.362963      0   1.00000 1.03039 0.25491
## 2 0.148143      1   0.63704 0.89007 0.21494
## 3 0.051732      2   0.48889 0.90970 0.23934
## 4 0.010000      3   0.43716 0.88930 0.23466

As the tree gets more complex (CP value decreases to allow for more nsplit), the rel error (training error) goes down as expected. To better understand this table, I will give a concrete example. From the second row of the table, any cp greater than 0.1481432 (but smaller than 0.3629629) will result in pruning to a tree with 1 split(s). The SSE using this tree is 0.63704 * 6880928 = 4383406. The CV error using this tree is 0.89007 * 6880928 = 6124493. The default full tree (row 4, using default cp = 0.01) has 3 splits (4 terminal nodes), with an SSE of 0.43716 * 6880928 = 3008082, which can be verified by the following code.

# calculate the default tree's training error manually
root_node_error <- sum((uscrime$Crime-mean(uscrime$Crime))^2); root_node_error # SSE for root node
## [1] 6880928
pred = predict(tree, newdata = uscrime[,1:15]) # prediction with default tree
sum((uscrime$Crime-pred)^2) # SSE with default tree
## [1] 3008082
sum((uscrime$Crime-pred)^2)/root_node_error # relative error
## [1] 0.4371622

To visualize the CV result, we can pull up the plot for the cp table by calling plotcp(). Note that the bottom X-axis is the geometric mean of the maximum and minimum cp that resulted in pruning the tree to the size corresponding to the the top X-axis, rather than the mininum cp in the cp table.

plotcp(tree) # plot xerror versus cp (as geometric mean) and tree size

Continuing with pruning, here we have two options: 1) if we want the lowest possible error in predicting new data, we can choose a cp value that results in the lowest CV error, or 2) if we are willing to trade off some predictive power for the simplicity of the tree, we can choose a cp that’s within one standard error of the lowest CV error (1-SE rule). If we go with option 2, the root node (with no split) would be the simplest tree within 1-SE of the lowest CV error (the dotted horizontal line in the cp plot is min(xerror)+xstd). Since our tree is already quite simple, I will go with option 1. The lower CV error is 0.8893049, so we choose cp = 0.01. It turns out the default tree gives the lowest CV error, so we don’t need to prune any further. For the sake of completeness, I will document the methods for pruning the tree below.

# get the best cp value from the cp table
best_cp<-tree$cptable[which.min(tree$cptable[,"xerror"]),"CP"] 

# two equivalent methods, one just prunes the previous tree, the other builds the tree from scratch with the best_cp parameter
prune(tree, cp = best_cp)
rpart(formula=Crime~., data = uscrime, method = 'anova', control = rpart.control(cp = best_cp))

As we have learned the default rpart() setting already applies some automatic pruning so that the tree does not overfit. Just to see what the full-sized tree would look like, we can use the following setting.

# no need to set seed as we are essentially doing LOOCV
# xval is the fold in CV, default = 10
# minsplit is the minimum number of observations that must exist in a node in order for a split to be attempted, default = 20
# minbucket is the minimum number of observations in any terminal node, default = round(minsplit/3)
full_tree <- rpart(formula=Crime~., data = uscrime, method = 'anova', control = rpart.control(xval = nrow(uscrime), minsplit = 2, minbucket = 1, cp = 0))
plotcp(full_tree)

As we can see, if we allow the tree to grow bigger it’s clearly overfitting (larger test error).

Finally, the important variables are stored in $variable.importance of the tree object. The values are calculate by summing up all the improvement measures that each variable contributes as either a surrogate or primary splitter. Although only Po1, pop, NW have been used for splitting, they are not all the most important variables.

tree$variable.importance
##       Po1       Po2    Wealth      Ineq      Prob         M        NW       Pop 
## 2497521.7 2497521.7 1628818.5 1602212.0 1520230.6 1388627.8 1245883.8  661770.6 
##      Time        Ed        LF        So 
##  601906.0  569545.9  203872.5  161800.8

(b) random forest model

For random forest we don’t need to perform cross validation or regular validation to get an unbiased estimate of the test set error. It is estimated internally during the run using out-of-bag (OOB) samples. So let’s first use all the data to build a random forest model with the default setting.

library(randomForest)
## Warning: package 'randomForest' was built under R version 4.1.3
set.seed(1)
rf<-randomForest(Crime~.,data=uscrime)
print(rf)
## 
## Call:
##  randomForest(formula = Crime ~ ., data = uscrime) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 5
## 
##           Mean of squared residuals: 87461.6
##                     % Var explained: 40.26

Running just the default randomForest algorithm, we have a mean square error (MSE) of 87461.6. This number can also be manually calculated by using the formula \(MSE={\frac{1}{n}\sum_{i=1}^{n}(y_i-\hat{y_i})^2}\), as shown below.

# passing the randomForest object back to the predict() function without specifying newdata retrieves the OOB prediction. Same as $predicted.
# predict(rf)==rf$predicted
sum((predict(rf)-uscrime$Crime)^2)/nrow(uscrime)
## [1] 87461.6

To fine tune a random forest model, there are two main parameters we should consider:

Other parameters we should be aware of (but will not be explored here):

Below the plot for our rf model shows MSE has settled down as ntree increases to the default 500. Since our data set is not large and I don’t feel it’s computationally taxing, I will not tune ntree and use the default ntree=500. I will just use the built-in tuneRF() function of the randomForest package to tune mtry.

plot(rf)

# tune mtry
set.seed(1)
best_mtry <- tuneRF(x=uscrime[,1:15], y=uscrime$Crime, stepFactor=1.5, improve=0.01, plot=TRUE, ntreeTry=500)
## mtry = 5  OOB error = 87461.6 
## Searching left ...
## mtry = 4     OOB error = 86598.04 
## 0.00987364 0.01 
## Searching right ...
## mtry = 7     OOB error = 89205.88 
## -0.0199434 0.01

So it looks like mtry = 4 decreased model error (albeit not by much) and we will use mtry = 4 in our final model. And to improve interpretability of our final model, we will set importance = TRUE and nperm = 2 to allow two permutations per tree for a more stable estimate of variable importance. Note that using the default importance = FALSE will still calculate importance by summing up the reduction in node impurity (RSS for regression) after each split over all splits for each variable over all trees. This method inflates the importance of continuous or high-cardinality categorical variables since they appear more often in the tree. Instead, we will use permutation importance, which calculates the increase in MSE of predictions (estimated with OOB samples) as a result of variable being permuted (values randomly shuffled). The higher the number, the more important the variable. For a more detailed understanding of different importance calculation methods, head to https://explained.ai/rf-importance/, https://blog.methodsconsultants.com/posts/be-aware-of-bias-in-rf-variable-importance-metrics/, https://stats.stackexchange.com/questions/162465/in-a-random-forest-is-larger-incmse-better-or-worse.

set.seed(1)
rf_final<-randomForest(Crime~., data=uscrime, mtry = best_mtry, importance=TRUE, nperm = 2)
print(rf_final)
## 
## Call:
##  randomForest(formula = Crime ~ ., data = uscrime, mtry = best_mtry,      importance = TRUE, nperm = 2) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 4
## 
##           Mean of squared residuals: 82393.69
##                     % Var explained: 43.72
# importance(rf_final,type = 1,scale = F) 
varImpPlot(rf_final,type = 1,scale = F) # type 1 for permutation importance

So our final model has a MSE of 82393.69. The most important variables are Po1 and Po2. However, since Po1 and Po2 are highly correlated, their importances are likely to be overestimated (permutation importance overestimates the importance of correlated predictor variables https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-307).

(c) logistic regression model

credit <- read.csv("./data 10.3/germancredit.txt",sep = "", header = F)
# The response variable is 2 for bad and 1 for good. Make it 1 or 0.
credit$V21[credit$V21==2]<-0

set.seed(1)
# split data into 80% training and 20% test sets
subset <- sample(nrow(credit), nrow(credit) * 0.8)
train = credit[subset,]
test = credit[-subset,]

logistic0 <- glm(V21~., data=train, family=binomial)
summary(logistic0)
## 
## Call:
## glm(formula = V21 ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4936  -0.6539   0.3701   0.6739   2.3062  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.285e-01  1.213e+00  -0.353 0.723862    
## V1A12        5.192e-01  2.513e-01   2.066 0.038853 *  
## V1A13        1.086e+00  4.050e-01   2.683 0.007299 ** 
## V1A14        1.747e+00  2.610e-01   6.692  2.2e-11 ***
## V2          -2.320e-02  1.075e-02  -2.159 0.030863 *  
## V3A31       -1.185e-01  6.156e-01  -0.192 0.847383    
## V3A32        5.655e-01  4.749e-01   1.191 0.233780    
## V3A33        6.357e-01  5.149e-01   1.235 0.216911    
## V3A34        1.279e+00  4.892e-01   2.614 0.008942 ** 
## V4A41        1.641e+00  4.373e-01   3.752 0.000175 ***
## V4A410       2.342e+00  9.428e-01   2.484 0.012978 *  
## V4A42        5.484e-01  2.935e-01   1.868 0.061708 .  
## V4A43        8.895e-01  2.815e-01   3.160 0.001577 ** 
## V4A44        2.264e-02  8.176e-01   0.028 0.977911    
## V4A45        3.923e-01  6.285e-01   0.624 0.532501    
## V4A46        3.765e-01  4.728e-01   0.796 0.425829    
## V4A48        1.856e+00  1.231e+00   1.508 0.131532    
## V4A49        5.449e-01  3.789e-01   1.438 0.150424    
## V5          -1.445e-04  5.154e-05  -2.804 0.005042 ** 
## V6A62        4.251e-01  3.308e-01   1.285 0.198794    
## V6A63        3.655e-01  4.638e-01   0.788 0.430609    
## V6A64        1.350e+00  6.428e-01   2.100 0.035757 *  
## V6A65        9.423e-01  2.947e-01   3.198 0.001385 ** 
## V7A72       -5.373e-02  4.752e-01  -0.113 0.909975    
## V7A73        2.799e-01  4.534e-01   0.617 0.537032    
## V7A74        9.702e-01  4.975e-01   1.950 0.051145 .  
## V7A75        4.071e-01  4.617e-01   0.882 0.377863    
## V8          -3.813e-01  1.019e-01  -3.741 0.000184 ***
## V9A92        4.102e-01  4.293e-01   0.956 0.339274    
## V9A93        9.071e-01  4.222e-01   2.148 0.031685 *  
## V9A94        6.736e-01  5.145e-01   1.309 0.190488    
## V10A102     -2.366e-01  4.887e-01  -0.484 0.628236    
## V10A103      6.922e-01  4.841e-01   1.430 0.152717    
## V11         -6.719e-04  9.993e-02  -0.007 0.994635    
## V12A122     -2.490e-01  2.958e-01  -0.842 0.399945    
## V12A123     -3.054e-01  2.696e-01  -1.133 0.257280    
## V12A124     -9.621e-01  5.066e-01  -1.899 0.057565 .  
## V13          2.056e-02  1.053e-02   1.953 0.050804 .  
## V14A142      4.833e-01  4.780e-01   1.011 0.312059    
## V14A143      8.110e-01  2.704e-01   2.999 0.002710 ** 
## V15A152      5.009e-01  2.648e-01   1.891 0.058564 .  
## V15A153      5.570e-01  5.585e-01   0.997 0.318575    
## V16         -2.538e-01  2.111e-01  -1.202 0.229292    
## V17A172     -8.743e-01  7.797e-01  -1.121 0.262181    
## V17A173     -8.431e-01  7.503e-01  -1.124 0.261138    
## V17A174     -5.922e-01  7.590e-01  -0.780 0.435247    
## V18         -2.719e-01  2.855e-01  -0.953 0.340840    
## V19A192      2.006e-01  2.230e-01   0.900 0.368228    
## V20A202      1.522e+00  7.029e-01   2.166 0.030334 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 965.23  on 799  degrees of freedom
## Residual deviance: 701.43  on 751  degrees of freedom
## AIC: 799.43
## 
## Number of Fisher Scoring iterations: 5

The default logistic regression with all predictor variables returns a model with AIC of 799.4305268. There are many factors with large p-values. Let’s see if removing some would improve the model. For this purpose, we can use backward elimination using AIC as the criterion.

# by default step() performs backward elimination using AIC as the criterion
logistic1 <- step(logistic0,trace = 0)
summary(logistic1)
## 
## Call:
## glm(formula = V21 ~ V1 + V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + 
##     V13 + V14 + V15 + V20, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.5108  -0.7173   0.3794   0.6979   2.2302  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.950e+00  9.212e-01  -2.116 0.034323 *  
## V1A12        6.056e-01  2.455e-01   2.467 0.013617 *  
## V1A13        1.163e+00  3.977e-01   2.923 0.003464 ** 
## V1A14        1.758e+00  2.571e-01   6.837 8.08e-12 ***
## V2          -2.449e-02  1.025e-02  -2.389 0.016912 *  
## V3A31        8.222e-02  5.906e-01   0.139 0.889277    
## V3A32        7.450e-01  4.553e-01   1.636 0.101769    
## V3A33        6.396e-01  5.116e-01   1.250 0.211242    
## V3A34        1.276e+00  4.840e-01   2.636 0.008395 ** 
## V4A41        1.620e+00  4.235e-01   3.824 0.000131 ***
## V4A410       2.387e+00  8.898e-01   2.682 0.007311 ** 
## V4A42        5.446e-01  2.854e-01   1.908 0.056419 .  
## V4A43        9.391e-01  2.761e-01   3.402 0.000670 ***
## V4A44        8.296e-02  7.950e-01   0.104 0.916896    
## V4A45        3.417e-01  6.138e-01   0.557 0.577690    
## V4A46        2.891e-01  4.649e-01   0.622 0.533994    
## V4A48        1.681e+00  1.206e+00   1.393 0.163475    
## V4A49        5.559e-01  3.716e-01   1.496 0.134643    
## V5          -1.349e-04  4.679e-05  -2.883 0.003936 ** 
## V6A62        2.388e-01  3.155e-01   0.757 0.449228    
## V6A63        3.751e-01  4.591e-01   0.817 0.413963    
## V6A64        1.260e+00  6.296e-01   2.002 0.045312 *  
## V6A65        9.099e-01  2.892e-01   3.147 0.001652 ** 
## V7A72       -2.794e-01  4.227e-01  -0.661 0.508551    
## V7A73        9.414e-02  3.894e-01   0.242 0.808986    
## V7A74        7.778e-01  4.424e-01   1.758 0.078687 .  
## V7A75        1.703e-01  4.108e-01   0.415 0.678445    
## V8          -3.830e-01  9.771e-02  -3.920 8.86e-05 ***
## V9A92        4.107e-01  4.179e-01   0.983 0.325697    
## V9A93        8.401e-01  4.100e-01   2.049 0.040443 *  
## V9A94        7.753e-01  5.021e-01   1.544 0.122586    
## V13          2.091e-02  1.009e-02   2.073 0.038137 *  
## V14A142      4.335e-01  4.728e-01   0.917 0.359213    
## V14A143      8.065e-01  2.642e-01   3.052 0.002271 ** 
## V15A152      5.127e-01  2.505e-01   2.047 0.040688 *  
## V15A153     -9.796e-02  3.819e-01  -0.257 0.797535    
## V20A202      1.613e+00  6.999e-01   2.305 0.021174 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 965.23  on 799  degrees of freedom
## Residual deviance: 712.78  on 763  degrees of freedom
## AIC: 786.78
## 
## Number of Fisher Scoring iterations: 5

By dropping V10, V11, V12, V16, V17, V18, V19, we slightly improved the model’s AIC to 786.7784. The estimates of the remaining factors’ coefficients are shown above. For continuous variable, V2 for example, 1 unit increase in V2 is associated with an increase in the log odds of V21 by -0.024495 units, that is, 1 unit increase in V2 is associated with an increase in the odds of classifying a customer as ‘good’ by a factor of exp(-0.024495) = 0.9758026, so actually a decrease in that odds. For categorical variable, there is a baseline category omitted in the table. For example, A11 is the baseline category for V1, if a customer is an A12, the associated increase in the log odds of V21 is 0.6056022 units, that is an increase in the odds of classifying a customer as ‘good’ by a factor of exp(0.6056022) = 1.8323552.

Let’s see how our models perform on the test set using 50% as the threshold for classification.

pred0 <- predict(logistic0, newdata = test, type = 'response') # predict the test set, the output is the probability
threshold = 0.5
pred0_class<-ifelse(pred0 > threshold, 1, 0) # classify the response variable based on the threshold 
table(test$V21, pred0_class, dnn = c("Observed", "Predicted")) # confusion matrix
##         Predicted
## Observed   0   1
##        0  31  36
##        1  16 117
accuracy0<-mean(test$V21==pred0_class); accuracy0
## [1] 0.74
pred1 <- predict(logistic1, newdata = test, type = 'response') # predict the test set, the output is the probability
threshold = 0.5
pred1_class<-ifelse(pred1 > threshold, 1, 0) # classify the response variable based on the threshold 
table(test$V21, pred1_class, dnn = c("Observed", "Predicted")) # confusion matrix
##         Predicted
## Observed   0   1
##        0  34  33
##        1  17 116
accuracy1<-mean(test$V21==pred1_class); accuracy1
## [1] 0.75

Using 50% as the threshold for classification, the model after variable selection (logistic1) performs similarly on the test set as the default model with all variables (logistic0). To see their performances on the range for all possible thresholds, we need to construct the ROC curves.

library(pROC)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
roc0<-roc(test$V21 ~ pred0); roc0$auc
## Area under the curve: 0.7962
roc1<-roc(test$V21 ~ pred1); roc1$auc
## Area under the curve: 0.7885
# simpler plot
# plot(roc0)  
# plot(roc1, add=TRUE, col='red')
ggroc(list(logistic0 = roc0, logistic1 = roc1), size = 2) +
  theme_minimal()

The ROC curves for logictic0 and logistic1 are very similar too. The AUC values are 0.7962069 and 0.7884637 respectively. But because logistic1 is simpler than logistic0, it is superior.

Because the cost for false positive is 5 times worse than false negative, we could set the threshold at 5/6. However, a more scientific way would be to do a grid search.

# grid search for optimal threshold
# define the search grid from 0.01 to 0.99
grid = seq(0.01, 0.99, 0.01)

# create a matrix to store the cost for each corresponding threshold to be tested
result = cbind(grid, NA)

# cost function calculates the cost associated with input vectors of observed and predicted response variables
cost <- function(observed, predicted) {
  weight1 = 1 # weight for false negative
  weight0 = 5 # weight for false positive
  c1 = (observed == 1) & (predicted < thresh)  # logical vector for false negative
  c0 = (observed == 0) & (predicted > thresh)  # logical vector for false positive
  return(mean(weight1 * c1 + weight0 * c0))
}

# loop over all threshold values to calculate corresponding costs
for (i in 1:length(grid)) { 
  thresh <- result[i, 1]
  result[i, 2] <- cost(test$V21, pred1) # assign cost to the second column of result
}
plot(result, ylab = 'Cost', xlab = 'Threshold')

#find the optimal threshold
grid[which.min(result[,2])]
## [1] 0.91

The grid search provides an optimal threshold of 0.91, which is actually not far from our initial guess 5/6 = 0.8333333.

References:
for tree models
https://bookdown.org/mpfoley1973/data-sci/regression-trees.html
http://ncss-tech.github.io/stats_for_soil_survey/chapters/8_Tree_models/treemodels.html
https://uc-r.github.io/regression_trees
for logistic regression
https://rpubs.com/AnanyaDu/358597