[Updated: Sun, Nov 27, 2022 - 14:16:59 ]
In Bagged Trees or Random Forests models, the trees are developed independently by taking a random sample of rows and columns from the dataset. The main difference between Gradient Boosting Trees and Bagged Trees or Random Forests is that the trees are developed sequentially, and each tree model is built upon the errors of the previous tree models. The sequential process of model building and predictions in gradient-boosted trees can be conceptually demonstrated below.
Let’s try to implement this idea in a toy dataset we use to predict a readability score from the number of sentences.
readability_sub <- read.csv('./data/readability_sub.csv',header=TRUE)
readability_sub[,c('V220','V166','target')]
V220 V166 target
1 -0.13908258 0.19028091 -2.06282395
2 0.21764143 0.07101288 0.58258607
3 0.05812133 0.03993277 -1.65313060
4 0.02526429 0.18845809 -0.87390681
5 0.22430885 0.06200715 -1.74049148
6 -0.07795373 0.10754109 -3.63993555
7 0.43400714 0.12202360 -0.62284268
8 -0.24364550 0.02454670 -0.34426981
9 0.15893717 0.10422343 -1.12298826
10 0.14496475 0.02339597 -0.99857142
11 0.34222975 0.22065343 -0.87656742
12 0.25219145 0.10865010 -0.03304643
13 0.03532625 0.07549474 -0.49529863
14 0.36410633 0.18675801 0.12453660
15 0.29988593 0.11618323 0.09678258
16 0.19837037 0.08272671 0.38422270
17 0.07807041 0.10235218 -0.58143038
18 0.07935690 0.11618605 -0.34324576
19 0.57000953 -0.02385423 -0.39054205
20 0.34523284 0.09299514 -0.67548411
We start with a simple model that uses the average target outcome to predict the readability for all observations in this toy dataset. We calculate the predictions and residuals from this initial intercept-only model.
readability_sub$pred0 <- mean(readability_sub$target)
readability_sub$res0 <- readability_sub$target - readability_sub$pred
round(readability_sub[,c('V166','V220','target','pred0','res0')],3)
V166 V220 target pred0 res0
1 0.190 -0.139 -2.063 -0.763 -1.300
2 0.071 0.218 0.583 -0.763 1.346
3 0.040 0.058 -1.653 -0.763 -0.890
4 0.188 0.025 -0.874 -0.763 -0.111
5 0.062 0.224 -1.740 -0.763 -0.977
6 0.108 -0.078 -3.640 -0.763 -2.877
7 0.122 0.434 -0.623 -0.763 0.140
8 0.025 -0.244 -0.344 -0.763 0.419
9 0.104 0.159 -1.123 -0.763 -0.360
10 0.023 0.145 -0.999 -0.763 -0.235
11 0.221 0.342 -0.877 -0.763 -0.113
12 0.109 0.252 -0.033 -0.763 0.730
13 0.075 0.035 -0.495 -0.763 0.268
14 0.187 0.364 0.125 -0.763 0.888
15 0.116 0.300 0.097 -0.763 0.860
16 0.083 0.198 0.384 -0.763 1.148
17 0.102 0.078 -0.581 -0.763 0.182
18 0.116 0.079 -0.343 -0.763 0.420
19 -0.024 0.570 -0.391 -0.763 0.373
20 0.093 0.345 -0.675 -0.763 0.088
# SSE at the end of Iteration 0
sum(readability_sub$res0^2)
[1] 17.73309
Now, we fit a tree model to predict the residuals of Iteration 0 from
the two predictors (Features V220 and V166). Notice that we fix the
value of specific parameters while fitting the tree model (e.g.,
cp
, minsplit
,maxdepth
).
Let’s see the predictions of residuals from Model 1.
pr.res <- predict(model1, readability_sub)
pr.res
1 2 3 4 5 6
-1.2523541 0.4220391 -0.4323615 0.4220391 -0.4323615 -1.2523541
7 8 9 10 11 12
0.4220391 -1.2523541 0.4220391 -0.4323615 0.4220391 0.4220391
13 14 15 16 17 18
0.4220391 0.4220391 0.4220391 0.4220391 0.4220391 0.4220391
19 20
-0.4323615 0.4220391
Now, let’s add the predicted residuals from Iteration 1 to the predictions from Iteration 0 to obtain the new predictions.
readability_sub$pred1 <- readability_sub$pred0 + pr.res
readability_sub$res1 <- readability_sub$target - readability_sub$pred1
round(readability_sub[,c('V166','V220','target','pred0','res0','pred1','res1')],3)
V166 V220 target pred0 res0 pred1 res1
1 0.190 -0.139 -2.063 -0.763 -1.300 -2.016 -0.047
2 0.071 0.218 0.583 -0.763 1.346 -0.341 0.924
3 0.040 0.058 -1.653 -0.763 -0.890 -1.196 -0.457
4 0.188 0.025 -0.874 -0.763 -0.111 -0.341 -0.533
5 0.062 0.224 -1.740 -0.763 -0.977 -1.196 -0.545
6 0.108 -0.078 -3.640 -0.763 -2.877 -2.016 -1.624
7 0.122 0.434 -0.623 -0.763 0.140 -0.341 -0.282
8 0.025 -0.244 -0.344 -0.763 0.419 -2.016 1.671
9 0.104 0.159 -1.123 -0.763 -0.360 -0.341 -0.782
10 0.023 0.145 -0.999 -0.763 -0.235 -1.196 0.197
11 0.221 0.342 -0.877 -0.763 -0.113 -0.341 -0.535
12 0.109 0.252 -0.033 -0.763 0.730 -0.341 0.308
13 0.075 0.035 -0.495 -0.763 0.268 -0.341 -0.154
14 0.187 0.364 0.125 -0.763 0.888 -0.341 0.466
15 0.116 0.300 0.097 -0.763 0.860 -0.341 0.438
16 0.083 0.198 0.384 -0.763 1.148 -0.341 0.726
17 0.102 0.078 -0.581 -0.763 0.182 -0.341 -0.240
18 0.116 0.079 -0.343 -0.763 0.420 -0.341 -0.002
19 -0.024 0.570 -0.391 -0.763 0.373 -1.196 0.805
20 0.093 0.345 -0.675 -0.763 0.088 -0.341 -0.334
# SSE at the end of Iteration 1
sum(readability_sub$res1^2)
[1] 9.964654
We repeat Iteration 1, but the only difference is that we now fit a tree model to predict the residuals at the end of Iteration 1.
model2 <- rpart(formula = res1 ~ V166 + V220,
data = readability_sub,
method = "anova",
control = list(minsplit=2,
cp=0,
minbucket = 2,
maxdepth = 2)
)
fancyRpartPlot(model2,type=2,sub='')
Let’s see the predictions of residuals from Model 2.
pr.res <- predict(model2, readability_sub)
pr.res
1 2 3 4 5 6
-0.4799134 0.1295162 -0.4799134 -0.4799134 0.1295162 -0.4799134
7 8 9 10 11 12
0.1295162 0.8912203 -0.4799134 0.8912203 0.1295162 0.1295162
13 14 15 16 17 18
-0.4799134 0.1295162 0.1295162 0.1295162 -0.4799134 -0.4799134
19 20
0.8912203 0.1295162
Now, add the predicted residuals from Iteration 2 to the predictions from Iteration 1 to obtain the new predictions.
readability_sub$pred2 <- readability_sub$pred1 + pr.res
readability_sub$res2 <- readability_sub$target - readability_sub$pred2
round(readability_sub[,c('V166','V220','target',
'pred0','res0','pred1','res1',
'pred2','res2')],3)
V166 V220 target pred0 res0 pred1 res1 pred2 res2
1 0.190 -0.139 -2.063 -0.763 -1.300 -2.016 -0.047 -2.496 0.433
2 0.071 0.218 0.583 -0.763 1.346 -0.341 0.924 -0.212 0.794
3 0.040 0.058 -1.653 -0.763 -0.890 -1.196 -0.457 -1.676 0.022
4 0.188 0.025 -0.874 -0.763 -0.111 -0.341 -0.533 -0.821 -0.053
5 0.062 0.224 -1.740 -0.763 -0.977 -1.196 -0.545 -1.066 -0.674
6 0.108 -0.078 -3.640 -0.763 -2.877 -2.016 -1.624 -2.496 -1.144
7 0.122 0.434 -0.623 -0.763 0.140 -0.341 -0.282 -0.212 -0.411
8 0.025 -0.244 -0.344 -0.763 0.419 -2.016 1.671 -1.124 0.780
9 0.104 0.159 -1.123 -0.763 -0.360 -0.341 -0.782 -0.821 -0.302
10 0.023 0.145 -0.999 -0.763 -0.235 -1.196 0.197 -0.304 -0.694
11 0.221 0.342 -0.877 -0.763 -0.113 -0.341 -0.535 -0.212 -0.665
12 0.109 0.252 -0.033 -0.763 0.730 -0.341 0.308 -0.212 0.179
13 0.075 0.035 -0.495 -0.763 0.268 -0.341 -0.154 -0.821 0.326
14 0.187 0.364 0.125 -0.763 0.888 -0.341 0.466 -0.212 0.336
15 0.116 0.300 0.097 -0.763 0.860 -0.341 0.438 -0.212 0.309
16 0.083 0.198 0.384 -0.763 1.148 -0.341 0.726 -0.212 0.596
17 0.102 0.078 -0.581 -0.763 0.182 -0.341 -0.240 -0.821 0.240
18 0.116 0.079 -0.343 -0.763 0.420 -0.341 -0.002 -0.821 0.478
19 -0.024 0.570 -0.391 -0.763 0.373 -1.196 0.805 -0.304 -0.086
20 0.093 0.345 -0.675 -0.763 0.088 -0.341 -0.334 -0.212 -0.464
# SSE at the end of Iteration 2
sum(readability_sub$res2^2)
[1] 5.588329
We can keep iterating and add tree models as long as we find a tree model that improves our predictions (minimizing SSE).
Let \(\mathbf{x}_i = (x_{i1},x_{i2},x_{i3},...,x_{ij})\) represent a vector of observed values for the \(i^{th}\) observation on \(j\) predictor variables, and \(y_i\) is the value of the target outcome for the \(i^{th}\) observation. A gradient-boosted tree model is an ensemble of \(T\) different tree models sequentially developed, and the final prediction of the outcome is obtained by using an additive function as
\[ \hat{y_i} = \sum_{t=1}^{T}f_t(\mathbf{x}_i),\]
where \(f_t\) is a tree model obtained at Iteration \(t\) from the residuals at Iteration \(t-1\).
The algorithm optimizes an objective function \(\mathfrak{L}(\mathbf{y},\mathbf{\hat{y}})\) in an additive manner. This objective loss function can be defined as the sum of squared errors when the outcome is continuous or logistic loss when the outcome is categorical.
The algorithm starts with a constant prediction. For instance, we start with the average outcome in the above example. Then, a new tree model that minimizes the objective loss function is searched and added at each iteration.
\[\hat{y}_{i}^{(0)} = \bar{y}\] \[\hat{y}_{i}^{(1)} = \hat{y}_{i}^{(0)} + \alpha f_1(\mathbf{x}_i)\]
\[\hat{y}_{i}^{(2)} = \hat{y}_{i}^{(1)} + \alpha f_2(\mathbf{x}_i)\] \[.\] \[.\] \[.\] \[\hat{y}_{i}^{(t)} = \hat{y}_{i}^{(t-1)} + \alpha f_t(\mathbf{x}_i)\]
Notice that I added a multiplier, \(\alpha\) while adding our predictions at each iteration. In the above example, we fixed this multiplier to 1, \(\alpha = 1\), as we added a whole new prediction to the previous prediction. This multiplier in machine learning literature is called the learning rate. We could also choose to add only a fraction of new predictions (e.g., \(\alpha = 0.1,0.05,0.01,0.001\)) at each iteration.
The smaller the learning rate, the more iterations (more tree models) we will need to achieve the same level of performance. So, the number of iterations (number of tree models, \(T\)) and the learning rate (\(\alpha\)) play in tandem. These two parameters are known as the boosting hyperparameters and need to be tuned.
Think about choosing a learning rate as choosing your speed on a highway and number of trees as the time it takes to arrive at your destination. Suppose you are traveling from Eugene to Portland on I-5. If you drive 40 miles/hour, you are less likely to involve in an accident because you are more aware of your surroundings, but it will take 3-4 hours to arrive at your destination. If you are 200 miles/hour, it will only take an hour to arrive at your destination, assuming you will not have an accident on the way (which is very likely). So, you try to find a speed level that is fast enough to arrive at your destination and safe enough not to have an accident.
TECHNICAL NOTE
Why do people call it Gradient Boosting? It turns out that the updates at each iteration based on the residuals from a previous model are related to the concept of negative gradient (first derivative) of the objective loss function with respect to the predicted values from the previous step.
\[-g_i^t = -\frac{\partial \mathfrak{L}(y_i,\hat{y}_i^{t-1})}{\partial \hat{y}_i^{t-1}} = \hat{y}_{i}^{(t)} - \hat{y}_{i}^{(t-1)} \]
The general logic of gradient boosting works as
take a differentiable loss function, \(\mathfrak{L}(\mathbf{y},\mathbf{\hat{y}})\), that summarizes the distance between observed and predicted values,
start with an initial model to obtain initial predictions, \(f_0(\mathbf{x}_i)\),
iterate until termination:
calculate the negative gradients of the loss function with respect to predictions from the previous step
fit a tree model to the negative gradients
update the predictions (with a multiplier, a.k.a learning rate).
Most software uses mathematical approximations and computational hocus pocus to do these computations for faster implementation.
gbm
packageThe gradient boosting trees can be fitted using the gbm
function from the `gbm package. The code below tries to replicate our
example above using the toy dataset.
Model and Data:
formula
, a description of the outcome and predictive
variables in the model using column names
data
, the name of the data object to look for the
variables in the formula statement
distribution
, a character to specify the type of
objective loss function to optimize. ‘gaussian’ is typically used for
continuous outcomes(minimize the squared error), and ‘bernoulli’ is
typically used for the binary outcomes (minimizes the logistic
loss)
Hyperparameters:
n.trees
, number of trees to fit (the number of
iterations)
shrinkage
, learning rate.
interaction.depth
, the maximum depth of each tree
developed at each iteration
n.minobsinnode
, the minimum number of observations
in each terminal note of tree models at each iteration
Stochastic Gradient Boosting:
bag.fraction
, the proportion of observations to be
randomly selected for developing a new tree at each iteration.In Gradient Boosting Trees, we use all observations (100% of rows)
when we develop a new tree model at each iteration. So, we can set
bag.fraction=1
, and gbm
fits a gradient
boosting tree model. On the other hand, adding a random component may
help yield better performance. You can think about this as a marriage of
Bagging and Boosting. So, we may want to take a random sample of
observations to develop a tree model at each iteration. For instance, if
you set bag.fraction=.9
, the algorithm will randomly sample
90% of the observations at each iteration before fitting the new tree
model to residuals from the previous step. When
bag.fraction
is lower than 1, this is called
Stochastic Gradient Boosting Trees.
bag.fraction
can also be considered a hyperparameter to
tune by trying different values to find an optimal value, or it can be
fixed to a certain number.
Cross-validation:
cv.folds
, number of cross-validation folds to
perform.Parallel Processing:
n.cores
, the number of CPU cores to use.# Obtain predictions from the model
predict(gbm.model)
[1] -2.4955898 -0.2117671 -1.6755972 -0.8211966 -1.0661677 -2.4955898
[7] -0.2117671 -1.1244561 -0.8211966 -0.3044636 -0.2117671 -0.2117671
[13] -0.8211966 -0.2117671 -0.2117671 -0.2117671 -0.8211966 -0.8211966
[19] -0.3044636 -0.2117671
# Plot the final model
plot(gbm.model,i.var = 1)
plot(gbm.model,i.var = 2)
caret
package and
Hyperparameter TuningThe gbm
algorithm is available in the caret
package. Let’s check the hyperparameters available to tune.
require(caret)
getModelInfo()$gbm$parameters
parameter class label
1 n.trees numeric # Boosting Iterations
2 interaction.depth numeric Max Tree Depth
3 shrinkage numeric Shrinkage
4 n.minobsinnode numeric Min. Terminal Node Size
The four most critical parameters are all available to tune. It is very challenging to find the best combination of values for all these four hyperparameters unless you implement a full grid search which may take a very long time. You may apply a general sub-optimal strategy to tune the hyperparameters step by step, either in pairs or one by one. Below is one way to implement such a strategy:
Fix the interaction.depth
and
n.minobsinnode
to a certain value (e.g., interaction.depth
= 5, n.minobsinnode = 10),
Pick a small value of learning rate (shrinkage
),
such as 0.05 or 0.1,
Do a grid search and find the optimal number of trees
(n.trees
) using the fixed values at #1 and #2,
Fix the n.trees
at its optimal value from #3, keep
shrinkage
the same as in #2, and do a two-dimensional grid
search for interaction.depth
and
n.minobsinnode
and find the optimal number of depth and
minimum observation in a terminal node,
Fix the interaction.depth
and `n.minobsinnode’at
their optimal values from #4, lower the learning rate and increase the
number of trees to see if the model performance can be further
improved.
Fix interaction.depth
,n.minobsinnode
,
shrinkage
, and n.trees
at their optimal values
from previus steps, and do a grid search for
bag.fraction
.
You will find an interactive app you can play at the link below to understand the dynamics among these hyperparameters and optimize them in toy examples.
http://arogozhnikov.github.io/2016/07/05/gradient_boosting_playground.html
First, we import and prepare data for modeling. Then, we split the data into training and test pieces.
require(recipes)
require(caret)
# Import the dataset
readability <- read.csv(here('data/readability_features.csv'),header=TRUE)
# Write the recipe
blueprint_readability <- recipe(x = readability,
vars = colnames(readability),
roles = c(rep('predictor',768),'outcome')) %>%
step_zv(all_numeric()) %>%
step_nzv(all_numeric()) %>%
step_normalize(all_numeric_predictors())
# Train/Test Split
set.seed(10152021) # for reproducibility
loc <- sample(1:nrow(readability), round(nrow(readability) * 0.9))
read_tr <- readability[loc, ]
read_te <- readability[-loc, ]
dim(read_tr)
dim(read_te)
[1] 2551 769
[1] 283 769
Prepare the data partitions for 10-fold cross validation.
# Cross validation settings
read_tr = read_tr[sample(nrow(read_tr)),]
# Create 10 folds with equal size
folds = cut(seq(1,nrow(read_tr)),breaks=10,labels=FALSE)
# Create the list for each fold
my.indices <- vector('list',10)
for(i in 1:10){
my.indices[[i]] <- which(folds!=i)
}
cv <- trainControl(method = "cv",
index = my.indices)
Set the multiple cores for parallel processing.
require(doParallel)
ncores <- 10
cl <- makePSOCKcluster(ncores)
registerDoParallel(cl)
Step 1: Tune the number of trees
Now, we will fix the learning rate at 0.1
(shrinkage=0.1
), interaction depth at 5
(interaction.depth=5
), and the minimum number of
observations at 10 (n.minobsinnode = 10
). We will do a grid
search for the number of trees from 1 to 500
(n.trees = 1:500
). Note that I fix the bag fraction at one
and passed it as an argument in the caret::train
function
because it is not allowed in the hyperparameter grid.
# Grid Settings
grid <- expand.grid(shrinkage = 0.1,
n.trees = 1:500,
interaction.depth = 5,
n.minobsinnode = 10)
gbm1 <- caret::train(blueprint_readability,
data = read_tr,
method = 'gbm',
trControl = cv,
tuneGrid = grid,
bag.fraction = 1,
verbose = FALSE)
gbm1$times
$everything
user system elapsed
65.00 1.59 160.42
$final
user system elapsed
52.47 0.14 52.61
$prediction
[1] NA NA NA
It took about 3 minutes to run. We can now look at the plot and examine how the cross-validated RMSE changes as a function of the number of trees.
plot(gbm1,type='l')
It indicates there is not much improvement after 300 trees with these settings (this is just eyeballing, there is nothing specific about how to come up with this number). So, I will fix the number of trees to 300 for the next step.
Step 2: Tune the interaction depth and minimum number of observations
Now, we will fix the number of trees at 300
(n.trees = 300
) and the learning rate at 0.1
(shrinkage=0.1
).
Then, we will do a grid search by assigning values for the interaction depth from 1 to 15 and values for the minimum number of observations at 5, 10, 20, 30, 40, and 50. We still keep the bag fraction as 1.
grid <- expand.grid(shrinkage = 0.1,
n.trees = 300,
interaction.depth = 1:15,
n.minobsinnode = c(5,10,20,30,40,50))
gbm2 <- caret::train(blueprint_readability,
data = read_tr,
method = 'gbm',
trControl = cv,
tuneGrid = grid,
bag.fraction = 1,
verbose = FALSE)
gbm2$times
$everything
user system elapsed
87.53 1.55 7433.68
$final
user system elapsed
75.17 0.14 75.30
$prediction
[1] NA NA NA
This search took about 1 hour and 10 minutes. If we look at the cross-validates RMSE for all these 90 possible conditions, we see that the best result comes out when the interaction depth is equal to 9, and the minimum number of observations is equal to 50.
plot(gbm2,type='l')
gbm2$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
54 300 9 0.1 50
gbm2$results[which.min(gbm2$results$RMSE),]
shrinkage interaction.depth n.minobsinnode n.trees RMSE
54 0.1 9 50 300 0.5759197
Rsquared MAE RMSESD RsquaredSD MAESD
54 0.6904758 0.4582852 0.02670278 0.03351747 0.02441325
Step 3: Lower the learning rate and increase the number of trees
Now, we will fix the interaction depth at 9
(interaction.depth = 9
) and the minimum number of
observations at 50 (n.minobsinnode = 50
). We will lower the
learning rate to 0.01 (shrinkage=0.01
) and increase the
number of trees to 8000 (n.trees = 1:8000
) to explore if a
lower learning rate improves the performance.
grid <- expand.grid(shrinkage = 0.01,
n.trees = 1:8000,
interaction.depth = 9,
n.minobsinnode = 50)
gbm3 <- caret::train(blueprint_readability,
data = read_tr,
method = 'gbm',
trControl = cv,
tuneGrid = grid,
bag.fraction = 1,
verbose= FALSE)
gbm3$times
$everything
user system elapsed
1703.83 0.94 3821.78
$final
user system elapsed
1689.69 0.09 1719.50
$prediction
[1] NA NA NA
plot(gbm3,type='l')
gbm3$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
7948 7948 9 0.01 50
gbm3$results[which.min(gbm3$results$RMSE),]
shrinkage interaction.depth n.minobsinnode n.trees RMSE
7948 0.01 9 50 7948 0.568526
Rsquared MAE RMSESD RsquaredSD MAESD
7948 0.6990292 0.453076 0.02218744 0.02806958 0.01829231
This run took about another 40 minutes. The best performance was
obtained with a model of 7948 trees, and yielded an RMSE value of
0.5685. We can stop here and decide that this is our final model. Or, we
can play with bag.fraction
and see if we can improve the
performance a little more.
Step 4: Tune Bag Fraction
To play with the bag.fraction
, we should write our own
syntax as caret::train
does not allow it to be manipulated
as a hyperparameter.
Notice that I fixed the values of shrinkage
,
n.trees
,interaction.depth
,n.minobsinnode
at their optimal values.
Then, I write a for
loop to iterate over different
values of bag.fraction
from 0.1 to 1 with increments of
0.05. I save the model object from each iteration in a list object.
grid <- expand.grid(shrinkage = 0.01,
n.trees = 7948,
interaction.depth = 9,
n.minobsinnode = 50)
bag.fr <- seq(0.1,1,.05)
my.models <- vector('list',length(bag.fr))
for(i in 1:length(bag.fr)){
my.models[[i]] <- caret::train(blueprint_readability,
data = read_tr,
method = 'gbm',
trControl = cv,
tuneGrid = grid,
bag.fraction = bag.fr[i],
verbose= FALSE)
}
It took about 17 hours to complete with ten cores. Let’s check if it improved the performance.
The best performance was obtained when bag.fr
is equal
to 0.40.
Finally, we can check the performance of the final model with these settings on the test dataset and compare it to other methods.
final.gbm <- my.models[[7]]
# Predictions from a Bagged tree model with 158 trees
predicted_te <- predict(final.gbm,read_te)
# MAE
mean(abs(read_te$target - predicted_te))
[1] 0.4479047
[1] 0.5509891
# R-square
cor(read_te$target,predicted_te)^2
[1] 0.7135852
R-square | MAE | RMSE | |
---|---|---|---|
Ridge Regression | 0.727 | 0.435 | 0.536 |
Lasso Regression | 0.725 | 0.434 | 0.538 |
Gradient Boosting | 0.714 | 0.448 | 0.551 |
Random Forests | 0.671 | 0.471 | 0.596 |
Bagged Trees | 0.656 | 0.481 | 0.604 |
Linear Regression | 0.644 | 0.522 | 0.644 |
KNN | 0.623 | 0.500 | 0.629 |
Decision Tree | 0.497 | 0.577 | 0.729 |
The code below implements a similar strategy and demonstrates how to fit a Gradient Boosting Tree model for the Recidivism dataset to predict recidivism in Year 2.
Import the dataset and initial data preparation
# Import data
recidivism <- read.csv(here('data/recidivism_y1 removed and recoded.csv'),
header=TRUE)
# List of variable types in the dataset
outcome <- c('Recidivism_Arrest_Year2')
id <- c('ID')
categorical <- c('Residence_PUMA',
'Prison_Offense',
'Age_at_Release',
'Supervision_Level_First',
'Education_Level',
'Prison_Years',
'Gender',
'Race',
'Gang_Affiliated',
'Prior_Arrest_Episodes_DVCharges',
'Prior_Arrest_Episodes_GunCharges',
'Prior_Conviction_Episodes_Viol',
'Prior_Conviction_Episodes_PPViolationCharges',
'Prior_Conviction_Episodes_DomesticViolenceCharges',
'Prior_Conviction_Episodes_GunCharges',
'Prior_Revocations_Parole',
'Prior_Revocations_Probation',
'Condition_MH_SA',
'Condition_Cog_Ed',
'Condition_Other',
'Violations_ElectronicMonitoring',
'Violations_Instruction',
'Violations_FailToReport',
'Violations_MoveWithoutPermission',
'Employment_Exempt')
numeric <- c('Supervision_Risk_Score_First',
'Dependents',
'Prior_Arrest_Episodes_Felony',
'Prior_Arrest_Episodes_Misd',
'Prior_Arrest_Episodes_Violent',
'Prior_Arrest_Episodes_Property',
'Prior_Arrest_Episodes_Drug',
'Prior_Arrest_Episodes_PPViolationCharges',
'Prior_Conviction_Episodes_Felony',
'Prior_Conviction_Episodes_Misd',
'Prior_Conviction_Episodes_Prop',
'Prior_Conviction_Episodes_Drug',
'Delinquency_Reports',
'Program_Attendances',
'Program_UnexcusedAbsences',
'Residence_Changes',
'Avg_Days_per_DrugTest',
'Jobs_Per_Year')
props <- c('DrugTests_THC_Positive',
'DrugTests_Cocaine_Positive',
'DrugTests_Meth_Positive',
'DrugTests_Other_Positive',
'Percent_Days_Employed')
# Convert all nominal, ordinal, and binary variables to factors
for(i in categorical){
recidivism[,i] <- as.factor(recidivism[,i])
}
# Write the recipe
require(recipes)
blueprint_recidivism <- recipe(x = recidivism,
vars = c(categorical,numeric,props,outcome,id),
roles = c(rep('predictor',48),'outcome','ID')) %>%
step_indicate_na(all_of(categorical),all_of(numeric),all_of(props)) %>%
step_zv(all_numeric()) %>%
step_impute_mean(all_of(numeric),all_of(props)) %>%
step_impute_mode(all_of(categorical)) %>%
step_logit(all_of(props),offset=.001) %>%
step_poly(all_of(numeric),all_of(props),degree=2) %>%
step_normalize(paste0(numeric,'_poly_1'),
paste0(numeric,'_poly_2'),
paste0(props,'_poly_1'),
paste0(props,'_poly_2')) %>%
step_dummy(all_of(categorical),one_hot=TRUE) %>%
step_num2factor(Recidivism_Arrest_Year2,
transform = function(x) x + 1,
levels=c('No','Yes'))
blueprint_recidivism
Train/Test Split and Cross-validation Settings
# Train/Test Split
loc <- which(recidivism$Training_Sample==1)
recidivism_tr <- recidivism[loc, ]
recidivism_te <- recidivism[-loc, ]
# Cross validation settings
set.seed(10302021) # for reproducibility
recidivism_tr = recidivism_tr[sample(nrow(recidivism_tr)),]
# Create 10 folds with equal size
folds = cut(seq(1,nrow(recidivism_tr)),breaks=10,labels=FALSE)
# Create the list for each fold
my.indices <- vector('list',10)
for(i in 1:10){
my.indices[[i]] <- which(folds!=i)
}
cv <- trainControl(method = "cv",
index = my.indices,
classProbs = TRUE,
summaryFunction = mnLogLoss)
Step 1: Initial model fit to tune the number of trees
We fix the learning rate at 0.1 (shrinkage=0.1
),
interaction depth at 5 (interaction.depth=5), and the minimum number of
observations at 10 (n.minobsinnode = 10
). We do a grid
search for the optimal number of trees from 1 to 1000
(n.trees = 1:1000
).
grid <- expand.grid(shrinkage = 0.1,
n.trees = 1:1000,
interaction.depth = 5,
n.minobsinnode = 10)
gbm1 <- caret::train(blueprint_recidivism,
data = recidivism_tr,
method = 'gbm',
trControl = cv,
tuneGrid = grid,
bag.fraction = 1,
metric = 'logLoss')
plot(gbm1,type='l')
gbm1$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
178 178 5 0.1 10
It indicates that a model with 178 trees is optimal at the initial search.
Step 2: Tune the interaction depth and minimum number of observations
We fix the number of trees at 178 (n.trees = 178
) and
the learning rate at 0.1 (shrinkage=0.1
). Then, we do a
grid search by assigning values for the interaction depth from 1 to 15
and the minimum number of observations at 5, 10, 20, 30, 40, and 50.
grid <- expand.grid(shrinkage = 0.1,
n.trees = 178,
interaction.depth = 1:15,
n.minobsinnode = c(5,10,20,30,40,50))
gbm2 <- caret::train(blueprint_recidivism,
data = recidivism_tr,
method = 'gbm',
trControl = cv,
tuneGrid = grid,
bag.fraction = 1,
metric = 'logLoss')
plot(gbm2)
gbm2$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
41 178 7 0.1 40
The search indicates that the best performance is obtained when the interaction depth equals 7 and the minimum number of observations equals 40.
Step 3: Lower the learning rate and increase the number of trees
We fix the interaction depth at 7
(interaction.depth = 4
) and the minimum number of
observations at 40 (n.minobsinnode = 40
). We will lower the
learning rate to 0.01 (shrinkage=0.01) and increase the number of trees
to 10000 (n.trees = 1:10000
) to explore if a lower learning
rate improves the performance.
grid <- expand.grid(shrinkage = 0.01,
n.trees = 1:10000,
interaction.depth = 7,
n.minobsinnode = 40)
gbm3 <- caret::train(blueprint_recidivism,
data = recidivism_tr,
method = 'gbm',
trControl = cv,
tuneGrid = grid,
bag.fraction = 1,
metric = 'logLoss')
plot(gbm3,type='l')
gbm3$bestTune
n.trees interaction.depth shrinkage n.minobsinnode
1731 1731 7 0.01 40
Step 4: Tune the bag fraction
grid <- expand.grid(shrinkage = 0.01,
n.trees = 1731,
interaction.depth = 7,
n.minobsinnode = 40)
bag.fr <- seq(0.1,1,.05)
my.models <- vector('list',length(bag.fr))
for(i in 1:length(bag.fr)){
my.models[[i]] <- caret::train(blueprint_recidivism,
data = recidivism_tr,
method = 'gbm',
trControl = cv,
tuneGrid = grid,
bag.fraction = bag.fr[i])
}
cv.LogL <- c()
for(i in 1:length(bag.fr)){
cv.LogL[i] <- my.models[[i]]$results$logLoss
}
ggplot()+
geom_line(aes(x=bag.fr,y=cv.LogL))+
theme_bw()+
xlab('Bag Fraction')+
ylab('LogLoss (Cross-validated)')+
scale_x_continuous(breaks = bag.fr)
bag.fr[which.min(cv.LogL)]
[1] 0.7
The best result is obtained when the bag fraction is 0.7. So, we will proceed with that as our final model.
Step 5: Final Predictions on Test Set
final.gbm <- my.models[[13]]
# Predict the probabilities for the observations in the test dataset
predicted_te <- predict(final.gbm, recidivism_te, type='prob')
head(predicted_te)
No Yes
1 0.9479557 0.05204431
2 0.6394094 0.36059065
3 0.7002056 0.29979444
4 0.6725847 0.32741529
5 0.8364610 0.16353899
6 0.7666956 0.23330437
# Compute the AUC
require(cutpointr)
cut.obj <- cutpointr(x = predicted_te$Yes,
class = recidivism_te$Recidivism_Arrest_Year2)
auc(cut.obj)
[1] 0.7465408
# Confusion matrix assuming the threshold is 0.5
pred_class <- ifelse(predicted_te$Yes>.5,1,0)
confusion <- table(recidivism_te$Recidivism_Arrest_Year2,pred_class)
confusion
pred_class
0 1
0 3957 189
1 1109 205
# True Negative Rate
confusion[1,1]/(confusion[1,1]+confusion[1,2])
[1] 0.9544139
# False Positive Rate
confusion[1,2]/(confusion[1,1]+confusion[1,2])
[1] 0.04558611
# True Positive Rate
confusion[2,2]/(confusion[2,1]+confusion[2,2])
[1] 0.1560122
# Precision
confusion[2,2]/(confusion[1,2]+confusion[2,2])
[1] 0.5203046
Comparison of Results
-LL | AUC | ACC | TPR | TNR | FPR | PRE | |
---|---|---|---|---|---|---|---|
Gradient Boosting Trees | 0.499 | 0.747 | 0.763 | 0.156 | 0.954 | 0.046 | 0.520 |
Random Forests | 0.507 | 0.725 | 0.761 | 0.153 | 0.954 | 0.046 | 0.514 |
Bagged Trees | 0.506 | 0.724 | 0.759 | 0.144 | 0.954 | 0.046 | 0.500 |
Logistic Regression | 0.510 | 0.719 | 0.755 | 0.142 | 0.949 | 0.051 | 0.471 |
Logistic Regression with Ridge Penalty | 0.511 | 0.718 | 0.754 | 0.123 | 0.954 | 0.046 | 0.461 |
Logistic Regression with Lasso Penalty | 0.509 | 0.720 | 0.754 | 0.127 | 0.952 | 0.048 | 0.458 |
Logistic Regression with Elastic Net | 0.509 | 0.720 | 0.753 | 0.127 | 0.952 | 0.048 | 0.456 |
KNN | ? | ? | ? | ? | ? | ? | ? |
Decision Tree | 0.558 | 0.603 | 0.757 | 0.031 | 0.986 | 0.014 | 0.423 |
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".