-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathHM_2.Rmd
1043 lines (824 loc) · 44.5 KB
/
HM_2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: 'Statistical Learning, Homework #2'
author: "Elisa Paolazzi [MAT. 224424]"
date: "01/04/2022"
output:
pdf_document:
latex_engine: xelatex
html_document:
df_print: paged
geometry: left=2cm,right=2cm,top=0.5cm,bottom=1.5cm
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(warning=FALSE,
message=FALSE,
tidy.opts=list(width.cutoff = 60),
tidy = TRUE)
knitr::opts_knit$set(global.par=TRUE)
# built-in output hook
hook_output <- knitr::knit_hooks$get("output")
# custom chunk option output.lines to truncate text output
knitr::knit_hooks$set(output = function(x, options) {
n <- options$output.lines
if (!is.null(n)) {
x <- xfun::split_lines(x)
if(length(x) > n) {
top <- head(x, n)
bot <- tail(x, n)
x <- c(top, "\n....\n", bot)
}
x <- paste(x, collapse="\n")
}
hook_output(x, options)
})
```
### R Setup
```{r}
library(tidyverse)
library(dplyr) # for data manipulation
library(ggplot2) # for nice plots
library(GGally) # for ggpairs()
library(tree) # for cost-complexity trees
library(rsample) # for train-test split
library(caret) # for K-fold CV
library(randomForest) # for Random forest
library(gbm) # for boosted trees
library(cowplot) # for plot grid
library(glmnet) # for Shrinkage methods
# choose a seed
seed = 14
```
### Data exploration
The first step of the analysis is to load the data (`prostate.csv`) and explore it:
```{r}
# load the dataset
prostate_df = read.csv("prostate.csv")
# dataset dimensions
dim(prostate_df)
# show the dataset
knitr::kable(head(prostate_df))
```
As described in the assignment the dataset contains 97 observations (men who were about to receive a radical prostatectomy) for 9 variables. The outcome variable is contained in the `lpsa` column, while the other 8 columns contain explanatory variables, which represent patients' clinical measures.
We can also print a summary in order to better understand the content of the columns, and thus the nature of our predictors:
```{r}
summary(prostate_df)
```
The outcome variable `lpsa` is a numerical variable and represents the $log(cancer volume in cm3)$.
`lcavol`, `lweight`, `age`, `lbph`, `lcp`, `pgg45` are numerical predictors, `svi` is a dummy variable and `gleason` is a qualitative variable. No missing values are detected.
Since `svi` and `gleason` are categorical variables we transform them into factors.:
```{r}
# convert dummy and categorical columns as factors
prostate_df = prostate_df %>%
mutate_at(c('svi', 'gleason'), as.factor)
# show a sample of the dataset to verify the transformation
as.tibble(prostate_df)
```
Considering the predictors it is useful to show their distribution and their relationships. We can do this by means of the `ggpairs()` function, which displays us a revealing plot:
```{r fig.height=10, fig.width=12}
ggpairs(prostate_df, title="Prostate-specific antigen and clinical measures")
```
Observing the plot we can see each variable density plot on the diagonal, the correlation between the continuous variables (in the upper panels), the scatter plots of the continuous variables (in the lower panels), and the histograms and box plots for the combinations between the categorical and the continuous variables. From this visualization we can notice some quite correlated predictors, such as `lcavol` and `lcp` (0.68), and correlations with the outcome variable and predictors, such as `lcavol` (0.73). Other useful insights about the relationships between the variables can be spotted in the visualization.
We can also directly show the correlation matrix between the numerical variables:
```{r}
knitr::kable(cor(dplyr::select(prostate_df, -svi, -gleason)))
```
As we said, there are some correlations between the variable.
### 1. Fit a decision tree on the whole data and plot the results. Choose the tree complexity by cross-validation and decide whether you should prune the tree based on the results. Prune the tree if applicable and interpret the fitted model.
Tree-based methods involve stratifying or segmenting the predictor space into a number of simple regions $R1, ..., Rn$. In regression settings, in order to make a prediction for a given observation, we use the mean response value for the training observations in the region to which it belongs. As we will see the main advantage of this methods over other types of regression models are in terms of interpretation and graphical representation.
We first fit a regression tree on the whole data:
```{r}
# fit a regression tree on the whole dataset
tree_pros = tree(lpsa ~ ., prostate_df)
# tree's summary
summary(tree_pros)
```
The output first shows us the variables that were actually used in the construction of the tree, in our case these are `lcavol`, `lweight` and `pgg45` (3/8). Secondly, it shows the number of terminal nodes (9 in our case) and the residual mean deviance which is calculated by dividing the tree deviance (36.24) by $n−|T0|$ (number of observations - number of terminal nodes, which is $97-9 = 88$ in our case). Note that the residual mean deviance is a measure of the error remaining in the tree after construction and, for a regression tree, this is related to the mean squared error. Lastly, we can see the distribution of residuals: their values are of course satisfactory since they are close to be centered on 0 and are roughly symmetrical.
We can plot the tree:
```{r fig.height=10, fig.width=12}
# plot the branches
plot(tree_pros)
# add branches' labels
text(tree_pros, pretty=0)
# add title
title(main="Prostate data: Unpruned regression tree on the whole data")
```
Observing the graph we can better understand how the tree was built using only three predictors (the predictors have one or more splits based on different thresholds). The most important predictor, which stand at the stump of the tree is `lcavol`, which then creates other subsequent branches.
The tree concludes with 9 terminal nodes identifying the partitions ($R1, ..., R9$) of the feature space. Each terminal node shows the average of the `lpsa` values for the corresponding region, which will then represent the prediction for the observations that will end up within them.
We can also examine in more details the tree's structure by displaying it in text mode:
```{r}
# display tree details
tree_pros
```
For each node important information are displayed: the splitting criterion, the number of observations in the branch, the deviance and the overall branch prediction; lastly, the asterisk indicates whether the node is a terminal one or not.
Once the tree is fitted and displayed, we can use cross-validation to choose the complexity of the tree and decide whether to prune it or not. However, remember that we are dealing with a tree fitted on the whole dataset.
```{r}
# set seed for reproducible results
set.seed(32)
# cross-validation
cv.pros = cv.tree(tree_pros)
cv.pros
```
The cross-validation output shows the number of possible terminal nodes of each tree (from 9 to 1), the corresponding cross-validation error rate and the value of the cost-complexity parameter (which is denoted by $k$ here).
As we can already note, at the second position (corresponding to node 8), we record the lowest value of cross-validation error rate (and also the lowest value of $k$, excluding the one belonging to the maximum size value), which suggests that the tree may be pruned.
Let's visualize the cross-validation error rate as a function of either size and $k$:
```{r fig.width=12}
op = par(mfrow=c(1, 2))
# plot cross-validation deviance versus the number of terminal nodes
plot(cv.pros$size, cv.pros$dev, type="b", main="CV Deviance vs. number of terminal nodes",
xlab="Number of terminal nodes",
ylab="CV error rate")
# plot cost-complexity parameter versus the number of terminal nodes
plot(cv.pros$k, cv.pros$dev, type="b", main="Cost-complexity vs. number of terminal nodes",
xlab="cost-complexity parameter 'k'",
ylab="CV error rate")
```
Here we can visually display how the error rate behaves depending on the size and the cost-complexity parameter. As we previously noted, we can clearly see from the plot that the error rate has its minimum with the size of 8.
We then can prune the tree using the optimal tree size.
```{r}
# get the optimal size
opt.size = cv.pros$size[which.min(cv.pros$dev)]
# prune the tree using the optimal size value
prune_pros = prune.tree(tree_pros, best=opt.size)
# pruned tree's summary
summary(prune_pros)
```
Reading the summary we can notice that the value of the residual mean deviance has slightly increased compared to the previous unpruned tree. This is inevitable since the tree deviance has also increased while $n−|T0|$ has decreased (since $n$ is lower).
We can also display the pruned tree:
```{r fig.height=10, fig.width=12}
# reset mfrow()
op = par(mfrow=c(1, 1))
# plot the branches
plot(prune_pros)
# add branches' labels
text(prune_pros, pretty=0)
# add title
title(main="Prostate data: Pruned regression tree on the whole data")
```
Compared to the previous tree, we can see that the last split of `lweight` (`lweight` < 3.8308) is missing.
It is now useful to calculate the mean squared error on the whole data, in order to see how the unpruned and pruned trees perform on it:
```{r}
# make prediction on the whole data using the whole data fitted unpruned tree
yhat_unpr = predict(tree_pros, newdata=prostate_df)
# calculate MSE
mean((yhat_unpr - prostate_df$lpsa)^2)
# make prediction on the whole data using the whole data fitted pruned tree
yhat_pr = predict(prune_pros, newdata=prostate_df)
# calculate MSE
mean((yhat_pr - prostate_df$lpsa)^2)
```
The two models perform roughly the same since the MSE values (which is basically the training error in this case) are very similar. However the unpruned version is the best in this case and its MSE value is one of the lowest that can be achieved with these methods.
```{r fig.height=4, fig.width=12}
op = par(mfrow=c(1, 2))
plot(yhat_unpr, prostate_df$lpsa, main = "Unpruned tree: prediction vs response variable (lpsa)",
xlab="prediction",
ylab="lpsa")
abline(0, 1)
plot(yhat_pr, prostate_df$lpsa, main = "Pruned tree: prediction vs response variable (lpsa)",
xlab="prediction",
ylab="lpsa")
abline(0, 1)
```
We have achieved great results since the trees are able to capture the complexity of the data. However, this is also due to the fact that we tested the models on the same data they were trained on (the whole dataset), for this reason the models tend to be perfectly adapted to them.
Nevertheless a regression tree performance is not properly evaluated until we estimate the test error. For this purpose we split the observations into a training and a testing partition:
```{r}
# set the seed for reproducible partitions
set.seed(42)
# split data into training and test sets
split = initial_split(prostate_df, prop=0.7)
x_train = training(split)
x_test = testing(split)
y_test = x_test$lpsa
```
We can now fit the unpruned tree on the training set:
```{r}
# fit a regression tree on the training set
unpruned_tree = tree(lpsa ~ ., x_train)
# tree's summary
summary(unpruned_tree)
```
This time 5 predictors are actually used in the tree construction and the terminal nodes are 10.
Let's visualize the tree:
```{r fig.height=10, fig.width=12}
# reset mfrow()
op = par(mfrow=c(1, 1))
plot(unpruned_tree)
text(unpruned_tree, pretty=0)
title(main="Prostate training data: Unpruned regression tree")
```
Let's perform a cross-validation in order to decide whether to prune the tree or not:
```{r fig.width=12}
# set seed for reproducible results
set.seed(8)
# cross-validation
cv.pros_tr = cv.tree(unpruned_tree)
cv.pros_tr
plot(cv.pros_tr$size, cv.pros_tr$dev, type="b",
main="CV Deviance vs. number of terminal nodes",
xlab="Number of terminal nodes",
ylab="CV error rate")
```
The optimal size is 9 (according also with the cost-complexity parameter), thus we prune the tree using it:
```{r}
# get the optimal size
opt.size = cv.pros_tr$size[which.min(cv.pros_tr$dev)]
# prune the tree
pruned_tree = prune.tree(unpruned_tree, best=opt.size)
summary(pruned_tree)
```
Compared to the previous full tree, here the `age` branches (which was one of the terminal nodes) was pruned and thus this predictor is not included anymore in this tree construction.
We can finally test the trees on the test set and compare the mean squared error:
```{r}
# make prediction on the test set using the training set fitted unpruned tree
yhat_unpr = predict(unpruned_tree, newdata=x_test)
# calculate MSE
mean((yhat_unpr - y_test)^2)
# calculate sd
sqrt(mean((yhat_unpr - y_test)^2))
# make prediction on the test set using the train set fitted pruned tree
yhat_pr = predict(pruned_tree, newdata=x_test)
# calculate MSE
mean((yhat_pr - y_test)^2)
# calculate sd
sqrt(mean((yhat_pr - y_test)^2))
```
And plot the results:
```{r fig.height=4, fig.width=12}
op = par(mfrow=c(1, 2))
plot(yhat_unpr, x_test$lpsa,
main = "Unpruned tree: prediction vs response variable (lpsa)",
xlab="prediction",
ylab="lpsa")
abline(0, 1)
plot(yhat_pr, x_test$lpsa,
main = "Pruned tree: prediction vs response variable (lpsa)",
xlab="prediction",
ylab="lpsa")
abline(0, 1)
```
In this case the pruned model performs better than the unpruned one. Moreover, the pruning procedure succeeds in slightly reducing the variance (as the standard deviation is also lower in the pruned tree) and thus also the overfitting.
However, if we try to repeat the test-train split the results tend to vary quite a bit: this is due to the high variance that is typical of tree models. Indeed, pruning the tree is often not enough to reduce variance.
### 2. Consider now a random forest and let $m$ be the number of variables to consider at each split. Set the range for $m$ from 1 to the number of explanatory variables, say $nvar$, and define a k-fold cross-validation schema for the selection of this tuning parameter, with $k$ of your choice. Prepare a matrix with $nvar$ rows and 2 columns and fill the first column with the average cross-validation error corresponding to each choice of $m$ and the second column with the OOB error (from the full dataset). Are the CV and OOB error different? Do they reach the minimum at the same value of $m$? Interpret the optimal model (either using the CV or the OOB error).
Random Forests are a useful method that is able to reduce variance at the cost of losing interpretability (since we have a *forest of trees*). For this purpose a Random Forest builds a number of decision trees on bootstrapped training samples. However, each time a split in a tree is considered and a random sample of $m$ predictors is chosen as split candidates from the full set of $p$ predictors.
When we make a prediction, the result will be obtained by averaging the results of all trees.
First, we define the `nvar` vector, which will contain the number of variables that Random Forest models will consider at each split (from 1 to 8, the number of predictors) and the 10 folds for the k-fold cross validation (using the full dataset):
```{r}
# number of predictors
n_pred = ncol(prostate_df) - 1
# create nvar vector
nvar = seq(1,n_pred)
# set seed for reproducible folds
set.seed(4)
# get 10 set of index for training sets composition
flds = createFolds(prostate_df$lpsa, k = 10, list = T, returnTrain = T)
```
We now initialize the results matrix containing the cross-validation error and the OOB error columns:
```{r}
# initialize empty matrix
results_matrix = matrix(nrow=n_pred, ncol=2)
```
We can now calculate the average cross-validation and the OOB errors corresponding to each choice of $m$:
```{r message=FALSE, warning=FALSE}
# iterate over m values
for (m in nvar) {
# CV error
# create an empty list to store each fold MSE
list_mse = c()
# iterate over the folds
for (i in flds) {
# set seed for reproducible randomForest() results
set.seed(seed)
# define index based on the i fold
train_idx = i
# split train and test sets according to the indexes
train = prostate_df[train_idx,]
test = prostate_df[-train_idx,]
y_ts = test$lpsa
# fit a random forest on k - 1 folds and test on the remaining fold using
# m number of variables
rf = randomForest(lpsa ~ ., data=train, mtry=m, xtest = dplyr::select(test, -lpsa),
ytest = y_ts, importance=TRUE)
# extract prediction
temp_yhat = rf$test$predicted
# compute MSE
temp_mse = mean((temp_yhat - y_ts)^2)
# add MSE to the mse_list
list_mse = append(list_mse, temp_mse)
}
# average k-folds cross validation errors
cv_error = mean(list_mse)
# place the value in the matrix
results_matrix[m,1] = cv_error
# OOB error
# set seed for reproducible randomForest() results
set.seed(seed)
# fit a random forest on the whole dataset using m number of variables
rf = randomForest(lpsa ~ ., data=prostate_df, mtry=m, importance=TRUE)
# extract OOB value
oob = rf$mse[length(rf$mse)]
# place the value in the matrix
results_matrix[m,2] = oob
}
results_matrix
```
We can also convert the matrix to a dataframe in order to sort the results:
```{r}
# convert matrix to df
results_df = as.data.frame(results_matrix, row.names=nvar)
colnames(results_df) = c("CV_error", "OOB_error")
# order results based on CV error
knitr::kable(results_df[order(results_df$CV_error),])
# order results based on OOB error
knitr::kable(results_df[order(results_df$OOB_error),])
```
As we can notice cross-validation and OOB errors do not reach the minimum at the same value of $m$: for cross-validation error the minimum is reached by 2, while for the OOB error the minimum is 3.
It should be noted that they can be very similar since OOB error could be seen as an estimate of cross-validation error (the OOB error converges to the leave-one-out cross validation error estimate). Moreover, OOB has clear computational advantages over CV error since no refitting is needed: we can obtain an estimate on unseen data just form our fitted model.
For these reasons both cross-validation and OOB errors are considered good approaches to select an optimal model.
However, we can try to see how the two models (for CV $m=2$ and for OBB $m=3$) perform on the previous train-test split:
```{r}
# add "m" column to the df
results_df = results_df %>%
mutate(m=nvar)
# extract optimal CV m
opt.m_cv = results_df$m[which.min(results_df$CV_error)]
# extract optimal OOB m
opt.m_oob = results_df$m[which.min(results_df$OOB_error)]
# set seed for reproducible results
set.seed(14)
# fit the best CV Random Forest
best_CV_rf = randomForest(lpsa ~ ., data=x_train, mtry=opt.m_cv,
xtest = dplyr::select(x_test, -lpsa),
ytest = y_test, importance=TRUE)
best_CV_rf
# fit the best OBB Random Forest
best_OOB_rf = randomForest(lpsa ~ ., data=x_train, mtry=opt.m_oob,
xtest = dplyr::select(x_test, -lpsa),
ytest = y_test, importance=TRUE)
best_OOB_rf
```
As we can see, the test set MSE values are essentially the same, which suggests that the performances of the models are very similar in this case. Besides that we can check the importance of each predictors in the two models:
```{r fig.width=12}
op = par(mfrow=c(2, 1))
varImpPlot(best_CV_rf,
main="Variables importance for the best Random Forest (based on CV error)")
varImpPlot(best_OOB_rf,
main="Variables importance for the best Random Forest (based on OOB error)")
```
The left-side plot shows us the mean decrease in accuracy of the predictions when that variable is removed from the model (`%IncMSE`), while the right-side plot x-axes represent a measure of the total decrease in node impurity resulting from splits over that variable. As we can notice the two models have the same most important variables, which are represented by `lcavol` and `lweight.`
However, the chosen numbers of $m$ (2 and 3) were selected using the whole dataset (as required by the assignment), and having fitted and tested the model on our partitions using such $m$-numbers does not make much sense, as the models may be overfitted.
For this reason, we can repeat the parameter tuning performing CV and OOB on the training set only:
```{r}
# initialize empty matrix
results_matrix_tr = matrix(nrow=n_pred, ncol=2)
# set seed for reproducible folds
set.seed(4)
# recreate 10 set of index for training set composition (using only training set)
flds_tr = createFolds(x_train$lpsa, k = 10, list = T, returnTrain = T)
# iterate over m values
for (m in nvar) {
# create an empty list to store each folds MSE
list_mse = c()
# iterate over the folds
for (i in flds_tr) {
# set seed for reproducible randomForest() results
set.seed(seed)
# define index based on the i fold
train_idx = i
# split train and test sets according to the indexes
train = x_train[train_idx,]
test = x_train[-train_idx,]
y_ts = test$lpsa
# fit a random forest on k - 1 folds and test on the remaining fold using
# m number of variables
rf = randomForest(lpsa ~ ., data=train, mtry=m, xtest = dplyr::select(test, -lpsa),
ytest = y_ts, importance=TRUE)
# extract prediction
temp_yhat = rf$test$predicted
# compute MSE
temp_mse = mean((temp_yhat - y_ts)^2)
# add MSE to the mse_list
list_mse = append(list_mse, temp_mse)
}
# average k-folds cross validation errors
cv_error = mean(list_mse)
# place the value in the matrix
results_matrix_tr[m,1] = cv_error
# set seed for reproducible randomForest() results
set.seed(seed)
# fit a random forest on the whole dataset using m number of variables
rf = randomForest(lpsa ~ ., data=x_train, mtry=m, importance=TRUE)
# extract OOB value
oob = rf$mse[length(rf$mse)]
# place the value in the matrix
results_matrix_tr[m,2] = oob
}
# convert matrix to df
results_df_tr = as.data.frame(results_matrix_tr, row.names=nvar)
colnames(results_df_tr) = c("CV_error", "OOB_error")
# order results based on CV error
knitr::kable(results_df_tr[order(results_df_tr$CV_error),])
# order results based on OOB error
knitr::kable(results_df_tr[order(results_df_tr$OOB_error),])
```
We can also produce a plot in order to better observe the trend of errors according to m:
```{r fig.width=12}
# add m column to df
results_df_tr = results_df_tr %>%
mutate(m=nvar)
# CV error plot
mse_plot = results_df_tr %>%
ggplot(aes(x=m, y=CV_error)) +
geom_line(color="grey", size=1.5) +
geom_point(shape=21, color="black", fill="#2374AB", size=5) +
labs(title = "CV error variation based on m",
x = "m",
y = "CV error")
# OOB error plot
oob_plot = results_df_tr %>%
ggplot(aes(x=m, y=OOB_error)) +
geom_line(color="grey", size=1.5) +
geom_point(shape=21, color="black", fill="#2374AB", size=5) +
labs(title = "OOB error variation based on m",
x = "m",
y = "OOB error")
# plot side by side
plot_grid(mse_plot, oob_plot,
labels = "", ncol = 2, nrow = 1)
```
As expected, the CV and OOB error values are higher now. In particular, the best m values is 6 for CV error and 4 for OBB error.
As a final step we can use these $m$ values (this time selected using the training set) to fit the models and test them on the test set:
```{r}
# extract optimal number of m based on CV
opt.m_cv_tr = results_df_tr$m[which.min(results_df_tr$CV_error)]
# extract optimal number of m based on OOB
opt.m_oob_tr = results_df_tr$m[which.min(results_df_tr$OOB_error)]
# set seed for reproducible results
set.seed(33)
# fit the best CV Random Forest
best_CV_rf = randomForest(lpsa ~ ., data=x_train, mtry=opt.m_cv_tr,
xtest = dplyr::select(x_test, -lpsa),
ytest = y_test, importance=TRUE)
best_CV_rf
# fit the best OBB Random Forest
best_OOB_rf = randomForest(lpsa ~ ., data=x_train, mtry=opt.m_oob_tr,
xtest = dplyr::select(x_test, -lpsa),
ytest = y_test, importance=TRUE)
best_OOB_rf
```
And we can also visualize again the variables importance:
```{r fig.width=12}
op = par(mfrow=c(2, 1))
varImpPlot(best_CV_rf,
main="Variables importance for the best Random Forest (based on CV error)")
varImpPlot(best_OOB_rf,
main="Variables importance for the best Random Forest (based on OOB error)")
```
Once again the performances are very similar. Moreover, they exceeds the performances of the previous regression trees (both the prune and unpruned versions fitted and tested using the same train-test partitions). Even in this case, by repeating the split test-train, both the optimal number of m and the performance may vary slightly. Besides that, also the variables importance order can slightly change. However the top 2-3 most important variables (especially `lcavol` and `lweight` as 1st and 2nd) seem to be always represented by the same set of variables.
Moreover, the models using the optimal values of m selected via the OOB error seem to be generally slightly more accurate, presumably due to the fact that the OOB approaches the LOOCV error. However, we cannot select the optimal model using only a few train-test splitting; indeed, cross-validation will be required for this purpose.
### 3. Fit boosted regression trees making a selection of the number of boosting iterations (n.trees) by CV. Interpret your selected optimal model.
Another technique to reduce tees variance is boosting. In particular this approach is based on the idea of aggregating sequentially grown trees in order to reduce variance, thus each tree is grown based on the information from previously grown trees. Moreover, the advantage compared to Random Forest is that, using boosting, some kind of variable selection is done.
For the purpose of the assignment we will use Gradient Boosting algorithm, which is implemented in R by the `gbm` library.
We will select the optimal model by tuning the parameter $M$ (`n.trees` in the model), which represents the number of trees of boosting iterations. This parameter is connected to the bias-variance trade-off since too large $M$ (the default for `gbm` is 5000) tends to overfit the data.
For the selection of number of boosting iterations (`n.trees`) we decide to use k-fold cross-validation on the training set, which is already implemented in the `gbm()` function:
```{r fig.height=8, fig.width=12}
# reset mfrow()
op = par(mfrow=c(1, 1))
# set seed for reproducible results
set.seed(1)
# fit a boosted regression trees using 10-fold CV
boosted_reg = gbm(lpsa ~ ., data=x_train, distribution="gaussian", n.trees=5000,
interaction.depth=4, cv.folds=10)
# show summary
knitr::kable(summary(boosted_reg))
```
The function show us the relative influence of each predictor on the output: in this case the top two variables are `lcavol` and `lweight`, which are the same as of the previous Random Forest models.
Using the `gbm.perf()` function allow us to find the optimal number of boosting iterations, both by evaluating the cross-validation error and the OOB:
```{r fig.height=6, fig.width=12}
op = par(mfrow=c(1,2))
# optimal number of boosting iterations based CV
cv_opt_iter = gbm.perf(boosted_reg, method = "cv")
cv_opt_iter
# optimal number of boosting iterations based OOB
oob_opt_iter = gbm.perf(boosted_reg, method = "OOB")
oob_opt_iter
```
As we can notice also by the output plots (squared error loss vs number of iterations), the optimal number of trees for cross-validation is 28, while for OOB is 48. In particular the green line indicates the test error (validation in our case), the black line represents the training error and the blue dotted line points the optimum number of iterations. In the CV based plot we can see how the intersection between the two error lines define the optimal number of iterations.
We can finally see how the models perform on our test set with both CV and OOB based number of iterations:
```{r fig.height=8, fig.width=12}
# reset mfrow()
op = par(mfrow=c(1, 1))
# set seed for reproducible results
set.seed(14)
# fit a boosted regression with the optimal number of iteration based on CV
boosted_reg_cv = gbm(lpsa ~ ., data=x_train, distribution="gaussian",
n.trees=cv_opt_iter, cv.folds=0)
# show summary
knitr::kable(summary(boosted_reg_cv))
# make prediction on test using the model
yhat = predict(boosted_reg_cv, newdata=x_test, n.trees=cv_opt_iter)
# calculate MSE
boosted_reg_cv_mse = mean((yhat - y_test)^2)
boosted_reg_cv_mse
set.seed(14)
# fit a boosted regression with the optimal number of iteration based on OOB
boosted_reg_OOB = gbm(lpsa ~ ., data=x_train, distribution="gaussian",
n.trees=oob_opt_iter, cv.folds=0)
# show summary
knitr::kable(summary(boosted_reg_OOB))
# make prediction on test using the model
yhat = predict(boosted_reg_OOB, newdata=x_test, n.trees=oob_opt_iter)
# calculate MSE
boosted_reg_OOB_mse = mean((yhat - y_test)^2)
boosted_reg_OOB_mse
```
Again, the top two variables in terms of relative influence are still `lcavol` and `lweight` for both models.
We can also evaluate the marginal effect of these two variables by producing a partial dependence plot:
```{r}
# partial dependence plot CV based model
plot(boosted_reg_cv, i="lcavol",
main ="Partial dependence plot: lcavol (CV based model)",
ylab="lpsa")
plot(boosted_reg_cv, i="lweight",
main ="Partial dependence plot: lweight (CV based model)",
ylab="lpsa")
# partial dependence plot OOB based model
plot(boosted_reg_OOB, i="lcavol",
main ="Partial dependence plot: lcavol (OOB based model)",
ylab="lpsa")
plot(boosted_reg_OOB, i="lweight",
main ="Partial dependence plot: lweight (OOB based model)",
ylab="lpsa")
```
These plots represent the effect that the chosen variable has on the response variable. Since the performances of OOB and CV based boosted regression are similar, these plots, and thus the effect of the most important variables on the response variable are quite similar too.
In both cases it the values of `lpsa` are not increasing until `lcavol` reaches about 0.5, after that the response variable increases with `lcavol`; lastly, reaching a value above 2.5 for `lcavol` the value of `lpsa` stabilize at just above 3.
In the `lweight` partial dependence plots of both the models the value of `lpsa` is stable around 2.2 until `lweight` reaches about 3.4, after that we can notice a little descending peaks, but then the response variable value will increase with `lweight` until about 3.35. At this point for both models the value of `lpsa` stabilizes at about 2.8.
As in the previous cases, the OBB and CV based models performance seems to be quite the same, but once again, repeating the train-test split may lead to different results. However, the top two variables seem to be always `lcavol` and `lweight`.
### 4. Compare the performance of the three methods (cost-complexity decision trees, random forests and boosting) using cross-validation. Make sure that the model complexity is re-optimized at each choice of the training set (either using another CV or using the OOB error).
The previous train-test splits were useful in order to show, interpret and have a first approach with the outputs of the various models' functions, as well as to get an initial idea of their performance. However, to properly evaluate the performance of the models and to eventually select the optimal model, it is essential to use a cross-validation.
We can finally properly compare methods using 10-folds cross validation. This choice stems from the fact that leave-one-out cross-validation proves to be rather expensive and time consuming in this case (especially for Random Forest and boosted models).
We start by initializing the dataframe that will contain the results and folds we are going to use for cross-validation.
```{r}
# initialize final results dataframe
final_results = data.frame(model = c("Unpruned decision tree", "Pruned decision tree",
"Random Forest (CV error based m)",
"Random Forest (OOB error based m)",
"Boosted regression (CV based)",
"Boosted regression (OOB based)"),
error=0, type="CV MSE")
# set seed for reproducible folds
set.seed(14)
# recreate 10 set of index for training set composition
flds_LOOCV = createFolds(prostate_df$lpsa, k = dim(prostate_df)[1], list = T,
returnTrain = T)
# set seed for reproducible folds
set.seed(14)
# recreate 10 set of index for training set composition
flds_10_cv = createFolds(prostate_df$lpsa, k = 10, list = T, returnTrain = T)
```
First, we evaluate the performance of cost-complexity decision trees re-optimizing the tree complexity using cross-validation:
```{r}
# initialize folds' MSE lists
list_unpr_mse = c()
list_pr_mse = c()
# iterate over folds
for (f in flds_10_cv) {
# set seed for reproducible randomForest() results
set.seed(seed)
# define index based on the i fold
train_idx = f
# split train and test sets according to the indexes
train = prostate_df[train_idx,]
test = prostate_df[-train_idx,]
y_ts = test$lpsa
# fit a regression tree on the training set
unpruned_tree = tree(lpsa ~ ., train)
# perform CV to find the optimal number of terminal nodes
cv.pros_tr = cv.tree(unpruned_tree)
# get the optimal size
opt.size = cv.pros_tr$size[which.min(cv.pros_tr$dev)]
# prune the tree
pruned_tree = prune.tree(unpruned_tree, best=opt.size)
# test the cost-complexity decision trees models
# unpruned tree
yhat_unpr = predict(unpruned_tree, newdata=test)
# calculate MSE
unpr_mse = mean((yhat_unpr - y_ts)^2)
# pruned tree
yhat_pr = predict(pruned_tree, newdata=test)
# calculate MSE
pr_mse = mean((yhat_pr - y_ts)^2)
# add MSEs to the lists
list_unpr_mse = append(list_unpr_mse, unpr_mse)
list_pr_mse = append(list_pr_mse, pr_mse)
}
# add CV MSE to the final results df
final_results[1,]$error = mean(list_unpr_mse)
final_results[2,]$error = mean(list_pr_mse)
knitr::kable(final_results[1:2,])
```
The performances of the unpruned and the pruned tree are very similar.
Similarly, we can evaluate the performances of Random Forests. In particular, we will report both the performances of models re-optimized via CV error and those optimized via OOB.
```{r message=FALSE, warning=FALSE}
# initialize folds' MSE lists
list_CV_rf_mse = c()
list_OOB_rf_mse = c()
# iterate over folds
for (f in flds_10_cv) {
# set seed for reproducible randomForest() results
set.seed(seed)
# define index based on the i fold
train_idx = f
# split train and test sets according to the indexes
train = prostate_df[train_idx,]
test = prostate_df[-train_idx,]
y_ts = test$lpsa
# create a CV/OBB errors df
temp_res_m = data.frame(nvar=nvar, CV_error = 0, OOB_error = 0)
# iterate over m values
for (m in nvar) {
# CV optimization
set.seed(seed)
# recreate 10 set of index for training-validation sets composition
flds_val = createFolds(train$lpsa, k = 10, list = T, returnTrain = T)
# create an empty list to store each folds MSE
list_rf_mse = c()
for (i in flds_val) {
# set seed for reproducible randomForest() results
set.seed(seed)
# define index based on the i fold
train_idx_rf = i
# split train and test sets according to the indexes
train_rf = train[train_idx_rf,]
test_rf = train[-train_idx_rf,]
y_ts_rf = test_rf$lpsa
# fit a random forest on k - 1 folds and test on the remaining fold using
# m number of variables
rf = randomForest(lpsa ~ ., data=train_rf, mtry=m,
xtest = dplyr::select(test_rf, -lpsa),
ytest = y_ts_rf, importance=TRUE)
# extract prediction
temp_yhat = rf$test$predicted
# compute MSE
temp_mse = mean((temp_yhat - y_ts_rf)^2)
# add MSE to the mse_list
list_rf_mse = append(list_rf_mse, temp_mse)
}
# average k-folds cross validation errors
cv_error = mean(list_rf_mse)
# add to the df
temp_res_m[temp_res_m$nvar==m,]$CV_error = cv_error
# OOB optimization
# set seed for reproducible randomForest() results
set.seed(seed)
# fit a random forest on the whole dataset using m number of variables
rf = randomForest(lpsa ~ ., data=train, mtry=m, importance=TRUE)
# extract OOB value
oob = rf$mse[length(rf$mse)]
# add to the df
temp_res_m[temp_res_m$nvar==m,]$OOB_error = oob
}
# extract optimal value of m based on CV error
opt_m_cv = temp_res_m[temp_res_m$CV_error == min(temp_res_m$CV_error),]$nvar
# extract optimal value of m based on OOB error
opt_m_oob = temp_res_m[temp_res_m$OOB_error == min(temp_res_m$OOB_error),]$nvar
# set seed for reproducible results
set.seed(seed)
# fit the best CV Random Forest
best_CV_rf = randomForest(lpsa ~ ., data=train, mtry=opt_m_cv,
xtest = dplyr::select(test, -lpsa),
ytest = y_ts, importance=TRUE)
# extract prediction
temp_yhat = best_CV_rf$test$predicted
# compute MSE
temp_mse = mean((temp_yhat - y_ts)^2)
# add MSE to the mse list
list_CV_rf_mse = append(list_CV_rf_mse, temp_mse)
# fit the best OOB Random Forest
best_OOB_rf = randomForest(lpsa ~ ., data=train, mtry=opt_m_oob,
xtest = dplyr::select(test, -lpsa),
ytest = y_ts, importance=TRUE)
# extract prediction
temp_yhat = best_OOB_rf$test$predicted
# compute MSE
temp_mse = mean((temp_yhat - y_ts)^2)
# add MSE to the mse list
list_OOB_rf_mse = append(list_CV_rf_mse, temp_mse)
}
# add cross-validation folds' average MSE for the two optimized Random Forest
final_results[3,]$error = mean(list_CV_rf_mse)
final_results[4,]$error = mean(list_OOB_rf_mse)
knitr::kable(final_results[1:4,])
```
Again, the performances of the two models (Random Forest optimized via CV and OOB errors) seem to be essentially the same.
Finally, we can do the same procedure for boosting:
```{r message=FALSE, warning=FALSE}
# initialize folds' MSE lists
boosted_CV_mse_list = c()
boosted_OOB_mse_list = c()
# iterate over folds
for (f in flds_10_cv) {
# set seed for reproducible results
set.seed(seed)
# define index based on the i fold
train_idx = f
# split train and test sets according to the indexes
train = prostate_df[train_idx,]
test = prostate_df[-train_idx,]
y_ts = test$lpsa
# fit a boosted regression trees on the training set using 10-fold CV
boosted_reg = gbm(lpsa ~ ., data=train, distribution="gaussian", n.trees=5000,
interaction.depth=4, cv.folds=10, )
# find optimal number of boosting iterations based CV
cv_opt_iter = gbm.perf(boosted_reg, method = "cv", plot.it=F)
# find optimal number of boosting iterations based OOB
oob_opt_iter = gbm.perf(boosted_reg, method = "OOB", plot.it=F)
# fit a boosted regression with the optimal number of iteration based on CV
boosted_reg_cv = gbm(lpsa ~ ., data=train, distribution="gaussian",
n.trees=cv_opt_iter, cv.folds=0)
# make prediction on test
yhat_cv = predict(boosted_reg_cv, newdata=test, n.trees=cv_opt_iter)
# calculate MSE
boosted_reg_cv_mse = mean((yhat_cv - y_ts)^2)
# add to the MSE list
boosted_CV_mse_list = append(boosted_CV_mse_list, boosted_reg_cv_mse)
# fit a boosted regression with the optimal number of iteration based on OOB
boosted_reg_oob = gbm(lpsa ~ ., data=train, distribution="gaussian",
n.trees=oob_opt_iter, cv.folds=0)
# make prediction on test
yhat_oob = predict(boosted_reg_oob, newdata=test, n.trees=oob_opt_iter)
# calculate MSE
boosted_reg_oob_mse = mean((yhat_oob - y_ts)^2)
# add to the MSE list
boosted_OOB_mse_list = append(boosted_OOB_mse_list, boosted_reg_oob_mse)
}
# add cross-validation folds' average MSE for the two optimized boosted models
final_results[5,]$error = mean(boosted_CV_mse_list)
final_results[6,]$error = mean(boosted_OOB_mse_list)
```
Note that we have not tuned the shrinkage parameter. however, we know that this parameter is correlated with the $M$ (`n.trees`) parameter, which we have tuned instead. Hence our strategy was to select a small number for the shrinkage parameter (we used the default, so 0.001) and let $M$ grow until there is no significant reduction in loss (reaching the optimal number of iteration based both on CV and OOB). Learning will be slower, as will the procedure, but this will not be a big problem since we are not dealing with a large dataset.
Once the models have been evaluated using cross-validation, we can finally compare them. We now sort the results dataframe to see which models have recorded the lowest MSE.
```{r}
knitr::kable(final_results[order(final_results$error),])
```
#### 5. Draw some general conclusions about the analysis and the different methods that you have considered.
As we can see from the results dataframe, the Random Forest models (both the one optimized via CV and the one optimized via OOB) seem to have the highest performance, achieving the lowest cross-validation MSE. This is followed by the boosted models; here in particular the difference between the CV-optimized model and the OOB-optimized model seems to be more pronounced, in fact the difference between MSEs is greater than for Random Forests. Finally, we have the cost-complexity decision trees, for which the complete tree performs better than the pruned tree. As we said before, models optimized by OOB error tend to perform better; this is probably due to the fact that it approaches the leave-one-out cross-validation error.
As we know, cost-complexity decision trees suffer from high variance and therefore tend to overfit more than models such as Random Forest and boosted models. The latter models have in fact quite similar performances in our case. However, the boosted models seem to tend more to overfitting: this could be due to the fact that some noise in the data affect them more than Random Forests.
Furthermore, we have understood how some of the variables, such as `lcavol` and `lweight`, play an important role in the prediction of `lpsa`. In the previous points we can observe how these variables were:
- at the stump of the trees (in the case of cost-complexity decision trees),
- at the top of the lists of variable importance (in the case of Random Forests),
- at the top of the lists of relative influence (in the case of boosted models).
Note that in the data exploration, looking at the output of `ggpairs()` and the correlation matrix, we already had a first intuition about the importance that the different variables would have in the models.
Out of curiosity, I lastly checked if Shrinkage models could perform better than tree-based models. Applying the same scheme and fold as the cross-validation used for the previous models, a properly tuned LASSO model performs better than Random Forest, achieving a cross-validation MSE of 0.55, while Ridge regression perform as well as to Random Forest:
```{r}
# initialize MSE lists
list_MSE_lasso = c()
list_MSE_ridge = c()
# iterate over folds
for (f in flds_10_cv) {
# define index based on the i fold
train_idx = f
x_tr_full = prostate_df[train_idx, ]
x_ts_full = prostate_df[-train_idx, ]
# glmnet() require
# matrix creation
x_tr_only_pred = model.matrix(lpsa ~ ., data=x_tr_full)[, -1]
x_ts_only_pred = model.matrix(lpsa ~ ., data=x_ts_full)[, -1]
# we store the y values
y_tr = x_tr_full$lpsa
y_ts = x_ts_full$lpsa
# Lasso
# fit lasso model on training
lasso_mod = glmnet(x_tr_only_pred, y_tr, alpha=1)
set.seed(seed)