-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathlecture-5.Rmd
994 lines (615 loc) · 47.1 KB
/
lecture-5.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
---
title: "An Overview of the Logistic Regression and Regularization"
author:
- name: Cengiz Zopluoglu
affiliation: University of Oregon
date: 11/3/2022
output:
distill::distill_article:
self_contained: true
toc: true
toc_float: true
theme: theme.css
---
<style>
.list-group-item.active, .list-group-item.active:focus, .list-group-item.active:hover {
z-index: 2;
color: #fff;
background-color: #FC4445;
border-color: #97CAEF;
}
#infobox {
padding: 1em 1em 1em 4em;
margin-bottom: 10px;
border: 2px solid black;
border-radius: 10px;
background: #E6F6DC 5px center/3em no-repeat;
}
</style>
```{r setup, include=FALSE}
knitr::opts_chunk$set(comment = "",fig.align='center')
require(here)
require(ggplot2)
require(plot3D)
require(kableExtra)
require(knitr)
require(gifski)
require(magick)
require(gridExtra)
library(scales)
library(lubridate)
require(plotly)
options(scipen=99)
```
`r paste('[Updated:',format(Sys.time(),'%a, %b %d, %Y - %H:%M:%S'),']')`
# 1. Overview of the Logistic Regression
Logistic regression is a type of model that can be used to predict a binary outcome variable. Linear and logistic regression are indeed members of the same family of models called *generalized linear models*. While linear regression can also technically be used to predict a binary outcome, the bounded nature of a binary outcome, [0,1], makes the linear regression solution suboptimal. Logistic regression is a more appropriate model that considers the bounded nature of the binary outcome when making predictions.
The binary outcomes can be coded in various ways in the data, such as 0 vs. 1, True vs. False, Yes vs. No, and Success vs. Failure. The rest of the notes assume that the outcome variable is coded as 0s and 1s. We are interested in predicting the probability that future observation will belong to the class of 1s.
The notes in this section will first introduce a suboptimal solution to predict a binary outcome by fitting a linear probability model using linear regression and discuss the limitations of this approach. Then, the logistic regression model and its estimation will be demonstrated. Finally, we will discuss various metrics that can be used to evaluate the performance of a logistic regression model.
Throughout these notes, we will use the [Recidivism dataset from the NIJ competition](https://nij.ojp.gov/funding/recidivism-forecasting-challenge) to discuss different aspects of logistic regression and demonstrations. This data and variables in this data were discussed in detail in [Lecture 1]( https://edld654-fall22.netlify.app/lecture-1.html) and [Lecture 2a – Section 6](https://edld654-fall22.netlify.app/lecture-2a.html#wrapping-up-using-the-recipes-package). The outcome of interest to predict in this dataset is whether or not an individual will be recidivated in the second year after initial release. To make demonstrations easier, I randomly sample 20 observations from this data. Eight observations in this data have a value of 1 for the outcome (recidivated), and 12 observations have a value of 0 (not recidivated).
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
# Import the randomly sample 20 observations from the recidivism dataset
recidivism_sub <- read.csv('./data/recidivism_sub.csv',header=TRUE)
# Outcome variable
table(recidivism_sub$Recidivism_Arrest_Year2)
```
## 1.1. Linear Probability Model
A linear probability model fits a typical regression model to a binary outcome. When the outcome is binary, the predictions from a linear regression model can be considered as the probability of the outcome being equal to 1,
$$\hat{Y} = P(Y = 1).$$
Suppose we want to predict the recidivism in the second year (`Recidivism_Arrest_Year2`) by using the number of dependents they have. Then, we could fit this using the `lm` function.
$$\hat{Y} = P(Y = 1) = \beta_0 + \beta_1X + \epsilon$$
```{r, echo=TRUE,eval=TRUE ,message=FALSE, warning=FALSE}
mod <- lm(Recidivism_Arrest_Year2 ~ 1 + Dependents,
data = recidivism_sub)
summary(mod)
```
The intercept is 0.75 and the slope for the predictor, `Dependents`, is -.25. We can interpret the intercept and slope as the following for this example. Note that the predicted values from this model can now be interpreted as probability predictions because the outcome is binary.
- Intercept (0.75): When the number of dependents is equal to 0, the probability of being recidivated in Year 2 is 0.75.
- Slope (-0.25): For every additional dependent (one unit increase in X) the individual has, the probability of being recidivated in Year 2 is reduced by .25.
The intercept and slope still represent the best-fitting line to our data, and this fitted line can be shown here.
```{r, echo=FALSE,eval=TRUE, message=FALSE, warning=FALSE}
set.seed(1234)
x <- recidivism_sub$Dependents + runif(20,-.1,.1)
y <- recidivism_sub$Recidivism_Arrest_Year2
ggplot()+
geom_point(aes(x=x,y=y))+
theme_bw()+
geom_abline(intercept = coef(mod)[1],slope = coef(mod)[2])+
xlim(c(min(x),10))
```
Suppose we want to calculate the model's predicted probability of being recidivated in Year 2 for a different number of dependents a parolee has. Let's assume that the number of dependents can be from 0 to 10. What would be the predicted probability of being recidivated in Year 2 for a parolee with eight dependents?
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show',message=FALSE, warning=FALSE}
X <- data.frame(Dependents = 0:10)
predict(mod,newdata = X)
```
It is not reasonable for a probability prediction to be negative. One of the major issues with using linear regression to predict a binary outcome using a linear-probability model is that the model predictions can go outside of the boundary [0,1] and yield unreasonable predictions. So, a linear regression model may not be the best tool to predict a binary outcome. We should use a model that respects the boundaries of the outcome variable.
## 1.2. Description of Logistic Regression Model
To overcome the limitations of the linear probability model, we bundle our prediction model in a sigmoid function. Suppose there is a real-valued function of $a$ such that
$$ f(a) = \frac{e^a}{1 + e^a}. $$
The output of this function is always between 0 and 1 regardless of the value of $a$. The sigmoid function is an appropriate choice for the logistic regression because it assures that the output is always bounded between 0 and 1.
Note that this function can also be written as the following, and they are mathematically equivalent.
$$ f(a) = \frac{1}{1 + e^{-a}}. $$
```{r, echo=FALSE,eval=TRUE,message=FALSE, warning=FALSE}
fx <- function(x) { exp(x)/(1+exp(x))}
ggplot()+
geom_function(fun=fx)+
xlim(c(-10,10))+
ylim(c(0,1))+
theme_bw()+
xlab('a')+
ylab('f(a)')+
ggtitle('Sigmoid function')
```
If we revisit the previous example, we can specify a logistic regression model to predict the probability of being recidivated in Year 2 as the following by using the number of dependents a parolee has as the predictor,
$$P(Y=1) = \frac{1}{1 + e^{-(\beta_0+\beta_1X)}}.$$
When values of a predictor variable is entered into the equation, the model output can be directly interpreted as the probability of the binary outcome being equal to 1 (or whatever category and meaning a value of 1 represents). Then, we assume that the actual outcome follows a binomial distribution with the predicted probability.
$$ P(Y=1) = p $$
$$ Y \sim Binomial(p)$$
Suppose the coefficient estimates of this model are $\beta_0$=1.33 and $\beta_1$=-1.62. Then, for instance, we can compute the probability of being recidivated for a parolee with 8 dependents as the following:
$$P(Y=1) = \frac{1}{1+e^{-(1.33-1.62 \times 8)}} = 0.0000088951098.$$
```{r, echo=FALSE,eval=FALSE,message=FALSE, warning=FALSE}
mod <- glm(Recidivism_Arrest_Year2 ~ 1 + Dependents,
family = 'binomial',
data = recidivism_sub)
summary(mod)
```
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
b0 = 1.33
b1 = -1.62
x = 0:10
y = 1/(1+exp(-(b0+b1*x)))
data.frame(number.of.dependents = x, probability=y)
```
```{r, echo=FALSE,eval=TRUE,message=FALSE, warning=FALSE}
b0 = 1.33
b1 = -1.62
fx <- function(x) {1/(1+exp(-(b0+b1*x)))}
ggplot()+
geom_function(fun=fx)+
xlim(c(0,10))+
ylim(c(0,1))+
theme_bw()+
xlab('Number of Dependents')+
ylab('Probability of Being Recidivated')+
ggtitle('')
```
In its original form, it is difficult to interpret the logistic regression parameters because a one unit increase in the predictor is no longer linearly related to the probability of the outcome being equal to 1 due to the nonlinear nature of the sigmoid function.
The most common presentation of logistic regression is obtained after a bit of algebraic manipulation to rewrite the model equation. The logistic regression model above can also be specified as the following without any loss of meaning as they are mathematically equivalent.
$$ln \left [ \frac{P(Y=1)}{1-P(Y=1)} \right] = \beta_0+\beta_1X.$$
The term on the left side of the equation is known as the **logit**. It is essentially the natural logarithm of odds. So, when the outcome is a binary variable, the logit transformation of the probability that the outcome is equal to 1 can be represented as a linear equation. It provides a more straightforward interpretation. For instance, we can now say that when the number of dependents is equal to zero, the predicted logit is equal to 1.33 (intercept), and for every additional dependent, the logit decreases by 1.62 (slope).
It is common to transform the logit to odds when interpreting the parameters. For instance, we can say that when the number of dependents is equal to zero, the odds of being recidivated is 3.78 ($e^{1.33}$), and for every additional dependent the odds of being recidivated is reduced by about 80% ($1 - e^{-1.62}$).
The right side of the equation can be expanded by adding more predictors, adding polynomial terms of the predictors, or adding interactions among predictors. A model with only the main effects of $P$ predictors can be written as
$$ ln \left [ \frac{P(Y=1)}{1-P(Y=1)} \right] = \beta_0 + \sum_{p=1}^{P} \beta_pX_{p},$$
and the coefficients can be interpreted as
- $\beta_0$: the predicted logit when the values for all the predictor variables in the model are equal to zero. $e^{\beta_0}$, the predicted odds of the outcome being equal to 1 when the values for all the predictor variables in the model are equal to zero.
- $\beta_p$: the change in the predicted logit for one unit increases in $X_p$ when the values for all other predictors in the model are held constant. For every one unit in increase in $X_p$, the odds of the outcome being equal to 1 is multiplied by $e^{\beta_p}$ when the values for all other predictors in the model are held constant. In other words, $e^{\beta_p}$ is odds ratio, the ratio of odds at $\beta_p = a+1$ to the odds at $\beta_p = a$.
It is essential that you get familiar with the three concepts (probability, odds, logit) and how these three are related to each other for interpreting the logistic regression parameters.
```{r, echo=FALSE,eval=TRUE,,fig.align='center',fig.height=5,fig.width=12}
knitr::include_graphics(here('figs/pr-odd-logit.PNG'))
```
***
<div id="infobox">
<center style="color:black;"> **NOTE** </center>
The sigmoid function is not the only tool to be used for modeling a binary outcome. One can also use the cumulative standard normal distribution function, $\phi(a)$, and the output of $\phi(a)$ is also bounded between 0 and 1. When $\phi$ is used to transform the prediction model, this is known as **probit regression** and serves the same purpose as the logistic regression, which is to predict the probability of a binary outcome being equal to 1. However, it is always easier and more pleasant to work with logarithmic functions, which have better computational properties. Therefore, logistic regression is more commonly used than probit regression.
</div>
***
## 1.3. Model Estimation
### 1.3.1. The concept of likelihood
It is essential to understand the **likelihood** concept for estimating the coefficients of a logistic regression model. We will consider a simple example of flipping coins for this.
Suppose you flip the same coin 20 times and observe the following outcome. We don't know whether this is a fair coin in which the probability of observing a head or tail is equal to 0.5.
$$\mathbf{Y} = \left ( H,H,H,T,H,H,H,T,H,T \right )$$
Suppose we define $p$ as the probability of observing a head when we flip this coin. By definition, the probability of observing a tail is $1-p$.
$$P(Y=H) = p$$
$$P(Y=T) = 1 - p$$
Then, we can calculate the likelihood of our observations of heads and tails as a function of $p$.
$$ \mathfrak{L}(\mathbf{Y}|p) = p \times p \times p \times (1-p) \times p \times p \times p \times (1-p) \times p \times (1-p) $$
$$ \mathfrak{L}(\mathbf{Y}|p) = p^7 \times (1-p)^3 $$
For instance, if we say that this is a fair coin and, therefore, $p$ is equal to 0.5, then the likelihood of observing seven heads and three tails would be equal to
$$ \mathfrak{L}(\mathbf{Y}|p = 0.5) = 0.5^7 \times (1-0.5)^3 = 0.0009765625$$
On the other hand, another person can say this is probably not a fair coin, and the $p$ should be something higher than 0.5. How about 0.65?
$$ \mathfrak{L}(\mathbf{Y}|p = 0.65) = 0.65^7 \times (1-0.65)^3 = 0.00210183$$
Based on our observation, we can say that an estimate of $p$ being equal to 0.65 is more likely than an estimate of $p$ being equal to 0.5. Our observation of 7 heads and 3 tails is more likely if we estimate $p$ as 0.65 rather than 0.5.
### 1.3.2. Maximum likelihood estimation (MLE)
Then, what would be the best estimate of $p$ given our observed data (seven heads and three tails)? We can try every possible value of $p$ between 0 and 1 and calculate the likelihood of our data ($\mathbf{Y}$). Then, we can pick the value that makes our data most likely (largest likelihood) to observe as our best estimate. Given the data we observed (7 heads and three tails), it is called the maximum likelihood estimate of $p$.
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show',message=FALSE, warning=FALSE}
p <- seq(0,1,.001)
L <- p^7*(1-p)^3
ggplot()+
geom_line(aes(x=p,y=L)) +
theme_bw() +
xlab('Probability of Observing a Head (p)')+
ylab('Likelihood of Observing 7 Heads and 3 Tails')+
geom_vline(xintercept=p[which.max(L)],lty=2)
```
We can show that the $p$ value that makes the likelihood largest is 0.7, and the likelihood of observing seven heads and three tails is 0.002223566 when $p$ is equal to 0.7. Therefore, the maximum likelihood estimate of the probability of observing a head for this particular coin is 0.7, given the ten observations we have made.
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show',message=FALSE, warning=FALSE}
L[which.max(L)]
p[which.max(L)]
```
Note that our estimate can change and be updated if we continue collecting more data by flipping the same coin and recording our observations.
### 1.3.3. The concept of the log-likelihood
The computation of likelihood requires the multiplication of so many $p$ values, and when you multiply values between 0 and 1, the result gets smaller and smaller. It creates problems when you multiply so many of these small $p$ values due to the maximum precision any computer can handle. For instance, you can see the minimum number that can be represented in R in your local machine.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
.Machine$double.xmin
```
When you have hundreds of thousands of observations, it is probably not a good idea to work directly with likelihood. Instead, we work with the log of likelihood (log-likelihood). The log-likelihood has two main advantages:
- We are less concerned about the precision of small numbers our computer can handle.
- Log-likelihood has better mathematical properties for optimization problems (the log of the product of two numbers equals the sum of the log of the two numbers).
- The point that maximizes likelihood is the same number that maximizes the log-likelihood, so our end results (MLE estimate) do not care if we use log-likelihood instead of likelihood.
$$ ln(\mathfrak{L}(\mathbf{Y}|p)) = ln(lop^7 \times (1-p)^3) $$
$$ ln(\mathfrak{L}(\mathbf{Y}|p)) = ln(p^7) + ln((1-p)^3) $$
$$ ln(\mathfrak{L}(\mathbf{Y}|p)) = 7 \times ln(p) + 3 \times ln(1-p) $$
```{r, echo=TRUE,eval=TRUE ,message=FALSE, warning=FALSE}
p <- seq(0,1,.001)
logL <- log(p)*7 + log(1-p)*3
ggplot()+
geom_line(aes(x=p,y=logL)) +
theme_bw() +
xlab('Probability of Observing a Head (p)')+
ylab('Loglikelihood of Observing 7 Heads and 3 Tails')+
geom_vline(xintercept=p[which.max(logL)],lty=2)
```
```{r, echo=TRUE,eval=TRUE ,message=FALSE, warning=FALSE}
logL[which.max(logL)]
p[which.max(logL)]
```
### 1.3.4. MLE for Logistic Regression coefficients
Now, we can apply these concepts to estimate the logistic regression coefficients. Let's revisit our previous example in which we predict the probability of being recidivated in Year 2, given the number of dependents a parolee has. Our model can be written as the following.
$$ln \left [ \frac{P_i(Y=1)}{1-P_i(Y=1)} \right] = \beta_0+\beta_1X_i.$$
Note that $X$ and $P$ have a subscript $i$ to indicate that each individual may have a different X value, and therefore each individual will have a different probability. Our observed outcome is a set of 0s and 1s. Remember that eight individuals were recidivated (Y=1), and 12 were not recidivated (Y=0).
```{r, echo=TRUE,eval=TRUE ,message=FALSE, warning=FALSE}
recidivism_sub$Recidivism_Arrest_Year2
```
Given a set of coefficients, {$\beta_0,\beta_1$}, we can calculate the logit for every observation using the model equation and then transform this logit to a probability, $P_i(Y=1)$. Finally, we can calculate the log of the probability for each observation and sum them across observations to obtain the log-likelihood of observing this set of observations (12 zeros and eight ones). Suppose that we have two guesstimates for {$\beta_0,\beta_1$}, which are 0.5 and -0.8, respectively. These coefficients imply the following predicted model.
```{r, echo=FALSE,eval=TRUE, message=FALSE, warning=FALSE}
b0 = 0.5
b1 = -0.8
fx <- function(x) {exp(b0+b1*x)/(1+exp(b0+b1*x))}
ggplot()+
geom_function(fun=fx)+
xlim(c(0,10))+
ylim(c(0,1))+
theme_bw()+
xlab('Number of Dependents')+
ylab('Probability of Being Recidivated')+
ggtitle('')
```
If these two coefficients were our estimates, how likely would we observe the outcome in our data, given the number of dependents? The below R code first finds the predicted logit for every observation, assuming that $\beta_0$ = 0.5 and $\beta_1$ = -0.8.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
b0 = 0.5
b1 = -0.8
x = recidivism_sub$Dependents
y = recidivism_sub$Recidivism_Arrest_Year2
pred_logit <- b0 + b1*x
pred_prob1 <- exp(pred_logit)/(1+exp(pred_logit))
pred_prob0 <- 1 - pred_prob1
data.frame(Dependents = x,
Recidivated = y,
Prob1 = pred_prob1,
Prob0 = pred_prob0)
```
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show',message=FALSE, warning=FALSE}
logL <- y*log(pred_prob1) + (1-y)*log(pred_prob0)
sum(logL)
```
We can summarize this by saying that if our model coefficients were $\beta_0$ = 0.5 and $\beta_1$ = -0.8, then the log of the likelihood of observing the outcome in our data would be -9.25.
$$\mathbf{Y} = \left ( 1,0,1,0,0,0,0,1,1,0,0,1,0,0,0,1,0,0,0,0 \right )$$
$$ \mathfrak{logL}(\mathbf{Y}|\beta_0 = 0.5,\beta_1 = -0.8) = -9.25$$
The critical question is whether or not there is another pair of values we can assign to $\beta_0$ and $\beta_1$ that would provide a higher likelihood of data. If there are, then they would be better estimates for our model. If we can find such a pair with the maximum log-likelihood of data, then they would be our maximum likelihood estimates for the given model.
We can approach this problem crudely to gain some intuition about Maximum Likelihood Estimation. Suppose that a reasonable range of values for $\beta_0$ is from -3 to 3, and for $\beta_1$ is from -3 to 3. Let's think about every possible combinations of values for $\beta_0$ and $\beta_1$ within these ranges with increments of .01. Then, let's calculate the log-likelihood of data for every possible combination and plot these in a 3D plot as a function of $\beta_0$ and $\beta_1$.
```{r, echo=TRUE,eval=FALSE, message=FALSE, warning=FALSE,fig.width=8,fig.height=8}
grid <- expand.grid(b0=seq(-3,3,.01),b1=seq(-3,3,.01))
grid$logL <- NA
for(i in 1○:nrow(grid)){
x = recidivism_sub$Dependents
y = recidivism_sub$Recidivism_Arrest_Year2
pred_logit <- grid[i,]$b0 + grid[i,]$b1*x
pred_prob1 <- exp(pred_logit)/(1+exp(pred_logit))
pred_prob0 <- 1 - pred_prob1
logL <- y*log(pred_prob1) + (1-y)*log(pred_prob0)
grid[i,]$logL <- sum(logL)
print(i)
}
require(plotly)
plot_ly(grid, x = ~b0, y = ~b1, z = ~logL,
marker = list(color = ~logL,
showscale = FALSE,
cmin=min(grid$logL),
cmax=max(grid$logL),cauto=F),
width=600,height=600) %>%
add_markers()
```
```{r, echo=FALSE,eval=TRUE,message=FALSE, warning=FALSE,fig.width=8,fig.height=8}
#write.csv(grid,
# './data/grid_logistic_regression.csv',
# row.names = FALSE)
grid <- read.csv('./data/grid_logistic_regression.csv',header=TRUE)
plot_ly(grid, x = ~b0, y = ~b1, z = ~logL,
marker = list(color = ~logL,
showscale = FALSE,
cmin=min(grid$logL),
cmax=max(grid$logL),cauto=F),
width=600,height=600) %>%
add_markers()
```
What is the maximum point of this surface? Our simple search indicates that it is -8.30, and the set of $\beta_0$ and $\beta_1$ coefficients that make the observed data most likely is 1.33 and -1.62.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
grid[which.max(grid$logL),]
```
Therefore, given our dataset with 20 observations, our maximum likelihood estimates for the coefficients of the logistic regression model above are 1.33 and -1.62.
$$ln \left [ \frac{P_i(Y=1)}{1-P_i(Y=1)} \right] = 1.33 - 1.62 \times X_i.$$
### 1.3.5. Logistic Loss function
Below is a compact way of writing likelihood and log-likelihood in mathematical notation. For simplification purposes, we write $P_i$ to represent $P_i(Y=1)$.
$$ \mathfrak{L}(\mathbf{Y}|\boldsymbol\beta) = \prod_{i=1}^{N} P_i^{y_i} \times (1-P_i)^{1-y_i}$$
$$ \mathfrak{logL}(\mathbf{Y}|\boldsymbol\beta) = \sum_{i=1}^{N} Y_i \times ln(P_i) + (1-Y_i) \times ln(1-P_i)$$
The final equation above, $\mathfrak{logL}(\mathbf{Y}|\boldsymbol\beta)$, is known as the **logistic loss** function.
By finding the set of coefficients in a model, $\boldsymbol\beta = (\beta_0, \beta_1,...,\beta_P)$, that maximizes this quantity, we obtain the maximum likelihood estimates of the coefficients for the logistic regression model.
Unfortunately, the naive crude search we applied above would be inefficient when you have a complex model with many predictors. Another unfortunate thing is that there is no closed-form solution (as we had for the linear regression) for the logistic regression. Therefore, the only way to estimate the logistic regression coefficients is to use numerical approximations and computational algorithms to maximize the logistic loss function. Luckily, we have tools available to accomplish this task.
***
<div id="infobox">
<center style="color:black;"> **NOTE** </center>
Why do we not use least square estimation and minimize the sum of squared residuals when estimating the coefficients of the logistic regression model? We can certainly use the sum of squared residuals as our loss function and minimize it to estimate the coefficients for the logistic regression, just like we did for the linear regression. The complication is that the sum of the squared residuals function yields a non-convex surface when the outcome is binary as opposed to a convex surface obtained from the logistic loss function. Non-convex optimization problems are more challenging than convex optimization problems, and they are more vulnerable to finding sub-optimal solutions (local minima/maxima). Therefore, the logistic loss function and maximizing it is preferred when estimating the coefficients of a logistic regression model.
```{r, echo=FALSE,eval=FALSE, message=FALSE, warning=FALSE,fig.width=8,fig.height=8}
# If you are interested, I replicated the previous crude grid search by using the sum of #squared residuals as the loss function. As you see, the least-square estimates are very #close to the maximum-likelihood estimates. It is not surprising since this is a simple #model with only one predictor.
grid <- expand.grid(b0=seq(-3,3,.01),b1=seq(-3,3,.01))
grid$SSR <- NA
for(i in 1:nrow(grid)){
x = recidivism_sub$Dependents
y = recidivism_sub$Recidivism_Arrest_Year2
pred_logit <- grid[i,]$b0 + grid[i,]$b1*x
pred_prob1 <- exp(pred_logit)/(1+exp(pred_logit))
grid[i,]$SSR <- sum((y - pred_prob1)^2)
print(i)
}
require(plotly)
plot_ly(grid, x = ~b0, y = ~b1, z = ~SSR,
marker = list(color = ~SSR,
showscale = FALSE,
cmin=min(grid$SSR),
cmax=max(grid$SSR),cauto=F),
width=600,height=600) %>%
add_markers()
```
```{r, echo=FALSE,eval=FALSE,message=FALSE, warning=FALSE,fig.width=8,fig.height=8}
grid <- read.csv('./data/grid_logistic_regression_ssr.csv',header=TRUE)
plot_ly(grid, x = ~b0, y = ~b1, z = ~SSR,
marker = list(color = ~SSR,
showscale = FALSE,
cmin=min(grid$SSR),
cmax=max(grid$SSR),cauto=F),
width=600,height=600) %>%
add_markers()
```
```{r, echo=FALSE,eval=FALSE,message=FALSE, warning=FALSE}
grid[which.min(grid$SSR),]
```
</div>
***
### 1.3.6. The `glm` function
The `glm()` function as a part of the base `stats` package can be used to estimate the logistic regression coefficients. The use of the `glm()` function is very similar to the `lm()` function. The only difference is that we specify the `family='binomial'` argument to fit the logistic regression by maximizing the logistic loss function.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE,fig.width=8,fig.height=8}
mod <- glm(Recidivism_Arrest_Year2 ~ 1 + Dependents,
data = recidivism_sub,
family = 'binomial')
summary(mod)
```
In the **Coefficients** table, the numbers under the **Estimate** column are the estimated coefficients for the logistic regression model. The quantity labeled as the **Residual Deviance** in the output is twice the maximized log-likelihood,
$$ Deviance = -2 \times \mathfrak{logL}(\mathbf{Y}|\boldsymbol\beta). $$
Notice that the coefficient estimates from the `glm()` function are very close to our crude estimates from a brute-force search in an earlier section (1.33 and -1.62). From our simple search, we found the maximum log-likelihood as -8.3. So, if we multiply that number by -2, that equals 16.6, which is the number reported in this output as **Residual Deviance**.
### 1.3.7. The `glmnet` function
You can also use the `glmnet()` function from the `glmnet` package to fit the logistic regression. The advantage of the `glmnet` package is that it allows regularization while fitting the logistic regression. You can set the `alpha=0` and `lambda=0` arguments to obtain the coefficient estimates without penalty.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
require(glmnet)
mod <- glmnet(x = cbind(0,recidivism_sub$Dependents),
y = factor(recidivism_sub$Recidivism_Arrest_Year2),
family = 'binomial',
alpha = 0,
lambda = 0,
intercept = TRUE)
coef(mod)
```
The `x` argument is the input matrix for predictors, and the `y' argument is a vector of binary response outcome. The `glmnet` requires the `y' argument to be a factor with two levels.
Note that I defined the `x` argument above as `cbind(0,recidivism_sub$Dependents)` because `glmnet` requires the `x` to be a matrix with at least two columns. So, I added a column of zeros to trick the function and force it to run. That column of zeros has zero impact on the estimation.
## 1.4. Evaluating the Predictive Performance for Logistic Regression Models
When the outcome is a binary variable, classification models, such as logistic regression), typically yield a probability estimate for a class membership (or a continuous-valued prediction between 0 and 1).
For instance, we can obtain the model predictive probabilities of being recidivated in Year 2 after fitting a simple logistic regression as the number of dependents is the predictor.
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show',message=FALSE, warning=FALSE,fig.width=8,fig.height=8}
mod <- glm(Recidivism_Arrest_Year2 ~ 1 + Dependents,
data = recidivism_sub,
family = 'binomial')
recidivism_sub$pred_prob <- predict(mod,type='response')
recidivism_sub[,c('ID','Dependents','Recidivism_Arrest_Year2','pred_prob')]
```
In an ideal situation where a model does a perfect job of predicting a binary outcome, we expect all those observations in Group 0 (Not Recidivated) to have a predicted probability of 0 and all those observations in Group 1 (Recidivated) to have a predicted probability of 1. So, predicted values close to 0 for observations in Group 0 and those close to 1 for Group 1 are desirable.
One way to look at the quality of predictions for a binary outcome is to examine the distribution of predictions for both categories of the outcome variable. For this small demo set with 20 observations, we can see that these distributions are not separated very well. This plot implies that our model predictions don't properly separate the two classes in the outcome variable. The more separation between the distribution of two classes, the better the model performance is.
**What would such a plot look like for a model with perfect classification?**
This visualization is a very intuitive initial look at the model performance. Later, we will see numerical summaries of the separation between these two distributions.
```{r, echo=FALSE,eval=TRUE,message=FALSE, warning=FALSE,fig.width=6,fig.height=6}
ggplot(data = recidivism_sub, aes(x=pred_prob,
lty=factor(Recidivism_Arrest_Year2,
levels = c(0,1),
labels = c('Not Recidivated','Recidivated'))))+
geom_density()+
theme_bw()+
xlab('Model Predictions')+
labs(lty='')+
scale_linetype_manual(values = c(1,2))
```
### 1.4.1. Accuracy of Class Probabilities
We can calculate MAE and RMSE to measure the accuracy of predicted probabilities when the outcome is binary, and they have the same definition as in the continuous case.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
outs <- recidivism_sub$Recidivism_Arrest_Year2
preds <- recidivism_sub$pred_prob
# Mean absolute error
mean(abs(outs - preds))
# Root mean squared error
sqrt(mean((outs - preds)^2))
```
### 1.4.2. Accuracy of Class Predictions
#### Confusion Matrix and Related Metrics
In most situations, the accuracy of class probabilities is not very useful because one has to decide and place these observations in a group based on these probabilities for practical reasons. For instance, suppose you are predicting whether or not an individual will be recidivated so you can take action based on this decision. Then, it would be best if you transformed this continuous probability (or probability-like score) predicted by a model into a binary prediction. Therefore, one has to determine an arbitrary cut-off value. Once a cut-off value is determined, then we can generate class predictions.
Consider that we use a cut-off value of 0.5. So, if an observation has a predicted class probability less than 0.5, we predict that this person is in Group 0 (Not Recidivated). Likewise, if an observation has a predicted class probability higher than 0.5, we predict that this person is in Group 1.
In the code below, I convert the class probabilities to class predictions using a cut-off value of 0.5.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
recidivism_sub$pred_class <- ifelse(recidivism_sub$pred_prob>.5,1,0)
recidivism_sub[,c('ID','Dependents','Recidivism_Arrest_Year2','pred_prob','pred_class')]
```
We can summarize the relationship between the binary outcome and binary prediction in a 2 x 2 table. This table is commonly referred to as **confusion matrix**.
```{r, echo=TRUE,eval=TRUE,class.source='klippy',class.source = 'fold-show',message=FALSE, warning=FALSE}
tab <- table(recidivism_sub$pred_class,
recidivism_sub$Recidivism_Arrest_Year2,
dnn = c('Predicted','Observed'))
tab
```
The table indicates that there are 12 observations with an outcome value of 0. The model accurately predicts the class membership as 0 for ten observation while inaccurately predicts class membership as 1 for two observations. The table also indicates that there are eight observations with an outcome value of 1. The model accurately predicts the class membership as 1 for 5 observations while inaccurately predicts class membership as 0 for three observations.
Based on the elements of this table, we can define four key concepts:
- **True Positives(TP)**: True positives are the observations where both the outcome and prediction are equal to 1.
- **True Negative(TN)**: True negatives are the observations where both the outcome and prediction are equal to 0.
- **False Positives(FP)**: False positives are the observations where the outcome is 0 but the prediction is 1.
- **False Negatives(FN)**: False negatives are the observations where the outcome is 1 but the prediction is 0.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
tn <- tab[1,1]
tp <- tab[2,2]
fp <- tab[2,1]
fn <- tab[1,2]
```
Several metrics can be defined based on these four variables. We define a few important metrics below.
- **accuracy**: Overall accuracy simply represent the proportion of correct predictions.
$$ACC = \frac{TP + TN}{TP + TN + FP + FN}$$
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
acc <- (tp + tn)/(tp+tn+fp+fn)
acc
```
- **True Positive Rate (Sensitivity)**: True positive rate (a.k.a. sensitivity) is the proportion of correct predictions for those observations the outcome is 1 (event is observed).
$$TPR = \frac{TP}{TP + FN}$$
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
tpr <- (tp)/(tp+fn)
tpr
```
- **True Negative Rate (Specificity)**: True negative rate (a.k.a. specificity) is the proportion of correct predictions for those observations the outcome is 0 (event is not observed).
$$TNR = \frac{TN}{TN + FP}$$
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
tnr <- (tn)/(tn+fp)
tnr
```
- **Positive predicted value (Precision)**: Positive predicted value (a.k.a. precision) is the proportion of correct decisions when the model predicts that the outcome is 1.
$$PPV = \frac{TP}{TP + FP}$$
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
ppv <- (tp)/(tp+fp)
ppv
```
- **F1 score**: F1 score is a metric that combines both PPV and TPR.
$$F1 = 2*\frac{PPV*TPR}{PPV + TPR}$$
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
f1 <- (2*ppv*tpr)/(ppv+tpr)
f1
```
#### Area Under the Receiver Operating Curve (AUC or AUROC)
The confusion matrix and related metrics all depend on the arbitrary cut-off value one picks when transforming continuous predicted probabilities to binary predicted classes. We can change the cut-off value to optimize certain metrics, and there is always a trade-off between these metrics for different cut-off values. For instance, let's pick different cut-off values and calculate these metrics for each one.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE}
# Write a generic function to return the metric for a given vector of observed
# outcome, predicted probability and cut-off value
cmat <- function(x,y,cut){
# x, a vector of predicted probabilities
# y, a vector of observed outcomes
# cut, user-defined cut-off value
x_ <- ifelse(x>cut,1,0)
tn <- sum(x_==0 & y==0)
tp <- sum(x_==1 & y==1)
fp <- sum(x_==1 & y==0)
fn <- sum(x_==0 & y==1)
acc <- (tp + tn)/(tp+tn+fp+fn)
tpr <- (tp)/(tp+fn)
tnr <- (tn)/(tn+fp)
ppv <- (tp)/(tp+fp)
f1 <- (2*ppv*tpr)/(ppv+tpr)
return(list(acc=acc,tpr=tpr,tnr=tnr,ppv=ppv,f1=f1))
}
# Try it out
#cmat(x=recidivism_sub$pred_prob,
# y=recidivism_sub$Recidivism_Arrest_Year2,
# cut=0.5)
# Do it for different cut-off values
metrics <- data.frame(cut=seq(0,1,0.01),
acc=NA,
tpr=NA,
tnr=NA,
ppv=NA,
f1=NA)
for(i in 1:nrow(metrics)){
cmat_ <- cmat(x = recidivism_sub$pred_prob,
y = recidivism_sub$Recidivism_Arrest_Year2,
cut = metrics[i,1])
metrics[i,2:6] = c(cmat_$acc,cmat_$tpr,cmat_$tnr,cmat_$ppv,cmat_$f1)
}
metrics
```
Notice the trade-off between TPR (sensitivity) and TNR (specificity). The more conservative cut-off value we pick (a higher probability), the higher TNR is and the lower TPR is.
A receiver operating characteristic curve (ROC) is plot that represents this dynamic relationship between TPR and TNR for varying levels of a cut-off value. It looks a little strange here because we only have 20 observations.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE,fig.width=6,fig.height=6}
ggplot(data = metrics, aes(x=1-tnr,y=tpr))+
geom_line()+
xlab('1-TNR (FP)')+
ylab('TPR')+
geom_abline(lty=2)+
theme_bw()
```
ROC may be a useful plot to inform about model's predictive power as well as choosing an optimal cut-off value. The area under the ROC curve (AUC or AUROC) is typically used to evaluate the predictive power of classification models. The diagonal line in this plot represents a hypothetical model with no predictive power and AUC for the diagonal line is 0.5 (it is half of the whole square). The more ROC curve resembles with the diagonal line, less the predictive power is. It is not easy to calculate AUC by hand or straight formula as it requires calculus and numerical approximations. There are many alternatives in R to calculate AUC. We will use the `cutpointr` package as it also provides other tools to select an optimal cut-off point.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE,fig.width=6,fig.height=6}
# install.packages('cutpointr')
require(cutpointr)
cut.obj <- cutpointr(x = recidivism_sub$pred_prob,
class = recidivism_sub$Recidivism_Arrest_Year2)
plot(cut.obj)
cut.obj$optimal_cutpoint
cut.obj$AUC
```
We see that AUC for the predictions in the hypothetical dataset is 0.854. This can be considered as good in terms of predictive power. The closer AUC is to 1, the more predictive power the model has. The closer AUC is to 0.5, the closer predictive power is to random guessing. The magnitude of AUC is also closely related to how well the predicted probabilities are separated for two classes.
The `cutpointr` package provides more in terms of finding optimal values to maximize certain metrics. For instance, suppose we want to find the optimal cut-off value that maximizes the sum of specificity and sensitivity.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE,fig.width=6,fig.height=6}
cut.obj <- cutpointr(x = recidivism_sub$pred_prob,
class = recidivism_sub$Recidivism_Arrest_Year2,
method = maximize_metric,
metric = sum_sens_spec)
cut.obj$optimal_cutpoint
plot(cut.obj)
```
You can also create custom cost functions to maximize if you can attach a $ value to TP, FP, TN, FN. For instance, suppose that a true-positive prediction made by the model brings a 5 dollars profit and a false-positive prediction made by the model costs 1 dollars. A true-negative or a false-negative doesn't cost or profit anything. Then, you can find an optimal cut-off value that maximizes the total profit.
```{r, echo=TRUE,eval=TRUE,message=FALSE, warning=FALSE,fig.width=6,fig.height=6}
# Custom function
cost <- function(tp,fp,tn,fn,...){
my.cost <- matrix(5*tp - 1*fp + 0*tn + 0*fn, ncol = 1)
colnames(my.cost) <- 'my.cost'
return(my.cost)
}
cut.obj <- cutpointr(x = recidivism_sub$pred_prob,
class = recidivism_sub$Recidivism_Arrest_Year2,
method = maximize_metric,
metric = cost)
cut.obj$optimal_cutpoint
plot(cut.obj)
```
Check [this page](https://cran.r-project.org/web/packages/cutpointr/vignettes/cutpointr.html) to find more information about the `cutpointr` package.
## 1.5. Building a Prediction Model for Recidivism using the `caret` package
Please review the following notebook that builds a classification model using the logistic regression for the full recidivism dataset.
[Building a Logistic Regression Model](https://www.kaggle.com/code/uocoeeds/building-a-logistic-regression-model/notebook)
# 2. Regularization in Logistic Regression
The regularization works similarly in logistic regression, as discussed in linear regression. We add penalty terms to the loss function to avoid large coefficients, and we reduce model variance by including a penalty term in exchange for adding bias. Optimizing the penalty degree via tuning, we can typically get models with better performance than a logistic regression with no regularization.
## 2.1 Ridge Penalty
The loss function with ridge penalty applied in logistic regression is the following:
$$ \mathfrak{logL}(\mathbf{Y}|\boldsymbol\beta) = \left ( \sum_{i=1}^{N} Y_i \times ln(P_i) + (1-Y_i) \times ln(1-P_i) \right ) - \frac{\lambda}{2} \sum_{i=1}^{P}\beta_p^2$$
Notice that we subtract the penalty from the logistic loss because we maximize the quantity. The penalty term has the same effect in the logistic regression, and it will pull the regression coefficients toward zero but will not make them exactly equal to zero. Below, you can see a plot of change in logistic regression coefficients for some of the variables from a model that will be fit in the next section at increasing levels of ridge penalty.
```{r, echo=FALSE,eval=TRUE,class.source='klippy',class.source = 'fold-show',message=FALSE, warning=FALSE,fig.width=8,fig.height=8}
coef_ridge <- read.csv('./data/ridge coefs.csv')
ggplot()+
geom_line(aes(x=seq(0,9.9,.1),y=as.numeric(coef_ridge[7,])))+
geom_line(aes(x=seq(0,9.9,.1),y=as.numeric(coef_ridge[3,])))+
geom_line(aes(x=seq(0,9.9,.1),y=as.numeric(coef_ridge[112,])))+
theme_bw()+
xlab('Lambda')+
ylab('Logistic regression coefficients')
```
***
<div id="infobox">
<center style="color:black;"> **NOTE** </center>
The `glmnet` package divides the value of the loss function by sample size ($N$) during the optimization (Equation 12 and 2-3 in [this paper](https://www.jstatsoft.org/article/view/v033i01)). It technically does not change anything in terms of the optimal solution. On the other hand, we should be aware that the ridge penalty being applied in the `glmnet package has become
$$N\frac{\lambda}{2}\sum_{i=1}^{P}\beta_p^2.$$
This information may be necessary while searching plausible values of $\lambda$ for the `glmnet` package.
</div>
***
### Building a Classification Model with Ridge Penalty for Recidivism using the `caret` package
Please review the following notebook that builds a classification model using the logistic regression with ridge penalty for the full recidivism dataset.
[Building a Classification Model with Ridge Penalty](https://www.kaggle.com/code/uocoeeds/building-a-classification-model-with-ridge-penalty/)
## 2.2. Lasso Penalty
The loss function with the lasso penalty applied in logistic regression is the following:
$$ \mathfrak{logL}(\mathbf{Y}|\boldsymbol\beta) = \left ( \sum_{i=1}^{N} Y_i \times ln(P_i) + (1-Y_i) \times ln(1-P_i) \right)- \lambda \sum_{i=1}^{P}|\beta_p|$$
The penalty term has the same effect in the logistic regression, and it will make the regression coefficients equal to zero when a sufficiently large $\lambda$ value is applied.
Below, you can see a plot of change in logistic regression coefficients for some of the variables from a model that will be fit in the next section at increasing levels of lasso penalty.
```{r, echo=FALSE,eval=TRUE,class.source='klippy',class.source = 'fold-show',message=FALSE, warning=FALSE,fig.width=8,fig.height=8}
coef_lasso <- read.csv('./data/lasso coefs.csv')
ggplot()+
geom_line(aes(x=seq(0,0.07,.001),y=as.numeric(coef_lasso[7,])))+
geom_line(aes(x=seq(0,0.07,.001),y=as.numeric(coef_lasso[3,])))+
geom_line(aes(x=seq(0,0.07,.001),y=as.numeric(coef_lasso[112,])))+
theme_bw()+
xlab('Lambda')+
ylab('Logistic regression coefficients')
```
***
<div id="infobox">
<center style="color:black;"> **NOTE** </center>
Note that the `glmnet` package divides the value of the loss function by sample ($N$) during the optimization (Equation 12 and 2-3 in [this paper](https://www.jstatsoft.org/article/view/v033i01)). You should be aware that the lasso penalty being applied in the `glmnet package has become
$$N\lambda\sum_{i=1}^{P}|\beta_p|.$$
This information may be important while searching plausible values of $\lambda$ for the `glmnet` package.
</div>
***
### Building a Classification Model with Lasso Penalty for Recidivism using the `caret` package
Please review the following notebook that builds a classification model using the logistic regression with lasso penalty for the full recidivism dataset.
[Building a Classification Model with Lasso Penalty](https://www.kaggle.com/code/uocoeeds/building-a-classification-model-with-lasso-penalty)
## 2.3. Elastic Net
The loss function with the elastic applied is the following:
$$ \mathfrak{logL}(\mathbf{Y}|\boldsymbol\beta) = \left ( \sum_{i=1}^{N} Y_i \times ln(P_i) + (1-Y_i) \times ln(1-P_i) \right)- \left ((1-\alpha)\frac{\lambda}{2} \sum_{i=1}^{P}\beta_p^2 + \alpha \lambda \sum_{i=1}^{P}|\beta_p| \right)$$
### Building a Classification Model with Elastic Net for Recidivism using the `caret` package
Please review the following notebook that builds a classification model using the logistic regression with elastic net for the full recidivism dataset.
[Building a Classification Model with Elastic Net](https://www.kaggle.com/code/uocoeeds/building-a-classification-model-with-elastic-net
)