(Regularized) Logistic Regression
Cengiz Zopluoglu
College of Education, University of Oregon
Nov 7 & 14, 2022 
 Eugene, OR
1 / 43

The goals for the next two weeks:

Overview of the Logistic Regression
- Linear Probability Model
- Model Description
- Model Estimation
- Model Performance Evaluation
Regularized Logistic Regression
- Ridge penalty
- Lasso penalty
- Elastic Net
Review of Kaggle notebooks for building classification models

2 / 43

Demo Dataset for Two Group Classification

A random sample of 20 observations from the Recidivism dataset

recidivism_sub <- read.csv(here('data/recidivism_sub.csv'),
                           header=TRUE)
recidivism_sub[,c('ID',
                  'Dependents',
                  'Recidivism_Arrest_Year2')]

      ID Dependents Recidivism_Arrest_Year2
1  21953          0                       1
2   8255          1                       1
3   9110          2                       0
4  20795          1                       0
5   5569          1                       1
6  14124          0                       1
7  24979          0                       1
8   4827          1                       1
9  26586          3                       0
10 17777          0                       0
11 22269          1                       0
12 25016          0                       0
13 24138          0                       1
14 12261          3                       0
15 15417          3                       0
16 14695          0                       1
17  4371          3                       0
18 13529          3                       0
19 25046          3                       0
20  5340          3                       0

table(recidivism_sub$Recidivism_Arrest_Year2)


 0  1 
12  8

The outcome variable is a binary outcome (1: Recidivated, 0: Not Recidivated)
In practice, the binary outcomes can be coded in various ways:
- True vs. False
- Yes vs. No
- Success vs. Failure
In this class, we assume that the outcome variable is coded as 0s and 1s, and the category we want to predict is 1.
The predictor is the number of dependents a parolee has at the time of release

3 / 43

Linear Probability Model

A linear probability model fits a typical regression model to a binary outcome.
When the outcome is binary, the predictions from a linear regression model can be considered as the probability of the outcome being equal to 1,

$\hat{Y} = P (Y = 1) = β_{0} + β_{1} X + ϵ$

mod <- lm(Recidivism_Arrest_Year2 ~ 1 + Dependents,
          data = recidivism_sub)
summary(mod)


Call:
lm(formula = Recidivism_Arrest_Year2 ~ 1 + Dependents, data = recidivism_sub)
Residuals:
    Min      1Q  Median      3Q     Max 
-0.7500 -0.0625  0.0000  0.2500  0.5000 
Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   0.7500     0.1295    5.79 0.000017 ***
Dependents   -0.2500     0.0682   -3.66   0.0018 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.391 on 18 degrees of freedom
Multiple R-squared:  0.427,    Adjusted R-squared:  0.395 
F-statistic: 13.4 on 1 and 18 DF,  p-value: 0.00178

4 / 43

Intercept (0.75): When the number of dependents is equal to 0, the probability of being recidivated in Year 2 is 0.75.
Slope (-0.25): For every additional dependent (one unit increase in X) the individual has, the probability of being recidivated in Year 2 is reduced by .25.

5 / 43

A major issue when using a linear regression model to predict a binary outcome is that the model predictions can go outside of the boundary [0,1] and yield unreasonable predictions.

X <- data.frame(Dependents = 0:10)
cbind(0:10,round(predict(mod,newdata = X),3))

   [,1]  [,2]
1     0  0.75
2     1  0.50
3     2  0.25
4     3  0.00
5     4 -0.25
6     5 -0.50
7     6 -0.75
8     7 -1.00
9     8 -1.25
10    9 -1.50
11   10 -1.75

A linear regression model may not be the best tool to predict a binary outcome.

6 / 43

Overview of the Logistic Regression

7 / 43

Model Description

To overcome the limitations of the linear probability model, we bundle our prediction model in a sigmoid function.

$f (a) = \frac{e^{a}}{1 + e^{a}} .$

$f (a) = \frac{1}{1 + e^{- a}} .$

The output of this function is always between 0 and 1 regardless of the value of $a$ .
The sigmoid function is an appropriate choice for the logistic regression (but not the only one) because it assures that the output is always bounded between 0 and 1.

8 / 43

If we revisit the previous example, we can specify a logistic regression model to predict the probability of being recidivated in Year 2 as the following:

$P (Y = 1) = \frac{1}{1 + e^{- (β_{0} + β_{1} X)}} .$

The model output can be directly interpreted as the probability of the binary outcome being equal to 1

Then, we assume that the actual outcome follows a binomial distribution with the predicted probability.

$P (Y = 1) = p$

$Y \sim B i n o m i a l (p)$

Suppose the coefficient estimates of this model are

$β_{0} = 1.33$
$β_{1} = - 1.62$

The probability of being recidivated for a parolee with 8 dependents:

$P (Y = 1) = \frac{1}{1 + e^{- (1.33 - 1.62 \times 8)}} = 0.0000088951098.$

9 / 43

b0 = 1.33
b1 = -1.62
x = 0:10
y = 1/(1+exp(-(b0+b1*x)))

data.frame(number.of.dependents=x, 
           probability=y)

   number.of.dependents  probability
1                     0 0.7908406348
2                     1 0.4280038671
3                     2 0.1289808521
4                     3 0.0284705877
5                     4 0.0057659656
6                     5 0.0011463790
7                     6 0.0002270757
8                     7 0.0000449462
9                     8 0.0000088951
10                    9 0.0000017603
11                   10 0.0000003484

10 / 43

$P (Y = 1) = \frac{1}{1 + e^{- (β_{0} + β_{1} X)}} .$

In its original form, it is difficult to interpret the logistic regression parameters because a one unit increase in the predictor is no longer linearly related to the probability of the outcome being equal to 1.
The most common presentation of logistic regression is obtained after a bit of algebraic manipulation to rewrite the model equation.

$l n [\frac{P (Y = 1)}{1 - P (Y = 1)}] = β_{0} + β_{1} X .$

The term on the left side of the equation is known as the logit (natural logarithm of odds).

11 / 43

It is essential that you get familiar with the three concepts (probability, odds, logit) and how these three are related to each other for interpreting the logistic regression parameters.

12 / 43

$l n [\frac{P (Y = 1)}{1 - P (Y = 1)}] = 1.33 - 1.62 X .$

When the number of dependents is equal to zero, the predicted logit is equal to 1.33 (intercept), and for every additional dependent, the logit decreases by 1.62 (slope).
It is also common to transform the logit to odds when interpreting the parameters.
- When the number of dependents is equal to zero, the odds of being recidivated is 3.78, $e^{1.33}$ .
- For every additional dependent the odds of being recidivated is multiplied by $e^{- 1.62}$
- Odds ratio --> $e^{- 1.62} = 0.198$

13 / 43

The right side of the equation can be expanded by adding more predictors, adding polynomial terms of the predictors, or adding interactions among predictors.
A model with only the main effects of $P$ predictors can be written as

$l n [\frac{P (Y = 1)}{1 - P (Y = 1)}] = β_{0} + \sum_{p = 1}^{P} β_{p} X_{p}$

$β_{0}$
- the predicted logit when the values for all the predictor variables in the model are equal to zero.
- $e^{β_{0}}$ , the predicted odds of the outcome being equal to 1 when the values for all the predictor variables in the model are equal to zero.
$β_{p}$
- the change in the predicted logit for one unit increases in $X_{p}$ when the values for all other predictors in the model are held constant
- For every one unit in increase in $X_{p}$ , the odds of the outcome being equal to 1 is multiplied by $e^{β_{p}}$ when the values for all other predictors in the model are held constant

14 / 43

Model Estimation

The concept of likelihood

It is essential to understand the likelihood concept for estimating the coefficients of a logistic regression model.
Consider a simple example of flipping coins. Suppose you flip the same coin 20 times and observe the following data.

$Y = (H, H, H, T, H, H, H, T, H, T)$

We don't know whether this is a fair coin in which the probability of observing a head or tail is equal to 0.5.
Is this a fair coin? If not, what is the probability of observing a head for this coin?

15 / 43

Suppose we define $p$ as the probability of observing a head when we flip this coin.
By definition, the probability of observing a tail is $1 - p$ .

$P (Y = H) = p$

$P (Y = T) = 1 - p$

The likelihood of our observations of heads and tails as a function of $p$ .

$L (Y | p) = p \times p \times p \times (1 - p) \times p \times p \times p \times (1 - p) \times p \times (1 - p)$

$L (Y | p) = p^{7} \times (1 - p)^{3}$

If this is a fair coin, then $p$ is equal to 0.5, and the likelihood of observing seven heads and three tails would be

$L (Y | p = 0.5) = {0.5}^{7} \times (1 - 0.5)^{3} = 0.0009765625$

If we assume that $p$ is equal to 0.65, the likelihood of observed data would be

$L (Y | p = 0.65) = {0.65}^{7} \times (1 - 0.65)^{3} = 0.00210183$

Based on observed data, Which one is more likely? $p = 0.5$ or $p = 0.65$ ?

16 / 43

Maximum likelihood estimation (MLE)

What would be the best estimate of $p$ given our observed data (seven heads and three tails)?
Suppose we try every possible value of $p$ between 0 and 1 and calculate the likelihood of observed data, $L (Y)$ .
Then, plot $p$ vs. $L (Y)$

Which $p$ value does make observed data most likely (largest likelihood)?
This $p$ value is called the maximum likelihood estimate of $p$ .
We can show that the $p$ value that makes the likelihood largest is 0.7.

17 / 43

The concept of the log-likelihood

The computation of likelihood requires the multiplication of so many $p$ values.
When you multiply values between 0 and 1, the result gets smaller and smaller.
It creates problems when you multiply so many of these small $p$ values due to the maximum precision any computer can handle.

.Machine$double.xmin

[1] 2.225e-308

When you have hundreds of thousands of observations, it is probably not a good idea to work directly with likelihood.
Instead, we prefer working with the log of likelihood (log-likelihood).

18 / 43

The log-likelihood has two main advantages:
- We are less concerned about the precision of small numbers our computer can handle.
- Log-likelihood has better mathematical properties for optimization problems (the log of the product of two numbers equals the sum of the log of the two numbers).
- The point that maximizes likelihood is the same number that maximizes the log-likelihood, so our end results (MLE estimate) do not care if we use log-likelihood instead of likelihood.

$l n (L (Y | p)) = l n (l o p^{7} \times (1 - p)^{3})$

$l n (L (Y | p)) = l n (p^{7}) + l n ((1 - p)^{3})$

$l n (L (Y | p)) = 7 \times l n (p) + 3 \times l n (1 - p)$

19 / 43

20 / 43

MLE for Logistic Regression coefficients

Let's apply these concepts to estimate the logistic regression coefficients for the demo dataset.

$l n [\frac{P_{i} (Y = 1)}{1 - P_{i} (Y = 1)}] = β_{0} + β_{1} X_{i} .$

Note that $X$ and $P$ have a subscript $i$ to indicate that each individual may have a different X value, and therefore each individual will have a different probability.
You can consider each individual as a separate coin flip with an unknown probability.
Our observed outcome is a set of 0s (not recidivated) and 1s (recidivated.

recidivism_sub$Recidivism_Arrest_Year2

 [1] 1 1 0 0 1 1 1 1 0 0 0 0 1 0 0 1 0 0 0 0

How likely to observe this set of values? What { $β_{0}, β_{1}$ } values make this data most likely?

21 / 43

Given a specific set of coefficients, { $β_{0}, β_{1}$ }, we can calculate the logit for every observation using the model equation and then transform this logit to a probability, $P_{i} (Y = 1)$ .
Then, we can calculate the log of the probability for each observation and sum them across observations to obtain the log-likelihood of observing this data (12 zeros and eight ones).
Suppose that we have two guesstimates for { $β_{0}, β_{1}$ }, which are 0.5 and -0.8, respectively. These coefficients imply the following predicted model.

22 / 43

b0 = 0.5
b1 = -0.8
x = recidivism_sub$Dependents
y = recidivism_sub$Recidivism_Arrest_Year2
pred_logit <- b0 + b1*x
pred_prob1 <- exp(pred_logit)/(1+exp(pred_logit))
pred_prob0 <- 1 - pred_prob1 
data.frame(Dependents      = x, 
           Recidivated     = y, 
           Prob1 = pred_prob1,
           Prob0 = pred_prob0)

   Dependents Recidivated  Prob1  Prob0
1           0           1 0.6225 0.3775
2           1           1 0.4256 0.5744
3           2           0 0.2497 0.7503
4           1           0 0.4256 0.5744
5           1           1 0.4256 0.5744
6           0           1 0.6225 0.3775
7           0           1 0.6225 0.3775
8           1           1 0.4256 0.5744
9           3           0 0.1301 0.8699
10          0           0 0.6225 0.3775
11          1           0 0.4256 0.5744
12          0           0 0.6225 0.3775
13          0           1 0.6225 0.3775
14          3           0 0.1301 0.8699
15          3           0 0.1301 0.8699
16          0           1 0.6225 0.3775
17          3           0 0.1301 0.8699
18          3           0 0.1301 0.8699
19          3           0 0.1301 0.8699
20          3           0 0.1301 0.8699
logL <-  y*log(pred_prob1) + (1-y)*log(pred_prob0)
sum(logL)

[1] -9.253
23 / 43

We can summarize this by saying that if our model coefficients were $β_{0}$ = 0.5 and $β_{1}$ = -0.8, then the log of the likelihood of observing the outcome in our data would be -9.25.

$Y = (1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0)$

$l o g L (Y | β_{0} = 0.5, β_{1} = - 0.8) = - 9.25$

Is there another pair of values we can assign to $β_{0}$ and $β_{1}$ that would provide a higher likelihood of data?
Is there a pair of values that makes the log-likelihood largest?

24 / 43

What is the maximum point of this surface?
Our simple search indicates that the maximum point of this surface is -8.30, and the set of $β_{0}$ and $β_{1}$ coefficients that make the observed data most likely is 1.33 and -1.62.

$l n [\frac{P_{i} (Y = 1)}{1 - P_{i} (Y = 1)}] = 1.33 - 1.62 \times X_{i} .$

25 / 43

Logistic Loss function

Below is a compact way of writing likelihood and log-likelihood in mathematical notation. For simplification purposes, we write $P_{i}$ to represent $P_{i} (Y = 1)$ .

$L (Y | β) = \prod_{i = 1}^{N} P_{i}^{y_{i}} \times (1 - P_{i})^{1 - y_{i}}$

$l o g L (Y | β) = \sum_{i = 1}^{N} Y_{i} \times l n (P_{i}) + (1 - Y_{i}) \times l n (1 - P_{i})$

The final equation above, $l o g L (Y | β)$ , is known as the logistic loss function.
By finding the set of coefficients in a model, $β = (β_{0}, β_{1}, . . ., β_{P})$ , that maximizes this quantity, we obtain the maximum likelihood estimates of the coefficients for the logistic regression model.
There is no closed-form solution for estimating the logistic regression parameters.
The naive crude search we applied above would be inefficient when you have a complex model with many predictors.
The only way to estimate the logistic regression coefficients is to use numerical approximations and computational algorithms to maximize the logistic loss function.

26 / 43

NOTE
Why do we not use least square estimation and minimize the sum of squared residuals when estimating the coefficients of the logistic regression model? We can certainly use the sum of squared residuals as our loss function and minimize it to estimate the coefficients for the logistic regression, just like we did for the linear regression. The complication is that the sum of the squared residuals function yields a non-convex surface when the outcome is binary as opposed to a convex surface obtained from the logistic loss function. Non-convex optimization problems are more challenging than convex optimization problems, and they are more vulnerable to finding sub-optimal solutions (local minima/maxima). Therefore, the logistic loss function and maximizing it is preferred when estimating the coefficients of a logistic regression model.

27 / 43

The `glm` function

mod <- glm(Recidivism_Arrest_Year2 ~ 1 + Dependents,
           data   = recidivism_sub,
           family = 'binomial')
summary(mod)


Call:
glm(formula = Recidivism_Arrest_Year2 ~ 1 + Dependents, family = "binomial", 
    data = recidivism_sub)
Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.767  -0.312  -0.241   0.686   1.303  
Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)    1.326      0.820    1.62    0.106  
Dependents    -1.616      0.727   -2.22    0.026 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 26.920  on 19  degrees of freedom
Residual deviance: 16.612  on 18  degrees of freedom
AIC: 20.61
Number of Fisher Scoring iterations: 5

In the Coefficients table, the numbers under the Estimate column are the estimated coefficients for the logistic regression model. The quantity labeled as the Residual Deviance in the output is twice the maximized log-likelihood,

$D e v i a n c e = - 2 \times l o g L (Y | β) .$

28 / 43

The `glmnet` function

require(glmnet)
mod <- glmnet(x         = cbind(0,recidivism_sub$Dependents),
              y         = factor(recidivism_sub$Recidivism_Arrest_Year2),
              family    = 'binomial',
              alpha     = 0,
              lambda    = 0,
              intercept = TRUE)
coef(mod)

3 x 1 sparse Matrix of class "dgCMatrix"
                s0
(Intercept)  1.325
V1           .    
V2          -1.616

The x argument is the input matrix for predictors, and the y argument is a vector of binary response outcome. The glmnet requires the y argument to be a factor with two levels.

Note that I defined the x argument above as cbind(0,recidivism_sub$Dependents) because glmnet requires the x to be a matrix with at least two columns. So, I added a column of zeros to trick the function and force it to run. That column of zeros has zero impact on the estimation.

29 / 43

Model Performance Evaluation

When the outcome is a binary variable, classification models, such as logistic regression, yield a probability estimate for a class membership (or a continuous-valued prediction between 0 and 1).

$l n [\frac{P_{i} (Y = 1)}{1 - P_{i} (Y = 1)}] = 1.33 - 1.62 \times X_{i} .$

mod <- glm(Recidivism_Arrest_Year2 ~ 1 + Dependents,
           data   = recidivism_sub,
           family = 'binomial')
recidivism_sub$pred_prob <- predict(mod,type='response')
recidivism_sub[,c('ID','Dependents','Recidivism_Arrest_Year2','pred_prob')]

      ID Dependents Recidivism_Arrest_Year2 pred_prob
1  21953          0                       1   0.79010
2   8255          1                       1   0.42786
3   9110          2                       0   0.12935
4  20795          1                       0   0.42786
5   5569          1                       1   0.42786
6  14124          0                       1   0.79010
7  24979          0                       1   0.79010
8   4827          1                       1   0.42786
9  26586          3                       0   0.02867
10 17777          0                       0   0.79010
11 22269          1                       0   0.42786
12 25016          0                       0   0.79010
13 24138          0                       1   0.79010
14 12261          3                       0   0.02867
15 15417          3                       0   0.02867
16 14695          0                       1   0.79010
17  4371          3                       0   0.02867
18 13529          3                       0   0.02867
19 25046          3                       0   0.02867
20  5340          3                       0   0.02867

30 / 43

Separation of two classes

In an ideal situation where a model does a perfect job of predicting a binary outcome, we expect

all those observations in Group 0 (Not Recidivated) to have a predicted probability of 0,
and all those observations in Group 1 (Recidivated) to have a predicted probability of 1.

So, predicted values close to 0 for observations in Group 0 and those close to 1 for Group 1 are indicators of good model performance.

One way to look at the quality of separation between two classes of a binary outcome is to examine the distribution of predictions within each class.

31 / 43

From the demo analysis:

32 / 43

Class Predictions

In most situations, for practical reasons, we transformed the continuous probability predicted by a model into a binary prediction.
Predicted class membership leads actionable items in practice.
This is implemented by determining an arbitrary cut-off value. Once a cut-off value is determined, then we can generate class predictions.
Consider that we use a cut-off value of 0.5.

      ID Dependents Recidivism_Arrest_Year2 pred_prob pred_class
1  21953          0                       1   0.79010          1
2   8255          1                       1   0.42786          0
3   9110          2                       0   0.12935          0
4  20795          1                       0   0.42786          0
5   5569          1                       1   0.42786          0
6  14124          0                       1   0.79010          1
7  24979          0                       1   0.79010          1
8   4827          1                       1   0.42786          0
9  26586          3                       0   0.02867          0
10 17777          0                       0   0.79010          1
11 22269          1                       0   0.42786          0
12 25016          0                       0   0.79010          1
13 24138          0                       1   0.79010          1
14 12261          3                       0   0.02867          0
15 15417          3                       0   0.02867          0
16 14695          0                       1   0.79010          1
17  4371          3                       0   0.02867          0
18 13529          3                       0   0.02867          0
19 25046          3                       0   0.02867          0
20  5340          3                       0   0.02867          0

If an observation has a predicted class probability less than 0.5, we predict that this person is in Group 0 (Not Recidivated).
If an observation has a predicted class probability higher than 0.5, we predict that this person is in Group 1.

33 / 43

Confusion Matrix

We can summarize the relationship between the binary outcome and binary prediction in a 2 x 2 table. This table is commonly referred to as confusion matrix.

         Observed
Predicted  0  1
        0 10  3
        1  2  5

Based on the elements of this table, we can define four key concepts:

True Positives(TP): True positives are the observations where both the outcome and prediction are equal to 1.
True Negative(TN): True negatives are the observations where both the outcome and prediction are equal to 0.
False Positives(FP): False positives are the observations where the outcome is 0 but the prediction is 1.
False Negatives(FN): False negatives are the observations where the outcome is 1 but the prediction is 0.

34 / 43

Accuracy: Overall accuracy simply represent the proportion of correct predictions.

$A C C = \frac{T P + T N}{T P + T N + F P + F N}$

True Positive Rate (Sensitivity): True positive rate (a.k.a. sensitivity) is the proportion of correct predictions for those observations the outcome is 1 (event is observed).

$T P R = \frac{T P}{T P + F N}$

True Negative Rate (Specificity): True negative rate (a.k.a. specificity) is the proportion of correct predictions for those observations the outcome is 0 (event is not observed).

$T N R = \frac{T N}{T N + F P}$

Positive predicted value (Precision): Positive predicted value (a.k.a. precision) is the proportion of correct decisions when the model predicts that the outcome is 1.

$P P V = \frac{T P}{T P + F P}$

F1 score: F1 score is a metric that combines both PPV and TPR.

$F 1 = 2 * \frac{P P V * T P R}{P P V + T P R}$

35 / 43

Area Under the Receiver Operating Curve (AUC or AUROC)

The confusion matrix and related metrics all depend on the arbitrary cut-off value one picks when transforming continuous predicted probabilities to binary predicted classes.
We can change the cut-off value to optimize certain metrics, and there is always a trade-off between these metrics for different cut-off values.

  cut  acc   tpr    tnr    ppv    fpr     f1
1 0.0 0.40 1.000 0.0000 0.4000 1.0000 0.5714
2 0.1 0.75 1.000 0.5833 0.6154 0.4167 0.7619
3 0.2 0.80 1.000 0.6667 0.6667 0.3333 0.8000
4 0.3 0.80 1.000 0.6667 0.6667 0.3333 0.8000
5 0.4 0.80 1.000 0.6667 0.6667 0.3333 0.8000
6 0.5 0.75 0.625 0.8333 0.7143 0.1667 0.6667
7 0.6 0.75 0.625 0.8333 0.7143 0.1667 0.6667
8 0.7 0.75 0.625 0.8333 0.7143 0.1667 0.6667
9 0.8 0.60 0.000 1.0000    NaN 0.0000    NaN

36 / 43

A receiver operating characteristic curve (ROC) is plot that represents this dynamic relationship between TPR and FPR (1-TNR) for varying levels of a cut-off value.

The area under the ROC curve (AUC or AUROC) is typically used to evaluate the predictive power of classification models.

The diagonal line in this plot represents a hypothetical model with no predictive power and AUC for the diagonal line is 0.5 (it is half of the whole square).
The closer AUC is to 0.5, the closer predictive power is to random guessing.
The more ROC curve resembles with the diagonal line, less the predictive power is.
The closer AUC is to 1, the more predictive power the model has.
The magnitude of AUC is closely related to how well the predicted probabilities separate the two classes.

37 / 43

Building a Logistic Regression Model via `caret`

Please review the following notebook that builds a classification model using the logistic regression for the full recidivism dataset.

Building a Logistic Regression Model

38 / 43

Regularized Logistic Regression

39 / 43

The regularization works similarly in logistic regression, as discussed in linear regression.
We add penalty terms to the loss function to avoid large coefficients, and we reduce model variance by including a penalty term in exchange for adding bias.
Optimizing the penalty degree via tuning, we can typically get models with better performance than a logistic regression with no regularization.

Logistic Loss with Ridge Penalty

$l o g L (Y | β) = (\sum_{i = 1}^{N} Y_{i} \times l n (P_{i}) + (1 - Y_{i}) \times l n (1 - P_{i})) - \frac{λ}{2} \sum_{i = 1}^{P} β_{p}^{2}$

Logistic Loss with Lasso Penalty

$l o g L (Y | β) = (\sum_{i = 1}^{N} Y_{i} \times l n (P_{i}) + (1 - Y_{i}) \times l n (1 - P_{i})) - λ \sum_{i = 1}^{P} | β_{p} |$

Logistic Loss with Elastic Net

$l o g L (Y | β) = (\sum_{i = 1}^{N} Y_{i} \times l n (P_{i}) + (1 - Y_{i}) \times l n (1 - P_{i})) - ((1 - α) \frac{λ}{2} \sum_{i = 1}^{P} β_{p}^{2} + α λ \sum_{i = 1}^{P} | β_{p} |)$

40 / 43

Shrinkage in Logistic Regression Coefficients with Ridge Penalty

41 / 43

Shrinkage in Logistic Regression Coefficients with Lasso Penalty

42 / 43

Building a Regularized Logistic Regression Model via `caret`

Please review the following notebooks that build classification models using the regularized logistic regression for the full recidivism dataset.

Building a Logistic Regression Model with Ridge Penalty

Building a Logistic Regression Model with Lasso Penalty

Building a Logistic Regression Model with Elastic Net

43 / 43

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

(Regularized) Logistic Regression

Cengiz Zopluoglu

College of Education, University of Oregon

Nov 7 & 14, 2022 Eugene, OR

The goals for the next two weeks:

Demo Dataset for Two Group Classification

Linear Probability Model

Overview of the Logistic Regression

Model Description

Model Estimation

The concept of likelihood

Maximum likelihood estimation (MLE)

The concept of the log-likelihood

MLE for Logistic Regression coefficients

Logistic Loss function

The glm function

The glmnet function

Model Performance Evaluation

Separation of two classes

Class Predictions

Confusion Matrix

Related Metrics

Area Under the Receiver Operating Curve (AUC or AUROC)

Building a Logistic Regression Model via caret

Regularized Logistic Regression

Logistic Loss with Ridge Penalty

Logistic Loss with Lasso Penalty

Logistic Loss with Elastic Net

Shrinkage in Logistic Regression Coefficients with Ridge Penalty

Shrinkage in Logistic Regression Coefficients with Lasso Penalty

Building a Regularized Logistic Regression Model via caret

The goals for the next two weeks:

Help

Nov 7 & 14, 2022
Eugene, OR

The `glm` function

The `glmnet` function

Building a Logistic Regression Model via `caret`

Building a Regularized Logistic Regression Model via `caret`