class: center, middle, inverse, title-slide .title[ # Regularized (Penalized) Linear Regression ] .author[ ### Cengiz Zopluoglu ] .institute[ ### College of Education, University of Oregon ] .date[ ### Oct 31, 2022
Eugene, OR ] --- <style> .blockquote { border-left: 5px solid #007935; background: #f9f9f9; padding: 10px; padding-left: 30px; margin-left: 16px; margin-right: 0; border-radius: 0px 4px 4px 0px; } #infobox { padding: 1em 1em 1em 4em; margin-bottom: 10px; border: 2px solid black; border-radius: 10px; background: #E6F6DC 5px center/3em no-repeat; } .centering[ float: center; ] .left-column2 { width: 50%; height: 92%; float: left; padding-top: 1em; } .right-column2 { width: 50%; float: right; padding-top: 1em; } .remark-code { font-size: 18px; } .tiny .remark-code { /*Change made here*/ font-size: 75% !important; } .tiny2 .remark-code { /*Change made here*/ font-size: 50% !important; } .indent { margin-left: 3em; } .single { line-height: 1 ; } .double { line-height: 2 ; } .title-slide h1 { padding-top: 0px; font-size: 40px; text-align: center; padding-bottom: 18px; margin-bottom: 18px; } .title-slide h2 { font-size: 30px; text-align: center; padding-top: 0px; margin-top: 0px; } .title-slide h3 { font-size: 30px; color: #26272A; text-align: center; text-shadow: none; padding: 10px; margin: 10px; line-height: 1.2; } </style> ### Today's Goals: - A conceptual introduction to Regularized Regression - Ridge Regression - Lasso Regression - Elastic Net - Implementation with the `glmnet` package - Building Regularized Regression models with the `caret` package --- ### What is regularization? - Regularization is a general strategy to incorporate additional penalty terms into the model fitting process - It is implemented in a various of other algorithms, not just regression. - The idea behind the regularization is to constrain the size of model coefficients to reduce their sampling variation and, hence, reduce the variance of model predictions. - The reduction in the variance comes with an expense of bias in model predictions. - These constraints are typically incorporated into the loss function to be optimized. - bias - variance tradeoff - There are two commonly used regularization strategies: - ridge penalty - lasso penalty - elastic net (a mix of ridge and lasso penalty) --- <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> # Ridge Regression --- ### Ridge Penalty - The loss function for the unregularized linear regression: the sum of squared residuals across all observations. `$$\sum_{i=1}^{N}\epsilon_{(i)}^2$$` - For ridge regression, we add a penalty term to this loss function, which is a function of all the regression coefficients in the model. `$$Penalty = \lambda \sum_{i=1}^{P}\beta_p^2,$$` - `\(\lambda\)` is a parameter that penalizes the regression coefficients when they get larger. - Loss function to optimize for the ridge regression: `$$Loss = \sum_{i=1}^{N}\epsilon_{(i)}^2 + \lambda \sum_{p=1}^{P}\beta_p^2,$$` --- Let’s consider the same example from the previous class. Suppose that we would like to predict the target readability score for a given text from the Feature 220. `$$Y = \beta_0 + \beta_1X + \epsilon$$` Assume the set of coefficients are { `\(\beta_0,\beta_1\)` } = {-1.5,2}, so my model is `$$Y = -1.5 + 2X + \epsilon.$$` Then, the value of the loss function when `\(\lambda=0.2\)` would be equal to 19.02. .pull-left[ .single[ .tiny2[ ```r d <- readability_sub[,c('V220','target')] b0 = -1.5 b1 = 2 d$predicted <- b0 + b1*d$V220 d$error <- d$target - d$predicted d ``` ``` V220 target predicted error 1 -0.13908258 -2.06282395 -1.7781652 -0.28465879 2 0.21764143 0.58258607 -1.0647171 1.64730321 3 0.05812133 -1.65313060 -1.3837573 -0.26937327 4 0.02526429 -0.87390681 -1.4494714 0.57556460 5 0.22430885 -1.74049148 -1.0513823 -0.68910918 6 -0.07795373 -3.63993555 -1.6559075 -1.98402809 7 0.43400714 -0.62284268 -0.6319857 0.00914304 8 -0.24364550 -0.34426981 -1.9872910 1.64302120 9 0.15893717 -1.12298826 -1.1821257 0.05913740 10 0.14496475 -0.99857142 -1.2100705 0.21149908 11 0.34222975 -0.87656742 -0.8155405 -0.06102693 12 0.25219145 -0.03304643 -0.9956171 0.96257066 13 0.03532625 -0.49529863 -1.4293475 0.93404886 14 0.36410633 0.12453660 -0.7717873 0.89632394 15 0.29988593 0.09678258 -0.9002281 0.99701073 16 0.19837037 0.38422270 -1.1032593 1.48748196 17 0.07807041 -0.58143038 -1.3438592 0.76242880 18 0.07935690 -0.34324576 -1.3412862 0.99804044 19 0.57000953 -0.39054205 -0.3599809 -0.03056111 20 0.34523284 -0.67548411 -0.8095343 0.13405021 ``` ]]] .pull-right[ .single[ .tiny2[ ```r lambda = 0.2 SSR <- sum((d$error)^2) penalty <- lambda*(b0^2 + b1^2) loss <- SSR + penalty SSR ``` ``` [1] 17.76657 ``` ```r penalty ``` ``` [1] 1.25 ``` ```r loss ``` ``` [1] 19.01657 ``` ]]] --- - Note that when `\(\lambda\)` is equal to zero, the loss function is identical to SSR; therefore, it becomes a linear regression with no regularization. - As the value of `\(\lambda\)` increases, the degree of penalty linearly increases. - The `\(\lambda\)` can technically take any positive value between 0 and `\(\infty\)`. - A visual representation of the effect of penalty terms on the loss function and regression coefficients ![](ridge.gif) --- ### Model Estimation The matrix solution we learned before for regression without regularization can also be applied to estimate the coefficients from ridge regression given a specific `\(\lambda\)` value. `$$\hat{\boldsymbol{\beta}} = (\mathbf{X^T}\mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X^T}\mathbf{Y}$$` - `\(\mathbf{Y}\)` is an N x 1 column vector of observed values for the outcome variable, - `\(\mathbf{X}\)` is an N x (P+1) **design matrix** for the set of predictor variables, including an intercept term, - `\(\boldsymbol{\beta}\)` is an (P+1) x 1 column vector of regression coefficients, - `\(\mathbf{I}\)` is a (P+1) x (P+1) identity matrix, - and `\(\lambda\)` is a positive real-valued number, --- Suppose we want to predict the readability score using the two predictors, Feature 220 ($X_1$) and Feature 166 ($X_2$). Our model will be `$$Y_{(i)} = \beta_0 + \beta_1X_{1(i)} + \beta_2X_{2(i)} + \epsilon_{(i)}.$$` If we estimate the ridge regression coefficients by using `\(\lambda=.5\)`, the estimates would be { `\(\beta_0,\beta_1,\beta_2\)` } = {-.915, 1.169, -0.221}. .single[ .tiny[ ```r Y <- as.matrix(readability_sub$target) X <- as.matrix(cbind(1,readability_sub$V220,readability_sub$V166)) lambda <- 0.5 beta <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y beta ``` ``` [,1] [1,] -0.9151087 [2,] 1.1691920 [3,] -0.2206188 ``` ]] --- If we change the value of λ to 2, we will get different estimates for the regression coefficients. .single[ .tiny[ ```r Y <- as.matrix(readability_sub$target) X <- as.matrix(cbind(1,readability_sub$V220,readability_sub$V166)) lambda <- 2 beta <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y beta ``` ``` [,1] [1,] -0.7550986 [2,] 0.4685138 [3,] -0.1152953 ``` ]] --- Suppose you manipulate the value of λ from 0 to 100 with increments of .1 and calculate the regression coefficients for different levels of `\(\lambda\)` values. <img src="slide4_files/figure-html/unnamed-chunk-6-1.svg" style="display: block; margin: auto;" /> Note that the regression coefficients will shrink toward zero but will never be exactly equal to zero in ridge regression. --- ### Standardized variables A critical complication arises in ridge regression when you have more than one predictor. - Different scales of different variables will affect the magnitude of the unstandardized regression coefficients. - A regression coefficient of a predictor ranging from 0 to 100 will be very different from a regression coefficient of a predictor from 0 to 1. - If we work with the unstandardized variables, the ridge penalty will be amplified for the coefficients of those variables with a more extensive range of values. Therefore, it is critical that we standardize variables before we use ridge regression. --- - When we standardize the variables, the mean of all variables becomes zero. - Therefore, the intercept estimate for any regression model with standardized variables is guaranteed to be zero. - The design matrix (**X**) doesn’t have to have a column of ones anymore because it is unnecessary (it would be a column of zeros if we had one). .single[ .tiny2[ ```r Y <- as.matrix(readability_sub$target) X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166)) ``` ]] .pull-left[ .single[ .tiny2[ ```r # Standardized Y Y <- scale(Y) Y ``` ``` [,1] [1,] -1.34512103 [2,] 1.39315702 [3,] -0.92104526 [4,] -0.11446655 [5,] -1.01147297 [6,] -2.97759768 [7,] 0.14541127 [8,] 0.43376355 [9,] -0.37229209 [10,] -0.24350754 [11,] -0.11722056 [12,] 0.75591253 [13,] 0.27743280 [14,] 0.91902756 [15,] 0.89029923 [16,] 1.18783003 [17,] 0.18827737 [18,] 0.43482354 [19,] 0.38586691 [20,] 0.09092186 attr(,"scaled:center") [1] -0.7633224 attr(,"scaled:scale") [1] 0.9660852 ``` ]]] .pull-right[ .single[ .tiny2[ ```r # Standardized X X <- scale(X) X ``` ``` [,1] [,2] [1,] -1.54285661 1.44758852 [2,] 0.24727019 -0.47711825 [3,] -0.55323997 -0.97867834 [4,] -0.71812450 1.41817232 [5,] 0.28072891 -0.62244962 [6,] -1.23609737 0.11236158 [7,] 1.33304531 0.34607530 [8,] -2.06757849 -1.22697345 [9,] -0.04732188 0.05882229 [10,] -0.11743882 -1.24554362 [11,] 0.87248435 1.93772977 [12,] 0.42065052 0.13025831 [13,] -0.66763117 -0.40479141 [14,] 0.98226625 1.39073712 [15,] 0.65999286 0.25182543 [16,] 0.15056337 -0.28808443 [17,] -0.45313072 0.02862469 [18,] -0.44667480 0.25187100 [19,] 2.01553800 -2.00805114 [20,] 0.88755458 -0.12237606 attr(,"scaled:center") [1] 0.1683671 0.1005784 attr(,"scaled:scale") [1] 0.19927304 0.06196686 ``` ]]] --- The regression model's coefficients with standardized variables when there is no ridge penalty. .single[ .tiny[ ```r lambda <- 0 beta.s <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y beta.s ``` ``` [,1] [1,] 0.42003881 [2,] -0.06335594 ``` ]] The regression coefficients when the ridge penalty is increased to 0.5. .single[ .tiny[ ```r lambda <- 0.5 beta.s <- solve(t(X)%*%X + lambda*diag(ncol(X)))%*%t(X)%*%Y beta.s ``` ``` [,1] [1,] 0.40931629 [2,] -0.06215875 ``` ]] --- Change in **standardized regression coefficients** when we manipulate the value of `\(\lambda\)` from 0 to 100 with increments of .1. <img src="slide4_files/figure-html/unnamed-chunk-12-1.svg" style="display: block; margin: auto;" /> --- ### Ridge regression with the `glmnet` package Similar to the `lm()` function, we can use the `glmnet()` function from the `glmnet` package to run a regression model with ridge penalty. There are many arguments for the glmnet() function. For now, the arguments we need to know are - `x`: an N x P input matrix, where N is the number of observations and P is the number of predictors - `y`: an N x 1 input matrix for the outcome variable - `alpha`: a mixing constant for lasso and ridge penalty. When it is 0, the ridge regression is conducted. - `lambda`: penalty term - `intercept`: set FALSE to avoid intercept for standardized variables --- Regression with no penalty (traditional OLS regression) - `alpha = 0` - `lambda = 0`. .single[ .tiny[ ```r require(glmnet) Y <- as.matrix(readability_sub$target) X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166)) Y <- scale(Y) X <- scale(X) mod <- glmnet(x = X, y = Y, family = 'gaussian', alpha = 0, lambda = 0, intercept= FALSE) coef(mod) ``` ``` 3 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) . V1 0.42003881 V2 -0.06335594 ``` ]] --- Increase the penalty term to 0.5. - `alpha = 0` - `lambda = 0.5`. .single[ .tiny[ ```r require(glmnet) Y <- as.matrix(readability_sub$target) X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166)) Y <- scale(Y) X <- scale(X) mod <- glmnet(x = X, y = Y, family = 'gaussian', alpha = 0, lambda = 0.5, intercept= FALSE) coef(mod) ``` ``` 3 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) . V1 0.27809880 V2 -0.04571182 ``` ]] --- In Slide 14, when we estimate the ridge regression coefficients with `\(\lambda=0.5\)` using matrix solution, we got different numbers than whan the `glmnet()` function reports. In the `glmnet()` function, it appears that the penalty term for the ridge regression is specified as `$$\lambda N \sum_{i=1}^{P}\beta_p^2.$$` For identical results, we should use `\(\lambda = 0.5/20\)` in the `glmnet()` package. .single[ .tiny2[ ```r mod <- glmnet(x = X, y = Y, family = 'gaussian', alpha = 0, lambda = 0.5/20, intercept= FALSE) coef(mod) ``` ``` 3 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) . V1 0.40958102 V2 -0.06218857 ``` ]] Note that these numbers are still slightly different. The difference is due to the numerical approximation glmnet is using when optimizing the loss function. glmnet doesn’t use the closed-form matrix solution for ridge regression. This is a good thing because there is not always a closed form solution for different types of regularization approaches (e.g., lasso). Therefore, the computational approximation in glmnet is very needed moving forward. --- ### Tuning the Hyperparameter `\(\lambda\)` In the context of machine learning, the parameters in a model can be classified into two types: parameters and hyperparameters. - The **parameters** are typically estimated from data and not set by users. In the context of ridge regression, regression coefficients, {$\beta_0,\beta_1,...,\beta_P$}, are parameters to be estimated from data. - The **hyperparameters** are not estimable because there are no first-order or second-order derivatives for these hyperparameters. Therefore, they must be set by the users. In the context of ridge regression, the penalty term, {$\lambda$}, is a hyperparameter. The process of deciding what value to use for a hyperparameter is called **Tuning**. It is usually a trial-error process. You try many different hyperparameter values and check how well the model performs based on specific criteria (e.g., MAE, MSE, RMSE) using k-fold cross-validation. Then, you pick the value of a hyperparameter that provides the best predictive performance on cross-validation. --- ### Using Ridge Regression to Predict Readability Scores Please review the following notebook for applying Ridge Regresison to predict readability scores from 768 features using the whole dataset. [Predicting Readability Scores using the Ridge Regression](https://www.kaggle.com/code/uocoeeds/building-a-ridge-regression-model) --- <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> # Lasso Regression --- ### Lasso Penalty Lasso regression is similar to the Ridge regression but applies a different penalty term. $$ Penalty = \lambda \sum_{i=1}^{P} |\beta_p|,$$ - `\(\lambda\)` is the penalty constant, - `\(|\beta_p|\)` is the absolute value of the regression coefficient for the `\(p^{th}\)` parameter. The loss function for the lasso regression to optimize: `$$Loss = \sum_{i=1}^{N}\epsilon_{(i)}^2 + \lambda \sum_{i=1}^{P}|\beta_p|$$` --- Let's consider the same example where we fit a simple linear regression model: the readability score is the outcome ($Y$) and Feature 229 is the predictor($X$). `$$Y = \beta_0 + \beta_1X + \epsilon,$$` Assume the set of coefficients are { `\(\beta_0,\beta_1\)` } = {-1.5,2}, so my model is `$$Y = -1.5 + 2X + \epsilon.$$` Then, the value of the loss function when `\(\lambda=0.2\)` would be equal to 18.467. .pull-left[ .single[ .tiny2[ ```r d <- readability_sub[,c('V220','target')] b0 = -1.5 b1 = 2 d$predicted <- b0 + b1*d$V220 d$error <- d$target - d$predicted d ``` ``` V220 target predicted error 1 -0.13908258 -2.06282395 -1.7781652 -0.28465879 2 0.21764143 0.58258607 -1.0647171 1.64730321 3 0.05812133 -1.65313060 -1.3837573 -0.26937327 4 0.02526429 -0.87390681 -1.4494714 0.57556460 5 0.22430885 -1.74049148 -1.0513823 -0.68910918 6 -0.07795373 -3.63993555 -1.6559075 -1.98402809 7 0.43400714 -0.62284268 -0.6319857 0.00914304 8 -0.24364550 -0.34426981 -1.9872910 1.64302120 9 0.15893717 -1.12298826 -1.1821257 0.05913740 10 0.14496475 -0.99857142 -1.2100705 0.21149908 11 0.34222975 -0.87656742 -0.8155405 -0.06102693 12 0.25219145 -0.03304643 -0.9956171 0.96257066 13 0.03532625 -0.49529863 -1.4293475 0.93404886 14 0.36410633 0.12453660 -0.7717873 0.89632394 15 0.29988593 0.09678258 -0.9002281 0.99701073 16 0.19837037 0.38422270 -1.1032593 1.48748196 17 0.07807041 -0.58143038 -1.3438592 0.76242880 18 0.07935690 -0.34324576 -1.3412862 0.99804044 19 0.57000953 -0.39054205 -0.3599809 -0.03056111 20 0.34523284 -0.67548411 -0.8095343 0.13405021 ``` ]]] .pull-right[ .single[ .tiny2[ ```r lambda = 0.2 SSR <- sum((d$error)^2) penalty <- lambda*(abs(b0) + abs(b1)) loss <- SSR + penalty SSR ``` ``` [1] 17.76657 ``` ```r penalty ``` ``` [1] 0.7 ``` ```r loss ``` ``` [1] 18.46657 ``` ]]] --- Below is a demonstration of what happens to the loss function and the regression coefficients for increasing levels of loss penalty (λ) for Lasso regression. ![](lasso.gif) Note that the regression coefficients become equal to 0 at some point (this was not the case for ridge regression). --- ### Model Estimation via `glmnet()` - There is no closed-form solution for lasso regression due to the absolute value terms in the loss function. - The only way to estimate the coefficients of the lasso regression is to use numerical techniques and obtain computational approximations. - Similar to ridge regression, glmnet is an engine we can use to estimate the coefficients of the lasso regression. - The arguments are identical. The only difference is that we fix the `alpha=` argument to 1 for lasso regression. .indent[ .single[ .tiny2[ ```r Y <- as.matrix(readability_sub$target) X <- as.matrix(cbind(readability_sub$V220,readability_sub$V166)) Y <- scale(Y) X <- scale(X) mod <- glmnet(x = X, y = Y, family = 'gaussian', alpha = 1, lambda = 0.2, intercept=FALSE) coef(mod) ``` ``` 3 x 1 sparse Matrix of class "dgCMatrix" s0 (Intercept) . V1 0.2174345 V2 . ``` ]]] --- - The `.` symbol for the coefficient of the second predictor indicates that it is equal to zero. - While the regression coefficients in the **ridge regression** shrink to zero, they do not necessarily end up being exactly equal to zero. - In contrast, **lasso regression** may yield a value of zero for some coefficients in the model. - For this reason, lasso regression may be used as a variable selection algorithm. The variables with coefficients equal to zero may be discarded from future considerations as they are not crucial for predicting the outcome. --- ### Tuning `\(\lambda\)` We implement a similar strategy for finding the optimal value of λ as we did for the Ridge Regression. ### Using Lasso Regression to Predict the Readability Scores Please review the following notebook to apply Lasso Regression to predict readability scores from all 768 features using the whole dataset. [Predicting Readability Scores using the Lasso Regression](https://www.kaggle.com/code/uocoeeds/building-a-lasso-regression-model) --- <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> # Elastic Net --- Elastic net combines the two types of penalty into one by mixing them with some weighted average. The penalty term for the elastic net could be written as `$$\lambda \left[ (1-\alpha)\sum_{i=1}^{P} \beta_p^2 + \alpha\sum_{i=1}^{P} |\beta_p|)\right].$$` The loss function for the elastic net to optimize: `$$Loss = \sum_{i=1}^{N}\epsilon_{(i)}^2 + \lambda \left[ (1-\alpha)\sum_{i=1}^{P} \beta_p^2 + \alpha\sum_{i=1}^{P} |\beta_p|)\right]$$` - When `\(\alpha\)` is set to 1, this is equivalent to ridge regression. - When `\(\alpha\)` equals 0, this is the equivalent of lasso regression. - When `\(\alpha\)` takes any value between 0 and 1, this term becomes a weighted average of the ridge penalty and lasso penalty. In Elastic Net, two hyperparameters will be tuned: `\(\alpha\)` and `\(\lambda\)`. We can consider many possible combinations of these two hyperparameters and try to find the optimal combination using 10-fold cross-validation. --- ### Using Elastic Net to Predict the Readability Scores Review of the following notebook for applying Elastic Net to predict readability scores from all 768 features using the whole dataset. [Predicting Readability Scores using the Elastic Net](https://www.kaggle.com/code/uocoeeds/building-a-regression-model-with-elastic-net) ## Using the Prediction Model for a New Text Review of the following notebook for predicting the readability of a given text with the existing model objects [Predicting Readability Scores for a new text](https://www.kaggle.com/code/uocoeeds/using-the-prediction-models-for-a-new-text)