class: center, middle, inverse, title-slide .title[ # An Overview of Linear Regression and Bias-Variance Tradeoff in Predictive Modeling ] .author[ ### Cengiz Zopluoglu ] .institute[ ### College of Education, University of Oregon ] .date[ ### Oct 17, 2022
Eugene, OR ] --- <style> .blockquote { border-left: 5px solid #007935; background: #f9f9f9; padding: 10px; padding-left: 30px; margin-left: 16px; margin-right: 0; border-radius: 0px 4px 4px 0px; } #infobox { padding: 1em 1em 1em 4em; margin-bottom: 10px; border: 2px solid black; border-radius: 10px; background: #E6F6DC 5px center/3em no-repeat; } .centering[ float: center; ] .left-column2 { width: 50%; height: 92%; float: left; padding-top: 1em; } .right-column2 { width: 50%; float: right; padding-top: 1em; } .remark-code { font-size: 18px; } .tiny .remark-code { /*Change made here*/ font-size: 75% !important; } .tiny2 .remark-code { /*Change made here*/ font-size: 50% !important; } .indent { margin-left: 3em; } .single { line-height: 1 ; } .double { line-height: 2 ; } .title-slide h1 { padding-top: 0px; font-size: 40px; text-align: center; padding-bottom: 18px; margin-bottom: 18px; } .title-slide h2 { font-size: 30px; text-align: center; padding-top: 0px; margin-top: 0px; } .title-slide h3 { font-size: 30px; color: #26272A; text-align: center; text-shadow: none; padding: 10px; margin: 10px; line-height: 1.2; } </style> ### Today's Goals: - An Overview of Linear Regression - Model Description - Model Estimation - Performance Evaluation - Understanding the concept of bias - variance tradeoff for predictive models - How to balance the model bias and variance when building predictive models --- <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> # An Overview of Linear Regression --- - The prediction algorithms are classified into two main categories: *supervised* and *unsupervised*. - **Supervised algorithms** are used when the dataset has an actual outcome of interest to predict (labels), and the goal is to build the "best" model predicting the outcome of interest. - **Unsupervised algorithms** are used when the dataset doesn't have an outcome of interest. The goal is typically to identify similar groups of observations (rows of data) or similar groups of variables (columns of data) in data. - This course will cover several *supervised* algorithms and Linear Regression is one of the most straightforward supervised algorithms and the easiest to interpret. --- ## Model Description The linear regression model with `\(P\)` predictors and an outcome variable `\(Y\)` can be written as `$$Y = \beta_0 + \sum_{p=1}^{P} \beta_pX_{p} + \epsilon$$` In this model, - `\(Y\)` represents the observed value for the outcome of an observation, - `\(X_{p}\)` represents the observed value of the `\(p^{th}\)` variable for the same observation, - `\(\beta_p\)` is the associated model parameter for the `\(p^{th}\)` variable, - and `\(\epsilon\)` is the model error (residual) for the observation. This model includes only the main effects of each predictor. --- - The previous model can be easily extended by including quadratic or higher-order polynomial terms for all (or a specific subset of) predictors. - A model with the first-order, second-order, and third-order polynomial terms for all predictors can be written as `$$Y = \beta_0 + \sum_{p=1}^{P} \beta_pX_{p} + \sum_{k=1}^{P} \beta_{k+P}X_{k}^2 + \sum_{m=1}^{P} \beta_{m+2P}X_{m}^3 + \epsilon$$` - Example: A model with only main effects `$$Y = \beta_0 + \beta_1X_{1} + \beta_2X_{2} + \beta_3X_{3}+ \epsilon.$$` - Example: A model with polynomial terms up to the 3rd degree added: `$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \\ \beta_4X_1^2 + \beta_5X_2^2 + \beta_6X_3^2+ \\ \beta_{7}X_1^3 + \beta_{8}X_2^3 + \beta_{9}X_3^3$$` --- - The effect of predictor variables on the outcome variable is sometimes not additive. - When the effect of one predictor on the response variable depends on the levels of another predictor, non-additive effects (a.k.a. interaction effects) can also be added to the model. - The interaction effects can be first-order interactions (interaction between two variables, e.g., `\(X_1*X_2\)`), second-order interactions ($X_1*X_2*X_3$), or higher orders. I - For instance, the model below also adds the first-order interactions. `$$Y = \beta_0 + \sum_{p=1}^{P} \beta_pX_{p} + \sum_{k=1}^{P} \beta_{k+P}X_{k}^2 + \sum_{m=1}^{P} \beta_{m+2P}X_{m}^3 + \sum_{i=1}^{P}\sum_{j=i+1}^{P}\beta_{i,j}X_iX_j + \epsilon$$` - A model with both interaction terms and polynomial terms up to the 3rd degree added: `$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \beta_3X_3 + \\ \beta_4X_1^2 + \beta_5X_2^2 + \beta_6X_3^2+ \\ \beta_{7}X_1^3 + \beta_{8}X_2^3 + \beta_{9}X_3^3+ \\ \beta_{1,2}X_1X_2+ \beta_{1,3}X_1X_3 + \beta_{2,3}X_2X_3 + \epsilon$$` --- ## Model Estimation - Suppose that we would like to predict the target readability score for a given text from the Feature 220. - Note that there are 768 features extracted from the NLP model as numerical embeddings. For the sake of simplicity, we will only use one of them (Feature 220). - Below is a scatterplot to show the relationship between these two variables for a random sample of 20 observations. <img src="slide3_files/figure-html/unnamed-chunk-2-1.svg" style="display: block; margin: auto;" /> --- - Consider a simple linear regression model - Outcome: the readability score is the outcome ( `\(Y\)` ) - Predictor: Feature 220 ( `\(X\)` ) - Our regression model is `$$Y = \beta_0 + \beta_1X + \epsilon$$` - The set of coefficients, { `\(\beta_0,\beta_1\)` } , represents a linear line. - We can write any set of { `\(\beta_0,\beta_1\)` } coefficients and use it as our model. - For instance, suppose we guesstimate that these coefficients are { `\(\beta_0,\beta_1\)` } = {-1.5,2}. Then, my model would be `$$Y = -1.5 + 2X + \epsilon$$` --- <img src="slide3_files/figure-html/unnamed-chunk-3-1.svg" style="display: block; margin: auto;" /> --- We can predict the target readability score for any observation in the dataset using this model. `$$Y_{(1)} = -1.5 + 2X_{(1)} + \epsilon_{(1)}.$$` `$$\hat{Y}_{(1)} = -1.5 + 2*(-0.139) = -1.778$$` $$\hat{\epsilon}_{(1)} = -2.062 - (-1.778) = -0.284 $$ The discrepancy between the observed value and the model prediction is the model error (residual) for the first observation and captured in the `\(\epsilon_{(1)}\)` term in the model. <img src="slide3_files/figure-html/unnamed-chunk-4-1.svg" style="display: block; margin: auto;" /> --- We can do the same thing for the second observation. `$$Y_{(2)} = -1.5 + 2X_{(2)} + \epsilon_{(2)}.$$` `$$\hat{Y}_{(2)} = -1.5 + 2*(0.218) = -1.065$$` $$\hat{\epsilon}_{(2)} = 0.583 - (-1.065) = 1.648 $$ <img src="slide3_files/figure-html/unnamed-chunk-5-1.svg" style="display: block; margin: auto;" /> --- Using a similar approach, we can calculate the model error for every observation. <img src="slide3_files/figure-html/unnamed-chunk-7-1.svg" style="display: block; margin: auto;" /> --- .single[ ```r d <- readability_sub[,c('V220','target')] d$predicted <- -1.5 + 2*d$V220 d$error <- d$target - d$predicted round(d,3) ``` ] .pull-left[ .single[ ``` V220 target predicted error 1 -0.139 -2.063 -1.778 -0.285 2 0.218 0.583 -1.065 1.647 3 0.058 -1.653 -1.384 -0.269 4 0.025 -0.874 -1.449 0.576 5 0.224 -1.740 -1.051 -0.689 6 -0.078 -3.640 -1.656 -1.984 7 0.434 -0.623 -0.632 0.009 8 -0.244 -0.344 -1.987 1.643 9 0.159 -1.123 -1.182 0.059 10 0.145 -0.999 -1.210 0.211 11 0.342 -0.877 -0.816 -0.061 12 0.252 -0.033 -0.996 0.963 13 0.035 -0.495 -1.429 0.934 14 0.364 0.125 -0.772 0.896 15 0.300 0.097 -0.900 0.997 16 0.198 0.384 -1.103 1.487 17 0.078 -0.581 -1.344 0.762 18 0.079 -0.343 -1.341 0.998 19 0.570 -0.391 -0.360 -0.031 20 0.345 -0.675 -0.810 0.134 ``` ] ] .pull-right[ `$$SSR = \sum_{i=1}^{N}(Y_{(i)} - (\beta_0+\beta_1X_{(i)}))^2$$` `$$SSR = \sum_{i=1}^{N}(Y_{(i)} - \hat{Y}_{(i)})^2$$` `$$SSR = \sum_{i=1}^{N}(\epsilon_{(i)})^2$$` $$ SSR = 17.767$$ For the set of coefficients { `\(\beta_0,\beta_1\)` } = {-1.5,2}, SSR is equal to 17.767. Could you find another set of coefficients that can do a better job in terms of prediction (smaller SSR)? ] --- **Thought Experiment** - Suppose the potential range for my intercept, `\(\beta_0\)`, is from -10 to 10, and we will consider every single possible value from -10 to 10 with increments of .1. - Also, suppose the potential range for my slope, `\(\beta_1\)`, is from -5 to 5, and we will consider every single possible value from -5 to 5 with increments of .01. - Note that every single possible combination of `\(\beta_0\)` and `\(\beta_1\)` indicates a different model - How many possible set of coefficients are there? - Can we try every single possible set of coefficients and compute the SSR? ---
--- **Optimization** - In our simple demonstration, estimating the best set of coefficients is an optimization problem with the following loss function: `$$Loss = \sum_{i=1}^{N}(Y_{(i)} - (\beta_0+\beta_1X_{(i)}))^2$$` - For a standard linear regression problem, a typical loss function is the **sum of squared residuals**, and we try to pick the set of coefficients that minimize this quantity. - Numerical approximations are used in most cases for optimization problems. - In the case of standard linear regression, we can mathematically obtain the complete solution without any numerical approximation. --- ### Matrix Solution We can find the best set of coefficients for most regression problems with a simple matrix operation. First, let's rewrite the regression problem in matrix form. <br> `$$Y_{(1)} = \beta_0 + \beta_1X_{(1)} + \epsilon_{(1)}$$` `$$Y_{(2)} = \beta_0 + \beta_1X_{(2)} + \epsilon_{(2)}$$` `$$Y_{(3)} = \beta_0 + \beta_1X_{(3)} + \epsilon_{(3)}$$` `$$Y_{(4)} = \beta_0 + \beta_1X_{(4)} + \epsilon_{(4)}$$` `$$Y_{(5)} = \beta_0 + \beta_1X_{(5)} + \epsilon_{(5)}$$` `$$Y_{(6)} = \beta_0 + \beta_1X_{(6)} + \epsilon_{(6)}$$` `$$...$$` `$$...$$` `$$...$$` `$$Y_{(20)} = \beta_0 + \beta_1X_{(20)} + \epsilon_{(20)}$$` --- We can write all of these equations in a much simpler format as $$ \mathbf{Y} = \mathbf{X} \boldsymbol{\beta} + \boldsymbol{\epsilon}, $$ - `\(\mathbf{Y}\)` is an N x 1 column vector of observed values for the outcome variable, - `\(\mathbf{X}\)` is an N x (P+1) **design matrix** for the set of predictor variables, including an intercept term, - `\(\boldsymbol{\beta}\)` is a (P+1) x 1 column vector of regression coefficients, - and `\(\boldsymbol{\epsilon}\)` is an N x 1 column vector of residuals. --- These matrix elements would look like the following for the problem above with our small dataset. .pull-left[ ![](regression_matrix.png) ] .pull-right[ ![](data_regression_matrix.gif) ] --- It can be shown that the set of `\(\boldsymbol{\beta}\)` coefficients that yields the minimum sum of squared residuals for this model can be analytically found using the following matrix operation. `$$\hat{\boldsymbol{\beta}} = (\mathbf{X^T}\mathbf{X})^{-1}\mathbf{X^T}\mathbf{Y}$$` Suppose we apply this matrix operation for the previous example. .single[ ```r Y <- as.matrix(readability_sub$target) X <- as.matrix(cbind(1,readability_sub$V220)) beta <- solve(t(X)%*%X)%*%t(X)%*%Y beta ``` ``` [,1] [1,] -1.108295 [2,] 2.048931 ``` ] The best set of { `\(\beta_0,\beta_1\)` } coefficients to predict the readability score with the least amount of error using Feature 220 as a predictor is { `\(\beta_0,\beta_1\)` } = {-1.108, 2.049} These estimates are also known as the **least square estimates**, and the best linear unbiased estimators (BLUE) for the given regression model. --- Once we find the best estimates for the model coefficients, we can also calculate the model predicted values and residual sum of squares for the given model and dataset. $$\boldsymbol{\hat{Y}} = \mathbf{X} \hat{\boldsymbol{\beta}} $$ $$ \boldsymbol{\hat{\epsilon}} = \boldsymbol{Y} - \hat{\boldsymbol{Y}} $$ $$ SSR = \boldsymbol{\hat{\epsilon}^T} \boldsymbol{\hat{\epsilon}} $$ .single[ ```r Y_hat <- X%*%beta E <- Y - Y_hat SSR <- t(E)%*%E SSR ``` ``` [,1] [1,] 14.56567 ``` ] --- - The matrix formulation is generalized to a regression model for more than one predictor. - When there are more predictors in the model, the dimensions of the design matrix, `\(\mathbf{X}\)`, and regression coefficient matrix, `\(\boldsymbol{\beta}\)`, will be different, but the matrix calculations will be identical. - Assume that we would like to expand our model by adding another predictor, Feature 166 as the second predictor. Our new model will be `$$Y_{(i)} = \beta_0 + \beta_1X_{1(i)} + \beta_2X_{2(i)} + \epsilon_{(i)}$$` - `\(X_1\)` represents Feature 220 and `\(X_2\)` represents Feature 166. - Now, we are looking for the best set of three coefficients, { `\(\beta_0, \beta_1, \beta_2\)` } that would yield the least error in predicting the readability. --- .single[ ```r Y <- as.matrix(readability_sub$target) X <- as.matrix(cbind(1,readability_sub[,c('V220','V166')])) ``` ] .pull-left[ .single[ ```r X ``` ``` 1 V220 V166 [1,] 1 -0.13908258 0.19028091 [2,] 1 0.21764143 0.07101288 [3,] 1 0.05812133 0.03993277 [4,] 1 0.02526429 0.18845809 [5,] 1 0.22430885 0.06200715 [6,] 1 -0.07795373 0.10754109 [7,] 1 0.43400714 0.12202360 [8,] 1 -0.24364550 0.02454670 [9,] 1 0.15893717 0.10422343 [10,] 1 0.14496475 0.02339597 [11,] 1 0.34222975 0.22065343 [12,] 1 0.25219145 0.10865010 [13,] 1 0.03532625 0.07549474 [14,] 1 0.36410633 0.18675801 [15,] 1 0.29988593 0.11618323 [16,] 1 0.19837037 0.08272671 [17,] 1 0.07807041 0.10235218 [18,] 1 0.07935690 0.11618605 [19,] 1 0.57000953 -0.02385423 [20,] 1 0.34523284 0.09299514 ``` ] ] .pull-right[ .single[ ```r Y ``` ``` [,1] [1,] -2.06282395 [2,] 0.58258607 [3,] -1.65313060 [4,] -0.87390681 [5,] -1.74049148 [6,] -3.63993555 [7,] -0.62284268 [8,] -0.34426981 [9,] -1.12298826 [10,] -0.99857142 [11,] -0.87656742 [12,] -0.03304643 [13,] -0.49529863 [14,] 0.12453660 [15,] 0.09678258 [16,] 0.38422270 [17,] -0.58143038 [18,] -0.34324576 [19,] -0.39054205 [20,] -0.67548411 ``` ] ] --- .single[ ```r beta <- solve(t(X)%*%X)%*%t(X)%*%Y beta ``` ``` [,1] 1 -1.0068344 V220 2.0363683 V166 -0.9877414 ``` <br> ```r Y_hat <- X%*%beta E <- Y - Y_hat SSR <- t(E)%*%E SSR ``` ``` [,1] [1,] 14.49461 ``` ] --- ### `lm()` function - While learning the inner mechanics of how numbers work behind the scenes is always exciting, it is handy to use already existing packages and tools to do all these computations. - A simple go-to function for fitting a linear regression to predict a continuous outcome is the `lm()` function. **Model 1: Predicting readability scores from Feature 220** .single[.tiny[ ```r mod <- lm(target ~ 1 + V220,data=readability_sub) summary(mod) ``` ``` Call: lm(formula = target ~ 1 + V220, data = readability_sub) Residuals: Min 1Q Median 3Q Max -2.37192 -0.45499 -0.00234 0.56655 1.26324 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.1083 0.2662 -4.163 0.000584 *** V220 2.0489 1.0356 1.978 0.063390 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.8996 on 18 degrees of freedom Multiple R-squared: 0.1786, Adjusted R-squared: 0.133 F-statistic: 3.914 on 1 and 18 DF, p-value: 0.06339 ``` ]] --- **Model 2: Predicting readability scores from Feature 220 and Feature 166** .single[.tiny[ ```r mod <- lm(target ~ 1 + V220 + V166,data=readability_sub) summary(mod) ``` ``` Call: lm(formula = target ~ 1 + V220 + V166, data = readability_sub) Residuals: Min 1Q Median 3Q Max -2.3681 -0.4265 0.0019 0.5827 1.2164 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -1.0068 0.4452 -2.262 0.0371 * V220 2.0364 1.0639 1.914 0.0726 . V166 -0.9877 3.4214 -0.289 0.7763 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 0.9234 on 17 degrees of freedom Multiple R-squared: 0.1826, Adjusted R-squared: 0.08646 F-statistic: 1.899 on 2 and 17 DF, p-value: 0.1801 ``` ]] --- ## Performance Evaluation ### Accuracy Metrics - **Mean Absolute Error(MAE)** $$ MAE = \frac{\sum_{i=1}^{N} \left | e_i \right |}{N}$$ - **Mean Squared Error(MSE)** $$ MSE = \frac{\sum_{i=1}^{N} e_i^{2}}{N}$$ - **Root Mean Squared Error (RMSE)** $$ RMSE = \sqrt{\frac{\sum_{i=1}^{N} e_i^{2}}{N}}$$ --- If we take predictions from the second model, we can calculate these performance metrics for the small demo dataset using the following code. .single[.tiny2[ ```r mod <- lm(target ~ 1 + V220 + V166,data=readability_sub) readability_sub$pred <- predict(mod) readability_sub[,c('target','pred')] ``` ``` target pred 1 -2.06282395 -1.4780061 2 0.58258607 -0.6337787 3 -1.65313060 -0.9279212 4 -0.87390681 -1.1415349 5 -1.74049148 -0.6113060 6 -3.63993555 -1.2717997 7 -0.62284268 -0.2435638 8 -0.34426981 -1.5272322 9 -1.12298826 -0.7861256 10 -0.99857142 -0.7347420 11 -0.87656742 -0.5278772 12 -0.03304643 -0.6005980 13 -0.49529863 -1.0094665 14 0.12453660 -0.4498485 15 0.09678258 -0.5109152 16 0.38422270 -0.6845919 17 -0.58143038 -0.9489518 18 -0.34324576 -0.9599963 19 -0.39054205 0.1774767 20 -0.67548411 -0.3956684 ``` ```r # Mean absolute error mean(abs(readability_sub$target - readability_sub$pred)) ``` ``` [1] 0.6983844 ``` ```r # Mean squared error mean((readability_sub$target - readability_sub$pred)^2) ``` ``` [1] 0.7247307 ``` ```r # Root mean squared error sqrt(mean((readability_sub$target - readability_sub$pred)^2)) ``` ``` [1] 0.8513112 ``` ]] --- ### Proportional Reduction in Total Amount of Error (R-squared) `\(SSR_{null}\)` : sum of squared residuals when we use only mean to predict the outcome (intercept-only model) `$$SSR_{null} = \sum_{i=1}^{N} (y-\bar{y})^2$$` In our case, if we use the mean to predict the outcome for each observation, the sum of squared error would be equal to 17.733. .single[ ```r y_bar <- mean(readability_sub$target) ssr_null <- sum((readability_sub$target-y_bar)^2) ssr_null ``` ``` [1] 17.73309 ``` ] --- Instead, if we rely on our model (Feature 220 + Feature 166) to predict the outcome, the sum of squared error would be equal to 14.495. ```r ssr_model <- sum((readability_sub$target - readability_sub$pred)^2) ssr_model ``` ``` [1] 14.49461 ``` The total amount of prediction error is reduced by about 18.3% when we use our model instead of a simple null model. This can be used as a performance measure for any model. `$$1-\frac{SSR_{model}}{SSR_{null}}$$` ```r 1 - (ssr_model/ssr_null) ``` ``` [1] 0.1826235 ``` --- <br> <br> <br> <br> <br> <br> <br> <br> <br> <center> # Bias - Variance Tradeoff for Predictive Models --- ## How many parameters does it take to draw an elephant? .pull-left[ <img src="elephant.png" width="1015" style="display: block; margin: auto;" /> ] .pull-right[ - https://kourentzes.shinyapps.io/FitElephant/ - Increase the number of parameters in this model from 1 to 70, and the model predictions will start to look like an elephant. - Explore to find the number of parameters you would use to model an elephant. - Start manipulating the **p** (number of parameters) and examine how the model predicted contour changes. - Stop when you believe you can convince someone else that it looks like an elephant. ] --- ## Bias - Variance Tradeoff When we use a model to predict an outcome, there are two primary sources of error: - **Model Error**: No model is a complete representation of truth underlying observed data, every model is misspecified. Conceptually, we can define the model error as the distance between the model and the true generating mechanism underlying data. Technically, for a given set of predictors, it is the difference between the expected value predicted by the model and the true value underlying data. The term **bias** is also commonly used for model error. - **Sampling Error**: Given that the amount of data is fixed during any modeling process, it will decrease the stability of parameter estimates for models with increasing complexity across samples drawn from the same population. Consequently, this will increase the variance of predictions (more variability of a predicted value across different samples) for a given set of the same predictors. The terms **estimation error** or **variance** is also used for sampling error. The essence of any modeling activity is to balance these two sources of error and find a stable model (generalizable across different samples) with the least amount of bias. --- ## A simple Monte Carlo experimentation Suppose that there is a true generating model underlying some observed data. This model is `$$y = e^{(x-0.3)^2} - 1 + \epsilon,$$` - `\(x\)` is a predictor variable equally spaced and ranges from 0 to 1, - `\(\epsilon\)` is a random error component that follows a normal distribution with a mean of zero and a standard deviation of 0.1, - and `\(y\)` is the outcome variable. Suppose we simulate a small observed data following this model with a sample size of 20. Then, we use a straightforward linear model to represent the observed simulated data. $$ y = \beta_0 + \beta_1x + \epsilon $$ --- .single[.tiny[ ```r set.seed(09282021) N = 20 x <- seq(0,1,length=20) e <- rnorm(20,0,.1) y <- exp((x-0.3)^2) - 1 + e mod <- lm(y ~ 1 + x) mod ``` ``` Call: lm(formula = y ~ 1 + x) Coefficients: (Intercept) x -0.00542 0.35272 ``` ```r round(predict(mod),3) ``` ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 -0.005 0.013 0.032 0.050 0.069 0.087 0.106 0.125 0.143 0.162 0.180 0.199 0.217 0.236 0.254 0.273 0.292 18 19 20 0.310 0.329 0.347 ``` ]] --- .pull-left[ <img src="slide3_files/figure-html/unnamed-chunk-26-1.svg" style="display: block; margin: auto;" /> ] .pull-right[ - The solid line represents the true nature of the relationship between `\(x\)` and `\(y\)`. - The observed data points do not lie on this line due to the random error component (noise). - If we use a simple linear model, the gray dashed line represents the predicted relationship between `\(x\)` and `\(y\)`. ] - This demonstration only represents a single dataset. Suppose that we repeat the same process ten times. - We will produce ten different datasets with the same size (N=20) using the same predictor values, `\(x\)`, and true data generating model. Then, we will fit a simple linear model to each of these ten datasets. --- <img src="slide3_files/figure-html/unnamed-chunk-28-1.svg" style="display: block; margin: auto;" /> --- <br> What can you say about the bias and variance of model predictions? <table class=" lightable-minimal table table-striped table-hover table-condensed table-responsive" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto; font-size: 10px; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="empty-cells: hide;" colspan="2"></th> <th style="padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="10"><div style="border-bottom: 2px solid #00000050; ">Model Predicted Value Across 10 Replications</div></th> <th style="empty-cells: hide;" colspan="2"></th> </tr> <tr> <th style="text-align:right;"> x </th> <th style="text-align:right;"> y (TRUE) </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> <th style="text-align:right;"> 4 </th> <th style="text-align:right;"> 5 </th> <th style="text-align:right;"> 6 </th> <th style="text-align:right;"> 7 </th> <th style="text-align:right;"> 8 </th> <th style="text-align:right;"> 9 </th> <th style="text-align:right;"> 10 </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 0.094 </td> <td style="text-align:right;"> -0.005 </td> <td style="text-align:right;"> -0.144 </td> <td style="text-align:right;"> -0.112 </td> <td style="text-align:right;"> -0.154 </td> <td style="text-align:right;"> -0.065 </td> <td style="text-align:right;"> -0.093 </td> <td style="text-align:right;"> -0.114 </td> <td style="text-align:right;"> -0.133 </td> <td style="text-align:right;"> -0.140 </td> <td style="text-align:right;"> -0.080 </td> <td style="text-align:right;"> -0.107 </td> <td style="text-align:right;"> 0.047 </td> </tr> <tr> <td style="text-align:right;"> 0.053 </td> <td style="text-align:right;"> 0.063 </td> <td style="text-align:right;"> 0.013 </td> <td style="text-align:right;"> -0.113 </td> <td style="text-align:right;"> -0.084 </td> <td style="text-align:right;"> -0.121 </td> <td style="text-align:right;"> -0.035 </td> <td style="text-align:right;"> -0.065 </td> <td style="text-align:right;"> -0.087 </td> <td style="text-align:right;"> -0.108 </td> <td style="text-align:right;"> -0.113 </td> <td style="text-align:right;"> -0.051 </td> <td style="text-align:right;"> -0.079 </td> <td style="text-align:right;"> 0.044 </td> </tr> <tr> <td style="text-align:right;"> 0.105 </td> <td style="text-align:right;"> 0.039 </td> <td style="text-align:right;"> 0.032 </td> <td style="text-align:right;"> -0.081 </td> <td style="text-align:right;"> -0.056 </td> <td style="text-align:right;"> -0.088 </td> <td style="text-align:right;"> -0.006 </td> <td style="text-align:right;"> -0.037 </td> <td style="text-align:right;"> -0.060 </td> <td style="text-align:right;"> -0.082 </td> <td style="text-align:right;"> -0.085 </td> <td style="text-align:right;"> -0.022 </td> <td style="text-align:right;"> -0.051 </td> <td style="text-align:right;"> 0.041 </td> </tr> <tr> <td style="text-align:right;"> 0.158 </td> <td style="text-align:right;"> 0.020 </td> <td style="text-align:right;"> 0.050 </td> <td style="text-align:right;"> -0.049 </td> <td style="text-align:right;"> -0.028 </td> <td style="text-align:right;"> -0.056 </td> <td style="text-align:right;"> 0.023 </td> <td style="text-align:right;"> -0.008 </td> <td style="text-align:right;"> -0.033 </td> <td style="text-align:right;"> -0.056 </td> <td style="text-align:right;"> -0.058 </td> <td style="text-align:right;"> 0.007 </td> <td style="text-align:right;"> -0.024 </td> <td style="text-align:right;"> 0.039 </td> </tr> <tr> <td style="text-align:right;"> 0.211 </td> <td style="text-align:right;"> 0.008 </td> <td style="text-align:right;"> 0.069 </td> <td style="text-align:right;"> -0.017 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> -0.023 </td> <td style="text-align:right;"> 0.052 </td> <td style="text-align:right;"> 0.020 </td> <td style="text-align:right;"> -0.006 </td> <td style="text-align:right;"> -0.030 </td> <td style="text-align:right;"> -0.031 </td> <td style="text-align:right;"> 0.036 </td> <td style="text-align:right;"> 0.004 </td> <td style="text-align:right;"> 0.036 </td> </tr> <tr> <td style="text-align:right;"> 0.263 </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> 0.087 </td> <td style="text-align:right;"> 0.015 </td> <td style="text-align:right;"> 0.027 </td> <td style="text-align:right;"> 0.010 </td> <td style="text-align:right;"> 0.082 </td> <td style="text-align:right;"> 0.048 </td> <td style="text-align:right;"> 0.021 </td> <td style="text-align:right;"> -0.004 </td> <td style="text-align:right;"> -0.004 </td> <td style="text-align:right;"> 0.065 </td> <td style="text-align:right;"> 0.031 </td> <td style="text-align:right;"> 0.034 </td> </tr> <tr> <td style="text-align:right;"> 0.316 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 0.106 </td> <td style="text-align:right;"> 0.046 </td> <td style="text-align:right;"> 0.055 </td> <td style="text-align:right;"> 0.042 </td> <td style="text-align:right;"> 0.111 </td> <td style="text-align:right;"> 0.076 </td> <td style="text-align:right;"> 0.048 </td> <td style="text-align:right;"> 0.022 </td> <td style="text-align:right;"> 0.024 </td> <td style="text-align:right;"> 0.094 </td> <td style="text-align:right;"> 0.059 </td> <td style="text-align:right;"> 0.032 </td> </tr> <tr> <td style="text-align:right;"> 0.368 </td> <td style="text-align:right;"> 0.005 </td> <td style="text-align:right;"> 0.125 </td> <td style="text-align:right;"> 0.078 </td> <td style="text-align:right;"> 0.083 </td> <td style="text-align:right;"> 0.075 </td> <td style="text-align:right;"> 0.140 </td> <td style="text-align:right;"> 0.105 </td> <td style="text-align:right;"> 0.075 </td> <td style="text-align:right;"> 0.048 </td> <td style="text-align:right;"> 0.051 </td> <td style="text-align:right;"> 0.124 </td> <td style="text-align:right;"> 0.087 </td> <td style="text-align:right;"> 0.031 </td> </tr> <tr> <td style="text-align:right;"> 0.421 </td> <td style="text-align:right;"> 0.015 </td> <td style="text-align:right;"> 0.143 </td> <td style="text-align:right;"> 0.110 </td> <td style="text-align:right;"> 0.111 </td> <td style="text-align:right;"> 0.108 </td> <td style="text-align:right;"> 0.169 </td> <td style="text-align:right;"> 0.133 </td> <td style="text-align:right;"> 0.102 </td> <td style="text-align:right;"> 0.074 </td> <td style="text-align:right;"> 0.078 </td> <td style="text-align:right;"> 0.153 </td> <td style="text-align:right;"> 0.114 </td> <td style="text-align:right;"> 0.030 </td> </tr> <tr> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.031 </td> <td style="text-align:right;"> 0.162 </td> <td style="text-align:right;"> 0.142 </td> <td style="text-align:right;"> 0.139 </td> <td style="text-align:right;"> 0.140 </td> <td style="text-align:right;"> 0.199 </td> <td style="text-align:right;"> 0.161 </td> <td style="text-align:right;"> 0.129 </td> <td style="text-align:right;"> 0.099 </td> <td style="text-align:right;"> 0.105 </td> <td style="text-align:right;"> 0.182 </td> <td style="text-align:right;"> 0.142 </td> <td style="text-align:right;"> 0.030 </td> </tr> <tr> <td style="text-align:right;"> 0.526 </td> <td style="text-align:right;"> 0.053 </td> <td style="text-align:right;"> 0.180 </td> <td style="text-align:right;"> 0.174 </td> <td style="text-align:right;"> 0.167 </td> <td style="text-align:right;"> 0.173 </td> <td style="text-align:right;"> 0.228 </td> <td style="text-align:right;"> 0.189 </td> <td style="text-align:right;"> 0.156 </td> <td style="text-align:right;"> 0.125 </td> <td style="text-align:right;"> 0.133 </td> <td style="text-align:right;"> 0.211 </td> <td style="text-align:right;"> 0.169 </td> <td style="text-align:right;"> 0.031 </td> </tr> <tr> <td style="text-align:right;"> 0.579 </td> <td style="text-align:right;"> 0.081 </td> <td style="text-align:right;"> 0.199 </td> <td style="text-align:right;"> 0.205 </td> <td style="text-align:right;"> 0.195 </td> <td style="text-align:right;"> 0.206 </td> <td style="text-align:right;"> 0.257 </td> <td style="text-align:right;"> 0.218 </td> <td style="text-align:right;"> 0.183 </td> <td style="text-align:right;"> 0.151 </td> <td style="text-align:right;"> 0.160 </td> <td style="text-align:right;"> 0.240 </td> <td style="text-align:right;"> 0.197 </td> <td style="text-align:right;"> 0.031 </td> </tr> <tr> <td style="text-align:right;"> 0.632 </td> <td style="text-align:right;"> 0.116 </td> <td style="text-align:right;"> 0.217 </td> <td style="text-align:right;"> 0.237 </td> <td style="text-align:right;"> 0.223 </td> <td style="text-align:right;"> 0.239 </td> <td style="text-align:right;"> 0.286 </td> <td style="text-align:right;"> 0.246 </td> <td style="text-align:right;"> 0.209 </td> <td style="text-align:right;"> 0.177 </td> <td style="text-align:right;"> 0.187 </td> <td style="text-align:right;"> 0.269 </td> <td style="text-align:right;"> 0.225 </td> <td style="text-align:right;"> 0.033 </td> </tr> <tr> <td style="text-align:right;"> 0.684 </td> <td style="text-align:right;"> 0.159 </td> <td style="text-align:right;"> 0.236 </td> <td style="text-align:right;"> 0.269 </td> <td style="text-align:right;"> 0.251 </td> <td style="text-align:right;"> 0.271 </td> <td style="text-align:right;"> 0.316 </td> <td style="text-align:right;"> 0.274 </td> <td style="text-align:right;"> 0.236 </td> <td style="text-align:right;"> 0.203 </td> <td style="text-align:right;"> 0.214 </td> <td style="text-align:right;"> 0.298 </td> <td style="text-align:right;"> 0.252 </td> <td style="text-align:right;"> 0.035 </td> </tr> <tr> <td style="text-align:right;"> 0.737 </td> <td style="text-align:right;"> 0.210 </td> <td style="text-align:right;"> 0.254 </td> <td style="text-align:right;"> 0.301 </td> <td style="text-align:right;"> 0.279 </td> <td style="text-align:right;"> 0.304 </td> <td style="text-align:right;"> 0.345 </td> <td style="text-align:right;"> 0.302 </td> <td style="text-align:right;"> 0.263 </td> <td style="text-align:right;"> 0.229 </td> <td style="text-align:right;"> 0.242 </td> <td style="text-align:right;"> 0.327 </td> <td style="text-align:right;"> 0.280 </td> <td style="text-align:right;"> 0.037 </td> </tr> <tr> <td style="text-align:right;"> 0.789 </td> <td style="text-align:right;"> 0.271 </td> <td style="text-align:right;"> 0.273 </td> <td style="text-align:right;"> 0.333 </td> <td style="text-align:right;"> 0.307 </td> <td style="text-align:right;"> 0.337 </td> <td style="text-align:right;"> 0.374 </td> <td style="text-align:right;"> 0.330 </td> <td style="text-align:right;"> 0.290 </td> <td style="text-align:right;"> 0.255 </td> <td style="text-align:right;"> 0.269 </td> <td style="text-align:right;"> 0.357 </td> <td style="text-align:right;"> 0.308 </td> <td style="text-align:right;"> 0.039 </td> </tr> <tr> <td style="text-align:right;"> 0.842 </td> <td style="text-align:right;"> 0.342 </td> <td style="text-align:right;"> 0.292 </td> <td style="text-align:right;"> 0.365 </td> <td style="text-align:right;"> 0.335 </td> <td style="text-align:right;"> 0.369 </td> <td style="text-align:right;"> 0.403 </td> <td style="text-align:right;"> 0.359 </td> <td style="text-align:right;"> 0.317 </td> <td style="text-align:right;"> 0.281 </td> <td style="text-align:right;"> 0.296 </td> <td style="text-align:right;"> 0.386 </td> <td style="text-align:right;"> 0.335 </td> <td style="text-align:right;"> 0.042 </td> </tr> <tr> <td style="text-align:right;"> 0.895 </td> <td style="text-align:right;"> 0.424 </td> <td style="text-align:right;"> 0.310 </td> <td style="text-align:right;"> 0.396 </td> <td style="text-align:right;"> 0.363 </td> <td style="text-align:right;"> 0.402 </td> <td style="text-align:right;"> 0.433 </td> <td style="text-align:right;"> 0.387 </td> <td style="text-align:right;"> 0.344 </td> <td style="text-align:right;"> 0.306 </td> <td style="text-align:right;"> 0.323 </td> <td style="text-align:right;"> 0.415 </td> <td style="text-align:right;"> 0.363 </td> <td style="text-align:right;"> 0.045 </td> </tr> <tr> <td style="text-align:right;"> 0.947 </td> <td style="text-align:right;"> 0.521 </td> <td style="text-align:right;"> 0.329 </td> <td style="text-align:right;"> 0.428 </td> <td style="text-align:right;"> 0.391 </td> <td style="text-align:right;"> 0.435 </td> <td style="text-align:right;"> 0.462 </td> <td style="text-align:right;"> 0.415 </td> <td style="text-align:right;"> 0.371 </td> <td style="text-align:right;"> 0.332 </td> <td style="text-align:right;"> 0.351 </td> <td style="text-align:right;"> 0.444 </td> <td style="text-align:right;"> 0.390 </td> <td style="text-align:right;"> 0.048 </td> </tr> <tr> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> 0.632 </td> <td style="text-align:right;"> 0.347 </td> <td style="text-align:right;"> 0.460 </td> <td style="text-align:right;"> 0.419 </td> <td style="text-align:right;"> 0.467 </td> <td style="text-align:right;"> 0.491 </td> <td style="text-align:right;"> 0.443 </td> <td style="text-align:right;"> 0.398 </td> <td style="text-align:right;"> 0.358 </td> <td style="text-align:right;"> 0.378 </td> <td style="text-align:right;"> 0.473 </td> <td style="text-align:right;"> 0.418 </td> <td style="text-align:right;"> 0.051 </td> </tr> </tbody> </table> --- Let's do the same experiment by fitting a more complex 6th-degree polynomial to the same datasets with the same underlying true model. $$ y = \beta_0 + \beta_1x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \beta_5 x^5 + \beta_6 x^6 + \epsilon $$ <img src="slide3_files/figure-html/unnamed-chunk-31-1.svg" style="display: block; margin: auto;" /> --- How are these numbers different than the numbers when we fitted a simple linear model? <table class=" lightable-minimal table table-striped table-hover table-condensed table-responsive" style='font-family: "Trebuchet MS", verdana, sans-serif; margin-left: auto; margin-right: auto; font-size: 10px; margin-left: auto; margin-right: auto;'> <thead> <tr> <th style="empty-cells: hide;" colspan="2"></th> <th style="padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="10"><div style="border-bottom: 2px solid #00000050; ">Model Predicted Value Across 10 Replications</div></th> <th style="empty-cells: hide;" colspan="2"></th> </tr> <tr> <th style="text-align:right;"> x </th> <th style="text-align:right;"> y (TRUE) </th> <th style="text-align:right;"> 1 </th> <th style="text-align:right;"> 2 </th> <th style="text-align:right;"> 3 </th> <th style="text-align:right;"> 4 </th> <th style="text-align:right;"> 5 </th> <th style="text-align:right;"> 6 </th> <th style="text-align:right;"> 7 </th> <th style="text-align:right;"> 8 </th> <th style="text-align:right;"> 9 </th> <th style="text-align:right;"> 10 </th> <th style="text-align:right;"> Mean </th> <th style="text-align:right;"> SD </th> </tr> </thead> <tbody> <tr> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 0.094 </td> <td style="text-align:right;"> 0.192 </td> <td style="text-align:right;"> 0.166 </td> <td style="text-align:right;"> 0.164 </td> <td style="text-align:right;"> 0.012 </td> <td style="text-align:right;"> -0.038 </td> <td style="text-align:right;"> -0.081 </td> <td style="text-align:right;"> 0.211 </td> <td style="text-align:right;"> 0.087 </td> <td style="text-align:right;"> 0.060 </td> <td style="text-align:right;"> -0.147 </td> <td style="text-align:right;"> 0.086 </td> <td style="text-align:right;"> 0.105 </td> </tr> <tr> <td style="text-align:right;"> 0.053 </td> <td style="text-align:right;"> 0.063 </td> <td style="text-align:right;"> 0.102 </td> <td style="text-align:right;"> -0.019 </td> <td style="text-align:right;"> 0.054 </td> <td style="text-align:right;"> 0.012 </td> <td style="text-align:right;"> 0.127 </td> <td style="text-align:right;"> 0.040 </td> <td style="text-align:right;"> 0.043 </td> <td style="text-align:right;"> 0.049 </td> <td style="text-align:right;"> 0.019 </td> <td style="text-align:right;"> 0.122 </td> <td style="text-align:right;"> 0.048 </td> <td style="text-align:right;"> 0.045 </td> </tr> <tr> <td style="text-align:right;"> 0.105 </td> <td style="text-align:right;"> 0.039 </td> <td style="text-align:right;"> 0.107 </td> <td style="text-align:right;"> -0.071 </td> <td style="text-align:right;"> -0.001 </td> <td style="text-align:right;"> -0.005 </td> <td style="text-align:right;"> 0.155 </td> <td style="text-align:right;"> 0.099 </td> <td style="text-align:right;"> -0.036 </td> <td style="text-align:right;"> -0.015 </td> <td style="text-align:right;"> -0.007 </td> <td style="text-align:right;"> 0.160 </td> <td style="text-align:right;"> 0.025 </td> <td style="text-align:right;"> 0.076 </td> </tr> <tr> <td style="text-align:right;"> 0.158 </td> <td style="text-align:right;"> 0.020 </td> <td style="text-align:right;"> 0.130 </td> <td style="text-align:right;"> -0.061 </td> <td style="text-align:right;"> -0.026 </td> <td style="text-align:right;"> -0.023 </td> <td style="text-align:right;"> 0.121 </td> <td style="text-align:right;"> 0.112 </td> <td style="text-align:right;"> -0.069 </td> <td style="text-align:right;"> -0.068 </td> <td style="text-align:right;"> -0.028 </td> <td style="text-align:right;"> 0.100 </td> <td style="text-align:right;"> 0.010 </td> <td style="text-align:right;"> 0.085 </td> </tr> <tr> <td style="text-align:right;"> 0.211 </td> <td style="text-align:right;"> 0.008 </td> <td style="text-align:right;"> 0.133 </td> <td style="text-align:right;"> -0.031 </td> <td style="text-align:right;"> -0.036 </td> <td style="text-align:right;"> -0.033 </td> <td style="text-align:right;"> 0.072 </td> <td style="text-align:right;"> 0.096 </td> <td style="text-align:right;"> -0.082 </td> <td style="text-align:right;"> -0.094 </td> <td style="text-align:right;"> -0.047 </td> <td style="text-align:right;"> 0.024 </td> <td style="text-align:right;"> -0.003 </td> <td style="text-align:right;"> 0.082 </td> </tr> <tr> <td style="text-align:right;"> 0.263 </td> <td style="text-align:right;"> 0.001 </td> <td style="text-align:right;"> 0.107 </td> <td style="text-align:right;"> -0.006 </td> <td style="text-align:right;"> -0.039 </td> <td style="text-align:right;"> -0.031 </td> <td style="text-align:right;"> 0.036 </td> <td style="text-align:right;"> 0.068 </td> <td style="text-align:right;"> -0.084 </td> <td style="text-align:right;"> -0.091 </td> <td style="text-align:right;"> -0.062 </td> <td style="text-align:right;"> -0.024 </td> <td style="text-align:right;"> -0.011 </td> <td style="text-align:right;"> 0.069 </td> </tr> <tr> <td style="text-align:right;"> 0.316 </td> <td style="text-align:right;"> 0.000 </td> <td style="text-align:right;"> 0.059 </td> <td style="text-align:right;"> 0.008 </td> <td style="text-align:right;"> -0.036 </td> <td style="text-align:right;"> -0.019 </td> <td style="text-align:right;"> 0.023 </td> <td style="text-align:right;"> 0.039 </td> <td style="text-align:right;"> -0.078 </td> <td style="text-align:right;"> -0.064 </td> <td style="text-align:right;"> -0.068 </td> <td style="text-align:right;"> -0.030 </td> <td style="text-align:right;"> -0.015 </td> <td style="text-align:right;"> 0.050 </td> </tr> <tr> <td style="text-align:right;"> 0.368 </td> <td style="text-align:right;"> 0.005 </td> <td style="text-align:right;"> 0.008 </td> <td style="text-align:right;"> 0.010 </td> <td style="text-align:right;"> -0.026 </td> <td style="text-align:right;"> 0.002 </td> <td style="text-align:right;"> 0.032 </td> <td style="text-align:right;"> 0.021 </td> <td style="text-align:right;"> -0.061 </td> <td style="text-align:right;"> -0.024 </td> <td style="text-align:right;"> -0.061 </td> <td style="text-align:right;"> 0.004 </td> <td style="text-align:right;"> -0.011 </td> <td style="text-align:right;"> 0.034 </td> </tr> <tr> <td style="text-align:right;"> 0.421 </td> <td style="text-align:right;"> 0.015 </td> <td style="text-align:right;"> -0.030 </td> <td style="text-align:right;"> 0.008 </td> <td style="text-align:right;"> -0.009 </td> <td style="text-align:right;"> 0.029 </td> <td style="text-align:right;"> 0.060 </td> <td style="text-align:right;"> 0.018 </td> <td style="text-align:right;"> -0.029 </td> <td style="text-align:right;"> 0.018 </td> <td style="text-align:right;"> -0.040 </td> <td style="text-align:right;"> 0.064 </td> <td style="text-align:right;"> 0.003 </td> <td style="text-align:right;"> 0.033 </td> </tr> <tr> <td style="text-align:right;"> 0.474 </td> <td style="text-align:right;"> 0.031 </td> <td style="text-align:right;"> -0.041 </td> <td style="text-align:right;"> 0.011 </td> <td style="text-align:right;"> 0.018 </td> <td style="text-align:right;"> 0.058 </td> <td style="text-align:right;"> 0.097 </td> <td style="text-align:right;"> 0.032 </td> <td style="text-align:right;"> 0.017 </td> <td style="text-align:right;"> 0.052 </td> <td style="text-align:right;"> -0.005 </td> <td style="text-align:right;"> 0.134 </td> <td style="text-align:right;"> 0.027 </td> <td style="text-align:right;"> 0.040 </td> </tr> <tr> <td style="text-align:right;"> 0.526 </td> <td style="text-align:right;"> 0.053 </td> <td style="text-align:right;"> -0.016 </td> <td style="text-align:right;"> 0.026 </td> <td style="text-align:right;"> 0.056 </td> <td style="text-align:right;"> 0.087 </td> <td style="text-align:right;"> 0.137 </td> <td style="text-align:right;"> 0.061 </td> <td style="text-align:right;"> 0.078 </td> <td style="text-align:right;"> 0.076 </td> <td style="text-align:right;"> 0.042 </td> <td style="text-align:right;"> 0.198 </td> <td style="text-align:right;"> 0.061 </td> <td style="text-align:right;"> 0.043 </td> </tr> <tr> <td style="text-align:right;"> 0.579 </td> <td style="text-align:right;"> 0.081 </td> <td style="text-align:right;"> 0.041 </td> <td style="text-align:right;"> 0.059 </td> <td style="text-align:right;"> 0.102 </td> <td style="text-align:right;"> 0.115 </td> <td style="text-align:right;"> 0.175 </td> <td style="text-align:right;"> 0.102 </td> <td style="text-align:right;"> 0.148 </td> <td style="text-align:right;"> 0.088 </td> <td style="text-align:right;"> 0.094 </td> <td style="text-align:right;"> 0.246 </td> <td style="text-align:right;"> 0.103 </td> <td style="text-align:right;"> 0.041 </td> </tr> <tr> <td style="text-align:right;"> 0.632 </td> <td style="text-align:right;"> 0.116 </td> <td style="text-align:right;"> 0.121 </td> <td style="text-align:right;"> 0.113 </td> <td style="text-align:right;"> 0.154 </td> <td style="text-align:right;"> 0.142 </td> <td style="text-align:right;"> 0.209 </td> <td style="text-align:right;"> 0.148 </td> <td style="text-align:right;"> 0.220 </td> <td style="text-align:right;"> 0.094 </td> <td style="text-align:right;"> 0.145 </td> <td style="text-align:right;"> 0.275 </td> <td style="text-align:right;"> 0.150 </td> <td style="text-align:right;"> 0.042 </td> </tr> <tr> <td style="text-align:right;"> 0.684 </td> <td style="text-align:right;"> 0.159 </td> <td style="text-align:right;"> 0.207 </td> <td style="text-align:right;"> 0.186 </td> <td style="text-align:right;"> 0.209 </td> <td style="text-align:right;"> 0.172 </td> <td style="text-align:right;"> 0.243 </td> <td style="text-align:right;"> 0.195 </td> <td style="text-align:right;"> 0.283 </td> <td style="text-align:right;"> 0.103 </td> <td style="text-align:right;"> 0.190 </td> <td style="text-align:right;"> 0.287 </td> <td style="text-align:right;"> 0.198 </td> <td style="text-align:right;"> 0.049 </td> </tr> <tr> <td style="text-align:right;"> 0.737 </td> <td style="text-align:right;"> 0.210 </td> <td style="text-align:right;"> 0.281 </td> <td style="text-align:right;"> 0.271 </td> <td style="text-align:right;"> 0.263 </td> <td style="text-align:right;"> 0.210 </td> <td style="text-align:right;"> 0.280 </td> <td style="text-align:right;"> 0.237 </td> <td style="text-align:right;"> 0.329 </td> <td style="text-align:right;"> 0.128 </td> <td style="text-align:right;"> 0.223 </td> <td style="text-align:right;"> 0.294 </td> <td style="text-align:right;"> 0.247 </td> <td style="text-align:right;"> 0.057 </td> </tr> <tr> <td style="text-align:right;"> 0.789 </td> <td style="text-align:right;"> 0.271 </td> <td style="text-align:right;"> 0.329 </td> <td style="text-align:right;"> 0.360 </td> <td style="text-align:right;"> 0.314 </td> <td style="text-align:right;"> 0.262 </td> <td style="text-align:right;"> 0.330 </td> <td style="text-align:right;"> 0.278 </td> <td style="text-align:right;"> 0.352 </td> <td style="text-align:right;"> 0.178 </td> <td style="text-align:right;"> 0.246 </td> <td style="text-align:right;"> 0.310 </td> <td style="text-align:right;"> 0.294 </td> <td style="text-align:right;"> 0.059 </td> </tr> <tr> <td style="text-align:right;"> 0.842 </td> <td style="text-align:right;"> 0.342 </td> <td style="text-align:right;"> 0.347 </td> <td style="text-align:right;"> 0.443 </td> <td style="text-align:right;"> 0.364 </td> <td style="text-align:right;"> 0.336 </td> <td style="text-align:right;"> 0.399 </td> <td style="text-align:right;"> 0.325 </td> <td style="text-align:right;"> 0.355 </td> <td style="text-align:right;"> 0.260 </td> <td style="text-align:right;"> 0.270 </td> <td style="text-align:right;"> 0.350 </td> <td style="text-align:right;"> 0.345 </td> <td style="text-align:right;"> 0.057 </td> </tr> <tr> <td style="text-align:right;"> 0.895 </td> <td style="text-align:right;"> 0.424 </td> <td style="text-align:right;"> 0.354 </td> <td style="text-align:right;"> 0.512 </td> <td style="text-align:right;"> 0.420 </td> <td style="text-align:right;"> 0.442 </td> <td style="text-align:right;"> 0.491 </td> <td style="text-align:right;"> 0.399 </td> <td style="text-align:right;"> 0.355 </td> <td style="text-align:right;"> 0.373 </td> <td style="text-align:right;"> 0.318 </td> <td style="text-align:right;"> 0.421 </td> <td style="text-align:right;"> 0.407 </td> <td style="text-align:right;"> 0.065 </td> </tr> <tr> <td style="text-align:right;"> 0.947 </td> <td style="text-align:right;"> 0.521 </td> <td style="text-align:right;"> 0.402 </td> <td style="text-align:right;"> 0.565 </td> <td style="text-align:right;"> 0.497 </td> <td style="text-align:right;"> 0.587 </td> <td style="text-align:right;"> 0.602 </td> <td style="text-align:right;"> 0.533 </td> <td style="text-align:right;"> 0.385 </td> <td style="text-align:right;"> 0.499 </td> <td style="text-align:right;"> 0.428 </td> <td style="text-align:right;"> 0.520 </td> <td style="text-align:right;"> 0.500 </td> <td style="text-align:right;"> 0.080 </td> </tr> <tr> <td style="text-align:right;"> 1.000 </td> <td style="text-align:right;"> 0.632 </td> <td style="text-align:right;"> 0.586 </td> <td style="text-align:right;"> 0.606 </td> <td style="text-align:right;"> 0.622 </td> <td style="text-align:right;"> 0.781 </td> <td style="text-align:right;"> 0.715 </td> <td style="text-align:right;"> 0.781 </td> <td style="text-align:right;"> 0.507 </td> <td style="text-align:right;"> 0.601 </td> <td style="text-align:right;"> 0.663 </td> <td style="text-align:right;"> 0.622 </td> <td style="text-align:right;"> 0.651 </td> <td style="text-align:right;"> 0.092 </td> </tr> </tbody> </table> --- <img src="slide3_files/figure-html/unnamed-chunk-33-1.svg" style="display: block; margin: auto;" /> --- We can expand our experiment and examine a range of models from linear to the 6th-degree polynomial. `$$y = \beta_0 + \beta_1x + \epsilon$$` `$$y = \beta_0 + \beta_1x + \beta_2 x^2 + \epsilon$$` `$$y = \beta_0 + \beta_1x + \beta_2 x^2 + \beta_3 x^3 + \epsilon$$` `$$y = \beta_0 + \beta_1x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \epsilon$$` `$$y = \beta_0 + \beta_1x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \beta_5 x^5 + \epsilon$$` `$$y = \beta_0 + \beta_1x + \beta_2 x^2 + \beta_3 x^3 + \beta_4 x^4 + \beta_5 x^5 + \beta_6 x^6 + \epsilon$$` --- <img src="slide3_files/figure-html/unnamed-chunk-36-1.svg" style="display: block; margin: auto;" /> If you had to choose one of these models for one of the simulated datasets, which one would you choose? Why? --- Could you find that sweet spot that provides a reasonable representation of data and a reasonable amount of generalizability (consistent/stable predictions for observations other than the ones you used to develop the model)? <img src="bias-variance_reduced.png" width="567" style="display: block; margin: auto;" /> --- ## Use of Resampling Methods to Balance Model Bias and Model Variance - Certain strategies are applied to avoid overfitting and find the sweet spot between model bias and model variance. - This process is nicely illustrated in [Boehmke and Greenwell (2020, Figure 2.1)](https://bradleyboehmke.github.io/HOML/process.html) <img src="modeling_process2.png" width="2391" style="display: block; margin: auto;" /> --- - **Resampling strategies**: - 80-20 or 70-30 splits for training vs. test data - Simple random sampling - Stratified sampling - Down-sampling and up-sampling - **k-fold** cross validation when training the model: - the training sample is randomly partitioned into *k* sets of equal size. - A model is fitted to *k*-1 folds, and the remaining fold is used to test the model performance. - Repeated *k* times by treating a different fold as a hold-out set. - The performance evaluation metric is aggregated (e.g., average) across *k* replications to get a *k*-fold cross-validation estimate of the performance evaluation metric - Be mindful about data leakage for longitudinal data when creating training and test splits and partitions for the k-fold cross-validation --- Please review the following notebook to see a demonstration of how to apply concepts discussed so far. This notebook builds a naive model to predict the readability score for a given text. [Building a linear prediction model with and without cross-validation using the caret package](https://www.kaggle.com/code/uocoeeds/building-a-prediction-model-with-cross-validation)