Processing Categorical Predictors
One-hot encoding (dummy variables)
Label encoding
Polynomial Contrasts
Processing Cyclic Variables
Processing Continuous Variables
Centering and Scaling
Box-Cox transformation
Logit transformation
Polynomial basis expansions
Handling Missing Data
the recipes
package
Processing Text Data with pre-trained NLP models
A dummy variable is a synthetic variable with two values (0 and 1) representing a group membership.
When there is a nominal variable with N levels, it is typical to create N dummy variables to represent the information in the nominal variable.
Each dummy variable represents membership to one of the levels in the nominal variable.
These dummy variables can be used as features in predictive models.
In its simplest case, consider the variable Race in the Recidivism dataset with two levels: Black and White. We can create two dummy variables to represent the information in this variable.
Dummy Variable 1 | Dummy Variable 2 | |
---|---|---|
Black | 1 | 0 |
White | 0 | 1 |
Variable Prison_Offense has five categories: Violent/Sex, Violent/Non-Sex, Property, Drug, and Other.
We can create five dummy variables using the following coding scheme.
Dummy Variable 1 | Dummy Variable 2 | Dummy Variable 3 | Dummy Variable 4 | Dummy Variable 5 | |
---|---|---|---|---|---|
Violent/Sex | 1 | 0 | 0 | 0 | 0 |
Violent/Non-Sex | 0 | 1 | 0 | 0 | 0 |
Property | 0 | 0 | 1 | 0 | 0 |
Drug | 0 | 0 | 0 | 1 | 0 |
Other | 0 | 0 | 0 | 0 | 1 |
Race | Race_Black | Race_White |
---|---|---|
BLACK | 1 | 0 |
BLACK | 1 | 0 |
BLACK | 1 | 0 |
WHITE | 0 | 1 |
WHITE | 0 | 1 |
BLACK | 1 | 0 |
BLACK | 1 | 0 |
WHITE | 0 | 1 |
WHITE | 0 | 1 |
WHITE | 0 | 1 |
Prison Offense | Drug | Property | Violent/Sex | Violent/Nonsex | Other |
---|---|---|---|---|---|
Drug | 1 | 0 | 0 | 0 | 0 |
Violent/Non-Sex | 0 | 0 | 1 | 0 | 0 |
Drug | 1 | 0 | 0 | 0 | 0 |
Property | 0 | 1 | 0 | 0 | 0 |
Property | 0 | 1 | 0 | 0 | 0 |
Drug | 1 | 0 | 0 | 0 | 0 |
Violent/Non-Sex | 0 | 0 | 1 | 0 | 0 |
Property | 0 | 1 | 0 | 0 | 0 |
Violent/Non-Sex | 0 | 0 | 1 | 0 | 0 |
When the variable of interest is ordinal, and there is a hierarchy among the levels, another alternative is to assign a numerical value to each category.
Consider the variable Age_At_Release in the Recidivism dataset. It is coded as 7 different age intervals in the dataset: 18-22, 23-27, 28-32, 33-37, 38-42, 43-47, 48 or older.
Encoding 1 | Encoding 2 | |
---|---|---|
18-22 | 20 | 1 |
23-27 | 25 | 2 |
28-32 | 30 | 3 |
33-37 | 35 | 4 |
38-42 | 40 | 5 |
43-47 | 45 | 6 |
48 or older | 60 | 7 |
Age_at_Release | Encoding 1 | Encoding 2 |
---|---|---|
43-47 | 45 | 6 |
33-37 | 35 | 4 |
48 or older | 60 | 7 |
38-42 | 40 | 5 |
38-42 | 40 | 5 |
48 or older | 60 | 7 |
38-42 | 40 | 5 |
43-47 | 45 | 6 |
48 or older | 60 | 7 |
33-37 | 35 | 4 |
Another example would be the variable Education Level in the Recidivism dataset.
How would you encode this variable?
Encoding 1 | Encoding 2 | |
---|---|---|
Less than a HS diploma | ||
HS diploma | ||
At least some college |
Another way of encoding an ordinal variable is to use polynomial contrasts.
The polynomial contrasts may be helpful if one wants to explore whether or not there is a linear, quadratic, cubic, etc., relationship between the predictor variable and outcome variable.
If there are N levels, one can have polynomial terms up to the (N-1)th degree.
The polynomial terms are orthonormal vectors:
Age at Release | Linear Term | Quadratic Term | Cubic Term | 4th Degree Term | 5th Degree Term | 6th Degree Term |
---|---|---|---|---|---|---|
18-22 | -0.567 | 0.546 | -0.408 | 0.242 | -0.109 | 0.033 |
23-27 | -0.378 | 0 | 0.408 | -0.564 | 0.436 | -0.197 |
28-32 | -0.189 | -0.327 | 0.408 | 0.081 | -0.546 | 0.493 |
33-37 | 0 | -0.436 | 0 | 0.483 | 0 | -0.658 |
38-42 | 0.189 | -0.327 | -0.408 | 0.081 | 0.546 | 0.493 |
43-47 | 0.378 | 0 | -0.408 | -0.564 | -0.436 | -0.197 |
48 or older | 0.567 | 0.546 | 0.408 | 0.242 | 0.109 | 0.033 |
There are sometimes variables that are cyclic by nature (e.g., months, days, hour)
Dummy variables or numerical encoding does not necessarily capture the information in these variables in the most meaningful way.
For cyclic variables, it may be more meaningful to creating two new variables using a sine and cosine transformation as the following:
x1=sin(2πxmax(x))
x2=cos(2πxmax(x))
Day | x | term1 | term2 |
---|---|---|---|
Mon | 1 | 0.782 | 0.623 |
Tue | 2 | 0.975 | -0.223 |
Wed | 3 | 0.434 | -0.901 |
Thu | 4 | -0.434 | -0.901 |
Fri | 5 | -0.975 | -0.223 |
Sat | 6 | -0.782 | 0.623 |
Sun | 7 | 0 | 1 |
hour x1 x21 1 0.259 0.9662 2 0.500 0.8663 3 0.707 0.7074 4 0.866 0.5005 5 0.966 0.2596 6 1.000 0.0007 7 0.966 -0.2598 8 0.866 -0.5009 9 0.707 -0.70710 10 0.500 -0.86611 11 0.259 -0.96612 12 0.000 -1.00013 13 -0.259 -0.96614 14 -0.500 -0.86615 15 -0.707 -0.70716 16 -0.866 -0.50017 17 -0.966 -0.25918 18 -1.000 0.00019 19 -0.966 0.25920 20 -0.866 0.50021 21 -0.707 0.70722 22 -0.500 0.86623 23 -0.259 0.96624 24 0.000 1.000
Centering a variable is done by subtracting the variable’s mean from every value
Xcentered=X−¯X
Centering ensures that the mean of the centered variable equals zero.
Scaling a variable is dividing the value of each observation by the variable’s standard deviation.
Xscaled=XσX Scaling ensures that the standard deviation of the scaled variable equals 1.
When centering and scaling are both applied, it is called standardization.
zX=X−¯XσX
When we standardize a variable, we ensure that its mean is equal to zero and variance is equal to 1.
Standardizing outcome and predictor variables may be critical and necessary for specific models (e.g., K-nearest neighbor, support vector machines, penalized regression), but it is not always necessary for other models (e.g., decision tree models).
Standardizing a variable only changes the first and second moments of a distribution (mean and variance)
Standardizing a variable doesn’t change the third and fourth moments of a distribution (skewness and kurtosis).
Some people in the data science field use the term normalization, but what they actually mean is standardization.
Variables with extreme skewness and kurtosis may deteriorate the model performance for certain types of models.
It may sometimes be useful to transform a variable with extreme skewness and kurtosis such that its distribution approximates to a normal distribution.
Box-Cox transformation is a method to find an optimal parameter of λ to apply the following transformation:
y(λ)=⎧⎪ ⎪⎨⎪ ⎪⎩yλ−1λ,λ≠0ln(y),λ=0
require(bestNormalize)require(psych)set.seed(9272022)old <- rbeta(1000,1,1000)fit <- boxcox(old,standardize=FALSE)fit
Non-Standardized Box Cox Transformation with 1000 nonmissing obs.: Estimated statistics: - lambda = 0.2449266 - mean (before standardization) = -3.390823 - sd (before standardization) = 0.1860852
new <- predict(fit)describe(old)
vars n mean sd median trimmed mad min max range skew kurtosis seX1 1 1000 0 0 0 0 0 0 0.01 0.01 2.43 10.27 0
describe(new)
vars n mean sd median trimmed mad min max range skew kurtosis seX1 1 1000 -3.39 0.19 -3.39 -3.39 0.19 -3.94 -2.76 1.18 -0.03 -0.2 0.01
π∗=ln(π1−π),
where π represents a proportion.
Particularly useful when your outcome variable is a proportion bounded between 0 and 1.
When a linear model is used to model an outcome bounded between 0 and 1, the model predictions may exceed the reasonable range of values (predictions equal to less than zero or greater than one).
Logit transformation scales variables such that the range of values becomes −∞ and ∞ on the logit scale.
One can build a model to predict π∗ instead of proportion ($\pi$), and then obtain predicted proportion after a simple reverse operation ^π=e^π∗1+e^π∗
Below is an example of logit transformation for a randomly generated variable.
old <- rbeta(1000,1,1000)new <- log(old/(1-old))
Basis expansions are useful to address nonlinearity between a continuous predictor variable and outcome variable.
We can create a set of feature variables using a nonlinear function of a variable x, ϕ(x).
For continuous predictors, the most commonly used expansions are polynomial basis expansions.
The nth degree polynomial basis expansion can be represented by
ϕ(x)=β1x+β2x2+β3x3+...+βnxn
For continuous predictors, there is no limit for the degree of polynomial.
The higher the degree of polynomial, the more flexible the model becomes, and there is a higher chance of overfitting.
Typically, polynomial terms up to the 3rd or 4th degree are more than enough.
One simply replaces the original variable x with the new variables obtained from ϕ(x).
Suppose we have 100 observation from a random normal variable x. The third degree polynomial basis expansion (cubic basis expansion) can be found using the poly
function as the following.
set.seed(654)x <- rnorm(100,0,1)head(x)
[1] -0.76031762 -0.38970450 1.68962523 -0.09423560 0.09530146 0.81727228
head(poly(x,degree=3))
1 2 3[1,] -0.070492258 -0.06612854 0.056003658[2,] -0.030023304 -0.07454585 -0.003988336[3,] 0.197028288 0.28324096 0.348896805[4,] 0.002240307 -0.06560960 -0.044790680[5,] 0.022936731 -0.05256865 -0.063289287[6,] 0.101772051 0.04942613 -0.034439696
For certain types of models such as gradient boosting, missing data is not a problem, and one can leave them as is without any processing.
Some models such as regularized regression models require complete data and one have to deal with missing data before modeling data.
Handling missing data
Creating an indicator variable for missingness
Imputation
Gang_Affiliated | Avg_Days_per_DrugTest | Gang_na | Drug_na |
---|---|---|---|
17.1875 | 1 | 0 | |
232 | 1 | 0 | |
1 | 1 | ||
0 | 0 | 1 | |
0 | 0 | 1 | |
0 | 0 | 1 | |
92.33333333 | 1 | 0 | |
0 | 190 | 0 | 0 |
0 | 66.9 | 0 | 0 |
0 | 0 | 1 |
Missingness indicator variables don't solve the missing data problem because we may still have to impute the missing values for certain types of models.
An indicator variable about whether or not a variable is missing may sometimes provide some information in predicting the outcome when the missingness is not random.
If there is a systematic relationship between outcome and whether or not values are missing for a variable, missingness indicators may provide vital information.
This indicator variable would be meaningless for variables that don’t have any missing value.
A common approach to missing data
Below is a very naive example of how it would work if we have an outcome variable (Y) and three predictors (X1, X2, X3).
Imputation Model | |
---|---|
Outcome | Predictors |
X1 | X2,X3 |
X2 | X1,X3 |
X3 | X1,X2 |
Prediction Model | |
---|---|
Outcome | Predictors |
Y | X1, X2, X3 |
An imputation model can be as simple as an intercept-only model (mean imputation).
For numeric variables, missing values can be replaced with a simple mean, median, or mode of the observed data.
For categorical variables, missing values can be replaced with a value randomly drawn from a binomial or multinomial distribution with the observed probabilities.
An imputation model can also be as complex as desired using a regularized regression model, a decision tree model, or a K-nearest neighbors model.
The main idea of a more complex prediction model is to find other observations similar to observations with a missing value in terms of other predictors and use data from these similar observations to predict the missing values.
recipes
packageGiven a dataset, we may need to apply several processes for different types of variables.
One can write an R script to implement all these procedures manually, but it is likely a tedious job.
The recipes
package helps us process data more efficiently and in an organized way.
The recipes
package makes it easier to replicate the same data processing for future datasets as long as data comes in the same format (e.g., same column names, same variable types).
The recipes demo notebook
Note that the order of procedures applied to variables is important.
For instance:
there would be no meaning of using step_indicate_na() after using step_impute_bag()
.Why?
there will be a problem when you first standardize variables using step_normalize()
and then apply a Box-Cox transformation. Why?
For a complete list of step_
functions available in the recipes
package, check this page.
Make sure you review the following notebook to see how to process the various types of variables in the NIJ's Recidivism dataset using the recipes
package.
https://www.kaggle.com/code/uocoeeds/lecture-2a-data-preprocessing-i
NLP = Linguistics + Computer Science + Statistics
The ultimate goal is to develop algorithms and models to understand and use human language in a way we understand and use it.
The goal is not only to understand individual words but also the context in which these words are being used
The recent advancements in the field of NLP revolutionized the language models and play a critical role in our daily lives
Did Gmail suggested you how to complete a sentence?
Did Outlook suggested a greeting message when you start drafting an email?
Have you ever interacted with a chat bot?
Model | Developer | Year | # of parameters | Estimated Cost |
---|---|---|---|---|
Bert-Large | Google AI | 2018 | 336 M | $ 7K |
Roberta-Large | Facebook AI | 2019 | 335 M | ? |
GPT2-XL | Open AI | 2019 | 1.5 B | $ 50K |
T5 | Google AI | 2020 | 11 B | $ 1.3 M |
GPT3 | OpenAI | 2020 | 175 B | $ 12.0 M |
These models are expensive to train and use the enormous amounts of data available.
For instance,
Bert/Roberta was trained using the entire Wikipedia and a Book Corpus (a total of ~ 4.7 billion words),
GPT-2 was trained using 8 million web pages, and
GPT3 was trained on 45 TB of data from the internet and books.
All these models except GPT3 are open source.
They can be immediately utilized using open libraries (typically Python)
Hugging Face is a platform people host the pre-trained AI/ML models, more like CRAN hosting R packages.
The tasks that can be achieved with these models are
text generation
text classification
text summarization
question answering
sentence similarity
translation
speech recognition
audio classification
image classification
object detection
reticulate
packageThe reticulate package provides an interface to call and run Python from R.
library(reticulate)py_config()
python: C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/python.exelibpython: C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/python36.dllpythonhome: C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulateversion: 3.6.13 (default, Sep 23 2021, 07:38:49) [MSC v.1916 64 bit (AMD64)]Architecture: 64bitnumpy: C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/Lib/site-packages/numpynumpy_version: 1.19.5NOTE: Python version was forced by RETICULATE_PYTHON
conda_list()
name python1 Anaconda3 C:\\Users\\cengiz\\Anaconda3\\python.exe2 r-reticulate C:\\Users\\cengiz\\AppData\\Local\\r-miniconda\\envs\\r-reticulate\\python.exe
use_condaenv('r-reticulate')# conda_install(envname = 'r-reticulate',# packages = 'sentence_transformers',# pip = TRUE)st <- import('sentence_transformers')st
Module(sentence_transformers)
Next, we will pick a pre-trained language model to play with.
It can be any model available on Hugging Face.
You have to find the associated Hugging Face page for that particular model and use the same tag used for that model.
For instance, suppose we want to use the RoBERTa model. The associated Huggingface webpage for this model is https://huggingface.co/roberta-base
Notice that the tag for this model is roberta-base
.
model.name <- 'roberta-base'roberta <- st$models$Transformer(model.name)pooling_model <- st$models$Pooling(roberta$get_word_embedding_dimension())model <- st$SentenceTransformer(modules = list(roberta,pooling_model))model
SentenceTransformer( (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False}))
Each model has a limit for the number of characters they can process.
For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters.
model$get_max_seq_length()
[1] 512
If we submit any text with more than 512 characters, it will only process the first 512 characters.
Another essential characteristic is the length of the output vector when a language model returns numerical embeddings.
The following code reveals that RoBERTa returns a vector with a length of 768.
model$get_sentence_embedding_dimension()
[1] 768
model$encode('sofa')
[1] -0.035673961 0.009145292 0.045155115 -0.022831999 0.444271356 -0.213501185 0.016776590 -0.031104138 [9] 0.010763753 -0.109695613 -0.218385130 -0.114003226 0.123890847 -0.079736136 0.169241652 0.033682950 [17] -0.060081303 0.062376734 0.090727039 -0.022418823 -0.048738785 0.146164417 -0.040947653 0.048980530 [25] -0.078615293 -0.001459451 0.122411616 0.016039580 -0.027918482 -0.063380398 -0.218016654 -0.130548552 [33] 0.069047354 -0.001986546 0.043062352 0.060666095 0.076921932 0.082324371 -0.009124480 0.074072585 [41] -0.099838220 0.019355917 -0.161781847 0.006589258 -0.006635100 -0.009499688 0.142429739 -0.162081540 [49] 0.035310790 -0.042761490 0.091216803 -0.069520645 -0.067890733 0.085272178 -0.052535873 -0.128475532 [57] 0.078527123 -0.065088570 -0.077463746 0.036280572 -0.076873213 0.503972054 -0.041739281 0.019071214 [65] -0.034255326 0.059130061 -0.068348601 0.298463583 0.103186190 -0.045786552 0.005054155 -0.082052834 [73] 0.067019530 0.096304454 -0.005556324 -0.014345085 0.089176968 -2.471521616 -0.152103558 0.050706096 [81] 0.071368039 -0.075957939 0.637347639 0.123483524 0.097477347 0.002311159 0.017645134 0.233651847 [89] -0.020067355 0.051375460 0.057862498 0.038290158 0.038890921 0.078757085 0.026953537 0.042133540 [97] -0.018540401 0.316686124 -0.064064525 0.024643535 [ reached getOption("max.print") -- omitted 668 entries ]
model$encode('I like to drink Turkish coffee')
[1] -0.011787563 0.102784999 0.010933759 -0.046234991 -0.004870780 -0.044982813 0.082253136 -0.030680846 [9] 0.002545473 -0.074854769 0.006268861 -0.161093548 0.073251143 -0.008548131 0.040855005 0.282602638 [17] 0.113437660 0.134714946 0.070556641 0.369077712 -0.023299653 0.129972905 -0.087970734 0.001767353 [25] -0.135965645 0.056659225 0.119028516 -0.015006499 0.137721002 -0.004969854 -0.084344238 -0.078683749 [33] 0.021612354 -0.039848015 0.044566937 0.058498740 0.116407432 0.022860289 -0.009472768 0.005023805 [41] 0.046304196 -0.321840614 -0.020347901 0.017407298 0.013550013 -0.046380099 -0.056976926 -0.140553191 [49] 0.074571297 -0.012873427 -0.015933262 0.075764850 0.029388802 0.041944694 -0.055896059 0.032724179 [57] 0.038444079 0.054137304 0.120718785 -0.074913748 0.019386305 0.541475773 -0.155054778 0.053815808 [65] 0.040421356 -0.006427096 -0.007460407 -0.121550187 -0.024524804 0.108934306 0.030513961 -0.088536829 [73] 0.049289271 -0.089430012 0.018972535 0.071318731 0.068718828 -4.136425018 0.133511454 0.076649025 [81] 0.036512222 -0.096736401 0.892177880 -0.043068361 0.041941863 -0.055145040 -0.021039886 0.175592646 [89] 0.012270497 -0.003860307 0.044418067 -0.035518471 -0.075384133 0.111435041 0.049394138 0.038163140 [97] 0.098541118 -0.023031345 -0.001796652 -0.000854142 [ reached getOption("max.print") -- omitted 668 entries ]
The input can be many sentences.
For instance, if I submit a vector of three sentences as an input, the model returns a 3 x 768 matrix containng sentence embeddings.
Each row contains the embeddings for the corresponding sentence.
my.sentences <- c('The weather today is great.', 'I live in Eugene.', 'I am a graduate student.')embeddings <- model$encode(my.sentences)dim(embeddings)
[1] 3 768
Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the AllenAI Longformer model.
https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii
Processing Categorical Predictors
One-hot encoding (dummy variables)
Label encoding
Polynomial Contrasts
Processing Cyclic Variables
Processing Continuous Variables
Centering and Scaling
Box-Cox transformation
Logit transformation
Polynomial basis expansions
Handling Missing Data
the recipes
package
Processing Text Data with pre-trained NLP models
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |