+ - 0:00:00
Notes for current slide
Notes for next slide

Data Preprocessing

Cengiz Zopluoglu

College of Education, University of Oregon

Oct 10, 2022
Eugene, OR

1 / 49

Today's Goals:

  • Processing Categorical Predictors

    • One-hot encoding (dummy variables)

    • Label encoding

    • Polynomial Contrasts

  • Processing Cyclic Variables

  • Processing Continuous Variables

    • Centering and Scaling

    • Box-Cox transformation

    • Logit transformation

    • Polynomial basis expansions

  • Handling Missing Data

  • the recipes package

  • Processing Text Data with pre-trained NLP models

2 / 49










Processing Categorical Predictors

3 / 49
  • When categorical predictors are in a dataset, it is essential to transform them into numerical codes because this is the only way to use them in predictive modeling.




  • When encoding categorical predictors, we try to preserve as much information as possible from their labels
4 / 49

One-hot encoding (dummy variables)

  • A dummy variable is a synthetic variable with two values (0 and 1) representing a group membership.

  • When there is a nominal variable with N levels, it is typical to create N dummy variables to represent the information in the nominal variable.

  • Each dummy variable represents membership to one of the levels in the nominal variable.

  • These dummy variables can be used as features in predictive models.

  • In its simplest case, consider the variable Race in the Recidivism dataset with two levels: Black and White. We can create two dummy variables to represent the information in this variable.

Dummy Variable 1 Dummy Variable 2
Black 1 0
White 0 1
5 / 49
  • Variable Prison_Offense has five categories: Violent/Sex, Violent/Non-Sex, Property, Drug, and Other.

  • We can create five dummy variables using the following coding scheme.

Dummy Variable 1 Dummy Variable 2 Dummy Variable 3 Dummy Variable 4 Dummy Variable 5
Violent/Sex 1 0 0 0 0
Violent/Non-Sex 0 1 0 0 0
Property 0 0 1 0 0
Drug 0 0 0 1 0
Other 0 0 0 0 1
6 / 49


7 / 49


8 / 49


NOTE

When you fit a typical regression model without regularization using ordinary least-squares (OLS), a typical practice is to drop a dummy variable for one of the levels. So, for instance, if there are N levels for a nominal variable, you only have to create (N-1) dummy variables, as the Nth one has redundant information. The information regarding the excluded category is represented in the intercept term. It creates a problem when you put all N dummy variables into the model because the OLS procedure tries to invert a singular matrix, and you will likely get an error message.

On the other hand, this is not an issue when you fit a regularized regression model, which will be the case in this class. Therefore, you do not need to drop one of the dummy variables and can include all of them in the analysis. In fact, it may be beneficial to keep the dummy variables for all categories in the model when regularization is used in the regression. Otherwise, the model may produce different predictions depending on which category is excluded.
9 / 49

Label encoding

  • When the variable of interest is ordinal, and there is a hierarchy among the levels, another alternative is to assign a numerical value to each category.

  • Consider the variable Age_At_Release in the Recidivism dataset. It is coded as 7 different age intervals in the dataset: 18-22, 23-27, 28-32, 33-37, 38-42, 43-47, 48 or older.

Encoding 1 Encoding 2
18-22 20 1
23-27 25 2
28-32 30 3
33-37 35 4
38-42 40 5
43-47 45 6
48 or older 60 7
10 / 49


11 / 49
  • Another example would be the variable Education Level in the Recidivism dataset.

  • How would you encode this variable?

Encoding 1 Encoding 2
Less than a HS diploma
HS diploma
At least some college
12 / 49

Polynomial Contrasts

  • Another way of encoding an ordinal variable is to use polynomial contrasts.

  • The polynomial contrasts may be helpful if one wants to explore whether or not there is a linear, quadratic, cubic, etc., relationship between the predictor variable and outcome variable.

  • If there are N levels, one can have polynomial terms up to the (N-1)th degree.

  • The polynomial terms are orthonormal vectors:

    • sum of the squares within each column is equal to 1
    • the dot product of the vectors is equal to 0.
13 / 49


14 / 49

15 / 49










Processing Cyclic Variables

16 / 49
  • There are sometimes variables that are cyclic by nature (e.g., months, days, hour)

  • Dummy variables or numerical encoding does not necessarily capture the information in these variables in the most meaningful way.

  • For cyclic variables, it may be more meaningful to creating two new variables using a sine and cosine transformation as the following:

x1=sin(2πxmax(x))

x2=cos(2πxmax(x))

17 / 49

18 / 49
  • Below is another example for the time of day
hour x1 x2
1 1 0.259 0.966
2 2 0.500 0.866
3 3 0.707 0.707
4 4 0.866 0.500
5 5 0.966 0.259
6 6 1.000 0.000
7 7 0.966 -0.259
8 8 0.866 -0.500
9 9 0.707 -0.707
10 10 0.500 -0.866
11 11 0.259 -0.966
12 12 0.000 -1.000
13 13 -0.259 -0.966
14 14 -0.500 -0.866
15 15 -0.707 -0.707
16 16 -0.866 -0.500
17 17 -0.966 -0.259
18 18 -1.000 0.000
19 19 -0.966 0.259
20 20 -0.866 0.500
21 21 -0.707 0.707
22 22 -0.500 0.866
23 23 -0.259 0.966
24 24 0.000 1.000

19 / 49










Processing Continuous Variables

20 / 49

Centering and Scaling

  • Centering a variable is done by subtracting the variable’s mean from every value

    Xcentered=XX¯

    Centering ensures that the mean of the centered variable equals zero.

  • Scaling a variable is dividing the value of each observation by the variable’s standard deviation.

    Xscaled=XσX Scaling ensures that the standard deviation of the scaled variable equals 1.

  • When centering and scaling are both applied, it is called standardization.

    zX=XX¯σX

21 / 49
  • When we standardize a variable, we ensure that its mean is equal to zero and variance is equal to 1.

  • Standardizing outcome and predictor variables may be critical and necessary for specific models (e.g., K-nearest neighbor, support vector machines, penalized regression), but it is not always necessary for other models (e.g., decision tree models).

  • Standardizing a variable only changes the first and second moments of a distribution (mean and variance)

  • Standardizing a variable doesn’t change the third and fourth moments of a distribution (skewness and kurtosis).

  • Some people in the data science field use the term normalization, but what they actually mean is standardization.

22 / 49

Box-Cox transformation

  • Variables with extreme skewness and kurtosis may deteriorate the model performance for certain types of models.

  • It may sometimes be useful to transform a variable with extreme skewness and kurtosis such that its distribution approximates to a normal distribution.

  • Box-Cox transformation is a method to find an optimal parameter of λ to apply the following transformation:

y(λ)={yλ1λ,λ0ln(y),λ=0

23 / 49
require(bestNormalize)
require(psych)
set.seed(9272022)
old <- rbeta(1000,1,1000)
fit <- boxcox(old,standardize=FALSE)
fit
Non-Standardized Box Cox Transformation with 1000 nonmissing obs.:
Estimated statistics:
- lambda = 0.2449266
- mean (before standardization) = -3.390823
- sd (before standardization) = 0.1860852
new <- predict(fit)
describe(old)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 0 0 0 0 0 0 0.01 0.01 2.43 10.27 0
describe(new)
vars n mean sd median trimmed mad min max range skew kurtosis se
X1 1 1000 -3.39 0.19 -3.39 -3.39 0.19 -3.94 -2.76 1.18 -0.03 -0.2 0.01

















24 / 49

Logit transformation

  • When a variable is a proportion bounded between 0 and 1, the logit transformation can be applied such that

π=ln(π1π),

where π represents a proportion.

  • Particularly useful when your outcome variable is a proportion bounded between 0 and 1.

  • When a linear model is used to model an outcome bounded between 0 and 1, the model predictions may exceed the reasonable range of values (predictions equal to less than zero or greater than one).

  • Logit transformation scales variables such that the range of values becomes and on the logit scale.

  • One can build a model to predict π instead of proportion ($\pi$), and then obtain predicted proportion after a simple reverse operation π^=eπ^1+eπ^

25 / 49

Below is an example of logit transformation for a randomly generated variable.

old <- rbeta(1000,1,1000)
new <- log(old/(1-old))

26 / 49

Polynomial basis expansions

  • Basis expansions are useful to address nonlinearity between a continuous predictor variable and outcome variable.

  • We can create a set of feature variables using a nonlinear function of a variable x, ϕ(x).

  • For continuous predictors, the most commonly used expansions are polynomial basis expansions.

  • The nth degree polynomial basis expansion can be represented by

ϕ(x)=β1x+β2x2+β3x3+...+βnxn

  • For continuous predictors, there is no limit for the degree of polynomial.

  • The higher the degree of polynomial, the more flexible the model becomes, and there is a higher chance of overfitting.

  • Typically, polynomial terms up to the 3rd or 4th degree are more than enough.

  • One simply replaces the original variable x with the new variables obtained from ϕ(x).

27 / 49

Suppose we have 100 observation from a random normal variable x. The third degree polynomial basis expansion (cubic basis expansion) can be found using the poly function as the following.

set.seed(654)
x <- rnorm(100,0,1)
head(x)
[1] -0.76031762 -0.38970450 1.68962523 -0.09423560 0.09530146 0.81727228
head(poly(x,degree=3))
1 2 3
[1,] -0.070492258 -0.06612854 0.056003658
[2,] -0.030023304 -0.07454585 -0.003988336
[3,] 0.197028288 0.28324096 0.348896805
[4,] 0.002240307 -0.06560960 -0.044790680
[5,] 0.022936731 -0.05256865 -0.063289287
[6,] 0.101772051 0.04942613 -0.034439696

28 / 49










Handling Missing Data

29 / 49
  • For certain types of models such as gradient boosting, missing data is not a problem, and one can leave them as is without any processing.

  • Some models such as regularized regression models require complete data and one have to deal with missing data before modeling data.

  • Handling missing data

    • Creating an indicator variable for missingness

    • Imputation

30 / 49

Creating Indicator Variable for Missingness

  • Identify the variables with missing data, and then create a binary indicator variable for every variable to indicate missingness (0: not missing, 1: missing).
31 / 49
  • Missingness indicator variables don't solve the missing data problem because we may still have to impute the missing values for certain types of models.

  • An indicator variable about whether or not a variable is missing may sometimes provide some information in predicting the outcome when the missingness is not random.

  • If there is a systematic relationship between outcome and whether or not values are missing for a variable, missingness indicators may provide vital information.

  • This indicator variable would be meaningless for variables that don’t have any missing value.

32 / 49

Imputation

  • A common approach to missing data

  • Below is a very naive example of how it would work if we have an outcome variable (Y) and three predictors (X1, X2, X3).

Imputation Model
Outcome Predictors
X1 X2,X3
X2 X1,X3
X3 X1,X2
Prediction Model
Outcome Predictors
Y X1, X2, X3
  • Each predictor becomes an outcome of interest in imputation, and then the remaining predictors are used to build an imputation model to predict the missing values. after missing values are estimated and replaced for each predictor using an imputation model, primary outcome of interest is predicted using the imputed X1, X2, and X3.
33 / 49
  • An imputation model can be as simple as an intercept-only model (mean imputation).

    • For numeric variables, missing values can be replaced with a simple mean, median, or mode of the observed data.

    • For categorical variables, missing values can be replaced with a value randomly drawn from a binomial or multinomial distribution with the observed probabilities.

  • An imputation model can also be as complex as desired using a regularized regression model, a decision tree model, or a K-nearest neighbors model.

  • The main idea of a more complex prediction model is to find other observations similar to observations with a missing value in terms of other predictors and use data from these similar observations to predict the missing values.

34 / 49










the recipes package

35 / 49
  • Given a dataset, we may need to apply several processes for different types of variables.

  • One can write an R script to implement all these procedures manually, but it is likely a tedious job.

  • The recipes package helps us process data more efficiently and in an organized way.

  • The recipes package makes it easier to replicate the same data processing for future datasets as long as data comes in the same format (e.g., same column names, same variable types).

  • The recipes demo notebook

36 / 49
  • Note that the order of procedures applied to variables is important.

  • For instance:

    • there would be no meaning of using step_indicate_na() after using step_impute_bag().Why?

    • there will be a problem when you first standardize variables using step_normalize() and then apply a Box-Cox transformation. Why?

  • For a complete list of step_ functions available in the recipes package, check this page.

37 / 49

Make sure you review the following notebook to see how to process the various types of variables in the NIJ's Recidivism dataset using the recipes package.

https://www.kaggle.com/code/uocoeeds/lecture-2a-data-preprocessing-i

38 / 49










Processing Text Data with pre-trained NLP models

39 / 49

Natural Language Processing (NLP)

  • NLP = Linguistics + Computer Science + Statistics

  • The ultimate goal is to develop algorithms and models to understand and use human language in a way we understand and use it.

  • The goal is not only to understand individual words but also the context in which these words are being used

  • The recent advancements in the field of NLP revolutionized the language models and play a critical role in our daily lives

    • Did Gmail suggested you how to complete a sentence?

    • Did Outlook suggested a greeting message when you start drafting an email?

    • Have you ever interacted with a chat bot?

40 / 49

41 / 49
  • Below is a brief list of some of these NLP models and some information, including links to original papers.
Model Developer Year # of parameters Estimated Cost
Bert-Large Google AI 2018 336 M $ 7K
Roberta-Large Facebook AI 2019 335 M ?
GPT2-XL Open AI 2019 1.5 B $ 50K
T5 Google AI 2020 11 B $ 1.3 M
GPT3 OpenAI 2020 175 B $ 12.0 M
  • These models are expensive to train and use the enormous amounts of data available.

  • For instance,

    • Bert/Roberta was trained using the entire Wikipedia and a Book Corpus (a total of ~ 4.7 billion words),

    • GPT-2 was trained using 8 million web pages, and

    • GPT3 was trained on 45 TB of data from the internet and books.

42 / 49
  • All these models except GPT3 are open source.

  • They can be immediately utilized using open libraries (typically Python)

  • Hugging Face is a platform people host the pre-trained AI/ML models, more like CRAN hosting R packages.

    [https://huggingface.co/models](https://huggingface.co/models)
  • The tasks that can be achieved with these models are

  • text generation

    • text classification

    • text summarization

    • question answering

    • sentence similarity

  • translation

    • speech recognition

    • audio classification

    • image classification

    • object detection

43 / 49

The reticulate package

The reticulate package provides an interface to call and run Python from R.

library(reticulate)
py_config()
python: C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/python.exe
libpython: C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/python36.dll
pythonhome: C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate
version: 3.6.13 (default, Sep 23 2021, 07:38:49) [MSC v.1916 64 bit (AMD64)]
Architecture: 64bit
numpy: C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/Lib/site-packages/numpy
numpy_version: 1.19.5
NOTE: Python version was forced by RETICULATE_PYTHON
conda_list()
name python
1 Anaconda3 C:\\Users\\cengiz\\Anaconda3\\python.exe
2 r-reticulate C:\\Users\\cengiz\\AppData\\Local\\r-miniconda\\envs\\r-reticulate\\python.exe
use_condaenv('r-reticulate')
# conda_install(envname = 'r-reticulate',
# packages = 'sentence_transformers',
# pip = TRUE)
st <- import('sentence_transformers')
st
Module(sentence_transformers)
44 / 49

Sample model: RoBERTa

  • Next, we will pick a pre-trained language model to play with.

  • It can be any model available on Hugging Face.

  • You have to find the associated Hugging Face page for that particular model and use the same tag used for that model.

  • For instance, suppose we want to use the RoBERTa model. The associated Huggingface webpage for this model is https://huggingface.co/roberta-base

  • Notice that the tag for this model is roberta-base.

model.name <- 'roberta-base'
roberta <- st$models$Transformer(model.name)
pooling_model <- st$models$Pooling(roberta$get_word_embedding_dimension())
model <- st$SentenceTransformer(modules = list(roberta,pooling_model))
model
SentenceTransformer(
(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel
(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)
45 / 49
  • Each model has a limit for the number of characters they can process.

  • For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters.

model$get_max_seq_length()
[1] 512
  • If we submit any text with more than 512 characters, it will only process the first 512 characters.

  • Another essential characteristic is the length of the output vector when a language model returns numerical embeddings.

  • The following code reveals that RoBERTa returns a vector with a length of 768.

model$get_sentence_embedding_dimension()
[1] 768
  • RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called encoding.
46 / 49
  • For instance, we can get the embeddings for a single word ‘sofa’.
model$encode('sofa')
[1] -0.035673961 0.009145292 0.045155115 -0.022831999 0.444271356 -0.213501185 0.016776590 -0.031104138
[9] 0.010763753 -0.109695613 -0.218385130 -0.114003226 0.123890847 -0.079736136 0.169241652 0.033682950
[17] -0.060081303 0.062376734 0.090727039 -0.022418823 -0.048738785 0.146164417 -0.040947653 0.048980530
[25] -0.078615293 -0.001459451 0.122411616 0.016039580 -0.027918482 -0.063380398 -0.218016654 -0.130548552
[33] 0.069047354 -0.001986546 0.043062352 0.060666095 0.076921932 0.082324371 -0.009124480 0.074072585
[41] -0.099838220 0.019355917 -0.161781847 0.006589258 -0.006635100 -0.009499688 0.142429739 -0.162081540
[49] 0.035310790 -0.042761490 0.091216803 -0.069520645 -0.067890733 0.085272178 -0.052535873 -0.128475532
[57] 0.078527123 -0.065088570 -0.077463746 0.036280572 -0.076873213 0.503972054 -0.041739281 0.019071214
[65] -0.034255326 0.059130061 -0.068348601 0.298463583 0.103186190 -0.045786552 0.005054155 -0.082052834
[73] 0.067019530 0.096304454 -0.005556324 -0.014345085 0.089176968 -2.471521616 -0.152103558 0.050706096
[81] 0.071368039 -0.075957939 0.637347639 0.123483524 0.097477347 0.002311159 0.017645134 0.233651847
[89] -0.020067355 0.051375460 0.057862498 0.038290158 0.038890921 0.078757085 0.026953537 0.042133540
[97] -0.018540401 0.316686124 -0.064064525 0.024643535
[ reached getOption("max.print") -- omitted 668 entries ]
  • Similarly, we can get the vector of numerical embeddings for a whole sentence.
model$encode('I like to drink Turkish coffee')
[1] -0.011787563 0.102784999 0.010933759 -0.046234991 -0.004870780 -0.044982813 0.082253136 -0.030680846
[9] 0.002545473 -0.074854769 0.006268861 -0.161093548 0.073251143 -0.008548131 0.040855005 0.282602638
[17] 0.113437660 0.134714946 0.070556641 0.369077712 -0.023299653 0.129972905 -0.087970734 0.001767353
[25] -0.135965645 0.056659225 0.119028516 -0.015006499 0.137721002 -0.004969854 -0.084344238 -0.078683749
[33] 0.021612354 -0.039848015 0.044566937 0.058498740 0.116407432 0.022860289 -0.009472768 0.005023805
[41] 0.046304196 -0.321840614 -0.020347901 0.017407298 0.013550013 -0.046380099 -0.056976926 -0.140553191
[49] 0.074571297 -0.012873427 -0.015933262 0.075764850 0.029388802 0.041944694 -0.055896059 0.032724179
[57] 0.038444079 0.054137304 0.120718785 -0.074913748 0.019386305 0.541475773 -0.155054778 0.053815808
[65] 0.040421356 -0.006427096 -0.007460407 -0.121550187 -0.024524804 0.108934306 0.030513961 -0.088536829
[73] 0.049289271 -0.089430012 0.018972535 0.071318731 0.068718828 -4.136425018 0.133511454 0.076649025
[81] 0.036512222 -0.096736401 0.892177880 -0.043068361 0.041941863 -0.055145040 -0.021039886 0.175592646
[89] 0.012270497 -0.003860307 0.044418067 -0.035518471 -0.075384133 0.111435041 0.049394138 0.038163140
[97] 0.098541118 -0.023031345 -0.001796652 -0.000854142
[ reached getOption("max.print") -- omitted 668 entries ]
47 / 49
  • The input can be many sentences.

  • For instance, if I submit a vector of three sentences as an input, the model returns a 3 x 768 matrix containng sentence embeddings.

Each row contains the embeddings for the corresponding sentence.

my.sentences <- c('The weather today is great.',
'I live in Eugene.',
'I am a graduate student.')
embeddings <- model$encode(my.sentences)
dim(embeddings)
[1] 3 768
48 / 49

Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the AllenAI Longformer model.

https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii

49 / 49

Today's Goals:

  • Processing Categorical Predictors

    • One-hot encoding (dummy variables)

    • Label encoding

    • Polynomial Contrasts

  • Processing Cyclic Variables

  • Processing Continuous Variables

    • Centering and Scaling

    • Box-Cox transformation

    • Logit transformation

    • Polynomial basis expansions

  • Handling Missing Data

  • the recipes package

  • Processing Text Data with pre-trained NLP models

2 / 49
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow