Data Preprocessing
Cengiz Zopluoglu
College of Education, University of Oregon
Oct 10, 2022 
 Eugene, OR
1 / 49

Today's Goals:

Processing Categorical Predictors
- One-hot encoding (dummy variables)
- Label encoding
- Polynomial Contrasts
Processing Cyclic Variables
Processing Continuous Variables
- Centering and Scaling
- Box-Cox transformation
- Logit transformation
- Polynomial basis expansions
Handling Missing Data
the recipes package
Processing Text Data with pre-trained NLP models

2 / 49

Processing Categorical Predictors

3 / 49

When categorical predictors are in a dataset, it is essential to transform them into numerical codes because this is the only way to use them in predictive modeling.

Show entries

Search:

Gender	Race	Education Level	Prison Offence
M	BLACK	At least some college	Drug
M	BLACK	Less than HS diploma	Violent/Non-Sex
M	BLACK	At least some college	Drug

Showing 1 to 3 of 50 entries

Previous1 2 3 4 5…17Next

When encoding categorical predictors, we try to preserve as much information as possible from their labels

4 / 49

One-hot encoding (dummy variables)

A dummy variable is a synthetic variable with two values (0 and 1) representing a group membership.
When there is a nominal variable with N levels, it is typical to create N dummy variables to represent the information in the nominal variable.
Each dummy variable represents membership to one of the levels in the nominal variable.
These dummy variables can be used as features in predictive models.
In its simplest case, consider the variable Race in the Recidivism dataset with two levels: Black and White. We can create two dummy variables to represent the information in this variable.

	Dummy Variable 1	Dummy Variable 2
Black	1	0
White	0	1

5 / 49

Variable Prison_Offense has five categories: Violent/Sex, Violent/Non-Sex, Property, Drug, and Other.
We can create five dummy variables using the following coding scheme.

	Dummy Variable 1	Dummy Variable 2	Dummy Variable 3	Dummy Variable 4	Dummy Variable 5
Violent/Sex	1	0	0	0	0
Violent/Non-Sex	0	1	0	0	0
Property	0	0	1	0	0
Drug	0	0	0	1	0
Other	0	0	0	0	1

6 / 49

Race	Race_Black	Race_White
BLACK	1	0
BLACK	1	0
BLACK	1	0
WHITE	0	1
WHITE	0	1
BLACK	1	0
BLACK	1	0
WHITE	0	1
WHITE	0	1
WHITE	0	1

7 / 49

Prison Offense	Drug	Property	Violent/Sex	Violent/Nonsex	Other
Drug	1	0	0	0	0
Violent/Non-Sex	0	0	1	0	0
Drug	1	0	0	0	0
Property	0	1	0	0	0
Property	0	1	0	0	0

Drug	1	0	0	0	0
Violent/Non-Sex	0	0	1	0	0
Property	0	1	0	0	0
Violent/Non-Sex	0	0	1	0	0

8 / 49

NOTE
When you fit a typical regression model without regularization using ordinary least-squares (OLS), a typical practice is to drop a dummy variable for one of the levels. So, for instance, if there are N levels for a nominal variable, you only have to create (N-1) dummy variables, as the N^th one has redundant information. The information regarding the excluded category is represented in the intercept term. It creates a problem when you put all N dummy variables into the model because the OLS procedure tries to invert a singular matrix, and you will likely get an error message.

On the other hand, this is not an issue when you fit a regularized regression model, which will be the case in this class. Therefore, you do not need to drop one of the dummy variables and can include all of them in the analysis. In fact, it may be beneficial to keep the dummy variables for all categories in the model when regularization is used in the regression. Otherwise, the model may produce different predictions depending on which category is excluded.

9 / 49

Label encoding

When the variable of interest is ordinal, and there is a hierarchy among the levels, another alternative is to assign a numerical value to each category.
Consider the variable Age_At_Release in the Recidivism dataset. It is coded as 7 different age intervals in the dataset: 18-22, 23-27, 28-32, 33-37, 38-42, 43-47, 48 or older.

	Encoding 1	Encoding 2
18-22	20	1
23-27	25	2
28-32	30	3
33-37	35	4
38-42	40	5
43-47	45	6
48 or older	60	7

10 / 49

Age_at_Release	Encoding 1	Encoding 2
43-47	45	6
33-37	35	4
48 or older	60	7
38-42	40	5
38-42	40	5
48 or older	60	7
38-42	40	5
43-47	45	6
48 or older	60	7
33-37	35	4

11 / 49

Another example would be the variable Education Level in the Recidivism dataset.
How would you encode this variable?

	Encoding 1	Encoding 2
Less than a HS diploma
HS diploma
At least some college

12 / 49

Polynomial Contrasts

Another way of encoding an ordinal variable is to use polynomial contrasts.
The polynomial contrasts may be helpful if one wants to explore whether or not there is a linear, quadratic, cubic, etc., relationship between the predictor variable and outcome variable.
If there are N levels, one can have polynomial terms up to the (N-1)^th degree.
The polynomial terms are orthonormal vectors:
- sum of the squares within each column is equal to 1
- the dot product of the vectors is equal to 0.

13 / 49

Age at Release	Linear Term	Quadratic Term	Cubic Term	4th Degree Term	5th Degree Term	6th Degree Term
18-22	-0.567	0.546	-0.408	0.242	-0.109	0.033
23-27	-0.378	0	0.408	-0.564	0.436	-0.197
28-32	-0.189	-0.327	0.408	0.081	-0.546	0.493
33-37	0	-0.436	0	0.483	0	-0.658
38-42	0.189	-0.327	-0.408	0.081	0.546	0.493
43-47	0.378	0	-0.408	-0.564	-0.436	-0.197
48 or older	0.567	0.546	0.408	0.242	0.109	0.033

14 / 49

15 / 49

Processing Cyclic Variables

16 / 49

There are sometimes variables that are cyclic by nature (e.g., months, days, hour)
Dummy variables or numerical encoding does not necessarily capture the information in these variables in the most meaningful way.
For cyclic variables, it may be more meaningful to creating two new variables using a sine and cosine transformation as the following:

$x_{1} = s i n (\frac{2 π x}{m a x (x)})$

$x_{2} = c o s (\frac{2 π x}{m a x (x)})$

Day	x	term1	term2
Mon	1	0.782	0.623
Tue	2	0.975	-0.223
Wed	3	0.434	-0.901
Thu	4	-0.434	-0.901
Fri	5	-0.975	-0.223
Sat	6	-0.782	0.623
Sun	7	0	1

17 / 49

18 / 49

Below is another example for the time of day

   hour     x1     x2
1     1  0.259  0.966
2     2  0.500  0.866
3     3  0.707  0.707
4     4  0.866  0.500
5     5  0.966  0.259
6     6  1.000  0.000
7     7  0.966 -0.259
8     8  0.866 -0.500
9     9  0.707 -0.707
10   10  0.500 -0.866
11   11  0.259 -0.966
12   12  0.000 -1.000
13   13 -0.259 -0.966
14   14 -0.500 -0.866
15   15 -0.707 -0.707
16   16 -0.866 -0.500
17   17 -0.966 -0.259
18   18 -1.000  0.000
19   19 -0.966  0.259
20   20 -0.866  0.500
21   21 -0.707  0.707
22   22 -0.500  0.866
23   23 -0.259  0.966
24   24  0.000  1.000

19 / 49

Processing Continuous Variables

20 / 49

Centering and Scaling

Centering a variable is done by subtracting the variable’s mean from every value

$X_{c e n t e r e d} = X - \bar{X}$

Centering ensures that the mean of the centered variable equals zero.
Scaling a variable is dividing the value of each observation by the variable’s standard deviation.

$X_{s c a l e d} = \frac{X}{σ_{X}}$ Scaling ensures that the standard deviation of the scaled variable equals 1.
When centering and scaling are both applied, it is called standardization.

$z_{X} = \frac{X - \bar{X}}{σ_{X}}$

21 / 49

When we standardize a variable, we ensure that its mean is equal to zero and variance is equal to 1.
Standardizing outcome and predictor variables may be critical and necessary for specific models (e.g., K-nearest neighbor, support vector machines, penalized regression), but it is not always necessary for other models (e.g., decision tree models).
Standardizing a variable only changes the first and second moments of a distribution (mean and variance)
Standardizing a variable doesn’t change the third and fourth moments of a distribution (skewness and kurtosis).
Some people in the data science field use the term normalization, but what they actually mean is standardization.

22 / 49

Box-Cox transformation

Variables with extreme skewness and kurtosis may deteriorate the model performance for certain types of models.
It may sometimes be useful to transform a variable with extreme skewness and kurtosis such that its distribution approximates to a normal distribution.
Box-Cox transformation is a method to find an optimal parameter of λ to apply the following transformation:

$y^{(λ)} = {\begin{matrix} \frac{y^{λ} - 1}{λ} & , λ \neq 0 \\ l n (y) & , λ = 0 \end{matrix}$

23 / 49

require(bestNormalize)
require(psych)
set.seed(9272022)
old <- rbeta(1000,1,1000)
fit <- boxcox(old,standardize=FALSE)
fit

Non-Standardized Box Cox Transformation with 1000 nonmissing obs.:
 Estimated statistics:
 - lambda = 0.2449266 
 - mean (before standardization) = -3.390823 
 - sd (before standardization) = 0.1860852

new <- predict(fit)
describe(old)

   vars    n mean sd median trimmed mad min  max range skew kurtosis se
X1    1 1000    0  0      0       0   0   0 0.01  0.01 2.43    10.27  0

describe(new)

   vars    n  mean   sd median trimmed  mad   min   max range  skew kurtosis   se
X1    1 1000 -3.39 0.19  -3.39   -3.39 0.19 -3.94 -2.76  1.18 -0.03     -0.2 0.01

24 / 49

Logit transformation

When a variable is a proportion bounded between 0 and 1, the logit transformation can be applied such that

$π^{*} = l n (\frac{π}{1 - π}),$

where π represents a proportion.

Particularly useful when your outcome variable is a proportion bounded between 0 and 1.
When a linear model is used to model an outcome bounded between 0 and 1, the model predictions may exceed the reasonable range of values (predictions equal to less than zero or greater than one).
Logit transformation scales variables such that the range of values becomes $- \infty$ and $\infty$ on the logit scale.
One can build a model to predict $π^{*}$ instead of proportion ($\pi$), and then obtain predicted proportion after a simple reverse operation $\hat{π} = \frac{e^{\hat{π^{*}}}}{1 + e^{\hat{π^{*}}}}$

25 / 49

Below is an example of logit transformation for a randomly generated variable.

old <- rbeta(1000,1,1000)
new <- log(old/(1-old))

26 / 49

Polynomial basis expansions

Basis expansions are useful to address nonlinearity between a continuous predictor variable and outcome variable.
We can create a set of feature variables using a nonlinear function of a variable x, $ϕ (x)$ .
For continuous predictors, the most commonly used expansions are polynomial basis expansions.
The $n^{t h}$ degree polynomial basis expansion can be represented by

$ϕ (x) = β_{1} x + β_{2} x^{2} + β_{3} x^{3} + . . . + β_{n} x^{n}$

For continuous predictors, there is no limit for the degree of polynomial.
The higher the degree of polynomial, the more flexible the model becomes, and there is a higher chance of overfitting.
Typically, polynomial terms up to the 3rd or 4th degree are more than enough.
One simply replaces the original variable x with the new variables obtained from $ϕ (x)$ .

27 / 49

Suppose we have 100 observation from a random normal variable x. The third degree polynomial basis expansion (cubic basis expansion) can be found using the poly function as the following.

set.seed(654)
x <- rnorm(100,0,1)
head(x)

[1] -0.76031762 -0.38970450  1.68962523 -0.09423560  0.09530146  0.81727228

head(poly(x,degree=3))

                1           2            3
[1,] -0.070492258 -0.06612854  0.056003658
[2,] -0.030023304 -0.07454585 -0.003988336
[3,]  0.197028288  0.28324096  0.348896805
[4,]  0.002240307 -0.06560960 -0.044790680
[5,]  0.022936731 -0.05256865 -0.063289287
[6,]  0.101772051  0.04942613 -0.034439696

28 / 49

Handling Missing Data

29 / 49

For certain types of models such as gradient boosting, missing data is not a problem, and one can leave them as is without any processing.
Some models such as regularized regression models require complete data and one have to deal with missing data before modeling data.
Handling missing data
- Creating an indicator variable for missingness
- Imputation

30 / 49

Creating Indicator Variable for MissingnessIdentify the variables with missing data, and then create a binary indicator variable for every variable to indicate missingness (0: not missing, 1: missing). 

  Gang_AffiliatedAvg_Days_per_DrugTestGang_naDrug_na
17.187510
23210
11
001
001
001
92.3333333310
019000
066.900
001

31 / 49

Gang_Affiliated	Avg_Days_per_DrugTest	Gang_na	Drug_na
	17.1875	1	0
	232	1	0
		1	1
0		0	1
0		0	1
0		0	1
	92.33333333	1	0
0	190	0	0
0	66.9	0	0
0		0	1

Missingness indicator variables don't solve the missing data problem because we may still have to impute the missing values for certain types of models.
An indicator variable about whether or not a variable is missing may sometimes provide some information in predicting the outcome when the missingness is not random.
If there is a systematic relationship between outcome and whether or not values are missing for a variable, missingness indicators may provide vital information.
This indicator variable would be meaningless for variables that don’t have any missing value.

32 / 49

Imputation

A common approach to missing data
Below is a very naive example of how it would work if we have an outcome variable (Y) and three predictors (X1, X2, X3).

Imputation Model
Outcome	Predictors
X1	X2,X3
X2	X1,X3
X3	X1,X2

Prediction Model
Outcome	Predictors
Y	X1, X2, X3

Each predictor becomes an outcome of interest in imputation, and then the remaining predictors are used to build an imputation model to predict the missing values. after missing values are estimated and replaced for each predictor using an imputation model, primary outcome of interest is predicted using the imputed X1, X2, and X3.

33 / 49

An imputation model can be as simple as an intercept-only model (mean imputation).
- For numeric variables, missing values can be replaced with a simple mean, median, or mode of the observed data.
- For categorical variables, missing values can be replaced with a value randomly drawn from a binomial or multinomial distribution with the observed probabilities.
An imputation model can also be as complex as desired using a regularized regression model, a decision tree model, or a K-nearest neighbors model.
The main idea of a more complex prediction model is to find other observations similar to observations with a missing value in terms of other predictors and use data from these similar observations to predict the missing values.

34 / 49

the `recipes` package

35 / 49

Given a dataset, we may need to apply several processes for different types of variables.
One can write an R script to implement all these procedures manually, but it is likely a tedious job.
The recipes package helps us process data more efficiently and in an organized way.
The recipes package makes it easier to replicate the same data processing for future datasets as long as data comes in the same format (e.g., same column names, same variable types).
The recipes demo notebook

https://www.kaggle.com/code/uocoeeds/the-recipes-package-demo/notebook

36 / 49

Note that the order of procedures applied to variables is important.
For instance:
- there would be no meaning of using step_indicate_na() after using step_impute_bag().Why?
- there will be a problem when you first standardize variables using step_normalize() and then apply a Box-Cox transformation. Why?
For a complete list of step_ functions available in the recipes package, check this page.

https://recipes.tidymodels.org/reference/index.html

37 / 49

Make sure you review the following notebook to see how to process the various types of variables in the NIJ's Recidivism dataset using the recipes package.

https://www.kaggle.com/code/uocoeeds/lecture-2a-data-preprocessing-i

38 / 49

Processing Text Data with pre-trained NLP models

39 / 49

Natural Language Processing (NLP)

NLP = Linguistics + Computer Science + Statistics
The ultimate goal is to develop algorithms and models to understand and use human language in a way we understand and use it.
The goal is not only to understand individual words but also the context in which these words are being used
The recent advancements in the field of NLP revolutionized the language models and play a critical role in our daily lives
- Did Gmail suggested you how to complete a sentence?
- Did Outlook suggested a greeting message when you start drafting an email?
- Have you ever interacted with a chat bot?

40 / 49

41 / 49

Below is a brief list of some of these NLP models and some information, including links to original papers.

Model	Developer	Year	# of parameters	Estimated Cost
Bert-Large	Google AI	2018	336 M	$ 7K
Roberta-Large	Facebook AI	2019	335 M	?
GPT2-XL	Open AI	2019	1.5 B	$ 50K
T5	Google AI	2020	11 B	$ 1.3 M
GPT3	OpenAI	2020	175 B	$ 12.0 M

These models are expensive to train and use the enormous amounts of data available.
For instance,
- Bert/Roberta was trained using the entire Wikipedia and a Book Corpus (a total of ~ 4.7 billion words),
- GPT-2 was trained using 8 million web pages, and
- GPT3 was trained on 45 TB of data from the internet and books.

42 / 49

All these models except GPT3 are open source.
They can be immediately utilized using open libraries (typically Python)
Hugging Face is a platform people host the pre-trained AI/ML models, more like CRAN hosting R packages.
[https://huggingface.co/models](https://huggingface.co/models)
The tasks that can be achieved with these models are

text generation
- text classification
- text summarization
- question answering
- sentence similarity

translation
- speech recognition
- audio classification
- image classification
- object detection

43 / 49

The `reticulate` package

The reticulate package provides an interface to call and run Python from R.

library(reticulate)
py_config()

python:         C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/python.exe
libpython:      C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/python36.dll
pythonhome:     C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate
version:        3.6.13 (default, Sep 23 2021, 07:38:49) [MSC v.1916 64 bit (AMD64)]
Architecture:   64bit
numpy:          C:/Users/cengiz/AppData/Local/r-miniconda/envs/r-reticulate/Lib/site-packages/numpy
numpy_version:  1.19.5
NOTE: Python version was forced by RETICULATE_PYTHON

conda_list()

          name                                                                         python
1    Anaconda3                                       C:\\Users\\cengiz\\Anaconda3\\python.exe
2 r-reticulate C:\\Users\\cengiz\\AppData\\Local\\r-miniconda\\envs\\r-reticulate\\python.exe

use_condaenv('r-reticulate')
# conda_install(envname  = 'r-reticulate',
#              packages = 'sentence_transformers',
#               pip      = TRUE)
st <- import('sentence_transformers')
st

Module(sentence_transformers)

44 / 49

Sample model: RoBERTa

Next, we will pick a pre-trained language model to play with.
It can be any model available on Hugging Face.
You have to find the associated Hugging Face page for that particular model and use the same tag used for that model.
For instance, suppose we want to use the RoBERTa model. The associated Huggingface webpage for this model is https://huggingface.co/roberta-base
Notice that the tag for this model is roberta-base.

model.name <- 'roberta-base'
roberta       <- st$models$Transformer(model.name)
pooling_model <- st$models$Pooling(roberta$get_word_embedding_dimension())
model         <- st$SentenceTransformer(modules = list(roberta,pooling_model))
model

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: RobertaModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

When you run this code the first time, it will install all the relevant model files https://huggingface.co/roberta-base/tree/main to a local folder on your machine.

45 / 49

Each model has a limit for the number of characters they can process.
For instance, RoBERTa can handle a text sequence with a maximum number of 512 characters.

model$get_max_seq_length()

[1] 512

If we submit any text with more than 512 characters, it will only process the first 512 characters.
Another essential characteristic is the length of the output vector when a language model returns numerical embeddings.
The following code reveals that RoBERTa returns a vector with a length of 768.

model$get_sentence_embedding_dimension()

[1] 768

RoBERTa can take any text sequence up to 512 characters as input and then return a numerical vector with a length of 768 that represent this text sequence. This process is also called encoding.

46 / 49

For instance, we can get the embeddings for a single word ‘sofa’. 
model$encode('sofa')

  [1] -0.035673961  0.009145292  0.045155115 -0.022831999  0.444271356 -0.213501185  0.016776590 -0.031104138
  [9]  0.010763753 -0.109695613 -0.218385130 -0.114003226  0.123890847 -0.079736136  0.169241652  0.033682950
 [17] -0.060081303  0.062376734  0.090727039 -0.022418823 -0.048738785  0.146164417 -0.040947653  0.048980530
 [25] -0.078615293 -0.001459451  0.122411616  0.016039580 -0.027918482 -0.063380398 -0.218016654 -0.130548552
 [33]  0.069047354 -0.001986546  0.043062352  0.060666095  0.076921932  0.082324371 -0.009124480  0.074072585
 [41] -0.099838220  0.019355917 -0.161781847  0.006589258 -0.006635100 -0.009499688  0.142429739 -0.162081540
 [49]  0.035310790 -0.042761490  0.091216803 -0.069520645 -0.067890733  0.085272178 -0.052535873 -0.128475532
 [57]  0.078527123 -0.065088570 -0.077463746  0.036280572 -0.076873213  0.503972054 -0.041739281  0.019071214
 [65] -0.034255326  0.059130061 -0.068348601  0.298463583  0.103186190 -0.045786552  0.005054155 -0.082052834
 [73]  0.067019530  0.096304454 -0.005556324 -0.014345085  0.089176968 -2.471521616 -0.152103558  0.050706096
 [81]  0.071368039 -0.075957939  0.637347639  0.123483524  0.097477347  0.002311159  0.017645134  0.233651847
 [89] -0.020067355  0.051375460  0.057862498  0.038290158  0.038890921  0.078757085  0.026953537  0.042133540
 [97] -0.018540401  0.316686124 -0.064064525  0.024643535
 [ reached getOption("max.print") -- omitted 668 entries ]
Similarly, we can get the vector of numerical embeddings for a whole sentence.
model$encode('I like to drink Turkish coffee')

  [1] -0.011787563  0.102784999  0.010933759 -0.046234991 -0.004870780 -0.044982813  0.082253136 -0.030680846
  [9]  0.002545473 -0.074854769  0.006268861 -0.161093548  0.073251143 -0.008548131  0.040855005  0.282602638
 [17]  0.113437660  0.134714946  0.070556641  0.369077712 -0.023299653  0.129972905 -0.087970734  0.001767353
 [25] -0.135965645  0.056659225  0.119028516 -0.015006499  0.137721002 -0.004969854 -0.084344238 -0.078683749
 [33]  0.021612354 -0.039848015  0.044566937  0.058498740  0.116407432  0.022860289 -0.009472768  0.005023805
 [41]  0.046304196 -0.321840614 -0.020347901  0.017407298  0.013550013 -0.046380099 -0.056976926 -0.140553191
 [49]  0.074571297 -0.012873427 -0.015933262  0.075764850  0.029388802  0.041944694 -0.055896059  0.032724179
 [57]  0.038444079  0.054137304  0.120718785 -0.074913748  0.019386305  0.541475773 -0.155054778  0.053815808
 [65]  0.040421356 -0.006427096 -0.007460407 -0.121550187 -0.024524804  0.108934306  0.030513961 -0.088536829
 [73]  0.049289271 -0.089430012  0.018972535  0.071318731  0.068718828 -4.136425018  0.133511454  0.076649025
 [81]  0.036512222 -0.096736401  0.892177880 -0.043068361  0.041941863 -0.055145040 -0.021039886  0.175592646
 [89]  0.012270497 -0.003860307  0.044418067 -0.035518471 -0.075384133  0.111435041  0.049394138  0.038163140
 [97]  0.098541118 -0.023031345 -0.001796652 -0.000854142
 [ reached getOption("max.print") -- omitted 668 entries ]
47 / 49

The input can be many sentences.
For instance, if I submit a vector of three sentences as an input, the model returns a 3 x 768 matrix containng sentence embeddings.

Each row contains the embeddings for the corresponding sentence.

my.sentences <- c('The weather today is great.',
                  'I live in Eugene.',
                  'I am a graduate student.')
embeddings <- model$encode(my.sentences)
dim(embeddings)

[1]   3 768

48 / 49

Make sure you review the following notebook to see how to process 2834 reading excerpts in the CommonLit Readability dataset and obtain a 2834 x 768 matrix of numerical embeddings using the AllenAI Longformer model.

https://www.kaggle.com/code/uocoeeds/lecture-2b-data-preprocessing-ii

49 / 49

Help

Keyboard shortcuts

↑, ←, Pg Up, k

Go to previous slide

↓, →, Pg Dn, Space, j

Go to next slide

Home

Go to first slide

End

Go to last slide

Number + Return

Go to specific slide

b / m / f

Toggle blackout / mirrored / fullscreen mode

Clone slideshow

Toggle presenter mode

Restart the presentation timer

?, h

Toggle this help

Data Preprocessing

Cengiz Zopluoglu

College of Education, University of Oregon

Oct 10, 2022 Eugene, OR

Today's Goals:

Processing Categorical Predictors

One-hot encoding (dummy variables)

Label encoding

Polynomial Contrasts

Processing Cyclic Variables

Processing Continuous Variables

Centering and Scaling

Box-Cox transformation

Logit transformation

Polynomial basis expansions

Handling Missing Data

Creating Indicator Variable for Missingness

Imputation

the recipes package

Processing Text Data with pre-trained NLP models

Natural Language Processing (NLP)

The reticulate package

Sample model: RoBERTa

Today's Goals:

Help

Oct 10, 2022
Eugene, OR

the `recipes` package

The `reticulate` package