Applied Machine Learning for Educational Data Science: K-Nearest Neighbors

Measuring the distance between two data points is at the core of the K Nearest Neighbors (KNN) algorithm, and it is essential to understand the concept of distance between two vectors.

Imagine that each observation in a dataset lives in a P-dimensional space, where P is the number of predictors.

A general definition of distance between two vectors is the Minkowski Distance. The Minkowski Distance can be defined as

For simplicity, suppose that we have two observations and three predictors, and we observe the following values for the two observations on these three predictors.

If we assume that the

for the Minkowski equation above, then we can calculate the distance as the following:

A <- c(20,25,30)
B <- c(80,90,75)


sum(abs(A - B))

When

is equal to 1 for the Minkowski equation, it becomes a special case known as Manhattan Distance. Manhattan Distance between these two data points is visualized below.

If we assume that the

for the Minkowski equation above, then we can calculate the distance as the following:

A <- c(20,25,30)
B <- c(80,90,75)


(sum(abs(A - B)^2))^(1/2)

When

is equal to 2 for the Minkowski equation, it is also a special case known as Euclidian Distance. The euclidian distance between these two data points is visualized below.

2. K-Nearest Neighbors

Given that there are

observations in a dataset, a distance between any observation and

remaining observations can be computed using Minkowski distance (with a user-defined choice of

value). Then, for any given observation, we can rank order the remaining observations based on how close they are to the given observation and then decide the K nearest neighbors (

), K observations closest to the given observation based on their distance.

Suppose that there are ten observations measured on three predictor variables (X1, X2, and X3) with the following values.

d <- data.frame(x1 =c(20,25,30,42,10,60,65,55,80,90),
                x2 =c(10,15,12,20,45,75,70,80,85,90),
                x3 =c(25,30,35,20,40,80,85,90,92,95),
                label= c('A','B','C','D','E','F','G','H','I','J'))

d

Given that there are ten observations, we can calculate the distance between all 45 pairs of observations (e.g., Euclidian distance).

dist <- as.data.frame(t(combn(1:10,2)))
dist$euclidian <- NA

for(i in 1:nrow(dist)){
  
  a <- d[dist[i,1],1:3]
  b <- d[dist[i,2],1:3]
  dist[i,]$euclidian <- sqrt(sum((a-b)^2))
  
}

dist

For instance, we can find the three closest observations to Point E (3-Nearest Neighbors). As seen below, the 3-Nearest Neighbors for Point E in this dataset would be Point B, Point C, and Point A.

# Point E is the fifth observation in the dataset

loc <- which(dist[,1]==5 | dist[,2]==5)

tmp <- dist[loc,]

tmp[order(tmp$euclidian),]

NOTE 1

The in the Minkowski distance equation and in the K-nearest neighbor are user-defined hyperparameters in the KNN algorithm. As a researcher and model builder, you can pick any values for and . They can be tuned using a similar approach applied in earlier classes for regularized regression models. One can pick a set of values for these hyperparameters and apply a grid search to find the combination that provides the best predictive performance.

It is typical to observe overfitting (high model variance, low model bias) for small values of K and underfitting (low model variance, high model bias) for large values of K. In general, people tend to focus their grid search for K around .

NOTE 2

It is essential to remember that the distance calculation between two observations is highly dependent on the scale of measurement for the predictor variables. If predictors are on different scales, the distance metric formula will favor the differences in predictors with larger scales, and it is not ideal. Therefore, it is essential to center and scale all predictors before the KNN algorithm so each predictor similarly contributes to the distance metric calculation.

3. Prediction with K-Nearest Neighbors (Do-It-Yourself)

Given that we learned about distance calculation and how to identify K-nearest neighbors based on a distance metric, the prediction in KNN is straightforward.

Note that Step 3 applies regardless of the type of outcome. If the outcome variable is continuous, we calculate the average outcome for the K-nearest neighbors as our prediction. If the outcome variable is binary (e.g., 0 vs. 1), then the proportion of observing each class among the K-nearest neighbors yields predicted probabilities for each class.

Below, I provide an example for both types of outcome using the Readability and Recidivism datasets.

3.1. Predicting a continuous outcome with the KNN algorithm

The code below is identical to the code we used in earlier classes for data preparation of the Readability datasets. Note that this is only to demonstrate the logic of model predictions in the context of K-nearest neighbors. So, we are using the whole dataset. In the next section, we will demonstrate the full workflow of model training and tuning with 10-fold cross-validation using the caret::train() function. 1. Import the data 2. Write a recipe for processing variables 3. Apply the recipe to the dataset

# Import the dataset

readability <- read.csv('./data/readability_features.csv',header=TRUE)

# Write the recipe

require(recipes)

blueprint_readability <- recipe(x     = readability,
                                vars  = colnames(readability),
                                roles = c(rep('predictor',768),'outcome')) %>%
  step_zv(all_numeric()) %>%
  step_nzv(all_numeric()) %>%
  step_normalize(all_numeric_predictors()) 

# Apply the recipe

baked_read <- blueprint_readability %>% 
  prep(training = readability) %>%
  bake(new_data = readability)

Our final dataset (baked_read) has 2834 observations and 769 columns (768 predictors; the last column is target outcome). Suppose we would like to predict the readability score for the first observation. The code below will calculate the Minkowski distance (with

) between the first observation and each of the remaining 2833 observations by using the first 768 columns of the dataset (predictors).

dist <- data.frame(obs = 2:2834,dist = NA,target=NA)

for(i in 1:2833){
  
  a <- as.matrix(baked_read[1,1:768])
  b <- as.matrix(baked_read[i+1,1:768])
  dist[i,]$dist   <- sqrt(sum((a-b)^2))
  dist[i,]$target <- baked_read[i+1,]$target

  #print(i)
}

We now rank-order the observations from closest to the most distant and then choose the 20 nearest observations (K=20). Finally, we can calculate the average of the observed outcome for the 20 nearest neighbors, which will become our prediction of the readability score for the first observation.

# Rank order the observations from closest to the most distant

dist <- dist[order(dist$dist),]

# Check the 20-nearest neighbors

dist[1:20,]

# Mean target for the 20-nearest observations

mean(dist[1:20,]$target)

# Check the actual observed value of reability for the first observation

readability[1,]$target

3.2. Predicting a binary outcome with the KNN algorithm

We can follow the same procedures to predict Recidivism in the second year after an individual’s initial release from prison.

# Import data

recidivism <- read.csv('./data/recidivism_y1 removed and recoded.csv',header=TRUE)

# Write the recipe

  outcome <- c('Recidivism_Arrest_Year2')
  
  id      <- c('ID')
  
  categorical <- c('Residence_PUMA',
                   'Prison_Offense',
                   'Age_at_Release',
                   'Supervision_Level_First',
                   'Education_Level',
                   'Prison_Years',
                   'Gender',
                   'Race',
                   'Gang_Affiliated',
                   'Prior_Arrest_Episodes_DVCharges',
                   'Prior_Arrest_Episodes_GunCharges',
                   'Prior_Conviction_Episodes_Viol',
                   'Prior_Conviction_Episodes_PPViolationCharges',
                   'Prior_Conviction_Episodes_DomesticViolenceCharges',
                   'Prior_Conviction_Episodes_GunCharges',
                   'Prior_Revocations_Parole',
                   'Prior_Revocations_Probation',
                   'Condition_MH_SA',
                   'Condition_Cog_Ed',
                   'Condition_Other',
                   'Violations_ElectronicMonitoring',
                   'Violations_Instruction',
                   'Violations_FailToReport',
                   'Violations_MoveWithoutPermission',
                   'Employment_Exempt') 

  numeric   <- c('Supervision_Risk_Score_First',
                 'Dependents',
                 'Prior_Arrest_Episodes_Felony',
                 'Prior_Arrest_Episodes_Misd',
                 'Prior_Arrest_Episodes_Violent',
                 'Prior_Arrest_Episodes_Property',
                 'Prior_Arrest_Episodes_Drug',
                 'Prior_Arrest_Episodes_PPViolationCharges',
                 'Prior_Conviction_Episodes_Felony',
                 'Prior_Conviction_Episodes_Misd',
                 'Prior_Conviction_Episodes_Prop',
                 'Prior_Conviction_Episodes_Drug',
                 'Delinquency_Reports',
                 'Program_Attendances',
                 'Program_UnexcusedAbsences',
                 'Residence_Changes',
                 'Avg_Days_per_DrugTest',
                 'Jobs_Per_Year')
  
  props      <- c('DrugTests_THC_Positive',
                  'DrugTests_Cocaine_Positive',
                  'DrugTests_Meth_Positive',
                  'DrugTests_Other_Positive',
                  'Percent_Days_Employed')
  

  for(i in categorical){
    
    recidivism[,i] <- as.factor(recidivism[,i])
    
  }
  # Blueprint for processing variables
    
  blueprint_recidivism <- recipe(x  = recidivism,
                    vars  = c(categorical,numeric,props,outcome,id),
                    roles = c(rep('predictor',48),'outcome','ID')) %>%
    step_indicate_na(all_of(categorical),all_of(numeric),all_of(props)) %>%
    step_zv(all_numeric()) %>%
    step_impute_mean(all_of(numeric),all_of(props)) %>%
    step_impute_mode(all_of(categorical)) %>%
    step_logit(all_of(props),offset=.001) %>%
    step_poly(all_of(numeric),all_of(props),degree=2) %>%
    step_normalize(paste0(numeric,'_poly_1'),
                   paste0(numeric,'_poly_2'),
                   paste0(props,'_poly_1'),
                   paste0(props,'_poly_2')) %>%
    step_dummy(all_of(categorical),one_hot=TRUE) %>%
    step_num2factor(Recidivism_Arrest_Year2,
                  transform = function(x) x + 1,
                  levels=c('No','Yes'))

# Apply the recipe

baked_recidivism <- blueprint_recidivism %>% 
  prep(training = recidivism) %>% 
  bake(new_data = recidivism)

The final dataset (baked_recidivism) has 18111 observations and 144 columns (the first column is the outcome variable, the second column is the ID variable, and remaining 142 columns are predictors). Now, suppose that we would like to predict the probability of Recidivism for the first individual. The code below will calculate the Minkowski distance (with

) between the first individual and each of the remaining 18,110 individuals by using values of the 142 predictors in this dataset.

dist2 <- data.frame(obs = 2:18111,dist = NA,target=NA)

for(i in 1:18110){
  
  a <- as.matrix(baked_recidivism[1,3:144])
  b <- as.matrix(baked_recidivism[i+1,3:144])
  dist2[i,]$dist   <- sqrt(sum((a-b)^2))
  dist2[i,]$target <- as.character(baked_recidivism[i+1,]$Recidivism_Arrest_Year2)

  #print(i)
}

We now rank-order the individuals from closest to the most distant and then choose the 100-nearest observations (K=100). Then, we calculate proportion of individuals who were recidivated (YES) and not recidivated (NO) among these 100-nearest neighbors. These proportions predict the probability of being recidivated or not recidivated for the first individual.

# Rank order the observations from closest to the most distant

dist2 <- dist2[order(dist2$dist),]

# Check the 100-nearest neighbors

dist2[1:100,]

# Mean target for the 100-nearest observations

table(dist2[1:100,]$target)

  # This indicates that the predicted probability of being recidivated is 0.28
  # for the first individual given the observed data for 100 most similar 
  # observations

# Check the actual observed outcome for the first individual

recidivism[1,]$Recidivism_Arrest_Year2

4. Kernels to Weight the Neighbors

In the previous section, we tried to understand how KNN predicts a target outcome by simply averaging the observed value for the target outcome from K-nearest neighbors. It was a simple average by equally weighing each neighbor.

Another way of averaging the target outcome from K-nearest neighbors would be to weigh each neighbor according to its distance and calculate a weighted average. A simple way to weigh each neighbor is to use the inverse of the distance. For instance, consider the earlier example where we find the 20-nearest neighbor for the first observation in the readability dataset.

dist <- dist[order(dist$dist),]

k_neighbors <- dist[1:20,]

k_neighbors

We can assign a weight to each neighbor by taking the inverse of their distance and rescaling them such that the sum of the weights equals 1.

k_neighbors$weight <- 1/k_neighbors$dist
k_neighbors$weight <- k_neighbors$weight/sum(k_neighbors$weight)


k_neighbors

Then, we can compute a weighted average of the target scores instead of a simple average.

# Weighted Mean target for the 20-nearest observations

sum(k_neighbors$target*k_neighbors$weight)

Several kernel functions can be used to assign weight to K-nearest neighbors (e.g., epanechnikov, quartic, triweight, tricube, gaussian, cosine). For all of them, closest neighbors are assigned higher weights while the weight gets smaller as the distance increases, and they slightly differ the way they assign the weight. Below is a demonstration of how assigned weight changes as a function of distance for different kernel functions.

5. Predicting a continuous outcome with the KNN algorithm via caret:train()

Please review the following notebook that builds a prediction model using the K-nearest neighbor algorithm for the readability dataset.

6. Predicting a binary outcome with the KNN algorithm via caret:train()

Please review the following notebook that builds a classification model using the K-nearest neighbor algorithm for the full recidivism dataset.

K-Nearest Neighbors

Author

Affiliation

Published

DOI

1. Distance Between Two Vectors

2. K-Nearest Neighbors

3. Prediction with K-Nearest Neighbors (Do-It-Yourself)

3.1. Predicting a continuous outcome with the KNN algorithm

3.2. Predicting a binary outcome with the KNN algorithm

4. Kernels to Weight the Neighbors

5. Predicting a continuous outcome with the KNN algorithm via `caret:train()`

6. Predicting a binary outcome with the KNN algorithm via `caret:train()`

Footnotes

Reuse

K-Nearest Neighbors

Author

Affiliation

Published

DOI

1. Distance Between Two Vectors

2. K-Nearest Neighbors

3. Prediction with K-Nearest Neighbors (Do-It-Yourself)

3.1. Predicting a continuous outcome with the KNN algorithm

3.2. Predicting a binary outcome with the KNN algorithm

4. Kernels to Weight the Neighbors

5. Predicting a continuous outcome with the KNN algorithm via caret:train()

6. Predicting a binary outcome with the KNN algorithm via caret:train()

Footnotes

Reuse

5. Predicting a continuous outcome with the KNN algorithm via `caret:train()`

6. Predicting a binary outcome with the KNN algorithm via `caret:train()`