Introduction to K-Nearest Neighbors Algorithm
Cengiz Zopluoglu
College of Education, University of Oregon
Nov 21, 2022 
 Eugene, OR
1 / 28

The goals:

K-nearest Neighbors Algorithm
- The concept of distance between two vectors
- The concept of K-nearest neighbors
- Predicting an outcome based on K-nearest neighbors
- Kernels to Weight the neighbors
- Review of Kaggle notebooks for building KNN models

2 / 28

Distance Between Two Vectors

Imagine that each observation in a dataset lives in a P-dimensional space, where P is the number of predictors.
- Obsevation 1: $A = (A_{1}, A_{2}, A_{3}, . . ., A_{P})$
- Obsevation 2: $B = (B_{1}, B_{2}, B_{3}, . . ., B_{P})$
A general definition of distance between two vectors is the Minkowski Distance.

${(\sum_{i = 1}^{P} | A_{i} - B_{i} |^{q})}^{\frac{1}{q}},$ where $q$ can take any positive value.

3 / 28

Suppose that we have two observations and three predictors
- Observation 1: (20,25,30)
- Observation 2: (80,90,75)

4 / 28

If we assume that the q=1q=1 for the Minkowski equation above, then we can calculate the distance as the following:
A <- c(20,25,30)
B <- c(80,90,75)
sum(abs(A - B))

[1] 170
If we assume that the q=2q=2 for the Minkowski equation above, then we can calculate the distance as the following:
A <- c(20,25,30)
B <- c(80,90,75)
(sum(abs(A - B)^2))^(1/2)

[1] 99.25
If we assume that the q=3q=3 for the Minkowski equation above, then we can calculate the distance as the following:
A <- c(20,25,30)
B <- c(80,90,75)
(sum(abs(A - B)^3))^(1/2)

[1] 762.7
5 / 28

When $q$ is equal to 1 for the Minkowski equation, it becomes a special case known as Manhattan Distance.

6 / 28

When $q$ is equal to 2 for the Minkowski equation, it is also a special case known as Euclidian Distance.

7 / 28

K-Nearest Neighbors

When there are $N$ observations in a dataset, a distance between any observation and $N - 1$ remaining observations can be computed using Minkowski distance (with a user-defined choice of $q$ value, a hyperparameter).
Then, for any given observation, we can rank order the remaining observations based on how close they are to the given observation and then decide the K observations closest based on their distance.
Suppose that there are ten observations measured on three predictor variables (X1, X2, and X3) with the following values.

d <- data.frame(x1 =c(20,25,30,42,10,60,65,55,80,90),
                x2 =c(10,15,12,20,45,75,70,80,85,90),
                x3 =c(25,30,35,20,40,80,85,90,92,95),
                label= c('A','B','C','D','E','F','G','H','I','J'))
d

   x1 x2 x3 label
1  20 10 25     A
2  25 15 30     B
3  30 12 35     C
4  42 20 20     D
5  10 45 40     E
6  60 75 80     F
7  65 70 85     G
8  55 80 90     H
9  80 85 92     I
10 90 90 95     J

8 / 28

9 / 28

Given that there are ten observations, we can calculate the distance between all 45 pairs of observations (e.g., Euclidian distance).

labels <- c('A','B','C','D','E',
            'F','G','H','I','J')
dist <- as.data.frame(t(combn(labels,2)))
dist$euclidian <- NA
for(i in 1:nrow(dist)){
  a <- d[d$label==dist[i,1],1:3]
  b <- d[d$label==dist[i,2],1:3]
  dist[i,]$euclidian <- sqrt(sum((a-b)^2))
}
dist

   V1 V2 euclidian
1   A  B     8.660
2   A  C    14.283
3   A  D    24.678
4   A  E    39.370
5   A  F    94.074
6   A  G    96.047
7   A  H   101.735
8   A  I   117.107
9   A  J   127.279
10  B  C     7.681
11  B  D    20.347
12  B  E    35.000
13  B  F    85.586
14  B  G    87.464
15  B  H    93.408
16  B  I   108.485
17  B  J   118.638
18  C  D    20.809
19  C  E    38.910
20  C  F    83.030
21  C  G    84.196
22  C  H    90.962
23  C  I   105.252
24  C  J   115.256
25  D  E    45.266
26  D  F    83.361
27  D  G    85.170
28  D  H    93.107
29  D  I   104.178
30  D  J   113.265
31  E  F    70.711
32  E  G    75.333
33  E  H    75.829
 [ reached 'max' / getOption("max.print") -- omitted 12 rows ]

10 / 28

For instance, we can find the three closest observations to Point E (3-Nearest Neighbors). As seen below, the 3-Nearest Neighbors for Point E in this dataset would be Point B, Point C, and Point A.

# Point E is the fifth observation in the dataset
loc <- which(dist[,1]=='E' | dist[,2]=='E')
tmp <- dist[loc,]
tmp[order(tmp$euclidian),]

   V1 V2 euclidian
12  B  E     35.00
19  C  E     38.91
4   A  E     39.37
25  D  E     45.27
31  E  F     70.71
32  E  G     75.33
33  E  H     75.83
34  E  I     95.94
35  E  J    107.00

11 / 28

NOTE 1

The $q$ in the Minkowski distance equation and $K$ in the K-nearest neighbor are user-defined hyperparameters in the KNN algorithm. As a researcher and model builder, you can pick any values for $q$ and $K$ . They can be tuned using a similar approach applied in earlier classes for regularized regression models. One can pick a set of values for these hyperparameters and apply a grid search to find the combination that provides the best predictive performance.

It is typical to observe overfitting (high model variance, low model bias) for small values of K and underfitting (low model variance, high model bias) for large values of K. In general, people tend to focus their grid search for K around $\sqrt{N}$ .

12 / 28

NOTE 2

It is essential to remember that the distance calculation between two observations is highly dependent on the scale of measurement for the predictor variables. If predictors are on different scales, the distance metric formula will favor the differences in predictors with larger scales, and it is not ideal. Therefore, it is essential to center and scale all predictors before the KNN algorithm so each predictor similarly contributes to the distance metric calculation.

13 / 28

Prediction with K-Nearest Neighbors

Below is a list of steps for predicting an outcome for a given observation.

1. Calculate the distance between the observation and the remaining $N - 1$ observations in the data (with a user choice of $q$ in Minkowski distance).
1. Rank order the observations based on the calculated distance, and choose the K-nearest neighbor. (with a user choice of $K$ )
1. Calculate the mean of the observed outcome in the K-nearest neighbors as your prediction.

Note that Step 3 applies regardless of the type of outcome.

If the outcome variable is continuous, we calculate the average outcome for the K-nearest neighbors as our prediction.

If the outcome variable is binary (e.g., 0 vs. 1), then the proportion of observing each class among the K-nearest neighbors yields predicted probabilities for each class.

14 / 28

An example of predicting a continuous outcome with the KNN algorithm

Import the data
Write a recipe for processing variables
Apply the recipe to the dataset

# Import the dataset
readability <- read.csv('./data/readability_features.csv',header=TRUE)
# Write the recipe
require(recipes)
blueprint_readability <- recipe(x     = readability,
                                vars  = colnames(readability),
                                roles = c(rep('predictor',768),'outcome')) %>%
  step_zv(all_numeric()) %>%
  step_nzv(all_numeric()) %>%
  step_normalize(all_numeric_predictors()) 
# Apply the recipe
baked_read <- blueprint_readability %>% 
  prep(training = readability) %>%
  bake(new_data = readability)

Our final dataset (baked_read) has 2834 observations and 769 columns (768 predictors; the last column is target outcome).

Suppose we would like to predict the readability score for the first observation.

15 / 28

The code below will calculate the Minkowski distance (with $q = 2$ ) between the first observation and each of the remaining 2833 observations by using the first 768 columns of the dataset (predictors).

dist <- data.frame(obs = 2:2834,dist = NA,target=NA)
for(i in 1:2833){
  a <- as.matrix(baked_read[1,1:768])
  b <- as.matrix(baked_read[i+1,1:768])
  dist[i,]$dist   <- sqrt(sum((a-b)^2))
  dist[i,]$target <- baked_read[i+1,]$target
  #print(i)
}

16 / 28

We now rank-order the observations from closest to the most distant and then choose the 20 nearest observations (K=20).

# Rank order the observations from closest to the most distant
dist <- dist[order(dist$dist),]
# Check the 20-nearest neighbors
print(dist[1:20,], row.names = FALSE)

  obs  dist  target
 2441 24.18  0.5590
   45 24.37 -0.5864
 1992 24.91  0.1430
 2264 25.26 -0.9035
 2522 25.27 -0.6359
 2419 25.41 -0.2128
 1530 25.66 -1.8725
  239 25.93 -0.5611
  238 26.30 -0.8890
 1520 26.40 -0.6159
 2244 26.50 -0.3327
 1554 26.57 -1.8844
 1571 26.61 -1.1337
 2154 26.62 -1.1141
   76 26.64 -0.6056
 2349 26.68 -0.1593
 1189 26.85 -1.2395
 2313 26.95 -0.2532
 2179 27.05 -1.0299
 2017 27.06  0.1399

17 / 28

Finally, we can calculate the average of the observed outcome for the 20 nearest neighbors, which will become our prediction of the readability score for the first observation.

mean(dist[1:20,]$target)

[1] -0.6594

The observed outcome (readability score) for the first observation.

readability[1,]$target

[1] -0.3403

18 / 28

An example of predicting a binary outcome with the KNN algorithm

We can follow the same procedures to predict Recidivism in the second year after an individual's initial release from prison.
The final dataset (baked_recidivism) after pre-processing has 18111 observations and 142 predictors.
Suppose that we would like to predict the probability of Recidivism for the first individual.
The code below will calculate the Minkowski distance (with $q = 2$ ) between the first individual and each of the remaining 18,110 individuals by using values of the 142 predictors in this dataset.

dist2 <- data.frame(obs = 2:18111,dist = NA,target=NA)
for(i in 1:18110){
  a <- as.matrix(baked_recidivism[1,3:144])
  b <- as.matrix(baked_recidivism[i+1,3:144])
  dist2[i,]$dist   <- sqrt(sum((a-b)^2))
  dist2[i,]$target <- as.character(baked_recidivism[i+1,]$Recidivism_Arrest_Year2)
  #print(i)
}

19 / 28

Suppose we now rank-order the individuals from closest to the most distant and then choose the 20-nearest observations.

Then, we calculate proportion of individuals who were recidivated (YES) and not recidivated (NO) among these 20-nearest neighbors.

dist2 <- dist2[order(dist2$dist),]
print(dist2[1:20,],
      row.names = FALSE)

   obs  dist target
  7070 6.217     No
 14204 6.256     No
  1574 6.384     No
  4527 6.680     No
  8446 7.012     No
  6024 7.251     No
  7787 7.270     No
   565 7.279    Yes
  8768 7.288     No
  4646 7.359     No
  4043 7.376     No
  9113 7.385     No
  5316 7.405     No
  4095 7.536     No
  9732 7.566     No
   831 7.634     No
 14385 7.644     No
  2933 7.660    Yes
   647 7.676    Yes
  6385 7.685    Yes

table(dist2[1:20,]$target)


 No Yes 
 16   4

# The observed outcome for the first individual
recidivism[1,]$Recidivism_Arrest_Year2

[1] 0

These proportions predict the probability of being recidivated or not recidivated for the first individual.

The probability of the first observation to be recidivated within 2 years is 0.2 (4/20) based on 20 nearest neighbors.

20 / 28

Kernels to Weight the Neighbors

In the previous section, we used a simple average of the observed outcome from K-nearest neighbors.
A simple average implies equally weighing each neighbor.
Another way of averaging the target outcome from K-nearest neighbors would be to weigh each neighbor according to its distance and calculate a weighted average.
A simple way to weigh each neighbor is to use the inverse of the distance.
For instance, consider the earlier example where we find the 20-nearest neighbor for the first observation in the readability dataset.
We can assign a weight to each neighbor by taking the inverse of their distance and rescaling them such that the sum of the weights equals 1.

21 / 28

dist <- dist[order(dist$dist),]
k_neighbors <- dist[1:20,]
print(k_neighbors,row.names=FALSE)

  obs  dist  target
 2441 24.18  0.5590
   45 24.37 -0.5864
 1992 24.91  0.1430
 2264 25.26 -0.9035
 2522 25.27 -0.6359
 2419 25.41 -0.2128
 1530 25.66 -1.8725
  239 25.93 -0.5611
  238 26.30 -0.8890
 1520 26.40 -0.6159
 2244 26.50 -0.3327
 1554 26.57 -1.8844
 1571 26.61 -1.1337
 2154 26.62 -1.1141
   76 26.64 -0.6056
 2349 26.68 -0.1593
 1189 26.85 -1.2395
 2313 26.95 -0.2532
 2179 27.05 -1.0299
 2017 27.06  0.1399

k_neighbors$weight <- 1/k_neighbors$dist
k_neighbors$weight <- k_neighbors$weight/sum(k_neighbors$weight)
print(k_neighbors,row.names=FALSE)

  obs  dist  target  weight
 2441 24.18  0.5590 0.05382
   45 24.37 -0.5864 0.05341
 1992 24.91  0.1430 0.05225
 2264 25.26 -0.9035 0.05152
 2522 25.27 -0.6359 0.05151
 2419 25.41 -0.2128 0.05122
 1530 25.66 -1.8725 0.05072
  239 25.93 -0.5611 0.05020
  238 26.30 -0.8890 0.04949
 1520 26.40 -0.6159 0.04930
 2244 26.50 -0.3327 0.04911
 1554 26.57 -1.8844 0.04899
 1571 26.61 -1.1337 0.04892
 2154 26.62 -1.1141 0.04890
   76 26.64 -0.6056 0.04886
 2349 26.68 -0.1593 0.04878
 1189 26.85 -1.2395 0.04847
 2313 26.95 -0.2532 0.04829
 2179 27.05 -1.0299 0.04812
 2017 27.06  0.1399 0.04810

Compute a weighted average of the target scores instead of a simple average.

sum(k_neighbors$target*k_neighbors$weight)

[1] -0.6526

22 / 28

Several kernel functions can be used to assign weight to K-nearest neighbors

Epanechnikov
Rectangular
Quartic
Triweight
Tricube
Gaussian
Cosine

For all of them, closest neighbors are assigned higher weights while the weight gets smaller as the distance increases, and they slightly differ the way they assign the weight.

23 / 28

Below is a demonstration of how assigned weight changes as a function of distance for different kernel functions.

24 / 28

NOTE 3

Which kernel function should we use for weighing the distance? The type of kernel function can also be considered a hyperparameter to tune.

25 / 28

Hyperparameters for the KNN algorithm

require(caret)
require(kknn)
getModelInfo()$kknn$parameters

  parameter     class           label
1      kmax   numeric Max. #Neighbors
2  distance   numeric        Distance
3    kernel character          Kernel

26 / 28

Kaggle Notebook

Building a Prediction Model for a Continuous Outcome Using the KNN Algorithm

Performance Comparison of Different Algorithms

	R-square	MAE	RMSE
Linear Regression	0.658	0.499	0.620
Ridge Regression	0.727	0.432	0.536
Lasso Regression	0.721	0.433	0.542
Elastic Net	0.726	0.433	0.539
KNN	0.611	0.519	0.648

27 / 28

Kaggle Notebook

Building a Classification Model for a Binary Outcome Using the KNN Algorithm

Performance Comparison of Different Algorithms

	-LL	AUC	ACC	TPR	TNR	FPR	PRE
Logistic Regression	0.5096	0.7192	0.755	0.142	0.949	0.051	0.471
Logistic Regression with Ridge Penalty	0.5111	0.7181	0.754	0.123	0.954	0.046	0.461
Logistic Regression with Lasso Penalty	0.5090	0.7200	0.754	0.127	0.952	0.048	0.458
Logistic Regression with Elastic Net	0.5091	0.7200	0.753	0.127	0.952	0.048	0.456
KNN	?	?	?	?	?	?	?

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Introduction to K-Nearest Neighbors Algorithm

Cengiz Zopluoglu

College of Education, University of Oregon

Nov 21, 2022 Eugene, OR

The goals:

Distance Between Two Vectors

K-Nearest Neighbors

Prediction with K-Nearest Neighbors

An example of predicting a continuous outcome with the KNN algorithm

An example of predicting a binary outcome with the KNN algorithm

Kernels to Weight the Neighbors

Hyperparameters for the KNN algorithm

Kaggle Notebook

Kaggle Notebook

The goals:

Help

Nov 21, 2022
Eugene, OR