+ - 0:00:00
Notes for current slide
Notes for next slide

Introduction to K-Nearest Neighbors Algorithm

Cengiz Zopluoglu

College of Education, University of Oregon

Nov 21, 2022
Eugene, OR

1 / 28

The goals:

  • K-nearest Neighbors Algorithm

    • The concept of distance between two vectors

    • The concept of K-nearest neighbors

    • Predicting an outcome based on K-nearest neighbors

    • Kernels to Weight the neighbors

    • Review of Kaggle notebooks for building KNN models

2 / 28

Distance Between Two Vectors

  • Imagine that each observation in a dataset lives in a P-dimensional space, where P is the number of predictors.

    • Obsevation 1: A=(A1,A2,A3,...,AP)

    • Obsevation 2: B=(B1,B2,B3,...,BP)

  • A general definition of distance between two vectors is the Minkowski Distance.

(i=1P|AiBi|q)1q, where q can take any positive value.

3 / 28
  • Suppose that we have two observations and three predictors

    • Observation 1: (20,25,30)

    • Observation 2: (80,90,75)

4 / 28
  • If we assume that the q=1 for the Minkowski equation above, then we can calculate the distance as the following:
A <- c(20,25,30)
B <- c(80,90,75)
sum(abs(A - B))
[1] 170
  • If we assume that the q=2 for the Minkowski equation above, then we can calculate the distance as the following:
A <- c(20,25,30)
B <- c(80,90,75)
(sum(abs(A - B)^2))^(1/2)
[1] 99.25
  • If we assume that the q=3 for the Minkowski equation above, then we can calculate the distance as the following:
A <- c(20,25,30)
B <- c(80,90,75)
(sum(abs(A - B)^3))^(1/2)
[1] 762.7
5 / 28

When q is equal to 1 for the Minkowski equation, it becomes a special case known as Manhattan Distance.

6 / 28

When q is equal to 2 for the Minkowski equation, it is also a special case known as Euclidian Distance.

7 / 28

K-Nearest Neighbors

  • When there are N observations in a dataset, a distance between any observation and N1 remaining observations can be computed using Minkowski distance (with a user-defined choice of q value, a hyperparameter).

  • Then, for any given observation, we can rank order the remaining observations based on how close they are to the given observation and then decide the K observations closest based on their distance.

  • Suppose that there are ten observations measured on three predictor variables (X1, X2, and X3) with the following values.

d <- data.frame(x1 =c(20,25,30,42,10,60,65,55,80,90),
x2 =c(10,15,12,20,45,75,70,80,85,90),
x3 =c(25,30,35,20,40,80,85,90,92,95),
label= c('A','B','C','D','E','F','G','H','I','J'))
d
x1 x2 x3 label
1 20 10 25 A
2 25 15 30 B
3 30 12 35 C
4 42 20 20 D
5 10 45 40 E
6 60 75 80 F
7 65 70 85 G
8 55 80 90 H
9 80 85 92 I
10 90 90 95 J
8 / 28

Given that there are ten observations, we can calculate the distance between all 45 pairs of observations (e.g., Euclidian distance).

labels <- c('A','B','C','D','E',
'F','G','H','I','J')
dist <- as.data.frame(t(combn(labels,2)))
dist$euclidian <- NA
for(i in 1:nrow(dist)){
a <- d[d$label==dist[i,1],1:3]
b <- d[d$label==dist[i,2],1:3]
dist[i,]$euclidian <- sqrt(sum((a-b)^2))
}
dist
V1 V2 euclidian
1 A B 8.660
2 A C 14.283
3 A D 24.678
4 A E 39.370
5 A F 94.074
6 A G 96.047
7 A H 101.735
8 A I 117.107
9 A J 127.279
10 B C 7.681
11 B D 20.347
12 B E 35.000
13 B F 85.586
14 B G 87.464
15 B H 93.408
16 B I 108.485
17 B J 118.638
18 C D 20.809
19 C E 38.910
20 C F 83.030
21 C G 84.196
22 C H 90.962
23 C I 105.252
24 C J 115.256
25 D E 45.266
26 D F 83.361
27 D G 85.170
28 D H 93.107
29 D I 104.178
30 D J 113.265
31 E F 70.711
32 E G 75.333
33 E H 75.829
[ reached 'max' / getOption("max.print") -- omitted 12 rows ]
10 / 28

For instance, we can find the three closest observations to Point E (3-Nearest Neighbors). As seen below, the 3-Nearest Neighbors for Point E in this dataset would be Point B, Point C, and Point A.

# Point E is the fifth observation in the dataset
loc <- which(dist[,1]=='E' | dist[,2]=='E')
tmp <- dist[loc,]
tmp[order(tmp$euclidian),]
V1 V2 euclidian
12 B E 35.00
19 C E 38.91
4 A E 39.37
25 D E 45.27
31 E F 70.71
32 E G 75.33
33 E H 75.83
34 E I 95.94
35 E J 107.00
11 / 28





NOTE 1

The q in the Minkowski distance equation and K in the K-nearest neighbor are user-defined hyperparameters in the KNN algorithm. As a researcher and model builder, you can pick any values for q and K. They can be tuned using a similar approach applied in earlier classes for regularized regression models. One can pick a set of values for these hyperparameters and apply a grid search to find the combination that provides the best predictive performance.

It is typical to observe overfitting (high model variance, low model bias) for small values of K and underfitting (low model variance, high model bias) for large values of K. In general, people tend to focus their grid search for K around N.


12 / 28






NOTE 2

It is essential to remember that the distance calculation between two observations is highly dependent on the scale of measurement for the predictor variables. If predictors are on different scales, the distance metric formula will favor the differences in predictors with larger scales, and it is not ideal. Therefore, it is essential to center and scale all predictors before the KNN algorithm so each predictor similarly contributes to the distance metric calculation.


13 / 28

Prediction with K-Nearest Neighbors

Below is a list of steps for predicting an outcome for a given observation.

    1. Calculate the distance between the observation and the remaining N1 observations in the data (with a user choice of q in Minkowski distance).
    1. Rank order the observations based on the calculated distance, and choose the K-nearest neighbor. (with a user choice of K)
    1. Calculate the mean of the observed outcome in the K-nearest neighbors as your prediction.

Note that Step 3 applies regardless of the type of outcome.

If the outcome variable is continuous, we calculate the average outcome for the K-nearest neighbors as our prediction.

If the outcome variable is binary (e.g., 0 vs. 1), then the proportion of observing each class among the K-nearest neighbors yields predicted probabilities for each class.

14 / 28

An example of predicting a continuous outcome with the KNN algorithm

  1. Import the data

  2. Write a recipe for processing variables

  3. Apply the recipe to the dataset

# Import the dataset
readability <- read.csv('./data/readability_features.csv',header=TRUE)
# Write the recipe
require(recipes)
blueprint_readability <- recipe(x = readability,
vars = colnames(readability),
roles = c(rep('predictor',768),'outcome')) %>%
step_zv(all_numeric()) %>%
step_nzv(all_numeric()) %>%
step_normalize(all_numeric_predictors())
# Apply the recipe
baked_read <- blueprint_readability %>%
prep(training = readability) %>%
bake(new_data = readability)

Our final dataset (baked_read) has 2834 observations and 769 columns (768 predictors; the last column is target outcome).

Suppose we would like to predict the readability score for the first observation.

15 / 28

The code below will calculate the Minkowski distance (with q=2) between the first observation and each of the remaining 2833 observations by using the first 768 columns of the dataset (predictors).

dist <- data.frame(obs = 2:2834,dist = NA,target=NA)
for(i in 1:2833){
a <- as.matrix(baked_read[1,1:768])
b <- as.matrix(baked_read[i+1,1:768])
dist[i,]$dist <- sqrt(sum((a-b)^2))
dist[i,]$target <- baked_read[i+1,]$target
#print(i)
}
16 / 28

We now rank-order the observations from closest to the most distant and then choose the 20 nearest observations (K=20).

# Rank order the observations from closest to the most distant
dist <- dist[order(dist$dist),]
# Check the 20-nearest neighbors
print(dist[1:20,], row.names = FALSE)
obs dist target
2441 24.18 0.5590
45 24.37 -0.5864
1992 24.91 0.1430
2264 25.26 -0.9035
2522 25.27 -0.6359
2419 25.41 -0.2128
1530 25.66 -1.8725
239 25.93 -0.5611
238 26.30 -0.8890
1520 26.40 -0.6159
2244 26.50 -0.3327
1554 26.57 -1.8844
1571 26.61 -1.1337
2154 26.62 -1.1141
76 26.64 -0.6056
2349 26.68 -0.1593
1189 26.85 -1.2395
2313 26.95 -0.2532
2179 27.05 -1.0299
2017 27.06 0.1399
17 / 28

Finally, we can calculate the average of the observed outcome for the 20 nearest neighbors, which will become our prediction of the readability score for the first observation.

mean(dist[1:20,]$target)
[1] -0.6594

The observed outcome (readability score) for the first observation.

readability[1,]$target
[1] -0.3403
18 / 28

An example of predicting a binary outcome with the KNN algorithm

  • We can follow the same procedures to predict Recidivism in the second year after an individual's initial release from prison.

  • The final dataset (baked_recidivism) after pre-processing has 18111 observations and 142 predictors.

  • Suppose that we would like to predict the probability of Recidivism for the first individual.

  • The code below will calculate the Minkowski distance (with q=2) between the first individual and each of the remaining 18,110 individuals by using values of the 142 predictors in this dataset.

dist2 <- data.frame(obs = 2:18111,dist = NA,target=NA)
for(i in 1:18110){
a <- as.matrix(baked_recidivism[1,3:144])
b <- as.matrix(baked_recidivism[i+1,3:144])
dist2[i,]$dist <- sqrt(sum((a-b)^2))
dist2[i,]$target <- as.character(baked_recidivism[i+1,]$Recidivism_Arrest_Year2)
#print(i)
}
19 / 28

Suppose we now rank-order the individuals from closest to the most distant and then choose the 20-nearest observations.

Then, we calculate proportion of individuals who were recidivated (YES) and not recidivated (NO) among these 20-nearest neighbors.

dist2 <- dist2[order(dist2$dist),]
print(dist2[1:20,],
row.names = FALSE)
obs dist target
7070 6.217 No
14204 6.256 No
1574 6.384 No
4527 6.680 No
8446 7.012 No
6024 7.251 No
7787 7.270 No
565 7.279 Yes
8768 7.288 No
4646 7.359 No
4043 7.376 No
9113 7.385 No
5316 7.405 No
4095 7.536 No
9732 7.566 No
831 7.634 No
14385 7.644 No
2933 7.660 Yes
647 7.676 Yes
6385 7.685 Yes
table(dist2[1:20,]$target)
No Yes
16 4
# The observed outcome for the first individual
recidivism[1,]$Recidivism_Arrest_Year2
[1] 0

These proportions predict the probability of being recidivated or not recidivated for the first individual.

The probability of the first observation to be recidivated within 2 years is 0.2 (4/20) based on 20 nearest neighbors.

20 / 28

Kernels to Weight the Neighbors

  • In the previous section, we used a simple average of the observed outcome from K-nearest neighbors.

  • A simple average implies equally weighing each neighbor.

  • Another way of averaging the target outcome from K-nearest neighbors would be to weigh each neighbor according to its distance and calculate a weighted average.

  • A simple way to weigh each neighbor is to use the inverse of the distance.

  • For instance, consider the earlier example where we find the 20-nearest neighbor for the first observation in the readability dataset.

  • We can assign a weight to each neighbor by taking the inverse of their distance and rescaling them such that the sum of the weights equals 1.

21 / 28
dist <- dist[order(dist$dist),]
k_neighbors <- dist[1:20,]
print(k_neighbors,row.names=FALSE)
obs dist target
2441 24.18 0.5590
45 24.37 -0.5864
1992 24.91 0.1430
2264 25.26 -0.9035
2522 25.27 -0.6359
2419 25.41 -0.2128
1530 25.66 -1.8725
239 25.93 -0.5611
238 26.30 -0.8890
1520 26.40 -0.6159
2244 26.50 -0.3327
1554 26.57 -1.8844
1571 26.61 -1.1337
2154 26.62 -1.1141
76 26.64 -0.6056
2349 26.68 -0.1593
1189 26.85 -1.2395
2313 26.95 -0.2532
2179 27.05 -1.0299
2017 27.06 0.1399
k_neighbors$weight <- 1/k_neighbors$dist
k_neighbors$weight <- k_neighbors$weight/sum(k_neighbors$weight)
print(k_neighbors,row.names=FALSE)
obs dist target weight
2441 24.18 0.5590 0.05382
45 24.37 -0.5864 0.05341
1992 24.91 0.1430 0.05225
2264 25.26 -0.9035 0.05152
2522 25.27 -0.6359 0.05151
2419 25.41 -0.2128 0.05122
1530 25.66 -1.8725 0.05072
239 25.93 -0.5611 0.05020
238 26.30 -0.8890 0.04949
1520 26.40 -0.6159 0.04930
2244 26.50 -0.3327 0.04911
1554 26.57 -1.8844 0.04899
1571 26.61 -1.1337 0.04892
2154 26.62 -1.1141 0.04890
76 26.64 -0.6056 0.04886
2349 26.68 -0.1593 0.04878
1189 26.85 -1.2395 0.04847
2313 26.95 -0.2532 0.04829
2179 27.05 -1.0299 0.04812
2017 27.06 0.1399 0.04810

Compute a weighted average of the target scores instead of a simple average.

sum(k_neighbors$target*k_neighbors$weight)
[1] -0.6526
22 / 28

Several kernel functions can be used to assign weight to K-nearest neighbors

  • Epanechnikov

  • Rectangular

  • Quartic

  • Triweight

  • Tricube

  • Gaussian

  • Cosine

For all of them, closest neighbors are assigned higher weights while the weight gets smaller as the distance increases, and they slightly differ the way they assign the weight.

23 / 28

Below is a demonstration of how assigned weight changes as a function of distance for different kernel functions.

24 / 28






NOTE 3

Which kernel function should we use for weighing the distance? The type of kernel function can also be considered a hyperparameter to tune.


25 / 28

Hyperparameters for the KNN algorithm

require(caret)
require(kknn)
getModelInfo()$kknn$parameters
parameter class label
1 kmax numeric Max. #Neighbors
2 distance numeric Distance
3 kernel character Kernel
26 / 28

Kaggle Notebook

Building a Prediction Model for a Continuous Outcome Using the KNN Algorithm

Performance Comparison of Different Algorithms

R-square MAE RMSE
Linear Regression 0.658 0.499 0.620
Ridge Regression 0.727 0.432 0.536
Lasso Regression 0.721 0.433 0.542
Elastic Net 0.726 0.433 0.539
KNN 0.611 0.519 0.648
27 / 28

Kaggle Notebook

Building a Classification Model for a Binary Outcome Using the KNN Algorithm

Performance Comparison of Different Algorithms

-LL AUC ACC TPR TNR FPR PRE
Logistic Regression 0.5096 0.7192 0.755 0.142 0.949 0.051 0.471
Logistic Regression with Ridge Penalty 0.5111 0.7181 0.754 0.123 0.954 0.046 0.461
Logistic Regression with Lasso Penalty 0.5090 0.7200 0.754 0.127 0.952 0.048 0.458
Logistic Regression with Elastic Net 0.5091 0.7200 0.753 0.127 0.952 0.048 0.456
KNN ? ? ? ? ? ? ?
28 / 28

The goals:

  • K-nearest Neighbors Algorithm

    • The concept of distance between two vectors

    • The concept of K-nearest neighbors

    • Predicting an outcome based on K-nearest neighbors

    • Kernels to Weight the neighbors

    • Review of Kaggle notebooks for building KNN models

2 / 28
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow