K-nearest Neighbors Algorithm
The concept of distance between two vectors
The concept of K-nearest neighbors
Predicting an outcome based on K-nearest neighbors
Kernels to Weight the neighbors
Review of Kaggle notebooks for building KNN models
Imagine that each observation in a dataset lives in a P-dimensional space, where P is the number of predictors.
Obsevation 1: A=(A1,A2,A3,...,AP)
Obsevation 2: B=(B1,B2,B3,...,BP)
A general definition of distance between two vectors is the Minkowski Distance.
(P∑i=1|Ai−Bi|q)1q, where q can take any positive value.
A <- c(20,25,30)B <- c(80,90,75)sum(abs(A - B))
[1] 170
A <- c(20,25,30)B <- c(80,90,75)(sum(abs(A - B)^2))^(1/2)
[1] 99.25
A <- c(20,25,30)B <- c(80,90,75)(sum(abs(A - B)^3))^(1/2)
[1] 762.7
When there are N observations in a dataset, a distance between any observation and N−1 remaining observations can be computed using Minkowski distance (with a user-defined choice of q value, a hyperparameter).
Then, for any given observation, we can rank order the remaining observations based on how close they are to the given observation and then decide the K observations closest based on their distance.
Suppose that there are ten observations measured on three predictor variables (X1, X2, and X3) with the following values.
d <- data.frame(x1 =c(20,25,30,42,10,60,65,55,80,90), x2 =c(10,15,12,20,45,75,70,80,85,90), x3 =c(25,30,35,20,40,80,85,90,92,95), label= c('A','B','C','D','E','F','G','H','I','J'))d
x1 x2 x3 label1 20 10 25 A2 25 15 30 B3 30 12 35 C4 42 20 20 D5 10 45 40 E6 60 75 80 F7 65 70 85 G8 55 80 90 H9 80 85 92 I10 90 90 95 J
Given that there are ten observations, we can calculate the distance between all 45 pairs of observations (e.g., Euclidian distance).
labels <- c('A','B','C','D','E', 'F','G','H','I','J')dist <- as.data.frame(t(combn(labels,2)))dist$euclidian <- NAfor(i in 1:nrow(dist)){ a <- d[d$label==dist[i,1],1:3] b <- d[d$label==dist[i,2],1:3] dist[i,]$euclidian <- sqrt(sum((a-b)^2))}dist
V1 V2 euclidian1 A B 8.6602 A C 14.2833 A D 24.6784 A E 39.3705 A F 94.0746 A G 96.0477 A H 101.7358 A I 117.1079 A J 127.27910 B C 7.68111 B D 20.34712 B E 35.00013 B F 85.58614 B G 87.46415 B H 93.40816 B I 108.48517 B J 118.63818 C D 20.80919 C E 38.91020 C F 83.03021 C G 84.19622 C H 90.96223 C I 105.25224 C J 115.25625 D E 45.26626 D F 83.36127 D G 85.17028 D H 93.10729 D I 104.17830 D J 113.26531 E F 70.71132 E G 75.33333 E H 75.829 [ reached 'max' / getOption("max.print") -- omitted 12 rows ]
For instance, we can find the three closest observations to Point E (3-Nearest Neighbors). As seen below, the 3-Nearest Neighbors for Point E in this dataset would be Point B, Point C, and Point A.
# Point E is the fifth observation in the datasetloc <- which(dist[,1]=='E' | dist[,2]=='E')tmp <- dist[loc,]tmp[order(tmp$euclidian),]
V1 V2 euclidian12 B E 35.0019 C E 38.914 A E 39.3725 D E 45.2731 E F 70.7132 E G 75.3333 E H 75.8334 E I 95.9435 E J 107.00
The q in the Minkowski distance equation and K in the K-nearest neighbor are user-defined hyperparameters in the KNN algorithm. As a researcher and model builder, you can pick any values for q and K. They can be tuned using a similar approach applied in earlier classes for regularized regression models. One can pick a set of values for these hyperparameters and apply a grid search to find the combination that provides the best predictive performance.
It is typical to observe overfitting (high model variance, low model bias) for small values of K and underfitting (low model variance, high model bias) for large values of K. In general, people tend to focus their grid search for K around √N.
It is essential to remember that the distance calculation between two observations is highly dependent on the scale of measurement for the predictor variables. If predictors are on different scales, the distance metric formula will favor the differences in predictors with larger scales, and it is not ideal. Therefore, it is essential to center and scale all predictors before the KNN algorithm so each predictor similarly contributes to the distance metric calculation.
Below is a list of steps for predicting an outcome for a given observation.
Note that Step 3 applies regardless of the type of outcome.
If the outcome variable is continuous, we calculate the average outcome for the K-nearest neighbors as our prediction.
If the outcome variable is binary (e.g., 0 vs. 1), then the proportion of observing each class among the K-nearest neighbors yields predicted probabilities for each class.
Import the data
Write a recipe for processing variables
Apply the recipe to the dataset
# Import the datasetreadability <- read.csv('./data/readability_features.csv',header=TRUE)# Write the reciperequire(recipes)blueprint_readability <- recipe(x = readability, vars = colnames(readability), roles = c(rep('predictor',768),'outcome')) %>% step_zv(all_numeric()) %>% step_nzv(all_numeric()) %>% step_normalize(all_numeric_predictors()) # Apply the recipebaked_read <- blueprint_readability %>% prep(training = readability) %>% bake(new_data = readability)
Our final dataset (baked_read
) has 2834 observations and 769 columns (768 predictors; the last column is target outcome).
Suppose we would like to predict the readability score for the first observation.
The code below will calculate the Minkowski distance (with q=2) between the first observation and each of the remaining 2833 observations by using the first 768 columns of the dataset (predictors).
dist <- data.frame(obs = 2:2834,dist = NA,target=NA)for(i in 1:2833){ a <- as.matrix(baked_read[1,1:768]) b <- as.matrix(baked_read[i+1,1:768]) dist[i,]$dist <- sqrt(sum((a-b)^2)) dist[i,]$target <- baked_read[i+1,]$target #print(i)}
We now rank-order the observations from closest to the most distant and then choose the 20 nearest observations (K=20).
# Rank order the observations from closest to the most distantdist <- dist[order(dist$dist),]# Check the 20-nearest neighborsprint(dist[1:20,], row.names = FALSE)
obs dist target 2441 24.18 0.5590 45 24.37 -0.5864 1992 24.91 0.1430 2264 25.26 -0.9035 2522 25.27 -0.6359 2419 25.41 -0.2128 1530 25.66 -1.8725 239 25.93 -0.5611 238 26.30 -0.8890 1520 26.40 -0.6159 2244 26.50 -0.3327 1554 26.57 -1.8844 1571 26.61 -1.1337 2154 26.62 -1.1141 76 26.64 -0.6056 2349 26.68 -0.1593 1189 26.85 -1.2395 2313 26.95 -0.2532 2179 27.05 -1.0299 2017 27.06 0.1399
Finally, we can calculate the average of the observed outcome for the 20 nearest neighbors, which will become our prediction of the readability score for the first observation.
mean(dist[1:20,]$target)
[1] -0.6594
The observed outcome (readability score) for the first observation.
readability[1,]$target
[1] -0.3403
We can follow the same procedures to predict Recidivism in the second year after an individual's initial release from prison.
The final dataset (baked_recidivism
) after pre-processing has 18111 observations and 142 predictors.
Suppose that we would like to predict the probability of Recidivism for the first individual.
The code below will calculate the Minkowski distance (with q=2) between the first individual and each of the remaining 18,110 individuals by using values of the 142 predictors in this dataset.
dist2 <- data.frame(obs = 2:18111,dist = NA,target=NA)for(i in 1:18110){ a <- as.matrix(baked_recidivism[1,3:144]) b <- as.matrix(baked_recidivism[i+1,3:144]) dist2[i,]$dist <- sqrt(sum((a-b)^2)) dist2[i,]$target <- as.character(baked_recidivism[i+1,]$Recidivism_Arrest_Year2) #print(i)}
Suppose we now rank-order the individuals from closest to the most distant and then choose the 20-nearest observations.
Then, we calculate proportion of individuals who were recidivated (YES) and not recidivated (NO) among these 20-nearest neighbors.
dist2 <- dist2[order(dist2$dist),]print(dist2[1:20,], row.names = FALSE)
obs dist target 7070 6.217 No 14204 6.256 No 1574 6.384 No 4527 6.680 No 8446 7.012 No 6024 7.251 No 7787 7.270 No 565 7.279 Yes 8768 7.288 No 4646 7.359 No 4043 7.376 No 9113 7.385 No 5316 7.405 No 4095 7.536 No 9732 7.566 No 831 7.634 No 14385 7.644 No 2933 7.660 Yes 647 7.676 Yes 6385 7.685 Yes
table(dist2[1:20,]$target)
No Yes 16 4
# The observed outcome for the first individualrecidivism[1,]$Recidivism_Arrest_Year2
[1] 0
These proportions predict the probability of being recidivated or not recidivated for the first individual.
The probability of the first observation to be recidivated within 2 years is 0.2 (4/20) based on 20 nearest neighbors.
In the previous section, we used a simple average of the observed outcome from K-nearest neighbors.
A simple average implies equally weighing each neighbor.
Another way of averaging the target outcome from K-nearest neighbors would be to weigh each neighbor according to its distance and calculate a weighted average.
A simple way to weigh each neighbor is to use the inverse of the distance.
For instance, consider the earlier example where we find the 20-nearest neighbor for the first observation in the readability dataset.
We can assign a weight to each neighbor by taking the inverse of their distance and rescaling them such that the sum of the weights equals 1.
dist <- dist[order(dist$dist),]k_neighbors <- dist[1:20,]print(k_neighbors,row.names=FALSE)
obs dist target 2441 24.18 0.5590 45 24.37 -0.5864 1992 24.91 0.1430 2264 25.26 -0.9035 2522 25.27 -0.6359 2419 25.41 -0.2128 1530 25.66 -1.8725 239 25.93 -0.5611 238 26.30 -0.8890 1520 26.40 -0.6159 2244 26.50 -0.3327 1554 26.57 -1.8844 1571 26.61 -1.1337 2154 26.62 -1.1141 76 26.64 -0.6056 2349 26.68 -0.1593 1189 26.85 -1.2395 2313 26.95 -0.2532 2179 27.05 -1.0299 2017 27.06 0.1399
k_neighbors$weight <- 1/k_neighbors$distk_neighbors$weight <- k_neighbors$weight/sum(k_neighbors$weight)print(k_neighbors,row.names=FALSE)
obs dist target weight 2441 24.18 0.5590 0.05382 45 24.37 -0.5864 0.05341 1992 24.91 0.1430 0.05225 2264 25.26 -0.9035 0.05152 2522 25.27 -0.6359 0.05151 2419 25.41 -0.2128 0.05122 1530 25.66 -1.8725 0.05072 239 25.93 -0.5611 0.05020 238 26.30 -0.8890 0.04949 1520 26.40 -0.6159 0.04930 2244 26.50 -0.3327 0.04911 1554 26.57 -1.8844 0.04899 1571 26.61 -1.1337 0.04892 2154 26.62 -1.1141 0.04890 76 26.64 -0.6056 0.04886 2349 26.68 -0.1593 0.04878 1189 26.85 -1.2395 0.04847 2313 26.95 -0.2532 0.04829 2179 27.05 -1.0299 0.04812 2017 27.06 0.1399 0.04810
Compute a weighted average of the target scores instead of a simple average.
sum(k_neighbors$target*k_neighbors$weight)
[1] -0.6526
Several kernel functions can be used to assign weight to K-nearest neighbors
Epanechnikov
Rectangular
Quartic
Triweight
Tricube
Gaussian
Cosine
For all of them, closest neighbors are assigned higher weights while the weight gets smaller as the distance increases, and they slightly differ the way they assign the weight.
Below is a demonstration of how assigned weight changes as a function of distance for different kernel functions.
Which kernel function should we use for weighing the distance? The type of kernel function can also be considered a hyperparameter to tune.
require(caret)require(kknn)getModelInfo()$kknn$parameters
parameter class label1 kmax numeric Max. #Neighbors2 distance numeric Distance3 kernel character Kernel
Building a Prediction Model for a Continuous Outcome Using the KNN Algorithm
Performance Comparison of Different Algorithms
R-square | MAE | RMSE | |
---|---|---|---|
Linear Regression | 0.658 | 0.499 | 0.620 |
Ridge Regression | 0.727 | 0.432 | 0.536 |
Lasso Regression | 0.721 | 0.433 | 0.542 |
Elastic Net | 0.726 | 0.433 | 0.539 |
KNN | 0.611 | 0.519 | 0.648 |
Building a Classification Model for a Binary Outcome Using the KNN Algorithm
Performance Comparison of Different Algorithms
-LL | AUC | ACC | TPR | TNR | FPR | PRE | |
---|---|---|---|---|---|---|---|
Logistic Regression | 0.5096 | 0.7192 | 0.755 | 0.142 | 0.949 | 0.051 | 0.471 |
Logistic Regression with Ridge Penalty | 0.5111 | 0.7181 | 0.754 | 0.123 | 0.954 | 0.046 | 0.461 |
Logistic Regression with Lasso Penalty | 0.5090 | 0.7200 | 0.754 | 0.127 | 0.952 | 0.048 | 0.458 |
Logistic Regression with Elastic Net | 0.5091 | 0.7200 | 0.753 | 0.127 | 0.952 | 0.048 | 0.456 |
KNN | ? | ? | ? | ? | ? | ? | ? |
K-nearest Neighbors Algorithm
The concept of distance between two vectors
The concept of K-nearest neighbors
Predicting an outcome based on K-nearest neighbors
Kernels to Weight the neighbors
Review of Kaggle notebooks for building KNN models
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |