September 9 2022

K-Nearest Neighbor (KNN) Explained

Ahmed Raafat MACHINE LEARNING classification, k nearest neighbor, KNN, nearest neighbor, supervised 0

In this Article you will learn about one of the most popular Supervised Machine Learning Algorithms, You will discover the K-Nearest Neighbor (KNN) algorithm for classification and regression. After reading this post you should know the following:

What is K-Nearest Neighbor Algorithm?
KNN for Classification and Regression
How Does The KNN Algorithm work?
KNN Classification & Regression Algorithm Steps
Python Implementation of a KNN Classifier
How to Choose K?
Weighted KNN
Summary & Notes
Resources & References

What is K-Nearest Neighbor Algorithm?

K-Nearest Neighbor (KNN) is a supervised learning algorithm used for both regression and classification. KNN algorithm assumes the similarity between the new data point and the available data points and put this new data point into the category that is the most similar to the available categories. But how does it do that? Let’s find next.

How Does It Work?

KNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new a data point appears, it will be easily classified into a well suite category. For example, imagine that you wanted to buy a new phone, but you wanted to check first with your friends what are the brands they are using so you can decide. Then, you buy the most bought brand as your most of your friends. That is KNN.

Let’s dig deeper into KNN.

KNN tries to predict the correct category for the new data point by calculating the distance between the data point and all the training data points. Then select the K number of points which is closet to the data point. The KNN algorithm will then assign the most category found from the K number of points selected.

Example of KNN Classification

Let’s say you have this dataset that consists of 2 features and 2 categories (Squares and Rectangles), and you want to classify a new data point to one of the 2 categories available.

KNN example with 2 Categories and 2 Features

The next step is to choose the K value, the K stands for the number of neighboring samples you want to check their labels / categories.

For example K = 3, we will look at the closest 3 data points, this can be done using the Euclidian distance.

Euclidean distance is as its name suggests, gives the distance between two points or the straight line distance. Assume that (x1,y1)(x1,y1) and (x2,y2)(x2,y2) are two points, then the Euclidean distance can then be calculated by the formula d = √[ (x2 – x1)2 + (y2 – y1)2]

By looking at the minimum distances computed, we define our neighbors.
Euclidean distance

The Red circle illustrates the closest K points to the new data point, when K=3

And hence, we will assign the new data point to be a circle category. This is because the 3 closest points were 2 Rectangles and 1 Square, therefore, we choose the most voted which is a Rectangle category. And we can see that it is so simple!

KNN Classification Algorithm Steps

Step-1: Select the number K of the neighbors
Step-2: Calculate the Euclidean distance between the new data point and the neighbors
Step-3: Take the K-nearest neighbors as per the calculated Euclidean distance.
Step-4: Among these K neighbors, count the number of the data points in each category.
Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.

Implementation Of a KNN Classifier

We will practice KNN classification on MNSIT dataset, and we will go walkthrough the process step by step.

First we import our needed libraries and load the dataset.

import numpy as np 
import matplotlib.pyplot as plt 
from keras.datasets import mnist 
from sklearn.neighbors import KNeighborsClassifier

(X_train, y_train), (X_test, y_test) = mnist.load_data()

The loaded dataset are images of shape (No. of images, 28, 28), therefore we in order to use it as input to our KNN classifier, we need to reshape our dataset into 1 vector with shape (No. of images, 784) and normalize its values by dividing by 255 to be in range of 0 – > 1

#Flatten the data and normalize it

X_train = X_train.reshape(X_train.shape[0], -1) / 255.0 
X_test = X_test.reshape(X_test.shape[0], -1) / 255.0

Choose a value for K and initiate the classifier then fit the dataset

K = 3

classifier = KNeighborsClassifier(n_neighbors = K)
classifier.fit(X_train, y_train)

Then we can start predicting our test set and start our evaluation

#predict the output for the test set
y_pred = classifier.predict(X_test)

How to Choose The K Value?

There are no pre-defined statistical methods to find the most favorable value of K.
Initialize a random K value and start computing.
Choosing a small value of K leads to unstable decision boundaries.
The substantial K value is better for classification as it leads to smoothening the decision boundaries.
Usually choose as an odd number if the number of classes is 2.
The optimal K value usually found is the square root of N, where N is the total number of samples.
Use an error plot or accuracy plot to find the most favorable K value.

KNN for Regression

KNN can also be used for regression problems pretty much as classification. Look at the example below

Let’s say we want to know the point Y for X = 4.
What happens that we will look at the closest points.
For K = 1, we will look at the closest point to X = 4, which will be X = 3.2 with value 6, therefore our value using KNN of K =1 will be Y = 6.

We do the same if we choose another K
At K = 2, we will take the average of X = 3.2 and X = 5. Therefore, our value will be (6+3)/2 = 4.5
At K =3, we will take the average of X = 3.2 and X = 5 and X = 6.
Therefore, our value will be (6+3+2)/3 = 3.67

KNN Regression Algorithm Steps

Select the number K of the neighbors
Calculate the Euclidean distance between the new data point and the neighbors
Take the K-nearest neighbors as per the calculated Euclidean distance.
Among these K neighbors, take the average value between all of them.
Assign the new data points to this average value.

Weighted KNN

One of the many issues that affect the performance of the KNN algorithm is the choice of the hyperparameter k. If k is too small, the algorithm would be more sensitive to data points which are outliers. If the value of k is too large, then the neighborhood may include too many points from other classes.

To overcome this disadvantage, weighted KNN is used.

In weighted KNN, the nearest K points are assigned a weight. Then we give more weight to the points which are nearby and less weight to the points which are farther away.

The simple function which is used is the inverse distance function ( 1 / Euclidian distance ) which implies that as the distance increases weight decreases and as the distance decreases, weight increases.

Summary and Notes

K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
KNN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is the most similar to the available categories.
It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.
The KNN algorithm can be used for imputing the missing value of both categorical and continuous variables.

Resources & References

KNN classifier implementation code on Github
Documentation for KNN on SKlearn website which includes what are the parameters and how to use it.
You can read more about the available modification for KNN such as weighted KNN from here
We can also use another distance measures such as Hamming, Manhattan or Minkowski

Author

Ahmed Raafat

Unstructured Data Scientist

View all posts

K-Nearest Neighbor (KNN) Explained

What is K-Nearest Neighbor Algorithm?

How Does It Work?

Example of KNN Classification

KNN Classification Algorithm Steps

Implementation Of a KNN Classifier

How to Choose The K Value?

KNN for Regression

KNN Regression Algorithm Steps

Weighted KNN

Summary and Notes

Resources & References

Author

Leave a Comment Cancel reply

Subscribe to our newsletter

K-Nearest Neighbor (KNN) Explained

What is K-Nearest Neighbor Algorithm?

How Does It Work?

Example of KNN Classification

KNN Classification Algorithm Steps

Implementation Of a KNN Classifier

How to Choose The K Value?

KNN for Regression

KNN Regression Algorithm Steps

Weighted KNN

Summary and Notes

Resources & References

Author

Related Posts

Introduction to Support Vector Machines (SVMs)

Collaborative Filtering Recommendation System

Activation Functions: All You Need To Know

Leave a Comment Cancel reply

Subscribe to Our Newsletter