K-Nearest Neighbor (KNN) Explained
In this Article you will learn about one of the most popular Supervised Machine Learning Algorithms, You will discover the K-Nearest Neighbor (KNN) algorithm for classification and regression. After reading this post you should know the following:
- What is K-Nearest Neighbor Algorithm?
- KNN for Classification and Regression
- How Does The KNN Algorithm work?
- KNN Classification & Regression Algorithm Steps
- Python Implementation of a KNN Classifier
- How to Choose K?
- Weighted KNN
- Summary & Notes
- Resources & References
What is K-Nearest Neighbor Algorithm?
K-Nearest Neighbor (KNN) is a supervised learning algorithm used for both regression and classification. KNN algorithm assumes the similarity between the new data point and the available data points and put this new data point into the category that is the most similar to the available categories. But how does it do that? Let’s find next.
How Does It Work?
KNN algorithm stores all the available data and classifies a new data point based on the similarity. This means when new a data point appears, it will be easily classified into a well suite category. For example, imagine that you wanted to buy a new phone, but you wanted to check first with your friends what are the brands they are using so you can decide. Then, you buy the most bought brand as your most of your friends. That is KNN.
Let’s dig deeper into KNN.
KNN tries to predict the correct category for the new data point by calculating the distance between the data point and all the training data points. Then select the K number of points which is closet to the data point. The KNN algorithm will then assign the most category found from the K number of points selected.
Example of KNN Classification
Let’s say you have this dataset that consists of 2 features and 2 categories (Squares and Rectangles), and you want to classify a new data point to one of the 2 categories available.
The next step is to choose the K value, the K stands for the number of neighboring samples you want to check their labels / categories.
For example K = 3, we will look at the closest 3 data points, this can be done using the Euclidian distance.
Euclidean distance is as its name suggests, gives the distance between two points or the straight line distance. Assume that (x1,y1)(x1,y1) and (x2,y2)(x2,y2) are two points, then the Euclidean distance can then be calculated by the formula d = √[ (x2 – x1)2 + (y2 – y1)2]
By looking at the minimum distances computed, we define our neighbors.
Euclidean distance
And hence, we will assign the new data point to be a circle category. This is because the 3 closest points were 2 Rectangles and 1 Square, therefore, we choose the most voted which is a Rectangle category. And we can see that it is so simple!
KNN Classification Algorithm Steps
- Step-1: Select the number K of the neighbors
- Step-2: Calculate the Euclidean distance between the new data point and the neighbors
- Step-3: Take the K-nearest neighbors as per the calculated Euclidean distance.
- Step-4: Among these K neighbors, count the number of the data points in each category.
- Step-5: Assign the new data points to that category for which the number of the neighbor is maximum.
Implementation Of a KNN Classifier
We will practice KNN classification on MNSIT dataset, and we will go walkthrough the process step by step.
First we import our needed libraries and load the dataset.
import numpy as np
import matplotlib.pyplot as plt
from keras.datasets import mnist
from sklearn.neighbors import KNeighborsClassifier
(X_train, y_train), (X_test, y_test) = mnist.load_data()
The loaded dataset are images of shape (No. of images, 28, 28), therefore we in order to use it as input to our KNN classifier, we need to reshape our dataset into 1 vector with shape (No. of images, 784) and normalize its values by dividing by 255 to be in range of 0 – > 1
#Flatten the data and normalize it
X_train = X_train.reshape(X_train.shape[0], -1) / 255.0
X_test = X_test.reshape(X_test.shape[0], -1) / 255.0
Choose a value for K and initiate the classifier then fit the dataset
K = 3
classifier = KNeighborsClassifier(n_neighbors = K)
classifier.fit(X_train, y_train)
Then we can start predicting our test set and start our evaluation
#predict the output for the test set
y_pred = classifier.predict(X_test)
How to Choose The K Value?
- There are no pre-defined statistical methods to find the most favorable value of K.
- Initialize a random K value and start computing.
- Choosing a small value of K leads to unstable decision boundaries.
- The substantial K value is better for classification as it leads to smoothening the decision boundaries.
- Usually choose as an odd number if the number of classes is 2.
- The optimal K value usually found is the square root of N, where N is the total number of samples.
- Use an error plot or accuracy plot to find the most favorable K value.
KNN for Regression
KNN can also be used for regression problems pretty much as classification. Look at the example below
Let’s say we want to know the point Y for X = 4.
What happens that we will look at the closest points.
For K = 1, we will look at the closest point to X = 4, which will be X = 3.2 with value 6, therefore our value using KNN of K =1 will be Y = 6.
We do the same if we choose another K
At K = 2, we will take the average of X = 3.2 and X = 5. Therefore, our value will be (6+3)/2 = 4.5
At K =3, we will take the average of X = 3.2 and X = 5 and X = 6.
Therefore, our value will be (6+3+2)/3 = 3.67
KNN Regression Algorithm Steps
- Select the number K of the neighbors
- Calculate the Euclidean distance between the new data point and the neighbors
- Take the K-nearest neighbors as per the calculated Euclidean distance.
- Among these K neighbors, take the average value between all of them.
- Assign the new data points to this average value.
Weighted KNN
One of the many issues that affect the performance of the KNN algorithm is the choice of the hyperparameter k. If k is too small, the algorithm would be more sensitive to data points which are outliers. If the value of k is too large, then the neighborhood may include too many points from other classes.
To overcome this disadvantage, weighted KNN is used.
In weighted KNN, the nearest K points are assigned a weight. Then we give more weight to the points which are nearby and less weight to the points which are farther away.
The simple function which is used is the inverse distance function ( 1 / Euclidian distance ) which implies that as the distance increases weight decreases and as the distance decreases, weight increases.
Summary and Notes
- K-Nearest Neighbor is one of the simplest Machine Learning algorithms based on Supervised Learning technique.
- KNN algorithm assumes the similarity between the new case/data and available cases and put the new case into the category that is the most similar to the available categories.
- It is also called a lazy learner algorithm because it does not learn from the training set immediately instead it stores the dataset and at the time of classification, it performs an action on the dataset.
- KNN algorithm at the training phase just stores the dataset and when it gets new data, then it classifies that data into a category that is much similar to the new data.
- The KNN algorithm can be used for imputing the missing value of both categorical and continuous variables.
Resources & References
- KNN classifier implementation code on Github
- Documentation for KNN on SKlearn website which includes what are the parameters and how to use it.
- You can read more about the available modification for KNN such as weighted KNN from here
- We can also use another distance measures such as Hamming, Manhattan or Minkowski