K-Nearest Neighbours

k-Nearest Neighbours or kNN, in short, is one of the simplest supervised machine learning algorithms that can be used to solve regression as well as classification problems. Let’s break the ice.

Exploring kNN

As you know, in supervised learning algorithms, we thus train the model with labeled data (input) to make predictions on new sets of data. so, Let’s dive more in-depth.

The k-Nearest Neighbours (kNN) algorithm determines the target value of a new data point or instance by comparing it with existing data points or instances that are close to it. The target value of the k-closest instances is aggregate and take as the target output for the new data point.

Now, the question is how the nearest neighbors (data points) are identified?

The algorithm identifies the nearest neighbors based on Euclidean distances(when data tuple comprises numeric attributes) or Hamming distance(when data tuple comprises categorical attributes). We will discuss K-Nearest Neighbours based on Euclidean distance.

Euclidean Distance

The Euclidean distance between two(tuples) – X1 (x11,x12,….,x1n) and X2 (x21,x22,….,x2n) can compute as:

Eucledian_dist(X1, X2)=i=1n(x1i-x2i)2

Where x11 ,x12,x13,…..x1n are the numeric attributes of X1 and x21 ,x22,x23,…..x2n are the numeric attributes of X2.

One of the limitations of Euclidean distance is that attributes with larger ranges contribute more value to the Euclidean distance. For example, if the numeric attributes are age and loan amount, then it can be observe that the typical age in years may range from 0 to 90 years, however, the loan amount in dollars may range from 0 dollars to several thousand dollars.

To avoid this situation, all the numeric attributes of the tuples can be normalized before they are used for computing the Euclidean distance. Normalization is essential to take into consideration the different measurement scales of attributes. Min-Max normalization is a common method used for normalization.

however, This method is available in sklearn.preprocessing (there are several methods for normalization). This normalization transforms the value v of a numeric attribute A to a value v’, where

V  =v – minAmaxA- minA

Selecting the value of ‘k’

Typically for binary-class cases (like Yes/No), an odd value of k is preferred to avoid tie situations(if k=2 and the 2 neighbors belong to different classes). For multi-class cases (more than 2 classes), the ties can be broken by assigning a class at random or by assigning the class that occurs most frequently.

For increasing the accuracy of the model, we can try different values of ‘k’ using Grid-Search or Random-Search method in sklearn.model_selection.So, ‘k’ is the hyperparameter in kNN, fine-tuning the value ‘k’ will improve the model accuracy.

Building a K-Nearest Neighbours model

so, Let us understand the algorithm with the help of the ‘defaulter’ dataset. It also contains data about customers defaulting on loans.

Step 1: Loading the data
Step 2: Feature Engineering – Normalisation

so, we normalize balance and income columns in the data.

Step 3: Splitting data into train and test

Here, Splitting data into train & test.

Step 4: Building the model
Step 5: Evaluate model performance on train and test set

In the above example, we choose the value of k=3 in our model. However,  fine-tuning ‘k’ will improve the performance of the model.

Conclusion

So actually kNN can be used for both classification and regression problems, but in general,kNN is used for classification problems. Some applications of kNN are Handwriting Recognition, Economic Forecasting, data compression, etc.

So the kNN algorithm is simple, easy to interpret, and works well on classification with multiple classes. But since the algorithm works with distance,kNN is sensitive to outliers thus outliers need to handle before building a kNN model.

written by: Ganesh Hari

reviewed by: Savya Sachi

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *