Introduction
In the machine learning classification problem, we very often deal with an imbalanced dataset. however, By an imbalanced dataset, we mean the presence of a minority class in the dataset. Let me explain this by taking a very common scenario of fraud detection. If we have to identify fraudulent transactions from a dataset having for fraudulent and non-fraudulent transactions.
thus, we will observe that there are a very small number of fraudulent transactions out of total transactions. In that case, the model build on this type of dataset will be bias towards the majority class and accuracy will be misleading.
so, I have downloaded here credit card transaction data from Kaggle to recognize fraudulent and non-fraudulent transactions. also, The link to the dataset is below.
Dataset Link:
https://www.kaggle.com/mlg-ulb/creditcardfraud
I have however used Jupyter Notebook for analyzing and processing the dataset
There are a total of 30 input variables and one output variable.
The output variable “Class” indicates whether the transaction is fraudulent or not
“0” indicates Non-Fraudulent Transaction
“1” indicates Fraudulent Transaction
In the figure A, it can also be observed that there are 284315 non-fraudulent transactions and only 492 fraudulent transactions.
Fraud Percentage = 0.17 %
Non- Fraud Percentage = 99.82%
Result
As we observe, the percentage of the “1” category i.e. the fraudulent transaction is very less and that of a percentage of majority class “0” i.e. non-fraudulent is very high, the model build-out of this dataset will be predicting wrongly and the objective of identifying fraudulent transaction will not be detected. This type of imbalanced dataset should be balance before training the machine learning algorithm. There are various resampling techniques to deal with imbalanced datasets.
One of the resampling techniques is SMOTE which I will be discussing here. I have also discussed random undersampling and random oversampling technique in my previous blog.
Synthetic Minority Oversampling Technique abbreviated as SMOTE which came into the picture to deal with the problems of overfitting caused by random oversampling.
In this resampling technique, unlike random oversampling, synthetically new minority data points are generate using Synthetic Minority Oversampling Technique. The process involved in SMOTE is therefore discuss below.
SMOTE PROCESS
- Identify the feature vector (minority class) and its nearest neighbor.
- Take the difference between the feature vector and also the nearest neighbor points.
- A random number between 0 and 1 is multiplies with the distance difference.
- Identify a new point on the line segment by adding the random number to the feature vector.
- thus, Repeat the process for identifying the feature vectors.
so, The above figure represents the Smote Process. From the 1st plot, minority data points are identify. In the 2nd plot, new points are generated by following the above steps.
how to perform SMOTE in python?
Separate input features and output Features
Import SMOTE function from imblearn library
Perform SMOTE
After performing SMOTE, the number of instances increased as we can see from the below figures.
If we visualize the Class variable now, we can properly understand what SMOTE does
From the above figure, it can be interpreted that, the no. of instances of minority class “1” is increased and balanced the no. of instances of majority class “0”. Here the new points that are added are synthetically generated points not an exact replication of existing minority class instances. So, the problem of overfitting encountered during random oversampling is overcome in the SMOTE process.
Though the SMOTE process is quite common and effective and has the advantages over the Random Oversampling method, it too has drawbacks that cannot be ignored. The following are the drawbacks of SMOTE.
Drawbacks Of SMOTE
· High dimensional data is not suitable for SMOTE
· SMOTE does not consider neighboring examples from other classes which can increase overlapping of classes and generate additional noises.
The model trained with a balanced dataset will not be biased towards the majority class as both the classes have an equal number of instances. Hope to achieve better accuracy while predicting the output class.
The working code for SMOTE is available in the GITHub repository.
You can download from the below-mentioned link
https://github.com/nabanitapaul1/SMOTE–Fraud-Detection.git
Conclusion
Here I have discussed the SMOTE, SMOTE process, how to implement SMOTE in python taking the example of credit card fraudulent transaction datasets, and drawbacks of SMOTE. Hope I could clarify the concepts lucidly.
If there are any queries, please post them in the comment section. Stay Healthy, Stay Happy.
Written By: Nabanita Paul
Reviewed By: Krishna Heroor
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs