As a machine learning professor or data scientist the most confusing part in there learning journey is the difference between precision and recall
therefore, the difference between precision and recall is easy to remember provided you have a better understanding of it like what each term actually means.
Source: https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/
so, To make any machine learning model, we all know to achieve the good fit model is very important and very challenging at times. This also involves achieving balance between overfitting and underfitting of the model or in more clear way a trade-off between bias and variance.
However, when talking about classification there comes another trade-off that often overlooked in favour of bias-variance trade-off which is for precision-recall trade-off. Imbalanced classes occur commonly in datasets and when it comes to specific use cases, we would in fact like to give more importance to the precision and recall metrics, and also how to achieve the balance between them.
We will use KNN model to simplify prediction.
It is always good to import necessary library.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import auc
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
Now next look at the data and the target variable
data_file_path = ‘..pathToDirectory/input/heart-disease-uci/heart.csv’
data_df = pd.read_csv(data_file_path)
data_df.head()
//to know the data information have to use EDA techniques to remove outlier and other necessary data
than we have to split the test and train data
y = data_df[“target”].values
x = data_df.drop([“target”], axis = 1)
#Scaling – mandatory for knn
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
x = ss.fit_transform(x)
#Splitting into train and test
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.3) # 70% training and 30% test
The reason we are choosing the best value of k is to save the time of trial and error, always keep in mind that we can determine the optimum value of k when we get the highest test score for that value.
For that, we can evaluate the training and testing scores for up to 20 nearest neighbours:
train_score = []
test_score = []
k_vals = []
for k in range(1, 21):
k_vals.append(k)
knn = KNeighborsClassifier(n_neighbors = k)
knn.fit(X_train, y_train)
tr_score = knn.score(X_train, y_train)
train_score.append(tr_score)
te_score = knn.score(X_test, y_test)
test_score.append(te_score)
below code can be use for to evaluate max score
## score that comes from the testing set only
max_test_score = max(test_score)
test_scores_ind = [i for i, v in enumerate(test_score) if v == max_test_score]
print(‘Max test score {} and k = {}’.format(max_test_score * 100, list(map(lambda x: x + 1, test_scores_ind
Thus, we have obtained the optimum value of k to be 3, 11, or 20 with a score of 83.5. We will finalize one of these values and fit the model accordingly:
Knn classifier can be identify as
#Setup a knn classifier with k neighbors
knn = KNeighborsClassifier(3)
knn.fit(X_train, y_train)
knn.score(X_test, y_test)
now we have to identify confusion matrix
y_pred = knn.predict(X_test)
confusion_matrix(y_test,y_pred)
pd.crosstab(y_test, y_pred, rownames = [‘Actual’], colnames =[‘Predicted’], margins = True)
A confusion matrix helps us gain an insight into how correct our predictions were and how they hold up against the actual values.
From our train and test data, we already know that our test data consisted of 91 data points. That is the 3rd row and 3rd column value at the end. We also notice that there are some actual and predicted values. The actual values are the number of data points that were originally categorized into 0 or 1.
The predicted values are the number of data points our KNN model predicted as 0 or 1.
The actual values are:
• The patients who actually don’t have a heart disease = 41
• The patients who actually do have a heart disease = 50
The predicted values are:
• Number of patients who were predicted as not having a heart disease = 40
• Number of patients who were predicted as having a heart disease = 51
All the values we obtain above have a term. Let’s go over them one by one:
The cases in which the patients actually did not have heart disease and our model also predicted not having it is called the True Negatives. For our matrix, True Negatives = 33.
so, The cases in which the patients actually have heart disease and our model also predicted as having it are called the True Positives. For our matrix, True Positives = 43
However, there are some cases where the patient actually no heart disease has, but our model has predicted that they do. thus, This kind of error is the Type I Error and we call the values as False Positives. For our matrix, False Positives = 8
Similarly, there are some cases where the patient actually heart disease has, but our model has predicted that he/she don’t. so, This kind of error is the Type II Error and we call the values as False Negatives. For our matrix, False Negatives = 7
What is Precision?
Right – so now we come to the crux of this article. What in the world is Precision? also what does all the above learning have to do with it?
In the simplest terms, Precision is thus the ratio between the True Positives and all the Positives. so, For our problem statement, that would be the measure of patients that we correctly identify having a heart disease out of all the patients actually having it. Mathematically:
Source: https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/
What is the Precision for our model? Yes, it is 0.843 or, when it predicts that a patient has heart disease, it is correct around 84% of the time.
Precision also gives us a measure of the relevant data points. It is important that we don’t start treating a patient who actually doesn’t have a heart ailment, but our model predicted as having it.
What is Recall?
The recall is thus the measure of our model correctly identifying True Positives. Thus, for all the patients who actually have heart disease, recall tells us how many we correctly identified as having a heart disease. so, Mathematically:
Source: https://www.analyticsvidhya.com/blog/2020/09/precision-recall-machine-learning/
For our model, Recall = 0.86. Recall also gives a measure of how accurately our model is able to identify the relevant data. so, We refer to it as Sensitivity or True Positive Rate. What if a patient has heart disease, but there is no treatment given to him/her because our model predicted so? That is a situation we would like to avoid!
The Easiest Metric to Understand – Accuracy
then, we come to one of the simplest metrics of all, Accuracy. so, Accuracy is the ratio of the total number of correct predictions and the total number of predictions. though, Can you guess what the formula for Accuracy will be?
For our model, Accuracy will be = 0.835.
Using accuracy as a defining metric for our model does make sense intuitively, but more often than not, it is always advisable to use Precision and Recall too. There might be other situations where our accuracy is very high, but our precision or recall is low. Ideally, for our model, we would like to completely avoid any situations where the patient has heart disease, but our model classifies as him not having it i.e., aim for high recall.
The Role of the F1-Score
Understanding Accuracy made us realize, we thus need a tradeoff between Precision and Recall. so, We first need to decide which is more important for our classification problem.
For example, for our dataset, we can consider that achieving a high recall is more important than getting a high precision – we would like to detect as many heart patients as possible. For some other models, like classifying whether a bank customer is a loan defaulter or not.
although, it is desirable to have a high precision since the bank wouldn’t want to lose customers who were denied a loan based on the model’s prediction that they would be defaulters.
In such cases, we thus use something called F1-score. thus, F1-score is the Harmonic mean of the Precision and Recall:
This is easier to work with since now, instead of balancing precision and recall, we can just aim for a good F1-score and that would be indicative of a good Precision and a good Recall value as well.
so, We can generate the above metrics for our dataset using sklearn too:
Print(Classification_report(Y_test, Y_pred))
Written By: Nikesh Maurya
Reviewed By: Krishna Heroor
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs