Handling Numeric Missing Values

Handling Numeric Missing Values is one of the biggest challenges faced in the preprocessing state because making the right decision on how to handle it generates robust data models. so, Let us look at different methods of imputing the missing values.

Methods to Handle Numeric Missing Values

though To understand these methods we will be working on the Titanic dataset:

First 20 rows of the dataset:

Deleting Rows:

This method is commonly used to handle null values. in this method, we either remove/delete a particular row if it has a null value for a specific feature and a particular column if it has more than 70-75% of missing values. This method is recommended only when there are enough samples in the data set.

however, One has to make sure that after we have removed/deleted the data, there is no addition of bias. Deleting the data will lead to loss of information which will not give the expected results while predicting the output.

 After deleting the rows:

Pros:
  • Complete removal of data with missing values thus results in a robust and very accurate model.
  • Deleting a particular row or a column with no specific information is more useful since it does not have a high weightage.
Cons:
  • however, Loss of information and data.
  • This works badly if the percentage of missing values is high (assume =30%), compared to the whole dataset.

Replacing With Mean/Median/Mode:

therefore, This approach can be applied to a feature which has numeric data like the age of a person or the ticket fare. so, We can compute the mean, median, or mode of the feature and replace it with the missing values. This will add variance to the data set. But the loss of the data can be negated by this method which provides better results compared to the removal of rows and columns.

Replacing the above three approximations is thus a statistical method for Handling Numeric Missing Values. This approach is also called leaking the data while training. Another way is to approximate it with the deviation of nearby values. though This will work better if the data is linear.  

After replacing with mean

Assumptions of mean or median imputation:

Data is missing at random.

The missing observations are also most similar to the majority of the observations in the variable.

Pros:
  • Easy to implement.
  • Fast way of obtaining complete datasets.
  • Sometimes it can be use in production, i.e during model deployment.
Cons:
  • It changes the original variable distribution and variance.
  • therefore, It changes the covariance with the remaining dataset variables.
  • The greater the percentage of missing values, the higher the distortions.

Random Sample Imputation:

Random sampling consists of using a random observation from the pool of available observations of the variable and also using those randomly selected values to fill in the missing ones.

Assumptions of random sample imputation:

Data is missing at random.

Replacing the missing values with other values within the identical distribution of the original variable.

Advantages of random sample imputation:
  • It is easy to implement and a fast way of obtaining complete datasets.
  • It can be use in production.
  • Preserves the variance of the variable.
Disadvantages of random sample imputation:
  • Randomness.
  • The relationship between imputed variables and other variables may be affected if there are a lot of missing values.
  • Memory is massive for deployment, as we need to store the initial training set to extract values from and replace the missing values with the randomly selected values.

Arbitrary Value Imputation:

This approach consists of replacing all occurrences of missing values (NA) within a variable with an arbitrary value. The arbitrary value should be dissimilar from the mean or median and not within the normal values of the variable.

We can use arbitrary values such as 0, 999, -999 (or different combinations of 9s) or -1 (if the distribution is positive).

Assumptions of arbitrary value imputation:

Data is not missing at random.

Advantages of arbitrary value imputation:
  • Easy to implement.
  • It’s a quick way to obtain complete datasets.
  • Sometimes it can be use in production, i.e during model deployment.
  • It takes the importance of a value being “missing” if there is one.
Disadvantages of arbitrary value imputation:
  • Misuse of the original variable distribution and variance.
  • Misuse of the covariance with the remaining dataset variables.
  • If the arbitrary value is at the end of the distribution, it may mask/hide or create outliers.
  • We must be careful not to choose an arbitrary value too similar to the mean or median (or any other typical value of the variable distribution).
  • If the higher the percentage of NA, then higher the distortions.

Replacing with previous value – Forward fill:

In time-series data, replacing with nearby values will be more suited than replacing it with mean. This method fills the missing value with the previous value.

Replacing with next value – Backward fill:

It uses the next value to fill the missing value.

Replacing with an average of previous and next value:

In time-series data, oftentimes the average value of the previous and next value will be a better estimate of the missing value.

Conclusion:

Every dataset will almost have some missing values which need to be dealt with. But intelligently Handling Numeric Missing Values and giving rise to robust models is a challenging task. We have gone through several ways in which nulls can replace. It is not necessary to handle a particular dataset in one single way. One can use different methods on different features depending on how and what the data is about. Having a small domain of knowledge about the data is important, which can give you an insight into how to approach the problem.

Written By: Prudvi Raj

Reviewed By: Rushikesh Lavate

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *