Handling categorical features in python - Pianalytix

Handling Categorical Features

While we work on huge data we encounter Handling categorical features in many datasets. These generally include different categories or levels associated with the observation, which is strings and should be converted to the computer to process them. Hence these are converted into integers.

They are mainly 6 techniques to handle categorical data :

One hot encoding
One hot encoding with many features
Ordinal encoding
Count or Frequency encoding
Mean encoding
Target guided encoding

One hot encoding

It is a technique where every category is consider as a feature and assigns 1 or 0. For N features there are N rows. This is a simple way to handle categorical data. The only disadvantage of using categorical data is that it has a “curse of dimensionality ” which means that there will be an increase in dimensions of data as the number of features increases. In python, we use get_dummies function.

After applying one-hot encoding to the first column, single columns are divide into many columns.

As the number of categories in a column increases, the number of columns increases while one hot encoding which makes data complex hence we can limit columns.

One hot encoding | Handling Categorical Data

It is an alternative way when the number of categories in the column are more. This is similar to one-hot encoding but the only difference is that we remove categories which are least repeat and perform one-hot encoding. It is most efficient when there are more categories.

For this, we first sort categories based on the frequency. Then we remove the least frequent category. Behind the scene, removed categories are consider as a single feature.

So first sort categories according to frequency.

Now, we remove the least repeated columns and perform one hot-encoding manually.

Here “SpiceJet” and “Multiple carriers” don’t have a separate column as they are least repeated.

While applying machine learning algorithms both come into a single category.

Ordinal encoding

It is a simple technique use when there are very few categories in a column. We simply assign a number to each category through a dictionary in python. We put the category as key and number in as the value. And we use map function to replace categories with a number.

Count or Frequency encoding | Handling Categorical Data

Count encoding or feature encoding is a technique where we replace categories with frequency in a column. This method is easy to implement and with help of this, we can assign weights to a category. we are not increasing feature space. But the only disadvantage with this method is that, if two categories have the same count then they are consider as a single category. This technique is use when we need to represent.

Mean encoding

Mean encoding is a complex way of encoding which is use to encode based on the target. Which means that it encoded based on the independent column. This is similar to count frequency. The only difference is that the independent feature of a data is use. We use this technique when we use binary classification where the output is either 0 or 1.

Mean of a category is label. This gives weights and is use for classification. There are two disadvantages in implementing this technique. Firstly if there are 2 categories with the same mean then they are consider as a single category. Another disadvantage of this technique is that we can’t implement regression.

Here we used unique() to check how many distinct categories are present in the Geography column.Then we grouped by category and took the mean of target column.

categories are then labeled with their means of target using map function.

Target guided encoding

This is similar to mean encoding but the only difference is that it is used to means are arranged in ascending order and then enumerated. We use this technique when we use binary classification where the output is either 0 or 1.Mean of a category is labelled. This gives weights and is used for classification.There are two disadvantages in implementing this technique. Firstly if there are 2 categories with the same mean then they are considered as a single category. Another disadvantage of this technique is that we can’t implement regression.

The enumerated data is in the form of a dictionary, then the labels are changed using the map function.

Conclusion

We conclude that each and every encoding technique is used in its own purpose.

One hot encoding gives best with few features, we can limit dimensions according to frequency. With ordinal encoding, we can give our own labels. Using count frequency, we use frequency to encode a category. Through mean encoding, we calculate the mean of the target variable. With target guided encoding, we encode according to ascending order of mean.

written by: Abhishek Vardhan

reviewed by: shivani yadav

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs