Feature Selection

The information of the target is contained in features. Feature Selection are used for defining the classification function. So one may think that if there are more features, there will be more information, and more information may lead to better discriminative or classification power. But is this always better? In this blog, it will be saw later that this may not be the case, that is just because there is more information given to the system, the system will have better performance in the classification processes by it.   

The curse of Dimensionality

This is the one specific case. If the number of training examples are fix, and the training set is slightly large, then when the number of features are increased, the performance of the classifier may hit a maxima (very efficient), and after that maxima, the classifier’s performance may reduce. 

SOURCE: https://miro.medium.com/max/300/0*fmwS1rDbutO2BWCW.png

The reason for this reduction is that there may be some features which are not relevant. In certain algorithms like the k nearest neighbours, these extra features introduce noise. These noisy features hamper the process of finding the instances which are close together and thus the performance of the classifier is severely affected.  

There may also be redundant features; they do not give any additional information and confuse the learner, especially when there are limited training examples and computational resources. When there is abundance of features, the hypothesis space will be larger and thus searching may require more amount of time depending on the algorithm that is chosen. With limited training examples one is unable to work with these large amounts of features because it leads to overfitting. Overfitting refers to the scenario when the model learns the training data and noisy data to an extent that it negatively affects the performance of the model when new data sets are introduced. 

So this phenomenon is termed as the curse of dimensionality and it leads to degradation of the learning algorithm. To overcome this, feature reduction is carried out. These are of 2 types: feature selection and feature extraction.

Feature Selection and Feature Extraction:-

In feature selection, an initial set of features is given.

F={x1,x2x3,………,xn}

Our aim is to get F’ F such that 

F’={x1′,x2′,x3′,………..,xm’}

and it optimises and maintains classification accuracy and simplifies classifier’s complexity.

As seen in the curse of dimension, the classification accuracy first increases and then decreases. It could also involve a case that the classification accuracy increases and then remains the same. So the objective is to determine a small number of features which improves the accuracy of classification. Thus we do feature selection.

The next method of feature reduction is feature extraction. In this the original features are transformed or projected into a new subspace which consists of a smaller number of dimensions. Consider the same set of initial features as in feature selection. In feature extraction, we do a projection to M < N dimensions. On the other hand in feature selection we select a subset of features.

How to select the features?

Consider the same subset 

F={x1,x2x3,………,xn}

The number of subsets possible are 2n. It is not possible for us to check each and every feature and see how good it is because the number of subsets are exponential. Thus a method is required which is efficient as well as quick. The methods can be

  • Optimum Methods
  • Heuristic Methods
  • Randomised Methods

Optimum Methods

The optimum methods are used if the hypothesis space has a structure so that there is an optimum algorithm which has time in the polynomial form. If this is not the case then a Heuristic or greedy algorithm or a randomised algorithm is used. 

The search algorithm which takes into account the different feature subsets will also contain some way to find out the subset. For evaluation two methods are used which are Supervised Methods and Unsupervised methods

In unsupervised methods, the subsets over the training examples are not evaluated. In this method the main focus is on the input and that subset is picked up which contains most information. These methods are called filter methods. 

In supervised methods also referred to as the Wrapper method, the feature subset is evaluated by using it on a learning algorithm(train using the selected subset). The error is estimated on a validation set. The flowchart below illustrates this


SOURCE: https://www.researchgate.net/figure/The-key-differences-of-filter-and-wrapper-method-based-on-how_fig2_229704411

Feature Selection Steps


In feature selection, there is a search algorithm, an objective function, and the aim is to optimise the objective function. As it can be seen in the diagram, a feature subset will be selected by the search algorithm, and then score it on the function and find which feature subset is appropriate. Which part of the search space to be explored next is based on this scoring. After this is 

completed, one gets the final feature subset and this subset is used by the machine learning algorithm. So an optimum subset with respect to the objective function is selected.  

Feature Selection Algorithm

One may not use all the features, if the features are redundant. Thus uncorrelated features can be determined. In other words, one starts with somes features and then when a new feature is introduced, then that feature is not taken if it is highly correlated with another feature (which contains the information). 

Two frameworks of heuristic algorithms for the selection of feature subset are used, namely the forward and backward selection algorithm. In forward selection algorithm, one starts from an empty feature set and then the features are added one at time. The remaining features are tried and the accuracy is computed for each of them. In other words, the classification or regression error is estimated for adding every particular feature. The feature which gives maximum improvement is selected. To find this, one considers the validation set and not the training set. When there is no significant improvement, one stops. Thus it is a greedy algorithm.

On the same lines, there is backward search. In this the full feature set is taken into consideration and then removal of the features that is there, begins. That feature is found which gives maximum improvement in performance on removal. Thus the feature with the smallest impact on error is dropped.

The feature selection algorithms can be Univariate or Multivariate.

In univariate method, each feature is considered independently of the other feature. For this there are different methods, for example, 

  • Pearson Correlation Coefficient
  • F score
  • Chi square,
  •  Signal to noise ratio
  • Mutual information etc.

The features are ranked by importance based on the method that is selected. A cut off is selected by the user and top M features above the cut off are taken. So some type of correlation between the 2 random variables is measured in a univariate method.  Thus, in this case, the correlation between the target variable and specific feature is being measured. Those features which contain higher correlation are kept. The value of the correlation is between +1 and -1. +1 indicates perfect positive correlation while -1 means perfect negative correlation. A value of zero means no correlation. Thus features which are close to +1 or -1 are selected as can be seen in the figure.


SOURCE: WIKIPEDIA

Signal to Noise ratio

Another method which is worth mentioning is the signal to noise ratio which measures the difference in means divided by difference in standard deviation between the two classes.

S2N(X,Y)=X-YX-Y

Larger the values of S2N, more is the correlation.

Multivariate Selection Methods

In contrast to the univariate selection methods, in multivariate selection methods, all the features are considered. Consider for example, linear regression. The linear regression determines a set of weights 0, 1, 2,…., n which are the coefficients of the n attributes and the bias values. 

Suppose there is a vector w for any linear classifier. The classification of the point is given by the equation of the line

wTx + wo 

Where wT is the coefficient matrix. The values of the matrix are now observed. Consider any value, say wixi . If the value of wi is small the corresponding feature, xi has less contribution to y and vice-versa.

So for example, if w = (12, 0.03, -10). The features whose value of the coefficient is 12 and -10, their contribution to the output is high. But the one whose value is 0.03 has a low contribution to the output and hence we can neglect this feature. This is the multivariate feature selection. The w can be obtained with the help of any of the linear classifiers.

written by: Aryaman Dubey

Reviewed By: Krishna Heroor

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *