Introduction: Machine Learning Model
Building an accurate model in a machine learning project becomes a challenge as data grows more complex and bigger. Training on bigger data might seem easy but predicting with that model on testing data actually shows how accurate that model really is. For bigger data, the model tends to cling on its complex pattern of training data, and on the other hand for a small data model is able to read the tread in data. We will go in details of these problems and also talk about solutions.
Before that, we need to learn two simple concepts bias and variance.
Bias is the difference between predictions build by model and actual values, or simply the error from the assumptions build by the model.
Variance is the error due to the sensitivity of the model to the fluctuations and in data.
Overfitting and Underfitting
Underfitting: A machine learning model have under-fitted if it’s not able to capture pattern in data and it results in low accuracy while predicting.
It usually happens either when data is less to build an accurate model or when we try to build a linear model on non-linear data. Underfitting can simply be solve by training on more data and for non-linear data, training with a suitable model.
Underfitting is a result of high bias and low variance.
Overfitting: This problem arises when a model is trained on big and complex data, and models tend to cling onto details and noise of data which in turn results in low accuracy of the model. This happens when a model makes a concept out of the noise in a model and when new data is presented to it for the prediction that concept negatively affects its accuracy. It is also a result of a non-parametric and non-linear algorithm which flexible in learning, for example, decision tree and random forest.
Overfitting is a result of low bias and high variance.
Solution
Although underfitting is easy to fix with the selection of a better model and more data, overfitting is the trickier one to fix. In this section, I will give out ways to fix overfitting and their implantation through ‘ScikitLearn’.
Cross-Validation: It is a method of splitting data into samples and training it for a number of times on different group of samples. ScikitLearn has a function ‘KFold’ which takes the number of splits as input and splits data in that number of samples to be train and test on.
Regularisation: It is basically a technique through which we try to make our model a simpler one which is less prone to overfitting. For example, there are two regularisation techniques L1 and L2 for regression problems. L1 is called Lasso and L2 is called Ridge. ScikitLearn has functions for these types of regularisations.
Feature Selection: It is a technique where only valuable features are in use from data. Having more number of feature also makes data complex and model more prone to overfitting. A solution to this problem is using the most valuable features. There are many ways to make an approach to this, for example, ScikitLearn’s grid search function.
Conclusion
Usually, data are not as clean and well manage as we find on online repositories. Overfitting is a binal problem everyone has to face. Detecting this problem is easy, your data tends to have good accuracy on training data but not on testing data or new data. Also, all these solutions are in use together in a problem as per convenience and yes there many other ways to approach this problem we will look into later.
written by: Saurav Kumar
Reviewed By: Krishna Heroor
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs