Random Forest - Pianalytix - Build Real-World Tech Projects

Prerequisite: Decision Tree Algorithm

INTRODUCTION

Random Forest Algorithm is a Supervised Learning Algorithm which is capable of solving both classification and regression problems.

It is a classic example of Ensemble learning which works on the Bagging type.

Bagging refers to Bootstrap Aggregating creates different models from a single training dataset and this process is known as Bootstrapping and after that we Aggregate results from different created models which boost its accuracy and reduce its variance.

In the case of Random forest, it creates different decision trees and aggregates results from different decision trees and produces optimal results.

The greater the number of decision trees during the training phase, the greater will be its accuracy, and the more it prevents decision trees from overfitting.

WORKING OF RANDOM FOREST ALGORITHM

Random Forest works as a Bagging type, It includes two-phase working.

The first one is to create N decision trees from the training dataset.

The second phase is to predict and merge output from each decision tree created in the first phase of the algorithm.

In general, it can be specified as :

Set the value of the Decision trees to be formed i.e N.
Randomly select M data points from the training dataset.
Create Decision trees from those M data points till N decision trees are not formed.
Now choose a random datapoint and feed it to all decision trees created earlier and predict the output from each tree
Use Averaging(in case of Regression problems) and Maximum Voting(in case of Classification problems) and predict the output from Random Forest Algorithm.

HYPERPARAMETERS IN RANDOM FOREST ALGORITHM

Hyperparameters are used in increasing the accuracy of random forest models or to make the model faster, in this tutorial we will discuss the hyperparameters which are present in python’s scikit-learn library.

1. Increasing the Accuracy

We will set the parameter n_estimators which is nothing but the number of decision trees generated in the random forest algorithm.

More the number of decision trees more will be accuracy and slower will be its computation.

Then there is another hyperparameter criterion which is nothing but the parameter used for finding the best split node, it can either be ‘Gini index ‘ or ‘entropy’.

We can also specify the maximum depth of the Tree by using the max_depth parameter which will set the max depth of the tree.

Also, there are several other features that you can specify to make your model the best fit. You can learn about them in the scikit-learn library

2. Increasing the speed of the model

We can also specify parameters which make our model work fast like

n_jobs specifies the number of jobs allowed to be run in parallel.

Then there is another hyperparameter bootstrap which sets whether bootstrap samples are used during the building of decision trees in the training phase.

IMPLEMENTATION OF RANDOM FOREST IN PYTHON

We will use the random forest algorithm to predict survivors of the titanic. however, We will be using a titanic dataset. We are given attributes of people present in titanic and we have to predict whether a person survived or not, so without wasting any time let’s get started.

Now we will import our dataset using the pandas to_csv method.

Output :

Now we will get some information about the attributes of the dataset.

Output :

Now we will graphically visualize the distribution of survivors and we will also make its relation with various factors and determine which factor is most responsible for surviving.

Output :

Now we will check the null values and drop the null values if present in the dataset.

Output :

Here age and Cabin and Embarked column contain null values so we have to remove null values present in the dataset. To remove null values we will use the drop method and fill method accordingly.

Now we will see the graphical relationship between features of the dataset and output variable ‘Survived’.

Output :

however, we will import the random forest classifier and necessary libraries along with it.

thus, we will perform some transformation before feeding our data into the Random forest classifier model.

Now we will split the dataset into a test and training dataset and feed it into our random forest model and will store the output into the y_pred variable.

we will predict the output from the model and also print the accuracy score, f1 score.

Output :

Now we will plot a ROC Curve for our model and also calculate the ROC-AUC score for our model.

Output :

This is all for the random forest classifier implementation.

ADVANTAGES OF RANDOM FOREST ALGORITHM

It produces better accuracy and prediction as compared to logistic regression and the Naive Bayes classifier.
thus, It works efficiently on large datasets
It needs very less preprocessing of the dataset
We can also make it work faster by changing its features
It also works well in the case of high-dimensional datasets.
It can also handle imbalanced datasets.

DISADVANTAGES OF RANDOM FOREST ALGORITHM

It takes plenty of time to predict results
It can also overfit in case of some noisy datasets.
Sometimes it produces biased prediction in the case of categorical variables
It requires huge memory to train and predict the output.

APPLICATIONS OF RANDOM FOREST ALGORITHM

Fraud transaction prediction
Medical and drugs sector
Recommendation tasks in the eCommerce sector
Stock price prediction
Customer segmentation tasks

CONCLUSION

In this tutorial, we learned about the random forest algorithm, it’s objective working, and Bagging, its importance in algorithm, features, and hyperparameters.

We also implemented a Random Forest algorithm to predict titanic survivors

And also plotted a roc curve.

We also discussed its advantages, disadvantages, and applications.

Written By: Mohit Kumar

Reviewed By: Savya Sachi

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

INTRODUCTION

WORKING OF RANDOM FOREST ALGORITHM

HYPERPARAMETERS IN RANDOM FOREST ALGORITHM

1. Increasing the Accuracy

2. Increasing the speed of the model

IMPLEMENTATION OF RANDOM FOREST IN PYTHON

Output :

Output :

Output :

Output :

Output :

Output :

Output :

ADVANTAGES OF RANDOM FOREST ALGORITHM

DISADVANTAGES OF RANDOM FOREST ALGORITHM

APPLICATIONS OF RANDOM FOREST ALGORITHM

CONCLUSION

Leave a Comment Cancel Reply