Introduction to Decision Tree
Decision Tree is a popular supervised classification technique that is used when the target variable is a discrete or categorical variable (having two or more than two classes) and the predictor variables are either categorical or numerical variables. It is based on multiple-staged criteria and variables and is very effective as a decision-making tool. It has a pictorial output that is
easier to understand and implement compared to the output of the other predictive models.
thus, A decision tree can be interpreted as an if-else rule on the classification problem where the target variables are discrete or categorical.
A tree has three basic elements:
- Nodes
- Branches
- Leaves
- Nodes: These are the points from where one or more branches come out. Every node contains either a condition on which the tree splits up further or an output value.
- Branches: Each node joins with its parent(if it has one), or also we can say a branch joins the parent node with its child nodes.
- Leaves: A node from where no branch originates is a leaf. A leaf has no child nodes and contains a to be predicted class.
We can see in the image above a basic representation of a tree along with its components. In this image, internal nodes are also term as decision nodes as a decision build at each node, and upon that decision the tree gets splits up further as discuss.
Structure of a Decision Tree
A decision tree starts with a root node, where the first decision build and the tree Splits Up further into decision nodes if required based on the data. These decision nodes consequently by terminal nodes which are refer as a leaf. All nodes, except the terminal node, represent one variable and the branches represent the different categories (values) of that variable. however, The terminal node represents the final decision or value for that route.
Concept of Homogeneity
Homogeneity can be categorize in two ways:
- More similar things together.
- Less dissimilar things together.
The homogeneous distribution means that similar values of the target variable are group together so that a concrete decision can be build, which is nothing but the ultimate goal of the Decision Tree.
Concept of Entropy
It is the measure of uncertainty of a random variable that characterizes the impurity of an arbitrary collection of examples. It is used to measure the impurity or heterogeneity of a node.
Entropy is calculated as:
Entropy (S) = ∑ – pi * log2pi
Decision Tree with an example
Let us build a decision tree based on the data given above. We have 4 useful input features- Rainfall, Terrain, Fertilizers, Groundwater. Based on this we have to predict whether the harvest is Bumper, Moderate, or Meagre.
Based on the data above we can get some intuitions like:
- If Terrain is Plateau, Groundwater is No, and Fertilizer is No (present), then the harvest will be Meagre
- If Terrain is Plateau, Groundwater is Yes, and Fertilizer is No (present), then the harvest will be Moderate.
Understanding data with the concept of Homogeneity:
We can see that for Low rainfall, the harvest is distributed as 0%, 71%, and 29% across Bumper, Meagre, and Moderate, and it is a more homogeneous classification than 0% (Bumper), 50% (Meagre), and 50% (Moderate) for hilly terrains. This is because Low rainfall was able to group more of the Meagre rainfall (71%) together than the classes in the Terrain parameter. For the Fertilizers and Groundwater parameters, the highest homogeneity that can be achieved is 67%. Thus, the rainfall is the most suited to classify the target variable; that is, the Harvest type.
Understanding data with the concept of Homogeneity:
The entropy can range from 0 to log23 (1.58) as we have three categories of the target variable, namely, Bumper, Meagre, and Moderate with proportions 4/20, 9/20, and 7/20. Entropy for the data is-
S= -((4/20)*log2(4/20) + (9/20)*log2(9/20) + (7/20)*log2(7/20)) =1.5
Entropy will be zero when the target variables are perfectly homogeneous.
Information Gain
When we use a node in the decision tree to partition the training instance into smaller subsets, the entropy changes. Information gain is the measure of this change in entropy. It can calculate as-
information Gain (S,V) = Sv – ∑(Vc / V) * Entropy(Vc)
Where: Sv: Total entropy for the node variable V
c: stands for categories in the node variable
Vc: number of total observations with category c
V: total number of observations
Entropy(Vc): Entropy of the system having observations with the category c of the node variable.
Let us calculate entropy for different categories of terrain variables.
Therefore information gain is:
Similarly, information gain for different variables available is calculated as follows:
The variable that provides the maximum information gain is chosen to be the node. In this case, the Rainfall variable is chosen as it results in the maximum information gain of 0.42.
Therefore the decision tree can be visualized as:
This algorithm using entropy and information gain is known as ID3 algorithm.
Steps to create a Decision Tree based on ID3 algorithm
- Initial entropy of the system based on the target variable is calculated.
- Calculating information gain for each candidate variable and selecting the one with the highest information gain as the decision node.
- Step 2 is repeated for each branch of the node until we end up at a leaf node. The leaf node classifies the entire data perfectly.
Note:
- Apart from the ID3 algorithm, there are other methods like the Gini index method, Chi-Square Automatic Interaction Detector (CHAID) method, and Reduction in Variance method that can be used to decide which variable should be chosen as a subnode.
Advantages of Decision Trees
- It does not require much computation to predict the output.
- Handles both continuous and categorical variables. Regression trees are used to predict continuous variables.
- Works well with missing values and incorrect values.
- It is easy to explain the structure of a decision tree.
- Doesn’t require much data pre-processing.
Disadvantages of Decision Trees
- Often requires large time to train the data.
- The chances of overfitting are more.
- These are less appropriate to predict numeric values(continuous values).
written by: Chaitanya Virmani
Reviewed By: Krishna Heroor
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs