One of the important and basic concepts in statistics is Hypothesis Testing. When we started with data science this is the important concept for every data enthusiast and also hypothesis is used in many other fields, since it is very useful in proving a claim or an assumption.
As I said above, Hypothesis is important for every field and how they use this technique.
Let’s take an example, we have used sanitizers for the past 6 months because of covid thing. Sanitizer claims that it kills viruses upto 99%. How do they claim this? Is there any technique they used to prove the claim that they are advertising. Yes, they use hypothesis testing to prove the claim. As I said earlier testing is majorly used to prove a claim or assumption.
Table of contents
- What is Hypothesis testing?
- Null hypothesis and alternative hypothesis
- One tailed and two tailed hypothesis testing
- Level of confidence
- Level of significance
- Type I and type II error
- P- Value
What is Hypothesis testing?
Hypothesis Testing Gives A Way Of Using Samples To Test Whether Or Not Statistical Claims Are Likely To Be True Or Not.
Hypothesis testing is an assumption we make over a population parameter. This assumption we make over the event may have occurred(right) or may not be occured(may not be right).
For example, “IPL 2020 champions title will be won by RCB team”. This is just an assumption statement that we make based on the number of matches won and net run rate. So we can try testing this statement based on the IPL match dataset.
Null Hypothesis and Alternative Hypothesis
The null hypothesis is what you/me believe in,other words we say by default. The null hypothesis is nothing but there is no relation between our statement and event going to happen.
In simple terms, a null hypothesis is something we are going to accept the statement i,e true statements. For example, “The coin is fair”. Null hypothesis is denoted by H0
H0: The coin is fair
The alternative hypothesis is inverse to the null hypothesis. Using the null and alternative hypothesis, both helps in making assumptions over the population parameter. It is denoted by H1.
For example, “The coin is biased towards head”.
H1: The coin is biased towards head
Let’s understand with the example
During this pandemic time, we all have used sanitizers for the past 6 months because of covid thing. Sanitizer claims that it kills viruses upto 99%. How do they claim this?.Let’s formulate this using null and alternative concepts.
Null hypothesis H0: Sanitizer kills viruses upto 99% (average)
Alternative hypothesis H1: Sanitizer kills viruses less than 99% (not equal to 99%)
One tailed and two tailed hypothesis testing
Using null and alternative hypothesis we formulate the statements. This is the normal procedure we follow when we have a statement. In alternative hypothesis, when we test the hypothesis if the alternative hypothesis gives both alternate directions (lesser than and greater than) value which we specified in the null hypothesis. Then it is called a two tailed hypothesis.
For example: If our Ho = 1000; 1000 =<H1>= 1000. In this case, our H1 is less than 1000 and also greater than 1000. So this is a two tailed hypothesis.
When the hypothesis test gives, i,e if the alternative hypothesis gives one alternate directions (either lesser than and greater than) value which we specified in the null hypothesis. Then it is called a one tailed hypothesis.
For example: If our Ho = 1000; H1>= 1000. In this case, our H1 is greater than 1000. So this is one tailed hypothesis.
Critical Region | Hypothesis Testing
In order to test a hypothesis, the entire sample space is divided/partitioned into two regions
- Critical region or rejection region
- Non rejection region
1. Critical region or rejection region
A region in a sample space in which if the calculated value of the test statistics lies. so, we will reject the null hypothesis. This region is called a critical region or rejection region.
2. Non rejection region
A region in a sample space in which if the calculated value of the test statistics lies, we will not reject the null hypothesis. so, This region is called a critical region or rejection region.
Test statistics : This is a statistic, which is used in decision making about null hypothesis.
Level of confidence
As a name suggests the definition of the level of confidence. How confident you are in taking the decisions. Basically the level of confidence( LOR ) must be 95% or more but if LOR is less than that it will be rejected.
Level of significance
The probability level below which we reject the null hypothesis is called the level of significance.
We can’t accept the 100% accuracy or it is not possible to accept or reject the hypothesis tests. So the level of significance will be 5% usually.
It is denoted with the α term and formula is, α = 1- Confidence level
Type I and Type II Error
When we test null hypothesis against alternative hypothesis, there we will have four probabilities.
1. Null hypothesis (H0) is accepted when null hypothesis (HO) is true
2. in this, Null hypothesis (H0) is rejected when null hypothesis (H0) is true [ type I error ]
3. Null hypothesis (H0) is accepted when null hypothesis (HO) is false [ type II error ]
3. here, Null hypothesis (H0) is rejected when null hypothesis (H0) is false
Type I error
We reject null hypothesis even though our hypothesis was true
α = p( type I error )
α = p( reject H0|H0 is true )
Type II error
We accept null hypothesis but our hypothesis was false
α = p( type II error )
α = p( accept H0|H0 is false )
Let’s understand with an example
The person got caught with the police for not putting on a helmet, but he has all the documents like license and others. The police has to decide whether the person is innocent or guilty
H0 : Person is innocent
H1 : Person is not innocent or he is guilty
Type I error will be if the police convicts the person [rejects H0] although the person was innocent having all the documents [H0 is true].
Type II error will be the case when police release the person [Do not reject H0] although the person is guilty without putting on a helmet [H1 is true].
Pic Credits: Google
P – Value
The calculated probability or p value, is nothing but finding the probability of observed value. We use p value in hypothesis to make decisions in accepting or rejecting the hypothesis tests. It is a major evidence against the H0. If the p value is smaller, the stronger evidence to reject the H0.
Implementation
1) One sample and Two sample t-tests
2) ANOVA
3) Type I and Type II errors
4) Chi-Squared Tests
Find the total notebook code for above concepts here. If you like the kernel upvote.
Happy learning!
Thank you!
written by: Krishna Heroor
reviewed by: Umamah
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs