ESSENTIAL STATISTICS CONCEPTS FOR DATA SCIENCE

Data science is the interdisciplinary stream that uses the knowledge of the domain, computer science, and mathematics to get some insights from data.

one can’t become a data scientist without mathematics knowledge. Probability, statistics, calculus, and linear algebra are some of the main concepts of mathematics, which are extensively used in data science. 

In this article, we are specifically going to talk about some of the concepts of statistics that are used in data science.

What is statistics?

Statistics is the stream which deals with the collection, analysis, interpretation, and reporting of numerical data. Statistics provides methods and tools to make predictions and discover some patterns from huge data.

Now let’s study some concepts of statistics one by one.

1. Mean, Mode and Median

Mean is the average of the dataset. therefore, Mean is calculated by adding all the values of the dataset and dividing it by the number of values in the dataset

A mode is a most often occurring value in the dataset.

Median is the middle value of the dataset when all values are placed in an ascending manner.

If the number of values in the dataset is odd then the median is the ((n+1)/2)th value.

If the number of values in the dataset is even then the median is the average of (n/2)th  value and ((n+2)/2)th value.

E.g. Calculate the Mean, Mode, and Median of the following list of values?

list  – 18, 13, 13, 14, 13, 16, 14, 22, 21  

Mean = (18+13+13+14+13+16+14+22+21)/9 = 16

The mode is the most often occurring value in the list. So here 13 is repeating most often than the other values so the mode is equal to 13.

Before calculating the median, arrange the values in ascending order.

13, 13, 13, 14, 14, 16, 18, 21, 22

Here the total number of elements is 9, which is an odd number.  So, the position of the middle element must be (n+1)/2 = 10/2 = 5

Here  5th element is 14.  So, the median is 14.

Mean and Median gives the center of the data but these values are not the same always.

2. Variance and Standard deviation

Variance and standard deviation are the most commonly used methods to measure the spread of the data. The more the variance, the more is the spread of the data.

Variance: 

It is the average of the square difference from the mean.

Variance is denoted by σ2

Variance is calculated by the below mathematical formula:

Where Xr = random variable

μ = mean

n = number of values

Standard deviation:

It equals the square root of the variance. 

It is denoted by the σ (sigma). 

Standard deviation is calculated by the below mathematical formula:

3. Central Limit Theorem:

The Central limit theorem states that the sampling distribution of the sampling mean imitates the normal/gaussian distribution irrespective of the distribution of the population. 

The mean of the sampling distribution of sampling means is the same as that of the mean of the population and the variance of the sampling distribution of sampling mean is equals to the (σ2 /n).

Where σ= variance of the population.

n = sample size.  

Understanding of the central limit theorem:

Suppose you have a dataset and you don’t know the distribution of the dataset.  If You want to find out the mean and the variance of the dataset then you can do this with the help of the central limit theorem.

Step 1.

Choose the sample size(n). The general rule says that if the sampling size is >= 30 then the distribution plot of the sampling distribution of sampling mean will be gaussian distribution. 

Step 2.

Take m sample set of a sample size of n from the population.

Samples { S1, S2, S3, ………………..Sm}

Step 3.

Calculate the mean of each sample set.

Mean {μ1, μ2, μ3, ……………………. μm}

Step 4.

Now plot the distribution curve of these sampling mean. This distribution curve of sampling means is called a sampling distribution of the sampling mean.

Figure 1. Sampling Distribution Of Sampling Means

                                         μ-2σ        μ-σ          μ            μ-σ        μ-2σ 

Step 5. 

This distribution curve has a mean equals to the mean of the population and variance equals to (variance of population/ sample size(n)).

In this way, one can find the mean and variance of any distribution by using the central limit theorem.

4. Hypothesis Testing:

A hypothesis is an assumption about the given data. It should be testable, either by observation or experiment. This hypothesis needs not to be true always. Statisticians use the hypothesis testing method to accept or reject the hypothesis.

There are two types of hypothesis:

  1. Null hypothesis – represented by H0
  2. Alternate hypothesis – represented by Ha or H1 

Let’s understand the hypothesis testing by example

Step 1.

Suppose there is 2 class and each class has 50 students. We have calculated the height of students and we found out that the difference between the mean height of the classes is 10 cm.

X = μ2 – μ1 = 10 cm  , this is a truth or ground reality.

Step 2. Design of Null hypothesis

The null hypothesis assumes that there is no relationship between the dependent and independent variables. While deciding the null hypothesis ask a question and rephrase that question in such a way that it assumes no relationship between the dependent and independent variables.

Question – Is there is a difference in height of the students in class 1 and class 2?

Null hypothesis – There is no difference in heights of students of class 1 and class 2.

Alternate Hypothesis – There is a difference in heights of the students of class 1 and class 2.

Step 3. Design of experiment

We have assumed that there is no difference in the heights of the students of class 1 and class 2. 

So, to check this we have combined the samples from class 1 and class 2.

Now randomly creates 2 samples of a sample size of 50 from the combined data and calculate the mean of each sample.

Calculate the difference of the mean ( δ = μ2 – μ1 )

Now similarly, do K resampling and calculate the δ for each resampling.

Now we have values like  δ1, δ2, δ3, δ4…………………………. δk.

Step 4. Calculate the probability

If P( δ >= 10 cm | H0 ) = 0.95

This means there is a 95% probability that the difference between height >= 10 cm for the given hypothesis, which however is matching to the ground reality.

Hence the Hypothesis is accepted.

If P( δ >= 10 cm | H0 ) = 0.01

This means there is only a 1% probability that the difference between the height >= 10 cm for a given hypothesis, which is not matching to the ground reality. Therefore H0 is less probable.

Hence the hypothesis is thus reject.

Note – Generally P value should be less than 5% to reject the hypothesis.

In this way, statisticians or data scientists use hypothesis testing to accept or reject the hypothesis.

Conclusion:

In this article, we have studied the importance of statistics in data science, and some of the majorly used concepts of statistics in data science with examples. 

written by: Sanket Landge

reviewed by: Rushikesh Lavate

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *