Introduction : Project On Data Science
Data Science is a growing field with increasing job opportunities. Although getting the role of Data Scientist is hard it needs patience and sheers hard work. So first baby step towards that path is making project on Data Science.
Projects help one to showcase their skills and to gain experience at the same time. But where to start and how to make one are really big questions at the very starting point. So in this article, we will learn how to get started with making projects and much more.
Getting Started
Very first thing you need to start is a data and a problem statement. First step is getting data, it is the easiest part of the Project On Data Science. Many sites like Kaggle, UCI Machine Learning repository provide data else you can use Sklearn datasets.
Sklearn has its own datasets API.
Next step is making a problem statement according to your data. What most of the project lack is the problem statement, without a problem statement a project is like doing nothing. It should seem like you are solving some real-world problem through your project (Need not be real, make a fictitious one).
Now that you have data and a problem statement you can start on your programming environment. I prefer using Jupyter IDE (Integrated Development Environment) for python but you can use any as per your convenience (Like Spider or Pycharm).
Always this is to be put up as a presentation so everything should be explained that’s why we need a problem statement and every code should have a comment.
Outline of your Project On Data Science is as follows:
- Problem Statement
- Importing Libraries and Data
- Cleaning Data (if needed any)
- Splitting Data (into training and testing data)
- Building model using machine learning
Here I am using google colab, it is a cloud computing platform provided by Google for aspiring data scientists with a lower-end computer to use GPU like technology. In colab, most of the python libraries are already installed, so all you need to do is import them and start
Technical Part
Start by giving a description of the data set and then the problem statement. Then you can start with your code. Data we are going to use here is from sklearn datasets API for diabetes. This project is basically a classification project, here we classify a given candidate on the basis of his/her features
Whether he/she has diabetes or not.
1. Importing Libraries and Data
Start with import required libraries and then data.
Here I used “sklearn.datasets” to import diabetes data and then pandas library.
And then made a data frame using data API and pandas. As you can see data is already split into labels and targets.
2. Cleaning Data
Below I used ‘pd.df.info’ which prints the data frame’s information like the number of rows and columns, variable type in each column and number of missing values in each column.
We can see there are not any null values in the data and also the value of all of the columns are in the desired data type. So there is no need for cleaning the data. The data we are using here is already clean, but we might not be lucky every time. So basic checks can be performed through ‘data.info’ function.
3. Splitting Data
In this step, we will split data into a training set and a testing set. This step is necessary as the model first learns on training data and makes a prediction on testing data. Also, the accuracy of prediction is checked with the help of testing data. Usually, data is split into a seventy-thirty ratio (seventy per cent of data needed to train and remaining for testing). You might be wondering why training data is large, it is so because we want the model to learn as many scenarios as possible it affects the accuracy of the model (how much better it performs).
4. Building model using a machine learning algorithm
A machine learning model is usually a file that recognizes a pattern using algorithms in Data. Here I am going to use ‘Gaussian Naive Bayes’ used for classification projects. Naive Bayes is basically Bayes theorem applied with an assumption that features are independent of each other.
As we can see here this model gives us an accuracy of 94% which means we are able to predict 94% of test targets accurately. This accuracy is really good for the start.
What next?
This one is good for the basic-level project but we need to improve. There are many more technical practices we need to perform in a project like normalizing data which is bringing every feature in one range, ‘one-hot encoding’ – dealing with categorical data.
Also as data becomes more complex and bigger it is hard to achieve even 90% of accuracy. ‘Curse of Dimensionality’ is a problem which arises when dealing with data with a high number of features.
Deploying a model is also one thing which never taught on most of the data science courses.
written by: Saurav Kumar
Reviewed By: Krishna Heroor
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs