What is Pandas?
Basically, Pandas Is A Fast, Powerful, Flexible And Easy Data Handling Library Which Is Written For The Python Programming Language and works for data manipulation, data analysis, etc. Additionally, it has the immense goal of becoming the most flexible and powerful open-source data manipulation tool.
This article is useful for beginners as well as professionals, in this article we will go through the basics of Pandas and the commands to do fundamental data analysis in a given dataset.
Installation of a library
There are multiple ways to install the pandas library, let’s see how to install it :
Using pip command :
>>> pip install pandas
Using conda command :
>>> conda install pandas
Using Google collab and Jupyter notebook :
If you want to install the pandas library directly using Jupyter notebook or Google colab then use following command :
>>> !pip install pandas
>>> !conda install pandas
Install pandas with required version
You can also install pandas with required version by using command :
pip install pandas == required_version
Ex. >>> pip install pandas == 1.1.3
Upgrading a library
To upgrade the installed library we can use a command
>>> pip install –upgrade pandas
Importing of a library
To make use of the functions in a library, you’ll need to import the library with the help of an import statement. An import statement is created by the import keyword along with the name of the library.
Importing pandas library :
>>> import pandas
We can also import the python library using the alias.
>>> import pandas as pd
here, import the “pandas” library using the alias “pd”.
Instead of the whole library, if you want to import a specific function in it you can also do this by using from – import method.
suppose you want to import DataFrame function in the pandas library you can follow the below command :
>>> from pandas import DataFrame
Data Structure in pandas
There are two major categories of pandas data structure that you can come across while doing data manipulation .
- Series
- Dataframe
These data can be of any data type such as character, number or object.
- Series
Series is a one-dimensional labeled array or list in python that has any data type integers, strings, floats, objects, etc. It returns an object in the form of a list, having an index starting from 0 to n, Where n is the length of the series. The labels of the axis are collectively referred to as the index.
Creating a pandas series
>>>s = pd.Series(data, index=index)
(First import and install pandas library)
>>> s = pd.Series(np.random.randn(4),index=[‘1’, ‘2’, ‘3’, ‘4’])
>>> s
Out :
1 0.469112
2 -0.282863
3 -1.509059
4 -1.135632
dtype: float64
Here, we create a series ‘s’ using pandas as pd having four values which have series index 1,2,3,4 and random values generated by a randn() function.
- DataFrame
Data frame is multidimensional data that can be made of more than one series or we can say that a data frame is a collection of series that can be used to analyze the data. that has more than one data type such as integers, strings, floats, objects, etc.
Creating a pandas DataFrame
There are various methods to create a DataFrame, but a great option for creating DataFrame is by using a dictionary. First, create a dictionary using dict() function and then pass it to the pandas DataFrame constructor :
In this example,
we create a dictionary using the dict() function
>>> data = {‘Car’ : [3,2,0,1], ‘Bike’ : [0,3,7,2]}
then pass it to dataframe constructor
>>> vehicle =pd.DataFrame(data)
>>> vehicle
Out :
Car Bike
0 3 0
1 2 3
2 0 7
3 1 2
How to Read Data in pandas
In pandas, the read function helps us to read several files available in different formats such as CSV, JSON, SQL, EXCEL, etc and convert into DataFrame. After that, we are able to perform an operation on the required files.
Reading Data From CSVs Files
A CSVs stands for a comma-separated values file, which supports data to be saved in a tabular format.
Reading CSVs file :
>>> df_csv= pd.read_csv(“income.csv”)
>>> df_csv
Out:
Reading Data From JSON Files
A JSON stands for a JavaScript Object Notation file, which is used when data is sent from a server to a web page.
Reading JSON file :
>>> df_json= pd.read_csv(“income.json”)
>>> df_json
Out:
Reading Data From EXCEL files
Excel file is save with extension xlsx where xlsx file extension is a Microsoft Excel Open XML Spreadsheet (XLSX) file created by Microsoft Excel. which supports data to be saved in a tabular format.
Reading xlsx file :
>>> df_xlsx = pd.read_csv(“income.xlsx”)
>>> df_xlsx
Out:
In this way we can read all types of files by using their required functions. Below are some functions which is use for readding files
- pd.read_csv(“file’s path”)
- pd.read_clipboard(“file’s path”)
- pd.read_excel(“file’s path”)
- pd.read_feather(“file’s path”)
- pd.read_html(“file’s path”)
- pd.read_json(“file’s path”)
- etc
DataFrame Operations
As I mentioned earlier Pandas DataFrame is a two-dimensional data that has more than one data type such as integers, strings, floats, objects, etc. i.e., data is aligned in a tabular fashion in rows and columns. We perform an operation on it for data manipulation.
Load DataFrame
We are loading Titanic dataframe from CSV and set PassengerId to be index
View data
By using df.head() function we can view the first five rows and for the bottom five, we use df.tail() function also we specify how many top and bottom rows we want to view by specifying parameters like the number of rows want to view df.head(17), df.tail(13).
Getting Info of data
df.info() function gives the required details about the given dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, etc.
df.columns function is use for finding column’s name in given data frame
df.shape function is use to find shape of given dataframe in the form of rows and columns
Handling missing values
It may be possible to have null or missing values in a given data frame and we want to remove or replace these null values for better and accurate results.
First, we check the total number of null values per column for finding null values we use the .isnull() function, and for calculating their sum we use the .sum() function.
Here, we can see that there are several null values in our data set now we want to handle it for that we use a method such as deleting a row, replacing values by mean, mode, Medien, etc. For more details about How Do You Handle Missing Values, Categorical Data And Feature Scaling In Machine Learning. click here
Summary
After reading this article we are able to understand that the main difference between Series and Dataframe is that Series can only contain a single list with index, whereas dataframe can be made of more than one series. How to load different formats of the file using pandas read function and perform some basic operations on titanic dataframe.
Article By: Rushikesh Lavate
If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs