Pandas | Python Data Analysis Library

What is Pandas?

Basically, Pandas Is A Fast, Powerful, Flexible And Easy Data Handling Library Which Is Written For The Python Programming Language and works for data manipulation, data analysis, etc. Additionally, it has the immense goal of becoming the most flexible and powerful open-source data manipulation tool.

This article is useful for beginners as well as professionals, in this article we will go through the basics of Pandas and the commands to do fundamental data analysis in a given dataset.

Installation of a library

There are multiple ways to install the pandas library, let’s see how to install it :

Using pip command :

>>> pip install pandas

Using conda command :

>>> conda install pandas

Using Google collab and Jupyter notebook :

If you want to install the pandas library directly using Jupyter notebook or Google colab then use following command :

>>> !pip install pandas

>>> !conda install pandas

Install pandas with required version

You can also install pandas with required version by using command :

pip install pandas == required_version

Ex. >>> pip install pandas == 1.1.3 

Upgrading a library

To upgrade the installed library we can use a command

>>> pip install –upgrade pandas 

Importing of a library

To make use of the functions in a library, you’ll need to import the library with the help of an import statement. An import statement is created by the import keyword along with the name of the library.

Importing pandas library :

>>> import pandas 

We can also import the python library using the alias.

>>> import pandas as pd

here, import the “pandas” library using the alias “pd”.

Instead of the whole library, if you want to import a specific function in it you can also do this by using from – import method.

suppose you want to import DataFrame function in the pandas library you can follow the below command :

>>> from pandas import DataFrame

Data Structure in pandas 

There are two major categories of pandas data structure that you can come across while doing data manipulation .

  1. Series
  2. Dataframe

These data can be of any data type such as character, number or object.

  1. Series

Series is a one-dimensional labeled array or list in python that has any data type integers, strings, floats, objects, etc. It returns an object in the form of a list, having an index starting from 0 to n, Where n is the length of the series. The labels of the axis are collectively referred to as the index. 

Creating a pandas series 

>>>s = pd.Series(data, index=index)

(First import and install pandas library)

>>> s = pd.Series(np.random.randn(4),index=[‘1’, ‘2’, ‘3’, ‘4’])

>>> s

Out : 

1    0.469112

2   -0.282863

3   -1.509059

4   -1.135632

dtype: float64

Here, we create a series ‘s’ using pandas as pd having four values which have series index 1,2,3,4 and random values generated by a randn() function.

  1. DataFrame

Data frame is multidimensional data that can be made of more than one series or we can say that a data frame is a collection of series that can be used to analyze the data. that has more than one data type such as integers, strings, floats, objects, etc.

Creating a pandas DataFrame

There are various methods to create a DataFrame, but a great option for creating DataFrame is by using a dictionary. First, create a dictionary using dict() function and then pass it to the pandas DataFrame constructor :

In this example, 

we create a dictionary using the dict() function 

>>> data = {‘Car’ : [3,2,0,1], ‘Bike’ : [0,3,7,2]}

then pass it to dataframe constructor 

>>> vehicle =pd.DataFrame(data)

>>> vehicle

Out : 

Car Bike

0 3 0

1 2 3

2 0 7

3 1 2

How to Read Data in pandas 

In pandas, the read function helps us to read several files available in different formats such as CSV, JSON, SQL, EXCEL, etc and convert into DataFrame. After that, we are able to perform an operation on the required files. 

Reading Data From CSVs Files

A CSVs stands for a comma-separated values file, which supports data to be saved in a tabular format.

Reading CSVs file :

>>> df_csv= pd.read_csv(“income.csv”)

>>> df_csv

Out: 

Reading Data From JSON Files

  A JSON stands for a JavaScript Object Notation file, which is used when data is sent from a server to a web page.

Reading JSON file :

>>> df_json= pd.read_csv(“income.json”)

>>> df_json

Out:

Reading Data From EXCEL files

  Excel file is save with extension xlsx where xlsx file extension is a Microsoft Excel Open XML Spreadsheet (XLSX) file created by Microsoft Excel. which supports data to be saved in a tabular format.

Reading xlsx file :

>>> df_xlsx = pd.read_csv(“income.xlsx”)

>>> df_xlsx

Out:

In this way we can read all types of files by using their required functions. Below are some functions which is use for readding files

  • pd.read_csv(“file’s path”)
  • pd.read_clipboard(“file’s path”)
  • pd.read_excel(“file’s path”)
  • pd.read_feather(“file’s path”)
  • pd.read_html(“file’s path”)
  • pd.read_json(“file’s path”)
  • etc

DataFrame Operations

As I mentioned earlier Pandas DataFrame is a two-dimensional data that has more than one data type such as integers, strings, floats, objects, etc. i.e., data is aligned in a tabular fashion in rows and columns. We perform an operation on it for data manipulation.

Load DataFrame

We are loading Titanic dataframe from CSV and set PassengerId to be index 

View data 

By using df.head() function we can view the first five rows and for the bottom five, we use df.tail() function also we specify how many top and bottom rows we want to view by specifying parameters like the number of rows want to view df.head(17), df.tail(13).

Getting Info of data

df.info() function gives the required details about the given dataset, such as the number of rows and columns, the number of non-null values, what type of data is in each column, etc.

df.columns function is use for finding column’s name in given data frame 

df.shape function is use to find shape of given dataframe in the form of rows and columns

Handling missing values 

It may be possible to have null or missing values in a given data frame and we want to remove or replace these null values for better and accurate results.

First, we check the total number of null values per column for finding null values we use the .isnull() function, and for calculating their sum we use the .sum() function.

Here, we can see that there are several null values in our data set now we want to handle it for that we use a method such as deleting a row, replacing values by mean, mode, Medien, etc. For more details about How Do You Handle Missing Values, Categorical Data And Feature Scaling In Machine Learning. click here

Summary

After reading this article we are able to understand that the main difference between Series and Dataframe is that Series can only contain a single list with index, whereas dataframe can be made of more than one series. How to load different formats of the file using pandas read function and perform some basic operations on titanic dataframe.  

Article By: Rushikesh Lavate

If you are Interested In Machine Learning You Can Check Machine Learning Internship Program
Also Check Other Technical And Non Technical Internship Programs

Leave a Comment

Your email address will not be published. Required fields are marked *