Pandas for Data Analysis

Reading Time: 4 minutes

Why Pandas for data Analysis?

Real ‘raw’ data needs a lot of ‘wrangling’ operations before it can be ready for dissection by a data scientist one of the popular tools for data wrangling in python is Pandas. Because of the availability of widespread packages of Pandas for almost every possible function. The library Pandas is one such package that makes life easier especially for data analysis. Through its extensive in-built functions for manipulations and visualizations.

Pandas First Steps

If you are using Anaconda so, you will automatically have pandas in it. But, for some reason, if you do not have it.

Just run this command –

conda install pandas

If you are not using Anaconda. So, install via pip by –

pip install pandas

Importing – To import pandas, use

import pandas as pd
import numpy as np

In conclusion, It is better to import NumPy with pandas to have access to more numpy features. In short, it will help us in Exploratory Data Analysis (EDA).

Pandas Data Structures

Pandas has two main data structures.

  • Series
  • Data Frames

SERIES

The basic syntax to create a pandas Series is as follows:

newSeries = pd.Series(data , index)

Data can be of Any type from Python’s dictionary to list or tuple. It can also be a Numpy array.

Let’s build a series from Python List:

mylist = ['Tanishka','Machine Learning', 24, 'India']
labels = ['Name', 'Career', 'Age', 'Country']
newSeries = pd.Series(mylist,labels)
print(newSeries)

In addition, let’s see how we can create a Series using a Python Dictionary.

myDict = {'Name': 'Tanishka',
         'Career': 'Machine Learning',
         'Age': 24,
          'Country': 'India'}
mySeries = pd.Series(myDict)
print(mySeries)

Accessing data from Series

The normal pattern to access the data from Pandas Series is –

seriesName['IndexName']


Let’s take the example of the mySeries we created earlier. To get the value of Name, Age, and Career, all we have to do is

print(mySeries['Name'])
print(mySeries['Age'])
print(mySeries['Career'])

Basic Operations on Pandas Series

For instance, Let’s create two new series to perform operations on them :

newSeries1 = pd.Series([10,20,30,40],index=['LONDON','NEWYORK','Washington','Singapore'])
newSeries2 = pd.Series([10,20,35,46],index=['LONDON','NEWYORK','INDIA','CANADA'])
print(newSeries,newSeries1,sep='\n\n')

Basic Arithmetic operations include +-*/ operations. These are done over-index, so let’s perform them.

newSeries1 + newSeries2

Here we can see that since London and NEWYORK index are present in both Series. So, it has added the value of both, and the output of the rest is NaN (Not a number).

newSeries1 * newSeries2
newSeries1 / newSeries2

DATAFRAMES

Creating a DataFrame using List

import pandas as pd

# list of strings
new_list = ['Mango','Kiwi','Strawberry','Pineapple']

# Calling DataFrame constructor on list
df = pd.DataFrame(new_list)
print(df)

Now, using dict of ndarray/lists

import pandas as pd
 
# intialise data of lists.
new_list = {'Name':['Mango','Kiwi','Strawberry','Pineapple'],'Price':[20, 21, 19, 18]}
 
# Create DataFrame
df = pd.DataFrame(new_list)
 
# Print the output.
print(df)

Indexing and Selecting Data

In Pandas, indexing means simply selecting particular rows and columns of data from a DataFrame. It could mean selecting all the rows and some of the columns, some of the rows and all of the columns, or some of each of the rows and columns. Indexing can also be known as Subset Selection.
Indexing operator is used to referring to the square brackets following an object. The .loc an.iloc indexers also use the indexing operator to make selections. In this indexing operator to refer to df[].

Now, we Selecting a single Column

# importing pandas package
import pandas as pd
 
# intialise data of lists.
data = {'Name':['Mango','Kiwi','Strawberry','Pineapple'],'Price':[20, 21, 19, 18]}
 
# retrieving columns by indexing operator
first = data["Price"]

print(first)

Selecting a single Row using .loc

import pandas as pd

data = {
  "Name": ['Mango','Kiwi','Strawberry','Pineapple'],
  "Price": [20, 21, 19, 18]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

#refer to the row index:
print(df.loc[0])

Selecting a single Row using .iloc

import pandas as pd

data = {
  "Name": ['Mango','Kiwi','Strawberry','Pineapple'],
  "Price": [20, 21, 19, 18]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

#refer to the row index:
print(df.iloc[3])

REFERENCES

  1. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
  2. https://www.geeksforgeeks.org/indexing-and-selecting-data-with-pandas/
  3. https://www.kdnuggets.com/2020/06/introduction-pandas-data-science.html
  4. https://www.geeksforgeeks.org/python-pandas-dataframe/

Written by 

Tanishka Garg is a Software Consultant working in AI/ML domain.

1 thought on “Pandas for Data Analysis7 min read

Comments are closed.