 # Data Analysis using Python: Pandas

In this blog, I am going to explain pandas which is an open source library for data manipulation, analysis, and cleaning.

Pandas is a high-level data manipulation tool developed by Wes McKinney. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data. Pandas is built on the top of NumPy.

Five typical steps in the processing and analysis of data, regardless of the origin of data are load, prepare, manipulate, model, and analyze.

You can install pandas using pip or conda:

pip install pandas

or

conda install pandas

Start using pandas by importing it as:

import pandas as pd

Pandas deals with the following three data structures :

• Series
• DataFrame
• Panel
###### Series

Series is a value mutable, size immutable one-dimensional array like structure with homogeneous data.

• To create an empty series:

pd.Series()
Series([], dtype: float64)

• To create a series with default indexes:

import numpy as np
data = np.array([‘a’,’b’,’c’,’d’])
series = pd.Series(data)
print(series)
0 a
1 b
2 c
3 d
dtype: object

• To create a series with custom indexes:

pd.Series(data,index=[10,11,12,13])

print(series)
10 a
11 b
12 c
13 d
dtype: object

• To create a series using a dictionary:

data = {‘a‘ : 0., ‘b’ : 1., ‘c’ : 2.}
s
eries = pd.Series(data)
print(series)
a 1.0
c 2.0
d 0.0
dtype: float64

• To create a series using a dictionary with custom indexes. Here, Index order is persisted and the missing element is filled with NaN:

data = {‘a’ : 0., ‘b’ : 1., ‘c’ : 2.}
series = pd.Series(data,index=[‘b’,’c’,’d’,’a’])
print(series)
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64

• To access a single element of series:

print(series)

1.0

• To access multiple elements of the series:

print(series[[0,1]])

b 1.0

c 2.0

dtype: float64

###### DataFrame

DataFrames allows to store and manipulate tabular data in rows of observations and columns of variables. It is a size mutable, data mutable two-dimensional array with heterogeneous data. For example, an employee management system column can be employeeName and row can be names of employees. Pandas DataFrame can be created by loading the datasets from existing storage like SQL Database, CSV file, and Excel file. Pandas DataFrame can also be created from the lists, dictionary, and from a list of dictionary etc.

• To create a DataFrame from csv:

• To create a DataFrame from the database:

• To create a DataFrame from excel:

• To create an empty DataFrame: FROM EXCEL:

print(pd.DataFrame())
Empty DataFrame
Columns: []
Index: []

• To create a DataFrame from the list:

print(pd.DataFrame([1,2]))

0

0 1

1 2

• To create a DataFrame with the given column names:

data = [[‘Joe’,10],[‘Bob’,12]]
print(pd.DataFrame(data,columns=[‘Name’,’Age’],dtype=float))
Name   Age
0  Joe     10.0
1  Bob    12.0

• To filter the DataFrame:

df.loc[df[‘Name’] == ‘Joe’]
Name Age
0   Joe    10.0

• To create DataFrame from the Series:

print(pd.DataFrame( {‘one’ : pd.Series([1, 2, 3], index=[‘a’, ‘b’, ‘c’]), ‘two’ : pd.Series([1, 2, 3, 4], index=[‘a’, ‘b’, ‘c’, ‘d’])}))

```   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4```

A lot of operations can be performed on DataFrames like filtering, adding or deleting new rows and columns. Also, many functions and properties are used with DataFrames like – isnull(), head(), tail(), empty, axes, values, size, transpose.

###### Panel

The panel is a 3D container of data. The names for the 3 axes:

• items: axis 0, each item corresponds to a DataFrame contained inside.
• major_axis: axis 1, it is the rows of each of the DataFrames.
• minor_axis: axis 2, it is the columns of each of the DataFrames.

print(pd.Panel())

<class ‘pandas.core.panel.Panel’>

Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)

Items axis: None

Major_axis axis: None

Minor_axis axis: None  