In this blog, I am going to explain pandas which is an open source library for data manipulation, analysis, and cleaning.
Pandas is a high-level data manipulation tool developed by Wes McKinney. The name Pandas is derived from the word Panel Data – an Econometrics from Multidimensional data. Pandas is built on the top of NumPy.
Five typical steps in the processing and analysis of data, regardless of the origin of data are load, prepare, manipulate, model, and analyze.
You can install pandas using pip or conda:
pip install pandas
or
conda install pandas
Start using pandas by importing it as:
import pandas as pd
Pandas deals with the following three data structures :
- Series
- DataFrame
- Panel
Series
Series is a value mutable, size immutable one-dimensional array like structure with homogeneous data.
- To create an empty series:
pd.Series()
Series([], dtype: float64)
- To create a series with default indexes:
import numpy as np
data = np.array([‘a’,’b’,’c’,’d’])
series = pd.Series(data)
print(series)
0 a
1 b
2 c
3 d
dtype: object
- To create a series with custom indexes:
pd.Series(data,index=[10,11,12,13])
print(series)
10 a
11 b
12 c
13 d
dtype: object
- To create a series using a dictionary:
data = {‘a‘ : 0., ‘b’ : 1., ‘c’ : 2.}
series = pd.Series(data)
print(series)
a 1.0
c 2.0
d 0.0
dtype: float64
- To create a series using a dictionary with custom indexes. Here, Index order is persisted and the missing element is filled with NaN:
data = {‘a’ : 0., ‘b’ : 1., ‘c’ : 2.}
series = pd.Series(data,index=[‘b’,’c’,’d’,’a’])
print(series)
b 1.0
c 2.0
d NaN
a 0.0
dtype: float64
- To access a single element of series:
print(series[0])
1.0
- To access multiple elements of the series:
print(series[[0,1]])
b 1.0
c 2.0
dtype: float64
DataFrame
DataFrames allows to store and manipulate tabular data in rows of observations and columns of variables. It is a size mutable, data mutable two-dimensional array with heterogeneous data. For example, an employee management system column can be employeeName and row can be names of employees. Pandas DataFrame can be created by loading the datasets from existing storage like SQL Database, CSV file, and Excel file. Pandas DataFrame can also be created from the lists, dictionary, and from a list of dictionary etc.
- To create a DataFrame from csv:
df = pd.read_csv(file)
- To create a DataFrame from the database:
df = pd.read_sql_query(query,conn)
- To create a DataFrame from excel:
df = pd.read_excel(file,sheetname=’ ‘)
- To create an empty DataFrame: FROM EXCEL:
print(pd.DataFrame())
Empty DataFrame
Columns: []
Index: []
- To create a DataFrame from the list:
print(pd.DataFrame([1,2]))
0
0 1
1 2
- To create a DataFrame with the given column names:
data = [[‘Joe’,10],[‘Bob’,12]]
print(pd.DataFrame(data,columns=[‘Name’,’Age’],dtype=float))
Name Age
0 Joe 10.0
1 Bob 12.0
- To filter the DataFrame:
df.loc[df[‘Name’] == ‘Joe’]
Name Age
0 Joe 10.0
- To create DataFrame from the Series:
print(pd.DataFrame( {‘one’ : pd.Series([1, 2, 3], index=[‘a’, ‘b’, ‘c’]), ‘two’ : pd.Series([1, 2, 3, 4], index=[‘a’, ‘b’, ‘c’, ‘d’])}))
one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4
A lot of operations can be performed on DataFrames like filtering, adding or deleting new rows and columns. Also, many functions and properties are used with DataFrames like – isnull(), head(), tail(), empty, axes, values, size, transpose.
Panel
The panel is a 3D container of data. The names for the 3 axes:
- items: axis 0, each item corresponds to a DataFrame contained inside.
- major_axis: axis 1, it is the rows of each of the DataFrames.
- minor_axis: axis 2, it is the columns of each of the DataFrames.
print(pd.Panel())
<class ‘pandas.core.panel.Panel’>
Dimensions: 0 (items) x 0 (major_axis) x 0 (minor_axis)
Items axis: None
Major_axis axis: None
Minor_axis axis: None
Thanks for reading!