Data Analysis Using Python

Reading Time: 4 minutes

In this blog we will introduce an overview of Python packages used for data analysis. And finally, we will  learn about how to import and export data in and from Python, and how to obtain basic insights from the datasets.

for understanding the basic concepts of Data Analytics , you can go through this link.

Python packages for Data Analysis:

In order to do analysis in , these are few libraries that help us in performing operations with minimised code.

  • Pandas is use to provide easy indexing functionality via creating dataframes. 
  • Numpy library is useful in arrays and operations linked with arrays.
  • SciPy includes functions for some advanced math problems,
  • Matplotlib package is the most well-known library for data visualization. It is great for making graphs and plots.
  • Seaborn. It is based on Matplotlib. It’s very easy to generate various plots such as heat maps, time series, and violin plots. With Machine Learning algorithms, we’re able to develop a model using our dataset, and obtain predictions.
  • Scikit-learn library contains tools for statistical modeling, including regression, classification, clustering and so on.
  • StatsModels is also a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

Importing of Data

Process of loading and reading data into python program from various resources.

two important properties:

  1. format – .csv , .json , .xlsx
  2. file path – C:/Desktop/python/…..
# import library
import pandas as pd
import numpy as np
Read Data

We use pandas.read_csv() function to read the csv file. In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address. The file path can be either an URL or your local file address.

Because the data does not include headers, we can add an argument headers = None inside the read_csv() method so that pandas will not automatically set the first row as a header.

# Import pandas library
import pandas as pd

# Read the online file by the URL provides above, and assign it to variable "df"
other_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"
df = pd.read_csv(other_path, header=None)

After reading the dataset, we can use the dataframe.head(n) method to check the top n rows of the dataframe, where n is an integer. Contrary to dataframe.head(n)dataframe.tail(n) will show you the bottom n rows of the dataframe.

# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe") 
df.head(5)

First, we create a list “headers” that include all column names in order. Then, we use dataframe.columns = headers to replace the headers with the list we created.

# create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style","drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type","num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",         "peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)

The main types stored in Pandas dataframes are objectfloatintbool and datetime64. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas

df.dtypes

Exporting of .csv file

df.to_csv("automobile.csv", index=False)
Describe

If we would like to get a statistical summary of each column e.g. count, column mean value, column standard deviation, etc., we use the describe method

dataframe.describe()
df.describe()

Data Normalization

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.

Example

To demonstrate normalization, let’s say we want to scale the columns “length”, “width” and “height”.

Target: would like to normalize those variables so their value ranges from 0 to 1

Approach: replace original value by (original value)/(maximum value)

# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()

df['height'] = df['height']/df['height'].max() 
# show the scaled columns
df[["length","width","height"]].head()

Binning

Binning is a process of transforming continuous numerical variables into discrete categorical ‘bins’ for grouped analysis.

Example:

In our dataset, “horsepower” is a real valued variable ranging from 48 to 288 and it has 59 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins’ to simplify analysis?

We will use the pandas method ‘cut’ to segment the ‘horsepower’ column into 3 bins

df["horsepower"]=df["horsepower"].astype(int, copy=True)

Let’s plot the histogram of horsepower to see what the distribution of horsepower looks like.

import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")

We would like 3 bins of equal size bandwidth so we use numpy’s linspace(start_value, end_value, numbers_generated function.

bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
group_names = ['Low', 'Medium', 'High']
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )
df[['horsepower','horsepower-binned']].head(20)

Conclusion

In this blog we have covered the importing and exporting of .csv file , analysis using pandas library , graphs and plots using matplotlib, Normalisation and Binning .

Written by 

Lokesh Kumar is intern in AI/ML studio at Knoldus. He is passionate about Artificial Intelligence and Machine Learning , having knowledge of C , C++ , Python and Data Analytics and much more. He is recognised as a good team player, a dedicated and responsible professional, and a technology enthusiast. He is a quick learner & curious to learn new technologies.

1 thought on “Data Analysis Using Python6 min read

Comments are closed.