In this blog we will introduce an overview of Python packages used for data analysis. And finally, we will learn about how to import and export data in and from Python, and how to obtain basic insights from the datasets.
for understanding the basic concepts of Data Analytics , you can go through this link.
Python packages for Data Analysis:
In order to do analysis in , these are few libraries that help us in performing operations with minimised code.
- Pandas is use to provide easy indexing functionality via creating dataframes.
- Numpy library is useful in arrays and operations linked with arrays.
- SciPy includes functions for some advanced math problems,
- Matplotlib package is the most well-known library for data visualization. It is great for making graphs and plots.
- Seaborn. It is based on Matplotlib. It’s very easy to generate various plots such as heat maps, time series, and violin plots. With Machine Learning algorithms, we’re able to develop a model using our dataset, and obtain predictions.
- Scikit-learn library contains tools for statistical modeling, including regression, classification, clustering and so on.
- StatsModels is also a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.
Importing of Data
Process of loading and reading data into python program from various resources.
two important properties:
- format – .csv , .json , .xlsx
- file path – C:/Desktop/python/…..
# import library import pandas as pd import numpy as np
pandas.read_csv() function to read the csv file. In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address. The file path can be either an URL or your local file address.
Because the data does not include headers, we can add an argument
headers = None inside the
read_csv() method so that pandas will not automatically set the first row as a header.
# Import pandas library import pandas as pd # Read the online file by the URL provides above, and assign it to variable "df" other_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv" df = pd.read_csv(other_path, header=None)
After reading the dataset, we can use the
dataframe.head(n) method to check the top n rows of the dataframe, where n is an integer. Contrary to
dataframe.tail(n) will show you the bottom n rows of the dataframe.
# show the first 5 rows using dataframe.head() method print("The first 5 rows of the dataframe") df.head(5)
First, we create a list “headers” that include all column names in order. Then, we use
dataframe.columns = headers to replace the headers with the list we created.
# create headers list headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style","drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type","num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower", "peak-rpm","city-mpg","highway-mpg","price"] print("headers\n", headers)
The main types stored in Pandas dataframes are object, float, int, bool and datetime64. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas
Exporting of .csv file
If we would like to get a statistical summary of each column e.g. count, column mean value, column standard deviation, etc., we use the describe method
Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.
To demonstrate normalization, let’s say we want to scale the columns “length”, “width” and “height”.
Target: would like to normalize those variables so their value ranges from 0 to 1
Approach: replace original value by (original value)/(maximum value)
# replace (original value) by (original value)/(maximum value) df['length'] = df['length']/df['length'].max() df['width'] = df['width']/df['width'].max() df['height'] = df['height']/df['height'].max() # show the scaled columns df[["length","width","height"]].head()
Binning is a process of transforming continuous numerical variables into discrete categorical ‘bins’ for grouped analysis.
In our dataset, “horsepower” is a real valued variable ranging from 48 to 288 and it has 59 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins’ to simplify analysis?
We will use the pandas method ‘cut’ to segment the ‘horsepower’ column into 3 bins
Let’s plot the histogram of horsepower to see what the distribution of horsepower looks like.
import matplotlib as plt from matplotlib import pyplot plt.pyplot.hist(df["horsepower"]) # set x/y labels and plot title plt.pyplot.xlabel("horsepower") plt.pyplot.ylabel("count") plt.pyplot.title("horsepower bins")
We would like 3 bins of equal size bandwidth so we use numpy’s
linspace(start_value, end_value, numbers_generated function.
bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4) group_names = ['Low', 'Medium', 'High'] df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True ) df[['horsepower','horsepower-binned']].head(20)
In this blog we have covered the importing and exporting of .csv file , analysis using pandas library , graphs and plots using matplotlib, Normalisation and Binning .