# Data Analysis Using Python

Reading Time: 4 minutes

In this blog we will introduce an overview of Python packages used for data analysis. And finally, we will  learn about how to import and export data in and from Python, and how to obtain basic insights from the datasets.

for understanding the basic concepts of Data Analytics , you can go through this link.

### Python packages for Data Analysis:

In order to do analysis in , these are few libraries that help us in performing operations with minimised code.

• Pandas is use to provide easy indexing functionality via creating dataframes.
• Numpy library is useful in arrays and operations linked with arrays.
• SciPy includes functions for some advanced math problems,
• Matplotlib package is the most well-known library for data visualization. It is great for making graphs and plots.
• Seaborn. It is based on Matplotlib. It’s very easy to generate various plots such as heat maps, time series, and violin plots. With Machine Learning algorithms, we’re able to develop a model using our dataset, and obtain predictions.
• Scikit-learn library contains tools for statistical modeling, including regression, classification, clustering and so on.
• StatsModels is also a Python module that allows users to explore data, estimate statistical models, and perform statistical tests.

### Importing of Data

Process of loading and reading data into python program from various resources.

two important properties:

1. format – .csv , .json , .xlsx
2. file path – C:/Desktop/python/…..
``````# import library
import pandas as pd
import numpy as np``````

We use `pandas.read_csv()` function to read the csv file. In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address. The file path can be either an URL or your local file address.

Because the data does not include headers, we can add an argument `headers = None` inside the `read_csv()` method so that pandas will not automatically set the first row as a header.

``````# Import pandas library
import pandas as pd

# Read the online file by the URL provides above, and assign it to variable "df"
other_path = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"
``````

After reading the dataset, we can use the `dataframe.head(n)` method to check the top n rows of the dataframe, where n is an integer. Contrary to `dataframe.head(n)``dataframe.tail(n)` will show you the bottom n rows of the dataframe.

``````# show the first 5 rows using dataframe.head() method
print("The first 5 rows of the dataframe")

First, we create a list “headers” that include all column names in order. Then, we use `dataframe.columns = headers` to replace the headers with the list we created.

``````# create headers list
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style","drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type","num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",         "peak-rpm","city-mpg","highway-mpg","price"]

The main types stored in Pandas dataframes are objectfloatintbool and datetime64. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas

``````df.dtypes
``````

Exporting of .csv file

``````df.to_csv("automobile.csv", index=False)
``````
##### Describe

If we would like to get a statistical summary of each column e.g. count, column mean value, column standard deviation, etc., we use the describe method

``````dataframe.describe()
df.describe()``````

### Data Normalization

Normalization is the process of transforming values of several variables into a similar range. Typical normalizations include scaling the variable so the variable average is 0, scaling the variable so the variance is 1, or scaling the variable so the variable values range from 0 to 1.

Example

To demonstrate normalization, let’s say we want to scale the columns “length”, “width” and “height”.

Target: would like to normalize those variables so their value ranges from 0 to 1

Approach: replace original value by (original value)/(maximum value)

``````# replace (original value) by (original value)/(maximum value)
df['length'] = df['length']/df['length'].max()
df['width'] = df['width']/df['width'].max()

df['height'] = df['height']/df['height'].max()
# show the scaled columns

### Binning

Binning is a process of transforming continuous numerical variables into discrete categorical ‘bins’ for grouped analysis.

Example:

In our dataset, “horsepower” is a real valued variable ranging from 48 to 288 and it has 59 unique values. What if we only care about the price difference between cars with high horsepower, medium horsepower, and little horsepower (3 types)? Can we rearrange them into three ‘bins’ to simplify analysis?

We will use the pandas method ‘cut’ to segment the ‘horsepower’ column into 3 bins

``````df["horsepower"]=df["horsepower"].astype(int, copy=True)
``````

Let’s plot the histogram of horsepower to see what the distribution of horsepower looks like.

``````import matplotlib as plt
from matplotlib import pyplot
plt.pyplot.hist(df["horsepower"])

# set x/y labels and plot title
plt.pyplot.xlabel("horsepower")
plt.pyplot.ylabel("count")
plt.pyplot.title("horsepower bins")
``````

We would like 3 bins of equal size bandwidth so we use numpy’s `linspace(start_value, end_value, numbers_generated` function.

``````bins = np.linspace(min(df["horsepower"]), max(df["horsepower"]), 4)
group_names = ['Low', 'Medium', 'High']
df['horsepower-binned'] = pd.cut(df['horsepower'], bins, labels=group_names, include_lowest=True )