A Simple Guide to Data Preprocessing in Machine Learning

Table of contents

Reading Time: 5 minutes

Machine learning algorithms are completely data-dependent as they are the most important aspect of enabling model training. On the other hand, if you don’t understand this data before feeding it to the ML algorithm, the machine becomes useless. Simply put, you always need to provide the right data due to that preparing data in machine learning with the appropriate scale, format, and meaningful attributes for problems design to be solved by the machine becomes an utmost important step.

Why Data Pre-processing ?

After selecting the raw data for ML training, the most important task is data pre-processing. In broad sense, data preprocessing will convert the selected data into a form we can work with or can feed to ML algorithms. We always need to preprocess our data so that it can be as per the expectation of machine learning algorithm

Data Pre-processing Techniques:

1. Scaling

In most cases, the dataset contains attributes of various scales, but such data cannot be provided to the ML algorithm and needs to be rescale. One of the methods data preparation in machine learning is , If you rescale the data, the attributes will scale the same. In general, attributes are rescale in the range 0 to 1. ML algorithms such as the gradient descent method and kNearest Neighbors require scaled data. You can rescale your data using the MinMaxScaler class in the ScikitlearnPython library.

from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

##Now we can use MinMaxScalar
data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)

#Print rescaled data
set_printoptions(precision=1)
print ("\nScaled data:\n", data_rescaled[0:10])

2.Normalization

One of the methods in data preparation in machine learning is a data preprocessing technique is normalization. It is use to rescale each row of data to a length of 1. This is mainly useful for sparse datasets with lots of zeros. You can rescale the data using the Normalizer class in the Scikitlearn Python library.

Types of Normalization

1. L1 Normalization

It may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the absolute values will always be up to 1. It is also known as Least Absolute Deviations. Example code:

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

##Now, we can use Normalizer class with L1 to normalize the data.

Data_normalizer = Normalizer(norm='l1').fit(array)
Data_normalized = Data_normalizer.transform(array)

##We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 3 rows in the output.

set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

2. L2 Normalization

Another method of preparing data in machine learning using normalization, may be defined as the normalization technique that modifies the dataset values in a way that in each row the sum of the squares will always be up to 1. It is also called least squares. Example code:

from pandas import read_csv
from numpy import set_printoptions
from sklearn.preprocessing import Normalizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv (path, names=names)
array = dataframe.values

#Now, we can use Normalizer class with L1 to normalize the data.

Data_normalizer = Normalizer(norm='l2').fit(array)
Data_normalized = Data_normalizer.transform(array)

##We can also summarize the data for output as per our choice. Here, we are setting the precision to 2 and showing the first 3 rows in the output.

set_printoptions(precision=2)
print ("\nNormalized data:\n", Data_normalized [0:3])

3.Binarization

As the name implies, this is a technique for making data binary. You can use the binary threshold to make the data binary. Values above this threshold are converted to 1 and values below this threshold are converted to 0. For example, if you select threshold = 0.5, the upper record value will be 1 and the lower value will be 0. Therefore, it can be called data binarization or data thresholding. This technique is useful if your dataset has probabilities and you want to convert them to sharp values. You can binarize the data using the Binarizer class of the ScikitlearnPython library.

Example

from pandas import read_csv
from sklearn.preprocessing import Binarizer
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

##Now, we can use Binarize class to convert the data into binary values.

binarizer = Binarizer(threshold=0.5).fit(array)
Data_binarized = binarizer.transform(array)

##Here, we are showing the first 5 rows in the output.

print ("\nBinary data:\n", Data_binarized [0:5])

4.Standardization

Another beneficial records preprocessing approach which is largely used to convert the records attributes with a Gaussian distribution. It differs the imply and SD (Standard Deviation) to a general Gaussian distribution with a median of zero and a SD of 1. This approach is beneficial in ML algorithms like linear regression, logistic regression that assumes a Gaussian distribution in enter dataset and bring higher effects with rescaled records. We can standardize the records (imply = zero and SD =1) with the assist of Standard Scaler magnificence of scikit-analyze Python library. Example code :

from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

#Now, we can use StandardScaler class to rescale the data.

data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)

#summarize the data 

set_printoptions(precision=2)
print ("\nRescaled data:\n", data_rescaled [0:5])

Data Labeling

We have described the importance of good data in the ML algorithm and some techniques for preprocessing the data before sending it to the ML algorithm. Another aspect of this context is data labeling. It is also very important to send the data to the ML algorithm with proper labeling. For example, in the case of classification issues, the data has many labels in the form of words, numbers, and so on.

What is Label Encoding?

Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we need to convert such labels into number labels. This process is called label encoding. We can perform label encoding of data with the help of LabelEncoder() function of scikit-learn Python library. Example code:

import numpy as np
from sklearn import preprocessing
##Now, we need to provide the input labels as follows −

input_labels = ['red','black','red','green','black','yellow','white']
The next line of code will create the label encoder and train it.

encoder = preprocessing.LabelEncoder()
encoder.fit(input_labels)

##The next lines of script will check the performance by encoding the random ordered list −

test_labels = ['green','red','black']
encoded_values = encoder.transform(test_labels)
print("\nLabels =", test_labels)
print("Encoded values =", list(encoded_values))
encoded_values = [3,0,4,1]
decoded_list = encoder.inverse_transform(encoded_values)

##We can get the list of encoded values with the help of following python script −

print("\nEncoded values =", encoded_values)
print("\nDecoded labels =", list(decoded_list))

Conclusion

In this blog we have looked at some of the data preprocessing techniques in detail with working examples along with the insight of why data preparation is one of the most important steps in machine learning.