Every machine learning practitioner will agree with me when I say that one of the most important part of machine learning is preparing data for machine learning. It certainly requires some experience to properly and effectively prepare data for machine learning. Although data preparation is in itself a really big topic, today we will only be looking at a part of its process, that is feature scaling.
From Wikipedia –
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step.
So, in simple terms, feature scaling is bringing the features of a data set to approximately a same scale.
That’s fine, but how does it help in improving accuracy?
Let’s understand that using an example. Suppose we want to find out the dependence of an athlete’s speed on his/her height and weight. So, our dataset consists of heights say in the range 4 ft. to 7 ft. and weights in the range 70 to 120 kilograms. Now, as we can clearly see, the range of weights is more than the range of heights. If we feed in these parameters to our ML model as it is, then our model will give higher weight-age to the weight of an athlete than his/her height. So, a change in the weight of an athlete will be reflected much more than the change in the height, which is totally undesirable, and at this point, the accuracy of our algorithms will be very low. If we were to scale down weight to approximately to the scale of height, then we will be able to get a much more desirable result, as changes in both parameters will be equally reflected, giving us much more accuracy than before.
With few exceptions, Machine Learning algorithms don’t perform well when the input numerical attributes have very different scales. Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.
Another reason why feature scaling is applied is that gradient descent converges much faster with feature scaling than without it.
Two of the most common ways to apply feature scaling, i.e., to get all attributes to have the same scale are Min-max scaling and standardization.
Min-max scaling (many people call this normalization) is quite simple, values are shifted and rescaled so that they end up ranging from 0 to 1. We do this by subtracting the min value and dividing by the max minus the min.
where, x = value to be normalized. min(x) = minimum value of the attribute in the data set. max(x) = maximum value of the attribute in the data set. z = value after normalization.
So, if a data set had some attribute with the values – 29, 59, 53, 76, the min-max scaling would have produced – 0, 0.63, 0.51, 1.
Standardization is quite different: first, it subtracts the mean value (so standardized values always have a zero mean), and then it divides by the variance so that the resulting distribution has unit variance. Unlike min-max scaling, standardization does not
bound values to a specific range, which may be a problem for some algorithms (e.g., neural networks often expect an input value ranging from 0 to 1). However, standardization is much less affected by outliers.
where, x = value to be normalized µ = mean of the attribute in the data set. σ² = variance of the attribute in the data set. z = value after standardization.
So, if a dataset had some attribute with the values – 29, 59, 53, 76, standardization would have produced – -0.066, 0.012, -0.003, 0.057.
Feature scaling can improve the accuracy of your ML models significantly, but implementing it can be quite tricky and can only be mastered through experience. It is certainly one of those things you cannot ignore.
- Ronak Choksy’s answer on this Quora question.
- Sebastian Raschka’s article.
- Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron.
- Feature scaling – Wikipedia