MachineX: The second dimensionality reduction method

Table of contents

Reading Time: 5 minutes

In the previous blog we have gone through how more data or to be precise more dimensions in the data creates different problems like overfitting in classification and regression algorithms. This is known as “curse of dimensionality”. Then we have gone through the solutions to the problem i.e. dimensionality reduction. We were mainly focused on one of the dimensionality reduction method called feature selection. In this blog, we will go through the second dimensionality reduction method that we were discussing.

Unlike feature selection, feature extraction doesn’t create the subsets and then find the best one out of them, instead, it tries to create whole new features set from the existing features. For example we have the data set X = {x1, x2, x3…..xn}, after doing the feature extraction it would be something like Y = {y1, y2, y3, ….ym} where m is the new number of dimensions extracted out. If we place it into a formula it would be f(X) = Y. Basically, the function f is doing the projection of a higher dimensional feature space to a lower dimensional feature space where new features are uncorrelated and cannot be reduced further.

One of the things we have to keep in mind while doing the projection is that every feature must have a larger variance so that it can be distinguished from the other features easily. This method of finding uncorrelated data with high variance is known as Principal Component Analysis.

Let’s take the below example. We have the two geometric representations and as you can see the second diagram the variance would be more for each point. By considering the previous point we have to select the second diagram as our projected

Let us suppose, we have some observations X1, X2, X3… Xp where every sample has n number of features i.e. X11, X12, X13, …. X1n. We need to find projection Z, an m dimensional space between them. For that, we need to find a matrix W where it would be like Z=WtX (i.e. Z1 = WtX1, Z2=WtX2, here t is for transpose matrix). W is something that we need to find for what the variance would be maximum for one particular feature or you can say the first feature we take for the calculation. PCA or principal component analysis is what we need to do next. Principal component Vector is something because of what we get the maximum variance for our projection. Moreover principal component vector is something because of what the reconstruction error from the low dimension space to the higher dimension would be minimal as well, however, that is something out of the scope of this blog.

Now the obvious question that arises, first feature extraction using PCA with maximum variance on the feature is fine but what about the other features? Well that’s an interesting part, as you can see in the below diagram(which is a poor drawing of mine) we are trying to draw a three dimensional space (of course with some time spent a good 3d image could have been created but bear with me for the time being we will try 3d image some other time) where we have extracted three features from the training data. The first feature is what we get from the principal component direction 1(it’s a principal component vector which makes us draw the direction lines in the graph) i.e. what we get by having maximum variance as mentioned in the above paragraph. The second one is what we get from the PC direction 2 and the conditions to get this are first it has to be orthogonal to the first principal component vector and second we still have to have the maximum variance on it. For the third feature we have to repeat the same condition as the second and by recursing it, we can have our rest of the m dimensions.

So to remember the steps, a short note would be something like

Pick the first feature with maximum variance
Choose the second or rest of the feature with PCV orthogonal to the previous PCVs and still having the maximum variance

Though we have gone through the PCA for feature extraction it might not be a very good idea when it comes to the classification problem. PCA does not count the dependent or label or the target variable while doing the calculation however while doing supervised learning that is a crucial thing to consider. For classification, we need to find the feature which would separate the classes rather than just have the maximum variance. So when we plot them in a graph, the centroids of the classes have to be with a high distance. Moreover, the points in the class have to be in the minimum distance from the centroid. For this, we have a different kind of analysis that we do which is called Linear Discriminant Analysis short for LDA. (Although the below component is showing how the classes can be held with their distance, diagrams in LDA are more of a linear diagram.)

In order to calculate LDA one of the methods that get used is called the Fisher Linear discriminant which we represent something like below for a two-dimensional space

J(w) = (m1 – m2)/(S1 ^2 + S2 ^ 2)

Here, m1 and m2 are the means of the classes and the s1 and s2 are the scatter points or the standard deviation of the classes.

For more than 2 dimensional space we have to do something like the above image. We have to take a middle point (Colored blue) among the other classes. So from the blue point, the distances between the classes are being calculated. If we take D as the difference between the means i.e. m1-m2 the new formula for multiple features would be

J(W) = D1^2 + D2^2 + D3^2 / (s1^2 + s2^2 + s3^2)

So far we have gone through both the method PCA and LDA here and get to know the difference as well. However, they have a lot in common as well. Both of them

Draw axis according to their priorities
1. PCA creates the first axis on the basis of maximum variance
2. LDA create the first axis on the basis of the maximum distance between the categories
Both of them creates the second axis and repeats the same for the rest of the dimensions. Just to remember for LDA the second axis need to be orthogonal like PCA

By now we have gone through the theoretical part of the dimensionality reduction. In the blogs, we will be going through the implementation part of it and that would be probably through Scala. Stay tuned!!