In this blog post, we will look at an interesting feature extraction technique of Machine Learning known as Principal Component Analysis (PCA). PCA is one of the powerful techniques in dimensionality reduction, in fact, the de facto standard for human face recognition. Let’s first understand what is dimensionality reduction
As an example let’s say we have a data set with many-many features(which is not so uncommon in real world scenario). Let’s say we plot only two features in the image below
As we can see that these features are redundant in nature i.e they both measure the same thing but with different parameters we might want to have a single new feature instead of these both. Therefore, we can have a new feature Z which can be easily projected with a single red line. The problem of extracting a new feature (Z) by approximating different no of features together and reducing the dimension from N-dimensions to K-dimensions is known as feature extraction. So now in our case where a single training example was described by a feature vector of (2X1) can now be easily projected by a (1X1) feature vector
Therefore, for the problem of dimensionality reduction, one of the most popular and commonly used algorithms is known as PCA. As we saw in the above example that to reduce the dimension of the data set above we need a line or vector on which the whole data can be projected with a minimum projection error. So what PCA does is to find a lower dimension surface or a vector in our case onto which the data can be projected with the minimum projection error. Therefore to reduce the data from n-dimension to K-dimension PCA will help us give K number of vectors onto which the data can be projected while keeping the error minimum.
Before applying PCA it is a standard practice to apply mean normalization and feature scaling to the data. Once done we will then proceed with to the PCA which is as follows:
- Compute the Co Variance Matrix.
- Compute the Eigen vector (U) of the covariance matrix calculated above.
- Sort the eigen vector and take first K no of columns from the matrix where K defines as no of dimensions to reduce the data to and name the new matrix as U(reduce)
- Compute the new features Z by transposing the U(reduce) and multiplying it with original feature vector of a training example X
- Project the features calculated or feed them to the ML algo further.
Now let’s take a look at these steps in further details
In first step we compute the covariance matrix and then we compute the Eigen vector U which we then sort (scikit-learn gives out the vector in sorted order by default) and take only those many columns that are required or unto how many dimensions we want to reduce the data i.e take only K number of columns from the matrix U which is then known as U(reduce). Then by transposing and multiplying the U(reduce) with data set to get the new features in form of Z which are in lower dimensions than the whole data set X.
Choosing the Value of K
K is the no of principal components retained or in simple terms the no of dimensions we want to reduce the data. Hence choosing the value of K is important. In order to find the best value of K we choose the smallest value of
Average Squared Projection / Total Variation in data <= 0.01
what this means is that the 99% of variance is retained. We then iteratively run PCA and choose the best value of K.
Hope you enjoyed the post.