MachineX: When data is a curse to learning

Table of contents

Reading Time: 4 minutes

Data and learning are like best friends, perhaps learning is too dependent on data to be called as friends. When data overwhelms, learning acts pricey, so it feels more like a girlfriend-boyfriend sort of a relationship. Well don’t get confused or bothered on how I am comparing the data and learning, it is just my depiction of something called Dimensionality reduction in machine learning. On a serious note, in this blog we will go through what is “curse of dimensionality”, feature selection, how do they work sort of topics and in the end you might just like my depiction and make it yours too.

Well, learning has all the right to act pricey, after all, quality wins over quantity. Having a lot of dimensions doesn’t mean that everyone is especially useful. In fact, it becomes a curse and “Curse of dimensionality” is a title in data science to define this problem. It refers to the phenomenon that arises during analyzing and organizing data in a high dimensional space. This kind of problems doesn’t generally occur in low dimensional space like the 3D physical space. Too many features/dimensions make the model overfit too, which in all classification algorithms like KNN, Decision tree, Neural network we want to avoid. Although as more features or dimensions are a problem of overfitting, fewer features or dimensions would also cause the same.

So the solution is dimensionality reduction. Two well-known dimensionality reduction techniques are Feature Selection and Feature Extraction. I find dimensionality reduction similar to girlfriend’s acting pricey and the above-mentioned techniques actually make the boyfriend learning in control. So enough with this depiction stuff and let’s focus on the two techniques now.

Feature Selection: In a one-liner, feature selection is the process of selecting a subset of the data by skipping redundant or irrelevant features from it. Let’s take an example our data is {X1, X2, X3, …… Xn} and the possible subset of this data is 2^n, basically, the number of subsets grows exponentially with the number of features. Now as you can see, though we have to select a subset we cannot go through each and everyone.

We need some methods to select the subsets which work in minimum time. They can be

Optimum method
Heuristic method
Randomized method

For the optimum method though, the hypothesis space for selecting the subsets also need to be in a structure so that the selection process could be done in polynomial time. Otherwise, we have to choose a heuristic or greedy or randomized method to do the task. The search mechanism that these methods are using to find the difference in the subsets also evaluates them. The evaluation can be done with both supervised and unsupervised methods. Just like any other unsupervised method, it doesn’t evaluate the subsets over the training examples, it evaluates them on the basis of information in the subsets itself. We generally called these unsupervised methods as filter methods. And just like the other supervised methods the supervised method which is also known as wrapper method also evaluate the subsets on the basis of the training examples.

Feature selection is an optimization problem. For both the filter and wrapper method, the flow works in a similar fashion. For filter method, we have a search algorithm involved in it as well as an objective function and we need to optimize the objective function here.

In the gray box as you can see a looping is happening where we search the subset we apply it to the objective function, in the objective function we check if the subset is good enough for the selection and the loop goes on until we find the best one. Similar loop also exists for the wrapper methods the only difference is that instead of an objective function it has a PR algorithm.

While searching for the best subset we need to take care of two things, first, we need to select uncorrelated feature and second we need to eliminate redundant features. So to do this we have two different methods –
Forward Selection method: It’s a recursive method where we start with empty data set and end up with the best subset. Following is how the steps look like –

It starts with empty data set.
Try each remaining feature
Estimate classification/regression error for adding each feature
Select features that give maximum improvement
Stop when there is no significant improvement

Backward Elimination method: In contrast to the Forward selection method in the backward selection, it starts with the full feature set and starts eliminating them till we find the best subset. It’s recursive method as well and below are the steps that get followed for this method –

Starts with the full feature set
Try removing features
Drop feature with smallest improvement/impact
Stop when there is no significant improvement

Feature selection can be univariate methods which look at each feature independently of others. Univariate methods measure some type of correlation between two random variables. For that, we have the following methods

Pearson correlation coefficient
F-score
Chi-square
Signal to noise ratio
Mutual information

Multivariate feature selection considers all features simultaneously. Following are some methods to achieve so –

Minimum Redundancy and Maximum Relevance (mRMR): This works on the principle of forward selection method.
Fast Correlation based Feature Selection (FCBF): This works on the principle of backward elimination method.

Undoubtedly this a bit more information then the depiction with we started off but dimensionality reduction doesn’t end here. As we can see there is one more technique mentioned in one of the above paragraph feature extraction. So, for now, we have the feature extraction as a leftover for the next blog as well as coding part and data results are also something that we would be explaining in our later blogs. Till then Keep reading Knoldus blogs.