MachineX: What is K-Fold Cross Validation?

Reading Time: 3 minutes

In this blog, we are going to explore and learn about K-Fold Cross Validation. K-Fold Cross Validation is a statistical method to evaluate a Machine Learning model’s performance. So, to understand what K-Fold Cross Validation is, we first need to understand what evaluating a model means, and why do we need to do that.

Evaluating a Machine Learning Model

Evaluating a Machine Learning model means to find out the accuracy of that model. The accuracy of a model tells us how good or bad our model is, i.e., how is it going to perform on a data sample that it is not trained upon.

We need to evaluate every ML model so that we can find out its accuracy and decide if the model is good enough or not. We also need accuracy to compare different ML models. There might be scenarios where different models can help us achieve the same task. In these scenarios, often it will be required to choose a final model, where accuracy would help us in doing just that.

So, K-Fold Cross Validation is a way to calculate an ML model’s accuracy. Let’s now see how it actually works.

How does K-Fold Cross Validation work?

While creating an ML model, we have a training set on which we train our model so that it can learn and predict the class or group or work on an unknown data sample. This training set is usually pretty large, containing around at least about 10,000 data samples. A good training set can also have much more data samples than that. What we generally do, for evaluating our model, is train the model on partial data set and test it on the remaining data set. What I mean by that is, we split our data set in a training set and a test set, training set generally contains about 80% of the data set, and test set contains about 20% of the data set. So, for a data set with 10,000 samples, we would put 8,000 samples in our training set to train our model, and 2,000 samples in our test set, which we will use to test the model.

K-Fold Cross Validation does exactly this, but K times and each time the training and test set is a little different. Let’s understand this visually.

Suppose we have a very small data set with 16 samples. We have a model built for it, and now we want to evaluate that model, using K-Fold Cross Validation with K=4. What we will do is, we will split the data set into training and test set 4 times, each time test set would be a different part of the data set and the remaining data samples would make up the training set, and then train the model on the training set and test it using the test set.

Suppose in the below diagram, the four parts represent the same data set, with the circles representing the test set and the squares representing the training set. In our first run, we will take the first 4 samples as the test set, and the remaining samples will make up the training set. We will train our model using the training set, and evaluate it using the test set. For evaluating the test set, we can use any standard evaluation metric, like Mean Squared Error(MSE) or Root Mean Squared Error(RMSE). Suppose in our first run, it gives the model’s accuracy as 84%.

K-FoldCrossValidation

We repeat this process a second time, but this time the test set would be the next 4 samples, and the remaining set will be the training set. Again, training and evaluating our model using this training and test set, we will evaluate our model, and let’s suppose it gives us an accuracy of 80%.

We repeat this process 2 more times, performing it exactly 4 times (since K=4), and we find out the accuracies of these 2 runs as 85% and 86%.

Now, we have 4 different accuracies for 4 different test sets. The final step would be to average these accuracies and get the final accuracy. So, we will add up the 4 accuracies, (84 + 80 + 85 + 86) = 335, and divide it by 4, which gives us 83.75%, which would be the final accuracy that we got using K-Fold Cross Validation.

So, now we have discussed what K-Fold Cross Validation is and how it works. In our next blog, we will dive into a sample code to understand how we will go about implementing it in our program.

Hope this blog was helpful to you, thanks for reading and happy blogging!

Knoldus-Scala-Spark-Services

Written by 

Akshansh Jain is a Software Consultant having more than 1 year of experience. He is familiar with Java but also has knowledge of various other programming languages such as scala, HTML and C++. He is also familiar with different Web Technologies and Android programming. He is a passionate programmer and always eager to learn new technologies & apply them in respective projects.