MachineX: performance metrics for Model Evaluation

Reading Time: 6 minutes

In this blog, we are going to see how to choose the right metrics for model evaluation in different kinds of applications.

Source: COE

There are different metric categories based on the ML model/application, and we are going to cover the popular metrics used in the following problems:

  • Classification Metrics (accuracy, precision, recall, F1-score, ROC, AUC)
  • Regression Metrics (MSE, MAE)

there are more metrics like Computer Vision Metrics, NLP Metrics, Deep Learning Related Metrics, but for the scope of this blog, we will cover regression and classification metrics

Importance of model evaluation

Suppose you had done the data analysis, feature engineering(feature selection), and model training part, but how will you come to know that the work done so far is correct or not. The best way is to evaluate our model by making different evaluation metrics. We can create metrics for checking performance. But, this is not an easy task, as choosing the right metric is crucial while evaluating machine learning (ML) models. Various metrics are proposed to evaluate ML models in different applications. In some applications using a single metric may not give you the whole performance of the problem you are solving, and you may want to use a subset of the metrics to have a concrete evaluation of your models.

So lets see some of the metrics:

No doubt, Classification is one of the most widely used problems in machine learning because of its up-gradation and performance.  There are a lot of industries using classification algorithms in their applications, from face recognition, Youtube video categorization, content moderation, medical diagnosis, to text classification, hate speech detection on Twitter. Models such as support vector machine (SVM), logistic regression, decision trees, random forest are some of the most popular classification models. In classification itself, there are a lot of ways to evaluate a model, let’s see some of the most popular metrics:

First one is not a metric but most important to understand/know

Confusion matrix 

It is basically a  terminology that is used in classification problems. One of the most important concepts in classification problems is the confusion matrix (also known as error matrix), which is a tabular visualization of the model predictions versus the original labels. Each of the rows of this matrix represents the instances in a predicted class and each column represents the instances in an actual class.

For example, suppose we want to build a binary classification to classify cat images and non-cat images. And let’s assume that our test set has 1100 images that include 1000 non-cat images and 100 cat images, and matrix of same after model prediction will be:

Here out of 100 cat images our model has predicted 90 correctly as a cat and 10 incorrectly as no cats. Suppose here we consider class ‘Cat’ as positive  and ‘Non-Cat’ as negative class, then in conclusion then 90 samples predicted as cat are considered as as true-positive, and the 10 samples predicted as non-cat are false negative.

And similarly we have 1000 non-cat images, out of them our model has classified 940 samples correctly as non-cat, and mis-classified 60 of them as cat. The 940 correctly classified samples are referred as true-negative, and those 60 are referred as false-positive.

I hope everything is clear now about confusion matrix, let’s see the actual metrics now:

Classification Accuracy

Classification accuracy is the simplest metric you can learn, and it is essential to know how accurate our model is. We can define accuracy as the number of correct predictions divided by the total number of predictions, and multiplied by 100. Also in our above example, we have 1100 total examples and out of those our model predicted 1030 correctly, and accuracy can be found as:

Classification accuracy:  (90+940)/(1000+100)= 1030/1100= 93.6%


If you think that accuracy is the only thing that can tell you the performance of the model, you are wrong. Suppose your class distribution is imbalanced(one class is more frequent than others). Now if your model predicted all of the samples as the most frequent class you would get high accuracy, but in this case, your model is learning nothing but to predict all the samples as the frequent class. We are just an example related to the same. Now go to the cat vs non-cat example and find out this type of problem. Do you get it?

Ok let me tell you , in our cat vs non-cat classification above, if the model predicts all samples as non-cat, it would result: 

1000/1100= 90.9% accuracy

That is why , we need to look at more class specific performance metrics also and here comes the Precision. It can be defined as:

Precision= True_Positive/ (True_Positive+ False_Positive)


The recall is another important metric, which is defined as the fraction of samples from a class that is correctly predicted by the model. More formally:

Recall= True_Positive/ (True_Positive+ False_Negative)

F1 Score

F1 Score considers both precision and recall. It is the harmonic mean(average) of the precision and recall. F1 Score is best if there is some sort of balance between precision (p) & recall (r) in the system. Oppositely F1 Score isn’t so high if one measure is improved at the expense of the other.
For example, if P is 1 & R is 0, F1 score is 0.

we can define F1 Score as:

F1-score= 2*Precision*Recall/(Precision+Recall)

So for our classification example with the confusion matrix in the above table, the F1-score can be calculated as:

F1_cat= 2*0.6*0.9/(0.6+0.9)= 72%

Sensitivity and Specificity

Sensitivity and specificity are two other popular metrics mostly used in medical and biology-related fields, and are defined as:

Sensitivity tells us what percentage of images with the cat was actually correctly identified.

Sensitivity= Recall= TP/(TP+FN)

Specificity tells us what percentage of non-cat images were actually correctly identified.

Specificity= True Negative Rate= TN/(TN+FP)

If correctly identifying positives is important for us, then we should choose a model with higher Sensitivity. However, if correctly identifying negatives is more important, then we should choose specificity as the measurement metric.

Unlike classification metrics, regression metrics used to evaluate models that can work on a set of continuous values, and are therefore slightly different from classification metrics.

Models such as linear regression, random forest, XGboost, convolutional neural network, recurrent neural network are some of the most popular regression models.

These metrics are:

MSE (Mean squared error)

It is the most popular metric used for regression problems. It essentially finds the average squared error between the predicted and actual values.

For example, suppose we have created a model to predict prices of houses in New Delhi, and assume that all prediction is denoted with ŷᵢ, and we also know the actual prices of houses denoted by yᵢ. MSE equation can be defined as:

MAE (Mean Absolute Error)

Mean absolute error (or mean absolute deviation) is another metric that finds the average absolute distance between the predicted and target values. MAE is defined as below

It is also known as Mean Absolute Deviation, In this metric, we have to find the average absolute distance between the actual value and predicted value.

MAE equation can be defined as:

So these are some metrics used for evaluating the performance of classification and regression models. Now you know which metric you should use to evaluate your model and you can create a more accurate model.

Always keep an eye on your model performance like tony did :p :

Stay Tunes, happy learning 🙂

Follow MachineX Intelligence for more:

Written by 

Shubham Goyal is a Data Scientist at Knoldus Inc. With this, he is an artificial intelligence researcher, interested in doing research on different domain problems and a regular contributor to society through blogs and webinars in machine learning and artificial intelligence. He had also written a few research papers on machine learning. Moreover, a conference speaker and an official author at Towards Data Science.