Model Evaluation Metrics for Machine Learning Algorithms

Reading Time: 6 minutes

When you build any Machine Learning model, all the audiences including the stakeholders always have only one question, what is the model performance? What are the model evaluation metrics? What is the accuracy of model?

Model Evaluation metrics explains the performance of models. Evaluating your developed model helps you refine and improve your model. You keep developing and evaluating a model until you reach an optimum performance level.

Regression Metrics

Regression models have continuous output. We need a metric based on calculating distance between predicted and ground truth.

  • Mean Absolute Error (MAE)
  • Mean Squared Error (MSE)
  • Root Mean Squared Error (RMSE)
  • R-Squared
  • Adjusted R-Squared

Mean Absolute Error (MAE)

Mean Absolute Error is the average difference between actual values and predicted values. Mathematically represented as,

Few key points of MAE are :

  • Mean Absolute Error (MAE) is more robust to outliers. Since it does not exaggerate errors.
  • It gives us a measure of how far the predictions were from the actual output. However, since MAE uses absolute value of the residual. It doesn’t give us an idea of the direction of the error. That is, whether we’re under-predicting or over-predicting the data.
  • Error interpretation needs no second thoughts It perfectly aligns with the original degree of the variable.

Mean Square Error (MSE)

Mean squared error finds the average of the squared difference between the target value and the value predicted by the regression model.

Few key points of MSE are :

  • MSE is differentiable.
  • It penalizes even small errors by squaring them. It essentially leads to an overestimation of how bad the model is.

Root Mean Squared Error (RMSE)

Root Mean Squared Error corresponds to the square root of the average of the squared difference between the target value and the value predicted by the regression model. Mathematically represented as:

Few key points related to RMSE :

  • It retains the differentiable property of MSE.
  • It handles the penalization of smaller errors done by MSE by square rooting it.
  • Error interpretation can be done smoothly. As the scale is now the same as the random variable.
  • RMSE is less prone to struggle in the case of outliers.

R-Squared (Coefficient of Determination)

R-Squared works as a post metric. The point of calculating coefficient of determination is to answer the question. “How much (what%) of the total variance in Y(target) is explained by the variance in X(regression line)”.

R-square defines the Goodness of fit of any regression model. It helps to explain and account for the difference between observed data and predicted data.

Let’s see the mathematical representation. Total variation in Y (Variance of Y):

The percentage variation described in regression line is –

Finally, the coefficient of determination formula tells how good or bad the fit of the regression line is:

Few key points related to R-Squared are –

  • If the sum of Squared Error of the regression line is small. R² will be close to 1 (Ideal). Meaning the regression was able to capture 100% of the variance in the target variable.
  • Conversely, if the sum of squared error of the regression line is high. R² will be close to 0. Meaning the regression wasn’t able to capture any variance in the target variable.

Adjusted R-Squared

The Coefficient of Determination has a drawback. On adding new features to the model, R-Squared value either increases or remains the same. It does not penalizes for adding new features that do not add value to the model. An improved version of R-Squared is defined known as Adjusted R-Squared to overcome this drawback.

Adjusted R-Squared is always lower than R-Squared. It adjusts for increasing features or independent variables. Adjusted R-squared only shows improvement if there is real improvement.


  • n – number of observations
  • k – number of independent variables

Classification Metrics

Classification algorithms are one of the widely researched areas. Use cases are present in almost all production and industrial environments. Speech recognition, face recognition, text classification – the list is endless. 

Classification models have discrete output. So we need a metric that compares discrete classes in some form. Classification Metrics evaluate a model’s performance. It tells how good or bad the classification is, but each of them evaluates it in a different way.

  • Confusion Matrix
  • Accuracy
  • Precision
  • Recall
  • F1-Score

Confusion Matrix

Confusion Matrix is a tabular visualization of the ground-truth labels versus model predictions. Each row of the confusion matrix represents the instances in a predicted class. While each column represents the instances in an actual class.

It demonstrates the number of testcases correctly and incorrectly classified. It looks something like this (considering 1 -Positive and 0 -Negative are the target classes):

  1. TN : Number of negative cases correctly classified
  2. TP : Number of positive cases correctly classified
  3. FN : Number of positive cases incorrectly classified as negative
  4. FP : Number of negative cases correctly classified as positive


Accuracy defines the number of test cases correctly classified divided by the total number of test cases. It applies to most generic problems. But is not very useful when it comes to unbalanced datasets.

For instance, if we are detecting frauds in bank data, the ratio of fraud to non-fraud cases can be 1:99. In such cases, if accuracy is used, the model will turn out to be 99% accurate by predicting all test cases as non-fraud. The 99% accurate model will be completely useless.

If a model is poorly trained such that it predicts all the 1000 (say) data points as non-frauds. It will be missing out on the 10 fraud data points. If accuracy is measured, it will show that that model correctly predicts 990 data points. Thus it will have an accuracy of (990/1000)*100 = 99%! This is why accuracy is a false indicator of the model’s health. For such a case, a metric is required that focuses on the ten fraud data points which were completely missed by the model.


Precision is a metric to identify the correctness of classification. The equation is the ratio of correct positive classifications to the total number of predicted positive classifications. The greater the fraction, higher is the precision. It means better is the ability of the model to correctly classify the positive class. In the problem of predictive maintenance, precision comes into play.


Recall tells us the number of positive cases correctly identified out of the total number of positive cases. Going back to the fraud problem, the recall value will be very useful in fraud cases. A high recall value indicates that a lot of fraud cases are identified out of the total number of frauds.


F1 score is the harmonic mean of Recall and Precision and thus, balances out the strengths of each. It is useful in cases where both recall and precision can be valuable. Like in the identification of plane parts that might require repairing. Precision is used to save the company’s cost. Recall is used to ensure the machinery is stable and not a threat to human lives.


ROC curve is a plot of true positive rate (recall) against false positive rate (TN / (TN+FP)). AUC-ROC stands for Area Under the Receiver Operating Characteristics. the higher the area, the better is the model performance. If the curve is somewhere near the 50% diagonal line, it suggests that the model randomly predicts the output variable.


Machine Learning model evaluation techniques are a bit extensive. But with practice and effective investment of time it becomes easy. Happy Learning!




Written by 

Working as a Sr. Software Consultant AI/ML at Knoldus. Like exploring more of Data Science and its related technology. Current learning areas are Natural Language Processing, Deep Learning and Artificial Intelligence.