MachineX: Boosting performance with XGBoost

Reading Time: 5 minutes

In this blog, we are going to see how XGBoost works and some of the important features of XGBoost with the help of an example.

So, many of us heard about tree models and boosting techniques. Let’s put these concepts together and talk about XGBoost, the most powerful machine learning Algorithm out there.

XGboost called for eXtreme Gradient Boosted trees.

The name XGBoost, though, actually refers to the engineering goal to push the limit of computations resources for boosted tree algorithms. Which is the reason why many people use XGBoost.

Ever since its introduction in 2014, XGBoost has high predictive power and is almost 10 times faster than the other gradient boosting techniques. It also includes a variety of regularization which reduces overfitting and improves overall performance. Hence it is also known as ‘regularized boosting‘ technique. XGBoost has proved its mettle in terms of performance – and speed.

Okay!!! don’t worry , lets talk about boosting

what is boosting tree?

we have encountered a lot of tree-based algorithms like decision trees, in those we used to train our single model on a particular dataset maybe with some parameter tuning. Also in ensemble models, we used to train all the models separately.

Boosting is also an ensemble technique, which combines many models to give a final one but rather than evaluating all models separately, boosting trains models in sequence. that means, every new model is trained to correct the error of the previous model and the sequence got stopped when there is no further improvement. That is why it is more accurate.

Installing XGBoost

There is a comprehensive installation guide on the XGBoost documentation website.

It covers installation for Linux, Mac OS X and Windows.

It also covers installation on platforms such as R and Python.

Setting up our data

So, the first thing is to prepare data for our model. we are going to use the iris flower dataset from Scikit Learn.

Here, we have loaded out dataset from Sklearn in python and also import XGBoost library

from sklearn import datasets
import xgboost as xgb

iris = datasets.load_iris()
X = iris.data
y = iris.target

Next, we have to split our dataset into two parts: train and test data. This is an important step to see how well our model performs. So, we are going to split our data into an 80%-20% part.

from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.2)

Unlike the rest of the algorithms, XGBoost needs our data to be transformed into a specific format i.e. DMatrix.

DMatrix is an internal data structure used by XGBoost which is optimized for both memory efficiency and training speed.

D_train = xgb.DMatrix(X_train, label=Y_train)
D_test = xgb.DMatrix(X_test, label=Y_test)

Now we have our numpy arrays of data converted to DMatix format to feed our model. But before that we need to define our model.

Define an XGBoost model

The first thing we have to do is to define the parameters of our gradient descent ensemble. we have N number of parameters available for our model but for now, we are going to focus on some of the important. The full list of possible parameters is available on the official XGBoost website.

param = {
    'eta': 0.2, 
    'max_depth': 4,  
    'objective': 'multi:softprob',  
    'num_class': 4
    }

epochs = 20 

so here are our parameters:

  • max_depth: maximum depth of the decision trees being trained
  • objective: the loss function is used
  • num_class: the number of classes in the dataset
  • eta: the learning rate

As we already know, this kind of model worked in a sequential way, which make it more complex. this technique is very prone to overfitting.

The eta parameter/ learning rate helps our algorithm to prevent overfitting by not just adding the prediction of new trees to the ensemble with full weight but the eta will be multiplied by the residual being adding to reduce their weights.

Note: It is advised to have small values of eta in the range of 0.1 to 0.3

we have our model defined now , lets train it

Training and Testing

model = xgb.train(param, D_train, steps)

It is a very similar process to Scikit Learn and running an evaluation is also very familiar.

import numpy as np
from sklearn.metrics import precision_score, recall_score, accuracy_score

preds = model.predict(D_test)
best_preds = np.asarray([np.argmax(line) for line in preds])

print("Precision = {}".format(precision_score(Y_test, best_preds, average='macro')))
print("Recall = {}".format(recall_score(Y_test, best_preds, average='macro')))
print("Accuracy = {}".format(accuracy_score(Y_test, best_preds)))

output:

Thats great , we achieved accuracy above 90%

As mentioned above , we have a lot of parameters and choosing wrong parameter , might effect your model performance a lot.

So the question here is : how to choose right parameters?

well , it is too easy , to compare model performance with different values. lets see it

Finding optimal parameters

Setting the optimal hyperparameters of any ML model can be a challenge. So why not let Scikit Learn do it for you? We can combine Scikit Learn’s grid search with an XGBoost classifier quite easily:

from sklearn.model_selection import GridSearchCV

clf = xgb.XGBClassifier()
parameters = {
     "eta"    : [0.05, 0.10, 0.15, 0.20, 0.25, 0.30 ] ,
     "max_depth"        : [ 3, 4, 5, 6, 8, 10, 12, 15],
     "min_child_weight" : [ 1, 3, 5, 7 ],
     "gamma"            : [ 0.0, 0.1, 0.2 , 0.3, 0.4 ],
     "colsample_bytree" : [ 0.3, 0.4, 0.5 , 0.7 ]
     }

grid = GridSearchCV(clf,
                    parameters, n_jobs=4,
                    scoring="neg_log_loss",
                    cv=3)

grid.fit(X_train, Y_train)

output:

Only do that on a big dataset if you have time to kill — doing a grid search is essentially training an ensemble of decision trees many times over!

Once your XGBoost model is trained, you can dump a human-readable description of it into a text file:

model.dump_model('dump.raw.txt')

So this is how we can create an XGBoost model and choose ideal hyper-parameters for it.

Stay Tunes, happy learning 🙂

Follow MachineX Intelligence for more:

Written by 

Shubham Goyal is a Data Scientist at Knoldus Inc. With this, he is an artificial intelligence researcher, interested in doing research on different domain problems and a regular contributor to society through blogs and webinars in machine learning and artificial intelligence. He had also written a few research papers on machine learning. Moreover, a conference speaker and an official author at Towards Data Science.