Concept | Tuning XGboost models in Python#

XGBoost is an advanced gradient boosting tree Python library. It is integrated into Dataiku visual machine learning, meaning that you can train XGBoost models without writing any code.

Here, we are going to cover some advanced optimization techniques that can help you go even further with your XGBoost models, by using custom Python code.

We assume that you are already familiar with how to train a model using Python code (for example with scikit-learn).

Using a sparse matrix#

XGBoost can take a sparse matrix as input. This allows you to convert categorical variables with high cardinality into a dummy matrix, then build a model without getting an out of memory error.

For this we use a Python function:

from pandas.core.categorical import Categorical
from scipy.sparse import csr_matrix
import numpy as np

def sparse_dummies(categorical_values):
    categories = Categorical.from_array(categorical_values)
    N = len(categorical_values)
    row_numbers = np.arange(N, dtype=np.int)
    ones = np.ones((N,))
    return csr_matrix( (ones, (row_numbers, categories.codes)) )

sparse_dummies(df.VAR_0001)

This returns a sparse matrix of 3 columns, one by value of VAR\_0001:

<145231x3 sparse matrix of type '<type 'numpy.float64'>'
      with 145231 stored elements in Compressed Sparse Row format>

You can concatenate this matrix with another dummy matrix with the SciPy hstack function:

from scipy.sparse import hstack
cat1 = sparse_dummies(df.VAR_0001)
cat2 = sparse_dummies(df.VAR_0002)
hstack((cat1,cat2), format="csr")
<145231x7 sparse matrix of type '<type 'numpy.float64'>'
    with 290462 stored elements in Compressed Sparse Row format>

Early stopping#

Note

When creating a XGBoost model using the visual machine learning component of Dataiku, it automatically uses early stopping (you don’t actually need to code to benefit from this).

A really cool feature of XGBoost is early stopping. As you train more and more trees, you will overfit your training dataset. Early stopping enables you to specify a validation dataset and the number of iterations after which the algorithm should stop if the score on your validation dataset didn’t increase.

To use it, you can specify in the fit method of the classifier an evaluation set, an evaluation method and the early stopping round number:

clf = xgb.XGBClassifier(n_estimators=10000)
eval_set  = [(train,y_train), (valid,y_valid)]
clf.fit(train, y_train, eval_set=eval_set,
        eval_metric="auc", early_stopping_rounds=30)

Here, we explicitly set n_estimators to a very large number. In your job log you’ll see the score increasing on the dataset you put in the eval_set list:

Will train until validation_1 error hasn't decreased in 30 rounds.
[0]    validation_0-auc:0.733451   validation_1-auc:0.698659
[1]    validation_0-auc:0.776699   validation_1-auc:0.731099
[2]    validation_0-auc:0.789156   validation_1-auc:0.740601
[3]    validation_0-auc:0.792534   validation_1-auc:0.744378
[4]    validation_0-auc:0.800747   validation_1-auc:0.748260
[5]    validation_0-auc:0.805586   validation_1-auc:0.750209
[6]    validation_0-auc:0.810889   validation_1-auc:0.752157
[7]    validation_0-auc:0.812459   validation_1-auc:0.752554
[8]    validation_0-auc:0.812928   validation_1-auc:0.752733
[9]    validation_0-auc:0.813815   validation_1-auc:0.753650
[10]   validation_0-auc:0.814547   validation_1-auc:0.753750
...
...
...
[271]  validation_0-auc:0.897922   validation_1-auc:0.782187
[272]  validation_0-auc:0.898150   validation_1-auc:0.782179
[273]  validation_0-auc:0.898150   validation_1-auc:0.782179
[274]  validation_0-auc:0.898439   validation_1-auc:0.782225
[275]  validation_0-auc:0.898439   validation_1-auc:0.782225
[276]  validation_0-auc:0.898591   validation_1-auc:0.782219
Stopping. Best iteration:
[246]  validation_0-auc:0.894087   validation_1-auc:0.782487

Note that you can define your own evaluation metric instead.

Viewing feature importance#

You can get the feature importance in clf.booster().get_fscore() where clf is your trained classifier.

For example, we can use this in a Jupyter notebook:

features = [ "your list of features ..." ]
mapFeat = dict(zip(["f"+str(i) for i in range(len(features))],features))
ts = pd.Series(clf.booster().get_fscore())
ts.index = ts.reset_index()['index'].map(mapFeat)
ts.order()[-15:].plot(kind="barh", title=("features importance"))

Using hyperopt for grid searching#

Fine-tuning your XGBoost can be done by exploring the space of parameters possibilities. For this task, you can use the hyperopt package. Hyperopt is a Python library for optimizing over awkward search spaces with real-valued, discrete, and conditional dimensions.

Here an example python recipe to use it:

import dataiku
import pandas as pd, numpy as np
from dataiku import pandasutils as pdu
from sklearn.metrics import roc_auc_score
import xgboost as xgb
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials

train = dataiku.Dataset("train").get_dataframe()
valid = dataiku.Dataset("valid").get_dataframe()

y_train = train.target
y_valid = valid.target

del train["target"]
del valid["target"]

def objective(space):

    clf = xgb.XGBClassifier(n_estimators = 10000,
                            max_depth = space['max_depth'],
                            min_child_weight = space['min_child_weight'],
                            subsample = space['subsample'])

    eval_set  = [( train, y_train), ( valid, y_valid)]

    clf.fit(train[col_train], y_train,
            eval_set=eval_set, eval_metric="auc",
            early_stopping_rounds=30)

    pred = clf.predict_proba(valid)[:,1]
    auc = roc_auc_score(y_valid, pred)
    print "SCORE:", auc

    return{'loss':1-auc, 'status': STATUS_OK }


space ={
        'max_depth': hp.quniform("x_max_depth", 5, 30, 1),
        'min_child_weight': hp.quniform ('x_min_child', 1, 10, 1),
        'subsample': hp.uniform ('x_subsample', 0.8, 1)
    }


trials = Trials()
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=100,
            trials=trials)

print best

After loading your training and validation datasets, we define our objective function.

This function trains a model, evaluates it and returns the error on the validation set. We define the space we want to explore: here, we want to try values from 5 to 30 for max_depth, from 1 to 10 for min_child_weight and from 0.8 to 1 for subsample.

Hyperopt will minimize this error in a maximum of 100 experiments.

See the Hyperopt documentation for more details.

What’s next?#

If modeling in the Visual Analysis with custom code does not suit your needs, you can also take full control by coding the whole machine learning pipeline by yourself (train, score, validation, etc) using your preferred languages (python, R, Scala or Shell), thus leveraging any external ML libraries.