Notes on eXtreme Gradient Boosting

Trevor Hastie hypothesizes (slide 17) that in terms of model accuracy: 

Boosting > Random Forest > Bagging > Single Tree

Thanks to XGBoost, it's become computationally feasible to test Hastie's claim about boosting's superiority on pretty much any size/shape dataset.  The darling of many recent Kaggle competitions, XGBoost is a library for fast general purpose gradient boosting. It is parallelized using OpenMP and also provides an AllReduce-based distributed version. It implements generalized linear model and gradient boosted regression trees. 

In a quick laptop-scale experiment on covertype, we see that boosting can indeed achieve slightly better prediction accuracy (2.5% error vs. Random Forest benchmark of 2.6%). Covertype is a classic dataset used for multi-class, non-linear algorithm benchmarking. The data consists of 54 variables and 581,012 observations. There are 7 classes (some of which are minority). 

Some crib notes from building this model:

  • Setting the shrinkage parameter low (eta < 0.1) will yield significant improvement in model generalization however it must be offset computationally (by doing more rounds of boosting to capture the residuals)
  • XGBoost expects multi-class problems to be 0 (zero) indexed.  The dataset comes with class labels of 1-7, so we need to shift the class labels by -1.
  • Watchlist does not affect model training.  It is simply a way to assess prediction error on an independent sample during the training process (that isn't used for training).
  • The categorical variable in this particular dataset has already been booleaned but if that wasn't the case, One-Hot-Encoding must be applied to all categoricals (see Orchestra.jl).
  • XGBoost will grow 1 tree per class per round of boosting (in this case 7 x 500 or 3,500).  As model complexity grows, prediction time will increase.  Thus the model was able to score ~110k observations in 18 seconds (which may be fine for some use cases but perhaps too slow for adtech).
  • Shrinkage parameter (eta) will only affect the prediction score of the leaf nodes and not the shape of the tree, whereas the other parameters (gamma, max_depth, min_child_weight, max_delta_step, subsample, colsample_bytree) will influence the tree shape.  
     
 

left: eta=0.1, eta=0.5 right: colsample_bytree=1, colsample_bytree=0.5.

 

Resources

Finally here are some great resources if you want to continue learning about boosting.

Tianqi Chen - Introduction to Boosted Trees

Trevor Hastie - Gradient Boosting Machine Learning

Mikhail Golovnya - Advances In Gradient Boosting: The Power of Post-Processing

Hastie, Tibshirani & Friedman - Elements of Statistical Learning (Chapter 10)

Hastie & Tibshirani - Stanford Statistical Learning Course