# Gradient Boosted Trees

Gradient Boosted Trees (a.k.a GBT) is a commonly used tree-based ML algorithm which works for both regression and classification type of data mining problems. This algorithm has attracted a lot of attention from data science communities because of its success in industrial problems and data mining competition platforms such as Kaggle. Actually, GBT (XGBoost) algorithm was used for almost half of the winning solutions on Kaggle contests.

In this article, we will discuss what makes this algorithm really good compared to others. We will discover the technical details and shed light on the mathematical formulation of the GBT algorithm in order to use it for our own needs with the best performance (configuration). I am not going to stop saying that if we don’t know the math behind any ML algorithm, then it is not possible to optimise it for your own data problems. Therefore, if you wonder about a ML algorithm, I strongly suggest you to learn it from technical papers instead of just discovering its R/Python packages and their parameters.

GBT is a tree ensemble model since it is consisted of multiple regression trees. Let’s start with defining objective function of GBT. As we already talked about in my previous article, there are two major parts in any objective function : training loss and regularization term. Training loss function may differ with respect to data mining problem type (regression, classification or ranking etc.) and regularization term is specifically defined for each algorithm. We are looking for these definitions right now for GBT. Assume that we have K trees in our tree-ensemble, our prediction for instance (i) is going to be sum of K trees (or we can call this K ‘functions’). We ask for prediction for instance (i) from each of K trees(functions) and sum of them will be final prediction →

Now, it is time to define our objective function for GBT in general. Training loss is easy to define since it measures the distance (difference) between predictions and their real values. However, things are little different for the regularization term part. We penalise weight parameters in a linear regression to overcome complexity problem of a model. Since it is a tree-based algorithm, we don’t have weights parameters like we do in a linear regression algorithm. Therefore, we have to think about it and define new way to penalise model complexity. Possible ways to do it : Number of nodes in a tree or depth, L2 norm of the leaf weights etc. Let me explain these : If the number of nodes or depth of trees are high, then it is easy to overfit. Likewise, if the leaf weights are high, it means that you don’t give much chance for new upcoming trees in your ensemble. This also makes the algorithm unguarded to overfitting problems. While keeping these factors in our minds, let’s define our objective function here →

Next question is ‘how do we learn?’ Since we are learning trees (functions) in GBT instead of just numerical vectors, we can not use Stochastic Gradient Descent (SGD) here. Another methodology is used to find the best form of trees in the ensemble which is called Additive Training.

Start from constant prediction and add the best tree (function) for each round while minimizing objective function (training loss+regularization).

As you can see in the equations above, at each round a new function (tree) is added into our ensemble. While GBT is adding a new function or the best tree in other words, the algorithm always tries to optimise objective function in backend side. For example, at round (t), new function is added to previous round (t-1) tree-ensemble with the aim of minimising objective function. At round (t), f_t(x_i) is what we need to decide.

As shown in first red circle in the equations above, we can write our prediction (y cap) for instance (i) at round (t) as a summation of prediction at round (t-1) and new function value for instance (i) at round (t). If we consider square loss, then training loss part of the objective function can be written as shown in the second line. Note that in the second line of the formulation, there should be also square of difference between prediction value of instance (i) at round (t-1) and real value of instance (i). However, since it is a constant residual from previous round and no effect on deciding new function on round (t), that term is not written separately and included in const.

Let’s think about the Taylor Approximation shown below : x+₺∆x is the parameter of the approximated function. We will use this Taylor expansion formula to approximate our objective function. Here, question is that ‘how we can convert our objective function in the form of Taylor Approximation formula?’. While converting objective function into the Taylor form, function f represents loss function l in our objective, x represents training loss of instances at round (t-1) and ∆x represents the new added function (tree) f at round (t).

Taylor Approximation

Objective Function →

Define →

Approximated Objective →

Think of g and h functions as derivatives of square loss and refer the previously written objective formula :

g and h functions :

objective function :

When constants are removed from the approximated objective function, then final formula will be like shown below and now our goal is to minimize this function.

Let’s interpret our new objective function. Now, learning of new function f at round (t) only depends on objective via g and h functions. g function means that first derivative of loss function at round (t-1) for instance i and h function means that second derivative of loss function at round (t-1) for instance i. Since h is second degree derivative, it returns a constant number if talk about square loss. (In square loss case, it returns 2) However, function g shows the gradient of the objective function by instance i. That really makes sense because new tree should be added to our ensemble while keeping track of gradient of our objective function with respect to changes on instances’ values. It’s all clear now!

Let’s define complexity of a tree in our ensemble. Consider number of leaf nodes and leaf weights to penalize model complexity. If number of leaf nodes are getting bigger, then GBT is much more unguarded to overfitting. In other words, the algorithm is getting far away from generalization. Same issue occurs when weights of leaf nodes are much higher. The reason is that if leaf weights are high, then your model does not give chance to upcoming new trees anymore. Therefore, we have to define a regularization term which includes both number of leaves and leaf weights in the formula as shown below.

There is one more tree re-definition step before we go further. This definition will be used re-define new added f function at round (t). Actually, function (tree) f is just a value which is one of the leaf weights for any instance i. Therefore, we will try to rewrite function f at round (t) in terms of leaf weights. We define a tree by a vector of weights in leaves and a leaf index mapping function that maps instance to a leaf.

As you can see q is the function that maps an instance to a leaf node and w is the weight vector that keeps scores for all the leaf nodes. Note that f_t(x) is now expressed as w_q(x). Now, we are ready to revisit our objective function and apply necessary changes to the function. This time we have our objective function as a summation over leaf nodes (T) instead of instances (n) because we also convert f_t(x) into leaf weight values. Define instance set in leaf node j as I_j={i | q(x_i)=j}. It means that all instances exist in leaf node j will be represented by I_j.

I hope you can make it so far. If you understand how we generate the last line of the objective function above, then our problem is just to solve a sum of T independent quadratic functions. Right now, it is easy to move on for us because we are all familiar with solving quadratic functions. Remember that there are two basic facts about single variable quadratic function shown below.

Let’s define sum of g_i (first derivative of loss function by instance i) in a leaf node j as G_j and sum of h_i (second derivative of loss function by instance i) in a leaf node j as H_j. (Below)

If we put G and H into the last obtained objective function, then it becomes a quadratic function and can be solved like we do for any quadratic function. What were we looking for from the beginning? Minimising our objective function, right? Therefore, we will find the best values for leaf weights (w_j) that optimizes (minimize) our objective function. We are going to do this like we do above for a single variable quadratic function. Let’s look at our objective function and its solution below.

Finally, we ended up an objective function that enables us to measure how good a tree structure is. All we have to do is to calculate g_i and h_i (first and second degree derivatives of instance (i) loss function in the previous round), then use all g and h values to find the best tree structure for the next round that gives the smallest objective function score. (Look at the figure below!)

Next and one the last question arises at this stage : Is it feasible to calculate all possible combinations for the next tree structure? Of course, complexity of this problem NP-hard and it is not feasible. Therefore, we have to find a proxy here to overcome this issue. Instead of enumerating all tree structures, let’s optimize each level of the tree once at a time calculating the formula below. It answers the question that whether or not dividing parent node into left and right leaf nodes will improve our objective score. Each split, below formula is calculated and taken action accordingly.

For a real valued data, instead of trying all possible solutions, just sort all instances by their g and h values first and then scanning from left to right to find the optimal split as shown below.

This is pretty much it about the dynamics of Gradient Boosted Machine algorithm. Now, we are aware of how this algorithm works and why it is really efficient and applicable on most of the data mining problems. Next step would be having a look at GBT R, Python or Spark packages and getting more familiar with the tuning parameters of this algorithm.

Currently Amsterdam-based and working at Ebay. Senior Data Scientist with M. Sc degree in Machine Learning and 7 years of professional experience.

## More from Burak Özen

Currently Amsterdam-based and working at Ebay. Senior Data Scientist with M. Sc degree in Machine Learning and 7 years of professional experience.