It is important to learn how a machine learning algorithm works behind the curtains. For a data scientist, it is crucial to wonder about the logic & the math behind these algorithms. However, even with best parameter configuration, using only one ML algorithm on a data mining problem may limit your performance and your capabilities while solving the problem. At this point, we have to meet some really important strategies in Machine Learning to boost our modelling performance. These strategies are called ‘Bagging’ and ‘Boosting’. We should keep these concepts in our minds for almost every project that we’ll be doing from now on. Let me start with explaining bagging strategy first.
Bagging (or sometimes it is called ‘bootstrapping’) is a strategy in Machine Learning that mainly makes use of several learning algorithms instead of just one. It is actually kind of a sampling technique used for training several learning algorithms by using a single dataset. In this strategy, there are generated smaller subset of the data by sampling whole dataset with replacement technique. Then, for every smaller subset of the data, we train a classifier and combine results of these classifiers with simply an averaging function. For example, these classifiers can be a set of different algorithms such as a logistic regression, an SVM, a neural nets, a decision tree etc. or a set of same type of classifier such as a bunch of decision trees. However, the logic is the same : Instead of getting result from just a single classifier, we now take advantage of multiple classifiers’ votes and create a single collaborative result. One of the most famous algorithm that uses bagging strategy is Random Forest. Actually, RF algorithm takes a step further beyond bagging strategy which is random sub-sampling over columns to select split candidate features for each leaf node. It de-correlates the trees in the forest and increase learning performance. (A kind of dealing with possibly overfitting problem in a single Decision Tree)
Since I already talked about bagging strategy and especially random forest algorithm in my previous post, I’ll stop explaining this concept here and focus on ‘Boosting Strategy’ for the rest of this article.
As we already mention that bagging is a strategy to combine votes of multiple learning algorithm by simply doing averaging. However, this combining step could be done much smarter instead of just taking average. Therefore, in general, we can define ‘Boosting’ strategy as a bagging strategy with much smarter combining function. Please don’t judge me because of this rough definition. Of course, there are some major difference between Boosting and Bagging strategies. Here is a small list of them :
- Boosting suggests that multiple weak classifiers can be combined to generate a one strong classifier.
- Each weak classifier should focus on the weaknesses of previous model and improve overall model when it is added to the group of classifiers.
- For each round — each time a new classifier is added, instances classified wrong by the previous set of classifiers have higher weights and new classifier is focused on classifying these instances correctly.
- Weighted average/sum is applied on combining classifiers decisions. In this step, the coefficient of each classifier is produced as weighted error of that classifier. (No worries, I’ll talk about weighted error of a classifier in a minute.)
- Boosting is a stagewise strategy unlike bagging which is a bunch of parallel independent tree generation. Each tree is independent to each other in bagging, but in boosting, each tree is generated as additive to previous classifiers. Therefore, structure of a new tree is dependent on the former model in boosting.
P.S Distributed Computation Frameworks such as Spark or H2O have both Random Forest (bagging) and Gradient Boosting Model (GBM) (Boosting) in their algorithm lists. However, because of the difference #5, parallelisation logic of these two algorithms is different. While building an RF model, each tree generation can be done on a different node at the same time, whereas distributed GBM has parallel tasks such as finding optimal splitting features for leaf nodes of a single tree. For more detailed information, read the article by H2O in this link.
Of course, there are also some similarities between ‘Bagging’ and ‘Boosting’ strategies :
- Both strategy takes advantage of multiple classifiers instead of just single one to generate a collaborative result.
- These are not a kind of ML algorithms. These are concepts frequently used in ML pipelines and both strategies can support different types of classifiers.
However, since tree based boosting methods are the most common one in this field, I’ll focus on boosting strategies based on decision trees such as AdaBoost, Gradient Boosting Trees and XGBoost Algorithm.
Types of Boosting Algorithms
Underlying engine (classifier) can be anything for boosting algorithms. I’m going to talk about 3 different boosting algorithms which have different types of underlying weak classifier and used for different types of data mining problems.
- AdaBoost (Adaptive Boosting)
- Gradient Boosting Trees
Since these algorithms are used for supervised learning, let me first give the general structure of any supervised learning algorithm. In a supervised ML algorithm, there is always an objective function that we want to minimize. During the learning step, all focus of the algorithm is just to minimize this objective function. Objective function is consisted of two main parts : Training Loss and Regularization Term.
This equation exists in almost every learning models. Mainly, we can say that training loss means fitting well on training data and encourages predictive power of a model. On the other hand, regularization is used to decrease large variance in the future and encourages simplicity in a model. The balance between training loss and regularization term determines what we called in Machine Learning field ‘Bias-Variance Tradeoff’. Therefore, our major goal in any machine learning problem is to get an algorithm that is simple but also effective in terms of predictive power.
Of course, both training loss and regularization have different formulation in different algorithms and different types of data mining problems.
As shown above, there are different types of training loss functions. For example, square loss can be used for a regression problem, whereas logistic loss can be used for a classification data mining problem. Also, we can apply different types of regularization functions such as L2 norm or L1 norm (lasso) to penalize the complexity of a model. Of course, this leads us to overcome any possible overfitting issue in our learning model.
Since I am going to refer a lot to training loss and regularization terms in the rest of the article, I talked about these issues first. Now, it is time to move on to our first boosting algorithm.
AdaBoost is the first realization of boosting algorithms in 1996 by Freund & Schapire. This boosting algorithm is designed for only binary classification and its base classifier is a decision stamp. Remember that underlying classifier in a boosting algorithm is called ‘base classifier’. This underlying base classifier for AdaBoost is decision stamp which is a decision tree with one level. Major goal of AdaBoost algorithm is to combine these really weak one-level decision stamp classifiers and generate a strong classifier at the end.
Remember that boosting algorithms give more weights to instances that are misclassified by previous model and try to correct predictions on those instances. Giving more weights to some instances means that applied misclassification penalty in the training loss are much higher for those instances. In the figure below, each time a decision stamp D1,D2 and D3 is added to the system, it focuses on misclassified instances in the previous stage. These instances are shown bigger in the figure and each decision stamp line is drawn to correct predictions of these instances in the current round of boosting algorithm. At the end, all decision stamps (Box1,Box2 and Box3) are collected and final model is formed.(Box4)
Let me first give the pseudo code of the algorithm and explain the technical details right after.
Therefore, in the step 1 of the pseudo code above, initially all the weights of instances are set equal. This weight is equal to 1/N where N is number of instances in our training set.
All learning is done in the step 2 of pseudo code. The loop generates each tree stage-wise by using weighted instances from previous stage. In step 2a, during fitting a classifier to weighted training data, a decision stamp divides data into two class while focusing on previously misclassified instances. In step 2b, weighted error of the newest tree is calculated over misclassified instances. It is because of function ‘I’ in the formula. It gives 1 for misclassified instances and returns 0 for correctly classified ones.
In step 2c, we compute another weight value but this weight is not for instances, it is actually a weight of the newest tree in the last step formula which is for the combination of classifiers. It is calculated using weighted error from step 2b. As you can see, if classification error of a tree is high, then it gets lower weight. This is a more reasonable way of combining votes of multiple classifiers for the final result. Finally, since AdaBoost is an algorithm just designed for binary classification (1 or -1), sign of a weighted sum of classifiers’ votes is calculated in step 3. Predictions are made by calculating weighted sum of weak classifiers.
Let me explain why weighted sum is important by an example. Assume that we have five different weak classifiers in our AdaBoost algorithm and they predict 1.0, 1.0, -1.0, 1.0 and -1.0. Even if final prediction seems 1.0 from the majority votes, it may be resulted in different way if we use weighted sum of votes. Let’s assume that weights (or sometimes it is called ‘stage values’) of classifiers are 0.2, 0.5, 0.8, 0.2 and 0.9 respectively. When we calculate weighted sum, it gives -0.8, which would be an ensemble prediction of class -1.0 for the final classifier.
Remember that each time we add a new tree to our boosting algorithm, all we want is to minimize our exponential loss function which is show below.
Each tree added to the boosting algorithm tries to minimize below argument. Since all instances are re-weighted in each round, the newest tree has to penalize instances with high weight when they misclassify again in the current round using exponential function.
However, overall system has to minimize the loss function : L(y,f(x)) = exp(-y f(x)) Since it is a stagewise algorithm, model (m) is totally dependent on model (m-1). Therefore, the loss function can be re-writed in details as shown below :
Our new candidate additive tree is G_m. So far, we have m-1 trees in our boosting algorithm and now we should add G_m tree with beta_m coefficient and try to minimize our loss function. Actually, this beta parameter is the same with ‘classifier weight’ that we mention in the explanation part of the pseudo-code.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —
That’s all about technical details of AdaBoost algorithm. It is really easy to understand algorithm among other boosting algorithms. Before closing the issue, there are some useful remarks to note. As you can see in the way of how this algorithm works, AdaBoost can be oversensitive to outlier or noisy data. It is because the ensemble method continues to attempt to correct misclassifications in the training data. Outliers will force the ensemble down the rabbit hole of working hard to correct for cases that are unrealistic. Therefore, always be careful and keep this in your mind while feeding data to this algorithm.
Some questions may arise in your minds at this point. If we gradually fix misclassification error of previous models, then why not get near %100 accuracy. If this is possible, does it mean ‘overfitting’? So, we’ll talk about ‘learning rate’ factor in the next article that answers the questions. In the following article, I’ll talk about Gradient Boosting Model, XGBoost algorithm, what’s going on inside of these algorithms and how to tune parameters while using R/Python version of these algorithms.