# Multi-class Classification on Imbalanced Data using Random Forest Algorithm in Spark

I don’t know if you’re a kind of person who is addicted to a machine learning algorithm and use your favorite one as long as it is applicable to a problem. But, I am that kind of machine learning geek. My favorite algorithm is Random Forest and I have my own reasons for this preference. First of all, Random Forest is one of the most easy-going algorithm among all other machine learning algorithms.

Let’s make a list of some advantages of Random Forest :

- Random Forest can be used for both classification and regression problems.
- Random Forest is a transparent machine learning methodology that we can see and interpret what’s going on inside of the algorithm. Not just use it as a black box application.
- Random Forest works well with both categorical and numerical (continuous) features. You don’t need to categorize (bucketize) numerical features before you use it. It supports these types of features at the same time.
- Since Random Forest is a rule-based algorithm, you also don’t need to normalize numerical features like we do for machine learning algorithms with linear separator. In Spark,
**’maxBins’**parameter set the number of bins while discretising continuous features. Let’s say you have a continuous feature that ranges from 0 to 300. If you set maxBins parameter 3, then it divides this feature into 0<<100, 100<<200 and 200<< 300.

- Let’s dig into the question : Why they called it ‘Random Forest’? Actually, both ‘Random’ and ‘Forest’ words have meanings. There are two reasons behind ‘Random’ word. First, for each individual decision tree in the forest, training data of each decision tree is produced by doing randomly sub-sampling the whole data. In Spark, they named this variable
**‘subsamplingRate’**. Second reason of randomness is node splitting criteria. For each node split, random subset of the features are selected as split candidates and after calculating entropies for each candidate feature, the best feature is selected to split that node. For most of the random forest algorithms, the default subsampling rate is square root of total number of features. For example, if you have 100 features to train your random forest model, each time the algorithm will choose 10 randomly selected features to split a node into sub-nodes. In Spark, this variable is named**‘featureSubsetStrategy’**. With the help of these two randomness approaches, Random Forest is really robust to*overfitting problem*. - In order to deal with overfitting problem, Random Forest has some other different parameters such as
**‘minInstancesPerNode’**and**‘maxDepth’**in Spark Implementation. While tuning parameters for the model, different values for these two parameters should be tried via grid search.**‘ParamGridBuilder’**class is implemented for this purpose in Spark.

- Let’s talk a bit about why it is called ‘Forest’. Due to
*bagging*methodology; rather than training one single decision tree, a bag of trees are trained and their majority of votes is used as final result. Since it is a series of decision trees, this algorithm has ‘Forest’ in its name. It has been proved that most of the time, a bag of trees will give better results compared to one single decision tree. In Spark, number of trees in the random forest algorithm is set by**‘numTrees’**parameter. - Random Forest is also robust to
*outliers*. Since algorithm uses majority votes of a bag of decision trees, the negative effect of outliers is handled. - RF is good for
*parallelisation*. *Feature selection*is not a must-do task for RF algorithm. If you have bad/outlier features in your feature set, these features will be eliminated automatically due to entropy-based node splitting methodology. In Spark,**‘featureImportances’**parameter of RF model will give you the importance weight list of all features. However, I must say that feature selection is always worth trying to decrease model complexity even if you use Random Forest.- Random Forest can handle
*imbalanced data*. There are bunch of different strategies can be applied to solve imbalanced data problem. Actually, this is the main topic that I would like to talk about throughout this paper. In Spark, there is a variable called**‘thresholds’**for this purpose. Let’s assume that we use RF on a training data with the class label distribution : %90 Negative, %10 Positive. Then, if we set threshold variable(t) [0.9,0.1], the class with largest value p/t is predicted, where p is the original probability of that class and t is the class’ threshold. However, I believe that this is not an optimal way of solving imbalanced data situation. I am going to list some other ways of solving this issue later in this article.

My major goal to write this article is to explain how to approach a multi-class imbalanced data problem using RF. Imbalanced data problem is a little bit more interesting issue for me; therefore I am going to save it for later and right here focus on multi-class problem. As I stated before, RF has ability to do multi-class classification by its nature. However, from a paper that I read recently, it is stated that instead of training a single RF model for multi-class classification problem, creating one RF model for each class using One-vs-Rest (One-vs-All) strategy and then selecting the class label which has the maximum probability score gives better results. [1] I personally tried this approach for my very recent project and increased the classification accuracy and F-Score. Let’s assume that you are about to solve a multi-class classification problem with 10 class labels. Then, if you decide to use this approach for your problem, you are going to create a different model for each class and end up with 10 models in total. When you want to score and learn to which class a new instance belongs, you will give this instance as input to all models and pick the class with the highest score. In Spark, there is a class called *‘OneVsRest’* with a parameter ‘*classifier’*. When you set your RF model to this classifier parameter using *‘setClassifier’* method, you are done applying One-vs-Rest methodology with RF classifier. However, if we want to use Spark’s *‘OneVsRest’* class as is, there is still an unsolved imbalanced data problem that arises some serious issues. We are going to talk about it in the next paragraph. The take home lesson here is to keep this One-vs_Rest approach in your mind when you face with a multi-class classification problem and decide to solve this problem with RF.

Now, it is time to explain how we can deal with imbalance data problem. Probably as you already realised, when we apply One-vs-Rest to our multi-class classification problem, the model for minority class will be trained with a very skewed data.

There are several solutions to list here :

## a. Under-sampling or Over-sampling :

This is very intuitive way of overcoming this issue. If you decide to apply under sampling, then pick a number of instances from each class with the same amount of minority class. However, there is a drawback of using under-sampling technique such that losing most of the training instances while doing sampling. Probably, we don’t want to shrink down our big data. In Spark, DataFrame class has a method called **‘sample’** with a** ‘fraction’ **parameter which is the ratio/percentage of sampling.

In over-sampling scenario, you create some synthetic and duplicated data for minority classes to reach out the number of instances of the majority class. In the paper [2], it is stated that oversampling is not a good way of doing this. If you have enough data and decide to use sampling to balance your training data, you are better to prefer under-sampling strategy over over-sampling.

## b. Give Weights to Instances :

If you give bigger weights to minority class instances and less weights to majority class instances, then we have a chance to use whole data without sampling. For example, let’s assume that you have 3 classes with instance weights (1,3,10). If a new/test instance falls into a leaf node in which there are 10 instances of class-A, 4 instances of class-B and 2 instances of class-C, then this instance is assigned with label C. It is because that WEIGHc(10)*NUMc(2) is both bigger than WEIGHTa(1)*NUMa(10) and WEIGHTb(3)*NUMb(4).

Random Forest algorithm in Spark has not supported this feature yet but in R, you can find this feature in RandomForest package with parameter named **‘classwt’**. For now, Spark only supports class ‘thresholds’ that I mentioned before in this article and again it is not a better way compared to class weights logic. If Spark starts supporting class weights feature someday, I would go with this to solve any imbalanced data problem.

Because of the deficiencies in current version of RF in Spark ML, what I am going to suggest to you for imbalanced data problem in multi-class classification is slightly different than these two approaches above. Since I prefer using One-vs-Rest strategy, I applied sampling while generating training data for each class model. Let’s assume that we have a training data with class distribution : % 5 Class A, % 15 Class B and %80 Class C. Then, while training a model for class A, use all class A labels (%5) and randomly select same amount of data from the rest of the data (%95). While training a model for class B, use all class B data (%15) and randomly select same amount of data from the rest (%85). For the model of class C, since it constitutes %80 of the data and the rest is %20, randomly choosing samples from the class C (%80) with the same amount of all other classes instances. (In this case, randomly select %25 of class C instances — it will be having equal size of %20 of the whole training data)

There is another issue about Random Forest that I would like to share with you and kindly ask your opinions about it. It is a different implementation of RF and I believe that it is worth to write a research paper about it. As I’ve already mentioned before in this article, Random Forest algorithm randomly chooses some features as split candidates and use the best feature to split the node into two. However, in some cases we need to able to tune this flow for our own needs. For example, let’s assume a case that an e-commerce website would like to develop a model that predicts users’ interest scores to each categories in the website. They know gender of a user is one the most important factor to calculate these interest scores. Besides gender, probably they also have tens of different factors in their feature set. There is two ways to go : ** Add ‘gender’ as one of the features or Create a different model for each gender group**. These two ways give totally different results. It is because when we add gender in our feature set, we can not be sure using gender feature in first node (root) splitting. Some can say that if the gender is the optimal feature to split that first node, RF will go on with ‘gender’ feature by its nature [3]. However, entropy based calculation is done for each individual node split only. Maybe, gender is not the best feature to split the first node but for rest of the node splits, splitting root node by gender may result with the optimum leaf nodes.

Of course, we can divide our training data into two by gender and then create different models by using each part of the data. The question is that when we also decide to use ‘age’ factor in the same way as gender (<30 & >=30 etc.), the number of separate models we have to create is getting much more and this makes our system complicated. Therefore, if we can implement a new version of Random Forest algorithm to which we are able to assign specific features for first/root node splittings, we may handle this issue in a simplest way. Instead of training a model for each data part (gender and age groups), we will have a single RF model and this simplifies the whole system for sure. In a comprehensive research study about Random Forest algorithm, one can compare the model results between the traditional RF with all features (gender and age included) and the new version of RF that we can externally specify features for first node splits.

If you are interested in the topics we’ve covered throughout the article, please give feedbacks and write down your comments about it.

*References :*

[1] https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es2015-5.pdf

[2] http://statistics.berkeley.edu/sites/default/files/tech-reports/666.pdf

[3] http://stackoverflow.com/questions/37044187/force-split-random-forest