Uplift Modeling: correlation to causation
It is a common case for most businesses needing an ML model to predict if a newly introduced campaign or promotion will work on their user base or not. Ultimate goal of such an initiative might be converting more users with the help of those campaigns or improving particular business metrics such as CTR. Of course, depending on the business type, campaigns might come in different forms such as a discount or a loyalty program.
For the sake of simplicity, imagine that we are running a discount campaign to increase return of investment (ROI) for an e-commerce website. Intuitive and simple way of formulating this problem is to train a machine learning model to predict if a product will be selling soon or not. Business could make use of such predictions in a smart way to increase the ROI metric so that discounts might be applied on only the products that are less likely to be sold otherwise. For such a model, the training methodology is quite straight-forward. We are supposed to use all past transactions (purchase) data. Let’s say that if a product still has not been sold 7 days after it was listed on the website, then the target variable is 0 (not sold), otherwise our label is 1 (sold). This is basically the well-known ML model: “Likelihood of selling a product” which computes this probability for each product → P( Y: 1 | x ) where Y is the target variable (1 — sold) and x is set of features such as combination of product, user, transaction features.
Introduction
Basically, uplift modeling is a methodology that approaches the same problem in a different way. Here, what we really want is to be able to know whether a discount will be needed to apply to sell a product. In other words, we would like to see if the applied discount will be “causative” factor to the successful transaction of the product. This is the point where we do transition from “correlation” to “causation” within the ML realm. In the previous problem formulation, we were making kind of a strong assumption that applying discounts on unlikely to be sold products will always affect conversion positively and increase our total return of investment. This is what we call presumed “correlation” between number of “discounts” and number of “purchases”.
However, Uplift modeling refrains from applying discounts on all the products which are unlikely to be sold. It creates different targeted sub-groups out of those “unlikely to be sold” products. Ultimate goal of the uplift modeling is to find the particular sub-group of products which could be sold if and only we apply discounts on them. Since we add “applying discount” as an external factor into the uplift modeling formulation, now we have total of 2 (sold-not sold) x 2 (apply-not apply discount) = 4 groups of labelled data.
1. Purchase organically — Purchase after applying discount:
This set is sometimes called “sure things”. Regardless of applying discount or not, the product will be sold anyway. For sure, the logical thing to do here not applying discount on this product group. We would not want to waste our budget on those products to keep the ROI higher.
2. No Purchase organically — Purchase after applying discount:
This set is sometimes called “persuadables”. This is the particular group of products, the uplift modeling is seeking for. If we apply discounts only on these products, we could positively affect our conversion rate and ROI as well.
3. Purchase organically — No Purchase after applying discount:
This set is sometimes called “do not disturb”. Even though it sounds a little counter-intuitive, sometimes there might be a set of users or products that might show negative reaction to your discounts. For sure, the reasonable thing to do for this group is just to skip applying discounts not to hurt the ROI metric.
4. No Purchase organically — No Purchase after applying discount:
This set is sometimes called “lost causes”. You could think of this group as churned. Even if we apply discount on these products, we can not change the result and make them sold. However, since our ultimate goal is to optimize the ROI metric, we also don’t want to apply any discounts on this product group.
ROI of the Discount Campaign
There are a few terms which are used often within the uplift context such as incrementality, cannibilization or incremental return of investment (iROI) to measure performance of the discount campaign.
Before making definitions of those terms one by one, let’s set names for some numbers for this example:
N: Total number of purchases on which discounts applied (e.g 10000)
T: Total number of purchases in the treatment group (e.g 25000)
C: Total number of purchases in the control group (e.g 21000)
R_t: Total revenue in the treatment group (e.g $2.5M)
R_c: Total revenue in the control group (e.g $2.1M)
L: Total loss of revenue because of the applied discounts (e.g $300K)
For the simplicity, assume that size of treatment and control groups is the same and here how the computations go in that case →
Incrementality purchase ratio of the Discount Campaign:
[(T - C) / N] → (25000 – 21000) / 10000 = 4000/10000 = 0.4 = 40%
Cannibalization purchase ratio of the Discount Campaign:
Whatever the part that are not incremental among discounted purchases are because of cannibalization. These are the purchases that would happen even if there was no discount at all. In this case, number of cannibalized purchases are [N - (T - C)]. Its ratio basically is (1 - incrementality ratio) which mans (1 – 0.4) = 0.6 = 60% in this example.
Incremental Return of Investment (iROI):
This is the ratio of incremental revenue amount between treatment and control to the total loss of revenue because of the applied discounts.
[(R_t - R_c) / L] → ($2.5M - $2.1M) / $300K = 1.33
If this ratio is bigger than 1, we could call this campaign profitable.
Model Part
Mainly, the uplift modeling is computing this probability:
P(Y=1 | x, T=1) - P(Y=1 | x, T=0)
where Y is our target variable (0 — not sold, 1 — sold), x is set of features and T is indicator of treatment if the discount is applied (T=1) or not (T=0). This formula calculates the difference (uplift) of possibility of selling this product with and without applying discount on it. In other words, how could applying the discount increase the chance of selling this product? (Causation)
The main difference of uplift modeling compared to the main stream ML model which predicts ‘likelihood of selling a product’ is to already have past data of whatever the discount you would like to measure its efficiency. In other words, before moving on the model part, we should have already applied this discount on products randomly for a while to collect data for uplift modeling.
There are mainly 4 different modeling approach to compute this uplift score as listed below. Here, I don’t want to go deep into details of these approaches in this blog post since there are already thousands of them out there you can easily find. Just an important note here, I intentionally keep calling those listed below “approach” not “model”, it is because you could use any preferred ML model as a base model in those approaches.
Approches
S-Learner:
It is a single model approach which uses and defines treatment (T) as another feature and compute uplift score of a product by taking difference of predictions with a treatment feature (T) is 1 and 0. Main drawback of this methodology is not having treatment (T) feature as one of the most important features in our single model (m).
m(Y=1 | x, T=1) - m(Y=1 | x, T=0)
T-Learner
This is two models approach where we train one model using treatment (apply discount — T=1) group data and train the other model using the control group (not apply discount — T=0) data. While computing the uplift for a product, we take difference of results of these two models (m1 & m0). Main drawback of this methodology is the risk of getting non-calibrated, non-sync scores from the two models because of their different data nature and distribution.
m1(Y=1 | x) - m0(Y=1 | x)
where M1 is model trained on treatment group data and M0 is the other model trained only on control group data.
X-Learner
This is also two models approach where each model learns on individual treatment effect (ITE). First step is the same as with the T-learner, second step is applying each model on the counter group and calculate ITE like this:
m1(Y=1 | x) - Y_real → for control group
Y_real - m0(Y=1 | x) → for treatment group
then the last step is to train another 2 models using those ITEs calculated in the previous step as new target variables. Let’s name these 2 new models as m1_ITE and m0_ITE. Here how it goes to calculate the final uplift score:
g(x) * m1_ITE + (1-g(x)) * m0_ITE
where g(x) is a propensity score weight function. For instance, if you have equal size of treatment and control, you could use the value 0.5 for g(x).
Uplift Trees
It is basically a tree-based algorithm which uses differences in uplift as splitting criteria. Each split happens to increase the difference of outcome distribution between treatment and control groups in child nodes compared to parent node.
Uplift Approximation
Even though the real uplift values could be computed as shown in the uplift formula below, it is almost always not possible in a real life scenario both to apply and not apply discount on the same product at the same time or give and not give discounted prices to the same user at the same time. In addition to that, we might not always have negative labels (Y=0) data in hand especially if we don’t have a concrete definition of what is a “not purchase”. In that case, we might introduce some approximation to the real uplift value. You could think of this as proxy in place of the real uplift values. What we have in hand are successful purchase transactions (Y=1) data where we also know if a discount applied on those past transactions back then or not. (T=1 or T=0).
Here is the brief explanation of creating an estimation to real uplift values using only successful purchase transactions →
Uplift Formula: P(Y=1 | x, T=1) - P(Y=1 | x, T=0)
Apply ‘Bayes Theorem’ on two parts of the formula
Bayes Theorem: P( A | B ) = P ( B | A ) * P (A) / P(B)
Step I:
P(Y=1 | x, T=1) = [P(T=1 | Y=1, x) * P(Y=1 | x)] / P(T=1 | x)
P(Y=1 | x, T=0) = [P(T=0 | Y=1, x) * P(Y=1 | x)] / P(T=0 | x)
Step II:
P(Y=1 | x, T=1) / P(Y=1 | x, T=0) = P(T=1 | Y=1, x) / P(T=0 | Y=1, x)
where P(T=1 | x) and P(T=0 | x) are 0.5 since we assume that the chance of a product which is treated (apply discount) or not treated (no apply discount) is equal.
Step III:
P(Y=1 | x, T=1) - P(Y=1 | x, T=0) = P(T=1 | Y=1, x) - [1-P(T=1 | Y=1, x)] = 2*P(T=1 | Y=1, x) - 1
where we use the right-side of the equation from the step II, both nominator and denominator as substitutes for two parts of the uplift formula.
Final estimation of the real uplift formula looks like this:
P(Y=1 | x, T=1) - P(Y=1 | x, T=0) = 2*P(T=1 | Y=1, x) - 1
It means that all we need is to train a model which uses only past successful transactions data where Y=1 and it will predict what’s the probability of the discount is applied on the particular successful transaction. Now, it is more realistic and achievable goal to train such a model because we already have such a data.
To sum up, using this little uplift approximation trick, we could estimate the causation effect of a discount on our sales. This is the way we get an answer to the question if the causative factor of a transaction is the discount we applied or not. In my opinion, this is more effective modeling approach than the main stream approach which makes the strong correlation assumption between discounts and sales.