An End-to-End Journey of a Multinational Predictive Conversion Likelihood Model for Marketplaces
Learn more about an end-to-end journey of a real-life data science project
Get some modelling tips and key takeaways
See a predictive machine learning model in action
Conversion is one of the most crucial metrics to all kinds of e-business. Although the definition varies across different parts of the industry, it has been always a vital one being tracked in order to measure the overall success of any e-platform. For a pure e-commerce website such as eBay, conversion means completing a purchase or a payment. The definition is less clear if there is no or just a few actual transaction going on over the platform, as is the case with classified websites.
A classified website is an online marketplace where people can sell and buy new and used items from a wide selection of categories. The vast majority of user actions on these platforms is done between buyers and sellers without any real payment transaction. Therefore, as eBay Classified Group (eCG), we had to come up with an operational definition of conversion: if a buyer sends a message to a seller or if a seller posts a new listing on the platform, we refer to them both as “converted”.
The eCG is an umbrella company managing 14 different classified platforms from all across the world. Conversion numbers for the group company are tracked by a central team named ‘Global Growth’, which is tasked with increasing the number of active (recently converted) users on the platforms. They turned to the Data Science Team with a question: “Who will have a conversion soon?” The initial purpose of this model was to let them know in advance who is likely to convert soon, so that the marketing team would be able to take actions on users beforehand.
The idea is straightforward: focus only on users who are unlikely to convert soon and try to find different ways to convince them to get back on our platform again (also: avoid allocating resources on users who will convert soon anyway, without requiring any extra targeting). This personalised marketing strategy was the main objective of this project.
This is an article about an ML model developed to accomplish this objective by predicting user likelihood of conversion in the near future based on the users’ past actions on a platform.
Table of Contents
- Data Exploration and Discovery
- Feature Engineering
- Modelling and Parameter Optimisation
- Production Jobs
- Behavioural User Segments
- A Sample Use-Case
Step 1: Data Exploration and Discovery
The way the problem was presented by Global Growth to Data Science team, they were looking for an efficient manner of targeting different user groups in a more personalised way. Our first step was casting the problem in data science domain — arguably, the most important stage, due to its impact on all the subsequent stages. Some important decisions taken in this step:
- There will be 2 different models, one for Buyers and the other one for Sellers.
It is a decision based on discovery work, in which we realised that browsing behaviour of our Buyers and Sellers is so different. When we tried to fit one single model for all users, the model performance was not that good because of the differences in behavioural patterns of Buyers and Sellers.
Another reason for taking this decision was to be able to give more reliable insights into a user journey towards conversion for different user types. As expected, the factors triggering Buyers or Sellers to convert are quite different.
- There will be different models for each market/country.
This project is coming out from the central data team (cData), we should always take different markets into account during the discovery stage. We decided utilise Google Analytics (GA) data and fit models for the markets:
- Marktplaats (Netherlands)
- eBayK (Germany)
- Kijiji (Canada)
- GumTree (Australia).
Due to discrepancies in GA data across different countries, we decided to create different Data Transformation/Cleaning processes for each market separately to ensure alignment for the next stage which is Feature Generation.
Another reason why we did have separate market models is the unique characteristics of each country. For example, while users in a country A might be more into sharing a listing on social platforms, in country B they are reluctant to share any listings with their social network. By using disjoint country datasets, we did have a chance to discover each country’s behavioural patterns separately and having healthier insights about them.
- We will use login user-ids to track our users across different platforms.
For each country, we have three different platforms and data sources respectively for Web, iOS and Android. There were 2 ways to go for us: either we are going to use GA cookie-ids for web users and app-ids for mobile users, or using login user-ids for all users. Each choice has its merits: with cookie-ids/app-ids we can capture user activities even while they were not logged in. However, during the discovery phase, the results indicated that there are a lot of multi-platform users for all countries. In order to have a single holistic user journey across different platforms, we needed to focus on a unique user id which is consistent across different platforms. Therefore, we opted for using internal login user-ids to track logged user activities only.
- We will predict users’ likelihood of conversion for the next 30 days
While setting the prediction time interval for the model, we had to combine two criteria: the time period should be long enough to take an action on users beforehand, and at the same time short enough to be precise and up-to-date. Exploratory data analysis revealed that using last 90 days of historical clicking behaviour of users provides us the best performance to predict likelihood of conversion for next 30 days.
Step 2: Feature Engineering
At this stage of the project, our main objective is to come up with the best list of predictors which possibly have an impact on user conversion behaviour. While generating this feature list, we put all the factors we think they might give information to predict users’ likelihood of conversion in the near future. We did not limit our imagination or filter out our ideas at this point, so that we could keep this list as long as possible. We would rather be testing as many predictors as we could because this is not a right step to eliminate those redundant or deflecting predictors. We will do that during the next stage — modelling.
We categorised the predictors into different contexts such as browsing-based, conversion-based, event-based, personal and marketing medium based. We ended up with a total of 70 features including both numerical and categorical variables. Some features are extracted from the last 90 days behavioural data and some others are extracted using the last 30 days of the users on the platform.
For each market, we created a separate “Data Cleaning” class which transforms hit-level and raw GA (Google Analytics) data into an aggregate user-level and more structured data. In the next step, on the top of user-level market datasets, we created a single “Feature Extraction” class to use it for all markets to extract all those features listed below.
In this way, although there is a data inconsistency in GA data for different countries as we mentioned in the previous chapter, we managed to end up with exactly the same list of features for all countries.
This is a crucial detail because then we will be able to analyse each market separately but meanwhile returning an outcome for each market in the same format. This will make things easier to interpret and manageable for the Global Growth team. It is also important to mention here that while generating those country-specific Data Cleaning classes and the Feature Extraction class, we used SparkSQL and DataFrame functions.
- The most visited (page views-wise) category of the user during last 30 days
- Number of days since the last session
- Average time per session during last 90 days
- Number of sessions during last 90 days
- Number of ‘VIP’ (View Item Page) page views during last 30 days
- Number of days since the last Ad posting
- Number of days since the last Ad replying
- Number of posts during last 90 days
- Number of R2S (reply-to-seller) events during last 90 days
- Number of ‘Save Search’ events during last 90 days
- Number of ‘Favourite Ad’ events during last 90 days
- Number of ‘Call Button’ (Calling the seller of a listing) clicks during last 30 days
- Number of ‘Share Ad’ (Sharing a listing on social platforms) events during last 30 days
- The user’s platform (web, iOS or android)
- The user’s region/city
- If the user uses multiple platforms or not
- If the user prefers visiting the site on working hours or not
- The marketing medium of the latest session (Organic, Social, Paid, Notification, eMail etc.)
- The most used marketing medium by the user during last 90 days
- Number of sessions initiated by ‘Direct’ medium during last 90 days
- Number of sessions initiated by ‘Retargeting’ medium during last 90 days
We formulated our supervised learning problem as binary classification: conversion (1) or not (0). Response variables are formed for Buyers and Sellers, as described below.
1: If a Buyer has one of these events R2SEmailSuccess and R2SChatSuccess during the prediction time interval
0: If no such event during that time period
Abbreviations → R2S: Reply-to-Seller, VIP: View Item Page
1: If a Seller has one of these events PostAdFreeSuccess and PostAdPaidSuccess during the prediction time interval
0: If no such event during that time period
Step 3: Modelling and Parameter Optimisation
At this stage of the project, we tried out 4 different combinations of ML packages (SparkML, H2O) and algorithms (RandomForest, GradientBoostingTrees) and ranked them by speed and performance.
H2O performs quite a bit faster than Spark and Gradient Boosting Trees are doing better job than Random Forest — for this type of classification problem.
According to the results we got from the comparison, we decided to go with using H2O’s Gradient Boosting Trees for our modelling problem. We made use of Sparkling Water package (H2O on Spark) which enables us to run an H2O cluster on existing Spark executors. In this way, we still stayed in the Spark Framework while being able to use H2O’s machine learning algorithms for better performance.
What we are doing is putting H2O into the middle of the project pipeline just for modelling purposes while still using Spark for other stages such as Data Munging and Prediction Processing.
A charming feature of Sparkling Water is the ability to switch between H2O’s and Spark’s data frames seamlessly. It is important to stay in the Spark Framework from beginning till the end because we would like to keep the entire pipeline into a single environment to make it easier to manage for the production stage later.
After deciding on algorithm type, next step was the hyper-parameter tuning (optimisation). At this stage, we generated our training and validation datasets accordingly and found the best parameters for our GBT model by using log-loss as the optimisation metric. Of course, we also held a completely disjoint dataset as to see the final performance of the optimal model on a completely unseen future data.
The key factor here is to generate the training data from a different set of earlier dates just to eliminate any temporal effect on the model learning. That is why each time we shifted the time frame backwards by a week over a certain period to generate training data for various dates.
Tree-based algorithms can inform us about useless variables through variable importance (by-product of a training routing). In this setup we do care about not only the model performance but also gaining some insights from the feature importance list, it is still valuable for us to remove those useless features from the total list. We achieved that by adding a random white noise column into the dataset and using it as a benchmark importance for all the other features.
There is a reasoning behind why we added a random continuous variable instead of a random binary one. It is because these tree-based algorithms tend to give more importance to the features with more levels.
There are 2 major advantages of applying this process: 1. it might change the feature importance list dramatically after the removal process 2. it decreases the complexity of the model and hence the training time, leading to increased parsimony.
Step 4: Results
Lift-Gain Chart Bucket Analysis
Q:”Does this model really work at all?”
A:Lift-Gain Chart Bucket Analysis
The first question we asked ourselves after the model training and optimisation steps was: “Does this model really work at all?”. We did find the best model using some metrics such as log-loss and AUC but in reality, we will use this model to divide users into smaller buckets by their conversion-likelihood probabilities. For that reason , we wanted to perform Lift-Gain Chart Analysis, sometimes known as “Bucket Analysis”, to evaluate the final performance of the model.
As model output, we get conversion likelihood probabilities for the next 30 days for all users, then sort them out in a decreasing order and divide this sorted list into 100 equal-sized user buckets: the top bucket (bucket #1-top 1%) includes those users who got the highest conversion likelihood probabilities from the model and the bottom bucket (bucket #100 -bottom 1%) is the user group the model predicts that they are highly unlikely to have a conversion.
Since we are running this analysis on historical data, we know already what was the overall conversion rate (CR) of all users during that prediction time period (30 days). Another comparable number to keep in mind is what we called ‘Intuitive Way Conversion Rate” which assumes the users who had a conversion during the last 30 days will have another conversion in the next 30 days. Before this project, this intuitive and simple logic was the way for our markets deciding who will have a conversion in the next 30 days. Our primary objective was to have top buckets with conversion rates higher than both overall and intuitive CR and also having bottom user buckets with less CR compared to the overall. In other words, if the overall conversion rate falls somewhere into the middle of the CRs of those 100 user buckets, we will call it a ‘success’. In the plot below, you’ll see the results we got from this analysis for our Dutch market:Marktplaats.
It is intuitively expected that the factors make users reply a listing and make them post a new listing will be different.
The next step was discovering which subset of features are important for Buyers’ and Sellers’ conversions, respectively — recall that we used the same feature list for all markets and for both model types. However, it is intuitively expected that the factors make users reply a listing and make them post a new listing will be different. We were also expecting to see different important features for different markets because of the unique characteristics of those countries. Below, you will see the results for 2 different countries and their 2 different model types. The first 2 bar plots show the feature importance result for our Canadian platform Kijiji respectively for its Buyer and Seller models and the second row shows the result for our German platform eBay Kleinanzeigen.
Partial Dependence Plots
We could use ‘Partial Dependence Plots’ to convert a black box tree-based algorithm to tell us more about the individual effect of a predictor on the response variable.
Once we shared those insights with the marketers, they also wondered about the effect of a single variable on user conversion behaviour: it is because sometimes knowing a feature is important to predict users’s conversion but not sufficient to take an action.
We should also know how it should be tuned to get the best impact out of it. Shall we try to decrease or increase a particular factor on a user to convert them? In a linear model, coefficients are providing this information (if a coefficient is positive, this means increasing the value of that variable will have a positive impact on the result). Achieving similar level of transparency and interpretability requires an additional step — ‘Partial Dependence Plots (PDP)’.
The logic behind these plots is straightforward: while keeping all other features constant, when we change the value of only a specific predictor, what will be the mean effect on the response variable? In the bar plot below, you will see how each level of a categorical feature “User’s the most visited category during last 30 days” is having an effect on the response variable. In other words, how likely a user to have a conversion in the next 30 days if they visited a particular category the most during the last 90 days. The cardinality of this feature is equal to the number of categories on our Australian platform Gumtree.
The plot below shows the impact of a numerical feature “Number of distinct visited categories during last 90 days” on users conversion behaviour for Marktplaats dataset. This plot has a very interesting property: it tells us that number of visited categories has a positive impact on conversion behaviour up until to 10 categories. After that, the impact is flattening which means has no effect at all after a user has visited 10 different categories. There are total of 36 different categories on Marktplaats and our aim must be making users visit at least 10 different categories during the last 90 days of their journey on the platform to increase the chance that they will have a conversion in the next 30 days.
Apart from all these analysis, we also wondered about how stable or dynamic users’ likelihood of conversion scores in time. Since each user has got scores from the model daily, we asked a question ourselves how fast transitions between user buckets are happening within a certain time period. We created user buckets in the same manner as explained in the bucket analysis chapter above but we divided the entire user set into 10 equal-sized user buckets this time. After that, we observed ‘absolute change’ in users’ bucket numbers after a certain time period such as 7 days or 30 days. For instance, light blue coloured bar-2 in the plots below shows the percentage of users who jumped 2 buckets within that specific time period. Therefore, let’s say a user moving from bucket 1 to 3 and a user moving from bucket 7 to 5 after a particular time period both belongs to this same bar-2. As you can see from the plots, there is also a dark green coloured NA-bar in each plot which shows percentage of users who got scores from the model before but after that certain time period, they got out of scoring scope of the model which is users taking at least an action during last 90 days. In the following bar plots generated from Gumtree Australia’s seller model, first you will see the transitions between user buckets in 7 days and then transitions in 30 days. As expected, user buckets are more stable — bar 0 is high (no change in buckets) and bar NA is low — when the time period is shorter. For example, around 91% of users either has not changed their bucket at all (bar-0: 65.2%) or only 1 shift to one of the adjacent buckets (bar-1: 25.8%) in 7 days. This percentage is around 60% for the 30 days case.
Step 5: Production Jobs
This is the stage of a modelling project where data engineers and data scientists have to come together — primarily because it is the data engineers who are in charge of production lines. This implies the need to maintain a clean and structured codebase, that is ready for handing over. Therefore, we need to revisit our code at this stage and decide e.g. on the optimum number of jobs that we run in production.
In our case, we split it into 5 different smaller production jobs which should be running sequentially. As you can see from the chart below, output of each previous job is an input to the next job. Also, at the end of each job, results are written in a file and saved somewhere in our Hadoop cluster. In this way, when a job fails at some stage, we will be able to spot the failed stage by just checking the health of those files.
The chart below shows a production case for our Dutch market (Marktplaats) on a particular date (20190101). In the first job, we run country specific ‘DataCleaning_MarketName’ classes to convert daily GA hit-level data into an aggregate user-level data. In the Job #2, we extract features from the last 90 days of a user using user-level datasets from the previous job. In the Model Update Job, we train our GBM model using the training data from the Job#2.
Subsequently, we input the trained model with the scoring data to generate likelihood prediction results during the Scoring Job. In the last Behavioural User Segments (BUS) job, we fit a simple k-means on the top of buyer and seller likelihoods of users in order to group them into different behavioural clusters. In the next chapter, we will dive into the details of this BUS job.
Step 6: Behavioural User Segments
From running this model daily, each user has got 2 different numerical likelihood scores with respect to their probability of being a Buyer and being a Seller on the platform for the next 30 days. In the last job of our production line, we added one more layer on top to label our users with hard-coded meaningful segment names. We do this segmentation by fitting a simple k-means using this 2-dimensional data and ended up with 7 different user cohorts.
- Inactive: Users who haven’t got any scores from our models (A user who did not visit our platform during the last 90 days)
- Visitor: Users who got low buyer and seller likelihood scores
- Low-Buyer: Users who got low seller score and relatively more buyer score
- Medium-Buyer: Users who got low seller score and moderate buyer score
- High-Buyer: Users who got low seller score and high buyer score
- Seller: Users who got low buyer score and high seller score
- Hybrid: Users who got high buyer score and also high seller score
Another reason why we needed this extra step to label our users with hard-coded cluster names is to make the results easier to interpret for all other teams at the company. In this way, instead of struggling with numerical probability values, all they need to do is pick a segment to target or run an experiment on.
Step 7: A Sample Use-Case
Initial objective of this project was enabling our Global Growth team marketers to do Personalised Marketing. However, we also discovered a different fascinating use-case of this model during our project journey.
Normally, when we come up with a new product idea, we usually test the product using all user-base or in other words on an ‘average user’. However, these user cohorts enable us to measure a product performance on different smaller user groups separately. In this way, we will stay away from the assumption that our new product will work well on all users — which is not the case for most of the time. This is what we call ‘Personalised Product Testing’.
In the sample case below, we used the notion explained above to test the performance of newly developed product ‘Personalised Homepage Feed’ for our Canadian platform Kijiji. Simply, what we did is first measure the performance of this new product on all users (all segments) and then analyse its performance on each different user cohort separately. We observed statistically meaningful differences in product impact across different user cohorts.
The main takeaway from this experiment is that instead of trying to develop a one-size-fits-all product, there is another option for us to go for more personalised product development. From the chart below, it is clear that this product has a really positive impact on inactive, seller and visitor user cohorts — but not doing a good job for other segments. For that reason, it makes more sense to apply this product on the ones that works well and develop some other product to be effective on the other segments.
Metrics in Chart below: Leads = Reply or Post, SRP = Number of Search Result Page views, VIP = Number of Item Page views
In this article, we wanted to share with you a holistic journey of the Conversion Likelihood Model project done by eCG’s Data Science team. We mentioned some challenges we faced and also some lessons we took from this journey. We also touched on some subjects like ‘How to measure the success of an ML model?’ or ‘How to come up with different real-life use-cases for your model?’ We hope you will find this journey useful for your own projects.
Special thanks to Rei Sayag, Rogier Peters, Gijsbert Kooper, Konrad Banachewicz and many more for their great contributions to this project!