Photo by Clemens van Lay on Unsplash

An End-to-End Journey of a Multinational Predictive Conversion Likelihood Model for Marketplaces

Overview

Introduction

Conversion is one of the most crucial metrics to all kinds of e-business. Although the definition varies across different parts of the industry, it has been always a vital one being tracked in order to measure the overall success of any e-platform. For a pure e-commerce website such as eBay, conversion means completing a purchase or a payment. The definition is less clear if there is no or just a few actual transaction going on over the platform, as is the case with classified websites.

Project Big Picture (Image by Author)

Table of Contents

  1. Data Exploration and Discovery
  2. Feature Engineering
  3. Modelling and Parameter Optimisation
  4. Results
  5. Production Jobs
  6. Behavioural User Segments
  7. A Sample Use-Case

Step 1: Data Exploration and Discovery

The way the problem was presented by Global Growth to Data Science team, they were looking for an efficient manner of targeting different user groups in a more personalised way. Our first step was casting the problem in data science domain — arguably, the most important stage, due to its impact on all the subsequent stages. Some important decisions taken in this step:

Step 2: Feature Engineering

At this stage of the project, our main objective is to come up with the best list of predictors which possibly have an impact on user conversion behaviour. While generating this feature list, we put all the factors we think they might give information to predict users’ likelihood of conversion in the near future. We did not limit our imagination or filter out our ideas at this point, so that we could keep this list as long as possible. We would rather be testing as many predictors as we could because this is not a right step to eliminate those redundant or deflecting predictors. We will do that during the next stage — modelling.

Time Frames for Training Data Generations (Image by Author)

Predictors

Browsing-based:

  • Number of days since the last session
  • Average time per session during last 90 days
  • Number of sessions during last 90 days
  • Number of ‘VIP’ (View Item Page) page views during last 30 days
  • Number of days since the last Ad replying
  • Number of posts during last 90 days
  • Number of R2S (reply-to-seller) events during last 90 days
  • Number of ‘Favourite Ad’ events during last 90 days
  • Number of ‘Call Button’ (Calling the seller of a listing) clicks during last 30 days
  • Number of ‘Share Ad’ (Sharing a listing on social platforms) events during last 30 days
  • The user’s region/city
  • If the user uses multiple platforms or not
  • If the user prefers visiting the site on working hours or not
  • The most used marketing medium by the user during last 90 days
  • Number of sessions initiated by ‘Direct’ medium during last 90 days
  • Number of sessions initiated by ‘Retargeting’ medium during last 90 days

Response Variable

We formulated our supervised learning problem as binary classification: conversion (1) or not (0). Response variables are formed for Buyers and Sellers, as described below.

Step 3: Modelling and Parameter Optimisation

At this stage of the project, we tried out 4 different combinations of ML packages (SparkML, H2O) and algorithms (RandomForest, GradientBoostingTrees) and ranked them by speed and performance.

Model Selection (Source: h2o.ai)
Shifted Training and Validation Datasets Generation (Image by Author)
Feature Importance List (Image by Author)

Step 4: Results

Lift-Gain Chart Bucket Analysis

Bucket Analysis: Comparison of Buckets CRs vs. Overall CR vs Intuitive Way CR (Image by Author)

Feature Importances

Feature Importance Lists of Kijiji Canada’s Buyer and Seller Models (Image by Author)
Feature Importance Lists of eBay Kleinanzeigen’s Buyer and Seller Models (Image by Author)

Partial Dependence Plots

Partial Dependence Plot: Impact of ‘The Most Visited Category’-feature on User Conversion-response variable (Image by Author)
Partial Dependence Plot: Impact of ‘Number of Different Categories Visited’-feature on User Conversion-response variable (Image by Author)

Bucket Transitions

Apart from all these analysis, we also wondered about how stable or dynamic users’ likelihood of conversion scores in time. Since each user has got scores from the model daily, we asked a question ourselves how fast transitions between user buckets are happening within a certain time period. We created user buckets in the same manner as explained in the bucket analysis chapter above but we divided the entire user set into 10 equal-sized user buckets this time. After that, we observed ‘absolute change’ in users’ bucket numbers after a certain time period such as 7 days or 30 days. For instance, light blue coloured bar-2 in the plots below shows the percentage of users who jumped 2 buckets within that specific time period. Therefore, let’s say a user moving from bucket 1 to 3 and a user moving from bucket 7 to 5 after a particular time period both belongs to this same bar-2. As you can see from the plots, there is also a dark green coloured NA-bar in each plot which shows percentage of users who got scores from the model before but after that certain time period, they got out of scoring scope of the model which is users taking at least an action during last 90 days. In the following bar plots generated from Gumtree Australia’s seller model, first you will see the transitions between user buckets in 7 days and then transitions in 30 days. As expected, user buckets are more stable — bar 0 is high (no change in buckets) and bar NA is low — when the time period is shorter. For example, around 91% of users either has not changed their bucket at all (bar-0: 65.2%) or only 1 shift to one of the adjacent buckets (bar-1: 25.8%) in 7 days. This percentage is around 60% for the 30 days case.

Transitions Between Buckets in 7 Days (Image by Author)
Transitions Between Buckets in 30 Days (Image by Author)

Step 5: Production Jobs

This is the stage of a modelling project where data engineers and data scientists have to come together — primarily because it is the data engineers who are in charge of production lines. This implies the need to maintain a clean and structured codebase, that is ready for handing over. Therefore, we need to revisit our code at this stage and decide e.g. on the optimum number of jobs that we run in production.

Production Pipeline: Jobs Flow (Image by Author)

Step 6: Behavioural User Segments

From running this model daily, each user has got 2 different numerical likelihood scores with respect to their probability of being a Buyer and being a Seller on the platform for the next 30 days. In the last job of our production line, we added one more layer on top to label our users with hard-coded meaningful segment names. We do this segmentation by fitting a simple k-means using this 2-dimensional data and ended up with 7 different user cohorts.

  1. Visitor: Users who got low buyer and seller likelihood scores
  2. Low-Buyer: Users who got low seller score and relatively more buyer score
  3. Medium-Buyer: Users who got low seller score and moderate buyer score
  4. High-Buyer: Users who got low seller score and high buyer score
  5. Seller: Users who got low buyer score and high seller score
  6. Hybrid: Users who got high buyer score and also high seller score
Predictive Behavioural User Clusters (Image by Author)

Step 7: A Sample Use-Case

Initial objective of this project was enabling our Global Growth team marketers to do Personalised Marketing. However, we also discovered a different fascinating use-case of this model during our project journey.

Experiment: How The New Product Performs On Different User Segments (Image by Author)

Conclusion

In this article, we wanted to share with you a holistic journey of the Conversion Likelihood Model project done by eCG’s Data Science team. We mentioned some challenges we faced and also some lessons we took from this journey. We also touched on some subjects like ‘How to measure the success of an ML model?’ or ‘How to come up with different real-life use-cases for your model?’ We hope you will find this journey useful for your own projects.

Author Signature

Currently Amsterdam-based and working at Ebay. Senior Data Scientist with M. Sc degree in Machine Learning and 7 years of professional experience.