MadKudu's Data Science Studio is the proprietary platform that allows users to easily build segmentation models. This platform was originally built for our internal data scientists to build segmentation models for our customers and we are excited to now open this "black box" to external users. The platform will be accessible in read-only mode as a first step to allow our customers to see how their model is configured.
This document focuses on the data science studio capabilities for the Customer Fit models.
High level, the platform allows you to
- create a training dataset from your CRM or upload via a CSV file
- understand which firmographic, demographic or technographic traits of leads are correlated with their conversion
- build a decision-tree based or point-based model from those traits
- adjust the thresholds between the customer fit segment (very good, good, medium, high)
- create the signals of the model
- validate the performance of the model on a validation dataset from your CRM or uploaded via CSV file
- preview a sample of scored leads
Now let's get into the details.
On the home page, you will find the list of segmentations live in your CRM. To view the model, click on "Open".
Within the model page,
- the Customer Profile section allows to access the list of traits available to build a segmentation
- the Data Science Studio allows to see the impact of each trait on the conversion, and how the model is built
In this section, you can see when the training and validation datasets were last uploaded.
Note: The functionality of creating or changing the training and validation dataset is currently available to internal users only and will soon be available to external users
In the Computations tab, you can find the list of firmographic, demographic, and technographic traits ("computations") available to use in the customer fit segmentation and how they are defined.
Some computations come directly from Clearbit data (like company category industry), some others from our own computations (like predicted revenue), and some others can be created from your own data in your CRM.
Data Science Studio
In the Data page, you can see the size and conversion rate of the training set and validation set uploaded to train and validate the customer fit model. A training dataset is essentially a table with 3 columns
- email: the email of the lead
- target: 1 if the lead has converted, 0 otherwise.
A training set is usually created to obtain at least 20% of conversion rate to avoid a class imbalance problem.
In the Univariate Analysis tab, see which firmographic, demographic and technographic traits are important conversion factors for your ideal customer profile. This section is also available in the MadKudu App, and you'll find a great article on how to read these graphs here .
Springbok allows user to build customer fit segmentations with either a decision-tree based model, or a point-based model.
- A decision tree-based model is particularly adapted to identify different ideal customer profiles and should be used with a training dataset large enough (> 1,000 records)
- A point-based model would be adapted when you know your ideal customer profile or when you don't have a training dataset large enough for a tree-based model.
When the model base selected is a decision tree, the tabs Trees and Rules will contain the structure of the model.
When the model base select is a rule-based, the tab Rules only will contain the structure of the model.
If the Model base is a Decision Tree
The decision trees use in each node a condition on one or more computations to separate leads into classes, and each class is defined by a conversion rate.
How to read the tree? Each node is a split condition, and the sub-node on the left will contain the lead which validates the split condition and the rest of the conditions that created this node while the sub node on the right contains the remaining leads, or in other words, the leads which do not validate the split condition but validate the conditions of the parent node.
In the example below, the node 18 corresponds to all the personal emails as it includes all the leads which enrichment validates the Node Definition. And this node itself will separate the leads to node 19 where the Split condition is verified and node 30 where the Split condition is not verified.
You can see the univariate analysis of each node by clicking on "See univariate analysis for this node" as well as a sample of leads of converted (target = 1) or not converted (target = 0) in this node.
The node stats section tells us that node 18 contains 7,218 emails, but only 113 have converted, therefore the conversion rate is 1.57%.
Building a tree-based model is essentially trying to find the split conditions which will separate the leads into two classes with a very different conversion rate.
All the end nodes of the trees are then sorted by conversion rate. The nodes with the highest conversion rate will correspond to the very good leads while the nodes with the lowest conversion rate will correspond to the low fit leads.
For example, if one node isolates the Software, B2B, MidMarket US-based companies and this node has the highest conversion rate, then the leads with these traits will be scored very good. Whereas if the leaf isolating the personal emails has a 0% conversion rate, then any personal email would be scored low by the model.
Note: not to "overfit" the model, all the end nodes should have a minimum population of ~ 100
When building a customer fit model, we want to identify the small portion of leads that will account for most of the conversions. This means trying to isolate leads in nodes with a small population but a high conversion rate.
To check if the tree is indeed doing a good job at separating the good leads from the bad leads, we look at the AUC.
The AUC(stands for Area Under The Curve) of the ROC curve (Receiver Operating Characteristics) is one of the most important evaluation metrics for checking any classification model’s performance in machine learning.
It tells how much the model is capable of distinguishing between classes.
- An excellent model has AUC near to the 1 which means it has a good measure of separability.
- A poor model has AUC near to the 0 which means it has a bad measure of separability.
- And when AUC is 0.5, it means the model has no class separation capacity whatsoever
How to read the AUC graph? This graph below is essentially a representation of the table described above with the ranked nodes. We have at the bottom left, the nodes with the highest conversion rates while at the top right the nodes with the lowest conversion rate. A good model would have a sharp increase at the bottom, then a flat tail. This means that a small % of the population accounts for a larger % of conversions and it isolates a large portion of non-converting leads. In the example below, we can read that 30% of the leads in the training set account for 75% of the conversions.
Finally, a tree-based model can combine up to 3 trees. The advantage of building several trees is to separate leads with more dimensions than just one tree to find different features to isolate leads in order to optimize the performance of the model.
The customer fit segment (very good, good, medium, low) of a lead will then be defined by it's predicted conversion rate which is the weighted average of the conversion rate of the trees. Those weights and the thresholds between the segment are configured in the Ensembling tab.
If the Model base is a Decision Tree
Within each customer fit segment, we can also adjust the final score of a lead.
Because the final segmentation in your CRM is as follow:
- low = 0 - 49 score
- medium = 50 - 69 score
- good = 70 - 84 score
- very good = 85 - 100 score
we may want to boost or penalize the leads among each segment to fit your business needs.
For example, not to overfit the model we could not use person-level traits in the trees, however within the very good scored companies you may want to boost the leads with a good title, or only see in the 95-100 score the Fortune 100. To do so, the Rules tab allows to allocate points to some traits to boost up or down the score so some leads get to be more specific at the bottom or top of a segment.
Note: As of October 2020, that part is either configured to default rules or hardcoded in SQL, but we are working on building a better interface.
If the Model base is a rule-based
For a rule-based model, the logic behind the score of a lead is more straightforward. The score of a lead is the sum of all the points associated with each rule the lead verifies and then the score is normalized between 0 and 100.
For example, if we have the following conditions:
- WHEN is_personal = 1 THEN -30
- WHEN pers_title = 'CTO' THEN 50
then a lead with a personal email but a CTO job title will have an internal score of 20, and the threshold between low/medium is 10 points, and medium/good is 30, then the lead will have a segment of medium and a final score in your CRM between 50 and 69.
The Ensembling tab allows to "ensemble" trees together and configure the conversion rate thresholds to define the customer fit segments.
In this example
- Tree #1 is allocated a weight of 15% in the score of the leads, Tree #2 25%, and Tree #3 60%. This means that if in Tree #1, the leads fall into node #18, in tree #2, the lead is in node #20 and tree #2 the lead is in node #7, then the predicted conversion rate of the lead is :
lead cvr = 0.15 x cvr(node18 of tree 1) + 0.25 x cvr(node 20 of tree 2) + 0.60 cvr(node 7 of tree 3)
- The threshold between Low and Medium is 8% conversion rate. This means that all the leads in the nodes of the trees with a conversion rate lower that 8% will be scored low.
To measure the performance of the model we look at 2 metrics: the Recall and the Precision.
The first graph returns the performance of the model on the training set.
On the left-hand side, it shows the distribution of leads by their customer fit segment. Ideally, the model identifies 20-35% of very good and good leads, 25-35% of medium, and 40-50% of lows.
On the right-hand side, it shows the distribution of conversions by their customer fit segment. What we want to optimize is the Recall which is the % of very good and good conversions.
Recall refers to the percentage of total relevant results correctly classified by the model. It's calculated from the number of True Positive divided by (True Positive + False Negative). The False Negative here are the leads scored low and medium but who converted anyway. Precision refers to the percentage of results that are relevant.
From this graph, we try to optimize the model to be as close as possible to the 20/80 rule: 20% of leads account for 80% of conversions. But a 20/80 performance is seldom achievable because some outliers convert and the model can't predict it without being overfitted.
Here we have 35% of leads which account for 83% (Recall).
The second metric to look at is the Precision: is the model identifying correctly the leads who convert at a higher rate than others? Ideally, we want to have at least a 10x difference in conversion rate between the very good and the low. This means the very goods will actually have a higher probability to convert than the lows.
Note: the conversion rates here should not be taken in absolute as the training dataset has been engineered (downsampled) to reach a 20% conversion rate.
The segment of a lead is defined by the model according to the result returned by the trees. However you may want to override the model to force the segment of some leads based on one or more traits.
For example, your company has not converted a lot of leads from Enterprise companies, therefore the model "naturally" scores them medium because your historical data say these leads don't convert well. However, you still want to prioritize these leads as very good for your sales team to go after. To do so, you would apply an override like "If the company size of the lead is Enterprise, then it should be scored very good" [regardless of its intrinsic score].
In the overrides tab, you can see all the overrides that are applied to the model, and in the sub tab "check performance of overrides" you will see the change of performance compared to the Ensembling tab when the overrides applied. This page allows to check that an overrides just applied is not boosting too many leads in good / very good, or decreasing the Recall.
A model needs to be validated on a validation dataset that does not have overlaps with the training dataset. For that, we usually take more recent leads than the training dataset and check the performance of the model on this dataset in the Validation tab. The same metrics of Recall and Precision can be extracted from the graph and a model is "valid" if we are still close to 65%+ Recall and 10x Precision.
In the Signals tab, are defined the rules to create the signals displayed in your CRM to explain the score of the leads. Are defined the traits which impact positively or negatively the score of a lead. For example, if a split condition of a node in a tree creates a node with a high conversion rate, it means the condition on this trait impacts positively the score of a lead.
Note: As of October 2020, the signals are hardcoded in YAML but we are working on developing a better interface.
Check out a sample of scored leads from the validation dataset in the spot check page. You will find for each email its score, segment, and signals as you would see it in your CRM. The links to the nodes of the trees allow understanding in which nodes the leads are falling for each tree that would explain the score of the leads.
A springbok, fast antelope native to southern Africa, Antidorcas marsupialis.