MadKudu's Data Science Studio called Springbok or DSS is our platform that allows you to easily build predictive models and segmentation. This platform was originally built for our internal data scientists to build models for our customers and we are excited to now open this "black box" to external users. The platform is accessible:
- in read-only mode for users with the role User and Admin-> they can see how the model is configured, but cannot modify it.
- in edit mode for users with the role Architect.
(Admins can manage users in app.madkudu.com > Settings > Users)
Please note that this access is in beta version and we are working on improving the Data Science Studio based on customer feedback. Don't hesitate to share to email@example.com!
This article focuses on the data science studio capabilities for the Customer Fit models.
At a high level, the platform allows you to:
- create a training dataset from your CRM or upload via a CSV file
- understand which firmographic, demographic or technographic traits of leads are correlated with their conversion
- build a decision-tree based or point-based model from those traits
- adjust the thresholds between the customer fit segment (very good, good, medium, low)
- create the signals of the model
- validate the performance of the model on a validation dataset from your CRM or uploaded via CSV file
- preview a sample of scored leads
Now let's get into the details.
On the home page, you will find the list of models live in your CRM. To view the model, click on "Open".
Within the model page,
- the Customer Profile section allows you to access the list of traits available to build a segmentation.
- the Data Science Studio allows you to see the impact of each trait on the conversion, as well as how the model is built.
In this section, you can see when the training and validation datasets were last uploaded.
In the Computations tab, you can find the list of firmographic, demographic, and technographic traits ("computations") available to use in the customer fit models and how they are defined.
Some computations come directly from our Enrichment partners (like company category industry), some others from our own computations (like predicted revenue), and some others can be created from your own data in your CRM. Learn more about Computations
Data Science Studio
In the Data page, you can see the size and conversion rate of the training set and validation set uploaded to train and validate the customer fit model. A training dataset is essentially a table with 2 columns
- email: the email of the lead
- target: 1 if the lead has converted, 0 otherwise.
A training set is usually created to obtain at least 20% of conversion rate to avoid a class imbalance problem.
In the Univariate Analysis tab, see which firmographic, demographic and technographic traits are important conversion factors for your ideal customer profile. This section is also available in the MadKudu App, and you'll find a great article on how to read these graphs here .
Springbok allows users to build customer fit segmentations with either a decision tree-based model, or a point-based model.
- A decision tree-based model is particularly adapted to identify different ideal customer profiles and should be used with a large enough training dataset (> 1,000 records)
- A point-based model would be adopted when you know your ideal customer profile or when you don't have a training dataset large enough for a tree-based model.
When the model base selected is a decision tree, the tabs Trees and Rules will contain the modeling of the model.
When the model base selected is a rule-based, only the tab Rules will contain the rules of the model.
If the model is a Decision Tree
We've got a full article about it just for you here.
The TL;DR: the decision trees allow to classify leads by populations with very different conversion rates to identify the high performing populations versus the low performing populations.
If the Model base is Tree-based
Within each customer fit segment, we can also adjust the final score of a lead.
Because the final segmentation in your CRM is as follows:
- low = 0 - 49 score
- medium = 50 - 69 score
- good = 70 - 84 score
- very good = 85 - 100 score
we may want to boost or penalize the leads among each segment to fit your business needs.
For example, in order to not overfit the model we could exclude person-level traits in the trees. However, within the very good scored companies you may want to boost the leads with a good title, or only see in the 95-100 score the Fortune 100. To do so, the Rules tab allows to allocate points to some traits to boost up or down the score so some leads get to be more specific at the bottom or top of a segment.
Note: As of October 2020, that part is either configured to default rules or hardcoded in SQL, but we are working on building a better interface.
If the Model base is a rule-based
For a rule-based model, the logic behind the score of a lead is more straightforward. The score of a lead is the sum of all the points associated with each rule the lead complies with.
For example, if we have the following conditions:
- WHEN is_personal = 1 THEN -30
- WHEN pers_title = 'CTO' THEN 50
then a lead with a personal email but a CTO job title will have a score of -30 + 50 = 20.
A full article is available here to give you more details on creating or editing the rules of your point-based model.
If the Model base is Decision Tree-based
The Ensembling tab allows you to "ensemble" trees together and to configure the conversion rate thresholds to define the customer fit segments.
In this example:
- Tree #1 is allocated a weight of 15% in the score of the leads, Tree #2 25%, and Tree #3 60%. This means that if in Tree #1, the leads falls into node #18, in tree #2, the lead is in node #20 and tree #3 the lead is in node #7, then the predicted conversion rate of the lead is :
- The threshold between Low and Medium is 8% conversion rate. This means that all the leads in the nodes of the trees with a conversion rate lower that 8% will be scored low.
To measure the performance of the model we look at 2 metrics: the Recall and the Precision.
The first graph returns the performance of the model on the training set.
On the left-hand side, it shows the distribution of leads by their customer fit segment. Ideally, the model identifies 20-35% of very good and good leads, 25-35% of medium, and 40-50% of lows.
On the right-hand side, it shows the distribution of conversions by their customer fit segment. What we want to optimize is the Recall which is the % of very good and good conversions.
Recall refers to the percentage of total relevant results correctly classified by the model. It's calculated from the number of True Positive divided by (True Positive + False Negative). The False Negative here are the leads scored low and medium but who converted anyway. Precision refers to the percentage of results that are relevant.
From this graph, we try to optimize the model to be as close as possible to the 20/80 rule: 20% of leads account for 80% of conversions. But a 20/80 performance is seldomly achievable because some outliers convert and the model can't predict it without overfitting.
Here we have 35% of leads which account for 83% (Recall).
The second metric to look at is the Precision: is the model correctly identifying the leads who convert at a higher rate than others? Ideally, we want to have at least a 10x difference in conversion rate between the very good and the low. This means the very goods will actually have a higher probability to convert than the lows.
Note: the conversion rates here should not be taken in absolute as the training dataset has been engineered (down-sampled) to reach a 20% conversion rate.
If the Model base is Rule-based
To make things easier for the end-users of the scoring, you may want to use a label (or "segment") instead of a score. For example, does your team know if a score of 20 is a good score because it's 20 out of 30 or it's a bad score because it's 20 out of 100?
The Ensembling tab allows you to configure the 4 available segments: very good, good, medium, low. You can set the thresholds defining what is the minimum score for each threshold.
Overrides allow you to force the segment of populations of leads regardless of what your historical data says about them. This allows you to make sure you can include your Sales feedback in the scoring (for example, always scoring low your partner or reseller to make sure they are never sent to Sales). Learn more about overrides and how to add one to a live model.
A model needs to be validated on a validation dataset that does not have overlaps with the training dataset. For that, we usually take more recent leads than the training dataset and check the performance of the model on this dataset in the Validation tab. The same metrics of Recall and Precision can be extracted from the graph and a model is "valid" if we are still close to 65%+ Recall and 10x Precision.
Signals are meant to provide information in your CRM to explain the score of a lead. It is also useful for surfacing relevant information about the prospect, even if that information is not used in the scoring.
Check out a sample of scored leads from the validation dataset in the spot check page. You will find for each email its score, segment, and signals as you would see it in your CRM.
[For decision tree-based model] The links to the nodes of the trees allow you to understand in which nodes the leads are falling into for each tree, which explains the scores of the leads.
A springbok, fast antelope native to southern Africa, Antidorcas marsupialis.
What does -333 mean as a value?
-333 or -1 is used for "unknown" values when using numerical computations