A Customer Fit model on Springbok can be built based on simple rules, or based on decision trees.
This article explains how to understand the decisions trees. For rule-based models, please follow this article.
The goal of a decision tree is to isolate populations with the highest conversion rate (aka the good fit), versus populations with the lowest conversion rate (aka the low fit).
The decision trees use in each node a condition on one or more computations to separate leads into classes of populations, and each population is defined by a set of features (or "traits") and a conversion rate.
How to read the decision tree?
Node Definition and Split conditions
Each node is a split condition, and
- the sub-node on the left contains the leads which validate the split condition and the rest of the conditions that created this node
- the sub node on the right contains the remaining leads, or in other words, the leads which do not validate the split condition but validate the conditions of the parent node.
In the example below, among the leads present in node 17, the split condition "is_personal=1" splits the population into the personal email (node 18) and the business emails (node 31).
The Node definition section provides the condition(s) met by all the leads present in the node.
The Split condition tells you the condition which creates the sub nodes
The Node stats section tells us that node 18 contains 7,218 emails, but only 113 have converted, therefore the conversion rate is 1.57%.
Building a tree-based model is essentially trying to find the split conditions which will separate the leads into two classes with a very different conversion rate.
All the end nodes of the tree are then sorted by conversion rate.
- The nodes with the highest conversion rate (at the top) will correspond to the very good leads
- The nodes with the lowest conversion rate (at the bottom) will correspond to the low fit leads.
For example, if one node isolates the Software, B2B, MidMarket US-based companies and this node has the highest conversion rate, then the leads with these traits will be scored very good. Whereas if the leaf isolating the personal emails has a 0% conversion rate, then any personal email would be scored low by the model.
When building a customer fit model, we want to identify the small portion of leads that will account for most of the conversions (close to the 20/80 Pareto rule). This means trying to isolate leads in nodes with a small population but a high conversion rate.
The Performance section gives you an indication of how close the tree is to this distribution.
How to read the AUC graph?
To check if the tree is indeed doing a good job at separating the good leads from the bad leads, we look at the AUC.
The AUC(stands for Area Under The Curve) of the ROC curve (Receiver Operating Characteristics) is one of the most important evaluation metrics for checking any classification model’s performance in machine learning. No need for a Data Science degree we are making things simple.
It tells how much the model is capable of distinguishing between classes.
- An excellent model has AUC close to 1 which means it has a good measure of separability.
- And when AUC is 0.5, it means the model has no class separation capacity whatsoever (in blue below): This is as good as randomly flagging leads very good to send to Sales.
The graph below is essentially a representation of the table described above with the ranked nodes. We have at the bottom left, the nodes with the highest conversion rates while at the top right the nodes with the lowest conversion rate. A good model would have a sharp increase at the bottom, then a flat tail. This means that a small % of the population accounts for a larger % of conversions and it isolates a large portion of non-converting leads. In the example below, we can read that 30% of the leads in the training set account for 75% of the conversions.
Why are there 3 trees?
The advantage of building several trees is to separate leads with more dimensions than just one tree to find different features to isolate leads in order to optimize the performance of the model while avoiding overfitting. With the Data Science Studio, a tree-based model can combine up to 3 trees.
The customer fit segment (very good, good, medium, low) of a lead will then be defined by it's predicted conversion rate which is the weighted average of the conversion rate of the trees. Those weights and the thresholds between the segment are configured in the Ensembling tab.
Overfitting a model means creating a model which produces a prediction based on the observations of a sample too small.
Example: You've visited London for a weekend and the weather was perfectly sunny. You come back home telling your friends that London is ALWAYS sunny...based on the observations of 2 days out of 365 days. We know it's not true, no offense :)
Now that you have the idea, it would be the same as building a model based on Sales feedback that two 30-people Gaming companies from Iceland converted in the last month out of the 3 leads received from this market, and therefore these should be your most highly scored Leads.
To avoid overfitting the model, we avoid having trees too deep, making sure all the end nodes have a minimum population of ~ 100 leads.
How to see the univariate analysis for a subset of the population?
You can see the univariate analysis of each node by clicking on "See univariate analysis for this node" as well as a sample of leads of converted (target = 1) or not converted (target = 0) in this node.
Help us improve the product! Send us your feedback at firstname.lastname@example.org