- You have access to the Data Studio
- You know what a computation is
A Customer Fit model in the Data Studio can be built based on simple rules, or based on decision trees.
This article explains how to understand the decisions trees. For rule-based models, please follow this article.
The goal of a decision tree is to isolate populations with the highest conversion rate (aka the good fit), versus populations with the lowest conversion rate (aka the low fit).
Let's get started
Access the Data Studio via the App > Predictions tab > 'Data Studio' button on the top right corner.
Once you're in the Studio, click on the Customer Fit model name you want to dive in, then go to the tab 'Model', and then the subsection 'Trees'.
In the Tree Visualization window, you can see several circles (the 'nodes') identified by a number and linked by lines (the branches).
How to read the decision tree?
Each node contains a subsection of your leads' whole population.
Node number 1 is the starting node. It contains the whole population. (i.e. the group of leads that made it into your training dataset).
Each node is split in two using a condition on one or more computations. Leads that will satisfy the splitting condition will go the the node on the left, and leads that don't will go to the node on the right.
We'll use the terms 'parent node' and 'sub-node'. In the image above, Node 1 is the parent of Node 2 and Node 17. Node 18 and Node 19 are the sub-nodes of Node 17.
In the end, your leads will be separated into classes of populations, and each population is defined by a set of features (or "traits", i.e. the set of splitting conditions they met) and their own conversion rate.
This means the tree allows you to identify leads that look like one another and calculate the average conversion rate of each group of similar leads.
Node Definition and Split conditions
As we've just seen, each node contains a subpopulation and a split condition.
- the sub-node on the left contains the leads which validate the split condition and the rest of the conditions of the parent node
- the sub node on the right contains the remaining leads, or in other words, the leads which do not validate the split condition but validate the conditions of the parent node.
In the example below, among the leads present in node 17, the split condition "is_personal=1" splits the population into the personal emails (node 18) and the business emails (node 31).
Click on a node to display on the right its information.
The Node definition section provides the condition(s) met by all the leads present in the node.
The Split condition tells you the condition which creates its sub nodes
The Node stats section tells us that node 2 contains 650 emails('Population'), but only 10 have converted ('Conversions'), therefore the conversion rate is 1.54%.
It also tells us that 39.3% of the whole population lies in this node ('% population'; meaning that 650 emails represent 39.3% of the whole leads population) and that 3% of the leads that converted lie in this node ('% conversions'; meaning that 10 converted emails is 3% of the total number of conversions in the training dataset).
Building a tree-based model is essentially trying to find the split conditions which will separate the leads into two classes with a very different conversion rate.
How to see the Insights for a subset of the population?
You can see the Insights page of each node by clicking on the node and then "View Insights for this node".
You can also see a sample of leads of converted (target = 1) or not converted (target = 0) leads in this node.
Nodes ranked by conversion rate
At the bottom of the page, you'll find a table with all the end nodes (i.e. the nodes that aren't split any further) of the tree that are then sorted by conversion rate.
- The nodes with the highest conversion rate (at the top) will correspond to the nodes containing the very good leads
- The nodes with the lowest conversion rate (at the bottom) will correspond to the nodes containing the low fit leads
For example, if one node isolates the Software, B2B, MidMarket US-based companies and this node has the highest conversion rate, then the leads with these traits will be scored very good. Whereas if the node isolating the personal emails has a 0% conversion rate, then any personal email would be scored low by the model.
This table also displays:
- Conversions : the number of conversions in the node
- Population : the number of leads in the node
- Population % : cumulative part of the population that has been caught by the nodes above
- Conversion % : cumulative part of the conversions that have been caught by the nodes above
- Conversion rate : conversion rate of the node, it sorts the table from highest to lowest conversion rate.
When building a customer fit model, we want to identify the small portion of leads that will account for most of the conversions (close to the 20/80 Pareto rule). This means trying to isolate leads in nodes with a small population but a high conversion rate.
The Tree Performance section gives you an indication of how close the tree is to this distribution.
How to read the AUC curve?
To check if the tree is indeed doing a good job at separating the good leads from the bad leads, we look at the AUC.
The AUC (stands for Area Under The Curve) of the ROC curve (Receiver Operating Characteristics) is one of the most important evaluation metrics for checking any classification model’s performance in machine learning. No need for a Data Science degree -- we are making things simple.
It tells how much the model is capable of distinguishing between classes.
- An excellent model has AUC close to 1 which means it has a good measure of separability.
- And when AUC is 0.5, it means the model has no class separation capacity whatsoever (in blue below): This is as good as randomly flagging leads very good to send to Sales.
In the graph below, the red line is essentially a representation of the table described above with the nodes ranked by conversion rate. We have at the bottom left, the nodes with the highest conversion rates while at the top right the nodes with the lowest conversion rate. A good model would have a sharp increase at the bottom, then a flat tail. This means that a small % of the population accounts for a larger % of conversions and it isolates a large portion of non-converting leads. In the example below, we can read that 30% of the leads in the training set account for 75% of the conversions.
The blue line is the 0.5 AUC, to display the performance of a random model as a reference. Having the red line above the blue line means you are getting valuable information out of your model.
Why are there 3 trees?
If you're an advanced user of MadKudu's Studio, keep reading!
As you'll read below if you want to learn more, to avoid overfitting the model, we avoid having trees too deep, making sure all the end nodes have a minimum population of ~ 100 leads.
This rule might prevent you from using all the computations you have in mind in splitting conditions in your first tree.
The advantage of building several trees is to use different computations in several trees in order to optimize the performance of the model while avoiding overfitting. With the Data Studio, a tree-based model can combine up to 3 trees. Each tree is attributed a weight in the Thresholds tab.
The customer fit segment (very good, good, medium, low) of a lead will then be defined by its predicted conversion rate, calculated as such: it is the weighted average of the conversion rate of each node the lead falls into for each tree. Those weights and the thresholds between the segment are configured in the Thresholds tab as well. Learn more about the Customer Fit score calculation
Overfitting a model means creating a model which produces a prediction based on the observations of a sample too small.
Example: You've visited London for a weekend and the weather was perfectly sunny. You come back home telling your friends that London is ALWAYS sunny...based on the observations of 2 days out of 365 days. We know it's not true, no offense :)
Now that you have the idea, it would be the same as building a model based on Sales feedback that two 30-people Gaming companies from Iceland converted in the last month out of the 3 leads received from this market, and therefore these should be your most highly scored Leads.
To avoid overfitting the model, we avoid having trees too deep, making sure all the end nodes have a minimum population of ~ 100 leads.
Help us improve the product! Send us your feedback at firstname.lastname@example.org