You've just uploaded a dataset in your model or looking at an existing model and trying to understand why the training dataset shows a much higher conversion rate than the validation dataset?
If you have a data science background you probably know about class imbalance, but if you don't you will find an answer in this article.
What is the Class Imbalance problem?
A class imbalance occurs when the total number in a class of data (the converters here), is far less than the total number in the other class of data (the non-converters here) in a training dataset.
The minority class of converters is harder to predict because there are few examples of this class. This means it is more challenging for a model to learn the common traits of examples from this class, and to differentiate examples from this class from the majority class. Therefore the model would be biased towards the majority class. For example, your model predicts 99% of the time if the lead is going to convert or not, that's awesome! But if at 99% it predicts correctly the non-converters, then it means it never predicts correctly the converters, and subsequently does not help surface the very good leads, but only helps to reject the bad leads.
How do we work around Class Imbalance?
To fight the class imbalance problem, we use two balancing techniques to make sure the ratio between the class of converters and the class of non-converters is at least more than 2 out of 10 (= conversion rate of 20%)
- boosting the minority class of non-converters by adding leads in the dataset who converted but were not created in the selected timeframe.
- downsampling the majority class of converters by randomly selecting and removing non-converters from the training dataset.
All customer fit training datasets by default are created with a boosting, but the downsampling is controlled by the Rebalancing ratio parameter in the Advanced options section.