Fine-tuning of probabilities: An example of model calibration

Model Calibration for Churn Prediction. In this blog post, we will explore the issue of using probability models directly in data analysis. Specifically, we examine the practice of taking model probabilities as expected values and discuss the potential problems this can cause illustrated by a bank churn example. We will also highlight the importance of considering the implications of probability density plots in data analysis, particularly when it comes to post-processing probabilities. Finally, we will suggest potential solutions to these problems to help ensure accurate and meaningful analyses.
What drives churn?
So, let’s get started… Imagine beeing a Data Scientist working in a bank which faces churns. The business department would like to approach the topic of churn prediction analytically and is also willing to spend money on it. By predicting which customers are likely to churn, the bank can take proactive measures by offering personalized incentives or improving the quality of service. [1]
So what does a Data Scientist do? Ask for data and build a model… As easy as that, right?
Model Calibration for Churn Prediction. One of the main problems with churn is data imbalance. This means that the number of customers who churn (the minority class) is much smaller than the number of customers who do not churn (the majority class). For example, in a dataset with 10,000 customers, there may be only approx. 2,000 customers who have churned. It might be difficult for a machine learning algorithms to accurately predict churn, as it can be biased towards the majority class. To identify the bias, one can use an approach to measure the final calibration of the model which will be introduced later.
Here one can see how imbalanced the dataset given to the Data Scientist by the department is for the churn class.

The Data Scientist digs through his toolbox and comes up with a Random Forest Model (RF) which is able to predict whether a customer is likely to churn or not. Luckily, the Random Forest Model provides a predict_proba()
method. With this model, the Data Scientist is now able to inference this model and send over individual churn predictions to the department.
This little story could conclude here. The department takes the churn predictions and applies its business logic to prevent customer churn.
Well, but …
… wait a moment! It’s a trap.
The department received churn predictions from the Data Scientist, which were used to determine incentives. However, the controlling department has observed a significant increase in costs. Upon further investigation, they have discovered that the mean churn probability in the predictions was higher than the actual churn rates in the past.
Therefore, the Data Scientist dived deeper into the RF and found an issue with the predict_proba()
method. What happened? There is one wrong assumption in the process. A predited value between zero and one is not equivalent to a probability.
In reality, these values between zero and one do not represent genuine probabilities. This is due to the following reasons:
- Lack of Normalization:
predict_proba()
values may not sum to 1, unlike true probabilities. - Bias and Variance: The model’s predictions can be biased and may not accurately reflect the data distribution.
- The outcome is strongly influenced by the model architecture itself. Supposed probabilities of a neural network may behave completely different from those of a random forest. This can also be the case if the class prediction (e.g. with a limit of 0.5) is the same for all data points.
Visual Inspection
Below, one can see the so-called calibration plot of the baseline model. Say that the predictions are divided into ten bins with equal width. The x‑axis represents the mean predicted probability of each bin whereas the y‑axis represents the actual mean of churns in that bin. In an ideal world, the bins would all fit to the dotted line: The mean predicted probability would match the actual churn propability.

One can directly see that the model tends to be overconfident. In most casts, the actual churn rate is lower than the predicted probability. Well, this fits to the feedback which brought the Data Scientist here.
Note that the training dataset is imbalanced. Due to the imbalanced dataset, the predictions are skewed with most predictions being smaller than 0.2. To take this into account, bins can be defined by using quantiles instead of equal width. This means that each bin contains the same number of customers (1/10).

The last the three bins crash the model. A Data Scientist with an analytical view would like to measure these deviations. Luckily, there are some metrics which help out.
Metrics
Expected Calibration Error
The Expected Calibration Error (ECE) is a metric that measures the difference between a model’s predicted probabilities and the true probabilities of the events they are predicting. To calculate the ECE, the predicted probabilities are first binned into a set of intervals. Then, the average difference between the predicted probabilities and the true probabilities is calculated for each interval. The ECE is then calculated as the weighted average of these average differences, where the weights are the number of samples in each interval. [2]

Maximum Calibration Error
The Maximum Calibration Error (MCE) is a similar metric that is specifically designed to evaluate the calibration of a model on imbalanced data. It is calculated as the maximum difference between the predicted probability and the true probability of the minority class. [2]

Relative Maximum Calibration Error
Here we would like to present a new metric called Relative Maximum Calibration Error (rel MCE). It is variant of the MCE that normalizes the maximum calibration error by the mean predicted value. This normalization allows for a more meaningful comparison of calibration errors across different models or datasets, since it takes into account the overall level of confidence in the predictions.
Specifically, this metric calculates the relative distance between the observed fraction of positives and the mean predicted value at each threshold, and takes the maximum value of this relative difference as the final score. The relative difference is simply the absolute difference divided by the mean predicted value, which ensures that the score is normalized by the scale of the predictions.

Metrics summit
Overall, there are multiple metrics that provide a useful way to evaluate the calibration performance of a binary classifier, and can be particularly useful when comparing models with different levels of predicted probabilities.
How to Calibrate?
Luckily, there is a solution for a better fitting. To address these concerns and refine prediction accuracy, Data Scientists often employ calibration techniques. Calibration aims to adjust the initial output from predict_proba()
to better align with true probabilities. One common approach is to use scikit-learn, which provides tools for creating calibrated probabilities through isotonic regression or logistic regression. These methods apply mathematical transformations to the initial predictions, enhancing the reliability of the values as probability estimates.
In general, one can use any kind of transformations, but typically transformations should have at least some properties for more accurate probabilities:
- Scaling: Smoothly mapping values to probabilities between 0 and 1, making it suitable for modeling probabilities.
- Monotonicity: Transformation should preserves the order of predicted values, ensuring higher values correspond to higher probabilities, maintaining underlying probability relationships.
- Interpretability: Transformations should offer interpretability to understand the relationship between the outcome of the baseline model and the resulting probabilities.
- Flexibility: Transformations should handle linear and non-linear relationships between features and the target variable, adapting to various data patterns.
Given the scikit-learn utilities, one can calibrate a model via a sigmoid function or an isotonic regression. The first one uses a logistic regression with typically a logloss and the second one a step function regression. As discussed before, one can use any kind of functions, but empirically these two methods pretty well.

In the documentation [3], an adivse is given to use the isotonic regression only if the dataset consists of more than 1,000 samples since the step function tends to overfit easily.
In summary, logistic regression and isotonic regression enhance probability estimates due to their mathematical properties, interpretability, flexibility, and calibration capabilities, improving overall model performance.
The other parameter is the validation method. This parameter can be set to an integer value or otherwise to a string called prefit. If one has an already trained model one can use the prefit value and fit the calibration on a hold out test set. Otherwise one can cross validate fitting the baseline model as well as calibration over the whole training data. [4]
Calibration
With this knowledge the Data Scientist is now able to create a new model and evaluate the performance.
Now one can stack a calibrated classifier on top of the RF and checkout the results. Here, the scikit-learn cross validation classifier for calibration with an isotonic regression is used. Within the details one can get more information about the cross validation. In this scenario, the k‑fold consists of five folds.

At a first glance it looks like a promising model. What story tell the metrics?

From what one can see is the overall metrics are all better and match with the perfectly calibrated dotted lines of the plot. The Data Scientist still faces some overconfidence within the highest bins, but the model was to get closer. If one compares the plot with the upper one will see that the bin to 0.8 are now closer to the perfect calibration as before.
Now, after the Data Scientist trained a calibrated Random Forest model, the business department can be satisfied with a less overconfident model.
Summary & Outlook
Model Calibration for Churn Prediction. This was a quick and hopefully helpful introduction how to calibrate a model. One should be aware that this is only a straightforward example to explain the problem. A real world application is more complex and data variability can increase the efforts.
For a good and productive model pipeline one needs to preprocess the data after the train-test split and within the pipeline, optimize the model parameters and also train and compare multiple models.
For calibration one needs to take into account that the bin count and the underlying distribution is important for a better analysis of the metrics. Also all parameters like discussed before can change the probabilities and make the model more confident. Whether the model is confident or not, the business case is most important and one needs to decide where to focus on:
- Small probabilities,
- high probabilities
- or the aim the be consistent in general
In this example it would be the decision where one should spend more money on a customer prevention or save money for a retention case.
Corresponding code for this example can be found in [5].
Happy Hacking!
References
[1] https://www.kaggle.com/datasets/adammaus/predicting-churn-for-bank-customers
[2] https://github.com/scikit-learn/scikit-learn/discussions/21785
[3] https://scikit-learn.org/stable/modules/calibration.html
[4] https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
[5] https://github.com/datadrivers/blogpost-calibration-example