Fine-tun­ing of prob­ab­il­it­ies: An example of model calibration



Model Cal­ib­ra­tion for Churn Pre­dic­tion. In this blog post, we will explore the issue of using prob­ab­il­ity mod­els dir­ectly in data ana­lysis. Spe­cific­ally, we exam­ine the prac­tice of tak­ing model prob­ab­il­it­ies as expec­ted val­ues and dis­cuss the poten­tial prob­lems this can cause illus­trated by a bank churn example. We will also high­light the import­ance of con­sid­er­ing the implic­a­tions of prob­ab­il­ity dens­ity plots in data ana­lysis, par­tic­u­larly when it comes to post-pro­cessing prob­ab­il­it­ies. Finally, we will sug­gest poten­tial solu­tions to these prob­lems to help ensure accur­ate and mean­ing­ful analyses.

What drives churn?

So, let’s get star­ted… Ima­gine bee­ing a Data Sci­ent­ist work­ing in a bank which faces churns. The busi­ness depart­ment would like to approach the topic of churn pre­dic­tion ana­lyt­ic­ally and is also will­ing to spend money on it. By pre­dict­ing which cus­tom­ers are likely to churn, the bank can take pro­act­ive meas­ures by offer­ing per­son­al­ized incent­ives or improv­ing the qual­ity of ser­vice. [1]

So what does a Data Sci­ent­ist do? Ask for data and build a model… As easy as that, right?

Model Cal­ib­ra­tion for Churn Pre­dic­tion. One of the main prob­lems with churn is data imbal­ance. This means that the num­ber of cus­tom­ers who churn (the minor­ity class) is much smal­ler than the num­ber of cus­tom­ers who do not churn (the major­ity class). For example, in a data­set with 10,000 cus­tom­ers, there may be only approx. 2,000 cus­tom­ers who have churned. It might be dif­fi­cult for a machine learn­ing algorithms to accur­ately pre­dict churn, as it can be biased towards the major­ity class. To identify the bias, one can use an approach to meas­ure the final cal­ib­ra­tion of the model which will be intro­duced later.

Here one can see how imbal­anced the data­set given to the Data Sci­ent­ist by the depart­ment is for the churn class.

normalized value counts of churned customers

The Data Sci­ent­ist digs through his tool­box and comes up with a Ran­dom Forest Model (RF) which is able to pre­dict whether a cus­tomer is likely to churn or not. Luck­ily, the Ran­dom Forest Model provides a predict_proba() method. With this model, the Data Sci­ent­ist is now able to infer­ence this model and send over indi­vidual churn pre­dic­tions to the department.

This little story could con­clude here. The depart­ment takes the churn pre­dic­tions and applies its busi­ness logic to pre­vent cus­tomer churn.

Well, but …

… wait a moment! It’s a trap.

The depart­ment received churn pre­dic­tions from the Data Sci­ent­ist, which were used to determ­ine incent­ives. How­ever, the con­trolling depart­ment has observed a sig­ni­fic­ant increase in costs. Upon fur­ther invest­ig­a­tion, they have dis­covered that the mean churn prob­ab­il­ity in the pre­dic­tions was higher than the actual churn rates in the past.

There­fore, the Data Sci­ent­ist dived deeper into the RF and found an issue with the predict_proba()method. What happened? There is one wrong assump­tion in the pro­cess. A pre­d­ited value between zero and one is not equi­val­ent to a probability.

In real­ity, these val­ues between zero and one do not rep­res­ent genu­ine prob­ab­il­it­ies. This is due to the fol­low­ing reasons:

  1. Lack of Nor­mal­iz­a­tion: predict_proba() val­ues may not sum to 1, unlike true probabilities.
  2. Bias and Vari­ance: The model’s pre­dic­tions can be biased and may not accur­ately reflect the data distribution.
  3. The out­come is strongly influ­enced by the model archi­tec­ture itself. Sup­posed prob­ab­il­it­ies of a neural net­work may behave com­pletely dif­fer­ent from those of a ran­dom forest. This can also be the case if the class pre­dic­tion (e.g. with a limit of 0.5) is the same for all data points.

Visual Inspec­tion

Below, one can see the so-called cal­ib­ra­tion plot of the baseline model. Say that the pre­dic­tions are divided into ten bins with equal width. The x‑axis rep­res­ents the mean pre­dicted prob­ab­il­ity of each bin whereas the y‑axis rep­res­ents the actual mean of churns in that bin. In an ideal world, the bins would all fit to the dot­ted line: The mean pre­dicted prob­ab­il­ity would match the actual churn propability.

mean predicted probability

One can dir­ectly see that the model tends to be over­con­fid­ent. In most casts, the actual churn rate is lower than the pre­dicted prob­ab­il­ity. Well, this fits to the feed­back which brought the Data Sci­ent­ist here.

Note that the train­ing data­set is imbal­anced. Due to the imbal­anced data­set, the pre­dic­tions are skewed with most pre­dic­tions being smal­ler than 0.2. To take this into account, bins can be defined by using quantiles instead of equal width. This means that each bin con­tains the same num­ber of cus­tom­ers (1/10).

mean predicted probability

The last the three bins crash the model. A Data Sci­ent­ist with an ana­lyt­ical view would like to meas­ure these devi­ations. Luck­ily, there are some met­rics which help out.

Met­rics

Expec­ted Cal­ib­ra­tion Error

The Expec­ted Cal­ib­ra­tion Error (ECE) is a met­ric that meas­ures the dif­fer­ence between a model’s pre­dicted prob­ab­il­it­ies and the true prob­ab­il­it­ies of the events they are pre­dict­ing. To cal­cu­late the ECE, the pre­dicted prob­ab­il­it­ies are first binned into a set of inter­vals. Then, the aver­age dif­fer­ence between the pre­dicted prob­ab­il­it­ies and the true prob­ab­il­it­ies is cal­cu­lated for each inter­val. The ECE is then cal­cu­lated as the weighted aver­age of these aver­age dif­fer­ences, where the weights are the num­ber of samples in each inter­val. [2]

formula

Max­imum Cal­ib­ra­tion Error

The Max­imum Cal­ib­ra­tion Error (MCE) is a sim­ilar met­ric that is spe­cific­ally designed to eval­u­ate the cal­ib­ra­tion of a model on imbal­anced data. It is cal­cu­lated as the max­imum dif­fer­ence between the pre­dicted prob­ab­il­ity and the true prob­ab­il­ity of the minor­ity class. [2]

formular

Rel­at­ive Max­imum Cal­ib­ra­tion Error

Here we would like to present a new met­ric called Rel­at­ive Max­imum Cal­ib­ra­tion Error (rel MCE). It is vari­ant of the MCE that nor­mal­izes the max­imum cal­ib­ra­tion error by the mean pre­dicted value. This nor­mal­iz­a­tion allows for a more mean­ing­ful com­par­ison of cal­ib­ra­tion errors across dif­fer­ent mod­els or data­sets, since it takes into account the over­all level of con­fid­ence in the predictions.

Spe­cific­ally, this met­ric cal­cu­lates the rel­at­ive dis­tance between the observed frac­tion of pos­it­ives and the mean pre­dicted value at each threshold, and takes the max­imum value of this rel­at­ive dif­fer­ence as the final score. The rel­at­ive dif­fer­ence is simply the abso­lute dif­fer­ence divided by the mean pre­dicted value, which ensures that the score is nor­mal­ized by the scale of the predictions.

formular

Met­rics summit

Over­all, there are mul­tiple met­rics that provide a use­ful way to eval­u­ate the cal­ib­ra­tion per­form­ance of a bin­ary clas­si­fier, and can be par­tic­u­larly use­ful when com­par­ing mod­els with dif­fer­ent levels of pre­dicted probabilities.

How to Calibrate?

Luck­ily, there is a solu­tion for a bet­ter fit­ting. To address these con­cerns and refine pre­dic­tion accur­acy, Data Sci­ent­ists often employ cal­ib­ra­tion tech­niques. Cal­ib­ra­tion aims to adjust the ini­tial out­put from predict_proba() to bet­ter align with true prob­ab­il­it­ies. One com­mon approach is to use scikit-learn, which provides tools for cre­at­ing cal­ib­rated prob­ab­il­it­ies through iso­tonic regres­sion or logistic regres­sion. These meth­ods apply math­em­at­ical trans­form­a­tions to the ini­tial pre­dic­tions, enhan­cing the reli­ab­il­ity of the val­ues as prob­ab­il­ity estimates.

In gen­eral, one can use any kind of trans­form­a­tions, but typ­ic­ally trans­form­a­tions should have at least some prop­er­ties for more accur­ate probabilities:

  • Scal­ing: Smoothly map­ping val­ues to prob­ab­il­it­ies between 0 and 1, mak­ing it suit­able for mod­el­ing probabilities.
  • Mono­ton­icity: Trans­form­a­tion should pre­serves the order of pre­dicted val­ues, ensur­ing higher val­ues cor­res­pond to higher prob­ab­il­it­ies, main­tain­ing under­ly­ing prob­ab­il­ity relationships.
  • Inter­pretab­il­ity: Trans­form­a­tions should offer inter­pretab­il­ity to under­stand the rela­tion­ship between the out­come of the baseline model and the res­ult­ing probabilities.
  • Flex­ib­il­ity: Trans­form­a­tions should handle lin­ear and non-lin­ear rela­tion­ships between fea­tures and the tar­get vari­able, adapt­ing to vari­ous data patterns.

Given the scikit-learn util­it­ies, one can cal­ib­rate a model via a sig­moid func­tion or an iso­tonic regres­sion. The first one uses a logistic regres­sion with typ­ic­ally a logloss and the second one a step func­tion regres­sion. As dis­cussed before, one can use any kind of func­tions, but empir­ic­ally these two meth­ods pretty well.

sigmoid function, isotonic regression

In the doc­u­ment­a­tion [3], an adivse is given to use the iso­tonic regres­sion only if the data­set con­sists of more than 1,000 samples since the step func­tion tends to over­fit easily.

In sum­mary, logistic regres­sion and iso­tonic regres­sion enhance prob­ab­il­ity estim­ates due to their math­em­at­ical prop­er­ties, inter­pretab­il­ity, flex­ib­il­ity, and cal­ib­ra­tion cap­ab­il­it­ies, improv­ing over­all model performance.

The other para­meter is the val­id­a­tion method. This para­meter can be set to an integer value or oth­er­wise to a string called pre­fit. If one has an already trained model one can use the pre­fit value and fit the cal­ib­ra­tion on a hold out test set. Oth­er­wise one can cross val­id­ate fit­ting the baseline model as well as cal­ib­ra­tion over the whole train­ing data. [4]

Cal­ib­ra­tion

With this know­ledge the Data Sci­ent­ist is now able to cre­ate a new model and eval­u­ate the performance.

Now one can stack a cal­ib­rated clas­si­fier on top of the RF and check­out the res­ults. Here, the scikit-learn cross val­id­a­tion clas­si­fier for cal­ib­ra­tion with an iso­tonic regres­sion is used. Within the details one can get more inform­a­tion about the cross val­id­a­tion. In this scen­ario, the k‑fold con­sists of five folds.

mean predicted probability

At a first glance it looks like a prom­ising model. What story tell the metrics?

random forest, rf calibrated

From what one can see is the over­all met­rics are all bet­ter and match with the per­fectly cal­ib­rated dot­ted lines of the plot. The Data Sci­ent­ist still faces some over­con­fid­ence within the highest bins, but the model was to get closer. If one com­pares the plot with the upper one will see that the bin to 0.8 are now closer to the per­fect cal­ib­ra­tion as before.

Now, after the Data Sci­ent­ist trained a cal­ib­rated Ran­dom Forest model, the busi­ness depart­ment can be sat­is­fied with a less over­con­fid­ent model.

Sum­mary & Outlook

Model Cal­ib­ra­tion for Churn Pre­dic­tion. This was a quick and hope­fully help­ful intro­duc­tion how to cal­ib­rate a model. One should be aware that this is only a straight­for­ward example to explain the prob­lem. A real world applic­a­tion is more com­plex and data vari­ab­il­ity can increase the efforts.

For a good and pro­duct­ive model pipeline one needs to pre­pro­cess the data after the train-test split and within the pipeline, optim­ize the model para­met­ers and also train and com­pare mul­tiple models.

For cal­ib­ra­tion one needs to take into account that the bin count and the under­ly­ing dis­tri­bu­tion is import­ant for a bet­ter ana­lysis of the met­rics. Also all para­met­ers like dis­cussed before can change the prob­ab­il­it­ies and make the model more con­fid­ent. Whether the model is con­fid­ent or not, the busi­ness case is most import­ant and one needs to decide where to focus on:

  • Small prob­ab­il­it­ies,
  • high prob­ab­il­it­ies
  • or the aim the be con­sist­ent in general

In this example it would be the decision where one should spend more money on a cus­tomer pre­ven­tion or save money for a reten­tion case.

Cor­res­pond­ing code for this example can be found in [5].

Happy Hack­ing!

Ref­er­ences

[1] https://www.kaggle.com/datasets/adammaus/predicting-churn-for-bank-customers

[2] https://github.com/scikit-learn/scikit-learn/discussions/21785

[3] https://scikit-learn.org/stable/modules/calibration.html

[4] https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html

[5] https://github.com/datadrivers/blogpost-calibration-example