Intro­duc­tion to Bayesian infer­ence with PyS­tan – Part II



This is the con­tinu­ta­tion of the first part of the blog post on this topic. If you are com­pletely new to the topic of Bayesian infer­ence, please don’t for­get to start with the first part, which intro­duced Bayes’ Theorem.

This second part focuses on examples of apply­ing Bayes’ The­orem to data-ana­lyt­ical problems.

A simple example

Ima­gine, we want to estim­ate the fair­ness of a coin by assess­ing a num­ber of coin tosses. The para­meter we want to estim­ate is in this case the prob­ab­il­ity that a coin falls (assum­ing that the toss­ing-pro­cess is unbiased) on its head.

Obvi­ously, a fair coin would be char­ac­ter­ized by

The (bino­mial) Likelihood

The data we observe con­sists of a a num­ber of “Heads” and “Tails”. Let’s assume we have NH heads out of N coin tosses. In this case (assum­ing that all coin tosses are inde­pend­ent of each other and identic­ally dis­trib­uted (i.i.d.)), the prob­ab­il­ity of the data given the para­meter is given by a bino­mial likelihood:

The prior distribution

In order to have a fully spe­cified Bayesian model, we need to spe­cify a prior dis­tri­bu­tion over all para­met­ers of the model. In this case, we have only one free para­meter: theta. The con­ven­tional choice in this case is a beta dis­tri­bu­tion. The beta dis­tri­bu­tion is defined for input val­ues from 0 to 1 – thus suit­able for a prob­ab­il­ity para­meter. Another reason for choos­ing the Beta dis­tri­bu­tion is the fact, that it is a con­jug­ate dis­tri­bu­tion for the bino­mial like­li­hood – which will make cal­cu­la­tions easier.

Choos­ing a con­jug­ate dis­tri­bu­tion as a prior dis­tri­bu­tion will lead to a pos­terior dis­tri­bu­tion which has the same form as the prior dis­tri­bu­tion. The con­jug­ate prior dis­tri­bu­tion for a bino­mial like­li­hood is the beta dis­tri­bu­tion [1]:

The nor­mal­iz­a­tion con­stant B is the so-called Beta func­tion.

The pos­terior distribution

Using Bayes’ The­orem, we obtain the fol­low­ing expres­sion for cal­cu­lat­ing the pos­terior distribution:

If we plug in the prior and the like­li­hood and ignore all nor­mal­iz­ing con­stants, we obtain:

The pos­terior dis­tri­bu­tion thus fol­lows exactly the same form as the prior dis­tri­bu­tion – which was one of the reas­ons we chose the Beta dis­tri­bu­tion in the first place. We know

  • a) that the pos­terior dis­tri­bu­tion is normalized
  • b) that the pos­terior dis­tri­bu­tion is a Beta distribution
  • c) the nor­mal­iz­ing factor for a Beta dis­tri­bu­tion (Beta func­tion in the denominator)

Thus, the pos­terior is given by:

with:

Going from the prior to the pos­terior in this case simply implies to update the para­met­ers of the Beta dis­tri­bu­tion. The ini­tial para­meter alpha is updated by adding the num­ber of “pos­it­ive” obser­va­tions (num­ber of heads). The para­meter beta is updated by adding the num­ber of “neg­at­ive” obser­va­tions (total coin tosses minus the num­ber of heads).

This leads to the fol­low­ing inter­pret­a­tion: The para­meter alpha in the prior dis­tri­bu­tion can be regarded as the num­ber of ini­tial pos­it­ive obser­va­tions minus one. The para­meter beta in the prior dis­tri­bu­tion can be inter­preted as the ini­tial num­ber of neg­at­ive (tails) obser­va­tions minus one.

Upon observation of actual data, the para­met­ers get updated accord­ingly (the num­ber of pos­it­ive obser­va­tions gets added to alpha, the num­ber of “tail” – obser­va­tions gets added to beta).

The Beta dis­tri­bu­tion is identical to the con­tinu­ous uni­form dis­tri­bu­tion on the inter­val [0,1] for val­ues alpha=1 and beta = 1. This is identical to 0 prior obser­va­tions of “heads” and 0 prior obser­va­tions of “tails”.

Plot­ting the prior, the pos­terior and the Likelihood

Let’s look at a visu­al­iz­a­tion of prior dis­tri­bu­tion, like­li­hood func­tion and pos­terior dis­tri­bu­tion for this case:

Plotting the prior, the posterior and the Likelihood

I have chosen a prior dis­tri­bu­tion with para­met­ers alpha=2 and beta=2. As a reminder: This can be inter­preted as cor­res­pond­ing to the prior observation of 1 “heads” and 1 “tails”. It is there­fore sym­met­ric around 0.5 and has a prob­ab­il­ity dens­ity of 0 at the val­ues for theta of 0 and 1.0. This of course makes sense – after all 1 “head” has already been observed, so the prob­ab­il­ity of heads can not be 0 any­more. The same goes for the prob­ab­il­ity of “heads only” (at theta=1.0). This has to be 0 because 1 “tail” has already been observed.

The like­li­hood func­tion is plot­ted for the value of N=3 total coin tosses, out of which all – N_heads=3 – turned out “heads”. The value of the like­li­hood func­tion at theta=0 is again zero – reflect­ing the fact that it’s impossible to observe 3 “heads” if the under­ly­ing prob­ab­il­ity for “heads” is zero. The highest like­li­hood is given for a value of theta=1.0 – reflect­ing the fact that the value of theta that makes the observation of 3 heads in 3 coin tosses the most prob­able is 1.0. As a fur­ther note: The like­li­hood func­tion is not a (nor­mal­ized) prob­ab­il­ity dis­tri­bu­tion as are the prior and the pos­terior dis­tri­bu­tion – so don’t be sur­prised if “integ­rat­ing the curve in your head” doesn’t res­ult in a value of 1 for the area.

The pos­terior dis­tri­bu­tion is the product of the prior dis­tri­bu­tion and the like­li­hood func­tion. It is in a sense a com­prom­ise between both. The prior dis­tri­bu­tion had its prob­ab­il­ity mass spread out sym­met­ric­ally around the value 0.5. Due to the observation of 3 “heads” in 3 coin tosses, the pos­terior dis­tri­bu­tion has its prob­ab­il­ity mass strongly shif­ted to higher val­ues. Typ­ic­ally – as it is also the case here – the pos­terior dis­tri­bu­tion is sig­ni­fic­antly more nar­row than the prior dis­tri­bu­tion. The reason is simple: It is based on the observation of addi­tional data and thus is “more certain”.

There are mul­tiple ways of report­ing the res­ult of a Bayesian ana­lysis. The three black points in the pos­terior dis­tri­bu­tion show (from left to right) the mean, the median and the mode of the pos­terior dis­tri­bu­tion. It cer­tainly doesn’t hurt to report one or more of these point estim­ates. Yet, it should be men­tioned that the spirit of Bayesian data ana­lysis is exactly not to rely too strongly on point estim­ates. Neither of the afore­men­tioned point estim­ates (median, mean or mode) provide the full pic­ture. The entire inform­a­tion is incor­por­ated in the dis­tri­bu­tion.

Another way of report­ing res­ults of a Bayesian data ana­lysis is by com­mu­nic­at­ing the bound­ar­ies of Cred­ible Inter­vals. These are the Bayesian equi­val­ent of Con­fid­ence Inter­vals. Con­fid­ence inter­vals are often mis­in­ter­preted as Bayesian Cred­ible inter­vals. A 80% cred­ible inter­val con­tains with a prob­ab­il­ity of 80% the true value of a para­meter. There are dif­fer­ent ways of con­struct­ing a cred­ible inter­val. Depic­ted in the fig­ure above is a cred­ible inter­val from the 10th to the 90th per­cent­ile of the pos­terior dis­tri­bu­tion (area filled in blue). The area under the pos­terior curve to the left of the area filled in blue is identical to the area to the right of the inter­val. The area filled in red is the so-called highest dens­ity inter­val (HDI). It also con­tains 80% of the prob­ab­il­ity dens­ity, but is the nar­row­est range that does so.

Infer­ence with (Py)Stan

We obtained the pos­terior dis­tri­bu­tion in this case ana­lyt­ic­ally – mak­ing use of the “cheap” trick of a con­jug­ate prior. So, we have no actual need of using Pys­tan in this case. But to demon­strate the func­tion­al­ity of Pys­tan we will do it anyways 😉 .

The pro­gram Stan has its own mod­el­ling lan­guage. Model spe­cific­a­tions in Stan are usu­ally saved in a *.stan – file. It con­tains sev­eral sec­tions – the input data is spe­cified in the data – sec­tion, the para­met­ers are spe­cified in the para­met­ers sec­tion and the model (like­li­hood func­tion and prior dis­tri­bu­tions) are spe­cified in the model sec­tion. There exist more sec­tions that provide for example the func­tion­al­ity to spe­cify user defined func­tions or the pos­sib­il­ity to dir­ectly gen­er­ate sim­u­la­tions based on the inferred para­met­ers . We don’t neces­sar­ily need these addi­tional sec­tions here.

/model/bin_lik_beta_prior.stan: (code can be found in the Git­Hub – repos­it­ory): https://github.com/DataInsightsGmbH/pystan_examples

Inference with (Py)Stan

The data sec­tion in the above list­ing tells Stan the names (N_heads, N_tails), types (integers) and poten­tially con­straints (here: pos­it­iv­ity) of the data.

If you call Stan via PyS­tan from Python, then you can pass the data in a dic­tion­ary, where the names of the keys have to be identical to the names spe­cified in the *.stan – model file.

The para­met­ers – sec­tion tells Stan the names, types and con­straints of the para­met­ers. In this case, we call the prob­ab­il­ity of heads theta and define it to be a real val­ued (con­tinu­ous) para­meter between 0 and 1.

The model – sec­tion serves to spe­cify the pos­terior dis­tri­bu­tion. The syn­tax might need some get­ting used to. The first line spe­cifies the prior dis­tri­bu­tion for the para­meter theta. It’s a beta dis­tri­bu­tion with (hyper)parameters alpha=2 and beta=2. The second line spe­cifies the like­li­hood func­tion – a bino­mial dis­tri­bu­tion with prob­ab­il­ity for heads theta, total num­ber of obser­va­tions N_heads + N_tails and N_heads actual obser­va­tions of “heads”.

The pos­terior dis­tri­bu­tion can be spe­cified in a slightly dif­fer­ent – and maybe easier to read – way. In order to motiv­ate it, let’s take the log­ar­ithm of the pos­terior distribution:

The last term (the mar­ginal like­li­hood) does not depend on the para­meter theta (it’s integ­rated out). It can there­fore be regarded as a con­stant. We can thus write:

Mean­ing the log­ar­ithm of the pos­terior dis­tri­bu­tion is given by the sum of the log­ar­ithms of the like­li­hood func­tion and the prior. In Stan – Code this is spe­cified in the fol­low­ing way:

Inference with (Py)Stan

Again, both ways res­ult in the spe­cific­a­tion of the same model. Yet, I would ima­gine that the second ver­sion is prob­ably more intu­it­ive for people who haven’t worked with Stan or sim­ilar soft­ware before.

What remains to be done is:

  1. load the model (into a string)
  2. com­pile the model
  3. pass data
  4. let Stan cre­ate samples for us

The code for doing this is shown in the list­ing below:

conduct_inference.py: (excerpt)

Inference with (Py)Stan

We first open the file and read the file’s con­tents (the com­plete model spe­cific­a­tion) into a string.

After­wards, pystan.stanModel is called with the model (string) as a para­meter. Stan will sub­sequently com­pile the model (into C++ – code). This usu­ally can take up to a couple of minutes.

The object that was returned from the pystan.stanModel – func­tion has a method called sampling that can to actu­ally cre­ate samples. The data should be passed as a Python dic­tion­ary. Addi­tion­ally, the num­ber of iter­a­tions and the num­ber of chains (for simplicity’s sake here 1) can be spe­cified. The usage of mul­tiple chains (=sequences of samples) is usu­ally recom­men­ded because sig­ni­fic­ant dif­fer­ences in the stat­ist­ics of samples from dif­fer­ent chains can be indic­at­ive of prob­lems with the sampling.

The dur­a­tion of the sampling can vary strongly with model com­plex­ity and num­ber of samples. Here, the gen­er­a­tion of 20 000 samples (the first 10 000 of which will be dis­carded) took around 0.12s. Obvi­ously, the infer­ence of more com­plex mod­els can take sig­ni­fic­antly longer. In gen­eral, one should take as many samples as pos­sible (con­straint usu­ally by com­pu­ta­tional resources / time). The more samples we obtain, the more pre­cisely any stat­istic we would like to obtains (mean, stdev,….) can be estim­ated from the samples.

Bayesan Analysis Codes

Let’s take another look at Stan’s out­put above: It tells us that there have been 20 000 samples gen­er­ated, the first 10k were ignored (war­mup = 10000) and every sample after­wards is retained (thin=1). In addi­tion, Stan provides us with sum­mary stat­ist­ics about the inferred para­meter and the value of the log pos­terior (up to some con­stant). For example, the mean, the stand­ard error of the mean, the stand­ard devi­ation and a num­ber of quantiles are provided for the para­meter and the log pos­terior. In addi­tion, the num­ber of effect­ive samples n_eff are specified.

Samples that Stan cre­ates are cor­rel­ated and thus the num­ber of “effect­ive” / inde­pend­ent samples is usu­ally smal­ler than the num­ber of samples (here: n_eff = 4151, num­ber of samples =10000). This cor­rel­a­tion of samples is typ­ical for MCMC meth­ods in gen­eral (it doesn’t cre­ate inde­pend­ent samples) – but usu­ally not too prob­lem­atic if the effect­ive sample size is still “high” (in the thou­sands). That is another con­ver­gence dia­gnostic. Val­ues sig­ni­fic­antly above 1.0 are a cause for concern.

So, now that Stan did the sampling for us, we some­how have to make them access­ible from within Python:

Inference with (Py)Stan

After­wards, we can simply work with the numpy array (or Python list if we prefer) as desired.

The fol­low­ing graph illus­trates “the magic”: We plot­ted the 10 000 samples in the left panel. The right panel shows the dis­tri­bu­tion of the samples – super-imposed as a black line is a beta dis­tri­bu­tion with para­met­ers alpha=5 and beta = 2 (the pos­terior dis­tri­bu­tion we cal­cu­lated for N_heads = 3, N_tails = 0 and a beta prior with alpha=2 and beta=2).

So, that’s exactly what sampling with Stan is about: We spe­cify a like­li­hood func­tion and a prior over all para­met­ers and obtain samples which are dis­trib­uted accord­ing to the pos­terior distribution.

Samples for theta, Distribution of samples

Bayesian Logistic Regression

Let’s look at an example with a little bit more “real-world” appeal. Let’s assume that we have data of a num­ber of stu­dents: We know how long they stud­ied for a par­tic­u­lar exam and whether they passed or not (this example is bor­rowed from the Wiki­pe­dia page for “Logistic Regres­sion” – we will use it to illus­trate the Bayesian way of “fit­ting” a Logistic Regres­sion model).

We will first sim­u­late some data accord­ing to a logistic regres­sion model and sub­sequently use Stan to do the infer­ence for us.

Logistic regres­sion is a mem­ber of a class of stat­ist­ical mod­els called Gen­er­al­ized Lin­ear Models.

A lin­ear pre­dictor func­tion is mapped to the inter­val between 0 and 1 (we are deal­ing with prob­ab­il­it­ies) by being trans­formed by a logistic func­tion. The out­come vari­able is assumed to be Bernoulli – dis­trib­uted (being either 0 or 1, the under­ly­ing prob­ab­il­ity is given by the trans­formed lin­ear predictor).

Let’s first use Python to sim­u­late some test data. We first sample the pre­par­a­tion times for the exam from a con­tinu­ous uni­form dis­tri­bu­tion (min­imum 1.0 hour, max­imum 12.0 hours). The prob­ab­il­ity for passing is sub­sequently cal­cu­lated by:

  1. lin­early trans­form­ing the “hours” (a*x+b) – stu­dents that stud­ied longer are sup­posed to increase their chances of passing the exam
  2. apply­ing a logistic func­tion to map the res­ults of the lin­ear trans­form to val­ues from 0 to 1
  3. the out­come “passed” is finally obtained by sampling from a Bernoulli dis­tri­bu­tion (bino­mial with n=1)

The code for cre­at­ing test data is shown below:

Inference with (Py)Stan

The sim­u­lated test data (here for 15 stu­dents) as well as the under­ly­ing (study­ing-time depended) prob­ab­il­ity for passing is shown in the fig­ure below:

studying time of 15 students

The Stan model for doing infer­ence on this model is given below. We pass the total num­ber of data points N (the num­ber of stu­dents) and for each stu­dent the num­ber of hours they stud­ied as well as whether they passed. In this case, we have two para­met­ers (for the lin­ear pre­dictor, the inter­cept beta and the slope alpha).

In this case we haven’t spe­cified prior dis­tri­bu­tions on the para­met­ers. This impli­citly makes the para­met­ers have uni­form dis­tri­bu­tions. In this case, the MAP of the pos­terior dis­tri­bu­tion will be almost (apart from numer­ical issues / sampling error) identical to the Max­imum Like­li­hood Estimate.

The spe­cific­a­tion of the like­li­hood might again need some get­ting used to. Basic­ally all the trans­form­a­tions (logistic func­tion as well as Bernoulli dis­tri­bu­tion) are packed into one line 😉 :

Stan – Model for Logistic Regres­sion with one pre­dictor (see e.g. Stan-Manual):

Inference with (Py)Stan

The code for com­pil­ing the model and sampling from it is very sim­ilar to the code from the first example.

We again call the Stan­Model func­tion and the sampling method:

Inference with (Py)Stan

As already men­tioned, we obtain samples for two dif­fer­ent para­met­ers this time (slope and inter­cept for each lin­ear pre­dictor). The samples of these para­met­ers are dis­played in the left panel below. We can see a couple of thou­sand samples for the para­met­ers alpha (x‑axis) and beta (y‑axis). Each pair of para­met­ers describes a logistic curve. To illus­trate this fur­ther, 4 points have been picked (basic­ally at ran­dom). The cor­res­pond­ing 4 logistic curves (cal­cu­lated by plug­ging in the value for alpha and beta for that sample) are shown in the panel in the middle.

Samples are dis­trib­uted accord­ing to the pos­terior dis­tri­bu­tion ( some­what centered around alpha=0.6 .. 0.7 and beta = ‑4.0 .. ‑4.5). Each sample cor­res­ponds to one logistic curve. The dis­tri­bu­tion of the points determ­ines also the dis­tri­bu­tion of the logistic curves. We can cal­cu­late the “mean curve” (black line) and the 80% cred­ible inter­val (blue region) for the logistic curve – giv­ing us an estim­ate of the curve as well as the curve’s uncertainty.

samples of posterior distribution, example curves, logistic curve

Con­grat­u­la­tions, you have made it to the end (you didn’t skip any part, did you?). I hope this blog post served as an intro­duc­tion to Bayesian stat­ist­ics / Bayesian infer­ence (with PyS­tan) and hope­fully I could con­vince you to take Bayesian ana­lysis into con­sid­er­a­tion as an altern­at­ive to MLE – meth­ods (if you haven’t already).

This post could obvi­ously only scratch the sur­face on a lot of top­ics – e.g. the rich­ness of mod­els that you can infer with Stan (from Gen­er­al­ized Lin­ear Mod­els to Time-Series mod­els (Facebook’s Prophet [2,3] uses Stan) to sys­tems of Ordin­ary Dif­fer­en­tial Equa­tions (inter­est­ing among oth­ers for phar­ma­ceut­ical com­pan­ies for doing infer­ence on so-called phar­ma­cokin­etic mod­els [4]))…

Also top­ics like deal­ing with miss­ing data in Bayesian data ana­lysis, deal­ing with hier­arch­ical data or mak­ing decisions based on Bayesian infer­ences (Bayesian decision the­ory) would be inter­est­ing to even­tu­ally dis­cuss fur­ther – maybe there will be a fol­low up blog post in the future.

I hope you any­ways got a taste of what Bayesian meth­ods in gen­eral and Stan in par­tic­u­lar have to offer.
Thank you again for read­ing and have a great day!

Links, Ref­er­ences:

[1] https://en.wikipedia.org/wiki/Beta_distribution

[2] https://facebook.github.io/prophet/

[3] https://towardsdatascience.com/a‑quick-start-of-time-series-forecasting-with-a-practical-example-using-fb-prophet-31c4447a2274