Intro­duc­tion to Bayesian infer­ence with PyS­tan – Part I



Intro­duc­tion

About this blog post

This blog art­icle is inten­ded as a hands-on tutorial on how to con­duct Bayesian infer­ence. Bayesian infer­ence is a method for learn­ing the val­ues of para­met­ers in stat­ist­ical mod­els from data. Bayesian infer­ence / data ana­lysis is a fully prob­ab­il­istic approach – the out­come of which are prob­ab­il­ity dis­tri­bu­tions. Another dis­tinct­ive fea­ture of Bayesian infer­ence is the use of prior inform­a­tion in the ana­lyses. Bayesian infer­ence is not con­cerned with one par­tic­u­lar stat­ist­ical model – in fact, a mul­ti­tude of dif­fer­ent stat­ist­ical / ML-mod­els (be it: Lin­ear Regres­sion, Logistic Regres­sion, Arti­fi­caial Neural Net­works,…) can be “trained” in a either a Bayesian or a Non-Bayesian way. Bayesian infer­ence can be sum­mar­ized as being a method for learn­ing from data.

This blog-posts con­sists of two parts, the first one will give a very brief recapit­u­la­tion of the his­tory of Bayesian data ana­lysis, as well as a few words on the dif­fer­ences between Bayesian meth­ods and the more com­mon Fre­quent­ist meth­ods. The first part will con­clude with the deriv­a­tion of Bayes’ The­orem, an example to demon­strate the applic­a­tion of Bayes’ The­orem as well as a short intro­duc­tion of how it can be used to tackle data-ana­lyt­ical problems.

Python has been chosen as a pro­gram­ming lan­guage (R would argu­ably be the first altern­at­ive) and Stan (Python inter­face: PyS­tan) will be used as a tool for spe­cify­ing Bayesian mod­els and con­duct­ing the inference.

The main inten­tion is to give someone who has pre­vi­ously not per­formed a Bayesian infer­ence on a data­set the means to do so and hope­fully also con­vin­cing argu­ments for why any­one would want to do so.

Short his­tory

The roots of Bayesian stat­ist­ics date back to the 18th cen­tury, when Rev­er­end Thomas Bayes discovered/invented Bayes The­orem. The fam­ous French math­em­atician Pierre-Simon Laplace was one of the first pro­ponents of Bayesian data ana­lysis and developed the meth­ods fur­ther as well as pop­ular­ized them.

Espe­cially in the early 20th cen­tury, Bayesian Stat­ist­ics was exposed to fierce oppos­i­tion. One of the most prom­in­ent oppon­ents of Bayesian Stat­ist­ics – or “inverse prob­ab­il­ity” – was the highly influ­en­tial stat­ist­i­cian Ron­ald Fisher. This strong rejec­tion of Bayesian stat­ist­ics is pre­sum­ably respons­ible for its rel­at­ive demise in the 20th cen­tury. Fur­ther­more, Bayesian infer­ence of real-world mod­els with poten­tially hun­dreds or thou­sands of free para­met­ers are com­pu­ta­tion­ally sig­ni­fic­antly more expens­ive than their altern­at­ive, which almost cer­tainly con­sti­tuted another reason for the vast dom­in­ance of fre­quent­ist stat­ist­ics in the past.

Apart from tail­winds due to con­stantly increas­ing com­pu­ta­tional power, also the advent of com­pu­ta­tional tools for con­duct­ing Bayesian infer­ence (among the first were BUGS [1] and JAGS [2]) has led rel­at­ively recently to an increase in the pop­ular­ity of Bayesian methods.

These tools for con­duct­ing Bayesian infer­ence are mak­ing heavy use of Markov Chain Monte Carlo (MCMC) meth­ods [3] – meth­ods that have been mostly developed by sci­ent­ists involved in the Man­hat­tan pro­ject in the years after the end of the pro­ject. Ini­tially used to sim­u­late phys­ical sys­tems, they were later used in stat­ist­ics – for example Bayesian infer­ence. One of the sci­ent­ists strongly involved in the inven­tion of MCMC meth­ods was the Pol­ish math­em­atician Stan­islaw Ulam – after whom the prob­ab­il­istic pro­gram­ming lan­guage Stan [4,5] was named. PyS­tan [6] is Stan’s Python interface.

Bayesian meth­ods are being used in a large vari­ety of domains and to a diverse set of prob­lems – from mod­el­ling the spread of infec­tious dis­eases [7] to ana­lys­ing astro­phys­ical data [8].

Bayesian Stat­ist­ics vs. Fre­quent­ist Statistics

Fre­quent­ist stat­ist­ics (some­times also called Fish­erian stat­ist­ics after the afore­men­tioned stat­ist­i­cian) , which make use of Max­imum Like­li­hood meth­ods have been the default method of choice in stat­ist­ical infer­ence for large parts of the 20th century.

Given a data­set, Max­imum Like­li­hood Estim­a­tion (MLE) starts with a model for the prob­ab­il­ity / Like­li­hood of the data. This model will have unknown para­met­ers. In Max­imum Like­li­hood Estim­a­tion, optim­iz­a­tion meth­ods are used to obtain the para­meter val­ues that max­im­ize the Like­li­hood of the data. Just to stretch this point – we do not obtain the para­met­ers that are most prob­able given the data, but instead the para­met­ers that make the data most likely.

Sim­ilar to Max­imum Like­li­hood meth­ods, Bayesian meth­ods start with a model for the like­li­hood of the observed data. Addi­tion­ally, a Bayesian model requires the spe­cific­a­tion of a prob­ab­il­ity dis­tri­bu­tion over the para­met­ers. This prob­ab­il­ity dis­tri­bu­tion can be thought of as the “belief” in the para­meter before any data (at least the data to be ana­lyzed in a par­tic­u­lar ana­lysis) has been observed. It is usu­ally called the prior dis­tri­bu­tion or a pri­ori dis­tri­bu­tion. Bayes the­orem is sub­sequently used to take the influ­ence of the observed data into con­sid­er­a­tion – a dis­tri­bu­tion over the para­met­ers before data has been observed (the prior dis­tri­bu­tion) is trans­formed into a prob­ab­il­ity dis­tri­bu­tion of the para­met­ers after the data has been observed (the pos­terior distribution).

The use of prior dis­tri­bu­tions can be seen as one of the strengths of Bayesian infer­ence – it for example provides for reg­u­lar­iz­a­tion and thus “sta­bil­izes” stat­ist­ical infer­ence. Many approaches to reg­u­lar­iz­a­tion in MLE (such as Lasso or Ridge regres­sion) can be under­stood in a mean­ing­ful way when tak­ing the Bayesian view­point (see e.g. [9]). Yet, one of the most prom­in­ent argu­ments against Bayesian infer­ence is, that prior dis­tri­bu­tions intro­duce sub­jectiv­ity into the ana­lysis – the data are not speak­ing for them­selves any­more, but the ana­lysis is influ­enced by the analyst him­self. The usual response of Bayesians to this argu­ment is that already the choice of the details of the like­li­hood func­tion will have an influ­ence on the ana­lysis. The argu­ment could be sum­mar­ized as: every data ana­lysis will be influ­enced by the analyst – the Bayesian approach is very trans­par­ent about this.

One of the main advant­ages of Bayesian data ana­lysis is the ease of incor­por­a­tion of prior inform­a­tion. Assume that you are ana­lys­ing a sci­entific exper­i­ment and – let’s say you want to estim­ate the value of some phys­ical con­stant (e.g. : the speed of light ). Now, assume that you have already ana­lysed data from a pre­vi­ous exper­i­ment. Now, you might not want to trash the insights you gained from the first exper­i­ment, but instead take it into con­sid­er­a­tion when ana­lys­ing the data of the second exper­i­ment. Bayesian stat­ist­ics and spe­cific­ally the use of prior dis­tri­bu­tions provide you with an easy way of achiev­ing this – the first exper­i­ments’ pos­terior dis­tri­bu­tion of the para­meter (speed of light) will become the prior dis­tri­bu­tion for the para­meter on the infer­ence of the second experiment’s data. We’ve learned some­thing from the ana­lysis of the first exper­i­ment and don’t ignore this know­ledge, but use it as a start­ing point for the inter­pret­a­tion of data from fur­ther analyses.

Bayesian infer­ence thus does not only provide you with res­ults that are easier to inter­pret, but also gives you an easy way of com­bin­ing the res­ults of dif­fer­ent analyses.

Bayesian Stat­ist­ics – the mechanics

Bayes’ The­orem

Bayes’ The­orem deals with con­di­tional and mar­ginal prob­ab­il­it­ies. One of the pro­to­typ­ical examples for illus­trat­ing Bayes’ The­orem deals with tests for a dis­ease and can thus be eas­ily turned into a cur­rently rel­ev­ant (dur­ing the COVID-19 – crisis 2020 if you are read­ing this in the far future ) example.

Example: Test­ing for diseases

Let’s assume we are deal­ing with a test (T) for past infec­tions (PI) with the virus that causes COVID-19. It is sup­posed to check whether an indi­vidual per­son already has over­come an infec­tion with the SARS-CoV‑2.

Indi­vidual people are tak­ing this test and their res­ult is either neg­at­ive (T=0) or pos­it­ive (T=1). At the same time, people either already have suc­cess­fully fought the virus (PI = 1) or have not been infec­ted yet (PI=0). For simplicity’s sake, let’s ignore the edge case of people who are cur­rently infected.

Sens­it­iv­ity and Specificity

The qual­ity of the test can be sum­mar­ized by its sens­it­iv­ity – how many of the people who are actu­ally infec­ted get a pos­it­ive test res­ult – and its spe­cificity. The spe­cificity tells you, what per­cent­age of people who had not had the virus get a neg­at­ive test res­ult. Let’s assume that the test has a sens­it­iv­ity of 98% and a spe­cificity of 90%.

Sens­it­iv­ity and spe­cificity can be expressed as con­di­tional prob­ab­il­it­ies. The sens­it­iv­ity is the prob­ab­il­ity that a test res­ult is pos­it­ive given that the tested indi­vidual has pre­vi­ously been infected:

Bayesian Statistics

The spe­cificity is the con­di­tional prob­ab­il­ity that a test-res­ult is neg­at­ive, given that the per­son has not been pre­vi­ously infected:

Bayesian Statistics

It is obvi­ous that a good test should have both – a high sens­it­iv­ity and a high spe­cificity. It’s for example very easy to achieve a high sens­it­iv­ity (just design a test that always returns a pos­it­ive res­ult). This test would have a per­fect sens­it­iv­ity, since it would give a pos­it­ive res­ult for every­one who has had the virus. Yet, it’s spe­cificity would be abysmal, since it would also give every­one who has not had the virus a pos­it­ive test result.

Now, the answer we might really be inter­ested in, is, with which prob­ab­il­ity a per­son who has a pos­it­ive test res­ult has actu­ally been infec­ted. It’s in a sense the “inverse” of the prob­ab­il­ity of hav­ing a pos­it­ive test res­ult given that someone has been infec­ted. Remem­ber, Bayesian ana­lysis has also been called “inverse probability”.

So, we are inter­ested in the quant­ity p(PI=1|T=1). As you may have guessed – Bayes’ The­orem will help us in obtain­ing the desired answer.

Deriv­a­tion of Bayes’ Theorem

In gen­eral, any joint prob­ab­il­ity dis­tri­bu­tion of two ran­dom vari­ables (such as PI and T) can be factored in two ways. One way is:

Bayesian Statistics

Just to cla­rify this fur­ther, let’s look at the case of a pos­it­ive test res­ult (T=1) for a pre­vi­ously infec­ted per­son (PI=1):

Bayesian Statistics

The prob­ab­il­ity that a per­son has a pos­it­ive test res­ult and has had the virus can be expressed as the prob­ab­il­ity that a per­son has had the virus (p(PI=1)) times the prob­ab­il­ity that a test res­ult was pos­it­ive given that the per­son has had the virus (p(T=1|PI=1)).

The joint prob­ab­il­ity dis­tri­bu­tion p(T,PI) can also be factored in another way:

Bayesian Statistics

To cla­rify this one fur­ther, let’s look again at the case of a pos­it­ive test res­ult (T=1) for some­body who has been pre­vi­ously infected:

Bayesian Statistics

The prob­ab­il­ity that an indi­vidual has a pos­it­ive test res­ult and has had the virus can also be expressed as the prob­ab­il­ity of a pos­it­ive test res­ult times the prob­ab­il­ity that someone actu­ally has pre­vi­ously been infec­ted given that the test res­ult was positive.

We can com­bine both fac­tor­iz­a­tions to obtain:

Bayesian Statistics

Solv­ing it for p(PI|T):

Bayesian Statistics

Et voilà – that’s one way to write down Bayes’ The­orem. It gives us a way to reason about the prob­ab­il­ity of pre­vi­ous infec­tion given a test res­ult, when everything we have to start with are the prob­ab­il­it­ies for test-res­ults given the states of pre­vi­ous infec­tion. Well, we also need a way to come up with val­ues for p(PI), the mar­ginal prob­ab­il­ity of pre­vi­ous infec­tion and p(T), the mar­ginal prob­ab­il­ity for test results.

Let’s assume that Ger­many has done a pretty good job so far – and the roughly 200 000 cases rep­res­ent roughly the num­ber of infec­tions we have had. Assum­ing we have 80 mil­lion people (yep, I am not being pre­cise here, want to have some nice num­bers) – that would mean the chance, that a ran­domly selec­ted per­son has already had the virus is around 1:400 (or 0.25% ). Now, let’s assume we ran­domly test people in Ger­many with the afore­men­tioned anti­body tests and want to know the prob­ab­il­ity that some­body who has been tested pos­it­ive actu­ally has had the virus:

Bayesian Statistics

What is the mar­ginal prob­ab­il­ity p(T=1)? Well, it is the prob­ab­il­ity that a test has been pos­it­ive, which can be expressed as the prob­ab­il­ity that a test has been pos­it­ive given that the per­son was infec­ted times the prob­ab­il­ity that an indi­vidual was infec­ted plus the prob­ab­il­ity that a test has been pos­it­ive given the per­son has not been pre­vi­ously infec­ted times the prob­ab­il­ity that an indi­vidual has not been pre­vi­ously infec­ted…. Long story short:

Bayesian Statistics

So, put­ting it all together, we obtain:

Bayesian Statistics

Plug­ging in the actual numbers:

Bayesian Statistics

So, that’s it – des­pite the fact that we have a test with high sens­it­iv­ity and reas­on­ably high spe­cificity, the prob­ab­il­ity that someone with a pos­it­ive test-res­ult actu­ally has had SARS-CoV‑2 is only 2.4%. Or, put­ting it slightly dif­fer­ently – around 98% of people who receive a pos­it­ive test res­ult will actu­ally never have been infec­ted. The reason for this maybe sur­pris­ing res­ult is that the vast major­ity of people who are tested never had the virus. Even if the prob­ab­il­ity for a false pos­it­ive is only 10%, the abso­lute num­ber of false pos­it­ives will largely out­num­ber the true pos­it­ives → thus a given pos­it­ive res­ult is very likely attrib­ut­able to a false pos­it­ive and not a true pos­it­ive result.

Agreed – the estim­ate of the prior prob­ab­il­ity that a ran­domly selec­ted per­son has had the virus is coarse at best – it’s pretty much guess­work in this case. Yet, even if it were quite a bit higher – false pos­it­ives would still out­num­ber true pos­it­ives and Bayes’ The­orem would still right­fully tell us that we should not take the pos­it­ive res­ult at face value.

Now, whether this will play a role in test­ing for SARS-CoV‑2 – or if the tests will have such high sens­it­iv­it­ies and spe­cificit­ies that “we don’t have to worry about stat­ist­ics” still remains to be seen. There is evid­ence from the FDA that some of the tests are very good indeed and these num­bers given here are quite a bit too neg­at­ive [10]. Yet, it is known for example that pos­it­ive res­ults in Mam­mo­grams for young women (around the age of 40) are largely dom­in­ated by false pos­it­ives. Num­bers might vary quite a bit with exact age, source,…. – but roughly only every 13th pos­it­ive res­ult is actu­ally a true pos­it­ive [11]. The reas­on­ing is the same as in the example above – the prior is strongly skewed in the dir­ec­tion of “no breast can­cer” – a rel­at­ively small imper­fec­tion in the spe­cificity will thus lead to way more false pos­it­ives than true pos­it­ives. While cer­tainly nobody should (or would) take such a pos­it­ive test res­ult lightly, the odds after one pos­it­ive res­ult would still be largely in the young woman’s favor and with more than 90% prob­ab­il­ity she would not have breast can­cer. One way of address­ing this is by apply­ing mul­tiple tests.

Con­tinu­ous Form

So far the whole dis­cus­sion has been about dis­crete ran­dom vari­ables – we used bin­ary out­comes such as sick / healthy or pos­it­ive / neg­at­ive. The form of Bayes’ The­orem for this case has been:

Bayesian Statistics

where p(B) could be expressed as a sum:

Bayesian Statistics

If we go from dis­crete to con­tinu­ous cases (look at con­tinu­ous ran­dom vari­ables, such as a person’s height / weight, an employee’s income / ….), the dis­crete prob­ab­il­ity dis­tri­bu­tions will become prob­ab­il­ity dens­it­ies and the sum will turn into an integral:

Bayesian Statistics

Bayesian data analysis

So far, we have only intro­duced Bayes’ The­orem. There are usu­ally not that many cases in a data analyst’s daily busi­ness where she/he/… has to cal­cu­late prob­ab­il­it­ies about tests for dis­eases. So, it’s per­fectly fair to ask: Why should data ana­lysts / data sci­ent­ists / data … / busi­ness … care about Bayes’ Theorem?

The point is, that many data-ana­lyt­ical prob­lems can be seen as the attempt to infer the value of a model’s (yeah, this kind of model) para­met­ers from data. Well, that’s exactly what Bayes’ The­orem can provide:

Bayesian Statistics

Theta denotes here the (poten­tially mul­tivari­ate) parameter(s) and D is simply our (very likely mul­tivari­ate) data. It basic­ally means, that we can get a prob­ab­il­ity dis­tri­bu­tion for our para­met­ers given the data (p(Theta|Data)). This dis­tri­bu­tion is con­ven­tion­ally called the pos­terior (after hav­ing seen the data). The dis­tri­bu­tion p(Theta) is our prior dis­tri­bu­tion (what cred­ib­il­ity do we assign to cer­tain para­meter val­ues before we have seen the data?). The quant­ity p(Data|Theta) is the Like­li­hood. Fre­quent­ists would simply use optim­iz­a­tion to obtain the para­met­ers Theta* that max­im­ize the like­li­hood (or the log-like­li­hood to make things com­pu­ta­tion­ally easier). The nor­mal­iz­a­tion con­stant p(Data) is con­ven­tion­ally called evid­ence or mar­ginal like­li­hood (we are integ­rat­ing / mar­gin­al­iz­ing the like­li­hood p(Data|Theta) over Theta).

Bayesians also use the like­li­hood, but in addi­tion they have to spe­cify also prior dis­tri­bu­tions for their para­met­ers. In return, they obtain an entire dis­tri­bu­tion over the para­met­ers and not (only) point estim­ates for the para­meter val­ues that max­im­ize the likelihood.

Sampling and Markov Chain Monte Carlo methods

Unfor­tu­nately, it is in most prac­tical cases impossible to obtain an ana­lyt­ical solu­tion for the pos­terior dis­tri­bu­tion. The denom­in­ator in the con­tinu­ous form of Bayes’ The­orem con­sists of an integ­ral over a poten­tially high-dimen­sional space. Given that you have 50 free para­met­ers (rather real­istic num­ber of para­met­ers) in a model, you will have to per­form an integ­ral over a space with 50 dimen­sions. Most of the times you will not be able to solve this ana­lyt­ic­ally. Also, simple numer­ical meth­ods (integ­ra­tion over “a grid”) will fail – after all, even if you have only 10 dimen­sions, a grid with 1000 points per dimen­sion will already have 1000^10 points in total, which already renders grid integ­ra­tion com­pu­ta­tion­ally unfeasible.

One of the ways of tack­ling this, is by obtain­ing samples from the pos­terior dis­tri­bu­tion. These samples are dis­trib­uted accord­ing to the the pos­terior dis­tri­bu­tion. You can basic­ally cal­cu­late every quant­ity that you would want to cal­cu­late from the pos­terior dis­tri­bu­tion from these samples. What is the mean value of the dis­tri­bu­tion? Well, gather enough (inde­pend­ent) samples and cal­cu­late their mean. Stand­ard devi­ations, … – same thing, just cal­cu­late the quant­ity from the obtained samples.

Usu­ally so-called Markov Chain Monte Carlo (MCMC) meth­ods (a great explan­a­tion is given in [12]) are used to obtain samples from a dis­tri­bu­tion. These meth­ods start from ini­tial val­ues for the para­met­ers. After­wards, new val­ues are pro­posed and accep­ted accord­ing to cer­tain pro­posal and accept­ance func­tions. This pro­cess of pro­posal and accept­ance is iter­ated over many times to cre­ate a large num­ber of samples. Dif­fer­ent meth­ods mostly dif­fer in their imple­ment­a­tions of the pro­posal and accept­ance functions.

MCMC meth­ods may need some post-pro­cessing of the obtained samples – e.g. it can be neces­sary to ignore the first (often hun­dreds or thou­sands) samples, because “they have been col­lec­ted far away from the rel­ev­ant para­meter val­ues”. Also, samples are usu­ally cor­rel­ated and thus often only every Nth sample (after the removal of the ini­tial samples) is retained. Fur­ther­more, the meth­ods can fail (to give rep­res­ent­at­ive samples of the dis­tri­bu­tion). There are ways to dia­gnose this, such as cre­at­ing run­ning mul­tiple “sampling-chains” (and com­par­ing their out­come). I will not go into details in this blog post – it’s an intro­duc­tion with simple examples after all.

Sum­mary

This first part gave a short review of the his­tory of Bayesian stat­ist­ics, as well as the deriv­a­tion of Bayes’ The­orem. After an illus­trat­ive example for the applic­a­tion of Bayes’ The­orem, the trans­ition to data-ana­lyt­ical applic­a­tions of Bayes’ The­orem was made. We con­cluded with a very brief dis­cus­sion of Markov Chain Monte Carlo methods.

I hope you enjoyed read­ing this first part – if so, please don’t for­get to read part II.

Links, Ref­er­ences:

[1] https://www.mrc-bsu.cam.ac.uk/software/bugs/

[2] http://mcmc-jags.sourceforge.net/

[3] https://en.wikipedia.org/wiki/Markov_chain_Monte_Carlo

[4] https://mc-stan.org/

[5] Bob Car­penter et al., 2017. Stan: A prob­ab­il­istic pro­gram­ming lan­guage. Journal of Stat­ist­ical Soft­ware 76(1). DOI 10.18637/jss.v076.i01

[6] https://pystan.readthedocs.io/en/latest/

[7] Nature: Spe­cial report: The sim­u­la­tions driv­ing the world‘s response to COVID-19

(con­tains a link to the fol­low­ing (Bayesian) report):

https://spiral.imperial.ac.uk:8443/bitstream/10044/1/77731/10/2020–03-30-COVID19-Report-13.pdf

[8] https://statmodeling.stat.columbia.edu/2016/02/15/the-recent-black-hole-ligo-experiment-used-pystan/

[9] https://bjlkeng.github.io/posts/probabilistic-interpretation-of-regularization/

[10] https://www.fda.gov/medical-devices/emergency-situations-medical-devices/eua-authorized-serology-test-performance

[11] https://www.northcoastjournal.com/humboldt/bayes-mammograms-and-false-positives/Content?oid=3677370#:~:text=This%20%22iterative%22%20approach%20is%20one,a%20yet%20more%2Daccurate%20belief.

[12] Richard McElreath: Stat­ist­ical Rethink­ing: A Bayesian course with examples in R and Stan.