Prob­ab­il­istic Counting



Fla­jolet, Hyper­lo­g­log and count­ing Big Data

Given a stream or tuple of items, how would you obtain the num­ber of dif­fer­ent items N con­tained in it? If you can­not mem­or­ize the items, you might want to write them on paper or at least the dif­fer­ent ones. But what do you do when you don’t have enough paper? In this scen­ario, you are left with two choices. Either you can buy more paper, or you can employ prob­ab­il­istic algorithms which are designed to write less, i.e., con­sume less paper. In this art­icle, we briefly present the Hyper­Lo­g­Log (HLL) algorithm and its history.

Motiv­a­tion

The prob­lem we are con­cerned about here is approx­im­ate count­ing or car­din­al­ity estim­a­tion. It arises in many areas, such as net­work mon­it­or­ing, search engines, or data min­ing. Car­din­al­ity-based meth­ods are used to detect dis­trib­uted denial of ser­vice (DDoS) attacks by count­ing source/target IP address pairs. Another example are data­bases which intern­ally use the car­din­al­ity of columns to decide how to serve the user’s quer­ies using the least amount of resources.

An exact count requires per­sist­ing all items on a device, e.g. using a hash table. How­ever, stor­age tech­no­lo­gies come with an inev­it­able tradeoff between capa­city and ease of access, as meas­ured by energy, latency, or band­width. That is, the expens­ive stor­age (SRAM, DRAM) is cheap to access, while the cheap stor­age (tape, HDD) is costly. Flash drives or SSDs sit in the middle of this spec­trum and exhibit sim­ilar tradeoffs between NAND- and NOR-based archi­tec­tures. Over the last four dec­ades, pro­gress in memory access has con­sist­ently lagged behind CPU per­form­ance improvements.

Sup­pose the car­din­al­ity of the input stream becomes too large. In that case, the per­form­ance of the count­ing applic­a­tion will degrade below the level you are will­ing to accept, or else the required hard­ware will become expens­ive. That’s where prob­ab­il­istic algorithms enter the scene. They only provide approx­im­ate counts, but offer a low memory foot­print which is also bene­fi­cial for runtime if memory is scarce.

Flow Chart for distinct count using a hash table

Fig­ure 1: Flow chart for dis­tinct count using a hash table.

Bloom fil­ters O(n)

One approach to reduce the required memory can be to employ Bloom fil­ters instead of hash tables [1]. A Bloom fil­ter is a data struc­ture that imple­ments prob­ab­il­istic set mem­ber­ship. The items them­selves are not per­sisted; instead, a sig­na­ture con­sist­ing of a few bits is stored in a shared bit array for each item. Hash func­tions are used to determ­ine this sig­na­ture, lead­ing to a dense rep­res­ent­a­tion with lim­ited cap­ab­il­it­ies. In par­tic­u­lar, it is not pos­sible to retrieve items, nor is it pos­sible to delete single items, because dif­fer­ent items may share the same bit in the array. Com­pared to a B‑tree or hash table, the required space is drastic­ally reduced, but still lin­ear in N.
Bloom fil­ters sup­port the fol­low­ing Add, Check, and Delete operations:

1. Add an item to the Bloom fil­ter.
2. Check whether an item is con­tained in the Bloom fil­ter.
3. Delete all items from the Bloom filter.

You can think of a Bloom fil­ter as a small gar­age where you can park an infin­ite num­ber of cars. The cars are lost forever in the gar­age, but you can ask whether a par­tic­u­lar car is present, to which you get an answer that may or may not be true.

Bloom fil­ters are biased in that the response to the Check oper­a­tion may return false pos­it­ives, but never false neg­at­ives (per­fect sens­it­iv­ity). This means that the pres­ence of items is sys­tem­at­ic­ally over­es­tim­ated. As more items are added, the false pos­it­ive rate increases, i.e. the spe­cificity drops. In the limit of total con­sump­tion, the Bloom fil­ter becomes a use­less estim­ator claim­ing set mem­ber­ship for every pos­sible item. There­fore the capa­city of the Bloom fil­ter is lim­ited in prac­tice by the avail­able memory.

The length of the bit array and the num­ber of bits per item are design para­met­ers that determ­ine the tradeoff between cor­rect­ness (i.e. spe­cificity or false pos­it­ive rate) and capa­city. Bloom fil­ters are used by con­tent deliv­ery net­works for effi­cient cach­ing and by data­bases or applic­a­tions in gen­eral to avoid unne­ces­sary memory accesses or net­work requests. They have many applic­a­tions today, but count­ing is not their biggest strength.

Devel­op­ment

Today, many data­base products sup­port approx­im­ate count­ing aggreg­a­tion func­tions (named approx_count_distinct or sim­ilar). That is, the queries

select count(distinct col1) from tab1
select approx_count_distinct(col1) from tab1

will usu­ally return dif­fer­ent res­ults, but the second one is designed to run much faster on large tables. Many vendors have opted for the HLL algorithm when it comes to imple­ment car­din­al­ity estim­a­tion. The name stems from required memory being of order log(log N). The required space can be sized in advance and cope with any car­din­al­ity of prac­tical relevance.

One of the earlier papers on prob­ab­il­istic count­ing was by Mor­ris [2]. He presen­ted a way to count from 1 to 130000 using a single byte of memory. The basic idea under­ly­ing HLL was developed by Phil­ippe Fla­jolet and G. Nigel Mar­tin dur­ing the early 80s of the last cen­tury. At that time, Fla­jolet was Dir­ector of Research at INRIA in France, while Mar­tin was a researcher work­ing for IBM in the UK. Both HLL and HLL++ (used by GCP BigQuery) stem from the Fla­jolet-Mar­tin and PCSA algorithms (Prob­ab­il­istic Count­ing Stochastic Aver­age) from 1985 [3],[4]. Martin’s bril­liant idea was to hash every item and extract use­ful inform­a­tion from the res­ult­ing bit pat­tern. Fla­jolet fin­ished this idea into an algorithm ready for prac­tical use. PCSA has log N space com­plex­ity and was used on many sys­tems for two dec­ades until it was gradu­ally sup­planted by HLL intro­duced by Fla­jolet and his co-work­ers in 2007 [5].

Both PCSA and HLL rely on stochastic aver­ages, i.e. the input stream is divided into mul­tiple sub­streams or buck­ets. The num­ber of buck­ets m=2^p depends on the pre­ci­sion p and determ­ines the over­all accur­acy. HLL uses a har­monic mean to com­bine res­ults for each bucket to obtain the final res­ult. Part of the hash is used to alloc­ate items to their bucket, so that duplic­ate items are always sent to the same bucket. Determ­in­istic alloc­a­tion to buck­ets using the same hash func­tion is essen­tial for the algorithm. It also allows com­bin­a­tion of algorithm states belong­ing to dif­fer­ent streams to estim­ate car­din­al­it­ies of uni­ons. It is worth not­ing, that a higher num­ber of buck­ets increases the car­din­al­ity threshold below which HLL per­forms poorly and needs to rely on other algorithms, such as Lin­ear Count­ing [6].

The main reason for the per­form­ance bene­fit of prob­ab­il­istic count­ing algorithms is that items are inspec­ted once and imme­di­ately for­got­ten, i.e. after­wards it is impossible to tell which items were coun­ted. This removes the need to reor­gan­ize data struc­tures and provide space for new data. Still, it also lim­its the applic­ab­il­ity to count­ing, e.g. remov­ing duplic­ates using these algorithms is impossible.

Count­ing Bits

Using a hash func­tion that assigns every bit with equal prob­ab­il­ity to zero or one, the pattern

11111111 11111111 11111111

will occur with prob­ab­il­ity 2^(-24)≈6∙10^(-8) at the end of a sequence. The occur­rence or non-occur­rence of these pat­terns can be used to estim­ate the car­din­al­ity of the stream. With Bash’s MD5 hash func­tion, the bit pat­tern for some input can be obtained using the command:

echo <input> | md5sum | cut -c1-32 | xxd -r -p | xxd -b

The table below shows the book­keep­ing used by PCSA and HLL for the first 5 items of a given bucket. For the sake of com­par­ison, HLL is presen­ted as count­ing trail­ing ones, instead of lead­ing zeros [5].

Fig­ure 2: Book­keep­ing of PCSA and HLL algorithms for the first five items of a sub­stream. In this example with 5 input val­ues, PCSA uses the value R=3 to estim­ate the cardinality.

For every bucket PCSA relies on a bit array r which is ini­tial­ized to 0. For given input item x the num­ber of trail­ing ones in the hash res­ult is coun­ted, and the appro­pri­ate bit in the bit array is set using the assignment

r = r | (~x & (x+1)) ,

where |, &, ~ denote the bit­wise oper­at­ors for inclus­ive dis­junc­tion (or), con­junc­tion (and), and com­ple­ment (not). At the end of the aggreg­a­tion, most left bits will be zero, while most right bits will be one. Finally, PCSA aver­ages the num­ber of con­sec­ut­ive trail­ing ones among the buck­ets’ bit arrays and returns the estim­ate N=2^(R_AVERAGE )/0.7735.

In con­trast to PCSA, HLL simply remem­bers the max­imum value of con­sec­ut­ive trail­ing ones pre­vi­ously encountered in each bucket. Whereas PCSA needs to alloc­ate sev­eral bytes for each bucket, HLL relies on a single byte lead­ing to lower memory usage. For the same num­ber of buck­ets, PCSA offers bet­ter accur­acy than HLL; how­ever, the over­all space-accur­acy tradeoff is bet­ter for HLL. HLL main­tains the same accur­acy with less memory but more buck­ets than PCSA.

The memory-accur­acy tradeoff for HLL and most approx­im­ate count­ing algorithms is quad­ratic, i.e. cut­ting the aver­age error in half requires four times more memory. The res­ults returned by PCSA and HLL are inde­pend­ent of the record order.

Applic­a­tions

The data­bases Oracle, MS Server, Snow­flake, AWS Red­shift, Azure Syn­apse Ana­lyt­ics, Redis, and oth­ers rely on HLL imple­ment­a­tions for approx­im­ate count­ing. An excep­tion is Google with its GCP BigQuery ser­vice that uses HLL++, which intro­duces algorithmic improve­ments at a range of small car­din­al­it­ies [7].

Snow­flake, Red­shift, and Bigquery claim aver­age rel­at­ive errors below two per­cent. Still, there are no guar­an­tees for single runs, and it is pos­sible (but tedi­ous) to pre­pare data that fools the algorithm.

Apart from the APPROX_COUNT_DISTINCT func­tion, these data­bases provide sev­eral HLL func­tions that allow man­aging the car­din­al­ity estim­a­tion using sketches. In par­tic­u­lar, it is pos­sible to estim­ate the car­din­al­ity of uni­ons using already gathered stat­ist­ics, i.e., build­ing hier­arch­ies and provid­ing estim­ates for every layer using inform­a­tion gathered at the bot­tom. For instance, you may count unique events per 10-minute inter­val and store this inform­a­tion using the sktech data type. This inform­a­tion can then be reused to build unique counts per hour, day, week, etc..

Snow­flake and Red­shift also allow export­ing the algorithm state as JSON objects. BigQuery allows the user to con­fig­ure the pre­ci­sion to any value between 10 and 24, lead­ing to dif­fer­ent accuracies. In con­trast, Snow­flake and Red­shift use fixed pre­ci­sions of 12 and 15, respectively.

HLL is only one of sev­eral approx­im­ate algorithms sup­por­ted by these data­bases. Other examples are the gen­er­a­tion of quantiles or iden­ti­fic­a­tion of the most fre­quent records using Snowflake’s APPROX_PERCENTILE and APPROX_TOP_K func­tions. These memory-sav­ing aggreg­a­tion func­tions can help to design report­ing solu­tions over large data sets that reflect the most recent data.

Today, HLL is one of the best-known algorithms for approx­im­ate count­ing and has found wide­spread adop­tion in many sys­tems since its inven­tion 15 years ago. It allows to count data­sets of any car­din­al­ity using min­imal resources, is suited for adop­tion in stream pro­cessing, and can run in par­al­lel. Sig­ni­fic­ant improve­ments with respect to HLL seem unlikely, so that HLL may stay with us for some time. You may also find the present­a­tions from Fla­jolet and Sedgewick inter­est­ing [8],[9]. Happy counting!

Many thanks to Prof. Con­rado Martínez from the Uni­versitat Politèc­nica de Catalunya for his use­ful remarks.

Ref­er­ences

[1] Onur Mutlu, “Com­puter Archi­tec­ture Lec­ture 2a: Memory Refresh”, 2019
[2] Robert Mor­ris, “Count­ing large num­bers of events in small registers”, 1978
[3] Phil­ippe Fla­jolet, G. Nigel Mar­tin, “Prob­ab­il­istic Count­ing Algorithms for Data Base Applic­a­tions”, 1985
[4] Phil­ippe Fla­jolet, “Approx­im­ate Count­ing: A Detailed Ana­lysis”, 1985
[5] Phil­ippe Fla­jolet, Éric Fusy, Olivier Gandouet, Frédéric Meunier, “Hyper­Lo­g­Log: the ana­lysis of a near-optimal car­din­al­ity estim­a­tion algorithm”, 2007
[6] Kyu-Young Whang, Brad T. Vander-Zanden, Howard M. Taylor, “A Lin­ear-Time Prob­ab­il­istic Count­ing Algorithm for Data­base Applic­a­tions”, 1990
[7] Stefan Heule, Marc Nunkesser, Alex­an­der Hall, “Hyper­Lo­g­Log in Prac­tice: Algorithmic Engin­eer­ing of a State of The Art Car­din­al­ity Estim­a­tion Algorithm”, 2013
[8] Phil­ippe Fla­jolet, “Prob­ab­il­istic Count­ing: from ana­lysis to algorithms to pro­grams”, 2007
[9] Robert Sedgewick, “Car­din­al­ity Estim­a­tion”, 2018