## Sketch of the Day: Probabilistic Counting with Stochastic Averaging (PCSA)

Before there was LogLog, SuperLogLog or HyperLogLog there was Probabilistic Counting with Stochastic Averaging (PCSA) from the seminal work “Probabilistic Counting Algorithms for Data Base Applications” (also known as the “FM Sketches” due to its two authors, Flajolet and Martin). The basis of PCSA matches that of the other Flajolet distinct value (DV) counters: hash values from a collection into binary strings, use patterns in those strings as indicators for the number of distinct values in that collection (bit-pattern observables), then use stochastic averaging to combine m trials into a better estimate. Our HyperLogLog post has more details on these estimators as well as stochastic averaging.

### Observables

The choice of observable pattern in PCSA comes from the knowledge that in a collection of randomly generated binary strings, the following probabilities occur:

\begin{aligned} &P( ..... 1) &= 2^{-1} \\ &P( .... 10) &= 2^{-2} \\ &P( ... 100) &= 2^{-3} \end{aligned}

$\vdots$

$P(...10^{k-1}) = 2^{-k}$

For each value added to the DV counter, a suitable hash is created and the position of the least-significant (right-most) 1 is determined. The corresponding position in a bitmap is updated and stored. I’ve created the simulation below so that you can get a feel for how this plays out.

Click above to run the bit-pattern simulation

(All bit representations in this post are numbered from 0 (the least-significant bit) on the right. This is the opposite of the direction in which they’re represented in the paper.)

Run the simulation a few times and notice how the bitmap is filled in. In particular, notice that it doesn’t necessarily fill in from the right side to the left — there are gaps that exist for a time that eventually get filled in. As the cardinality increases there will be a block of 1s on the right (the high probability slots), a block of 0s on the left (the low probability slots) and a “fringe” (as Flajolet et al. called it) of 1s and 0s in the middle.

I added a small pointer below the bitmap in the simulation to show how the cardinality corresponds to the expected bit position (based on the above probabilities). Notice what Flajolet et al. saw when they ran this same experiment: the least-significant (right-most) 0 is a pretty good estimator for the cardinality! In fact when you run multiple trials you see that this least-significant 0 for a given cardinality has a narrow distribution. When you combine the results with stochastic averaging it leads to a small relative error of $0.78 / \sqrt{m}$ and estimates the cardinality of the set quite well. You might have also observed that the most-significant (left-most) 1 can also be used for an estimator for the cardinality but it isn’t as clear-cut. This value is exactly the observable used in LogLog, SuperLogLog and HyperLogLog and does in fact lead to the larger relative error of $1.04 / \sqrt{m}$ (in the case of HLL).

### Algorithm

The PCSA algorithm is elegant in its simplicity:

m = 2^b # with b in [4...16]

bitmaps = [[0]*32]*m # initialize m 32bit wide bitmaps to 0s

##############################################################################################
# Construct the PCSA bitmaps
for h in hashed(data):
bitmap_index = 1 + get_bitmap_index( h,b ) # binary address of the rightmost b bits
run_length = run_of_zeros( h,b ) # length of the run of zeroes starting at bit b+1
bitmaps[ bitmap_index ][ run_length ] = 1 # set the bitmap bit based on the run length observed

##############################################################################################
# Determine the cardinality
phi = 0.77351
DV = m / phi * 2 ^ (sum( least_sig_bit( bitmap ) ) / m) # the DV estimate


Stochastic averaging is accomplished via the arithmetic mean.

You can see PCSA in action by clicking on the image below.

Click above to run the PCSA simulation

There is one point to note about PCSA and small cardinalities: Flajolet et al. mention that there are “initial nonlinearities” in the algorithm which result in poor estimation at small cardinalities ($n/m \approx 10 \, \text{to} \, 20$) which can be dealt with by introducing corrections but they leave it as an exercise for the reader to determine what those corrections are. Scheuermann et al. did the leg work in “Near-Optimal Compression of Probabilistic Counting Sketches for Networking Applications” and came up with a small correction term (see equation 6). Another approach is to simply use the linear (ball-bin) counting introduced in the HLL paper.

### Set Operations

Just like HLL and KMV, unions are trivial to compute and lossless. The PCSA sketch is essentially a “marker” for runs of zeroes, so to perform a union you merely bit-wise OR the two sets of bitmaps together. Folding a PCSA down to a smaller m works the same way as HLL but instead of HLL’s max you bit-wise OR the bitmaps together. Unfortunately for intersections you have the same issue as HLL, you must perform them using the inclusion/exclusion principle. We haven’t done the plots on intersection errors for PCSA but you can imagine they are similar to HLL (and have the benefit of the better relative error $0.78 / \sqrt{m}$).

### PCSA vs. HLL

That fact that PCSA has a better relative error than HyperLogLog with the same number of registers ($1.04 / 0.78 \approx 1.33$) is slightly deceiving in that $m$ (the number of stored observations) are different sizes. A better way to look at it is to fix the accuracy of the sketches and see how they compare. If we would like to have the same relative error from both sketches we can see that the relationship between registers is:

$\text{PCSA}_{RE} = \text{HLL}_{RE}$

$\dfrac{0.78}{\sqrt{m_{\scriptscriptstyle PCSA}}} = \dfrac{1.04}{\sqrt{m_{\scriptscriptstyle HLL}}}$

$m_{\scriptscriptstyle PCSA} = \left( \dfrac{0.78}{1.04} \right)^2 m_{\scriptscriptstyle HLL} \approx 0.563 \ m_{\scriptscriptstyle HLL}$

Interestingly, PCSA only needs a little more than half the registers of an HLL to reach the same relative error. But this is also deceiving. What we should be asking is what is the size of each sketch if they provide the same relative error? HLL commonly uses a register width of 5 bits to count to billions whereas PCSA requires 32 bits. That means a PCSA sketch with the same accuracy as an HLL would be:

\begin{aligned} \text{Size of PCSA} &= 32 \text{bits} \ m_{\scriptscriptstyle PCSA} = 32 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} ) \\ \\ \text{Size of HLL} &= 5 \text{bits} \ m_{\scriptscriptstyle HLL} \end{aligned}

Therefore,

$\dfrac{\text{Size of PCSA}} {\text{Size of HLL}} = \dfrac{32 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} )}{5 \text{bits} \ m_{\scriptscriptstyle HLL}} \approx 3.6$

A PCSA sketch with the same accuracy is 3.6 times larger than HLL!

### Optimizations

But what if you could make PCSA smaller by reducing the size of the bitmaps? Near the end of the paper in the Scrolling section, Flajolet et al. bring up the point that you can make the bitmaps take up less space. From the simulation you can observe that with a high probability there is a block of consecutive 1s on the right side of the bitmap and a block of consecutive 0s on the left side of the bitmap with a fringe in between. If one found the “global fringe” — that is the region defined by the left-most 1 and right-most 0 across all bitmaps — then only those bits need to be stored (along with an offset value). The authors theorized that a fringe width of 8 bits would be sufficient (though they fail to mention if there are any dependencies on the number of distinct values counted). We put this to the test.

In our simulations it appears that a fringe width of 12 bits is necessary to provide an unbiased estimator comparable to full-fringe PCSA (32-bit) for the range of distinct values we analyzed. (Notice the consistent bias of smaller fringe sizes.) There are many interesting reasons that this “fringe” concept can fail. Look at the notes to this post for more. If we take the above math and update 32 to 12 bits per register (and include the 32 bit offset value) we get:

\begin{aligned} \dfrac{\text{Size of PCSA}} {\text{Size of HLL}} &= \dfrac{12 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} ) + 32\text{bits}} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}} \\ \\ &= \dfrac{12 \text{bits} \, ( 0.563 \ m_{\scriptscriptstyle HLL} )} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}} + \dfrac{32\text{bits}} {5 \text{bits} \cdot m_{\scriptscriptstyle HLL}} \\ \\ &\approx 1.35 \text{ (for }m\gg64 \text{)} \end{aligned}

This is getting much closer to HLL! The combination of tighter bounds on the estimate and the fact that the fringe isn’t really that wide in practice result in PCSA being very close to the size of the much lauded HLL. This got us thinking about further compression techniques for PCSA. After all, we only need to get the sketch about 1/3 smaller to be comparable in size to HLL. In a future post we will talk about what happens if you Huffman code the PCSA bitmaps and the tradeoffs you make when you do this.

### Summary

PCSA provides for all of the goodness of HLL: very fast updates making it suitable for real-time use, small footprint compared to the information that it provides, tunable accuracy and unions. The fact that it has a much better relative error per register than HLL indicates that it should get more credit than it does. Unfortunately, each bitmap in PCSA requires more space than HLL and you still get less accuracy per bit. Look for a future post on how it is possible to use compression (e.g. Huffman encoding) to reduce the number of bits per bitmap, thus reducing the error per bit to match that of HLL, resulting in an approach that matches HLL in size but exceeds its precision!

### Notes on the Fringe

While we were putting this post together we discovered many interesting things to look at with respect to fringe optimization. One of the questions we wanted to answer was “How often does the limited size of the fringe muck up a bitmap?” Below is a plot that shows how often any given sketch had a truncation event (that affected the DV estimate) in the fringe of any one of its bitmaps for a given fringe width (i.e. some value could not be stored in the space available).

Note that this is an upper bound on the error that could be generated by truncation. If you compare the number of runs that had a truncation event (almost all of the runs) with the error plot in the post it is quite shocking that the errors are as small as they are.

Since we might not get around to all of the interesting research here, we are calling out to the community to help! Some ideas:

1. There are likely a few ways to improve the fringe truncation. Since PCSA is so sensitive to the least-significant 1 in each bitmap, it would be very interesting to see how different approaches affect the algorithm. For example, in our algorithm we “left” truncated meaning that all bitmaps had to have a one in the least-significant position of the bitmap in order to move up the offset. It would be interesting to look at “right” truncation. If one bitmap is causing many of the others to not record incoming values perhaps it should be bumped up. Is there some math to back up this intuition?
2. It is interesting to us that the fringe width truncation events are DV dependent. We struggled with the math on this for a bit before we just stopped. Essentially we want to know what is the width of the theoretical fringe? It obviously appears to be DV dependent and some sort of coupon collector problem with unequal probabilities. Someone with better math skills than us needs to help here.

### Closing thoughts

We uncovered PCSA again as a way to go back to first principles and see if there are lessons to be learned that could be applied to HLL to make it even better. For instance, can all of this work on the fringe be applied to HLL to reduce the number of bits per register while still maintaining the same precision? HLL effectively records the “strandline” (what we call the left-most 1s). More research into how this strandline behaves and if it is possible to improve the storage of it through truncation could reduce the standard HLL register width from 5 bits to 4, a huge savings! Obviously, we uncovered a lot of open questions with this research and we feel there are algorithmic improvements to HLL right around the corner. We have done some preliminary tests and the results so far are intriguing. Stay tuned!

## HyperLogLog Engineering: Choosing The Right Bits

Author’s Note: this is just a quick post about an engineering hiccup we ran into while implementing HyperLogLog features that aren’t mentioned in the original paper. We have an introduction to the algorithm and several other posts on the topic if you’re interested.

Say you had two HyperLogLog data structures with 5-bit-wide registers, one with $log_{2}m = 11$ and the other with $log_{2}m = 15$, and wanted to compute their union. You could just follow my colleague Chris’ advice and “fold” the larger one down to the size of the smaller one and then proceed as usual taking the pairwise max of the registers. This turns out to be a more involved process than Chris makes it out to be if you designed your HLL implementation in a particular way. For instance, if you use the 15 least(/most) significant bits of the 64-bit hashed input to determine register index and the next 30 bits to determine the register value, you end up in a tricky situation when you truncate the last 4 bits of the index to get the new 11-bit index.

If you imagine feeding the same element into an HLL of the smaller size, then the 4 bits you truncated from the index would have actually been used in the computation of the register value.

You couldn’t simply take the original register value you computed, you’d have to take into account the new prefix added to the register value bit string. If the prefix has a 1 in it, you would recompute the run of zeroes on just the prefix (because you know it contains a 1 and thus all the information you need), and if not, you’d add the length of the prefix to the original register value computed. Not a ton of work, but having clutter like this in algorithmic code distracts the reader from the true intention. So how do we avoid this?

Well, you could say that it’s very, very unlikely that you’ll ever need more than 30 bits for your register value, so you could assume that the register width would remain constant forever and use the bottom 30 bits for your register value and the next $log_{2}m$ bits for your register index. That way you could just truncate the last 4 bits of the index and know that your register value would still be the same. On the other hand, if you’re Google, that may not be true. In that case, what you should do is use the $log_{2}m$ least (/most) significant bits of your hashed value for the register index and the 30 most (/least) significant bits for the register value.

Now you can just truncate the register index and use the original register value.

If you’re using a good hash function like MurmurHash3 that gives you 128 bits of entropy, you could simply compute the register index from the first 64-bit word in the hash and compute the register value from the second 64-bit word and completely ignore this problem up to a mind-bending $log_{2}m = 64$ and register width of 6 (aka the heat death of the universe).

I know it’s not always possible to anticipate this problem in the early stages of implementing and vetting an algorithm, but hopefully with a bit of research the next time someone looks to implement HLL they’ll see this and learn from our mistake.

## The Sketching Press

Yesterday we had a visitor at the office, Bahman Bahmani. He was nice enough to give us a preview of his talk for Strata this week. As we are sketching cheerleaders, it was really cool of him to let us see his talk and to trade some war stories. If you are at Strata this week, definitely go and check it out. He has some really cool examples of sketching applications and a detailed description of his work at Twitter for their streaming PageRank sketch.

## Doubling the Size of an HLL Dynamically – Unions

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm.  See Matt’s overview of the algorithm, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the second in a series of three posts on doubling the number of bins of HLLs. The first post dealt with the recovery time after doubling and the next post will deal with ways to utilize an extra bit or two per bin.

### Overview

Let’s say we have two streams of data which we’re monitoring with the HLL algorithm, and we’d like to get an estimate on the cardinality of these two streams combined, i.e. thought of as one large stream.  In this case, we have to take advantage of the algorithm’s built-in “unionfeature.  Done naively, the accuracy of the estimate will depend entirely on the the number of bins, $m$, of the smaller of the two HLLs.  In this case, to make our estimate more accurate, we would need to increase this $m$ of one (or both) of our HLLs.  This post will investigate the feasibility of doing this; we will apply our idea of “doubling” to see if we can gain any accuracy.  We will not focus on intersections, since the only support the HyperLogLog algorithm has for intersections is via the inclusion/exclusion principle. Hence the error can be kind of funky for this – for a better overview of this, check out Timon’s post here. For this reason, we only focus on how the union works with doubling.

### The Strategy: A Quick Reminder

In my last post we discussed the benefits and drawbacks of many different doubling strategies in the context of recovery time of the HLL after doubling. Eventually we saw that two of our doubling strategies worked significantly better than the others. In this post, instead of testing many different strategies, we’ll focus instead on one strategy, “proportion doubling” (PD), and how to manipulate it to work best in the context of unions. The idea behind PD is to guess the approximate intersection cardinality of the two datasets and to force that estimate to remain after doubling. To be more specific, suppose we have an HLL $A$ and an HLL $B$ with$n$ bins and $2n$ bins, respectively. Then we check what proportion of bins in $A$, call it $p$, agree with the bins in $B$. When we doubled $A$, we fill in the bins by randomly selecting $p\cdot n$ bins, and filling them in with the value in the corresponding bins in $B$. To fill in the rest of the bins, we fill them in randomly according to the distribution.

### The Naive Approach

To get some idea of how well this would work, I put the most naive strategy to the test. The idea was to run 100 trials where I took two HLLs (one of size $2^5 = 32$ and one of size $2^6 = 64$), ran 200K keys through them, doubled the smaller one (according to Random Estimate), and took a union. I had a hunch that the accuracy of our estimate after doubling would depend on how large the true intersection cardinality of the two datasets would be, so I ran this experiment for overlaps of size 0, 10K, 20K, etc. The graphs below are organized by the true intersection cardinality, and each graph shows the boxplot of the error for the trials.

This graph is a little overwhelming and a bit of a strange way to display the data, but is useful for getting a feel for how the three estimates work in the different regimes.  The graph below is from the same data and just compares the “Small” and “Doubled” HLLs.  The shaded region represents the middle 50% of the data, and the blue dots represent the data points.

The first thing to notice about these graphs is the accuracy of the estimate in the small intersection regime. However, outside of this, the estimates are not very accurate – it is clearly a better choice to just use the estimate from the smaller HLL.

Let’s try a second approach. Above we noticed that the algorithm’s accuracy depended on the cardinality of the intersection. Let’s try to take that into consideration. Let’s use the “Proportion Doubling” (PD) strategy we discussed in our first post. That post goes more in depth into the algorithm, but the take away is that this doubling strategy preserves the proportion of bins in the two HLLs which agree. I ran some trials like I did above to get some data on this. The graphs below represent this.

Here we again, show the data in a second graph comparing just the “Doubled” and “Small” HLL estimates.  Notice how much tighter the middle 50% region is on the top graph (for the “Doubled” HLL).  Hence in the large intersection regime, we get very accurate estimates.

One thing to notice about the second set of graphs is how narrow the error bars are.  Even when the estimate is biased, it still has much smaller error.  Also, notice that this works well in the large intersection regime but horribly in the small intersection regime.  This suggests that we may be able to interpolate our strategies. The next set of graphs is for an attempt at this. The algorithm gets an estimate of the intersection cardinality, then decides to either double using PD, double using RE, or not double depending on whether the intersection is large, small, or medium.

Here, the algorithm works well in the large intersection regime and doesn’t totally crap out outside of this regime (like the second algorithm), but doesn’t sustain the accuracy of the first algorithm in the small intersection regime. This is most likely because the algorithm cannot “know” which regime it is in and thus, must make a guess.  Eventually, it will guess wrong will severely underestimate the union cardinality. This will introduce a lot of error, and hence, our boxplot looks silly in this regime. The graph below shows the inefficacy of this new strategy. Notice that there are virtually no gains in accuracy in the top graph.

### Conclusion

With some trickery, it is indeed possible to gain some some accuracy when estimating the cardinality of the union of two HLLs by doubling one.  However, in order for this to be feasible, we need to apply the correct algorithm in the correct regime. This isn’t a major disappointment since for many practical cases, it would be easy to guess which regime the HLLs should fall under and we could build in the necessary safeguards if we guess incorrectly.  In any case, our gains were modest but certainly encouraging!

## Open Source Release: postgresql-hll

We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation.

### What is it and what can I do with it?

The extension introduces a new data type, hll, which represents a probabilistic distinct value counter that is a hybrid between a HyperLogLog data structure (for large cardinalities) and a simple set (for small cardinalities). These structures support the basic HLL methods: insert, union, and cardinality, and we’ve also provided aggregate and debugging functions that make using and understanding these things a breeze. We’ve also included a way to do schema versioning of the binary representations of hlls, which should allow a clear path to upgrading the algorithm, as new engineering insights come up.

A quick overview of what’s included in the release:

• C-based extension that provides the hll data structure and algorithms
• Austin Appleby’s MurmurHash3 implementation and SQL-land wrappers for integer numerics, bytes, and text
• Full storage specification in STORAGE.markdown
• Full function reference in REFERENCE.markdown
• .spec file for rpmbuild
• Full test suite

A quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like HyperLogLog and we came to the conclusion that it wouldn’t be very user-friendly to force the user to figure out how to get a good hash function into SQL-land. Sure, there are plenty of cryptographic hash functions available, but those are (computationally) overkill for what is needed. We did the research and found MurmurHash3 to be an excellent non-cryptographic hash function in both theory and practice. We’ve been using it in production for a while now with excellent results. As mentioned in the README, it’s of crucial importance to reliably hash the inputs to hlls.

### Why did you build it?

The short answer is to power these two UIs:

On the left is a simple plot of the number of unique users seen per day and the number of cumulative unique users seen over the days in the month. The SQL behind this is very very straightforward:

SELECT report_date,
#users as by_day,
#hll_union_agg(users) as cumulative_by_day OVER (ORDER BY report_date ASC)
FROM daily_uniques
WHERE report_date BETWEEN '2013-01-01' AND '2013-01-31'
ORDER BY report_date ASC;

where daily_uniques is basically:

   Column    | Type | Modifiers
-------------+------+-----------
report_date | date |
users       | hll  |

Briefly, # is the cardinality operator which is operating on the hll result of the hll_union_agg aggregate function which unions the previous days’ hlls.

On the right is a heatmap of the percentage of an inventory provider’s users that overlap with another inventory provider. Essentially, we’re doing interactive set-intersection of operands with millions or billions of entries in milliseconds. This is intersection computed using the inclusion-exclusion principle as applied to hlls:

SELECT ip1.id as provider1,
ip2.id as provider2,
(#ip1.users + #ip2.users - #hll_union(ip1.users, ip2.users))/#ip1.users as overlap
FROM inventory_provider_stats ip1, inventory_provider_stats ip2
WHERE ip1.id <> ip2.id;

where inventory_provider_stats is basically:

   Column    | Type | Modifiers
-------------+------+-----------
id          | date |
users       | hll  |

(Some of you may note that the diagonal is labeled “exclusive reach” and is not represented in the query’s result set. That’s because the SQL above is a simplification of what’s happening. There’s some extra work done that replaces that the useless diagonal entries with the percent of the inventory provider’s users that are only seen on that inventory provider.)

We’ve been running this type of code in production for over a year now and are extremely pleased with its performance, ease of use, and expressiveness. Everyone from engineers to researchers to ops people to analysts have been using hlls in their daily reports and queries. We’re seeing product innovation coming from all different directions in the organization as a direct result of having these powerful data structures in an easily accessed and queried format. Dynamic COUNT(DISTINCT ...) queries that would have taken minutes or hours to compute from a fact table or would have been impossible in traditional cube aggregates return in milliseconds. Combine that speed with PostgreSQL’s window and aggregate functions and you have the ability to present interactive, rich distinct-value reporting over huge data sets. I’ll point you to the README and our blog posts on HyperLogLog for more technical details on storage, accuracy, and in-depth use cases.

I believe that this pattern of in-database probabilistic sketching is the future of interactive analytics. As our VP of Engineering Steve Linde said to me, “I can’t emphasize enough how much business value [sketches] deliver day in and day out.”

### Our Commitment

Obviously we’re open-sourcing this for both philanthropic and selfish reasons: we’d love for more people to use this technology so that they can tell us all the neat uses for it that we haven’t thought of yet. In exchange for their insight, we’re promising to stay active in terms of stewardship and contribution of our own improvements. Our primary tool for this will be the GitHub Issues/Pull Request mechanism. We’d considered a mailing list but that seems like overkill right now. If people love postgresql-hll, we’ll figure something out as needed.

Please feel free to get in touch with us about the code on GitHub and about the project in general in the comments here. We hope to release additional tools that allow seamless Java application integration with the raw hll data in the future, so stay tuned!

### Update

Looks Dimitri Fontaine wrote up a basic “how-to” post on using postgresql-hll here and another on unions here. (Thanks, Dimitri!) He brings up the issue that hll_add_agg() returns NULL when aggregating over an empty set when it should probably return an empty hll. Hopefully we’ll have a fix for that soon. You can follow the progress of the issue here.

## HyperLogLog++: Google’s Take On Engineering HLL

Matt Abrams recently pointed me to Google’s excellent paper “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm” [UPDATE: changed the link to the paper version without typos] and I thought I’d share my take on it and explain a few points that I had trouble getting through the first time. The paper offers a few interesting improvements that are worth noting:

1. Move to a 64-bit hash function
2. A new small-cardinality estimation regime
3. A sparse representation

I’ll take a look at these one at a time and share our experience with similar optimizations we’ve developed for a streaming (low latency, high throughput) environment.

### 32-bit vs. 64-bit hash function

I’ll motivate the move to a 64-bit hash function in the context of the original paper a bit more since the Google paper doesn’t really cover it except to note that they wanted to count billions of distinct values.

#### Some math

In the original HLL paper, a 32-bit hash function is required with the caveat that measuring cardinalities in the hundreds of millions or billions would become problematic because of hash collisions. Specifically, Flajolet et al. propose a “large range correction” for when the estimate $E$ is greater than $2^{32}/30$.  In this regime, they replace the usual HLL estimate by the estimate

$\displaystyle E^* := -2^{32} \mbox{log}\Big(1 - \frac{E}{2^{32}}\Big)$.

This reduces to a simple probabilistic argument that can be modeled with balls being dropped into bins. Say we have an $L$-bit hash. Each distinct value is a ball and each bin is designated by a value in the hash space.  Hence, you “randomly” drop a ball into a bin if the hashed value of the ball corresponds to the hash value attached to the bin.  Then, if we get an estimate $E$ for the cardinality, that means that (approximately) $E$ of our bins have values in them, and so there are $2^L - E$ empty bins.  The number of empty bins should be about $2^L e^{ - n/2^{L} }$, where $n$ is the number of balls.  Hence $2^{L} - E = 2^{L} e^{-n/2^{L}}$.  Solving this gives us the formula he recommends using: $-2^L \log(1 - \frac{E}{2^{L}})$.

Aside:  The empty bins expected value comes from the fact that

$E(\# \text{ of empty bins}) = m(1 - \frac{1}{m})^{n}$,

where $m$ is the number of bins and $n$ the number of balls.  This is pretty quick to show by induction.  Hence,

$\displaystyle E(\#\text{ of empty bins}) \sim m e^{-n/m}$ as $n \rightarrow \infty$.

Again, the general idea is that the $E$ ends up being some number smaller than $n$ because some of the balls are getting hashed to the same value.  The correction essentially doesn’t do anything in the case when $E$ is small compared to $2^L$ as you can see here. (Plotted is $-\log(1 - x)$, where $x$ represents $E / 2^L$, against the line $y = x$. The difference between the two graphs represents the difference between $E$ and $n$.)

#### A solution and a rebuttal

The natural move to start estimating cardinalities in the billions is to simply move to a larger hash space where the hash collision probability becomes negligibly small. This is fairly straightforward since most good hash functions give you at least 64-bits of entropy these days and it’s also the size of a machine word. There’s really no downside to moving to the larger hash space, from an engineering perspective. However, the Google researchers go one step further: they increase the register width by one bit (to 6), as well, ostensibly to be able to support the higher possible register values in the 64-bit setting. I contend this is unnecessary. If we look at the distribution of register values for a given cardinality, we see that it takes about a trillion elements before a 5-bit register overflows (at the black line):

The distributions above come from the LogLog paper, on page 611, right before formula 2. Look for $\mathbb{P}_{\nu}(M = k)$.

Consider the setting in the paper where $p = \log_2(m) = 14$. Let’s says we wanted to safely count into the 100 billion range. If we have $L = p + (2^5 - 1) = 14 + 31 = 45$ then our new “large range correction” boundary is roughly one trillion, per the adapted formula above. As you can see from the graph below, even at $p = 10, L = 41$ the large range correction only kicks in at a little under 100 billion!

The black line is the cutoff for a 5-bit register, and the points are plotted when the total number of hash bits required reaches 40, 50, and 60.

The real question though is all this practically useful? I would argue no: there are no internet phenomena that I know of that are producing more than tens of billions of distinct values, and there’s not even a practical way of empirically testing the accuracy of HLL at cardinalities above 100 billion. (Assuming you could insert 50 million unique, random hashed values per second, it would take half an hour to fill an HLL to 100 billion elements, and then you’d have to repeat that 5000 times as they do in the paper for a grand total of 4 months of compute time per cardinality in the range.)

[UPDATE: After talking with Marc Nunkesser (one of the authors) it seems that Google may have a legitimate need for both the 100 billion to trillion range right now, and even more later, so I retract my statement! For the rest of us mere mortals, maybe this section will be useful for picking whether or not you need five or six bits.]

At AK we’ve run a few hundred test runs at 1, 1.5, and 2 billion distinct values under the $p = 10-14, L = 41-45$ configuration range and found the relative error to be identical to that of lower cardinalities (in the millions of DVs). There’s really no reason to inflate the storage requirements for large cardinality HLLs by 20% simply because the hash space has expanded. Furthermore, you can do all kinds of simple tricks like storing an offset as metadata (which would only require at most 5 bits) for a whole HLL and storing the register values as the difference from that base offset, in order to make use of a larger hash space.

### Small Cardinality Estimation

Simply put, the paper introduces a second small range correction between the existing one (linear counting) and the “normal” harmonic mean estimator ($E$ in the original paper) in order to eliminate the “large” jump in relative error around the boundary between the two estimators.

They empirically calculate the bias of $E$ and create a lookup table for various $p$, for 200 values less than $5 \cdot 2^p$ with a correction to the overestimate of $E$. They interpolate between the 200 reference points to determine the correction to apply for any given raw $E$ value. Their plots give compelling evidence that this bias correction makes a difference in the $m$ to $5m$ cardinality range (cuts 95th percentile relative error from about 2% to 1.2%).

I’ve been a bit terse about this improvement since sadly it doesn’t help us at AK much because most of our data is Zipfian. Few of our reporting keys live in the narrow cardinality range they are optimizing: they either wallow in the linear counting range or shoot straight up into the normal estimator range.

Nonetheless, if you find you’re doing a lot of DV counting in this range, these corrections are pretty cheap to implement (as they’ve provided numerical values for all the cutoffs and bias corrections in the appendix.)

### Sparse representation

The general theme of this optimization isn’t particularly new (our friends at MetaMarkets mentioned it in this post): for smaller cardinality HLLs there’s no need to materialize all the registers. You can get away with just materializing the set registers in a map. The paper proposes a sorted map (by register index) with a hash map off the side to allow for fast insertions. Once the hash map reaches a certain size, its entries are batch-merged into the sorted list, and once the sorted list reaches the size of the materialized HLL, the whole thing is converted to the fully-materialized representation.

Aside: Though the hash map is a clever optimization, you could skip it if you didn’t want the added complexity. We’ve found that the simple sorted list version is extremely fast (hundreds of thousands of inserts per second). Also beware the variability of the the batched sort-and-merge cost every time the hash map repeatedly outgrows its limits and has to be merged into the sorted list. This can add significant latency spikes to a streaming system, whereas a one-by-one insertion sort to a sorted list will be slower but less variable.

The next bit is very clever: they increase $p$ when the HLL is in the sparse representation because of the saved space. Since they’re already storing entries in 32-bit integers, they increase $p$ to $p^{\prime} = 31 - \mbox{regWidth} = 31 - 6 = 25$. (I’ll get to the leftover bit in a second!) This gives them increased precision which they can simply “fold” down when converting from the sparse to fully materialized representation. Even more clever is their trick of not having to always store the full register value as the value of an entry in the map. Instead, if the longer register index encodes enough bits to determine the value, they use the leftover bit I mentioned before to signal as much.

In the diagram, $p$ and $p^{\prime}$ are as in the Google paper, and $q$ and $q^{\prime}$ are the number of bits that need to be examined to determine $\rho$ for either the $p$ or $p^{\prime}$ regime. I encourage you to read section 5.3.3 as well as EncodeHash and DecodeHash in Figure 8 to see the whole thing. [UPDATE: removed the typo section as it has been fixed in the most recent version of the paper (linked at the top)]

The paper also tacks on a difference encoding (which works very well because it’s a sorted list) and a variable length encoding to the sparse representation to further shrink the storage needed, so that the HLL can use the increased register count, $p^{\prime}$, for longer before reverting to the fully materialized representation at $p$. There’s not much to say about it because it seems to work very well, based on their plots, but I’ll note that in no way is that type of encoding suitable for streaming or “real-time” applications. The encode/decode overhead simply takes an already slow (relative to the fully materialized representation) sparse format and adds more CPU overhead.

### Conclusion

The researchers at Google have done a great job with this paper, meaningfully tackling some hard engineering problems and showing some real cleverness. Most of the optimizations proposed work very well in a database context, where the HLLs are either being used as temporary aggregators or are being stored as read-only data, however some of the optimizations aren’t suitable for streaming/”real-time” work. On a more personal note, it’s very refreshing to see real algorithmic engineering work being published from industry. Rob, Matt, and I just got back from New Orleans and SODA / ALENEX / ANALCO and were hoping to see more work in this area, and Google sure did deliver!

### Appendix

Sebastiano Vigna brought up the point that 6-bit registers are necessary for counting above 4 billion in the comments. I addressed it in the original post (see “A solution and a rebuttal“) but I’ll lay out the math in a bit more detail here to show that you can easily count above 4 billion with only 5-bit registers.

If you examine the original LogLog paper (the same as mentioned above) you’ll see that the register distribution for LogLog (and HyperLogLog consequently) registers is

$\displaystyle \mathbb{P}_{\nu}(M > k) = 1 - \mathbb{P}_{\nu}(M \le k) = 1 - \Big(1 - \frac{1}{2^k}\Big)^{\nu}$

where $k$ is the register value and $\nu$ is the number of (distinct) elements a register has seen.

So, I assert that 5 bits for a register (which allows the maximum value to be 31) is enough to count to ten billion without any special tricks. Let’s fix $p=14$ and say we insert $10^{10}$ distinct elements. That means, any given register will see about $\frac{10^{10}}{2^p} = \frac{10^{10}}{2^{14}} = \approx 6.1 \times 10^5$ elements assuming we have a decent hash function. So, the probability that a given register will have a value greater than 31 is (per the LogLog formula given above)

$\displaystyle \mathbb{P}_{\nu}(M > 31) = 1 - \mathbb{P}_{\nu}(M \le 31) = 1 - \Big(1 - \frac{1}{2^{31}}\Big)^{6.1 \times 10^5} \approx 0.00028$

and hence the expected number of registers that would overflow is approximately $2^{14} \times 0.00028 \approx 4.5$. So five registers out of sixteen thousand would overflow. I am skeptical that this would meaningfully affect the cardinality computation. In fact, I ran a few tests to verify this and found that the average number of registers with values greater than 31 was 4.5 and the relative error had the same standard deviation as that predicted by the paper, $1.04/\sqrt{m}$.

For argument, let’s assume that you find those five overflowed registers to be unacceptable. I argue that you could maintain an offset in 5 bits for the whole HLL that would allow you to still use 5 bit registers but exactly store the value of every register with extremely high probability. I claim that with overwhelmingly high probability, every register the HLL used above is greater than 15 and less than or equal to 40. Again, we can use the same distribution as above and we find that the probability of a given register being outside those bounds is

$\mathbb{P}_{\nu}(M < 15) \approx 10^{-162}$ and

$\mathbb{P}_{\nu}(M > 40) \approx 10^{-7}$.

Effectively, there are no register values outside of $[15,40]$. Now I know that I can just store 15 in my offset and the true value minus the offset (which now fits in 5 bits) in the actual registers.

## HLL Intersections

### Why?

The intersection of two streams (of user ids) is a particularly important business need in the advertising industry. For instance, if you want to reach suburban moms but the cost of targeting those women on a particular inventory provider is too high, you might want to know about a cheaper inventory provider whose audience overlaps with the first provider. You may also want to know how heavily your car-purchaser audience overlaps with a certain metro area or a particular income range. These types of operations are critical for understanding where and how well you’re spending your advertising budget.

As we’ve seen before, HyperLogLog provides a time- and memory-efficient algorithm for estimating the number of distinct values in a stream. For the past two years, we’ve been using HLL at AK to do just that: count the number of unique users in a stream of ad impressions. Conveniently, HLL also supports the union operator ( $\cup$ ) allowing us to trivially estimate the distinct value count of any composition of streams without sacrificing precision or accuracy. This piqued our interest because if we can “losslessly” compute the union of two streams and produce low-error cardinality estimates, then there’s a chance we can use that estimate along with the inclusion-exclusion principle to produce “directionally correct” cardinality estimates of the intersection of two streams. (To be clear, when I say “directionally correct” my criteria is “can an advertiser make a decision off of this number?”, or “can it point them in the right direction for further research?”. This often means that we can tolerate relative errors of up to 50%.)

The goals were:

1. Get a grasp on the theoretical error bounds of intersections done with HLLs, and
2. Come up with heuristic bounds around $m$, $overlap$, and the set cardinalities that could inform our usage of HLL intersections in the AK product.

Quick terminology review:

• If I have set of integers $A$, I’m going to call the HLL representing it $H_{A}$.
• If I have HLLs $H_{A}, H_{B}$ and their union $H_{A \cup B}$, then I’m going to call the intersection cardinality estimate produced $|H_{A \cap B}|$.
• Define the $overlap$ between two sets as $overlap(A, B) := \frac{|A \cap B|}{min(|A|, |B|)}$.
• Define the cardinality ratio $\frac{max(|A|, |B|)}{min(|A|, |B|)}$ as a shorthand for the relative cardinality of the two sets.
• We’ll represent the absolute error of an observation $|H_{A}|$ as $\Delta |H_{A}|$.

That should be enough for those following our recent posts, but for those just jumping in, check out Appendices A and B at the bottom of the post for a more thorough review of the set terminology and error terminology.

### Experiment

We fixed 16 $overlap$ values (0.0001, 0.001, 0.01, 0.02, 0.05, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0) and 12 set cardinalities (100M, 50M, 10M, 5M, 1M, 500K, 100K, 50K, 10K, 5K, 1K, 300) and did 100 runs of each permutation of $(overlap, |A|, |B|)$. A random stream of 64-bit integers hashed with Murmur3 was used to create the two sets such that they shared exactly $min(|A|,|B|) \cdot overlap = |A \cap B|$  elements. We then built the corresponding HLLs $H_{A}$ and $H_{B}$ for those sets and calculated the intersection cardinality estimate $|H_{A} \cap H_{B}|$ and computed its relative error.

Given that we could only generate and insert about 2M elements/second per core, doing runs with set cardinalities greater than 100M was quickly ruled out for this blog post. However, I can assure you the results hold for much larger sets (up to the multiple billion-element range) as long as you heed the advice below.

### Results

This first group of plots has a lot going on, so I’ll preface it by saying that it’s just here to give you a general feeling for what’s going on. First note that within each group of boxplots $overlap$ increases from left to right (orange to green to pink), and within each plot cardinality ratio increases from left to right. Also note that the y-axis (the relative error of the observation) is log-base-10 scale. You can clearly see that as the set sizes diverge, the error skyrockets for all but the most similar (in both cardinality and composition) sets. I’ve drawn in a horizontal line at 50% relative error to make it easier to see what falls under the “directionally correct” criteria. You can (and should) click for a full-sized image.

Note: the error bars narrow as we progress further to the right because there are fewer observations with very large cardinality ratios. This is an artifact of the experimental design.

A few things jump out immediately:

• For cardinality ratio > 500, the majority of observations have many thousands of percent error.
• When cardinality ratio is smaller than that and $overlap > 0.4$, register count has little effect on error, which stays very low.
• When $overlap \le 0.01$, register count has little effect on error, which stays very high.

Just eyeballing this, the lesson I get is that computing intersection cardinality with small error (relative to the true value) is difficult and only works within certain constraints. Specifically,

1. $\frac{|A|}{|B|} < 100$, and
2. $overlap(A, B) = \frac{|A \cap B|}{min(|A|, |B|)} \ge 0.05$.

The intuition behind this is very simple: if the error of any one of the terms in your calculation is roughly as large as the true value of the result then you’re not going to estimate that result well. Let’s look back at the intersection cardinality formula. The left-hand side (that we are trying to estimate) is a “small” value, usually. The three terms on the right tend to be “large” (or at least “larger”) values. If any of the “large” terms has error as large as the left-hand side we’re out of luck.

So, let’s say you can compute the cardinality of an HLL with relative error of a few percent. If $|H_{A}|$ is two orders of magnitude smaller than $|H_{B}|$, then the error alone of $|H_{B}|$ is roughly as large as $|A|$.

$|A \cap B| \le |A|$ by definition, so

$|A \cap B| \le |A| \approx |H_{A}| \approx \Delta |H_{B}|$.

In the best scenario, where $A \cap B = A$, the errors of $|H_{B}|$ and $|H_{A \cup B}| \approx |H_{B}|$ are both roughly the same size as what you’re trying to measure. Furthermore, even if $|A| \approx |B|$ but the overlap is very small, then $|A \cap B|$  will be roughly as large as the error of all three right-hand terms.

### On the bubble

Let’s throw out the permutations whose error bounds clearly don’t support “directionally correct” answers ($overlap < 0.01$ and $\frac{|A|}{|B|} > 500$) and those that trivially do ($overlap > 0.4$) so we can focus more closely on the observations that are “on the bubble”. Sadly, these plots exhibit a good deal of variance inherent in their smaller sample size. Ideally we’d have tens of thousands of runs of each combination, but for now this rough sketch will hopefully be useful. (Apologies for the inconsistent colors between the two plots. It’s a real bear to coordinate these things in R.) Again, please click through for a larger, clearer image.

By doubling the number of registers, the variance of the relative error falls by about a quarter and moves the median relative error down (closer to zero) by 10-20 points.

In general, we’ve seen that the following cutoffs perform pretty well, practically. Note that some of these aren’t reflected too clearly in the plots because of the smaller sample sizes.

Register Count Data Structure Size Overlap Cutoff Cardinality Ratio Cutoff
8192 5kB 0.05 10
16384 10kB 0.05 20
32768 20kB 0.05 30
65536 41kB 0.05 100

### Error Estimation

To get a theoretical formulation of the error envelope for intersection, in terms of the two set sizes and their overlap, I tried the first and simplest error propagation technique I learned. For variables $Y, Z, ...$, and $X$ a linear combination of those (independent) variables, we have

$\Delta X = \sqrt{ (\Delta Y)^2 + (\Delta Z)^2 + ...}$

Applied to the inclusion-exclusion formula:

$\begin{array}{ll} \displaystyle \Delta |H_{A \cap B}| &= \sqrt{ (\Delta |H_{A}|)^2 + (\Delta |H_{B}|)^2 + (\Delta |H_{A \cup B}|)^2} \\ &= \sqrt{ (\sigma\cdot |A|)^2 + (\sigma\cdot |B|)^2 + (\sigma\cdot |A \cup B|)^2} \end{array}$

where

$\sigma = \frac{1.04}{\sqrt{m}}$ as in section 4 (“Discussion”) of the HLL paper.

Aside: Clearly $|H_{A \cup B}|$ is not independent of $|H_{A}| + |H_{B}|$, though $|H_{A}|$ is likely independent of $|H_{B}|$. However, I do not know how to a priori calculate the covariance in order to use an error propagation model for dependent variables. If you do, please pipe up in the comments!

I’ve plotted this error envelope against the relative error of the observations from HLLs with 8192 registers (approximately 5kB data structure).

Despite the crudeness of the method, it provided a 95% error envelope for the data without significant differences across cardinality ratio or $overlap$. Specifically, at least 95% of observations satisfied

$(|H_{A \cap B}| - |A \cap B|) < \Delta |H_{A \cap B}|$

However, it’s really only useful in the ranges shown in the table above. Past those cutoffs the bound becomes too loose and isn’t very useful.

This is particularly convenient because you can tune the number of registers you need to allocate based on the most common intersection sizes/overlaps you see in your application. Obviously, I’d recommend everyone run these tests and do this analysis on their business data, and not on some contrived setup for a blog post. We’ve definitely seen that we can get away with far less memory usage than expected to successfully power our features, simply because we tuned and experimented with our most common use cases.

### Next Steps

We hope to improve the intersection cardinality result by finding alternatives to the inclusion-exclusion formula. We’ve tried a few different approaches, mostly centered around the idea of treating the register collections themselves as sets, and in my next post we’ll dive into those and other failed attempts!

### Appendix A: A Review Of Sets

Let’s say we have two streams of user ids, $S_{A}$ and $S_{B}$. Take a snapshot of the unique elements in those streams as sets and call them $A$ and $B$. In the standard notation, we’ll represent the cardinality, or number of elements, of each set as $|A|$ and $|B|$.

Example: If $A = \{1,2,10\}$ then $|A| = 3$.

If I wanted to represent the unique elements in both of those sets combined as another set I would be performing the union, which is represented by $A \cup B$.

Example: If $A = \{1,2,3\}, B=\{2,3,4\}$ then $A \cup B = \{1,2,3,4\}$.

If I wanted to represent the unique elements that appear in both $A$ and $B$ I would be performing the intersection, which is represented by $A \cap B$.

Example: With $A, B$ as above, $A \cap B = \{2,3\}$.

The relationship between the union’s cardinality and the intersection’s cardinality is given by the inclusion-exclusion principle. (We’ll only be looking at the two-set version in this post.) For reference, the two-way inclusion-exclusion formula is $|A \cap B| = |A| + |B| - |A \cup B|$.

Example: With $A, B$ as above, we see that $|A \cap B| = 2$ and $|A| + |B| - |A \cup B| = 3 + 3 - 4 = 2$.

For convenience we’ll define the $overlap$ between two sets as $overlap(A, B) := \frac{|A \cap B|}{min(|A|, |B|)}$.

Example: With $A, B$ as above, $overlap(A,B) = \frac{|A \cap B|}{min(|A|, |B|)} = \frac{2}{min(3,3)} = \frac{2}{3}$.

Similarly, for convenience, we’ll define the cardinality ratio $\frac{max(|A|, |B|)}{min(|A|, |B|)}$ as a shorthand for the relative cardinality of the two sets.

The examples and operators shown above are all relevant for exact, true values. However, HLLs do not provide exact answers to the set cardinality question. They offer estimates of the cardinality along with certain error guarantees about those estimates. In order to differentiate between the two, we introduce HLL-specific operators.

Consider a set $A$. Call the HLL constructed from this set’s elements $H_{A}$. The cardinality estimate given by the HLL algorithm for $H_{A}$ is $|H_{A}|$.

Define the union of two HLLs $H_{A} \cup H_{B} := H_{A \cup B}$, which is also the same as the HLL created by taking the pairwise max of $H_{A}$‘s and $H_{B}$‘s registers.

Finally, define the intersection cardinality of two HLLs in the obvious way: $|H_{A} \cap H_{B}| := |H_{A}| + |H_{B}| - |H_{A \cup B}|$. (This is simply the inclusion-exclusion formula for two sets with the cardinality estimates instead of the true values.)

### Appendix B: A (Very Brief) Review of Error

The simplest way of understanding the error of an estimate is simply “how far is it from the truth?”. That is, what is the difference in value between the estimate and the exact value, also known as the absolute error.

However, that’s only useful if you’re only measuring a single thing over and over again. The primary criteria for judging the utility of HLL intersections is relative error because we are trying to measure intersections of many different sizes. In order to get an apples-to-apples comparison of the efficacy of our method, we normalize the absolute error by the true size of the intersection. So, for some observation $\hat{x}$ whose exact value is non-zero $x$, we say that the relative error of the observation is $\frac{x-\hat{x}}{x}$. That is, “by what percentage off the true value is the observation off?”

Example: If $|A| = 100, |H_{A}| = 90$ then the relative error is $\frac{100 - 90}{100} = \frac{10}{100} = 10\%$.

## HLLs and Polluted Registers

### Introduction

It’s worth thinking about how things can go wrong, and what the implications of such occurrences might be. In this post, I’ll be taking a look at the HyperLogLog (HLL) algorithm for cardinality estimation, which we’ve discussed before.

### The Setup

HLLs have the property that their register values increase monotonically as they run. The basic update rule is:

for item in stream:
index, proposed_value = process_hashed_item(hash(item))
hll.registers[index] = max(hll.registers[index], proposed_value)


There’s an obvious vulnerability here: what happens to your counts if you get pathological data that blows up a register value to some really large number? These values are never allowed to decrease according to the vanilla algorithm. How much of a beating can these sketches take from such pathological data before their estimates are wholly unreliable?

### Experiment The First

To get some sense of this, I took a 1024 bucket HLL, ran a stream through it, and then computed the error in the estimate. I then proceeded to randomly choose a register, max it out, and compute the error again. I repeated this process until I had maxed out 10% of the registers. In pseudo-python:

print("n_registers_touched,relative_error")
print(0, relative_error(hll.cardinality(), stream_size), sep = ",")
for index, reg in random.sample(range(1024), num_to_edit):
hll.registers[reg] = 32
print(index + 1, relative_error(hll.cardinality(), stream_size), sep = ",")


In practice, HLL registers are fixed to be a certain bit width. In our case, registers are 5 bits wide, as this allows us to count runs of 0s up to length 32. This allows us to count astronomically high in a 1024 register HLL.

Repeating this for many trials, and stream sizes of 100k, 1M, and 10M, we have the following picture. The green line is the best fit line.

What we see is actually pretty reassuring. Roughly speaking, totally poisoning x% of registers results in about an x% error in your cardinality estimate. For example, here are the error means and variances across all the trials for the 1M element stream:

Number of Registers Modified Percentage of Registers Modified Error Mean Error Variance
0 0 -0.0005806094 0.001271119
10 0.97% 0.0094607324 0.001300780
20 1.9% 0.0194860481 0.001356282
30 2.9% 0.0297495396 0.001381753
40 3.9% 0.0395013208 0.001436418
50 4.9% 0.0494727527 0.001460182
60 5.9% 0.0600436774 0.001525749
70 6.8% 0.0706375356 0.001522320
80 7.8% 0.0826034639 0.001599104
90 8.8% 0.0937465662 0.001587156
100 9.8% 0.1060810958 0.001600348

### Initial Reactions

I was actually not too surprised to see that the induced error was modest when only a small fraction of the registers were poisoned. Along with some other machinery, the HLL algorithm uses the harmonic mean of the individual register estimates when computing its guess for the number of distinct values in the data stream. The harmonic mean does a very nice job of downweighting values that are significantly larger than others in the set under consideration:

In [1]: from scipy.stats import hmean

In [2]: from numpy import mean

In [3]: f = [1] * 100000 + [1000000000]

In [4]: mean(f)
Out[4]: 10000.899991000089

In [5]: hmean(f)
Out[5]: 1.0000099999999899


It is this property that provides protection against totally wrecking the sketch’s estimate when we blow up a fairly small fraction of the registers.

### Experiment The Second

Of course, the algorithm can only hold out so long. While I was not surprised by the modesty of the error, I was very surprised by how linear the error growth was in the first figure. I ran the same experiment, but instead of stopping at 10% of the registers, I went all the way to the end. This time, I have plotted the results with a log-scaled y-axis:

Note that some experiments appear to start after others. This is due to missing data from taking the logarithm of negative errors.

Without getting overly formal in our analysis, there are roughly three phases in error growth here. At first, it’s sublinear on the log-scale, then linear, then superlinear. This roughly corresponds to “slow”, “exponential”, and “really, really, fast”. As our mathemagician-in-residence points out, the error will grow roughly as p/(1-p) where p is the fraction of polluted registers. The derivation of this isn’t too hard to work out, if you want to give it a shot! The implication of this little formula matches exactly what we see above. When p is small, the denominator does not change much, and the error grows roughly linearly. As p approaches 1, the error begins to grow super-exponentially. Isn’t it nice when experiment matches theory?

### Final Thoughts

It’s certainly nice to see that the estimates produced by HLLs are not overly vulnerable to a few errant register hits. As is often the case with this sort of analysis, the academic point must be put in balance with the practical. The chance of maxing out even a single register under normal operation is vanishingly small, assuming you chose a sane hash function for your keys. If I was running an HLL in the wild, and saw that 10% of my registers were pegged, my first thought would be “What is going wrong with my system!?” and not “Oh, well, at least I know my estimate to within 10%!” I would be disinclined to trust the whole data set until I got a better sense of what caused the blowups, and why I should give any credence at all to the supposedly unpolluted registers.

## Doubling the Size of an HLL Dynamically – Recovery

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm. See Matt’s overview of HLL, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the first in a series of three posts on doubling the size of HLLs – the next two will be about set operations and utilizing additional bits of data, respectively.

### Overview

In this post, we explore the error of the cardinality estimate of an HLL whose size has been doubled using several different fill techniques. Specifically, we’re looking at how the error changes as additional keys are seen by the HLL.

#### A Quick Reminder – Terminology and Fill Strategies

If we have an HLL of size $2^n$ and we double it to be an HLL of size $2^{n+1}$, we call two bins “partners” if their bin number differs by $2^n$.  For example, in an HLL double to be size $8$, the bins $1$ and $5$ are partners, as are $2$ and $6$, etc. The Zeroes doubling strategy fills in the newly created bins with zeroes. The Concatenate strategy fills in the newly created bins with the values of their partner bins. MinusTwo fills in each bin with two less than its partner bin’s value. RE fills in the newly created bins according to the empirical distribution of each bin.

### Some Sample Recovery Paths

Below, we ran four experiments to check recovery time. Each experiment consisted of running an HLL of size 210 on 500,000 unique hashed keys (modeled here using a random number generator), doubling the HLL to be size 211, and then ran 500,000 more hashed keys through the HLL. Below, we have graphs showing how the error decreases as more keys are added.  Both graphs show the same data (the only difference being the scale on the y-axis). We have also graphed “Large,” an HLL of size $2^{11}$, and “Small,” an HLL of size $2^{10}$, which are shown only for comparison and are never doubled.

One thing to note about the graphs is that the error is relative.

Notice that Concatenate and Zeroes perform particularly poorly. Even after 500,000 extra keys have been added, they still don’t come within 5% of the true value! For Zeroes, this isn’t too surprising. Clearly the initial error of Zeroes, that is the error immediately after doubling, should be high.  A quick look at the harmonic mean shows why this occurs. If a single bin has a zero as its value, the harmonic mean of the values in the bins will be zero. Essentially, the harmonic mean of a list always tends towards the lowest elements of the list. Hence, even after all the zeroes have been replaced with positive values, the cardinality estimate will be very low.

On the other hand, a more surprising result is that Concatenate gives such a poor guess. To see this we need to look at the formula for the estimate again.  The formula for the cardinality estimate is $\frac{\alpha_m m^2}{\sum_{i=1}^{m} 2^{-M_i}}$ where $M_i$ is the value in the $i^{th}$ bin, $m$ is the number of bins, and $\alpha_m$ is a constant approaching about $.72$. For Concatenate, the value $M_{i + m}$ is equal to $M_i$.  Hence we have that the cardinality estimate for Concatenate is:

$\begin{array}{ll}\displaystyle\frac{\alpha_{2m} (2m)^2}{\left(\displaystyle\sum_{i=1}^{2m} 2^{-M_i}\right)} \vspace{10pt}&\approx \displaystyle\frac{.72\cdot 4\cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) + \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) }\vspace{10pt} \\&\displaystyle= \displaystyle 4\cdot \frac{.72 \cdot m^2}{2\cdot \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&= \displaystyle 2 \cdot \frac{.72 \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&\approx \displaystyle 2 \cdot \frac{ \alpha_m \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\end{array}$

Notice that this last term is about equal to 2 times the cardinality estimate of the HLL before doubling. One quick thing that we can take away from this is that it is unlikely for two “partner” bins to have the same value in them (since if this happens frequently, we get an estimate close to that given by Concatenate – which is very inaccurate!).

As for MinusTwo and RE, these have small initial error and the error only falls afterwards. The initial error is small since the rules for these give guesses approximately equal to the guess of the original HLL before doubling. From here, the error should continue to shrink, and eventually, it should match that of the large HLL.

One thing we noticed was that error for Concatenate in the graph above suggested that the absolute error wasn’t decreasing at all. To check this we looked at the trials and, sure enough, the absolute error stays pretty flat. Essentially, Concatenate overestimates pretty badly, and puts the HLL in a state where it thinks it has seen twice as many keys as it actually has. In the short term, it will continue to make estimates as if it has seen 500,000 extra keys. We can see this clearly in the graphs below.

### Recovery Time Data

I also ran 100 experiments where we doubled the HLLs after adding 500,000 keys, then continued to add keys until the cardinality estimate fell within 5% of the true cardinality.  The HLLs were set up to stop running at 2,000,000 keys if they hadn’t arrived at the error bound.

Notice how badly Concatenate did! In no trials did it make it under 5% error. Zeroes did poorly as well, though it did recover eventually. My guess here is that the harmonic mean had a bit to do with this – any bin with a low value, call it $k$, in it would pull the estimate down to be about $m^2 \cdot 2^k$. As a result, the estimate produced by the Zeroes HLL will remain depressed until every bin is hit with a(n unlikely) high value. Zeroes and Concatenate should not do well since essentially the initial estimate (after doubling) of each HLL is off by a very large fixed amount. The graph of absolute errors, above, shows this.

On the other hand, RE and MinusTwo performed fairly well. Certainly, RE looks better in terms of median and middle 50%, though its variance is much higher than MinusTwo‘s.This should make sense as we are injecting a lot of randomness into RE when we fill in the values, whereas MinusTwo‘s bins are filled in deterministically.

### Recovery Time As A Function of Size

One might wonder whether the recovery time of MinusTwo and RE depend on the size of the HLL before the doubling process. To get a quick view of whether or not this is true, we did 1,000 trials like those above but by adding 200K, 400K, 600K, 800K, 1M keys and with a a cutoff of 3% this time. Below, we have the box plots for the data for each of these. The headings of each graph gives the size of the HLL before doubling, and the y-axis gives the fractional recovery time (the true recovery time divided by the size of the HLL before doubling).

Notice that, for each doubling rule, there is almost no variation between each of the plots. This suggests that the size of the HLL before doubling doesn’t change the fractional recovery time. As a side note, one thing that we found really surprising is that RE is no longer king – MinusTwo has a slightly better average case. We think that this is just a factor of the higher variation of RE and the change in cutoff.

### Summary

Of the four rules, MinusTwo and RE are clearly the best. Both take about 50 – 75% more keys after doubling to get within 3% error, and both are recover extremely quickly if you ask for them to only get within 5% error.

To leave you with one last little brainteaser, an HLL of size $2^{10}$, which is then doubled, will eventually have the same values in its bins as an HLL of size $2^{11}$ which ran on the same data. About how long will it take for these HLLs to converge? One (weak) requirement for this to happen is to have the value in every bin of both HLLs be changed. To get an upper bound on how long this should take, one should read about the coupon collector problem.

## Doubling the Size of an HLL Dynamically

### Introduction

In my last post, I explained how to halve the number of bins used in an HLL as a way to allow set operations between that HLL and smaller HLLs.  Unfortunately, the accuracy of an HLL is tied to the number of bins used, so one major drawback with this “folding” method is that each time you have the number of bins, you reduce that accuracy by a factor of $\sqrt{2}$.

In this series of posts I’ll focus on the opposite line of thinking: given an HLL, can one double the number of bins, assigning the new bins values according to some strategy, and recover some of the accuracy that a larger HLL would have had?  Certainly, one shouldn’t be able to do this (short of creating a new algorithm for counting distinct values) since once we use the HLL on a dataset the extra information that a larger HLL would have gleaned is gone.  We can’t recover it and so we can’t expect to magically pull a better estimate out of thin air (assuming Flajolet et al. have done their homework properly and the algorithm makes the best possible guess with the given information – which is a pretty good bet!).  Instead, in this series of posts, I’ll focus on how doubling plays with recovery time and set operations.  By this, I mean the following:  Suppose we have an HLL of size 2n and while its running, we double it to be an HLL of size 2n+1. Initially, this may have huge error, but if we allow it to continue running, how long will it take for its error to be relatively small?  I’ll also discuss some ways of modifying the algorithm to carry slightly more information.

### The Candidates

Before we begin, a quick piece of terminology.  Suppose we have an HLL of size 2n and we double it to be an HLL of size 2 n+1.  We consider two bins to be partners if their bin numbers differ by 2n.  To see why this is important – check the post on HLL folding.

Colin and I did some thinking and came up with a few naive strategies to fill in the newly created bins after the doubling. I’ve provided a basic outline of the strategies below.

• Zeroes – Fill in with zeroes.
• Concatenate – Fill in each bin with the value of its partner.
• MinusTwo – Fill in each bin with the value of its partner minus two. Two may seem like an arbitrary amount, but quick look at the formulas involved in the algorithm show that this leaves the cardinality estimate approximately unchanged.
• RandomEstimate (RE) – Fill in each bin according to its probability distribution. I’ll describe more about this later.
• ProportionDouble (PD) – This strategy is only for use with set operations. We estimate the number of bins in the two HLLs which should have the same value, filling in the second half so that that proportion holds and the rest are filled in according to RE.

#### Nitty Gritty of RE

The first three strategies given above are pretty self-explanatory, but the last two are a bit more complicated. To understand these, one needs to understand the distribution of values in a given bin.  In the original paper, Flajolet et al. calculate the probability that a given bin takes the value $k$ to be given by $(1 - 0.5^k)^v - (1 - 0.5^{k-1})^v$ where $v$ is the number of keys that the bin has seen so far. Of course, we don’t know this value ($v$) exactly, but we can easily estimate it by dividing the cardinality estimate by the number of bins. However, we have even more information than this. When choosing a value for our doubled HLL, we know that that value cannot exceed its partner’s value. To understand why this is so, look back at my post on folding, and notice how the value in the partner bins in a larger HLL correspond to the value in the related bin in the smaller HLL.

Hence, to get the distribution for the value in a given bin, we take the original distribution, chop it off at the relevant value, and rescale it to have total area 1. This may seem kind of hokey but let’s quickly look at a toy example. Suppose you ask me to guess a number between 1 and 10, and you will try to guess which number I picked. At this moment, assuming I’m a reasonable random number generator, there is a $1/10$ chance that I chose the number one, a $1/10$ chance that I chose the number two, etc. However, if I tell you that my guess is no larger than two, you can now say there there is a $1/2$ chance that my guess is a one, a $1/2$ chance that my guess is a two, and there is no chance that my guess is larger. So what happened here? We took the original probability distribution, used our knowledge to cut off and ignore the values above the maximum possible value, and then rescaled them so that the sum of the possible probabilities is equal to zero.

RE consists simply of finding this distribution, picking a value according to it, and placing that value in the relevant bin.

#### Nitty Gritty of PD

Recall that we only use PD for set operations. One thing we found was that the accuracy of doubling with set operations according to RE is highly dependent on the the intersection size of the two HLLs. To account for this, we examine the fraction of bins in the two HLLs which contain the same value, and then we force the doubled HLL to preserve this fraction

So how do we do this? Let’s say we have two HLLs: $H$ and $G$. We wish to double $H$ before taking its union with $G$. To estimate the proportion of their intersection, make a copy of $G$ and fold it to be the same size as $H$. Then count the number of bins where $G$ and $H$ agree, call this number $a$. Then if $m$ is the number of bins in $H$, we can estimate that $H$ and $G$ should overlap in about $a/m$ bins. Then for each bin, with probability $a/m$ we fill in the bin with the the minimum of the relevant bin from $G$ and that bin’s partner in $G$. With probability $1 - a/m$ we fill in the bin according to the rules of RE.