## Sketch of the Day: Frugal Streaming

We are always worried about the size of our sketches. At AK we like to count stuff, and if you can count stuff with smaller sketches then you can count more stuff! We constantly have conversations about how we might be able to make our sketches, and subsequently our datastores, smaller. During our science summit, Muthu pointed us at some of the new work in Frugal Streaming. The concept of Frugal Streaming is to process your data as it comes, O(N), but severely limit the amount of memory you are using for your sketch, and by “severely” they mean using perhaps one or two pieces of information. Needless to say, we were interested!

### History

The concept of Frugal Streaming reminds me of an algorithm for finding the dominant item in a stream, MJRTY written by Boyer and Moore in 1980. (This paper has a very interesting history). The MJRTY algorithm sets out to solve the problem of finding the majority element in a stream (an element comprising more than 50% of the stream). Moore proposed to solve this by using only 2 pieces of information and a single scan of the data.

Imagine you have a stream of names (“matt”, “timon”, “matt”, “matt”, “rob”, “ben”, …) and you wanted to know if any name appeared in more than half the stream. Boyer and Moore proposed the following:

count = 0
majority = ""

for val in stream:
if count == 0:
majority = val
count = 1
elif val == majority:
count += 1
else:
count -= 1

print majority if count > 0 else "no majority!"


If you’re anything like me you will go through a few phases: “That can’t work!”, “Wait, that works?!”, “Nah, this data will break that”, “Holy sh*t that works!!”. If you think about it for a minute you realize that it HAS to work. If any element in the stream comprises more than half the stream values there is no way to get to the end and have a counter of zero. To convince yourself suppose the majority element only comprises the first half + 1 of your N-length stream. The counter would count up to $N/2+1$ and then start decrementing until all N values have been seen, which would leave the counter at $2 = (N/2+1) - (N/2-1)$*. This will hold regardless of the ordering of the elements in the stream. A simple simulation is provided by the authors. Philippe Flajolet apparently “liked this algorithm very much and called it the ‘gang war’, because in his mind, every element is a gang member, and members of opposite gangs are paired in a standoff, and shoot each other. The gang members remaining are the more numerous”**.

The astute among you will have noticed that this algorithm only works if there is, in fact, a majority element. If there is not then it will fail. A stream such as {“matt”,”matt”,”timon”,”timon”,”rob”} would result in the algorithm returning “rob”. In practice you need ways of ensuring that your stream does indeed have a majority element or have a guarantee ahead of time.

* Note, that formula is for an even length stream. For a stream of odd length the counter will indeed be at 1. Proof is left to the reader.

** Thanks to Jeremie Lumbroso for his insightful feedback on the history of MJRTY and his memory of Philippe’s explanation.

### One “bit” – Frugal-1U

In their Frugal Streaming paper, Qiang and Muthu decided to see if they could find a frugal solution to the streaming quantile problem. The streaming quantiles problem is one I enjoy quite a bit and I have used it as an interview question for potential candidates for some time. Simply stated it is: “How would you find the quantiles of a stream of numbers in O(N) with limited memory?” There are a few different approaches to solve this problem, with the most widely known probably being Count-Min Sketch. However, with Count-Min you need to keep your keys around in order to query the sketch. There are other approaches to this question as well.

Instead of focusing on quantiles generally, Qiang and Muthu’s first solution is a frugal way to find the median of a stream. As with MJRTY above, I’ll just write it down and see how you react:

median_est = 0
for val in stream:
if val > median_est:
median_est += 1
elif val < median_est:
median_est -= 1


Granted, the above is just for the median, where the stream is much longer than the value of the median, but it is so simple that I don’t think I would have ever considered this approach to the problem. The extension to quantiles is equally as shocking. If you are trying to find the 75th percentile of the data stream you do the same as above but increment up randomly 75% of the time and decrement down randomly 25% of the time:


quantile_75 = 0
for val in stream:
r = random()
if val > quantile_75 and r > 1 - 0.75:
quantile_75 += 1
elif val < quantile_75 and r > 0.75:
quantile_75 -= 1


As the paper states:

The trick to generalize median estimation to any $\frac{h}{k}$ -quantile estimation is that not every stream item seen will cause an update. If the current stream item is larger than estimation, an increment update will be triggered only with probability $\frac{h}{k}$. The rationale behind it is that if we are estimating $\frac{h}{k}$ -quantile, and if the current estimate is at stream’s true $\frac{h}{k}$ -quantile, we will expect to see stream items larger than the current estimate with probability $1-\frac{h}{k}$ .

### Finding Quantiles With Two “bits”- Frugal-2U

There are a few obvious drawbacks to the above algorithm. Since we are only incrementing by 1 each time, the time to converge is linear and our initial guess of zero could be very bad. Secondly, and by design, the algorithm has no memory, can fluctuate wildly and, as they show in the paper, the estimates can drift very far away. (Note: while it is extremely unlikely that the estimates will drift far from the correct values the paper has some very loose bounds on how bad it can drift. See Lemma 3 in the paper.) They suggest a few improvements over Frugal-1U where you essentially include a varying step (instead of just incrementing by 1 every time) and 1 “bit” so you know which way you incremented in the last update. The intuition here is that if you have been consistently incrementing up or down for the last few elements of a stream then you are probably “far” away from the quantile in question. If this is the case we can speed up convergence time by incrementing a larger amount. The Frugal-2U algorithm:

def frugal_2u(stream, m = 0, q = 0.5, f = constantly_one):
step, sign = 1, 1

for item in stream:
if item > m and random() > 1 - q:
# Increment the step size if and only if the estimate keeps moving in
# the same direction. Step size is incremented by the result of applying
# the specified step function to the previous step size.
step += f(step) if sign > 0 else -1 * f(step)
# Increment the estimate by step size if step is positive. Otherwise,
# increment the step size by one.
m += step if step > 0 else 1
# Mark that the estimate increased this step
sign = 1
# If the estimate overshot the item in the stream, pull the estimate back
# and re-adjust the step size.
if m > item:
step += (item - m)
m = item
# If the item is less than the stream, follow all of the same steps as
# above, with signs reversed.
elif item < m and random() > q:
step += f(step) if sign < 0 else -1 * f(step)
m -= step if step > 0 else 1
sign = -1
if m < item:
step += (m - item)
m = item
# Damp down the step size to avoid oscillation.
if (m - item) * sign < 0 and step > 1:
step = 1



You can play around with the 1U and 2U algorithms in the simulation below.

Click above to run the Frugal Streaming simulation

As usual, there are a few interesting tidbits as well. If you read the paper you will see that they define the updates to step as a function. This means they are allowing many different types of increments to step. For example, instead of increasing the size of step by 1 each time we could increase it by 10 or even have it increase multiplicatively. They do talk about some uses of different updates to step but there is no analysis around this (yet) and they restrict all of the work in the paper to step increments of 1. We offer a few different step update functions in the simulation and they indeed do interesting things. Further exploration is definitely needed to get some insights here.

A non-obvious thing about the step variable is how it behaves under decrements. My initial thought was that step would get large if your current estimate was far below the actual value (thus allowing you to approach it faster from below) and that step would get to be a large negative number if your current estimate was very far above the actual value. This turns out to just be flat wrong. The negative updates to step have the effect of stabilizing the estimate (notice when step is negative that the updates to your estimates are always ± 1 ). If you read the algorithm closely you will see that step decrements when you consistently alternate up and down updates. This behavior occurs when the estimate is close to the actual value which causes step to become a very large negative number. And I mean VERY large. In practice we have seen this number as small as $-10^{102}$ for some simulations.

### Monitoring

One of the first things I thought of when I saw this sketch was to use it as a monitoring tool. Specifically, perhaps it could be used to replace the monitoring we use on our application server response times. It turns out that using this sketch for monitoring introduces some very interesting issues. So far, we have mostly talked about what I will call “static streams”. These are streams that have data in them which is pulled consistently from some static underlying distribution (as in the examples above). However, what if the underlying distribution changes? For example, what if one of our servers all of the sudden starts responding with slower response times? Does this frugal sketch enable you to quickly figure out that something has broken and fire off an alarm with high confidence? Well, in some ways this is an identical problem to cold start: how long does it take for an initial guess to reach a stable quantile? Unfortunately, there is no easy way to know when you have reached “equilibrium” and this makes identifying when an underlying distribution change has occurred difficult as well. The paper ends with an open challenge:

… as our experiments and insights indicate, frugal streaming algorithms work with so little memory of the past that they are adaptable to changes in the stream characteristics. It will be of great interest to understand this phenomenon better.

The paper shows some interesting experiments with changing data, but I agree with Qiang: we need to understand these algorithms better before I swap out all of our server monitoring tools (and our ops team would kill me). However, since these are so simple to implement and so small, we can easily deploy a few tests and see how the results compare “in the field” (you can play around with this by changing the underlying stream distribution in our simulation above.)

### Conclusion

The frugal quantile algorithms proposed in the paper are fascinating. It is a very interesting turn in the sketching literature and Qiang and Muthu’s creativity really comes across. I am very interested in getting some real world uses out of this sketch and am excited to see what other applications we (and Qiang!) can think of.  Many thanks to MuthuQiang Ma and Jeremie Lumbroso for all their help!

### Appendix

• Variability: While the bounds on the accuracy of these algorithms seem wide to me, clearly in real world experiments we see much better performance than the guaranteed bounds in the paper. In fact, the error bounds in the paper are dependent on the size of the stream and not fixed.
• Size of step: A potential gotcha is the size of the step variable. Depending on your update function it indeed seems possible for this value to get below -MAXINT. Clearly a real implementation would need some error checking.
• Cold Start: One more gotcha is that you have no real way of knowing when you are near the quantile in question. For instance, starting your estimate at zero and measuring a true median which is 100,000,000,000 will take a long time to “catch up” to this value. There are a few ways to limit this, some of which are discussed in the paper. One way is to try and make a sane guess up front (especially if you know something about the distribution) and another is to start your guess with the value from the last counter. For instance, suppose you are in a monitoring situation and you are computing means over the course of a day. It would make sense to start tomorrow’s estimate with yesterdays final value.
• Accuracy:  And, lastly, there is some interesting dependence on “atomicity”. Meaning, the estimates in some sense depend on how “large” the actual values are. In both, your minimum step size is 1. If my median in the stream is, say, 6 then this “atomic” update of 1 causes high relative fluctuation. It seems in real world examples you would like to balance the size of the thing you are estimating with the speed of approach of the algorithm. This could lead you to estimate latencies in microseconds rather than milliseconds for instance. All of this points to the fact that there are a bunch of real world engineering questions that need to be answered and thought about before you actually go and implement a million frugal sketches throughout your organization.

## Data Science Summit – Update

I don’t think I’m going out on a limb saying that our conference last week was a success. Thanks to everyone who attended and thanks again to all the speakers. Muthu actually beat us to it and wrote up a nice summary. Very kind words from a great guy. For those of you that couldn’t make it (or those that want to relive the fun) we posted the videos and slides. Thanks again to everyone for making it such a success!

Muthu being Muthu during David Woodruff’s fantastic talk on streaming linear algebra

## Foundation Capital and Aggregate Knowledge Sponsor Streaming/Sketching Conference

We, along with our friends at Foundation Capital, are pleased to announce a 1 day mini-conference on streaming and sketching algorithms in Big Data.  We have gathered an amazing group of speakers from academia and industry to give talks.  If you are a reader of this blog we would love to have you come!  The conference will be on 6/20 (Thursday) from 10 AM to 5:30 PM at the 111 Minna Gallery in San Francisco and attendance is limited. Breakfast and lunch included!

There will also be a happy hour afterwards if you cannot make the conference or just want a beer.

The speaker list includes:

##### Muthu Muthukrishnan

The Count-Min Sketch, 10 Years Later

The Count-Min Sketch is a data structure for indexing data streams in very small space. In a decade since its introduction, it has found many uses in theory and practice, with data streaming systems and beyond. This talk will survey the developments.

Muthu Muthukrishnan is a Professor at Rutgers University and a Research Scientist at Microsoft, India. His research interest is in development of data stream theory and systems, as well as online advertising systems.

##### David P. Woodruff

Sketching as a Tool for Numerical Linear Algebra

The talk will focus on how sketching techniques from the data stream literature can be used to speed up well-studied algorithms for problems occurring in numerical linear algebra, such as least squares regression and approximate singular value decomposition. It will also discuss how they can be used to achieve very efficient algorithms for variants of these problems, such as robust regression.

David Woodruff joined the algorithms and complexity group at IBM Almaden in 2007 after completing his Ph.D. at MIT in theoretical computer science. His interests are in compressed sensing, communication, numerical linear algebra, sketching, and streaming.

##### Sam Ritchie

Summingbird: Streaming Map/Reduce at Twitter

Summingbird is a platform for streaming map/reduce used at Twitter to build aggregations in real-time or on hadoop. When the programmer describes her job, that job can be run without change on Storm or Hadoop. Additionally, summingbird can manage merging realtime/online computations with offline batches so that small errors in real-time do not accumulate. Put another way, summingbird gives eventual consistency in a manner that is easy for the programmer to reason about.

Sam Ritchie works on data analysis and infrastructure problems in Twitter’s Revenue engineering team. He is co-author of a number of open-source Scala and Clojure libraries, including Bijection, Algebird, Cascalog 2 and ElephantDB. He holds a bachelor’s degree in mechanical and aerospace engineering.

##### Alexandr Andoni

Similarity Search Algorithms

Nearest Neighbor Search is an ubiquitous problem in analyzing massive datasets: its goal is to process a set of objects (such as images), so that later, one can find the object most similar to a given query object. I will survey the state-of-the-art for this problem, starting from the (Kanellakis-award winning) technique of Locality Sensitive Hashing, to its more modern relatives, and touch upon connection to the theory of sketching.

Alexandr Andoni is a researcher in the Microsoft Research at Silicon Valley since 2010, after finishing his PhD in MIT’s theory group and year-long postdoctoral position at Princeton University. His research interests revolve around algorithms for massive datasets, including similarity search and streaming/sublinear algorithms, as well as theoretical machine learning.

There will be a panel discussion on the topic of harboring research in startups. Speakers include:
Pete Skomoroch from the LinkedIn data science team.
Rob Grzywinski of Aggregate Knowledge.
Joseph Turian of Metaoptimize

##### Lightning talks
• Armon Dadgar (Kiip) - Sketching Data Structures at Kiip
• Blake Mizerany (Heroku) - An Engineer Reads a Data Sketch
• Timon Karnezos (AK) - TBD
• Jérémie Lumbroso (INRIA) - Philippe Flajolet’s Contribution to Streaming Algorithms

Update! We posted the videos and slides for those of you that couldn’t make it (or those that want to relive the fun). Enjoy!

## Doubling the Size of an HLL Dynamically – Extra Bits…

Author’s Note: This post is related to a few previous posts on the HyperLogLog algorithm.  See Matt’s overview of the algorithm, and see this for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the final post in a series on doubling the number of bins in HLLs. The first post dealt with the recovery time after doubling, and the second dealt with doubling’s accuracy when taking unions of two HLLs.

### Introduction

The main draw to the HyperLogLog algorithm is its ability to make accurate cardinality estimates using small, fixed memory.  In practice, there are two choices a user makes which determine how much memory the algorithm will use: the number of registers (bins) and the size of each register (how high they can count).  As Timon discussed previously, increasing the size of each register will only increase the accuracy if the true cardinality of the stream is HUGE.

Recall that HyperLogLog (and most other streaming algorithms) is designed to work with a fixed number of registers, $m$, which is chosen as a function of the expected cardinality to approximate. We track a great number of different cardinality streams and in this context it is useful for us to not have one fixed value of $m$, but to have this evolve with the needs of a given estimation.

We are thus confronted with many engineering problems, some of which we have already discussed. In particular, one problem is that the neat feature of sketches, namely that they allow for an estimate of the cardinality of the union of multiple streams at no cost, depends on having sketches of the same size.

We’ve discussed how to get around this by folding HLLs, though with some increase in error. We’ve also explored a few options on how to effectively perform a doubling procedure. However, we started to wonder if any improvements could be made by using just a small amount of extra memory, say an extra bit for each register. In this post we will discuss one such idea and its use in doubling. Note: we don’t talk about quadrupling or more. We limit ourselves to the situation where HLL sketches only differ in $m$‘s by 1.

### The Setup

One of the downfalls in doubling is that it there is no way to know, after doubling, whether a value belongs in its bin or its partner bin. Recall that a “partner bin” is the register that could have been used had our “prefix” (the portion of the hashed value which is used to decide which register to update) been one bit longer. If the binary representation of the bin index used only two bits of the hashed value, e.g. $01$, then in an HLL that used a three-bit index, the same hashed value could have been placed in the bin whose index is either $101$ or $001$. Since $001$ and $01$ are the same number, we call $101$ the “partner bin”. (See the “Key Processing” section in Set Operations On HLLs of Different Sizes).

Consider an example where we have an HLL with $2^{10}$ bins.The $k^{th}$ bin has the value 7 in it, and after doubling we guess that its partner bin, at index $(2^{10} + k)^{th}$, should have a 5 in it. It is equally likely that the $k^{th}$ bin should have the 5 in it and the $(2^{10}+k)^{th}$ bin should have the 7 in it (since the “missing” prefix bit could have been a 1 or a 0)! Certainly the arrangement doesn’t change the basic cardinality estimate, but once we start getting involved with unions, the arrangement can make a very large difference.

To see how drastic the consequences can be, let’s look at a simple example. Suppose we start with an HLL with 2 bins and get the value 6 in each of its bins. Then we run the doubling procedure and decide that the partner bins should both have 1′s in them. With this information, it is equally likely that both of the arrangements below, “A” and “B”, could be the “true” larger HLL.

Further suppose we have some other data with which we wish to estimate the union. Below, I’ve diagrammed what happens when we take the union.

Arrangement A leads to a cardinality estimate (of the union) of about 12 and Arrangement B leads to a cardinality estimate (of the union) of about 122. This is an order of magnitude different! Obviously not all cases are this bad, but this example is instructive. It tells us that knowing the true location of each value is very important. We’ve attempted to improve our doubling estimate by keeping an extra bit of information as we will describe below.

### The Algorithm

Suppose we have an HLL with $m$ bins. Let’s keep another array of data which holds $m$ total bits, one for each bin — we will call these the “Cached Values.” For each bin, we keep a 0 if the value truly belongs in the bin in which it was placed (i.e. if, had we run an HLL with $2m$ bins, the value would have been placed in the first $m$ bins in the HLL), and we keep a 1 if the value truly belongs in the partner bin of the one in which it was placed (i.e. if, had we run an HLL with $2m$ bins, the value would have been placed in the last $m$ bins). See the image below for an example. Here we see two HLLs which have processed the same data. The one on the left is half the size and collects the cached values as it runs on the data. The one on the right is simply the usual HLL algorithm run on the same data.

Looking at the first row of the small HLL (with $m$ bins), the $0$ cache value means that the 2 “belongs” in the top half of the large HLL, i.e. if we had processed the stream using a larger HLL the 2 would be in the same register. Essentially this cached bit allows you to know exactly where the largest value in a bin was located in the larger HLL (if the $i^{th}$ bin has value $V$ and cached value $S$, we place the value $V$ in the $S * 2^{\log{m}} + i$ = $(S\cdot m + i)^{th}$ bin).

In practice, when we double, we populate the doubled HLL first with the (now correct location) bin values from the original HLL then we fill the remaining bins by using our “Proportion Doubling” algorithm.

Before we begin looking at the algorithm’s performance, let’s think about how much extra space this requires. In our new algorithm, notice that for each bin, we keep around either a zero or a one as its cached value. Hence, we require only one extra bit per bin to accommodate the cached values. Our implementation of HLL requires 5 bits per bin, since we want to be able to include values up to $2^5 -1= 31$ in our bins. Thus, a standard HLL with $m$ bins, requires $5m$ bits. Hence, this algorithm requires $5m + m = 6m$ bits (with the extra $m$ bins representing the cached values). This implies that this sketch requires 20% more space.

### The Data

Recall in the last post in this series, we explored doubling with two main strategies: Random Estimate (RE) and Proportion Doubling (PD). We did the same here, though using the additional information from this cached bit. We want to know a few things:

• Does doubling using a cache bit work? i.e. is it better to fold the bigger one or double the smaller one when comparing HLL’s of different sizes?
• Does adding in a cache bit change which doubling strategy is preferred (RE or PD)?
• Does the error in union estimate depend on intersection size as we have seen in the past?

Is it better to double or fold?

For each experiment we took 2 sets of data (each generated from 200k random keys) and estimated the intersection size between them using varying methods.

• “Folded”: estimate by filling up an HLL with $log_{2}(m) = 10$ and  comparing it to a folded HLL starting from $log_{2}(m) = 11$ and folded down $log_{2}(m) = 10$
• “Large”: estimate by using two HLL’s of a larger HLL of $log_{2}(m) = 11$.  This is effectively a lower bound for our doubling approaches.
• “Doubled – PD”: estimate by taking an HLL of $log_{2}(m) = 10$ and double it up to $log_{2}(m) = 11$ using the Proportion Doubling strategy.  Once this larger HLL is approximated we estimate the intersection with another HLL of native size $log_{2}(m) = 11$
• “Doubled – RE”: estimate by taking an HLL of $log_{2}(m) = 10$ and doubling up to $log_{2}(m) = 11$ using Random Estimate strategy.

We performed an experiment 300 times at varying intersection sizes from 0 up to 200k (100%) overlapping elements between sets (in steps of 10k). The plots below show our results (and extrapolate between points).

The graph of the mean error looks pretty bad for Random Estimate doubling. Again we see that the error depends heavily on the intersection size and becomes more biased as the set’s overlap more. On the other hand, Proportion Doubling was much more successful  (recall that this strategy forces the proportion of bins in the to-be-doubled HLL and the HLL with which we will union it to be equal before and after doubling.)  It’s possible there is some error bias with small intersections but we would need to run more trials to know for sure. As expected, the “Folded” and the “Large” are centered around zero. But what about the spread of the error?

The Proportion Doubling strategy looks great! In my last post on this subject, we found that this doubling strategy (without the cached part) really only worked well in the large intersection size regime, but here, with the extra cache bits, we seem to avoid that. Certainly the large intersection regime is where the standard deviation is lowest, but for every intersection size, it is significantly lower than that of the smaller HLL. This suggests that one of our largest sources of error when we use doubling in conjunction with unions is related to our lack of knowledge of the arrangement of the bins (i.e. when doubling, we do not know which of the two partner bins gets the larger, observed value). So it appears that the strategy of keeping cache bits around does indeed work, provided you use a decent doubling scheme.

Interestingly, it is always much better to double a smaller cache HLL than to fold a larger HLL when comparing sketches of different sizes. This is represented above by the lower error of the doubled HLL than the small HLL. The error bounds do seem to depend on the size of the intersection between the two sets but this will require more work to really understand how, especially in the case of Proportion Doubling.

Notes:  In this work we focus solely on doubling a HLL sketch and then immediately using this new structure to compute set operations. It would be interesting to see if set operation accuracy changes as a doubled HLL goes through its “recovery” period under varying doubling methods. It is our assumption that nothing out of the ordinary would come of this, but we definitely could be wrong. We will leave this as an exercise for the reader.

### Summary

We’ve found an interesting way of trading space for accuracy with this cached bit method, but there are certainly other ways of using an extra bit or two (per bucket). For instance, we could keep more information about the distribution of each bin by keeping a bit indicating whether or not the bin’s value minus one has been seen. (If the value is $k$, keep track of whether $k-1$ has shown up.)

We should be able to use any extra piece of information about the distribution or position of the data to help us obtain a more accurate estimate. Certainly, there are a myriad of other ideas ways of storing a bit or two of extra information per bin in order to gain a little leverage — it’s just a matter of figuring out what works best. We’ll be messing around more with this in the coming weeks, so if you have any ideas of what would work best, let us know in the comments!

(P.S. A lot of our recent work has been inspired by Flajolet et al.’s paper on PCSA – check out our post on this here!)

Thanks to Jeremie Lumbroso for his kind input on this post. We are much indebted to him and hopefully you will see more from our collaboration.

## Doubling the Size of an HLL Dynamically – Unions

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm.  See Matt’s overview of the algorithm, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the second in a series of three posts on doubling the number of bins of HLLs. The first post dealt with the recovery time after doubling and the next post will deal with ways to utilize an extra bit or two per bin.

### Overview

Let’s say we have two streams of data which we’re monitoring with the HLL algorithm, and we’d like to get an estimate on the cardinality of these two streams combined, i.e. thought of as one large stream.  In this case, we have to take advantage of the algorithm’s built-in “unionfeature.  Done naively, the accuracy of the estimate will depend entirely on the the number of bins, $m$, of the smaller of the two HLLs.  In this case, to make our estimate more accurate, we would need to increase this $m$ of one (or both) of our HLLs.  This post will investigate the feasibility of doing this; we will apply our idea of “doubling” to see if we can gain any accuracy.  We will not focus on intersections, since the only support the HyperLogLog algorithm has for intersections is via the inclusion/exclusion principle. Hence the error can be kind of funky for this – for a better overview of this, check out Timon’s post here. For this reason, we only focus on how the union works with doubling.

### The Strategy: A Quick Reminder

In my last post we discussed the benefits and drawbacks of many different doubling strategies in the context of recovery time of the HLL after doubling. Eventually we saw that two of our doubling strategies worked significantly better than the others. In this post, instead of testing many different strategies, we’ll focus instead on one strategy, “proportion doubling” (PD), and how to manipulate it to work best in the context of unions. The idea behind PD is to guess the approximate intersection cardinality of the two datasets and to force that estimate to remain after doubling. To be more specific, suppose we have an HLL $A$ and an HLL $B$ with$n$ bins and $2n$ bins, respectively. Then we check what proportion of bins in $A$, call it $p$, agree with the bins in $B$. When we doubled $A$, we fill in the bins by randomly selecting $p\cdot n$ bins, and filling them in with the value in the corresponding bins in $B$. To fill in the rest of the bins, we fill them in randomly according to the distribution.

### The Naive Approach

To get some idea of how well this would work, I put the most naive strategy to the test. The idea was to run 100 trials where I took two HLLs (one of size $2^5 = 32$ and one of size $2^6 = 64$), ran 200K keys through them, doubled the smaller one (according to Random Estimate), and took a union. I had a hunch that the accuracy of our estimate after doubling would depend on how large the true intersection cardinality of the two datasets would be, so I ran this experiment for overlaps of size 0, 10K, 20K, etc. The graphs below are organized by the true intersection cardinality, and each graph shows the boxplot of the error for the trials.

This graph is a little overwhelming and a bit of a strange way to display the data, but is useful for getting a feel for how the three estimates work in the different regimes.  The graph below is from the same data and just compares the “Small” and “Doubled” HLLs.  The shaded region represents the middle 50% of the data, and the blue dots represent the data points.

The first thing to notice about these graphs is the accuracy of the estimate in the small intersection regime. However, outside of this, the estimates are not very accurate – it is clearly a better choice to just use the estimate from the smaller HLL.

Let’s try a second approach. Above we noticed that the algorithm’s accuracy depended on the cardinality of the intersection. Let’s try to take that into consideration. Let’s use the “Proportion Doubling” (PD) strategy we discussed in our first post. That post goes more in depth into the algorithm, but the take away is that this doubling strategy preserves the proportion of bins in the two HLLs which agree. I ran some trials like I did above to get some data on this. The graphs below represent this.

Here we again, show the data in a second graph comparing just the “Doubled” and “Small” HLL estimates.  Notice how much tighter the middle 50% region is on the top graph (for the “Doubled” HLL).  Hence in the large intersection regime, we get very accurate estimates.

One thing to notice about the second set of graphs is how narrow the error bars are.  Even when the estimate is biased, it still has much smaller error.  Also, notice that this works well in the large intersection regime but horribly in the small intersection regime.  This suggests that we may be able to interpolate our strategies. The next set of graphs is for an attempt at this. The algorithm gets an estimate of the intersection cardinality, then decides to either double using PD, double using RE, or not double depending on whether the intersection is large, small, or medium.

Here, the algorithm works well in the large intersection regime and doesn’t totally crap out outside of this regime (like the second algorithm), but doesn’t sustain the accuracy of the first algorithm in the small intersection regime. This is most likely because the algorithm cannot “know” which regime it is in and thus, must make a guess.  Eventually, it will guess wrong will severely underestimate the union cardinality. This will introduce a lot of error, and hence, our boxplot looks silly in this regime. The graph below shows the inefficacy of this new strategy. Notice that there are virtually no gains in accuracy in the top graph.

### Conclusion

With some trickery, it is indeed possible to gain some some accuracy when estimating the cardinality of the union of two HLLs by doubling one.  However, in order for this to be feasible, we need to apply the correct algorithm in the correct regime. This isn’t a major disappointment since for many practical cases, it would be easy to guess which regime the HLLs should fall under and we could build in the necessary safeguards if we guess incorrectly.  In any case, our gains were modest but certainly encouraging!

## HyperLogLog++: Google’s Take On Engineering HLL

Matt Abrams recently pointed me to Google’s excellent paper “HyperLogLog in Practice: Algorithmic Engineering of a State of The Art Cardinality Estimation Algorithm” [UPDATE: changed the link to the paper version without typos] and I thought I’d share my take on it and explain a few points that I had trouble getting through the first time. The paper offers a few interesting improvements that are worth noting:

1. Move to a 64-bit hash function
2. A new small-cardinality estimation regime
3. A sparse representation

I’ll take a look at these one at a time and share our experience with similar optimizations we’ve developed for a streaming (low latency, high throughput) environment.

### 32-bit vs. 64-bit hash function

I’ll motivate the move to a 64-bit hash function in the context of the original paper a bit more since the Google paper doesn’t really cover it except to note that they wanted to count billions of distinct values.

#### Some math

In the original HLL paper, a 32-bit hash function is required with the caveat that measuring cardinalities in the hundreds of millions or billions would become problematic because of hash collisions. Specifically, Flajolet et al. propose a “large range correction” for when the estimate $E$ is greater than $2^{32}/30$.  In this regime, they replace the usual HLL estimate by the estimate

$\displaystyle E^* := -2^{32} \mbox{log}\Big(1 - \frac{E}{2^{32}}\Big)$.

This reduces to a simple probabilistic argument that can be modeled with balls being dropped into bins. Say we have an $L$-bit hash. Each distinct value is a ball and each bin is designated by a value in the hash space.  Hence, you “randomly” drop a ball into a bin if the hashed value of the ball corresponds to the hash value attached to the bin.  Then, if we get an estimate $E$ for the cardinality, that means that (approximately) $E$ of our bins have values in them, and so there are $2^L - E$ empty bins.  The number of empty bins should be about $2^L e^{ - n/2^{L} }$, where $n$ is the number of balls.  Hence $2^{L} - E = 2^{L} e^{-n/2^{L}}$.  Solving this gives us the formula he recommends using: $-2^L \log(1 - \frac{E}{2^{L}})$.

Aside:  The empty bins expected value comes from the fact that

$E(\# \text{ of empty bins}) = m(1 - \frac{1}{m})^{n}$,

where $m$ is the number of bins and $n$ the number of balls.  This is pretty quick to show by induction.  Hence,

$\displaystyle E(\#\text{ of empty bins}) \sim m e^{-n/m}$ as $n \rightarrow \infty$.

Again, the general idea is that the $E$ ends up being some number smaller than $n$ because some of the balls are getting hashed to the same value.  The correction essentially doesn’t do anything in the case when $E$ is small compared to $2^L$ as you can see here. (Plotted is $-\log(1 - x)$, where $x$ represents $E / 2^L$, against the line $y = x$. The difference between the two graphs represents the difference between $E$ and $n$.)

#### A solution and a rebuttal

The natural move to start estimating cardinalities in the billions is to simply move to a larger hash space where the hash collision probability becomes negligibly small. This is fairly straightforward since most good hash functions give you at least 64-bits of entropy these days and it’s also the size of a machine word. There’s really no downside to moving to the larger hash space, from an engineering perspective. However, the Google researchers go one step further: they increase the register width by one bit (to 6), as well, ostensibly to be able to support the higher possible register values in the 64-bit setting. I contend this is unnecessary. If we look at the distribution of register values for a given cardinality, we see that it takes about a trillion elements before a 5-bit register overflows (at the black line):

The distributions above come from the LogLog paper, on page 611, right before formula 2. Look for $\mathbb{P}_{\nu}(M = k)$.

Consider the setting in the paper where $p = \log_2(m) = 14$. Let’s says we wanted to safely count into the 100 billion range. If we have $L = p + (2^5 - 1) = 14 + 31 = 45$ then our new “large range correction” boundary is roughly one trillion, per the adapted formula above. As you can see from the graph below, even at $p = 10, L = 41$ the large range correction only kicks in at a little under 100 billion!

The black line is the cutoff for a 5-bit register, and the points are plotted when the total number of hash bits required reaches 40, 50, and 60.

The real question though is all this practically useful? I would argue no: there are no internet phenomena that I know of that are producing more than tens of billions of distinct values, and there’s not even a practical way of empirically testing the accuracy of HLL at cardinalities above 100 billion. (Assuming you could insert 50 million unique, random hashed values per second, it would take half an hour to fill an HLL to 100 billion elements, and then you’d have to repeat that 5000 times as they do in the paper for a grand total of 4 months of compute time per cardinality in the range.)

[UPDATE: After talking with Marc Nunkesser (one of the authors) it seems that Google may have a legitimate need for both the 100 billion to trillion range right now, and even more later, so I retract my statement! For the rest of us mere mortals, maybe this section will be useful for picking whether or not you need five or six bits.]

At AK we’ve run a few hundred test runs at 1, 1.5, and 2 billion distinct values under the $p = 10-14, L = 41-45$ configuration range and found the relative error to be identical to that of lower cardinalities (in the millions of DVs). There’s really no reason to inflate the storage requirements for large cardinality HLLs by 20% simply because the hash space has expanded. Furthermore, you can do all kinds of simple tricks like storing an offset as metadata (which would only require at most 5 bits) for a whole HLL and storing the register values as the difference from that base offset, in order to make use of a larger hash space.

### Small Cardinality Estimation

Simply put, the paper introduces a second small range correction between the existing one (linear counting) and the “normal” harmonic mean estimator ($E$ in the original paper) in order to eliminate the “large” jump in relative error around the boundary between the two estimators.

They empirically calculate the bias of $E$ and create a lookup table for various $p$, for 200 values less than $5 \cdot 2^p$ with a correction to the overestimate of $E$. They interpolate between the 200 reference points to determine the correction to apply for any given raw $E$ value. Their plots give compelling evidence that this bias correction makes a difference in the $m$ to $5m$ cardinality range (cuts 95th percentile relative error from about 2% to 1.2%).

I’ve been a bit terse about this improvement since sadly it doesn’t help us at AK much because most of our data is Zipfian. Few of our reporting keys live in the narrow cardinality range they are optimizing: they either wallow in the linear counting range or shoot straight up into the normal estimator range.

Nonetheless, if you find you’re doing a lot of DV counting in this range, these corrections are pretty cheap to implement (as they’ve provided numerical values for all the cutoffs and bias corrections in the appendix.)

### Sparse representation

The general theme of this optimization isn’t particularly new (our friends at MetaMarkets mentioned it in this post): for smaller cardinality HLLs there’s no need to materialize all the registers. You can get away with just materializing the set registers in a map. The paper proposes a sorted map (by register index) with a hash map off the side to allow for fast insertions. Once the hash map reaches a certain size, its entries are batch-merged into the sorted list, and once the sorted list reaches the size of the materialized HLL, the whole thing is converted to the fully-materialized representation.

Aside: Though the hash map is a clever optimization, you could skip it if you didn’t want the added complexity. We’ve found that the simple sorted list version is extremely fast (hundreds of thousands of inserts per second). Also beware the variability of the the batched sort-and-merge cost every time the hash map repeatedly outgrows its limits and has to be merged into the sorted list. This can add significant latency spikes to a streaming system, whereas a one-by-one insertion sort to a sorted list will be slower but less variable.

The next bit is very clever: they increase $p$ when the HLL is in the sparse representation because of the saved space. Since they’re already storing entries in 32-bit integers, they increase $p$ to $p^{\prime} = 31 - \mbox{regWidth} = 31 - 6 = 25$. (I’ll get to the leftover bit in a second!) This gives them increased precision which they can simply “fold” down when converting from the sparse to fully materialized representation. Even more clever is their trick of not having to always store the full register value as the value of an entry in the map. Instead, if the longer register index encodes enough bits to determine the value, they use the leftover bit I mentioned before to signal as much.

In the diagram, $p$ and $p^{\prime}$ are as in the Google paper, and $q$ and $q^{\prime}$ are the number of bits that need to be examined to determine $\rho$ for either the $p$ or $p^{\prime}$ regime. I encourage you to read section 5.3.3 as well as EncodeHash and DecodeHash in Figure 8 to see the whole thing. [UPDATE: removed the typo section as it has been fixed in the most recent version of the paper (linked at the top)]

The paper also tacks on a difference encoding (which works very well because it’s a sorted list) and a variable length encoding to the sparse representation to further shrink the storage needed, so that the HLL can use the increased register count, $p^{\prime}$, for longer before reverting to the fully materialized representation at $p$. There’s not much to say about it because it seems to work very well, based on their plots, but I’ll note that in no way is that type of encoding suitable for streaming or “real-time” applications. The encode/decode overhead simply takes an already slow (relative to the fully materialized representation) sparse format and adds more CPU overhead.

### Conclusion

The researchers at Google have done a great job with this paper, meaningfully tackling some hard engineering problems and showing some real cleverness. Most of the optimizations proposed work very well in a database context, where the HLLs are either being used as temporary aggregators or are being stored as read-only data, however some of the optimizations aren’t suitable for streaming/”real-time” work. On a more personal note, it’s very refreshing to see real algorithmic engineering work being published from industry. Rob, Matt, and I just got back from New Orleans and SODA / ALENEX / ANALCO and were hoping to see more work in this area, and Google sure did deliver!

### Appendix

Sebastiano Vigna brought up the point that 6-bit registers are necessary for counting above 4 billion in the comments. I addressed it in the original post (see “A solution and a rebuttal“) but I’ll lay out the math in a bit more detail here to show that you can easily count above 4 billion with only 5-bit registers.

If you examine the original LogLog paper (the same as mentioned above) you’ll see that the register distribution for LogLog (and HyperLogLog consequently) registers is

$\displaystyle \mathbb{P}_{\nu}(M > k) = 1 - \mathbb{P}_{\nu}(M \le k) = 1 - \Big(1 - \frac{1}{2^k}\Big)^{\nu}$

where $k$ is the register value and $\nu$ is the number of (distinct) elements a register has seen.

So, I assert that 5 bits for a register (which allows the maximum value to be 31) is enough to count to ten billion without any special tricks. Let’s fix $p=14$ and say we insert $10^{10}$ distinct elements. That means, any given register will see about $\frac{10^{10}}{2^p} = \frac{10^{10}}{2^{14}} = \approx 6.1 \times 10^5$ elements assuming we have a decent hash function. So, the probability that a given register will have a value greater than 31 is (per the LogLog formula given above)

$\displaystyle \mathbb{P}_{\nu}(M > 31) = 1 - \mathbb{P}_{\nu}(M \le 31) = 1 - \Big(1 - \frac{1}{2^{31}}\Big)^{6.1 \times 10^5} \approx 0.00028$

and hence the expected number of registers that would overflow is approximately $2^{14} \times 0.00028 \approx 4.5$. So five registers out of sixteen thousand would overflow. I am skeptical that this would meaningfully affect the cardinality computation. In fact, I ran a few tests to verify this and found that the average number of registers with values greater than 31 was 4.5 and the relative error had the same standard deviation as that predicted by the paper, $1.04/\sqrt{m}$.

For argument, let’s assume that you find those five overflowed registers to be unacceptable. I argue that you could maintain an offset in 5 bits for the whole HLL that would allow you to still use 5 bit registers but exactly store the value of every register with extremely high probability. I claim that with overwhelmingly high probability, every register the HLL used above is greater than 15 and less than or equal to 40. Again, we can use the same distribution as above and we find that the probability of a given register being outside those bounds is

$\mathbb{P}_{\nu}(M < 15) \approx 10^{-162}$ and

$\mathbb{P}_{\nu}(M > 40) \approx 10^{-7}$.

Effectively, there are no register values outside of $[15,40]$. Now I know that I can just store 15 in my offset and the true value minus the offset (which now fits in 5 bits) in the actual registers.

## HLLs and Polluted Registers

### Introduction

It’s worth thinking about how things can go wrong, and what the implications of such occurrences might be. In this post, I’ll be taking a look at the HyperLogLog (HLL) algorithm for cardinality estimation, which we’ve discussed before.

### The Setup

HLLs have the property that their register values increase monotonically as they run. The basic update rule is:

for item in stream:
index, proposed_value = process_hashed_item(hash(item))
hll.registers[index] = max(hll.registers[index], proposed_value)


There’s an obvious vulnerability here: what happens to your counts if you get pathological data that blows up a register value to some really large number? These values are never allowed to decrease according to the vanilla algorithm. How much of a beating can these sketches take from such pathological data before their estimates are wholly unreliable?

### Experiment The First

To get some sense of this, I took a 1024 bucket HLL, ran a stream through it, and then computed the error in the estimate. I then proceeded to randomly choose a register, max it out, and compute the error again. I repeated this process until I had maxed out 10% of the registers. In pseudo-python:

print("n_registers_touched,relative_error")
print(0, relative_error(hll.cardinality(), stream_size), sep = ",")
for index, reg in random.sample(range(1024), num_to_edit):
hll.registers[reg] = 32
print(index + 1, relative_error(hll.cardinality(), stream_size), sep = ",")


In practice, HLL registers are fixed to be a certain bit width. In our case, registers are 5 bits wide, as this allows us to count runs of 0s up to length 32. This allows us to count astronomically high in a 1024 register HLL.

Repeating this for many trials, and stream sizes of 100k, 1M, and 10M, we have the following picture. The green line is the best fit line.

What we see is actually pretty reassuring. Roughly speaking, totally poisoning x% of registers results in about an x% error in your cardinality estimate. For example, here are the error means and variances across all the trials for the 1M element stream:

Number of Registers Modified Percentage of Registers Modified Error Mean Error Variance
0 0 -0.0005806094 0.001271119
10 0.97% 0.0094607324 0.001300780
20 1.9% 0.0194860481 0.001356282
30 2.9% 0.0297495396 0.001381753
40 3.9% 0.0395013208 0.001436418
50 4.9% 0.0494727527 0.001460182
60 5.9% 0.0600436774 0.001525749
70 6.8% 0.0706375356 0.001522320
80 7.8% 0.0826034639 0.001599104
90 8.8% 0.0937465662 0.001587156
100 9.8% 0.1060810958 0.001600348

### Initial Reactions

I was actually not too surprised to see that the induced error was modest when only a small fraction of the registers were poisoned. Along with some other machinery, the HLL algorithm uses the harmonic mean of the individual register estimates when computing its guess for the number of distinct values in the data stream. The harmonic mean does a very nice job of downweighting values that are significantly larger than others in the set under consideration:

In [1]: from scipy.stats import hmean

In [2]: from numpy import mean

In [3]: f = [1] * 100000 + [1000000000]

In [4]: mean(f)
Out[4]: 10000.899991000089

In [5]: hmean(f)
Out[5]: 1.0000099999999899


It is this property that provides protection against totally wrecking the sketch’s estimate when we blow up a fairly small fraction of the registers.

### Experiment The Second

Of course, the algorithm can only hold out so long. While I was not surprised by the modesty of the error, I was very surprised by how linear the error growth was in the first figure. I ran the same experiment, but instead of stopping at 10% of the registers, I went all the way to the end. This time, I have plotted the results with a log-scaled y-axis:

Note that some experiments appear to start after others. This is due to missing data from taking the logarithm of negative errors.

Without getting overly formal in our analysis, there are roughly three phases in error growth here. At first, it’s sublinear on the log-scale, then linear, then superlinear. This roughly corresponds to “slow”, “exponential”, and “really, really, fast”. As our mathemagician-in-residence points out, the error will grow roughly as p/(1-p) where p is the fraction of polluted registers. The derivation of this isn’t too hard to work out, if you want to give it a shot! The implication of this little formula matches exactly what we see above. When p is small, the denominator does not change much, and the error grows roughly linearly. As p approaches 1, the error begins to grow super-exponentially. Isn’t it nice when experiment matches theory?

### Final Thoughts

It’s certainly nice to see that the estimates produced by HLLs are not overly vulnerable to a few errant register hits. As is often the case with this sort of analysis, the academic point must be put in balance with the practical. The chance of maxing out even a single register under normal operation is vanishingly small, assuming you chose a sane hash function for your keys. If I was running an HLL in the wild, and saw that 10% of my registers were pegged, my first thought would be “What is going wrong with my system!?” and not “Oh, well, at least I know my estimate to within 10%!” I would be disinclined to trust the whole data set until I got a better sense of what caused the blowups, and why I should give any credence at all to the supposedly unpolluted registers.

## Doubling the Size of an HLL Dynamically – Recovery

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm. See Matt’s overview of HLL, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the first in a series of three posts on doubling the size of HLLs – the next two will be about set operations and utilizing additional bits of data, respectively.

### Overview

In this post, we explore the error of the cardinality estimate of an HLL whose size has been doubled using several different fill techniques. Specifically, we’re looking at how the error changes as additional keys are seen by the HLL.

#### A Quick Reminder – Terminology and Fill Strategies

If we have an HLL of size $2^n$ and we double it to be an HLL of size $2^{n+1}$, we call two bins “partners” if their bin number differs by $2^n$.  For example, in an HLL double to be size $8$, the bins $1$ and $5$ are partners, as are $2$ and $6$, etc. The Zeroes doubling strategy fills in the newly created bins with zeroes. The Concatenate strategy fills in the newly created bins with the values of their partner bins. MinusTwo fills in each bin with two less than its partner bin’s value. RE fills in the newly created bins according to the empirical distribution of each bin.

### Some Sample Recovery Paths

Below, we ran four experiments to check recovery time. Each experiment consisted of running an HLL of size 210 on 500,000 unique hashed keys (modeled here using a random number generator), doubling the HLL to be size 211, and then ran 500,000 more hashed keys through the HLL. Below, we have graphs showing how the error decreases as more keys are added.  Both graphs show the same data (the only difference being the scale on the y-axis). We have also graphed “Large,” an HLL of size $2^{11}$, and “Small,” an HLL of size $2^{10}$, which are shown only for comparison and are never doubled.

One thing to note about the graphs is that the error is relative.

Notice that Concatenate and Zeroes perform particularly poorly. Even after 500,000 extra keys have been added, they still don’t come within 5% of the true value! For Zeroes, this isn’t too surprising. Clearly the initial error of Zeroes, that is the error immediately after doubling, should be high.  A quick look at the harmonic mean shows why this occurs. If a single bin has a zero as its value, the harmonic mean of the values in the bins will be zero. Essentially, the harmonic mean of a list always tends towards the lowest elements of the list. Hence, even after all the zeroes have been replaced with positive values, the cardinality estimate will be very low.

On the other hand, a more surprising result is that Concatenate gives such a poor guess. To see this we need to look at the formula for the estimate again.  The formula for the cardinality estimate is $\frac{\alpha_m m^2}{\sum_{i=1}^{m} 2^{-M_i}}$ where $M_i$ is the value in the $i^{th}$ bin, $m$ is the number of bins, and $\alpha_m$ is a constant approaching about $.72$. For Concatenate, the value $M_{i + m}$ is equal to $M_i$.  Hence we have that the cardinality estimate for Concatenate is:

$\begin{array}{ll}\displaystyle\frac{\alpha_{2m} (2m)^2}{\left(\displaystyle\sum_{i=1}^{2m} 2^{-M_i}\right)} \vspace{10pt}&\approx \displaystyle\frac{.72\cdot 4\cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) + \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) }\vspace{10pt} \\&\displaystyle= \displaystyle 4\cdot \frac{.72 \cdot m^2}{2\cdot \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&= \displaystyle 2 \cdot \frac{.72 \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&\approx \displaystyle 2 \cdot \frac{ \alpha_m \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\end{array}$

Notice that this last term is about equal to 2 times the cardinality estimate of the HLL before doubling. One quick thing that we can take away from this is that it is unlikely for two “partner” bins to have the same value in them (since if this happens frequently, we get an estimate close to that given by Concatenate – which is very inaccurate!).

As for MinusTwo and RE, these have small initial error and the error only falls afterwards. The initial error is small since the rules for these give guesses approximately equal to the guess of the original HLL before doubling. From here, the error should continue to shrink, and eventually, it should match that of the large HLL.

One thing we noticed was that error for Concatenate in the graph above suggested that the absolute error wasn’t decreasing at all. To check this we looked at the trials and, sure enough, the absolute error stays pretty flat. Essentially, Concatenate overestimates pretty badly, and puts the HLL in a state where it thinks it has seen twice as many keys as it actually has. In the short term, it will continue to make estimates as if it has seen 500,000 extra keys. We can see this clearly in the graphs below.

### Recovery Time Data

I also ran 100 experiments where we doubled the HLLs after adding 500,000 keys, then continued to add keys until the cardinality estimate fell within 5% of the true cardinality.  The HLLs were set up to stop running at 2,000,000 keys if they hadn’t arrived at the error bound.

Notice how badly Concatenate did! In no trials did it make it under 5% error. Zeroes did poorly as well, though it did recover eventually. My guess here is that the harmonic mean had a bit to do with this – any bin with a low value, call it $k$, in it would pull the estimate down to be about $m^2 \cdot 2^k$. As a result, the estimate produced by the Zeroes HLL will remain depressed until every bin is hit with a(n unlikely) high value. Zeroes and Concatenate should not do well since essentially the initial estimate (after doubling) of each HLL is off by a very large fixed amount. The graph of absolute errors, above, shows this.

On the other hand, RE and MinusTwo performed fairly well. Certainly, RE looks better in terms of median and middle 50%, though its variance is much higher than MinusTwo‘s.This should make sense as we are injecting a lot of randomness into RE when we fill in the values, whereas MinusTwo‘s bins are filled in deterministically.

### Recovery Time As A Function of Size

One might wonder whether the recovery time of MinusTwo and RE depend on the size of the HLL before the doubling process. To get a quick view of whether or not this is true, we did 1,000 trials like those above but by adding 200K, 400K, 600K, 800K, 1M keys and with a a cutoff of 3% this time. Below, we have the box plots for the data for each of these. The headings of each graph gives the size of the HLL before doubling, and the y-axis gives the fractional recovery time (the true recovery time divided by the size of the HLL before doubling).

Notice that, for each doubling rule, there is almost no variation between each of the plots. This suggests that the size of the HLL before doubling doesn’t change the fractional recovery time. As a side note, one thing that we found really surprising is that RE is no longer king – MinusTwo has a slightly better average case. We think that this is just a factor of the higher variation of RE and the change in cutoff.

### Summary

Of the four rules, MinusTwo and RE are clearly the best. Both take about 50 – 75% more keys after doubling to get within 3% error, and both are recover extremely quickly if you ask for them to only get within 5% error.

To leave you with one last little brainteaser, an HLL of size $2^{10}$, which is then doubled, will eventually have the same values in its bins as an HLL of size $2^{11}$ which ran on the same data. About how long will it take for these HLLs to converge? One (weak) requirement for this to happen is to have the value in every bin of both HLLs be changed. To get an upper bound on how long this should take, one should read about the coupon collector problem.

## Doubling the Size of an HLL Dynamically

### Introduction

In my last post, I explained how to halve the number of bins used in an HLL as a way to allow set operations between that HLL and smaller HLLs.  Unfortunately, the accuracy of an HLL is tied to the number of bins used, so one major drawback with this “folding” method is that each time you have the number of bins, you reduce that accuracy by a factor of $\sqrt{2}$.

In this series of posts I’ll focus on the opposite line of thinking: given an HLL, can one double the number of bins, assigning the new bins values according to some strategy, and recover some of the accuracy that a larger HLL would have had?  Certainly, one shouldn’t be able to do this (short of creating a new algorithm for counting distinct values) since once we use the HLL on a dataset the extra information that a larger HLL would have gleaned is gone.  We can’t recover it and so we can’t expect to magically pull a better estimate out of thin air (assuming Flajolet et al. have done their homework properly and the algorithm makes the best possible guess with the given information – which is a pretty good bet!).  Instead, in this series of posts, I’ll focus on how doubling plays with recovery time and set operations.  By this, I mean the following:  Suppose we have an HLL of size 2n and while its running, we double it to be an HLL of size 2n+1. Initially, this may have huge error, but if we allow it to continue running, how long will it take for its error to be relatively small?  I’ll also discuss some ways of modifying the algorithm to carry slightly more information.

### The Candidates

Before we begin, a quick piece of terminology.  Suppose we have an HLL of size 2n and we double it to be an HLL of size 2 n+1.  We consider two bins to be partners if their bin numbers differ by 2n.  To see why this is important – check the post on HLL folding.

Colin and I did some thinking and came up with a few naive strategies to fill in the newly created bins after the doubling. I’ve provided a basic outline of the strategies below.

• Zeroes – Fill in with zeroes.
• Concatenate – Fill in each bin with the value of its partner.
• MinusTwo – Fill in each bin with the value of its partner minus two. Two may seem like an arbitrary amount, but quick look at the formulas involved in the algorithm show that this leaves the cardinality estimate approximately unchanged.
• RandomEstimate (RE) – Fill in each bin according to its probability distribution. I’ll describe more about this later.
• ProportionDouble (PD) – This strategy is only for use with set operations. We estimate the number of bins in the two HLLs which should have the same value, filling in the second half so that that proportion holds and the rest are filled in according to RE.

#### Nitty Gritty of RE

The first three strategies given above are pretty self-explanatory, but the last two are a bit more complicated. To understand these, one needs to understand the distribution of values in a given bin.  In the original paper, Flajolet et al. calculate the probability that a given bin takes the value $k$ to be given by $(1 - 0.5^k)^v - (1 - 0.5^{k-1})^v$ where $v$ is the number of keys that the bin has seen so far. Of course, we don’t know this value ($v$) exactly, but we can easily estimate it by dividing the cardinality estimate by the number of bins. However, we have even more information than this. When choosing a value for our doubled HLL, we know that that value cannot exceed its partner’s value. To understand why this is so, look back at my post on folding, and notice how the value in the partner bins in a larger HLL correspond to the value in the related bin in the smaller HLL.

Hence, to get the distribution for the value in a given bin, we take the original distribution, chop it off at the relevant value, and rescale it to have total area 1. This may seem kind of hokey but let’s quickly look at a toy example. Suppose you ask me to guess a number between 1 and 10, and you will try to guess which number I picked. At this moment, assuming I’m a reasonable random number generator, there is a $1/10$ chance that I chose the number one, a $1/10$ chance that I chose the number two, etc. However, if I tell you that my guess is no larger than two, you can now say there there is a $1/2$ chance that my guess is a one, a $1/2$ chance that my guess is a two, and there is no chance that my guess is larger. So what happened here? We took the original probability distribution, used our knowledge to cut off and ignore the values above the maximum possible value, and then rescaled them so that the sum of the possible probabilities is equal to zero.

RE consists simply of finding this distribution, picking a value according to it, and placing that value in the relevant bin.

#### Nitty Gritty of PD

Recall that we only use PD for set operations. One thing we found was that the accuracy of doubling with set operations according to RE is highly dependent on the the intersection size of the two HLLs. To account for this, we examine the fraction of bins in the two HLLs which contain the same value, and then we force the doubled HLL to preserve this fraction

So how do we do this? Let’s say we have two HLLs: $H$ and $G$. We wish to double $H$ before taking its union with $G$. To estimate the proportion of their intersection, make a copy of $G$ and fold it to be the same size as $H$. Then count the number of bins where $G$ and $H$ agree, call this number $a$. Then if $m$ is the number of bins in $H$, we can estimate that $H$ and $G$ should overlap in about $a/m$ bins. Then for each bin, with probability $a/m$ we fill in the bin with the the minimum of the relevant bin from $G$ and that bin’s partner in $G$. With probability $1 - a/m$ we fill in the bin according to the rules of RE.

### The Posts

(Links will be added as the posts are published. Keep checking back for updates!)

## Sketching the last year

Sketching is an area of big-data science that has been getting a lot of attention lately. I personally am very excited about this.  Sketching analytics has been a primary focus of our platform and one of my personal interests for quite a while now. Sketching as an area of big-data science has been slow to unfold, (thanks Strata for declining our last two proposals on sketching talks!), but clearly the tide is turning. In fact, our summarizer technology, which relies heavily on our implementation of Distinct Value (DV) sketches, has been in the wild for almost a year now (and, obviously we were working on it for many months before that).

#### Fast, But Fickle

The R&D of the summarizer was fun but, as with most technical implementations, it’s never as easy as reading the papers and writing some code. The majority of the work we have done to make our DV sketches perform in production has nothing to do with the actual implementation.  We spend a lot of time focused on how we tune them, how we feed them, and make them play well with the rest of our stack.

Likewise, setting proper bounds on our sketches is an ongoing area of work for us and has led down some very interesting paths.  We have gained insights that are not just high level business problems, but very low level watchmaker type stuff.  Hash function behaviors and stream entropy alongside the skewness of data-sets themselves are areas we are constantly looking into to improve our implementations. This work has helped us refine and find optimizations around storage that aren’t limited to sketches themselves, but the architecture of the system as a whole.

#### Human Time Analytics

Leveraging DV sketches as more than just counters has proven unbelievably useful for us. The DV sketches we use provide arbitrary set operations. This comes in amazingly handy when our customers ask “How many users did we see on Facebook and on AOL this month that purchased something?” You can imagine how far these types of questions go in a real analytics platform. We have found that DV counts alongside set operation queries satisfy a large portion of our analytics platforms needs.

Using sketches for internal analytics has been a blast as well. Writing implementations and libraries in scripting languages enables our data-science team to perform very cool ad-hoc analyses faster and in “human-time”. Integrating DV sketches as custom data-types into existing databases has proven to be a boon for analysts and engineers alike.

#### Reap The Rewards

Over the course of the year that we’ve been using DV sketches to power analytics, the key takeaways we’ve found are: be VERY careful when choosing and implementing sketches; and leverage as many of their properties as possible.  When you get the formula right, these are powerful little structures. Enabling in-memory DV counting and set operations is pretty amazing when you think of the amount of data and analysis we support. Sketching as an area of big-data science seems to have (finally!) arrived and I, for one, welcome our new sketching overlords.