## Doubling the Size of an HLL Dynamically – Recovery

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm. See Matt’s overview of HLL, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the first in a series of three posts on doubling the size of HLLs – the next two will be about set operations and utilizing additional bits of data, respectively.

### Overview

In this post, we explore the error of the cardinality estimate of an HLL whose size has been doubled using several different fill techniques. Specifically, we’re looking at how the error changes as additional keys are seen by the HLL.

#### A Quick Reminder – Terminology and Fill Strategies

If we have an HLL of size $2^n$ and we double it to be an HLL of size $2^{n+1}$, we call two bins “partners” if their bin number differs by $2^n$.  For example, in an HLL double to be size $8$, the bins $1$ and $5$ are partners, as are $2$ and $6$, etc. The Zeroes doubling strategy fills in the newly created bins with zeroes. The Concatenate strategy fills in the newly created bins with the values of their partner bins. MinusTwo fills in each bin with two less than its partner bin’s value. RE fills in the newly created bins according to the empirical distribution of each bin.

### Some Sample Recovery Paths

Below, we ran four experiments to check recovery time. Each experiment consisted of running an HLL of size 210 on 500,000 unique hashed keys (modeled here using a random number generator), doubling the HLL to be size 211, and then ran 500,000 more hashed keys through the HLL. Below, we have graphs showing how the error decreases as more keys are added.  Both graphs show the same data (the only difference being the scale on the y-axis). We have also graphed “Large,” an HLL of size $2^{11}$, and “Small,” an HLL of size $2^{10}$, which are shown only for comparison and are never doubled.

One thing to note about the graphs is that the error is relative.

Notice that Concatenate and Zeroes perform particularly poorly. Even after 500,000 extra keys have been added, they still don’t come within 5% of the true value! For Zeroes, this isn’t too surprising. Clearly the initial error of Zeroes, that is the error immediately after doubling, should be high.  A quick look at the harmonic mean shows why this occurs. If a single bin has a zero as its value, the harmonic mean of the values in the bins will be zero. Essentially, the harmonic mean of a list always tends towards the lowest elements of the list. Hence, even after all the zeroes have been replaced with positive values, the cardinality estimate will be very low.

On the other hand, a more surprising result is that Concatenate gives such a poor guess. To see this we need to look at the formula for the estimate again.  The formula for the cardinality estimate is $\frac{\alpha_m m^2}{\sum_{i=1}^{m} 2^{-M_i}}$ where $M_i$ is the value in the $i^{th}$ bin, $m$ is the number of bins, and $\alpha_m$ is a constant approaching about $.72$. For Concatenate, the value $M_{i + m}$ is equal to $M_i$.  Hence we have that the cardinality estimate for Concatenate is:

$\begin{array}{ll}\displaystyle\frac{\alpha_{2m} (2m)^2}{\left(\displaystyle\sum_{i=1}^{2m} 2^{-M_i}\right)} \vspace{10pt}&\approx \displaystyle\frac{.72\cdot 4\cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) + \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) }\vspace{10pt} \\&\displaystyle= \displaystyle 4\cdot \frac{.72 \cdot m^2}{2\cdot \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&= \displaystyle 2 \cdot \frac{.72 \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&\approx \displaystyle 2 \cdot \frac{ \alpha_m \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\end{array}$

Notice that this last term is about equal to 2 times the cardinality estimate of the HLL before doubling. One quick thing that we can take away from this is that it is unlikely for two “partner” bins to have the same value in them (since if this happens frequently, we get an estimate close to that given by Concatenate – which is very inaccurate!).

As for MinusTwo and RE, these have small initial error and the error only falls afterwards. The initial error is small since the rules for these give guesses approximately equal to the guess of the original HLL before doubling. From here, the error should continue to shrink, and eventually, it should match that of the large HLL.

One thing we noticed was that error for Concatenate in the graph above suggested that the absolute error wasn’t decreasing at all. To check this we looked at the trials and, sure enough, the absolute error stays pretty flat. Essentially, Concatenate overestimates pretty badly, and puts the HLL in a state where it thinks it has seen twice as many keys as it actually has. In the short term, it will continue to make estimates as if it has seen 500,000 extra keys. We can see this clearly in the graphs below.

### Recovery Time Data

I also ran 100 experiments where we doubled the HLLs after adding 500,000 keys, then continued to add keys until the cardinality estimate fell within 5% of the true cardinality.  The HLLs were set up to stop running at 2,000,000 keys if they hadn’t arrived at the error bound.

Notice how badly Concatenate did! In no trials did it make it under 5% error. Zeroes did poorly as well, though it did recover eventually. My guess here is that the harmonic mean had a bit to do with this – any bin with a low value, call it $k$, in it would pull the estimate down to be about $m^2 \cdot 2^k$. As a result, the estimate produced by the Zeroes HLL will remain depressed until every bin is hit with a(n unlikely) high value. Zeroes and Concatenate should not do well since essentially the initial estimate (after doubling) of each HLL is off by a very large fixed amount. The graph of absolute errors, above, shows this.

On the other hand, RE and MinusTwo performed fairly well. Certainly, RE looks better in terms of median and middle 50%, though its variance is much higher than MinusTwo‘s.This should make sense as we are injecting a lot of randomness into RE when we fill in the values, whereas MinusTwo‘s bins are filled in deterministically.

### Recovery Time As A Function of Size

One might wonder whether the recovery time of MinusTwo and RE depend on the size of the HLL before the doubling process. To get a quick view of whether or not this is true, we did 1,000 trials like those above but by adding 200K, 400K, 600K, 800K, 1M keys and with a a cutoff of 3% this time. Below, we have the box plots for the data for each of these. The headings of each graph gives the size of the HLL before doubling, and the y-axis gives the fractional recovery time (the true recovery time divided by the size of the HLL before doubling).

Notice that, for each doubling rule, there is almost no variation between each of the plots. This suggests that the size of the HLL before doubling doesn’t change the fractional recovery time. As a side note, one thing that we found really surprising is that RE is no longer king – MinusTwo has a slightly better average case. We think that this is just a factor of the higher variation of RE and the change in cutoff.

### Summary

Of the four rules, MinusTwo and RE are clearly the best. Both take about 50 – 75% more keys after doubling to get within 3% error, and both are recover extremely quickly if you ask for them to only get within 5% error.

To leave you with one last little brainteaser, an HLL of size $2^{10}$, which is then doubled, will eventually have the same values in its bins as an HLL of size $2^{11}$ which ran on the same data. About how long will it take for these HLLs to converge? One (weak) requirement for this to happen is to have the value in every bin of both HLLs be changed. To get an upper bound on how long this should take, one should read about the coupon collector problem.

## Doubling the Size of an HLL Dynamically

### Introduction

In my last post, I explained how to halve the number of bins used in an HLL as a way to allow set operations between that HLL and smaller HLLs.  Unfortunately, the accuracy of an HLL is tied to the number of bins used, so one major drawback with this “folding” method is that each time you have the number of bins, you reduce that accuracy by a factor of $\sqrt{2}$.

In this series of posts I’ll focus on the opposite line of thinking: given an HLL, can one double the number of bins, assigning the new bins values according to some strategy, and recover some of the accuracy that a larger HLL would have had?  Certainly, one shouldn’t be able to do this (short of creating a new algorithm for counting distinct values) since once we use the HLL on a dataset the extra information that a larger HLL would have gleaned is gone.  We can’t recover it and so we can’t expect to magically pull a better estimate out of thin air (assuming Flajolet et al. have done their homework properly and the algorithm makes the best possible guess with the given information – which is a pretty good bet!).  Instead, in this series of posts, I’ll focus on how doubling plays with recovery time and set operations.  By this, I mean the following:  Suppose we have an HLL of size 2n and while its running, we double it to be an HLL of size 2n+1. Initially, this may have huge error, but if we allow it to continue running, how long will it take for its error to be relatively small?  I’ll also discuss some ways of modifying the algorithm to carry slightly more information.

### The Candidates

Before we begin, a quick piece of terminology.  Suppose we have an HLL of size 2n and we double it to be an HLL of size 2 n+1.  We consider two bins to be partners if their bin numbers differ by 2n.  To see why this is important – check the post on HLL folding.

Colin and I did some thinking and came up with a few naive strategies to fill in the newly created bins after the doubling. I’ve provided a basic outline of the strategies below.

• Zeroes – Fill in with zeroes.
• Concatenate – Fill in each bin with the value of its partner.
• MinusTwo – Fill in each bin with the value of its partner minus two. Two may seem like an arbitrary amount, but quick look at the formulas involved in the algorithm show that this leaves the cardinality estimate approximately unchanged.
• RandomEstimate (RE) – Fill in each bin according to its probability distribution. I’ll describe more about this later.
• ProportionDouble (PD) – This strategy is only for use with set operations. We estimate the number of bins in the two HLLs which should have the same value, filling in the second half so that that proportion holds and the rest are filled in according to RE.

#### Nitty Gritty of RE

The first three strategies given above are pretty self-explanatory, but the last two are a bit more complicated. To understand these, one needs to understand the distribution of values in a given bin.  In the original paper, Flajolet et al. calculate the probability that a given bin takes the value $k$ to be given by $(1 - 0.5^k)^v - (1 - 0.5^{k-1})^v$ where $v$ is the number of keys that the bin has seen so far. Of course, we don’t know this value ($v$) exactly, but we can easily estimate it by dividing the cardinality estimate by the number of bins. However, we have even more information than this. When choosing a value for our doubled HLL, we know that that value cannot exceed its partner’s value. To understand why this is so, look back at my post on folding, and notice how the value in the partner bins in a larger HLL correspond to the value in the related bin in the smaller HLL.

Hence, to get the distribution for the value in a given bin, we take the original distribution, chop it off at the relevant value, and rescale it to have total area 1. This may seem kind of hokey but let’s quickly look at a toy example. Suppose you ask me to guess a number between 1 and 10, and you will try to guess which number I picked. At this moment, assuming I’m a reasonable random number generator, there is a $1/10$ chance that I chose the number one, a $1/10$ chance that I chose the number two, etc. However, if I tell you that my guess is no larger than two, you can now say there there is a $1/2$ chance that my guess is a one, a $1/2$ chance that my guess is a two, and there is no chance that my guess is larger. So what happened here? We took the original probability distribution, used our knowledge to cut off and ignore the values above the maximum possible value, and then rescaled them so that the sum of the possible probabilities is equal to zero.

RE consists simply of finding this distribution, picking a value according to it, and placing that value in the relevant bin.

#### Nitty Gritty of PD

Recall that we only use PD for set operations. One thing we found was that the accuracy of doubling with set operations according to RE is highly dependent on the the intersection size of the two HLLs. To account for this, we examine the fraction of bins in the two HLLs which contain the same value, and then we force the doubled HLL to preserve this fraction

So how do we do this? Let’s say we have two HLLs: $H$ and $G$. We wish to double $H$ before taking its union with $G$. To estimate the proportion of their intersection, make a copy of $G$ and fold it to be the same size as $H$. Then count the number of bins where $G$ and $H$ agree, call this number $a$. Then if $m$ is the number of bins in $H$, we can estimate that $H$ and $G$ should overlap in about $a/m$ bins. Then for each bin, with probability $a/m$ we fill in the bin with the the minimum of the relevant bin from $G$ and that bin’s partner in $G$. With probability $1 - a/m$ we fill in the bin according to the rules of RE.

## Set Operations On HLLs of Different Sizes

### Introduction

Here at AK, we’re in the business of storing huge amounts of information in the form of 64 bit keys. As shown in other blog posts and in the HLL post by Matt, one efficient way of getting an estimate of the size of the set of these keys is by using the HyperLogLog (HLL) algorithm.  There are two important decisions one has to make when implementing this algorithm.  The first is how many bins one will use and the second is the maximum value one allows in each bin.  As a result, the amount of space this will take up is going to be the number of bins times the log of the maximum value you allow in each bin.  For this post we’ll ignore this second consideration and focus instead on the number of bins one uses.  The accuracy for an estimate is given approximately by 1.04/√b, where b is the number of bins.  Hence there is a tradeoff between the accuracy of the estimate and the amount of space you wish to dedicate to this estimate. Certainly, projects will have various requirements that call for different choices of number of bins.

The HLL algorithm natively supports the union operation.  However, one requirement for this operation is that the HLLs involved are of the same size, i.e. have the same number of bins.  In practice, there’s no guarantee that HLLs will satisfy this requirement.  In this post, I’ll outline the method by which we transform an HLL with a certain number of bins to one with a fewer number of bins, allowing us to perform set operations on any two HLLs, regardless of size.

### Key Processing

As discussed in the HyperLogLog paper, to get a cardinality estimate with an HLL with 2n bins on a data set we pass over each key, using the placement of the rightmost “1″ to determine the value of the key and the next n digits to the left to determine in which bin to place that value.  In each bin, we only store the maximum value that that bin has “seen.”

Below I’ve shown how two HLLs (one of size 23 and one of size 24) process two different keys.  Here, the keys have the same value, because the purpose of this example is to illustrate how the location in which we place the key changes when the HLL has twice the number of bins.

Above, the keys which are attributed to the fifth and thirteenth bins in the larger HLL would both have been attributed to the fifth bin in the smaller HLL.  Hence, unraveling the algorithm a bit, we see that the values which are seen by the fifth and thirteenth bins in the larger HLL would have been seen by the fifth bin in the smaller HLL had they run on the same dataset.  Because of this, in the case where the two algorithms estimate the same dataset, the value stored in the fifth bin in the smaller HLL is the maximum of the values stored in the fifth and thirteenth bins in the larger HLL.

### Folding HLLs

What happened above is not an isolated phenomenon.  In general, if one uses the HLL algorithm twice on a dataset, once with 2n+1 bins and once with 2n bins, the value in the kth bin in the smaller HLL will be the maximum of the values in the kth and (k + 2n)th bins of the larger HLL.  As a result, if given an HLL of size 2n+1 that one wishes to transform to an HLL of size 2n, one can simply fold the HLL by letting the value of the kth bin in the folded HLL be given by the maximum of the values in the kth and (k + 2n)th bins of the original HLL.

In fact, we can fold any HLL an arbitrary number of times.  Repeating this process, we can take an HLL of size 2n to an HLL of size 2m for any m which is less than or equal to n.  Hence if we wish to perform a set operation on two HLLs of different sizes, we can simply fold the larger HLL repeatedly until it is the same size as the smaller HLL.  After this, we can take unions and intersections as we wish.

### Folding – An Example

Below, we show a simple example of how folding works.  Here we have an HLL with 23 bins which we fold to be an HLL with 22 bins.  In the diagram, I’ve torn an HLL of size 23 in half and placed the strips side by side to emphasize how we line up bins and take maximums in the folding process.  Notice that the values in the folded the bins of the folded HLL are the maximum of the relevant bins in the larger HLL.

This technique gives us the flexibility to be able to perform set operations on any two HLLs regardless of the number of bins used in the algorithms.  It’s usefulness in this regard is a bit offset by the fact that the accuracy of the estimate on these is limited by the accuracy of the least accurate HLL.  For example, an HLL of size 210 will have accuracy roughly 23 times better than an HLL of size 2 (to see where I’m getting these numbers from, you’ll have to read the paper!).  Unfortunately, if we combine these with a set operation, our resulting estimate will have the same accuracy as the smaller HLL.

### Summary

The HyperLogLog algorithm supports set operations in a nice way only if the number of bins used is fixed.  Using folding, one can correct for this by reducing the size of the larger HLL to the size of the smaller.  The cost of this convenience is in the accuracy of the estimates after the folding process.  In my next post, I’ll explore some methods of performing the set operations without this loss of accuracy.

## K-Minimum Values: Sketching Error, Hash Functions, and You

### Introduction

“All known efficient cardinality estimators rely on randomization, which is ensured by the use of hash functions.”
–Flajolet, et al

Recalling the KMV algorithm Matt presented in his last post, one will note that every stream element is passed to a hash function as part of the processing step. This is meant to transform the data being operated on from its native distribution into something uniformly distributed. Unfortunately, we don’t live in a perfect world, and since all of the algorithm’s analysis assumes that this hash function does its job well, we wanted to get some sense of how it behaves under less friendly conditions. The first half of this post will investigate the algorithm’s performance when we artificially introduce bias, and the second half will look at its behavior with a handful of real hash functions.

### A Simple Error Model

The first hash function error model that came to mind is admittedly unrealistic and ham-fisted, but hopefully illustrative. Suppose you have a stream of fixed sized, an ideal hash function, and from these you produce a distinct value estimate using the KMV algorithm. Now suppose that for some unlucky reason, one bit from your hash function is stuck; it’s always a zero or a one, but the other 31 bits are free of this curse.

There’s nothing to stop you from computing a distinct value estimate using this janky hash with KMV, but your intuition suggests that it shouldn’t be very good. We went through this exact process with various choices of k, using a random number generator to simulate a perfect hash function.

Before we look at the data, let’s think about what we should expect. From the perspective of KMV, it shouldn’t make a whole lot of difference if your kth smallest element is odd or even (for instance, in a case where the lowest order bit always/never set, respectively). It does, however, make a difference if you’re actually incapable of seeing values smaller than 231, which is what happens when the highest order bit is always set. Thus, in both the 0-biasing and 1-biasing cases, we should expect that higher order bits have a much more dramatic effect on error than lower order bits.

Notice how the performance degradation follows two different patterns. When we are fixing bits as ones, the observed error increases fairly smoothly, and tends to result in under estimates. In contrast, setting bits to zeros results in no change until the error increases producing catastrophic over estimates. Additionally, larger values of k have protective effects against these biases.

### A Somewhat Less Simple Error Model

Now that we have some intuition for the problem, let’s get a little more subtle. Instead of always setting the nth bit as a 0 or 1, let’s add a probabilistic element. We’ll do the same experiment as before, except we will now fix the nth bit with probability p. Thus, when p = 0 we have a perfectly well behaved KMV, and when p = 1, we have the experiment we just finished discussing. In the following diagram, each tile represents the average error across several experiments in which a stream of 1,000,000 unique elements was fed to a KMV sketch (k = 1024) which was rigged to modify the nth bit with probability p.

Many of the same lessons can be seen here — high order bits matter more, downward biasing degrades performance sharply, upward biasing degrades more smoothly. Additionally, as we’d expect, within a given bit, more bias means more error.

### Send in the Hash Functions

All of the experiments to this point have involved using a random number generator instead of hashing real data. I think it’s time that we took a look at what happens when we drop in a few real hash functions with real data. For the following experiments, I’m using four 32-bit hashes — Murmur3, SDBM, Arash Partow’s hash, and one of the old Donald Knuth hashes. You may recall these from our series on choosing a good hash function (although 64-bit versions were used there). I chose four text corpuses:

• Romeo and Juliet, stripped of all punctuation and converted to lower case (3794 words)
• /usr/share/dict/words (99171 words)
• 1,000,000 random 12 character long strings, each sharing the same suffix: “123456″
• 1,000,000 random 12 character long strings, each sharing the same prefix: “123456″

Using formulas from this paper, we can compute the relative error that 99% of KMV estimates should theoretically fall within. This turns out to depend on k and the stream size.

To make these pictures, I chose random values of k within each hash/document pair at which I evaluate the cardinality estimate and compute the relative error. The lines are linear interpolations between sampled points and are shown solely for clarity. The y-axis scale is adjusted on a per-picture basis to best display the theoretical envelope within which we expect our errors to lie.

Now that we’ve gotten through all the necessary preamble, let’s take a look at the results!

One picture in and we’ve already learned a lesson: choice of hash function seriously matters! SDBM and DEK cause the algorithm to perform well below its capabilities. DEK’s error is actually off the charts for most of this graph, which is why it does not appear until k > 3,000.

On a bigger corpus with tighter theoretical error bounds, Murmur3 and AP are still doing quite well. Do note, however, that AP dips outside the envelope for a while at k = 70,000 or so.

With the random strings, SDBM performs much better than it did on English words. DEK, however, is still hopeless. It’s a little tough to see on these pictures, but at high k, AP starts to fall off the wagon, and even Murmur3 dips outside the envelope, though not beyond what we’d expect, statistically speaking. Honestly, I was hoping for some fireworks here, but they didn’t materialize. I was wondering if we might see some hashes break on one version of these strings, and do fine on the other due to the location of the varying key bits (high order/low order). Sadly, that didn’t happen, but a negative result is a result none the less.

To summarize these, I made the following table, which shows us the percentage of time that an one of the samples falls outside the theoretical envelope. In this view, Murmur3′s superiority is clear.

AP DEK Murmur3 SDBM
Romeo and Juliet 0.00% 100.00% 0.00% 61.54%
/usr/share/dict/words
10.76% 100.00% 0.00% 68.46%
Common Suffix 7.22% 99.11% 1.10% 0.27%
Common Prefix 3.33% 100.00% 0.22% 0.0001%

### Fin

KMV is a very nice little algorithm that is incredibly simple to understand, implement, and use. That said, if you’re going to make use of it, you really do need to practice some due diligence when choosing your hash function. With packages like smhasher available, trying out multiple hash functions is a cinch, and a little legwork at the start of a project can save you from confusion and despair later on!

## Efficient Field-Striped, Nested, Disk-backed Record Storage

At AK we deal with a torrent of data every day. We can report on the lifetime of a campaign which may encompass more than a year’s worth of data. To be able to efficiently access our data we are constantly looking at different approaches to storage, retrieval and querying. One approach that we have been interested in involves dissecting data into its individual fields (or “columns” if you’re thinking in database terms) so that we only need to access the fields that are pertinent to a query. This is not a new approach to dealing with large volumes of data – it’s the basis of column-oriented databases like HBase.

Much of our data contains nested structures and this causes things to start to get a little more interesting, since this no longer easily fits within the data-model of traditional column-stores. Our Summarizer uses an in-memory approach to nested, field-striped storage but we wanted to investigate this for our on-disk data. Google published the Dremel paper a few years ago covering this exact topic. As with most papers, it only provides a limited overview of the approach without covering many of the “why”s and trade-offs made. So, we felt that we needed to start from the ground up and investigate how nested, field-striped storage works in order to really understand the problem.

Due to time constraints we have only been able to scratch the surface. Since the community is obviously interested in a Dremel-like project, we want to make the work that we have done available. We apologize in advance for the rough edges.

Without further ado: Efficient Field-Striped, Nested, Disk-backed Record Storage (on GitHub).

## No BS Data Salon #3

On Saturday Aggregate Knowledge hosted the third No BS Data Salon on databases and data infrastructure. A handful of people showed up to hear Scott Andreas of Boundary talk about distributed, streaming service architecture, and I also gave a talk about AK’s use of probabilistic data structures.

The smaller group made for some fantastic, honest conversation about the different approaches to streaming architectures, the perils of distributing analytics workloads in a streaming setting, and the challenges of pushing scientific and engineering breakthroughs all the way through to product innovation.

We’re all looking forward to the next event, which will be in San Francisco, in a month or two. If you have topics you’d like to see covered, we’d love to hear from you in the comments below!

As promised, I’ve assembled something of a “References” section to my talk, which you can find below.

### Random

• Sean Gourley’s talk on human-scale analytics and decision-making
• Muthu Muthukrishnan’s home page, where research on streaming in general abounds
• A collection of C and Java implementations of different probabilistic sketches

## Sketching the last year

Sketching is an area of big-data science that has been getting a lot of attention lately. I personally am very excited about this.  Sketching analytics has been a primary focus of our platform and one of my personal interests for quite a while now. Sketching as an area of big-data science has been slow to unfold, (thanks Strata for declining our last two proposals on sketching talks!), but clearly the tide is turning. In fact, our summarizer technology, which relies heavily on our implementation of Distinct Value (DV) sketches, has been in the wild for almost a year now (and, obviously we were working on it for many months before that).

#### Fast, But Fickle

The R&D of the summarizer was fun but, as with most technical implementations, it’s never as easy as reading the papers and writing some code. The majority of the work we have done to make our DV sketches perform in production has nothing to do with the actual implementation.  We spend a lot of time focused on how we tune them, how we feed them, and make them play well with the rest of our stack.

Likewise, setting proper bounds on our sketches is an ongoing area of work for us and has led down some very interesting paths.  We have gained insights that are not just high level business problems, but very low level watchmaker type stuff.  Hash function behaviors and stream entropy alongside the skewness of data-sets themselves are areas we are constantly looking into to improve our implementations. This work has helped us refine and find optimizations around storage that aren’t limited to sketches themselves, but the architecture of the system as a whole.

#### Human Time Analytics

Leveraging DV sketches as more than just counters has proven unbelievably useful for us. The DV sketches we use provide arbitrary set operations. This comes in amazingly handy when our customers ask “How many users did we see on Facebook and on AOL this month that purchased something?” You can imagine how far these types of questions go in a real analytics platform. We have found that DV counts alongside set operation queries satisfy a large portion of our analytics platforms needs.

Using sketches for internal analytics has been a blast as well. Writing implementations and libraries in scripting languages enables our data-science team to perform very cool ad-hoc analyses faster and in “human-time”. Integrating DV sketches as custom data-types into existing databases has proven to be a boon for analysts and engineers alike.

#### Reap The Rewards

Over the course of the year that we’ve been using DV sketches to power analytics, the key takeaways we’ve found are: be VERY careful when choosing and implementing sketches; and leverage as many of their properties as possible.  When you get the formula right, these are powerful little structures. Enabling in-memory DV counting and set operations is pretty amazing when you think of the amount of data and analysis we support. Sketching as an area of big-data science seems to have (finally!) arrived and I, for one, welcome our new sketching overlords.

## Statistical Toolbox: The Kolmogorov-Smirnov Test

Author’s Note: The Kolmogorov-Smirnov test is a handy tool that is conceptually clean, and can be useful in a variety of data analysis situations. I’ll introduce it in the context of a problem that I came across, and give a feel for what it does, and how it might be useful.

### A Question and A Tool

I’ve been doing a lot of work with hash functions, and as part of that work I was posed with a question. If I take the same data, encode it two different ways, and feed the two encodings to the same hash function, is there any difference in the statistical properties of the hashed output data sets?

The model I used to explore this question was to take a great number of SHA1 checksums, and MurmurHash3 these numbers, first encoded as 16 byte integers, and then again as Java Strings. There are a lot of things that one could do at this stage, but the first thing I thought to apply was the Kolmogorov-Smirnov (KS) test.

### The Whatnow?

First, some background. The cumulative distribution function (CDF) is a common and natural way of characterizing a probability distribution. The KS test gives us a tool for taking two CDFs and speaking intelligently about how “different” they are. A typical use case is as follows:

• You collect data that you suspect follows some theoretical distribution (uniform, Poisson, whatever)
• From the raw data you construct an empirical cumulative distribution function (ECDF)
• You use the KS test to answer the question, “Assuming my data were sampled from this theoretical distribution, what is the probability of seeing an ECDF that is at least this different from what one would predict?”

A more interesting use case is to compare two empirical distributions for equality. The test is conceptually exactly the same, except instead of comparing a CDF generated from data to one generated by theory, the comparison is between two empirical CDFs. A minor consequence of comparing two empirical data sets is that there is some additional uncertainty that must be dealt with, but this can be addressed by simply using larger samples (see the scaling factors discussed below).

### What Does It Look Like?

The figure on the right is very helpful in understanding what is going on in this test.

Given two CDFs, the first thing the KS test does is find their maximum positive and negative differences, D+ and D-, respectively. These differences are scaled to produce so-called “K statistics.” In the case where one is comparing an empirical to a theoretical CDF (shown in the figure), all one needs to do is scale the differences by sqrt(n) where there are n observations. For the comparison of two empirical distributions of size n and m, D+ and D- are scaled by sqrt(nm/(n+m))

This scaling takes care of the idea the same magnitude of difference is more troubling if you have more data. A chance large jump or long lag in your ECDF curve is increasingly unlikely as your samples grow.

For a vanilla KS test, the larger of K+ and K- is compared against the Kolmogorov distribution. This allows you to compute a p-value telling you the probability of seeing a K statistic as large as you did under the assumptions of the null hypothesis that the sample is drawn from the theoretical distribution you are testing it against.

The KS test doesn’t need a lot of data to start detecting fairly small differences. If you have a lot of data, and you want to get fancy, you can break your data set up into many disjoint subsets and run KS test on each of the subsets, keeping the K+ and K- statistics for each subset. You can then pool all of K+ statistics into one collection, all of the K- statistics into another and individually compare them to their theoretical distribution, which is well approximated by 1-e-2x2. In this way you can make good use of all of your data, and better balancing the competing goals of detecting both global and local divergence from the ideal CDF. See TAOCP Vol. II for a more thorough discussion of this technique.

### So What Happened?

A simple call to scipy.stats.ks_2samp and some waiting returned a p-value of 0.9977065. The size difference between the two data samples’ ECDFs was well within what one would expect, were they drawn from the same underlying distribution. This result is nice. A good hash function should be as insensitive to the statistical nuances of the input data as possible, always producing a nice, uniform, output. Note that this statistic says nothing about the quality of MurmurHash3‘s output distribution, only that its ability to grind up the name numbers doesn’t appear to suffer dramatically when they are encoded as strings vs. bytes. As it so happens we’ve seen that Murmur is pretty darn good!

### Closing Thoughts

As with all test statistics, you shouldn’t blindly accept or reject a result on the basis of some arbitrary cutoff. The KS test can’t tell you whether or not any “statistically significant” difference is practically significant. It is a very sensitive test, and given a large enough sample size can detect differences that are meaningless to your application. It’s certainly worth looking at plots of your ECDFs, repeating your analysis on different subsets of your data, and even judging the results of the test in light of other statistical measures or related data. This test wasn’t end of my analysis of this problem, but it was certainly a useful tool along the way. I hope that it may one day be similarly useful for you!

#### Implementations

• R’s ks.test and ks.boot functions implement the standard and bootstrapped KS test for single and two-sample cases
• SciPy implements a lot of KS tools in the scipy.stats module
• Matlab’s versions live in the statistics toolbox
• Octave has these tests as builtins

#### Books

• TAOCP Vol II. Seminumerical Algorithms by Knuth has a very nice writeup, but is focused on 1 sample tests.
• The KS test is discussed in John Cook’s chapter on testing a random number generator in Beautiful Testing. It is freely readable here.

## No BS Data Salon #2

On Saturday, our illustrious Chief Scientist Matt Curcio sat on the Frameworks, Tools, and Techniques for Scaling up Machine Learning panel at the second No BS Data Salon hosted by MetaMarkets. The discussion ranged from scaling the human aspect of ML and analytics to brass tacks about the difficulties of actually performing ML on web scale data sets.

The theme for this Salon was analytics, and just like last time the focus was on real use cases and a no-nonsense open discussion. We heard great presentations from Sean Gourley of Quid, Ian Wong of Square, and Metamarkets’ own Nelson Ray. Sean gave a great talk about the future of human-scale vs. machine-scale decision-making and analytics. Ian gave an informative overview of the challenges of growing a risk analysis team at Square. Nelson threw down with an awesome technical presentation about how to “A/B test anything”.

This second event in the series was certainly a worthwhile follow-up to the first! Thanks again to Metamarkets for the food, drink, and welcome. You won’t find better hosts than Mike and Nisha!

## Choosing a Good Hash Function, Part 3

Author’s note: Part three of a series studying hash functions. My last post identified a few candidate algorithms that are subjected to further scrutiny here today.

### The Story So Far

The simplest attribute on which one could imagine differentiating candidate hash functions is the number of collision produced when hashing a fixed pool of keys. By that standard, my last post identified Murmur3, Jenkins, City, Spooky, FNV1/1a, SDBM, AP, and RS as possible contenders. Today we’re going to see how they compare  to each other on some more rigorous tests.

### Random Uniformity

A hash function ought to distribute its keys uniformly across its output range. To see how these functions stack up, we’ll put our 42 million unique keys through each hash function, bin the output, and compare the bin counts with expectation:

For bins of equal size, E[bini] = Number of items hashed/Number of bins

Now, uniformity is different from random uniformity. In general the latter is not always necessary for building a good hash table, but the analysis of some schemes assume it. For our purposes, we’re going to want our hashes to look like they are drawn from a random uniform distribution — simple uniformity won’t cut it for our applications. This means that when we look at our bin counts, we want them to be neither too smooth nor too lumpy. To quantify this concept, we’ll use a chi-squared test.

In volume II of TAOCP Donald Knuth provides a somewhat ad-hoc, but easy to understand method for interpreting the p-values calculated by a chi-squared test of randomness. If your p-value is less than 0.01 or greater than 0.99 the process that generated those results is almost certainly non-random. Something less than 0.05 or greater than 0.95 should be considered suspect. Finally, he designates a p-value of less than 0.1 or greater than 0.90 as “almost suspect”.

Here I’ve cut the whole 64 bit output space into 100 bins, and again in 1,000,000 bins. For a final test I modded out the bottom 20 bits, to check their distributions in isolation.

Hash Function 1 Million bins* Bottom 20 bits* 100 bins
AP  0.70  0.50  <0.01
City 0.07  0.29  0.46
FNV64-1  <0.01  >0.99  0.97
FNV64-1a  >0.99  >0.99  0.87
Jenkins  0.17  0.46  0.72
Murmur3  0.14  0.31  0.08
RS  >0.99  >0.99  0.23
SDBM  >0.99  >0.99  >0.99
Spooky  0.84  0.27  0.98

*p-values estimated from a standard normal distribution

Jenkins passes all three of these nicely. City and Murmur each come up “almost suspect” once, and Spooky shows some suspicious behavior in the 100 bin test. I put the heaviest weight on the bottom 20 bit test, and can pretty comfortably give these four functions a pass here. AP does dramatically better at higher bin counts, which is interesting. We can pretty solidly eliminate RS, SDBM, AP, and both FNV variants based on this analysis alone.

As a final note, hash functions are not meant to be RNGs! This test holds them to a very rigid standard that is not generally necessary to build a good hash table. It’s just that in our specific application, we’re going to want our hash values to be somewhat random looking.

### Using Keyspace Structure

Before I continue, let me explain a little bit more of the structure of the data I am working with. I have 251 namespaces, each of which has a variable number of 192 and 256 bit keys associated with it. All told I have in the neighborhood of 66 million datapoints of the form (namespace, key). Only the key portion of these tuples actually gets hashed, however. Up until this point, we have been ignoring the namespace attribute of these data points, and thus have been restricted to looking at the 42 million unique (key, hash(key)) pairs. Let’s see if we can exploit larger set of data by including the namespaces!

In the chi-squared analysis above, we did our binning over the union of all namespaces. Now let’s individually bin the hash values of each namespace. All said and done, we have 251 namespaces ranging in size from a tiny handful to several million elements. This gives us 251 vectors of size 100, with

V{n,i} = Number of items of namespace n hashed to the i-th bin

For each namespace, we can compute the mean and variance of its count vector. I’ll leave it as an exercise to the reader, but it’s a pretty simple calculation to show that if you sample from a random uniform distribution, the variance of such a bin-count vector should equal its mean. If the variance is lower than the mean, it implies that the distribution is flatter than expected. On the contrary, if the variance is higher, it implies the existence of hot-spots on the range that are getting more than their fair share of data points hashed there.

Enough with the words, let’s look at the graphs! To generate these, I took the subset of namespaces that had at least 100,000 elements, of which there are 83. Each point is a namespace, and the green line shows the theoretical variance = mean relationship we’d expect from binning a random uniform distribution. Finally, I ran a Bonferroni corrected chi-squared test within each namespace. Those that come out “almost suspect” or worse are highlighted in red.

You can think of these namespaces as small experiments. Together, they help give us a picture of what the chi-squared test done on the whole dataset tells us.

A few observations:

• Under the 100 bin chi-squared test, SDBM was flagged as being way too uniformly distributed. We can see that quite clearly here. Generally, the variance of the bin counts is quite a bit lower than the mean bin count.
• On the other hand, AP has a comparatively high variance. This translates, again, to some bins being overly “favored” by the hash function.
• These pictures also give us some idea of how noisy the functions are on a namespace by namespace basis. Compare Spooky and Murmur3. The residuals for all of the namespaces are quite low, and basically equal for Spooky, whereas Murmur3′s residuals show a lot more variability.

So far we’ve been taking our input sets as a given, and examining the statistical properties of the outputs. While powerful, we need not limit ourselves to these techniques. Onward to avalanche!

### Avalanche Analysis

A common test of hash function performance is whether or not it achieves “avalanche.” This refers to the desireable characteristic that

P(Output bit i changes | Input bit j changes) = 0.5 for all i, j

Basically, if we keep all of the input bits the same, save for exactly 1 which we flip, we’d hope that each of our hash function’s output bits changes with probability 1/2.

I generated the following avalanche diagrams by using a random sample of 4000 keys (2000 of each type). The x-axis is the input key bit, the y axis is the output hash bit, and the color of the (x,y) tile is a measure of the bias that I/O pair has. Black indicates the desired 50% flip-probability, bright green indicates that the output bit is “stuck” and, certeris paribus, it doesn’t ever vary as a result of flipping just that input bit.

This test absolutely wrecks AP, SDBM, both FNV twins, and RS. Jenkins has some poor mixing in its upper bits, but that is mentioned in the implementation. It’s very small, but a slight bias can be observed in City’s lowest bits on the Creative keys. Murmur3 and Spooky are the only two functions left unscathed by this test. Given some of our algorithmic needs, this is a very slight knock against both Jenkins and City.

### Conclusion

After all of this, Murmur3, Jenkins, City, and Spooky are the only functions that I’m really pleased with for our work. I’ll give a slight edge to Murmur3 and City over Jenkins due to the avalanche results, and City’s incredible speed. Spooky’s performance here is notable, but I’m a little uneasy putting it forward as a candidate for use in production, as it is still in beta. I’ll be keeping my eye on it. Based on these results it shows a lot of promise!

The next logical step is to plug some of these in to Timon’s work, and see how they serve as the keystone of our hash table!