Sketching the last year

Sketching is an area of big-data science that has been getting a lot of attention lately. I personally am very excited about this.  Sketching analytics has been a primary focus of our platform and one of my personal interests for quite a while now. Sketching as an area of big-data science has been slow to unfold, (thanks Strata for declining our last two proposals on sketching talks!), but clearly the tide is turning. In fact, our summarizer technology, which relies heavily on our implementation of Distinct Value (DV) sketches, has been in the wild for almost a year now (and, obviously we were working on it for many months before that).

Fast, But Fickle

The R&D of the summarizer was fun but, as with most technical implementations, it’s never as easy as reading the papers and writing some code. The majority of the work we have done to make our DV sketches perform in production has nothing to do with the actual implementation.  We spend a lot of time focused on how we tune them, how we feed them, and make them play well with the rest of our stack.

Likewise, setting proper bounds on our sketches is an ongoing area of work for us and has led down some very interesting paths.  We have gained insights that are not just high level business problems, but very low level watchmaker type stuff.  Hash function behaviors and stream entropy alongside the skewness of data-sets themselves are areas we are constantly looking into to improve our implementations. This work has helped us refine and find optimizations around storage that aren’t limited to sketches themselves, but the architecture of the system as a whole.

Human Time Analytics

Leveraging DV sketches as more than just counters has proven unbelievably useful for us. The DV sketches we use provide arbitrary set operations. This comes in amazingly handy when our customers ask “How many users did we see on Facebook and on AOL this month that purchased something?” You can imagine how far these types of questions go in a real analytics platform. We have found that DV counts alongside set operation queries satisfy a large portion of our analytics platforms needs.

Using sketches for internal analytics has been a blast as well. Writing implementations and libraries in scripting languages enables our data-science team to perform very cool ad-hoc analyses faster and in “human-time”. Integrating DV sketches as custom data-types into existing databases has proven to be a boon for analysts and engineers alike.

Reap The Rewards

Over the course of the year that we’ve been using DV sketches to power analytics, the key takeaways we’ve found are: be VERY careful when choosing and implementing sketches; and leverage as many of their properties as possible.  When you get the formula right, these are powerful little structures. Enabling in-memory DV counting and set operations is pretty amazing when you think of the amount of data and analysis we support. Sketching as an area of big-data science seems to have (finally!) arrived and I, for one, welcome our new sketching overlords.

Netty’s CodecEmbedder

We love Netty. It’s a great full-featured network framework for Java. One of the features that rounds out the framework is the CodecEmbedder. It allows you to test your encoders and decoders without any fuss using a offer-poll paradigm. For example, to test our Rsyslog decoder, we simply:

ChannelBuffer messageBuffer =
    ChannelBuffers.copiedBuffer("2011-06-30T00:00:03-07:00 some.host.agkn.net EVT_NM column1,column2\n", CharsetUtil.UTF_8);
RsyslogDecoder decoder = new RsyslogDecoder();
DecoderEmbedder<IRsyslogMessage> embedder = new DecoderEmbedder<IRsyslogMessage>(decoder);
    embedder.offer(messageBuffer);

IRsyslogMessage message = embedder.poll();
assertNotNull(message, "Decoded message");
assertEquals(message.getTimestamp(), "2011-06-30T00:00:03-07:00", "Timestamp");
assertEquals(message.getHostname(), "some.host.agkn.net", "Host");
assertEquals(message.getProgramname(), "EVT_NM", "Programname");
assertEquals(message.getBody(), "column1,column2", "Body");

One gotcha to watch out for (which always manages to bite me in the butt, and is the impetus for writing this post) is that handlers will only process the type of data that they understand. Data of other types is passed along completely untouched. For example, while the following successfully compiles, it throws a java.lang.ClassCastException: java.lang.String cannot be cast to IRsyslogMessage at embedder.poll():

RsyslogDecoder decoder = new RsyslogDecoder();
DecoderEmbedder<IRsyslogMessage> embedder = new DecoderEmbedder<IRsyslogMessage>(decoder);
    embedder.offer("2011-06-30T00:00:03-07:00 some.host.agkn.net EVT_NM column1,column2\n");

IRsyslogMessage message = embedder.poll();
assertNotNull(message, "Decoded message");

The ChannelPipeline that backs the embedder can handle any type of input object. In the above case the object offer‘d is a string which is simply passed through the RsyslogDecoder untouched and tries unsuccessfully to pop out of the poll as an IRsyslogMessage. As long as you always make sure that your offered object is understood by one of your handlers then the embedder will work as you expect.

Statistical Toolbox: The Kolmogorov-Smirnov Test

Author’s Note: The Kolmogorov-Smirnov test is a handy tool that is conceptually clean, and can be useful in a variety of data analysis situations. I’ll introduce it in the context of a problem that I came across, and give a feel for what it does, and how it might be useful.

A Question and A Tool

I’ve been doing a lot of work with hash functions, and as part of that work I was posed with a question. If I take the same data, encode it two different ways, and feed the two encodings to the same hash function, is there any difference in the statistical properties of the hashed output data sets?

Conceptual map

The model I used to explore this question was to take a great number of SHA1 checksums, and MurmurHash3 these numbers, first encoded as 16 byte integers, and then again as Java Strings. There are a lot of things that one could do at this stage, but the first thing I thought to apply was the Kolmogorov-Smirnov (KS) test.

The Whatnow?

First, some background. The cumulative distribution function (CDF) is a common and natural way of characterizing a probability distribution. The KS test gives us a tool for taking two CDFs and speaking intelligently about how “different” they are. A typical use case is as follows:

  • You collect data that you suspect follows some theoretical distribution (uniform, Poisson, whatever)
  • From the raw data you construct an empirical cumulative distribution function (ECDF)
  • You use the KS test to answer the question, “Assuming my data were sampled from this theoretical distribution, what is the probability of seeing an ECDF that is at least this different from what one would predict?”

A more interesting use case is to compare two empirical distributions for equality. The test is conceptually exactly the same, except instead of comparing a CDF generated from data to one generated by theory, the comparison is between two empirical CDFs. A minor consequence of comparing two empirical data sets is that there is some additional uncertainty that must be dealt with, but this can be addressed by simply using larger samples (see the scaling factors discussed below).

What Does It Look Like?KS Schematic

The figure on the right is very helpful in understanding what is going on in this test.

Given two CDFs, the first thing the KS test does is find their maximum positive and negative differences, D+ and D-, respectively. These differences are scaled to produce so-called “K statistics.” In the case where one is comparing an empirical to a theoretical CDF (shown in the figure), all one needs to do is scale the differences by sqrt(n) where there are n observations. For the comparison of two empirical distributions of size n and m, D+ and D- are scaled by sqrt(nm/(n+m))

This scaling takes care of the idea the same magnitude of difference is more troubling if you have more data. A chance large jump or long lag in your ECDF curve is increasingly unlikely as your samples grow.

For a vanilla KS test, the larger of K+ and K- is compared against the Kolmogorov distribution. This allows you to compute a p-value telling you the probability of seeing a K statistic as large as you did under the assumptions of the null hypothesis that the sample is drawn from the theoretical distribution you are testing it against.

The KS test doesn’t need a lot of data to start detecting fairly small differences. If you have a lot of data, and you want to get fancy, you can break your data set up into many disjoint subsets and run KS test on each of the subsets, keeping the K+ and K- statistics for each subset. You can then pool all of K+ statistics into one collection, all of the K- statistics into another and individually compare them to their theoretical distribution, which is well approximated by 1-e-2x2. In this way you can make good use of all of your data, and better balancing the competing goals of detecting both global and local divergence from the ideal CDF. See TAOCP Vol. II for a more thorough discussion of this technique.

So What Happened?

A simple call to scipy.stats.ks_2samp and some waiting returned a p-value of 0.9977065. The size difference between the two data samples’ ECDFs was well within what one would expect, were they drawn from the same underlying distribution. This result is nice. A good hash function should be as insensitive to the statistical nuances of the input data as possible, always producing a nice, uniform, output. Note that this statistic says nothing about the quality of MurmurHash3‘s output distribution, only that its ability to grind up the name numbers doesn’t appear to suffer dramatically when they are encoded as strings vs. bytes. As it so happens we’ve seen that Murmur is pretty darn good!

Closing Thoughts

As with all test statistics, you shouldn’t blindly accept or reject a result on the basis of some arbitrary cutoff. The KS test can’t tell you whether or not any “statistically significant” difference is practically significant. It is a very sensitive test, and given a large enough sample size can detect differences that are meaningless to your application. It’s certainly worth looking at plots of your ECDFs, repeating your analysis on different subsets of your data, and even judging the results of the test in light of other statistical measures or related data. This test wasn’t end of my analysis of this problem, but it was certainly a useful tool along the way. I hope that it may one day be similarly useful for you!

Additional Resources

Implementations

  • R’s ks.test and ks.boot functions implement the standard and bootstrapped KS test for single and two-sample cases
  • SciPy implements a lot of KS tools in the scipy.stats module
  • Matlab’s versions live in the statistics toolbox
  • Octave has these tests as builtins

Books

  • TAOCP Vol II. Seminumerical Algorithms by Knuth has a very nice writeup, but is focused on 1 sample tests.
  • The KS test is discussed in John Cook’s chapter on testing a random number generator in Beautiful Testing. It is freely readable here.

Papers

No BS Data Salon #2

On Saturday, our illustrious Chief Scientist Matt Curcio sat on the Frameworks, Tools, and Techniques for Scaling up Machine Learning panel at the second No BS Data Salon hosted by MetaMarkets. The discussion ranged from scaling the human aspect of ML and analytics to brass tacks about the difficulties of actually performing ML on web scale data sets.

The theme for this Salon was analytics, and just like last time the focus was on real use cases and a no-nonsense open discussion. We heard great presentations from Sean Gourley of Quid, Ian Wong of Square, and Metamarkets’ own Nelson Ray. Sean gave a great talk about the future of human-scale vs. machine-scale decision-making and analytics. Ian gave an informative overview of the challenges of growing a risk analysis team at Square. Nelson threw down with an awesome technical presentation about how to “A/B test anything”.

This second event in the series was certainly a worthwhile follow-up to the first! Thanks again to Metamarkets for the food, drink, and welcome. You won’t find better hosts than Mike and Nisha!

No BS Data Salon #2 attendees during Sean Gourley's PresentationNo BS Data Salon #2 attendees during Sean Gourley's PresentationNo BS Data Salon #2 attendees caffeinating in between presentations

Choosing a Good Hash Function, Part 3

Author’s note: Part three of a series studying hash functions. My last post identified a few candidate algorithms that are subjected to further scrutiny here today.

The Story So Far

The simplest attribute on which one could imagine differentiating candidate hash functions is the number of collision produced when hashing a fixed pool of keys. By that standard, my last post identified Murmur3, Jenkins, City, Spooky, FNV1/1a, SDBM, AP, and RS as possible contenders. Today we’re going to see how they compare  to each other on some more rigorous tests.

Random Uniformity

A hash function ought to distribute its keys uniformly across its output range. To see how these functions stack up, we’ll put our 42 million unique keys through each hash function, bin the output, and compare the bin counts with expectation:

For bins of equal size, E[bini] = Number of items hashed/Number of bins

Now, uniformity is different from random uniformity. In general the latter is not always necessary for building a good hash table, but the analysis of some schemes assume it. For our purposes, we’re going to want our hashes to look like they are drawn from a random uniform distribution — simple uniformity won’t cut it for our applications. This means that when we look at our bin counts, we want them to be neither too smooth nor too lumpy. To quantify this concept, we’ll use a chi-squared test.

In volume II of TAOCP Donald Knuth provides a somewhat ad-hoc, but easy to understand method for interpreting the p-values calculated by a chi-squared test of randomness. If your p-value is less than 0.01 or greater than 0.99 the process that generated those results is almost certainly non-random. Something less than 0.05 or greater than 0.95 should be considered suspect. Finally, he designates a p-value of less than 0.1 or greater than 0.90 as “almost suspect”.

Here I’ve cut the whole 64 bit output space into 100 bins, and again in 1,000,000 bins. For a final test I modded out the bottom 20 bits, to check their distributions in isolation.

Hash Function 1 Million bins* Bottom 20 bits* 100 bins
AP  0.70  0.50  <0.01
City 0.07  0.29  0.46
FNV64-1  <0.01  >0.99  0.97
FNV64-1a  >0.99  >0.99  0.87
Jenkins  0.17  0.46  0.72
Murmur3  0.14  0.31  0.08
RS  >0.99  >0.99  0.23
SDBM  >0.99  >0.99  >0.99
Spooky  0.84  0.27  0.98

*p-values estimated from a standard normal distribution

Jenkins passes all three of these nicely. City and Murmur each come up “almost suspect” once, and Spooky shows some suspicious behavior in the 100 bin test. I put the heaviest weight on the bottom 20 bit test, and can pretty comfortably give these four functions a pass here. AP does dramatically better at higher bin counts, which is interesting. We can pretty solidly eliminate RS, SDBM, AP, and both FNV variants based on this analysis alone.

As a final note, hash functions are not meant to be RNGs! This test holds them to a very rigid standard that is not generally necessary to build a good hash table. It’s just that in our specific application, we’re going to want our hash values to be somewhat random looking.

Using Keyspace Structure

Before I continue, let me explain a little bit more of the structure of the data I am working with. I have 251 namespaces, each of which has a variable number of 192 and 256 bit keys associated with it. All told I have in the neighborhood of 66 million datapoints of the form (namespace, key). Only the key portion of these tuples actually gets hashed, however. Up until this point, we have been ignoring the namespace attribute of these data points, and thus have been restricted to looking at the 42 million unique (key, hash(key)) pairs. Let’s see if we can exploit larger set of data by including the namespaces!

In the chi-squared analysis above, we did our binning over the union of all namespaces. Now let’s individually bin the hash values of each namespace. All said and done, we have 251 namespaces ranging in size from a tiny handful to several million elements. This gives us 251 vectors of size 100, with

V{n,i} = Number of items of namespace n hashed to the i-th bin

For each namespace, we can compute the mean and variance of its count vector. I’ll leave it as an exercise to the reader, but it’s a pretty simple calculation to show that if you sample from a random uniform distribution, the variance of such a bin-count vector should equal its mean. If the variance is lower than the mean, it implies that the distribution is flatter than expected. On the contrary, if the variance is higher, it implies the existence of hot-spots on the range that are getting more than their fair share of data points hashed there.

Enough with the words, let’s look at the graphs! To generate these, I took the subset of namespaces that had at least 100,000 elements, of which there are 83. Each point is a namespace, and the green line shows the theoretical variance = mean relationship we’d expect from binning a random uniform distribution. Finally, I ran a Bonferroni corrected chi-squared test within each namespace. Those that come out “almost suspect” or worse are highlighted in red.

You can think of these namespaces as small experiments. Together, they help give us a picture of what the chi-squared test done on the whole dataset tells us.

A few observations:

  • Under the 100 bin chi-squared test, SDBM was flagged as being way too uniformly distributed. We can see that quite clearly here. Generally, the variance of the bin counts is quite a bit lower than the mean bin count.
  • On the other hand, AP has a comparatively high variance. This translates, again, to some bins being overly “favored” by the hash function.
  • These pictures also give us some idea of how noisy the functions are on a namespace by namespace basis. Compare Spooky and Murmur3. The residuals for all of the namespaces are quite low, and basically equal for Spooky, whereas Murmur3′s residuals show a lot more variability.

So far we’ve been taking our input sets as a given, and examining the statistical properties of the outputs. While powerful, we need not limit ourselves to these techniques. Onward to avalanche!

Avalanche Analysis

A common test of hash function performance is whether or not it achieves “avalanche.” This refers to the desireable characteristic that

P(Output bit i changes | Input bit j changes) = 0.5 for all i, j

Basically, if we keep all of the input bits the same, save for exactly 1 which we flip, we’d hope that each of our hash function’s output bits changes with probability 1/2.

I generated the following avalanche diagrams by using a random sample of 4000 keys (2000 of each type). The x-axis is the input key bit, the y axis is the output hash bit, and the color of the (x,y) tile is a measure of the bias that I/O pair has. Black indicates the desired 50% flip-probability, bright green indicates that the output bit is “stuck” and, certeris paribus, it doesn’t ever vary as a result of flipping just that input bit.

Avalanche Diagram

This test absolutely wrecks AP, SDBM, both FNV twins, and RS. Jenkins has some poor mixing in its upper bits, but that is mentioned in the implementation. It’s very small, but a slight bias can be observed in City’s lowest bits on the Creative keys. Murmur3 and Spooky are the only two functions left unscathed by this test. Given some of our algorithmic needs, this is a very slight knock against both Jenkins and City.

Conclusion

After all of this, Murmur3, Jenkins, City, and Spooky are the only functions that I’m really pleased with for our work. I’ll give a slight edge to Murmur3 and City over Jenkins due to the avalanche results, and City’s incredible speed. Spooky’s performance here is notable, but I’m a little uneasy putting it forward as a candidate for use in production, as it is still in beta. I’ll be keeping my eye on it. Based on these results it shows a lot of promise!

The next logical step is to plug some of these in to Timon’s work, and see how they serve as the keystone of our hash table!

Big Data Ain’t Fat Data: A Case Study

We’ve always had a hunch that our users stick to the same geographic region. Sure, there’s the occasional jet-setter that takes their laptop from New York to Los Angeles (or like Rob, goes Chicago to San Francisco) on a daily or weekly basis, but they’re the exception and not the rule. Knowing how true this is can simplify the way we work with user-centric data across multiple data centers.

When Rob asked me to find this out for sure, my first instinct was to groan and fire up Hive on an Elastic MapReduce cluster, but after a second, I heard Matt’s voice in my head saying, “Big Data isn’t Fat Data”. Why bother with Hadoop?

The Setup

If I was solving this problem on a small data-set, it’d be pretty straight-forward. I could write a Python script in about 10 minutes that would take care of the problem. It would probably look something like:

users = {}

for line in sys.stdin:
    user, data_center = parse(line)
    try:
        users[user].append(data_center)
    except KeyError:
        users[user] = [data_center]

total_users = len(users)
multiple_dc_users = len([u for u in users if len(users[u]) > 1])

Easy peasy. However, explicitly storing such a large hash-table gets a little problematic once you start approaching medium-sized data (1GB+). Your memory needs grow pretty rapidly – with M users and N data centers, storage is O(MN) – , and things start to get a little slow in Python. At this point there are two options. You can brute force the problem by throwing hardware at it, either with a bigger machine or with something like Hadoop. Or, we can put on our Computer Science and Statistics hats and get a little bit clever.

What if we turn the problem sideways? Above, we’re keeping a hash table that holds a set of data-center for each user. Instead, let’s keep a set of users per data-center, splitting the problem up into multiple hash tables. This lets us keep a small, fixed number of tables – since I’d hope any company knows exactly how many data centers they have – and spread the load across them, hopefully making the load on each table more tolerable. We can then check how many sets each user falls into, and call it a day.

data_centers = dict([(dc, set()) for dc in AK_DATA_CENTERS])

for line in sys.stdin:
    user, data_center = parse(line)
    data_centers[data_center].add(user)

# Get the total users by intersecting all of the data center sets
...

# Get all users who are in exactly one set by taking symmetric differences (XOR) of data-center sets
# and count the size of that set.
...

While this approach theoretically has better performance with the same O(MN) space requirements, with big enough data the space requirements of the problem totally dominate whatever improvement this approach would provide. In other words, it doesn’t matter how small each hash table is, you can’t fit 80GB of user IDs into the 8GB of RAM on your laptop.

It’s looking pretty bleak for the Clever Way of doing things, since what we really want is a magic hash table that can store our 80GB of user IDs in the memory on our laptops.

Bloom Filters

Enter Bloom Filters. A bloom filter is a fixed-size set data structure with two minor features/drawbacks:

  1. You can never ask a Bloom Filter for the set of elements it contains.
  2. Membership queries have a small, controllable, false-positive probability. Bloom filters will never return false negatives.

With a little bit of work, it’s pretty easy to substitute Bloom Filters for plain old hash tables in our sideways approach above. There’s a slight tweak we have to make to our algorithm to accommodate the fact that we can’t ever query a bloom filter for the elements it contains, but the idea remains the same.

The Payoff

Suppose now we’re keeping a bloom-filter of users per data center. The only thing we have to work around is the fact that we’ll never be able to recover the list of users we’ve added to each set. So, we’ll just deal with users each time we see them instead of deferring our counting to the end.

With that idea in the bag, there are really only a few things to worry about when a request comes in for a given data center.

  • Check the bloom filter for that data center to see if the user has been to that one before
  • Check the other bloom filters to see how many other data-centers that user has been to before
  • Count the number of total data-centers that user has seen before. If the user is new to this data center, and the user has seen exactly one other data center before, increment the multiple data center user counter
  • If the user has never seen any of your data centers before, that user is a completely new user. Increment the total number of users seen.
  • If the user has already seen this data-center, this user is a repeat. Do nothing!

We ran our version of this overnight. It took us one core, 8GB of RAM, and just under than 4 hours to count the number of users who hit multiple data centers in a full week worth of logs.

Not bad!

Cookies

At Aggregate Knowledge we are constantly concerned about our data space. And since our most basic data key is cookies (cookie ids) we are very interested in how they behave. To that end we have done a ton of research into what the cookie space looks like in the advertising world and the web in general. Understanding the basic behavior of cookies (count, ingestion rate, growth rate, etc.) is vital for our architecture planning. Here I will show you a view of the cookie space that we collect at Aggregate Knowledge and then take you through some of the research we are doing in the next few posts.

To start things off we asked “How should cookies behave?” It’s pretty easy to model what we expect to see. Let’s make the reasonable assumptions that cookies are finite and persistent. As we track advertising around the web we are randomly sampling from this set of numbers (cookie ids). The question is: how many cookies will I see with respect to the number of ads I show? i.e., if I draw from a set of uniquely numbered balls with replacement, how many draws do I need to see most or all of the numbers? Well, if you think of this as a collision problem with n trials and k draws, you can write the expected number of collisions as:

E[collisions]= n − k + k (1 −1/k)^n

so the expected number of unique values we have seen is just n minus this or

E[uniques] = k*(1 – (1-1/k)^n)

Let’s make some reasonable assumptions and plot this against our data. With an assumption of 500M cookies in the US we would expect:

That seems reasonable. We’ll “see” all of the cookies in about 3 Billion page views. Let’s plot our data on top:

Uh…ok. Well, clearly there are more than 500M cookies. Some of this can be explained by everyone having smartphones and iPads, meaning there are at least a few devices per internet user. All we should really need to do is collect a bit more data on our side and see when the unique cookie vs impression chart starts to keel over. Then I could fit an asymptote curve to it and guess as to how many cookies there are in the world. Well, fortunately we have more data available – let’s look at all of AK’s traffic this summer:

What could this possibly mean? At 40B ad impressions we must have seen a significant amount of the cookies on the internet. So, whats going on? Well, we have some theories (robots, deleters, etc.) and over the next few weeks we’ll share some of our adventures in cookie analysis.

No BS Data Salon

After being quite disenchanted with the state of the Big Data conferences, I thought that I would reach out to some folks that do work similar to ours and plan a mini conference of our own. The first guy that I reached out to was Mike Driscoll, the CTO of MetaMarkets. I had hit the jackpot on the first pull. Mike had been toying with the idea of having a “No BS Data Salon” where he’d get together folks that have challenging problems and present how they’ve solved them in a use-case style format. He wanted to hit at least three levels of the stack: visualization, analytics and data infrastructure. Timon and I encouraged him to take his ideas and make it real since it was exactly what we were thinking.

Today we had the first in the series. It covered data visualization. Mike put together a fantastic group of presenters from Bret Victor to Nick Bilton. All told, there were 5 presenters, a panel discussion on JS visualization tools, and around 20 attendees. It was an awesome opportunity to just talk shop about data and visualizations.

A big thanks to Mike, Nisha and all of the MetaMarkets folks for all of their work and hospitality. And another big thanks to all of the presenters. I certainly look forward to attending and presenting at future get togethers.

Image provided by Xavier Leaute

Choosing a Good Hash Function, Part 2

Author’s note: Part two of a series in which I investigate the performance of a menagerie of hash functions on our data. In today’s episode the analysis begins in earnest with an investigation of collision rates.

Hash function designers have many tools at their disposal, but at their heart, most algorithms follow the same pattern: bytewise iteration over a key during which some internal state is mixed up with the key bits via some combination of ANDs, ORs, XORs, ADDs, shifts, magic numbers, modular arithmetic, and similar tools. As an example, consider the famous FNV hash function, which is astoundingly simple in its construction:

uint64_t fnv1_hash (void *key, int n_bytes)
{
    unsigned char *p = key;
    uint64_t h = 14695981039346656037;
    int i;
    for (i = 0; i < n_bytes; i++) {
        h = (h * 1099511628211) ^ p[i];
    }
    return h;
}

With all hash functions, the hope is that one may sufficiently mix up the input bits such that, on average, the output is uniformly distributed across its available range. If you think that designing such an algorithm sounds tricky, you’re right!

Over the years many hash functions have been developed that vary widely in quality and complexity. There are many that, despite some demonstrable theoretical flaws, have worked well in enough practical applications to have gained popularity. Other algorithms have been designed from the ground up to achieve a variety of theoretical benchmarks. To get started with this project, I spent some time looking around and came up with a list of 16 reasonably well-known functions that run a pretty wide breadth of quality from negative control to veteran. I started with the simplest test imaginable: I have ~42 million keys available, each of which are either 192 or 256 bits long. Given my entire available set of keys, what fraction can be hashed without collision?

Fraction of keys hashed without collision

A few notes about this graph:

  • All hashes are 64 bits.
  • Hashing is hard. Many of these functions do quite poorly compared to sampling from a random uniform distribution. The theoretical expectation here is that 0 keys should collide.
  • It looks like there is a significant hurdle at ~85% of the keys.
  • Although hard to see on this chart, OAT (Bob Jenkins’ less popular one-at-a-time hash) came in just under 100%. While this is a standout performance in comparison to most of the functions tested, it is still below what is expected by theory.
  • Unsurprisingly, Murmur3 and Jenkins eat this data set for lunch. They are carefully designed to work well on a broad variety of inputs, thoroughly tested, and I would have been shocked to see them fail here. They are matched by Google’s City Hash, Spooky Hash (Jenkins’ most recent project, which is still under development), FNV-1/1a, SDBM hash (also known as x65599), RS (Arash Partow‘s version of a hash function designed by Robert Sedgewick), another function of Partow’s own creation.

We’re by no means done here — we’ve simply thinned our list to a few algorithms that merit deeper exploration. The challenge now becomes distinguishing our high performers, and for that we’ll need tools a little bit more sophisticated than simple collision counts. Bring your statistics thinking cap to part 3!

Appendix: Further Reading

  • Unsurprisingly, Donald Knuth’s chapter from The Art of Computer Programming, Volume III: Sorting and Searching is an excellent piece.
  • Bob Jenkins wrote a great article in Dr. Dobb’s back in 1997 that is also a great starting place.
  • More generally, Jenkins’ own website is a treasure trove of material on the subject of hashing
  • There’s a lot of material about FNV to be had here.
  • And let’s not leave out Murmur Hash and City Hash.

Big Memory, Part 4

Author’s Note: This is part 4 of a series of posts about my adventures in building a “large”, in-memory hash table. This post is a summary of some pure Java hash table libraries.

Background

In my last post, I discussed the results of rerunning Nick Welch’s benchmark of C/C++ hash tables. However, since we use the JVM in production, those results were purely academic. I need to either verify or strike down my suspicions in the Java world.

For this test, I’m only concerned with primitive long-long mappings. For those following along, this is a departure from the methodology used in part 3. Here’s why.

Goals

I was hoping to replicate some fraction of the amazing throughput seen in the C/C++ benchmark. (A million inserts per second wouldn’t be a bad start!) More importantly, I sought to learn something about the relative performance of different schemes on the JVM. Would the bit-masking used in power-of-2-sized tables prove faster than the modulus operator used by prime-sized tables? Would the extra pointer hops of an externally chained implementation be more or less costly than the probe chains of open addressing schemes? Would Java even be able to handle a 128GiB hash table? What effect would GC have? What implementation tricks could be learned?

I’ll save you some time right now if you’re actually looking for the exact answers to all those questions: I don’t have most of them. I’ll hopefully get to them in later posts. The available, open-source primitive hash maps simply did not cover enough of the algorithm space to reliably demonstrate the isolated effect of a particular design choice such as power-of-2- vs prime-sized bucket arrays.

Instead, this post’s goals are more modest: I wish to answer the following questions:

  • What is the effect of load factor on the performance of these maps?
  • Given the out-of-the-box implementations available, which designs are the most appealing?
  • What optimizations can be applied to their implementation, given my requirements?
  • Are any of these suitable for my needs right out of the box?

Candidates

Lib. Ver. Rel. Date Class Coll. Res. No. Buckets Hash Fun. Bytes/Entry
mahout-collections 1.0.0 2010 OpenLongLongHashMap double hashing prime [1] 17
Trove 3.0.1 2011 TLongLongHashMap double hashing prime [2] 17
PCJ 1.2 2003 LongKeyLongOpenHashMap double hashing prime [2] 17
fast-util 6.4.1 2011 Long2LongOpenHashMap linear probing 2n MurmurHash3 17
hppc 0.4.1 2011 LongLongOpenHashMap linear probing 2n MurmurHash3 17
PCJ 1.2 2003 LongKeyLongChainedHashMap external chaining prime [2] 48
JDK 1.6.0_27 2011 HashMap<Long, Long> external chaining 2n [3] 100NOTE

[1] value ^ (value >> 32) [2] value ^ (value >>> 32) [3] h ^= (h >>> 20) ^ (h >>> 12); h ^ (h >>> 7) ^ (h >>> 4);

Setup

  • 2 warmup runs, 2 observation runs for each combination of (Class, Load Factor, Size Hint)
  • Centos 5.4, dual-Xeon X7542  (2.67GHz, 6 cores each), 256 GB of RAM
  • 4.72 billion 64-bit integer (976 million uniques) keys inserted with random long values, read from disk
  • baseline read/parse rate was roughly 10 million records per second
  • JVM flags: -server -Xmx128g -XX:+PrintGCTimeStamps
  • Code is on GitHub, added to the benchmark used in part 3. See the section of the README that discusses the “libraries” comparison.

The runs completed are listed below, as are some odd failures that occurred in testing.

The test pushed several hash tables to their breaking point, since they either imposed artificial limits on the backing array length (usually 230 ~ 1.074B) or they ran into Java’s built-in limit, Integer.MAX_VALUE, of 231-1 ~ 2.147B. Extremely high load factors were required to even get some tables to finish the benchmark, which I understand is suboptimal.

In general, I tried to run three different maximum load factors where possible: 0.50, 0.75, and 0.91. The first two are common default load factors for the hash tables in question, and the last one is the highest load factor such that 976,000,000/loadFactor < 230. What’s important to note here is that I actively avoided resizes during the runs by setting an initial capacity corresponding to the load factor. The number of “usable” buckets was held constant (at 976M) while the number of empty buckets varied in order to test the effect of load factor on performance.

Results

The broad strokes are that all of these libraries live between the 1M to 2M inserts per second range, with performance degrading as they approach their maximum load factors, some more than others. The drastically different bottom quartiles are what stand out to me.

Worst-cases for the chained maps are close to full stops whereas the open addressing maps’ throughputs fall by about 25%. HashMap<Long, Long> and LongKeyLongChainedHashMap clearly fall apart in a serious way on a regular basis, roughly every 50 million new keys. It’s not clear to me that the code would allow resizes/rehashes given the initial capacity provided, so my guess was that these were stop-the-world GCs.

I reran the test with -XX:+PrintGCTimeStamps -verbose:gc to verify, and 78% of the real time of the two observation runs was consumed by GC, with regular pauses up to 100 seconds. For contrast, Long2LongOpenHashMap‘s GCs amounted to less than 1% of real time with no pause greater than 0.2 seconds (with the exception of the Full GC that occurred between observation runs). I computed those percentages by dividing the sum of the reported GC durations by the ‘real’ value from time(1).

It makes sense that GC pauses would uniquely affect the externally-chained implementations, since they produce so many objects. (All the open addressing implementations only create a few handfuls of objects per map instance.) In any case, that problem alone disqualifies it for my uses: such significant dips in throughput are unacceptable. Indeed, any externally-chained map will be suspect at these sizes because of the need to create a new object per entry. Perhaps different collectors can mitigate this problem, but given the extremely sound performance of the open addressing implementations, I’m inclined to table the externally chained maps for a while. The operational advantage of simply not worrying about tuning GC is huge for me.

The two most interesting maps proved to be HPPC’s LongLongOpenHashMap and fastutil’s Long2LongOpenHashMap. Both were blazing fast and quite stable, with minor throughput degradation only after 700 million keys. LongKeyLongOpenHashMap, TLongLongHashMap, and OpenLongLongHashMap began their degradation right from the get-go, dropping by over 50% over the course of the test. fastutil’s implementation’s variance was also extremely low, which is a very admirable quality from an operational perspective.

It’s fascinating that a design as straightforward as power-of-two bucket counts and linear probing could be so amazingly effective. It’s the classic example of a ‘flawed’ open addressing scheme: it’s susceptible to primary clustering, the number of probes shoots up as you go past 90% occupancy, and it’s unforgiving of mediocre hash functions.

After thinking about it, though, it makes great sense that it performed so well here. Both favorites use a byte[] to store the state of each bucket, which means that each 64-byte cache line has on average the next 32 bucket statuses pre-cached. This means that walking the linear probe chain is effectively free for the next 32 buckets (compared to going to L1 or, heaven-forbid, main memory). At 90% occupancy, the average successful lookup requires 5.5 probes while unsuccessful lookups require 50 probes. This means that on average we’re fetching at most two cache lines for the probing. Compare this to double hashing, which by design is meant to probe far from the original hash, which requires a line fetch per probe. At 2.5/10 average probes for successful/unsuccessful lookups, you’ll be doing far more fetches with double hashing than with linear probing. If you roll your own status array on top of a long[] and some masking tricks, you could potentially fit 256 2-bit statuses on each cache line, further improving linear probing’s efficiency. ($ cat /sys/devices/system/cpu/cpu0/cache/index0/size to get your cache line size in bytes.)

If you combine linear hashing with a good hash function, like Murmur3, you can avoid primary clustering in practice. What I find curious is that so many of the double-hashing libraries use such a weak hash function. Sure it’s only 2 ops, but is the hash function the real bottleneck in their code? Can they really not afford the extra 6 ops? I doubt it. I’m going to add some instrumentation to a few of these implementations to do probe length histograms to verify what I’ve speculated. Stay tuned.

Conclusions

In order to meet the requirements I’ve enumerated in my previous posts, I’m going to need more than 976 million occupied buckets. I’m going to need 1.5 billion, and a fraction of that again for empty buckets. Unfortunately, 230 is just a bit over 1.073 billion, and well short of 1.5 billion. This means that I can’t use one of the two favorites from my testing (fastutil’s Long2LongOpenHashMap) out of the box. Luckily, HPPC’s implementation does not have an artificial cap on the maximum number of buckets, so moving forward that will be a viable out-of-the-box option. (Edit: It has the same 230 cap as fastutil.) In order to complete my testing, I’ll likely dig up another month’s worth of data and run HPPC and modified variants of the other open addressing hash tables up to about 2 billion keys.


Mapping types  [UP]

In part 3 of this series of posts, I tested out several hash table “services” to see if any of them would give me the end-to-end storage functionality I needed for this project. In that test, I used an append semantic for the performance comparison because I was hoping that one of those “services” would solve all of my storage needs in one shot. Ideally, such a service would provide a fast algorithm, efficient memory management, and solid monitoring facilities. Sadly, all of them failed on the first count and I was forced to move onto a more hands-on approach.

This means worrying about the actual storage implementation and its interaction with the JVM’s GC. If I take the naive route and simply allocate a new long[] for every user’s activity, the memory allocation time alone could cripple the application. It has become clear that storing 1.5 billion long-lived long[]s (about 80GB’s worth) on a high-occupancy heap will cause unacceptable GC pauses. As such, I’ll be storing the user activity chains in a slab allocator (of my own construction) and using the hash table to translate between user IDs and their storage reference. Thus, I’m transitioning from benchmarking appends to puts, and I’m now mapping from long to long instead of long to long[]. Since the slab allocator will return roughly random longs as storage references, I’ve simulated the values of the hash table with a call to Random#nextLong().

An Aside on HashMap<Long, Long> and External Chaining  [UP]

Unlike the other candidates, HashMap<Long, Long> stores boxed keys and values, wrapping each pair in an Entryobject.

    public final class Long extends Number implements Comparable {
        ...
        private final long value;
        ...
    }

    static class Entry implements Map.Entry {
        final K key;
        V value;
        Entry next;
        final int hash;
        ...
    }

On a 64-bit JVM with a starting heap greater than 32GB, each of our 976 million unique users (wrapped in an Entryinstance) requires:

  • 8 bytes for the pointer to the Entry from the backing Entry[],
  • 16 bytes for the object header of the Entry,
  • 8 bytes for the pointer to final K key,
  • 8 bytes for the pointer to V value,
  • 8 byte pointer to Entry next,
  • 4 bytes for int hash,
  • 16 bytes for the object header of final K key,
  • 8 bytes for the underlying long of final K key,
  • 16 bytes for the object header of V value, and
  • 8 bytes for the underlying long of V value

for a grand total of 100 bytes per unique key. This means that at the default 0.75 load factor, you’re looking at 97,600,000,000 bytes for the used buckets and 2,600,000,000 bytes for the empty buckets. 93 GiB later and I haven’t even stored the events themselves. Since boxing has its own CPU costs, I’m not expecting much in terms of performance or memory footprint from HashMap. It’s included here mostly as a baseline, but also to demonstrate it’s unsuitability for this scale of work. The same goes for LongKeyLongChainedHashMap, minus the boxing. Even at a more modest 48 bytes per entry, it’s still using 55 GiB plus the cost of maintaing a billion items on the heap. External chaining is simply unviable at these sizes.

Runs  [UP]

For each load factor, the initial capacity of the table was hinted to be at least 976,000,000/loadFactor in order to avoid resizes, where possible. However, Trove and fastutil automatically do this computation by requesting the number of unique elements expected, instead of a lower bound on the initial bucket count. They were given 976,000,000 as their starting size hint regardless of load factor. As a last resort, 976,000,000 was used as the size hint when a map could or would not finish a test run.

The “Constructor” column indicates whether the constructor expects the initial capacity (“IC”) or the expected elements count (“EE”).

Lib. Class Constructor LF Size Hint
mahout-collections OpenLongLongHashMap IC 0.91 1,073M
0.75 1,302M
0.50 1,952M
Trove TLongLongHashMap EE 0.91 976M
0.75 976M
0.50 976M
PCJ LongKeyLongOpenHashMap IC 0.91 1,073M
0.75 1,302M
0.50 1,952M
fast-util Long2LongOpenHashMap EE 0.91 976M
hppc LongLongOpenHashMap IC 0.91 1,073M
PCJ LongKeyLongChainedHashMap IC 0.91 976M
JDK HashMap<Long, Long> IC 0.91 976M
0.75 1,302M
0.50 1,952M

Oddities  [UP]

When running initial tests, the configuration (TLongLongHashMap, 0.50, 976M) forced a catastrophic six-hour-long rehash at roughly 1.78 billion records in. At first I suspected a pathological probe chain followed by a rehash, however, I could not reproduce this particular case despite using identical data and identical runtime parameters. This is utterly horrifying from an operational perspective, but I’m willing to write this off as some artifact of the machine or perhaps the screen session. In another initial test, the configuration (LongKeyLongChainedHashMap, 0.50, 976M) crashed the Parallel Mark-Sweep garbage collector thread. Again, I couldn’t reproduce the problem despite my best efforts, so I’m writing it off as an artifact. It could have been the 128GB heap, it could have been the hundreds of millions of objects, it could have been anything. Sadly, I don’t even have a reproducible test case to give the JVM team, so I don’t foresee much progress on the bug.

Follow

Get every new post delivered to your Inbox.

Join 31 other followers