Choosing a Good Hash Function, Part 2

Author’s note: Part two of a series in which I investigate the performance of a menagerie of hash functions on our data. In today’s episode the analysis begins in earnest with an investigation of collision rates.

Hash function designers have many tools at their disposal, but at their heart, most algorithms follow the same pattern: bytewise iteration over a key during which some internal state is mixed up with the key bits via some combination of ANDs, ORs, XORs, ADDs, shifts, magic numbers, modular arithmetic, and similar tools. As an example, consider the famous FNV hash function, which is astoundingly simple in its construction:

```uint64_t fnv1_hash (void *key, int n_bytes)
{
unsigned char *p = key;
uint64_t h = 14695981039346656037;
int i;
for (i = 0; i < n_bytes; i++) {
h = (h * 1099511628211) ^ p[i];
}
return h;
}
```

With all hash functions, the hope is that one may sufficiently mix up the input bits such that, on average, the output is uniformly distributed across its available range. If you think that designing such an algorithm sounds tricky, you’re right!

Over the years many hash functions have been developed that vary widely in quality and complexity. There are many that, despite some demonstrable theoretical flaws, have worked well in enough practical applications to have gained popularity. Other algorithms have been designed from the ground up to achieve a variety of theoretical benchmarks. To get started with this project, I spent some time looking around and came up with a list of 16 reasonably well-known functions that run a pretty wide breadth of quality from negative control to veteran. I started with the simplest test imaginable: I have ~42 million keys available, each of which are either 192 or 256 bits long. Given my entire available set of keys, what fraction can be hashed without collision?

• All hashes are 64 bits.
• Hashing is hard. Many of these functions do quite poorly compared to sampling from a random uniform distribution. The theoretical expectation here is that 0 keys should collide.
• It looks like there is a significant hurdle at ~85% of the keys.
• Although hard to see on this chart, OAT (Bob Jenkins’ less popular one-at-a-time hash) came in just under 100%. While this is a standout performance in comparison to most of the functions tested, it is still below what is expected by theory.
• Unsurprisingly, Murmur3 and Jenkins eat this data set for lunch. They are carefully designed to work well on a broad variety of inputs, thoroughly tested, and I would have been shocked to see them fail here. They are matched by Google’s City Hash, Spooky Hash (Jenkins’ most recent project, which is still under development), FNV-1/1a, SDBM hash (also known as x65599), RS (Arash Partow‘s version of a hash function designed by Robert Sedgewick), another function of Partow’s own creation.

We’re by no means done here — we’ve simply thinned our list to a few algorithms that merit deeper exploration. The challenge now becomes distinguishing our high performers, and for that we’ll need tools a little bit more sophisticated than simple collision counts. Bring your statistics thinking cap to part 3!

• Unsurprisingly, Donald Knuth’s chapter from The Art of Computer Programming, Volume III: Sorting and Searching is an excellent piece.
• Bob Jenkins wrote a great article in Dr. Dobb’s back in 1997 that is also a great starting place.
• More generally, Jenkins’ own website is a treasure trove of material on the subject of hashing
• There’s a lot of material about FNV to be had here.
• And let’s not leave out Murmur Hash and City Hash.

1. SpookyHash is finished now. I’ll declare it production Feb 2 unless someone finds a reason for me to change it. The function did change when it went beta Dec 31, right after you made this post.

2. cpesyna says: