Author’s note: Hello, reader! I’m Colin, a new data scientist on the team. This is the first in a series of posts in which I will be describing my efforts to characterize various hash functions for use here at AK. Future posts will discuss the statistical and computational properties exhibited by these algorithms on our data. Additionally, I will be tackling the problem of trying to use the data that we have available to uncover potentially pathological input sets.
At AK, every event that we track is encoded as an n-tuple of 64-bit integers:
key component #1, key component #2, … , key component #n
This is a convenient form for summary and analysis, but obviously not optimal from a storage perspective. Internet advertising is no stranger to large numbers, but 264n is enormous. The set of keys that we will draw from this theoretical universe of keys is comparatively quite small. We find ourselves posed with a problem that looks very much like a natural fit for hashing!
A well chosen hash function, operating at the heart of solidly designed hash table could allow us a big win on both the internal storage/representation front, as well as in wild, freeing up space in client cookies, etc.
Paraphrasing Knuth, one should not choose a random hash function to generate a good hash table. As with any hashing task, there are the three classical issues to consider:
- The size of the hash in terms of the number of bits of output needed to hit your collision (two distinct keys hashing to the same value) goals and remain within your storage constraints
- The distributions of hashes on your input data, and the related problem of collisions
- Computation time
Over the next several posts, I will be putting a number of hash functions through the wringer in an effort to identify a handful that perform well on our data.
[...] note: Part two of a series in which I investigate the performance of a menagerie of hash functions on our data. In [...]