The Sketching Press

bahman

Yesterday we had a visitor at the office, Bahman Bahmani. He was nice enough to give us a preview of his talk for Strata this week. As we are sketching cheerleaders, it was really cool of him to let us see his talk and to trade some war stories. If you are at Strata this week, definitely go and check it out. He has some really cool examples of sketching applications and a detailed description of his work at Twitter for their streaming PageRank sketch.

Sketch of the Day: HyperLogLog — Cornerstone of a Big Data Infrastructure

Intro

In the Zipfian world of AK, the HyperLogLog distinct value (DV) sketch reigns supreme. This DV sketch is the workhorse behind the majority of our DV counters (and we’re not alone) and enables us to have a real time, in memory data store with incredibly high throughput. HLL was conceived of by Flajolet et. al. in the phenomenal paper HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm. This sketch extends upon the earlier Loglog Counting of Large Cardinalities (Durand et. al.) which in turn is based on the seminal AMS work FM-85, Flajolet and Martin’s original work on probabilistic counting. (Many thanks to Jérémie Lumbroso for the correction of the history here. I am very much looking forward to his upcoming introduction to probabilistic counting in Flajolet’s complete works.) UPDATE – Rob has recently published a blog about PCSA, a direct precursor to LogLog counting which is filled with interesting thoughts. There have been a few posts on HLL recently so I thought I would dive into the intuition behind the sketch and into some of the details.

Just like all the other DV sketches, HyperLogLog looks for interesting things in the hashed values of your incoming data.  However, unlike other DV sketches HLL is based on bit pattern observables as opposed to KMV (and others) which are based on order statistics of a stream.  As Flajolet himself states:

Bit-pattern observables: these are based on certain patterns of bits occurring at the beginning of the (binary) S-values. For instance, observing in the stream S at the beginning of a string a bit- pattern O^{\rho-1}1 is more or less a likely indication that the cardinality n of S is at least 2^\rho.

Order statistics observables: these are based on order statistics, like the smallest (real) values, that appear in S. For instance, if X = min(S), we may legitimately hope that n is roughly of the order of 1/X…

In my mind HyperLogLog is really composed of two insights: Lots of crappy things are sometimes better than one really good thing; and bit pattern observables tell you a lot about a stream. We’re going to look at each component in turn.

Bad Estimator

Even though the literature refers to the HyperLogLog sketch as a different family of estimator than KMV I think they are very similar. It’s useful to understand the approach of HLL by reviewing the KMV sketch. Recall that KMV stores the smallest k values that you have seen in a stream. From these k values you get an estimate of the number of distinct elements you have seen so far. HLL also stores something similar to the smallest values ever seen. To see how this works it’s useful to ask “How could we make the KMV sketch smaller?” KMV stores the actual value of the incoming numbers. So you have to store k 64 bit values which is tiny, but not that tiny. What if we just stored the “rank” of the numbers?  Let’s say the number 94103 comes through (I’ll use base 10 here to make things easier). That number is basically 9*10^4 plus some stuff. So, let’s just store the exponent, i.e. 4. In this way I get an approximation of the size of numbers I have seen so far. That turns the original KMV algorithm into only having to store the numbers 1-19 (since 2^{64} \approx 10^{19}) which is a whole lot less than 2^{64} numbers. Of course, this estimate will be much worse than storing the actual values.

Bit Pattern Observables

In actuality HLL, just like all the other DV sketches, uses hashes of the incoming data in base 2. And instead of storing the “rank” of the incoming numbers HLL uses a nice trick of looking for runs of zeroes in the hash values. These runs of zeroes are an example of “bit pattern observables”. This concept is similar to recording the longest run of heads in a series of coin flips and using that to guess the number of times the coin was flipped. For instance, if you told me that you spent some time this afternoon flipping a coin and the longest run of heads you saw was 2 I could guess you didn’t flip the coin very many times. However, if you told me you saw a run of 100 heads in a row I would gather you were flipping the coin for quite a while. This “bit pattern observable”, the run of heads, gives me information about the stream of data it was pulled from. An interesting thing to note is just how probable long runs of heads are. As Mark Shilling points out, you can almost always tell the difference between a human generated set of coin flips and an actual one, due to humans not generating long runs. (The world of coin flipping seems to be a deep and crazy pit.) Disclaimer: The only thing I am trying to motivate here is that by keeping a very small piece of information (the longest run of heads) I can get some understanding of what has happened in a stream. Of course, you could probably guess that even though we have now reduced the storage of our sketch the DV estimate is pretty crummy. But what if we kept more than one of them?

Stochastic Averaging

In order to improve the estimate, the HLL algorithm stores many estimators instead of one and averages the results. However, in order to do this you would have to hash the incoming data through a bunch of independent hash functions. This approach isn’t a very good idea since hashing each value a bunch of times is expensive and finding good independent hash families is quite difficult in practice. The work around for this is to just use one hash function and “split up” the input into m buckets while maintaining the observable (longest run of zeroes) for each bucket. This procedure is called stochastic averaging. You could do this split in KMV as well and it’s easier to visualize. For an m of 3 it would look like:

To break the input into the m buckets, Durand suggests using the first few (k) bits of the hash value as an index into a bucket and compute the longest run of zeroes (R) on what is left over. For example, if your incoming datum looks like 010100000110 and k = 3 you could use the 3 rightmost bits, 110, to tell you which register to update (110_2 = 6) and from the remaining bits, 010100000, you could take the longest run of zeroes (up to some max), which in this case is 5. In order to compute the number of distinct values in the stream you would just take the average of all of the m buckets:

\displaystyle DV_{LL} = \displaystyle\text{constant} * m*2^{\overline{R}}

Here \overline{R} is the average of the values R in all the buckets. The formula above is actually the estimator for the LogLog algorithm, not HyperLogLog. To get HLL, you need one more piece…

Harmonic Mean

A fundamental insight that Flajolet had to improve LogLog into HyperLogLog was that he noticed the distribution of the values in the m registers is skewed to the right, and there can be some dramatic outliers that really mess up the average used in LogLog (see Fig. 1 below). He and Durand knew this when they wrote LogLog and did a bunch of hand-wavey stuff (like cut off the top 30% of the register values) to create what he called the “SuperLogLog”, but in retrospect this seems kind of dumb. He fixed this in HLL by tossing out the odd rules in SuperLogLog and deciding to take the harmonic mean of the DV estimates. The harmonic mean tends to throw out extreme values and behave well in this type of environment. This seems like an obvious thing to do. I’m a bit surprised they didn’t try this in the LogLog paper, but perhaps the math is harder to deal with when using the harmonic mean vs the geometric mean.

Fig. 1:  The theoretical distribution of register values after v distinct values have been run through an HLL.

Throw all these pieces together and you get the HyperLogLog DV estimator:

\displaystyle DV_{HLL} = \displaystyle\text{constant} * m^2 *\left (\sum_{j=1}^m 2^{-R_j} \right )^{-1}

Here R_j is the longest run of zeroes in the i^{th} bucket.

Putting it All Together

Even with the harmonic mean Flajolet still has to introduce a few “corrections” to the algorithm. When the HLL begins counting, most of the registers are empty and it takes a while to fill them up. In this range he introduces a “small range correction”. The other correction is when the HLL gets full. If a lot of distinct values have been run through an HLL the odds of collisions in your hash space increases. To correct for hash collisions Flajolet introduces the “large range collection”. The final algorithm looks like (it might be easier for some of you to just look at the source in the JavaScript HLL simulation):

m = 2^b #with b in [4...16]

if m == 16:
    alpha = 0.673
elif m == 32:
    alpha = 0.697
elif m == 64:
    alpha = 0.709
else:
    alpha = 0.7213/(1 + 1.079/m)

registers = [0]*m # initialize m registers to 0

##############################################################################################
# Construct the HLL structure
for h in hashed(data):
    register_index = 1 + get_register_index( h,b ) # binary address of the rightmost b bits
    run_length = run_of_zeros( h,b ) # length of the run of zeroes starting at bit b+1
    registers[ register_index ] = max( registers[ register_index ], run_length )

##############################################################################################
# Determine the cardinality
DV_est = alpha * m^2 * 1/sum( 2^ -register )  # the DV estimate

if DV_est < 5/2 * m: # small range correction
    V = count_of_zero_registers( registers ) # the number of registers equal to zero
    if V == 0:  # if none of the registers are empty, use the HLL estimate
          DV = DV_est
    else:
          DV = m * log(m/V)  # i.e. balls and bins correction

if DV_est <= ( 1/30 * 2^32 ):  # intermediate range, no correction
     DV = DV_est
if DV_est > ( 1/30 * 2^32 ):  # large range correction
     DV = -2^32 * log( 1 - DV_est/2^32)

Rob wrote up an awesome HLL simulation for this post. You can get a real sense of how this thing works by playing around with different values and just watching how it grows over time. Click below to see how this all fits together.

HyperLogLog Simulation

Click above to run the HyperLogLog simulation

Unions

Unions are very straightforward to compute in HLL and, like KMV, are lossless. All you need to do to combine the register values of the 2 (or n) HLL sketches is take the max of the 2 (or n) register values and assign that to the union HLL. With a little thought you should realize that this is the same thing as if you had fed in the union stream to begin with. A nice side effect about lossless unions is that HLL sketches are trivially parallelizable. This is great if, like us, you are trying to digest a firehose of data and need multiple boxes to do summarization. So, you have:

for i in range(0, len(R_1)):
     R_new[i] = max( R_1[i], R_2[i] )

To combine HLL sketches that have differing sizes read Chris’s blog post about it.

Wrapping Up

In our research, and as the literature says, the HyperLogLog algorithm gives you the biggest bang for the buck for DV counting. It has the best accuracy per storage of all the DV counters to date. The biggest drawbacks we have seen are around intersections. Unlike KMV, there is no explicit intersection logic, you have to use the inclusion/exclusion principle and this gets really annoying for anything more than 3 sets. Aside from that, we’ve been tickled pink using HLL for our production reporting. We have even written a PostgreSQL HLL data type that supports cardinality, union, and intersection. This has enabled all kinds of efficiencies for our analytics teams as the round trips to Hadoop are less and most of the analysis can be done in SQL. We have seen a massive increase in the types of analytics that go on at AK since we have adopted a sketching infrastructure and I don’t think I’m crazy saying that many big data platforms will be built this way in the future.

P.S.  Sadly, Philippe Flajolet passed away in March 2011. It was actually a very sad day for us at Aggregate Knowledge because we were so deep in our HLL research at the time and would have loved to reach out to him, he seems like he would have been happy to see his theory put to practice. Based on all I’ve read about him I’m very sorry to have not met him in person. I’m sure his work will live on but we have definitely lost a great mind both in industry and academia. Keep counting Philippe!

Photo courtesy of http://www.ae-info.org/

Sketch of the Day: K-Minimum Values

Intro

We’ve been talking about probabilistic distinct value counting with sketches (DV sketches) for a while now and have had some fun experiences implementing them into our production environment. In this post I want to talk about a DV sketch that is very intuitive and easy to implement, the K-minimum Values sketch (KMV). While KMV sketches are relatively lightweight and accurate, they are not the best of breed when it comes to DV counting. They are useful in two ways to me though, for exposition and multi-set operations.

History

KMV seems to have been first introduced in 2002 by Ziv Bar-Yossef et. al. in the great paper Counting distinct elements in a data stream. In this paper they talk about improving on the basic intuition by the seminal DV sketch papers of Flajolet and Martin and Alon, Matias, and Szegedy (AMS) (AMS put some formality around the frequency moment problems, bounds of algorithms etc.) Flajolet and Martin’s paper is in turn based upon work from Morris 1978 (looking for streaks of right-most zeroes i.e. the predecessor to LogLog and HyperLogLog). These are fun to read (although they admittedly get pretty mathy) and it’s cool to see the progression of knowledge, accuracy, and efficiency as these guys do their work. You can almost imagine the fist fights that happen during their meet-ups! The final detailed work on KMV is by Beyer et. al. in On Synopses for Distinct-Value Estimation Under Multiset Operations.

How it works

The intuition behind KMV is straightforward. Supposing you have a good hash function, i.e. hash values are evenly distributed over the hash space (I will normalize the hash space output to [0-1] for the rest of this), then you could estimate the number of distinct values you have seen by knowing the average spacing between values in the hash space. If I see 10 distinct values, I would expect them on average to be spaced about 1/10th apart from each other. We could do this cheaply by keeping track of, say, the smallest value you have ever seen. If the values are indeed uniformly distributed and provided you’ve thrown a decent amount of data through it, you could guess that the smallest value you have seen is a decent estimate of the average spacing of hash values in your space.

Of course, this doesn’t have a lot of “nice” properties. Taking only one value opens you up to a ton of variance and you are fairly dependent on the “goodness” of your hash. In order to improve upon this Bar-Yossef suggests keeping the k smallest values you have ever seen around. The algorithm becomes:

Initialize KMV with first k values
for all h(n):
     if h(n) < max(KMV):
          insert h(n) into KMV set
          remove largest value from KMV

Cardinality(KMV):
     return: (k-1)/max(KMV)

For a KMV sketch of size k=3, graphically you have:

A very straightforward approach. Note that the “-1″ in the numerator comes from a bias correction in the estimate. You’re going to have to read the paper for that. So, the size of the sketch is basically k 64bit values large. Click below to run a KMV simulation:

Click above to run the KMV simulation

Set Operations

Performing set operations with KMV’s is also incredibly straightforward. The intuition around unions is that there is no difference between combining 2 KMV sketches and keeping the k minimum values in both versus just keeping one to start with, so unions are “lossless”. To perform union, you merely take 2 sketches and combine their values and keep the k smallest ones (if the 2 sketches are of different sizes, k and k’, then you keep the min(k,k’) values in order to keep the lowest resolution).

Union(A,B):
     k = min( |A|, |B|)
     return: min_k( A U B )

For intersections you use the KMV to estimate the Jaccard coefficient for the 2 (or n) sets. Basically, you treat the 2 KMV sketches for each set as a random uniform sample and intersect these to estimate Jaccard. So, you assemble the k minimum values of the two sets (as you did in union above), and intersect this result with the original sketches to obtain an estimate of the overlap of the 2 sets. The steps are:

IntersectionCard(A,B):
     L = UnionSet(A,B)  # the set this time, not just the cardinality
     k = min( |A|, |B|)
     K = | L ∩ A ∩ B |
     return: K/k * Cardinality(L)

One of the nice features of KMV which is different than say HyperLogLog, is that you can take n-way intersections by extending the algorithm above. To do this with HyperLogLog you actually need to compute the n-way algebra for set intersection i.e.

|A ∩ B| = |A| + |B| - |A U B|

However, in our experience of using KMV for set operations on Zipfian data, KMV’s still don’t perform as well HyperLogLog sketches for computing n-way intersections using the same amount of memory.

Expansion to Multisets

One of the nice features of KMV sketches is their expansion to supporting multiset operations, dubbed the AKMV sketch. This is great if you are using them for document representations and want to support document similarity operations like tf-idf (or any other multiset operation). In order to expand the basic KMV structure to support multisets (described here) you just add a counter on top of the values you are storing. In this way you get a decent sample of the counts of things in the stream/document to use for multiset operations. Most other DV sketches, HyperLogLog in particular, don’t support these types of queries.

To see how well this might work in practice, I took a look at some simple tf-idf similarity against the 20 news groups data set. This data set contains about 1000 news group emails on various topics such as atheism and motorcycles (woo!). For each article I constructed an AKMV sketch of the words in it and used this representation as the basis for tf-idf.  I cleaned up the data marginally by limiting my analysis to the 5000 most common words in the corpus (as seems to be the norm) and only considered alpahnumeric “words”.   Additionally, I cherry picked only a few newsgroups from the set that showed “nice” separation in the SVD.  You can think of the documents looking a bit like this where the red dots are the entries in the AKMV and the green dots are not (as above):

Once I created the tf-idf matrix, I SVD-ed it and plotted each newsgroup against the second and third singular vectors (the first vector in this case contained mostly information about the mean of the document vectors and contained little real information for classification).  The intermediate singular vectors for differing k were projected onto the actual singular vectors from the complete matrix (k = Inf).  Running through increasing k, the newsgroups look like this (click on the graphic to restart the animation):

Click image to restart animation

You can see the structure start to appear relatively quickly for small k and you can also see how some of the articles “stick” to their final spots due to them having less than k words.  Clearly you would have to do more work and testing if you wanted to implement something like this in a real classifier or search engine but it seems to be a promising approach.

Here is the same thing for a corpus composed of 23 articles about the Tom Cruise/Katie Holmes divorce and 20 articles about the Higgs boson.

Click image to restart animation

Using document sketches as a basis for a recommender system/search engine or any other application that requires similarity metrics seems like a promising avenue.  It would be very interesting indeed to run some real tests of precision/recall and memory footprint for sketch based recommenders/classifiers against other more standard approaches.

Disclaimer:

I make no claims about having built a classifier of any sort here. A lot of work and decisions would be necessary to move from these ideas to a useful classification scheme in a real environment. I was interested in how much of the flavor of a document would be retained in an AKMV sketch. Based on the above results, I think that the answer is “quite a bit,” even for modest values of k. I don’t think it would be out of the question to try to build a system that allowed you to compute similarities or apply classification tools after the sampling process inherent in the construction of these sketches.

Compression

An interesting thing to notice is that as your DV count gets larger, your max value of the k items is getting smaller. What this means is a simple compression algorithm that works is to just throw away the higher order unused bits of all the k values. Oddly, as the DV count gets larger your KMV will get smaller without losing accuracy.

Summary

There are many DV sketches in the world and KMV is one of the most interesting due to how easy it is to comprehend and implement. I particularly enjoy using KMV as a pedagogical tool and a solid jumping off point for DV sketching. The fact that KMV is so straightforward makes it stand out in a world of more confusing math and complicated sketching algorithms. In the right context it very well could be the right solution for your sketching needs, especially given the multiset support.

Sketching the last year

Sketching is an area of big-data science that has been getting a lot of attention lately. I personally am very excited about this.  Sketching analytics has been a primary focus of our platform and one of my personal interests for quite a while now. Sketching as an area of big-data science has been slow to unfold, (thanks Strata for declining our last two proposals on sketching talks!), but clearly the tide is turning. In fact, our summarizer technology, which relies heavily on our implementation of Distinct Value (DV) sketches, has been in the wild for almost a year now (and, obviously we were working on it for many months before that).

Fast, But Fickle

The R&D of the summarizer was fun but, as with most technical implementations, it’s never as easy as reading the papers and writing some code. The majority of the work we have done to make our DV sketches perform in production has nothing to do with the actual implementation.  We spend a lot of time focused on how we tune them, how we feed them, and make them play well with the rest of our stack.

Likewise, setting proper bounds on our sketches is an ongoing area of work for us and has led down some very interesting paths.  We have gained insights that are not just high level business problems, but very low level watchmaker type stuff.  Hash function behaviors and stream entropy alongside the skewness of data-sets themselves are areas we are constantly looking into to improve our implementations. This work has helped us refine and find optimizations around storage that aren’t limited to sketches themselves, but the architecture of the system as a whole.

Human Time Analytics

Leveraging DV sketches as more than just counters has proven unbelievably useful for us. The DV sketches we use provide arbitrary set operations. This comes in amazingly handy when our customers ask “How many users did we see on Facebook and on AOL this month that purchased something?” You can imagine how far these types of questions go in a real analytics platform. We have found that DV counts alongside set operation queries satisfy a large portion of our analytics platforms needs.

Using sketches for internal analytics has been a blast as well. Writing implementations and libraries in scripting languages enables our data-science team to perform very cool ad-hoc analyses faster and in “human-time”. Integrating DV sketches as custom data-types into existing databases has proven to be a boon for analysts and engineers alike.

Reap The Rewards

Over the course of the year that we’ve been using DV sketches to power analytics, the key takeaways we’ve found are: be VERY careful when choosing and implementing sketches; and leverage as many of their properties as possible.  When you get the formula right, these are powerful little structures. Enabling in-memory DV counting and set operations is pretty amazing when you think of the amount of data and analysis we support. Sketching as an area of big-data science seems to have (finally!) arrived and I, for one, welcome our new sketching overlords.

Cookies

At Aggregate Knowledge we are constantly concerned about our data space. And since our most basic data key is cookies (cookie ids) we are very interested in how they behave. To that end we have done a ton of research into what the cookie space looks like in the advertising world and the web in general. Understanding the basic behavior of cookies (count, ingestion rate, growth rate, etc.) is vital for our architecture planning. Here I will show you a view of the cookie space that we collect at Aggregate Knowledge and then take you through some of the research we are doing in the next few posts.

To start things off we asked “How should cookies behave?” It’s pretty easy to model what we expect to see. Let’s make the reasonable assumptions that cookies are finite and persistent. As we track advertising around the web we are randomly sampling from this set of numbers (cookie ids). The question is: how many cookies will I see with respect to the number of ads I show? i.e., if I draw from a set of uniquely numbered balls with replacement, how many draws do I need to see most or all of the numbers? Well, if you think of this as a collision problem with n trials and k draws, you can write the expected number of collisions as:

E[collisions]= n − k + k (1 −1/k)^n

so the expected number of unique values we have seen is just n minus this or

E[uniques] = k*(1 – (1-1/k)^n)

Let’s make some reasonable assumptions and plot this against our data. With an assumption of 500M cookies in the US we would expect:

That seems reasonable. We’ll “see” all of the cookies in about 3 Billion page views. Let’s plot our data on top:

Uh…ok. Well, clearly there are more than 500M cookies. Some of this can be explained by everyone having smartphones and iPads, meaning there are at least a few devices per internet user. All we should really need to do is collect a bit more data on our side and see when the unique cookie vs impression chart starts to keel over. Then I could fit an asymptote curve to it and guess as to how many cookies there are in the world. Well, fortunately we have more data available – let’s look at all of AK’s traffic this summer:

What could this possibly mean? At 40B ad impressions we must have seen a significant amount of the cookies on the internet. So, whats going on? Well, we have some theories (robots, deleters, etc.) and over the next few weeks we’ll share some of our adventures in cookie analysis.

On Accuracy and Precision

A joint post from Matt and Ben

Believe it or not, we’ve been getting inspired by MP3′s lately, and not by turning on music in the office. Instead, we drew a little bit of inspiration from the way MP3 encoding works. From wikipedia:

“The compression works by reducing accuracy of certain parts of sound that are considered to be beyond the auditory resolution ability of most people. This method is commonly referred to as perceptual coding. It uses psychoacoustic models to discard or reduce precision of components less audible to human hearing, and then records the remaining information in an efficient manner.”

Very similarly, in online advertising there are signals that go “beyond the resolution of advertisers to action”. Rather than tackling the problem of clickstream analysis in the standard way, we’ve employed an MP3-like philosophy to storage. Instead of storing absolutely everything and counting it, we’ve employed a probabilistic, streaming approach to measurement. This lets us give clients real-time measurements of how many users and impressions a campaign has seen at excruciating levels of detail. The downside is that our reports tends to include numbers like “301M unique users last month” as opposed to “301,123,098 unique users last month”, but we believe that the benefits of this approach far outweigh the cost of limiting precision.

Give a little, get a lot

The precision of our approach does not depend on the size of the thing we’re counting. When we set our precision to +/-1%, we can tell the difference between 1000 and 990 as easily as we can tell the difference between 30 billion and 29.7 billion users. For example when we count the numbers of users a campaign reached in Wernersville, PA (Matt’s hometown) we can guarantee that we saw 1000 +/- 10 unique cookies, as well as saying the campaign reached 1 Billion +/- 10M unique cookies overall.

Our storage size is fixed once we choose our level of precision. This means that we can accurately predict the amount of storage needed and our system has no problem coping with increases in data volume and scales preposterously well. Just to reiterate, it takes exactly as much space to count the number of users you reach in Wernersville as it does to count the total number of users you reach in North America. Contrast this with sampling, where to maintain a fixed precision when capturing long-tail features (things that don’t show up a lot relative to the rest of the data-set, like Wernersville) you need to drastically increase the size of your storage.

The benefits of not having unexpected storage spikes, and scaling well are pretty obvious – fewer technical limits, fewer surprises, and lower costs for us, which directly translates to better value for our users and a more reliable product. A little bit of precision seems like a fair trade here.

The technique we chose supports set-operations. This lets us ask questions like, “how many unique users did I see from small towns in Pennsylvania today” and get an answer instantaneously by composing multiple data structures. Traditionally, the answers to questions like this have to be pre-computed, leaving you waiting for a long job to run every time you ask a question you haven’t prepared for. Fortunately, we can do these computations nearly instantaneously, so you can focus on digging into your data. You can try that small-town PA query again, but this time including Newton, MA (Ben’s hometown), and not worry that no one has prepared an answer.

Unfortunately, not all of these operations are subject to the same “nice” error bounds. However, we’ve put the time in to detect these errors, and make sure that the functionality our clients see degrades gracefully. And since our precision is tunable, we can always dial the precision up as necessary.

Getting insight from data

Combined with our awesome streaming architecture this allows us to stop thinking about storage infrastructure as the limiting factor in analytics, similar to the way MP3 compression allows you to fit more and more music on your phone or MP3-player. When you throw the ability to have ad-hoc queries execute nearly instantly into the mix, we have no regrets about getting a little bit lossy. We’ve already had our fair share of internal revelations, and enabled clients to have quite a few of their own, just because it’s now just so easy to work with our data.

Streaming Algorithms and Sketches

Here at Aggregate Knowledge we spend a lot of time thinking about how to do analytics on a massive amount of data. Rob recently posted about building our streaming datastore and the architecture that helps us deal with “big data”. Given a streaming architecture, the obvious question for the data scientist is “How do we fit in?”. Clearly we need to look towards streaming algorithms to match the speed and performance of our datastore.

A streaming algorithm is defined generally as having finite memory – significantly smaller than the data presented to it – and must process the input in one pass. Streaming algorithms start pretty simple, for instance counting the number of elements in the stream:

counter = 0
for event in stream:
    counter += 1

While eventually counter will overflow (and you can be somewhat clever about avoiding that) this is way better than the non-streaming alternative.

elements = list(stream)
counter = len(elements)

Pretty simple stuff. Even a novice programmer can tell you why the second method is way worse than the first. You can get more complicated and keep the same basic approach – computing the mean of a floating point number stream is almost as simple: keep a around counter as above, and add a new variable, total_sum += value_new. Now that we’re feeling smart, what about the quantiles of the stream? Ah! Now that is harder.

While it may not be immediately obvious, you can prove (as Munro and Paterson did in 1980) that computing exact quantiles of a stream requires memory that is at least linear with respect to the size of the stream. So, we’re left approximating a solution to the quantiles problem. A first stab might be sampling where you keep every 1000th element. While this isn’t horrible, it has it’s downsides – if your stream is infinite, you’ll still run out of space. It’s a good thing there are much better solutions. One of the first and most elegant was proposed by Cormode and Muthukrishnan in 2003 where they introduce the Count-Min sketch data structure. (A nice reference for sketching data structures can be found here.)

Count-Min sketch works much like a bloom filter. You compose k empty tables and k hash functions. For each incoming element we simply hash it through each function and increment the appropriate element in the corresponding table. To find out how many times we have historically seen a particular element we simply hash our query and take the MINIMUM value that we find in the tables. In this way we limit the effects of hash collision, and clearly we balance the size of the Count-Min sketch with the accuracy we require for the final answer. Heres how it works:

The Count-Min sketch is an approximation to the histogram of the incoming data, in fact it’s really only probabilistic when hashes collide. In order to compute quantiles we want to find the “mass” of the histogram above/below a certain point. Luckily Count-Min sketches support range queries of the type “select count(*) where val between 1 and x;“. Now it is just a matter of finding the quantile of choice.

To actually find the quantiles is slightly tricky, but not that hard. You basically have to perform a binary search with the range queries. So to find the first decile value, and supposing you kept around the the number of elements you have seen in the stream, you would binary search through values of x until the return count of the range query is 1/10 of the total count.

Pretty neat, huh?

The Effects of Ad Campaign Gravity

There exists an interesting and not immediately obvious relationship between the volume (impressions, unique users) of a campaign and its performance.  This generally occurs due to inaccurate measures of attribution.  The concept is simple once you think about it.  Basically, advertisers tend to blast ad’s all over the internet.  The signal to noise ratio is extremely low for display ads (<1% CTR, <0.01% direct conversions), making optimization difficult in many cases.  The ad space has devised various methods to increase the signal in their campaigns.  A few of these concepts are “view-throughs”, attribution windows, “hover actions” and many others.  As a data scientist you realize that while there is very definitely some type of branding effect in advertising, it is unfair to attribute every downstream purchase to “branding”.  While there is nothing a-priori wrong with adding in these types of metrics, it is their interpretation that is tricky. One of the first things you need to correct for in these scenarios is what I dub “campaign gravity”.  Larger campaigns receive more credit (and thus higher performance) merely due to their size and not their effectiveness.

Let’s run through a thought experiment and see what this looks like:  Suppose I’m running my e-commerce site, awesomewidgets.com.  For some reason I decide to run a very large ad campaign of just public service announcements (PSA’s) that have no reference to my awesome site.  At the same time I launch another ad campaign which is 10 times smaller with similar PSA’s.  I would expect the performance of these two campaigns to be identical.  After all, I’m not advertising my site and there should be no reason these campaigns would drive any incremental lift to my sales.  But, when I run the numbers I see that the larger campaign is “outperforming” the smaller one by about 10%.  Let’s take a look at why.

From the Universe of all users (or cookies, however you want to think about it) on the internet there are a few people that converted on my awesomewidgets.com.

I launch both of my PSA ad campaigns and I have the following situation:

What I’m really interested in is the intersection of all 3 circles.  Users in this space have seen ads from both campaigns and converted.  In a last touch attribution model, the way these campaigns are “credited” is via a race condition to see which campaign they saw last.  A user’s history in this case might look like: A,A,B,A,A,B,A,A.  The question is how often is campaign A versus campaign B last in a user’s chain?  The answer involves the ratio of volumes for each campaign.

Enough with the pretty pictures, let’s do the math:

  • Cost of campaign A: $1000
  • Cost of campaign B: $100

Suppose that in isolation they both have the same “CPA” of $1.00 (1000 actions on A and 100 on B).  Note that this CPA is basically fictitious since I’m not driving any actions to my site.  This is the latent baseline CPA of any campaign. However, when I look at them together they have different measured performance.

  • For every converter that sees ads from both campaigns, A “steals” about 90% (actually, 1-1/11) of the attribution away from campaign B
  • If 10% of converters have seen ads from both campaigns, A “steals” 90% of this 10% overlap which is 9%

What this means is that 9% of B‘s conversions are now being credited to A and B will appear to be performing worse. So, of the 100 actions that in isolation are attributed to B, 9 of them will now go to A and campaign B‘s CPA will go up.  The CPA for campaign B will now look like $100.00/91 actions ~= $1.10.  A 10% reduction in measured “performance”!

What does this mean?  If you are comparing performance across multiple campaigns of very different sizes using a last touch attribution model then bigger always wins.  This can make analysis tricky on the backend to say the least.  These types of gotchas abound in all complex systems and care must always be taken when doing analytics.

There is another reason larger campaigns will appear better and it has to do with improper sampling by smaller campaigns.  Look for a future post on that topic.

Hacking is Indeed a Data Science Skill

Hacking
Recently I ran into a task that required me to manipulate a bunch of disparate log level data. Ahh, the tedium of data mining! There were a few particularly annoying things about this task.

 

  1. The data was split into 3 distinct files. Impression logs, activity logs, and user-attribute logs.
  2. The user-attribute logs were stored sequentially. It looked like:

    request_id, date, user_id, attribute_id

    where there were a varying number of lines for each request_id, representing how many attributes we may have been handed at that time for that user.

  3. The data was reasonably large (~500GB)

What I wanted after this was a flattened, normalized data set to use for various modeling tasks. The output format needed to be:

request_id, date, user_id, activity_id, { attribute_ids }

The first approach I thought of was to get the entire set of unique attributes from the file using something like “cat | cut | sort -u”to create a database table and generate a bunch of inserts. This was dumb and obviously this gets annoying very quickly. Not to mention that my final data set would be a few 100GB and my research instance of Postgres would get real annoyed.

How about Hadoop?  While this isn’t a terrible answer, there are a few problems. Mainly, I’m under a deadline and getting 500GB to the cluster would take too long. What I really want is some Unix-foo that i can kick off and forget about. It feels like there is some “cut | join | awk” solution. These are times when i wish I had better Unix skills.  Maybe emacs has a function that does this and brings you lunch (c-x-lunch)?

So, what did I do?  Well, many definitions of data science include the technical skill of “hacking” as a necessary ingredient. One of the finer points of “hacking” has to be social engineering. It’s way easier to get the president of the bank drunk and have him tell you the combination to his lock than it is to crack the safe. So, along these lines i came up with a plan. Most engineers pride themselves on being extremely smart (and most are) and love challenges. This can also get them into trouble though. Next time you walk into an engineering meeting, ask an engineer what sorting algorithm Java uses and if it’s the right choice. One hour wasted!

Our CTO, Rob G., happens to be a brilliant engineer, so I called him up and casually brought up this annoying formatting problem I was having. He immediately started brainstorming solutions and he ended up talking himself into Java as the fastest way (wall clock) that he could get this done.  Fortunately, I’m not really a Java guy. So after Rob convinced himself that his solution was best, he also ended up talking himself into writing all the code. Awesome! Now, my annoying data task was “executing” and I could go back to work on more important things. This entire conversation took about 10 minutes. Much faster than Googling around for Unix foo. The next morning, my data set was all organized and sitting on one of our servers.  Hacking is indeed a useful data science skill!

I guess the moral here is twofold. 1) Sometimes asking for help (and figuring out ways to get it!) really is the best solution, and 2) distributing workloads across your team makes everybody work faster.

P.S. Obviously wasting the CTO’s time is never a good idea. Luckily, Rob is a champion of scheduling and apparently he had a few extra cycles, so no harm done to the greater good.

Hello World!

Hi all,

Matt here, Chief Scientist at AK.  Here at AK we are surrounded by massive amounts of data.  Obviously we aren’t unique in this way, data is the new revolution and data scientists are the leaders of this revolution.  Our approach to leadership in this space is to devise new technologies specific to our industry and apply both new and existing approaches to the ingestion and digestion of data.  It’s extremely exciting to be at the forefront of this revolution at a company where data is our core competency.

In this vein I’ve been doing a lot of thinking about what it means to be a great data scientist and what it means to build a team capable of handling the typical tasks on a daily basis.  This document from O’Reilly does a pretty good job of describing data scientists and the data science “ecosystem”.  I particularly like the sentence: “[Data scientists] can think outside the box to come up with new ways to view  the problem, or to work with very broadly defined problems: ‘here’s a lot of data, what can you make from it?’”

In the upcoming blog posts we will show you the different ways that our team ‘makes’ stuff from our data.

Follow

Get every new post delivered to your Inbox.

Join 122 other followers