AWS Redshift: How Amazon Changed The Game

Edit: Thank you to Curt Monash who points out that Netezza is available for as little as $20k/TB/year with hardware (and 2.25x compression) and that there is an inconsistency in my early price estimates and the fraction I quote in my conclusion. I’ve incorporated his observations into my corrections below. I’ve also changed a sentence in the conclusion to make the point that the $5k/TB/year TCO number is the effective TCO given that a Redshift cluster that can perform these queries at the desired speed has far more storage than is needed to just hold the tables for the workloads I tested.

Author’s Note: I’ll preface this post with a warning: some of the content will be inflammatory if you go into it with the mindset that I’m trying to sell you on an alternative to Hadoop. I’m not. I’m here to talk about how an MPP system blew us away, as jaded as we are, and how it is a sign of things to come. There’s enormous, ripe-for-the-picking business value in Redshift and you’d be remiss to ignore it because “we’re a Hadoop shop” or “commercial MPP is too expensive”.

Flash back a few years: We’d struggled mightily with commercial data warehouses that simply couldn’t load the data volumes we wanted to query (at that time this was significantly less than a billion rows) in any decent amount of time, let alone query them semi-interactively. Anecdotally, we’d hear that vendor X had solved our problem or vendor Y had a completely different model that would “completely blow us out of the water“, and so on, but either the prices were astronomical ($100k/TB/year +) or the performance didn’t pan out. In our minds, MPP solutions were a non-starter.

To oversimplify the design we chose for our infrastructure: we built the Summarizer and threw in some Postgres for our product reporting needs, mixed in some Elastic Map-Reduce for the “tough” jobs, and went about our business. We’re very happy with the Summarizer because it solves a particular business need very inexpensively. (We do all of our aggregation on one box and let a couple of Postgres boxes handle the querying.) However, our EMR bill was becoming pretty large, and for what? A handful of reports that weren’t (yet) feasible to do in a streaming setting like the Summarizer. That’s it. Did a handful of reports really merit the enormous cost and capability of Hadoop? Our gut always told us no, but who are we to argue with a working solution? We had other fish to fry.

Fast-forward to (nearly) the present day: the business has grown, and we have ourselves a medium-to-large data (1-200TB) query problem and some cash to solve it. That old ache in our gut told us it was time again to poke our heads up and see if Spring had come. A bit of catching up on research revealed the tradeoff to be the same as ever, either:

  • sign a fat check for Greenplum/Teradata/Aster/Vertica/ParAccel/Kognitio/Infobright/Oracle/IBM/SAP/Microsoft etc… and pay per terabyte per year, or
  • hop onto Hadoop and ride the open-source rodeo with Hive or Pig by your side.

(Sure, sure there are other open-source options out there that aren’t Hadoop, but for example Impala 1.0 just came out and Drill’s nowhere near ready for production.)

Our first question was: “How much are we willing to spend?” Running a Hadoop cluster in-house or on EMR isn’t cheap, and avoiding some of  its hassles is worth something too. The answer varied depending on the vendor, but word on the street was $25k/TB/year $20k/TB/year to $600k/TB/year. We were still stuck between a rock and a hard place: Hadoop was cheaper than the data warehouses, but so much more tool than we needed.

Present day: we hear about AWS’s Redshift offering. It’s basically ParAccel, a columnar, compressed data warehouse that a bunch of AWS engineers “cloud-ified”. I won’t lie, our first reaction was incredulity and dismissive chuckles followed by “Sure, ‘petabyte-scale’. Bet we can break this one with a few days of data.” At their claimed $1k/TB/year TCO, it seemed impossible that they could offer a product that worked. To prove our point, I fired up a few instances, and found out that we were flat out wrong.

Objective

I’m going to try to convince you that Redshift is worth a look by showing you the results of a handful of non-trivial tests I ran, not much more. What’s more is that I’m not going to whitewash any of the issues I ran into. I’m going to tell you about some pretty ugly lumps on this sucker, but by the time I’m done I don’t think it will matter. I think you’ll come to the same conclusion we did: for the price, Redshift is an exceptionally powerful tool that almost any medium data business can use to solve all of their exploration problems for a reasonable price.

I’m here to tell the story of how Amazon and MPP caught up in a big way, and how Redshift is the real deal.

Background

I’ll give you the standard caveats: I ran tests on our data, which has a certain shape, a certain size, and a certain density/sparsity with respect to our reporting keys. Yours may be different. My methods may not work well on your data. In all likelihood, however, if your data is human-generated advertising behavioral data or something like it, Redshift is going to work pretty well with it.

Ground rules for using Redshift

  • Your data must be flat, in a CSV/TSV/*SV format. No nested structures.
  • Your data must be loaded from S3 or DynamoDB.
  • Your data should have a natural group of columns to sort by (that suits the queries you want to run). Your data should also have a column by which it can be evenly distributed across the cluster, when consistently-hashed.
  • Result sets come out via JDBC/ODBC (don’t do this for anything bigger than a few thousand rows) or are shipped out to S3.

Overview of my tests

6 cluster configurations. 1x, 4x, 16x, 32x dw.hs1.xlarge, and 2x, 4x dw.hs1.8xlarge. (For brevity’s sake, I’m going to refer to the node types as ‘xlarge’ and ’8xlarge’ from here on.)
4 data sets. 2B rows, 10B rows, 44B rows, 57B rows in the main fact table. As gzip’d CSV they ranged from 80GB to 2.4TB.

Not all data sets were run on all clusters. Anything smaller than 16x xlarge/2x 8xlarge was simply too slow for the larger (read: more realistic) data sets. I tested them to the point where I could safely rule them out for our workloads and then moved up to the next cluster.

Data sets tested by cluster
2B 10B 44B 57B
 1x dw1.hs1.xlarge X
 4x dw1.hs1.xlarge X X
16x dw1.hs1.xlarge X X X X
32x dw1.hs1.xlarge X
 2x dw1.hs1.8xlarge X X X X
 4x dw1.hs1.8xlarge X

Standard data warehouse table setup. About a dozen numeric columns per table. The largest table (impressions) is 100x larger than the next biggest table (clicks), and 5000x larger than the smallest table (conversions). All tables sorted by date and by user id (SORTKEY), distributed by user id (DISTKEY).

Two types of queries. (pardon the pseudo-SQL, see Appendix for full queries)

  1. SELECT keys, aggregates FROM all_tables_unioned WHERE date_predicate GROUP BY keys
  2. WITH windowed_query_over_user_sessions, SELECT keys, session_statistic_value, COUNT(*) FROM windowed_query_over_user_sessions WHERE date_predicate GROUP BY keys

That is, (1) a basic aggregate report with some hard stuff like COUNT(DISTINCT), and (2) a windowed walk over user sessions that computes a histogram of a certain statistic about certain ads. Think a histogram for each ad of how many times it shows up as the last thing (impression/click/conversion) in a session, the second-to-last-thing in a session, and so on. The catch is we’re not specifying a particular campaign or advertiser for which we’re running these reports. We’re computing everyone’s reports all at once, just to push the envelope a bit.

(Unreasonable?) Expectations

As I mentioned earlier, we walked into this with no small amount of incredulity. We’ve spent a lot of engineering effort keeping ahead of the billions of events we see each day, and as a result our expectations have grown. I mean, if a handful of loons like us can keep up, Amazon’s insane resources should be miles ahead of us, right?

So, I’m going to say some unreasonable things in this section and they should be taken with a grain of salt. For example, when I say I should be able to add nodes to my cluster “quickly”, I mean that I should be able to so within the limitations of copying data across a network, I shouldn’t have to babysit the data transfer, and I shouldn’t have to provision the boxes myself or reconfigure anything. I should be able to click a button, and so on.

  • Per dollar, I want linear scaling. Period. Load, query, etc… If I’m paying more, it should be faster!
  • I want my dollar to be just as well-spent on a xlarge cluster as a 8xlarge cluster. That means N 8xlarge should be “worth” 8N xlarge instances. (I test N=2,4 here.)
  • I want to load the marginal “day”, about 2B rows for our purposes here, in an hour and I should be able to rerun my queries with linear scaling in the new number of rows.
  • I want to be able to painlessly and quickly add nodes to my cluster.
  • I want linear degradation in query times for parallel queries. (Two queries at once will run in twice their normal time, and so on.)
  • I want to be able to shut down and take a snapshot quickly. I want to be able to start a cluster from a snapshot quickly. (Because if I can, I’d like to turn this thing off when I’m not using it, so I don’t have to pay for it.)

Starting a cluster and loading data into it

Starting a cluster, regardless of what size and type, took between 3 and 20 minutes. No rhyme or reason. Once the cluster was up, I would issue queries via psql from a t1.micro in the same zone and security group in order to avoid a firewall timeout issue which will kill long-lasting connections from outside of EC2. The result sets were gzip’d and sent to S3 directly from the cluster.

Each data set had its own S3 bucket and was distributed over hundreds of gzip’d files in that bucket. I started by trying to load the whole bucket at once with something like

COPY impression FROM "s3://test-data/impressions/" WITH CREDENTIALS ... GZIP DELIMITER ','  MAXERROR AS 100000;

This worked for the smallest data set, but as soon as I moved to the larger ones I would see errors like:

ERROR:  S3ServiceException:speed limit exceeded,StatusCode 500,ErrorCode N/A,RequestId N/A,ExtendedRequestId N/A,CanRetryException 1
DETAIL:
  -----------------------------------------------
  error:  S3ServiceException:speed limit exceeded,StatusCode 500,ErrorCode N/A,RequestId N/A,ExtendedRequestId N/A,CanRetryException 1
  code:      9001
  context:   S3 key being read : impressions/201.csv.gz
  query:     1446
  location:  table_s3_scanner.cpp:355
  process:   query1_s9_2 [pid=5471]
  -----------------------------------------------

and the whole load would fail. (A failed load leaves the database in the prior, intact state, so you don’t have to worry about being in a halfway-loaded state.)  I started splitting the loads into smaller chunks of about 100 files, but sometimes those would fail too. I ended up splitting them into groups of 10-20 files (3GB-60GB total) and those would succeed very reliably. The tradeoff is that there is an overhead to starting up and winding down a new COPY command, so the fewer COPYs, the better for overall load speed.

This slideshow requires JavaScript.

As you can see, there are dips in CPU utilization/network throughput between COPYs that isn’t present on loads that grouped more files together. The dips come from an initial analysis phase before the download and a final sort phase after download. I saw a 40% decrease in load speed when I went from 200 files-at-a-time to 20 files at a time for the same data set.

Once I had settled on a sane way of chunking the loads, the overall COPY speed scaled linearly with cost and with data size, measured over hundreds of chunked loads. I was also pleasantly surprised to find that there was very, very little variability within the load times, when controlling for chunk size and file count. Each xlarge node could load about 3.17MB/sec of compressed S3 data or 78k rows/sec and each 8xlarge node could load about 23.8MB/sec or 584k rows/sec, meaning one of the latter was worth about 7.5 of the former while costing 8 times as much. Roughly speaking, every extra dollar spent bought a marginal 3.5-3.7MB/sec in load speed. In row throughput, each marginal dollar gave us an additional 80-100k rows/sec. The plots below are averages of the chunked load times grouped by data set and cluster. (The costs are all in terms of the on-demand, un-reserved instance costs.)

COPY Cost Scaling

copy_cost_scaling_rows-1

There is an outlier in the load time for the 2B row data set on the 16-node xlarge cluster (lavender circle). I believe that this was caused by the relatively high node count compared to the relatively small file count and data size. Informally, I saw this regression on both 16- and 32-node clusters for data sets with fewer than 200 files and less than 100GB of data. Overall, you can count on equally-priced xlarge clusters outperforming their 8xlarge peers by 5-20% in COPY times as long as the number of files and total sizes are large enough. (That said, there is a 32-node limit on xlarge clusters, so after a certain point your hand will be forced and you’ll have to move to the larger nodes.)

After loading the data, it is a best practice to VACUUM the tables to ensure that the rows are all in sorted order. Each COPY will sort the rows in that load, but will not merge them into the existing sorted rows. Not VACUUM’ing will force your queries to execute over the multiple sorted sections (which I’ll call partitions), which can significantly diminish performance on queries that depend heavily on the sorted order of the data. For queries that didn’t take advantage of the sorted-order of the data, the difference between the VACUUM’d data and the load-partitioned data was minimal until I hit dozens of partitions. For queries that did rely on the sorted-order for their execution, I saw a night-and-day effect: either a minor (1-5%) degradation in query time or the query wouldn’t finish for tens of hours. In general, I’d just pay the cost of VACUUMing to avoid the edge cases where the queries would implode.

The plot below shows the VACUUM rate per dollar/hour in both types of clusters, for on-demand, non-reserved instances. (I know the units obfuscate the real performance, but it’s just to normalize clusters within their node type and across node types.)

vac_rate_per_dollar

Per dollar, the primary driver of the VACUUM rate was how many rows were already present in the table before COPY (assuming they were already VACUUM’d themselves.) xlarge clusters performed 10-25% better on VACUUMs than their 8xlarge counterparts, on similar workloads.

Is it feasible to load 2B more rows in about an hour with any of the clusters? Not the way I have my data set up, it isn’t. The COPY part is easy. It would take less than half an hour to COPY onto the 2x 8xlarge and 16x xlarge clusters, but the VACUUM wouldn’t be nearly so quick. With over 50B rows already in the table, both node types struggle to VACUUM more than 200k row/sec, which means at least 3 hours of VACUUM time. I suspect if I had manually split my tables into smaller chunks (say 10B-20B rows) like the docs told me to, I would have been able to COPY and VACUUM the 2B extra rows in just about an hour and a half.

Aside: let’s just do the math quickly from the plot above. Say we’re talking about a 8xlarge cluster with two machines. Say I have about 15B rows in my table and I’m adding 2B more. I’ll eyeball it and say that I could do 45k rows/sec/$. That cluster costs  $13.60/hour. That means I should see a VACUUM rate of about 612k rows/sec. That means it should take about 55 minutes. The plots earlier showed that the same cluster could load at a rate of 1.1M rows/sec. That means 2B rows get COPY’d in 30 minutes. Like I said, about an hour and a half.

For loading, I saw linear scaling per dollar and observed just a bit worse than the 8:1 price ratio between the small and big nodes. I could also load my marginal day in about the time I wanted. Not bad for a start.

Note: I know I haven’t mentioned anything here about column compression, which is a big selling point of Redshift. My objective in this testing was to see how Redshift performed “out-of-the-box”, which meant leaving it to its own devices for compression. It seems like it’s own devices are quite good: the compression types it selected for the columns were exactly what I would have chosen, knowing the exact column statistics. Furthermore, the reported storage used by Redshift was only 1.6 times greater than the original GZIP’d files for all three data sets, which meant that about 115B rows fit into a little under 8TB of cluster storage. A 2-node 8xlarge/16-node xlarge cluster comes with 32 TB of cluster storage, meaning I could run a 24/7 cluster with about 450B rows in it for $70k/year with a 1-year reserve purchase. For a more tactical perspective, 450B/year equates to an average of 14k events/sec ingress for the whole year straight. Just buying 16 machines that look like xlarge instances would probably cost around $50k. If you factor in amortization over three years, we’re looking at a (3 x $70k) – $50k = $160k difference to cover any other operating expenses such as ops salaries, MPP licenses, and facility costs. A ParAccel license for $1k/TB/year alone would cost about $100k for those three years, and that would be a damn cheap license. (For a 30TB cluster over three years, a $2500/TB/year license alone would be more than the all-in AWS cost!)

Querying the data

As I mentioned above, I tested two classes of queries:

  1. a simple aggregation that is only restricted by a date predicate, and
  2. a windowed walk over user sessions, computing and aggregating statistics for those sessions.

The first query class is the bread and butter of our reports, but was actually a very challenging query in the table setup described above because the table is only sorted by one of the six GROUP BY columns (date), which means that either a giant hash map of all the keys (of which there were 10s of millions) needed to be populated for each day, or the whole data set had to be resorted according to those keys. The balance I tried to reach when defining the schema was to make the “impossible” queries (#2) easy even if I had to sacrifice performance on the simpler queries. In a production situation, I’d probably have two schemas side-by-side, one sorted for the first class of queries, and another for the second given how much storage Redshift nodes come with.

Though this query does have a date predicate, the plot below is the average execution time for multiple runs with a predicate that includes every row in every table.

simple_agg_query_time

As you can see, the identically-priced xlarge cluster is about three times faster for the two largest data sets, and finishes the query on the 57B row data set in about three hours. The xlarge cluster query times degrade linearly with the number of rows in the data set, right up until the last data set where there’s a clear super-linear jump. The 8xlarge cluster, from the start, seems to degrade almost-quadratically.

The poor performance of the 8xlarge nodes was very surprising. I would have expected that the cost of shuffling/merging results across 16 nodes would hurt more than having 16 separate nodes helped. (Recall that the DISTKEY was in no way connected to the reporting keys of this query.) Just as in the COPY section, the distribution of work over nodes seems to be more efficient than the distribution of work over cores, as shown by the (more) graceful degradation of performance of the xlarge cluster. (This will be a recurring trend.)

The second class of queries walked over users’ sessions and produced histograms of how many times each ad appeared in various positional offsets and date offsets from the end of a user’s session. I also produced a version of the query than ran only on user sessions that contained at least one conversion. (In this data set, approximately 1 in 250 users are converters.)

session_walk_query_time-4

As you can see, clusters composed of many small nodes bested their large-node counterparts by about 20% on all but the population date histogram query, where the small nodes were five times faster. The more-selective versions of the two queries were at least an order of magnitude faster than their ‘Population’ counterparts on the same clusters. Of course, it’s nowhere near a one-to-one speedup with the selectivity of the converters predicate, but going from 90 minutes to 7.5 minutes is impressive given that there’s a non-trivial join from a 57B row table to a 20M row table happening in the “Converters” queries. The query times clearly degrade linearly in the number of rows in the data set, perhaps with the exception of the date histogram for the population.

Similarly, query times halved (approximately) when I doubled the number of nodes in the cluster, in both clusters, for the battery of queries over the largest data set. (Apologies ahead of time for not synchronizing the colors of the queries between the plot above and the one below. It’s just a real pain.)

query_cost_scaling-1

Note the above graph has log-scaled query time to demonstrate that both types of clusters see linear or better-than-linear per-dollar scaling on all of the queries I performed. Note, also, that certain queries improve more per dollar over the transition in the 8xlarge cluster. (Unfortunately, I couldn’t explore this comparison any further because of Amazon’s 32-node limitation for dw.h1.xlarge clusters.)

So I won’t say that I saw linear scaling across rows and nodes, but I will say this: it was pretty close to it. It’s clear that my 16-node xlarge cluster’s linear-ish performance was starting to fall apart on queries that had to touch more than 50B rows (the little uptick seen between the last two data points.) But let’s just back that up for a second. The cluster assembled and walked 4B sessions and computed the histograms for approximately 100,000 distinct ads in just under two hours. When restricting to converters, this happened in just over three minutes. I thought I had written the wrong query the first time the query returned that quickly, so I went back and wrote even more thorough unit tests for it. After convincing myself I had the right query, I restarted the cluster and reran the query to find it had once again run in the same amount of time, to the second. And then I did it again, and it was off by a few seconds. And again, and again, and again. I saw this repeated in all of my queries: the standard error on query time hovered around 1% for the most variable queries.

To get back to my point about business value, I caused a small riot among the analysts when I mentioned off-hand how quickly I could run these queries on substantial data sets. On a service that I launched and loaded overnight with about three days of prior fiddling/self-training. Sure, putting this into production is a whole other beast, but I contend it’s no worse than the alternative of trying to keep a Hadoop cluster running or trying to reliably process jobs. And I don’t even have to teach any of the analysts a particular flavor of SQL. They get to reuse the Postgres-flavored stuff that they know and love.

I’ll quickly throw down some caveats/lessons-learned on querying:

  • Running any two of those queries at the same time doubled the nominal execution time of each individually. Three at a time tripled it plus change. At four, however, the cluster crawled to a halt and took about 10x the time. At five concurrent queries, the cluster went into an unhealthy state and rebooted, losing all history of the concurrent queries. The takeaway is that you really need to implement work-load management (WLM) if you’re going to run concurrent queries on the cluster.
  • Learn to cancel queries before you run anything serious, and be sure to EXPLAIN and analyze them, as you can often avoid costly writes to disk by simply adding redundant predicates and inlining WITH clauses.

Snapshotting a cluster

Snapshotting is a relatively painless process through the UI. It’s as simple as clicking a button and obeying some unreasonably-strict naming policies. I turned off the auto-snapshotting and managed mine manually because I found the periodic dips in performance caused by the auto-snapshotting very annoying. Backing up my 4-5TB cluster reliably took 2-3 hours, which means I was getting about about 400MB/sec to S3 from my 2-node 8xlarge/16-node xlarge clusters which isn’t out of line with what I’ve heard from our internal teams that push big result sets from EMR back to S3. Good enough in my book.

It’s important to note that booting a cluster from a snapshot always yields the same node type and count, meaning you can’t up- or down-grade a cluster from a snapshot. Once you have a snapshot, it’s actually very fast to boot a cluster from it. In my experience it took about 20 to 30 minutes which just seemed insane until I realized that the cluster seems to become available well before the “work” of loading the snapshot is complete. The dashboard shows a whole lot of CPU, network, and disk usage all before I submitted my first query.

post_load_snapshotMake no mistake, this is no phantom: those are real resources being used. I saw queries run significantly more slowly during this post-snapshot-load period, as well as a much less responsive web dashboard. (Oh, did I mention that? The web dashboard is querying directly to your cluster, so if your cluster is wedged on something, it’s almost impossible to debug it. Makes WLM all the more important.) This work usually took between 45 minutes to an hour to clear up, after which everything seemed to chug along just fine. I’d say it took a bit over an hour to get a cluster from snapshot to full-speed.

Resizing a cluster

In Redshift, resizing a cluster is basically an automated process of launching a new cluster, starting the data transfer from old to new in the background while answering new and existing queries, then seamlessly finishing the running queries and pointing all new ones at the new cluster. Sure there is some performance degradation of queries run during this period, but no more than I saw when running two concurrent queries. In fact, my experience resizing a cluster was nothing short of remarkable. (My tests were fairly limited, since running one test meant launching a cluster from a snapshot, waiting for the work to finish, launching a resize and then shutting down the cluster so I didn’t waste my testing credit. This whole process could take several hours start to finish, so I won’t pretend that I have anything more than anecdotal data here.) I saw an effective rate of about 175MB/sec transfer when upgrading from 2-node 8xlarge/16-node xlarge to 4-node 8xlarge/32-node xlarge clusters, meaning I looked up the amount of space Redshift reported using on disk and then divided that by the time between hitting the resize button and being able to run my first query on the new cluster. For my data sets, that meant 6-7 hours of degraded query performance (but not downtime!) followed by a seamless transition. Even query endpoints were seamlessly switched to the new cluster without my client even noticing! Not a bad deal, especially since you don’t even have to worry about the old cluster–it shuts down and gets cleaned up in the background.

Recap

So let’s review my crazy expectations and see how Redshift did.

  • Per dollar, I want linear scaling. Within reasonable bounds (like an order-of-magnitude row count range), I found that to be true for loading, querying, snapshotting, and resizing.
  • I want my dollar to be just as well-spent on a xlarge cluster as a 8xlarge cluster. For everything but querying, the 8xlarge nodes were about 8x an xlarge node. (But if you can get away with it, stick with xlarge clusters!)
  • I want to load the marginal “day” in an hour and query it. Looks like an hour and a half is the real number. However, querying scaled linearly and without complaint once the COPY and VACUUM completed.
  • I want to be able to painlessly and quickly add nodes to my cluster. A few clicks in the UI, wait a few hours, no downtime. Feels painless and quick to me!
  • I want linear degradation in query times for parallel queries. Mostly true, until you crush the thing. Break it a few times to test the limits on your queries, then set up WLM and don’t worry about it. If you’re using this mostly for periodic reporting, then you probably won’t even notice.
  • I want to be able to snapshot/boot quickly and easily. I honestly don’t know what kind of magic is going on that lets me boot a snapshot in an hour and then let me query 3TB of data, but I’ll take it, degraded performance and all.

Redshift knocked it out of the park, in my opinion. (And this is coming from AK, the “do-everything-streaming” peanut-gallery. If we’re convinced, that’s gotta count for something, right?) In fact, the appeal of Redshift is much like the appeal of the Summarizer to me: it takes a non-trivial business problem (executing expressive SQL over medium data) and completely undercuts the mainstream solutions’ price points by trading away generality for ease-of-use and performance.

And sure, it has some flaws: COPY failures, load-induced cluster crashes, dashboard slowness, and so on. But I don’t care. I found workarounds for all of them over the course of a few days. Moreover, my communication with the Redshift team has been nothing but constructive, and given what I heard the beta days were like, they’re making real progress. I have no doubt that the service is only going to get better and cheaper.

Oddly enough, Redshift isn’t going to sell because devs think it’s super-duper-whizz-bang. It’s going to sell because it took a problem and an industry famous for it’s opaque pricing, high TCO, and unreliable results and completely turned it on its head. MPP was never a bad idea. Selling that way was. Yeah, the effective TCO is closer to $5k/TB/year than it is to their stated $1k/TB/year, but the pricing scheme is transparent and it’s half a quarter the price of the other MPP alternatives.

Now, rightfully, you could argue that it’s not Teradata or Vertica that it’s competing with anymore, but rather Hadoop. And clearly, it can’t do everything Hadoop can, but I contend that it doesn’t need to. All it needs to do is convince people that they don’t need Hive, and that isn’t a hard sell for a ton of businesses out there. Having vanilla SQL along with a familiar data model, and the added speed of a system built for these types of queries is probably justification enough to pick this over Hadoop+Hive if BI is what you’re after. The hosted/as-a-service aspect is just frosting on the cake at that point.

Appendix: Queries

Query 1: Simple Aggregate

This query simulates the “wide” aggregates that the Summarizer generates. Specifically, it does multiple simple grouping aggregates and coalesces them with an outer join. It represents the simplest type of reporting we do.

WITH imps AS (
    SELECT
        campaign_id,
        tracking_campaign_id,
        inventory_placement_id,
        audience_definition_id,
        creative_group_id,
        creative_id,

        SUM(1)                      AS raw_impression,
        COUNT(DISTINCT(ak_user_id)) AS uu_impression,
        SUM(media_cost)             AS media_cost,
        SUM(data_cost)              AS aggregate_attribute_data_cost

    FROM
        impression
    GROUP BY
        1,2,3,4,5,6
),

clicks AS (
    SELECT
        campaign_id,
        tracking_campaign_id,
        inventory_placement_id,
        audience_definition_id,
        creative_group_id,
        creative_id,

        SUM(1)                      AS raw_click,
        COUNT(DISTINCT(ak_user_id)) AS uu_click,
        0::float8                   AS click_revenue

    FROM
        click
    GROUP BY
        1,2,3,4,5,6
),

a_conv_click AS (
    SELECT
        campaign_id,
        tracking_campaign_id,
        inventory_placement_id,
        audience_definition_id,
        creative_group_id,
        creative_id,

        SUM(1)                      AS raw_click_attributed_conversion,
        COUNT(DISTINCT(ak_user_id)) AS uu_click_attributed_conversion,
        SUM(attributed_revenue)     AS click_attributed_conversion_revenue
    FROM
        attributedconversion
    WHERE type_id = 1
    GROUP BY
        1,2,3,4,5,6
),

a_conv_imp AS (
    SELECT
        campaign_id,
        tracking_campaign_id,
        inventory_placement_id,
        audience_definition_id,
        creative_group_id,
        creative_id,

        SUM(1)                      AS raw_impression_attributed_conversion,
        COUNT(DISTINCT(ak_user_id)) AS uu_impression_attributed_conversion,
        SUM(attributed_revenue)     AS impression_attributed_conversion_revenue
    FROM
        attributedconversion
    WHERE type_id = 2
    GROUP BY
        1,2,3,4,5,6
)

SELECT
    COALESCE(imps.campaign_id           ,  clicks.campaign_id           , a_conv_click.campaign_id           , a_conv_imp.campaign_id           ) AS campaign_id           ,
    COALESCE(imps.tracking_campaign_id  ,  clicks.tracking_campaign_id  , a_conv_click.tracking_campaign_id  , a_conv_imp.tracking_campaign_id  ) AS tracking_campaign_id  ,
    COALESCE(imps.inventory_placement_id,  clicks.inventory_placement_id, a_conv_click.inventory_placement_id, a_conv_imp.inventory_placement_id) AS inventory_placement_id,
    COALESCE(imps.audience_definition_id,  clicks.audience_definition_id, a_conv_click.audience_definition_id, a_conv_imp.audience_definition_id) AS audience_definition_id,
    COALESCE(imps.creative_group_id     ,  clicks.creative_group_id     , a_conv_click.creative_group_id     , a_conv_imp.creative_group_id     ) AS creative_group_id     ,
    COALESCE(imps.creative_id           ,  clicks.creative_id           , a_conv_click.creative_id           , a_conv_imp.creative_id           ) AS creative_id           ,
    imps.raw_impression,
    imps.uu_impression,
    imps.media_cost,
    imps.aggregate_attribute_data_cost,
    clicks.raw_click,
    clicks.uu_click,
    clicks.click_revenue,
    a_conv_click.raw_click_attributed_conversion,
    a_conv_click.uu_click_attributed_conversion,
    a_conv_click.click_attributed_conversion_revenue,
    a_conv_imp.raw_impression_attributed_conversion,
    a_conv_imp.uu_impression_attributed_conversion,
    a_conv_imp.impression_attributed_conversion_revenue
FROM
         imps
FULL OUTER JOIN clicks ON
            imps.campaign_id            = clicks.campaign_id            AND
            imps.tracking_campaign_id   = clicks.tracking_campaign_id   AND
            imps.inventory_placement_id = clicks.inventory_placement_id AND
            imps.audience_definition_id = clicks.audience_definition_id AND
            imps.creative_group_id      = clicks.creative_group_id      AND
            imps.creative_id            = clicks.creative_id
FULL OUTER JOIN a_conv_click ON
            imps.campaign_id            = a_conv_click.campaign_id            AND
            imps.tracking_campaign_id   = a_conv_click.tracking_campaign_id   AND
            imps.inventory_placement_id = a_conv_click.inventory_placement_id AND
            imps.audience_definition_id = a_conv_click.audience_definition_id AND
            imps.creative_group_id      = a_conv_click.creative_group_id      AND
            imps.creative_id            = a_conv_click.creative_id
FULL OUTER JOIN a_conv_imp ON
            imps.campaign_id            = a_conv_imp.campaign_id            AND
            imps.tracking_campaign_id   = a_conv_imp.tracking_campaign_id   AND
            imps.inventory_placement_id = a_conv_imp.inventory_placement_id AND
            imps.audience_definition_id = a_conv_imp.audience_definition_id AND
            imps.creative_group_id      = a_conv_imp.creative_group_id      AND
            imps.creative_id            = a_conv_imp.creative_id
;

Query 2: Session-walk Position Histogram (Population)

This query is the simplest demonstration of the expressivity of the SQL you can run on Redshift. The WITH and OVER clauses provide an encapsulated, session-based position (DENSE_RANK) for each user’s impressions, which are aggregated by reporting key and position. This allows us to see if particular advertisements tend to appear earlier or later in users’ sessions. (You may recognize queries like this from market basket analysis.)

    WITH annotated_chains AS (
        SELECT campaign_id,
               tracking_campaign_id,
               inventory_placement_id,
               DENSE_RANK() OVER (PARTITION BY ak_user_id ORDER BY record_date DESC) AS position
        FROM impression
        WHERE record_date >= some_date AND
              record_date < other_date
        ORDER BY ak_user_id, record_date DESC
    )
    SELECT campaign_id, tracking_campaign_id, inventory_placement_id, position, COUNT(*) as ct
    FROM annotated_chains
    GROUP BY 1,2,3,4;

Query 3: Session-walk Position Histogram (Converters)

This query does the same thing as #2 but joins against the conversion table to construct a session that reaches 90 days back for each conversion event. This allows us to once again see if certain advertisements tend to be seen “close” to conversions.

    WITH annotated_chains AS (
        SELECT i.ak_user_id                                AS ak_user_id,
               i.campaign_id                               AS campaign_id,
               i.tracking_campaign_id                      AS tracking_campaign_id,
               i.inventory_placement_id                    AS inventory_placement_id,
               i.record_date                               AS record_date,
               DENSE_RANK()
            OVER (PARTITION BY i.ak_user_id ORDER BY i.record_date DESC) AS position
        FROM
              impression i
        JOIN  unattributedconversion c
                    ON     c.ak_user_id = i.ak_user_id
                       AND c.record_date >= i.record_date
                       AND (c.record_date - interval '90 days') < i.record_date
                       AND c.record_date >= some_date
                       AND c.record_date <  other_date
    )
    SELECT campaign_id, tracking_campaign_id, inventory_placement_id, position, count(*) AS ct
    FROM annotated_chains
    GROUP BY 1,2,3,4;

Query 4: Session-walk Date Histogram (Population)

This query is the date-based version of #2. Instead of reporting a position-based offset in the session, it reports a day-offset from the end of the session. This query allows us to examine temporal distribution of different advertisements.

    WITH annotated_chains AS (
        SELECT campaign_id,
               tracking_campaign_id,
               inventory_placement_id,
               DATEDIFF(day, record_date,
                  LAST_VALUE(record_date) OVER (PARTITION BY ak_user_id ORDER BY record_date ASC rows between unbounded preceding and unbounded following)
               )  AS days_behind_conversion
        FROM impression
        WHERE record_date >= some_date AND
              record_date < other_date
        ORDER BY ak_user_id, record_date DESC
    )
    SELECT campaign_id, tracking_campaign_id, inventory_placement_id, days_behind_conversion, COUNT(*) as ct
    FROM annotated_chains
    GROUP BY 1,2,3,4;

Query 5: Session-walk Date Histogram (Converters)

This query is the date-based version of #3. Again, instead of position-based offsets, it reports day-offsets from conversion. This query explores temporal proximity of advertisements to conversions.

    WITH annotated_chains AS (
        SELECT i.ak_user_id                                AS ak_user_id,
               i.campaign_id                               AS campaign_id,
               i.tracking_campaign_id                      AS tracking_campaign_id,
               i.inventory_placement_id                    AS inventory_placement_id,
               i.record_date                               AS record_date,
               DATEDIFF(day, i.record_date, c.record_date) AS days_behind_conversion
        FROM
              impression i
        JOIN  unattributedconversion c
                    ON     c.ak_user_id = i.ak_user_id
                       AND c.record_date >= i.record_date
                       AND (c.record_date - interval '90 days') < i.record_date
                       AND c.record_date >= some_date
                       AND c.record_date <  other_date
    )
    SELECT campaign_id, tracking_campaign_id, inventory_placement_id, days_behind_conversion, count(*) AS ct
    FROM annotated_chains
    GROUP BY 1,2,3,4;

Call for Summer Interns

AK is looking for a summer intern in our R&D group. If any of our blog posts have interested you, then you’ll fit right in!

We’re looking for someone who has a good handle on a few programming languages (pick any two from R/Mathematica/Python/Javascript/Java) and has some math in their background — college-level calculus or algebra is plenty. Ideally, you’re interested in learning about:

  • building and tuning high-performance data structures,
  • streaming algorithms,
  • interesting data visualizations, and
  • how to translate academic research into business value.

It’s OK if you’ve never seen the stuff we write about on the blog before! We didn’t either until we started researching them!

I can’t emphasize this enough: we don’t expect you to know how to do the things above yet. We simply expect you to have a passion for learning about them and the diligence to work through what (at the time) seem like impossible problems. Work experience is nice, but not necessary. As long as you can write clean code and can work hard, you’re well-qualified for this job.

If you’re interested, please send a brief note about why you’re interested, along with a CV and/or GitHub username to timon at aggregateknowledge dot com. For extra credit, please submit one (or more!) of the following:

  • An implementation of HLL, Count-Min Sketch, K-Min Values, or Distinct Sampling in a language of your choice.
  • An extension to Colin’s blog post about a good hash function that adds CityHash and SipHash to the shoot-out.
  • An explanation of the tradeoffs between using a hash map and Count-Min Sketch for counting item frequency.

(I feel like I shouldn’t have to say this, but yes, these are all answered somewhere on the internet. Don’t plagiarize. What we want is for you to go learn from them and try your own hand at implementing/experimenting. Also, don’t freak out, these are extra credit!)

Doubling the Size of an HLL Dynamically – Extra Bits…

Author’s Note: This post is related to a few previous posts on the HyperLogLog algorithm.  See Matt’s overview of the algorithm, and see this for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the final post in a series on doubling the number of bins in HLLs. The first post dealt with the recovery time after doubling, and the second dealt with doubling’s accuracy when taking unions of two HLLs.

Introduction

The main draw to the HyperLogLog algorithm is its ability to make accurate cardinality estimates using small, fixed memory.  In practice, there are two choices a user makes which determine how much memory the algorithm will use: the number of registers (bins) and the size of each register (how high they can count).  As Timon discussed previously, increasing the size of each register will only increase the accuracy if the true cardinality of the stream is HUGE.

Recall that HyperLogLog (and most other streaming algorithms) is designed to work with a fixed number of registers, m, which is chosen as a function of the expected cardinality to approximate. We track a great number of different cardinality streams and in this context it is useful for us to not have one fixed value of m, but to have this evolve with the needs of a given estimation.

We are thus confronted with many engineering problems, some of which we have already discussed. In particular, one problem is that the neat feature of sketches, namely that they allow for an estimate of the cardinality of the union of multiple streams at no cost, depends on having sketches of the same size.

We’ve discussed how to get around this by folding HLLs, though with some increase in error. We’ve also explored a few options on how to effectively perform a doubling procedure. However, we started to wonder if any improvements could be made by using just a small amount of extra memory, say an extra bit for each register. In this post we will discuss one such idea and its use in doubling. Note: we don’t talk about quadrupling or more. We limit ourselves to the situation where HLL sketches only differ in m‘s by 1.

The Setup

One of the downfalls in doubling is that it there is no way to know, after doubling, whether a value belongs in its bin or its partner bin. Recall that a “partner bin” is the register that could have been used had our “prefix” (the portion of the hashed value which is used to decide which register to update) been one bit longer. If the binary representation of the bin index used only two bits of the hashed value, e.g. 01, then in an HLL that used a three-bit index, the same hashed value could have been placed in the bin whose index is either 101 or 001. Since 001 and 01 are the same number, we call 101 the “partner bin”. (See the “Key Processing” section in Set Operations On HLLs of Different Sizes).

Consider an example where we have an HLL with 2^{10} bins.The k^{th} bin has the value 7 in it, and after doubling we guess that its partner bin, at index (2^{10} + k)^{th}, should have a 5 in it. It is equally likely that the k^{th} bin should have the 5 in it and the (2^{10}+k)^{th} bin should have the 7 in it (since the “missing” prefix bit could have been a 1 or a 0)! Certainly the arrangement doesn’t change the basic cardinality estimate, but once we start getting involved with unions, the arrangement can make a very large difference.

To see how drastic the consequences can be, let’s look at a simple example. Suppose we start with an HLL with 2 bins and get the value 6 in each of its bins. Then we run the doubling procedure and decide that the partner bins should both have 1′s in them. With this information, it is equally likely that both of the arrangements below, “A” and “B”, could be the “true” larger HLL.

arrangement

Further suppose we have some other data with which we wish to estimate the union. Below, I’ve diagrammed what happens when we take the union.

union_diag

Arrangement A leads to a cardinality estimate (of the union) of about 12 and Arrangement B leads to a cardinality estimate (of the union) of about 122. This is an order of magnitude different! Obviously not all cases are this bad, but this example is instructive. It tells us that knowing the true location of each value is very important. We’ve attempted to improve our doubling estimate by keeping an extra bit of information as we will describe below.

The Algorithm

Suppose we have an HLL with m bins. Let’s keep another array of data which holds m total bits, one for each bin — we will call these the “Cached Values.” For each bin, we keep a 0 if the value truly belongs in the bin in which it was placed (i.e. if, had we run an HLL with 2m bins, the value would have been placed in the first m bins in the HLL), and we keep a 1 if the value truly belongs in the partner bin of the one in which it was placed (i.e. if, had we run an HLL with 2m bins, the value would have been placed in the last m bins). See the image below for an example. Here we see two HLLs which have processed the same data. The one on the left is half the size and collects the cached values as it runs on the data. The one on the right is simply the usual HLL algorithm run on the same data.

swap_diag

Looking at the first row of the small HLL (with m bins), the 0 cache value means that the 2 “belongs” in the top half of the large HLL, i.e. if we had processed the stream using a larger HLL the 2 would be in the same register. Essentially this cached bit allows you to know exactly where the largest value in a bin was located in the larger HLL (if the i^{th} bin has value V and cached value S, we place the value V in the S * 2^{\log{m}} + i = (S\cdot m + i)^{th} bin).

Doubling Bit Diagram

In practice, when we double, we populate the doubled HLL first with the (now correct location) bin values from the original HLL then we fill the remaining bins by using our “Proportion Doubling” algorithm.

Before we begin looking at the algorithm’s performance, let’s think about how much extra space this requires. In our new algorithm, notice that for each bin, we keep around either a zero or a one as its cached value. Hence, we require only one extra bit per bin to accommodate the cached values. Our implementation of HLL requires 5 bits per bin, since we want to be able to include values up to 2^5 -1= 31 in our bins. Thus, a standard HLL with m bins, requires 5m bits. Hence, this algorithm requires 5m + m = 6m bits (with the extra m bins representing the cached values). This implies that this sketch requires 20% more space.

The Data

Recall in the last post in this series, we explored doubling with two main strategies: Random Estimate (RE) and Proportion Doubling (PD). We did the same here, though using the additional information from this cached bit. We want to know a few things:

  • Does doubling using a cache bit work? i.e. is it better to fold the bigger one or double the smaller one when comparing HLL’s of different sizes?
  • Does adding in a cache bit change which doubling strategy is preferred (RE or PD)?
  • Does the error in union estimate depend on intersection size as we have seen in the past?
Experimental Setup

Is it better to double or fold?

For each experiment we took 2 sets of data (each generated from 200k random keys) and estimated the intersection size between them using varying methods.

  • “Folded”: estimate by filling up an HLL with log_{2}(m) = 10 and  comparing it to a folded HLL starting from log_{2}(m) = 11 and folded down log_{2}(m) = 10
  • “Large”: estimate by using two HLL’s of a larger HLL of log_{2}(m) = 11.  This is effectively a lower bound for our doubling approaches.
  • “Doubled – PD”: estimate by taking an HLL of log_{2}(m) = 10 and double it up to log_{2}(m) = 11 using the Proportion Doubling strategy.  Once this larger HLL is approximated we estimate the intersection with another HLL of native size log_{2}(m) = 11
  • “Doubled – RE”: estimate by taking an HLL of log_{2}(m) = 10 and doubling up to log_{2}(m) = 11 using Random Estimate strategy.

We performed an experiment 300 times at varying intersection sizes from 0 up to 200k (100%) overlapping elements between sets (in steps of 10k). The plots below show our results (and extrapolate between points).

Doubling_bias

The graph of the mean error looks pretty bad for Random Estimate doubling. Again we see that the error depends heavily on the intersection size and becomes more biased as the set’s overlap more. On the other hand, Proportion Doubling was much more successful  (recall that this strategy forces the proportion of bins in the to-be-doubled HLL and the HLL with which we will union it to be equal before and after doubling.)  It’s possible there is some error bias with small intersections but we would need to run more trials to know for sure. As expected, the “Folded” and the “Large” are centered around zero. But what about the spread of the error?

Doubling_spread

The Proportion Doubling strategy looks great! In my last post on this subject, we found that this doubling strategy (without the cached part) really only worked well in the large intersection size regime, but here, with the extra cache bits, we seem to avoid that. Certainly the large intersection regime is where the standard deviation is lowest, but for every intersection size, it is significantly lower than that of the smaller HLL. This suggests that one of our largest sources of error when we use doubling in conjunction with unions is related to our lack of knowledge of the arrangement of the bins (i.e. when doubling, we do not know which of the two partner bins gets the larger, observed value). So it appears that the strategy of keeping cache bits around does indeed work, provided you use a decent doubling scheme.

Interestingly, it is always much better to double a smaller cache HLL than to fold a larger HLL when comparing sketches of different sizes. This is represented above by the lower error of the doubled HLL than the small HLL. The error bounds do seem to depend on the size of the intersection between the two sets but this will require more work to really understand how, especially in the case of Proportion Doubling.

Notes:  In this work we focus solely on doubling a HLL sketch and then immediately using this new structure to compute set operations. It would be interesting to see if set operation accuracy changes as a doubled HLL goes through its “recovery” period under varying doubling methods. It is our assumption that nothing out of the ordinary would come of this, but we definitely could be wrong. We will leave this as an exercise for the reader.

Summary

We’ve found an interesting way of trading space for accuracy with this cached bit method, but there are certainly other ways of using an extra bit or two (per bucket). For instance, we could keep more information about the distribution of each bin by keeping a bit indicating whether or not the bin’s value minus one has been seen. (If the value is k, keep track of whether k-1 has shown up.)

We should be able to use any extra piece of information about the distribution or position of the data to help us obtain a more accurate estimate. Certainly, there are a myriad of other ideas ways of storing a bit or two of extra information per bin in order to gain a little leverage — it’s just a matter of figuring out what works best. We’ll be messing around more with this in the coming weeks, so if you have any ideas of what would work best, let us know in the comments!

(P.S. A lot of our recent work has been inspired by Flajolet et al.’s paper on PCSA – check out our post on this here!)

Thanks to Jeremie Lumbroso for his kind input on this post. We are much indebted to him and hopefully you will see more from our collaboration.

HyperLogLog Engineering: Choosing The Right Bits

Author’s Note: this is just a quick post about an engineering hiccup we ran into while implementing HyperLogLog features that aren’t mentioned in the original paper. We have an introduction to the algorithm and several other posts on the topic if you’re interested.

Say you had two HyperLogLog data structures with 5-bit-wide registers, one with log_{2}m = 11 and the other with log_{2}m = 15, and wanted to compute their union. You could just follow my colleague Chris’ advice and “fold” the larger one down to the size of the smaller one and then proceed as usual taking the pairwise max of the registers. This turns out to be a more involved process than Chris makes it out to be if you designed your HLL implementation in a particular way. For instance, if you use the 15 least(/most) significant bits of the 64-bit hashed input to determine register index and the next 30 bits to determine the register value, you end up in a tricky situation when you truncate the last 4 bits of the index to get the new 11-bit index.

bit string bad

If you imagine feeding the same element into an HLL of the smaller size, then the 4 bits you truncated from the index would have actually been used in the computation of the register value.

bit string bad after fold

You couldn’t simply take the original register value you computed, you’d have to take into account the new prefix added to the register value bit string. If the prefix has a 1 in it, you would recompute the run of zeroes on just the prefix (because you know it contains a 1 and thus all the information you need), and if not, you’d add the length of the prefix to the original register value computed. Not a ton of work, but having clutter like this in algorithmic code distracts the reader from the true intention. So how do we avoid this?

Well, you could say that it’s very, very unlikely that you’ll ever need more than 30 bits for your register value, so you could assume that the register width would remain constant forever and use the bottom 30 bits for your register value and the next log_{2}m bits for your register index. That way you could just truncate the last 4 bits of the index and know that your register value would still be the same. On the other hand, if you’re Google, that may not be true. In that case, what you should do is use the log_{2}m least (/most) significant bits of your hashed value for the register index and the 30 most (/least) significant bits for the register value.

bit string

Now you can just truncate the register index and use the original register value.

bit string after fold

If you’re using a good hash function like MurmurHash3 that gives you 128 bits of entropy, you could simply compute the register index from the first 64-bit word in the hash and compute the register value from the second 64-bit word and completely ignore this problem up to a mind-bending log_{2}m = 64 and register width of 6 (aka the heat death of the universe).

I know it’s not always possible to anticipate this problem in the early stages of implementing and vetting an algorithm, but hopefully with a bit of research the next time someone looks to implement HLL they’ll see this and learn from our mistake.

Doubling the Size of an HLL Dynamically – Unions

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm.  See Matt’s overview of the algorithm, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the second in a series of three posts on doubling the number of bins of HLLs. The first post dealt with the recovery time after doubling and the next post will deal with ways to utilize an extra bit or two per bin.

Overview

Let’s say we have two streams of data which we’re monitoring with the HLL algorithm, and we’d like to get an estimate on the cardinality of these two streams combined, i.e. thought of as one large stream.  In this case, we have to take advantage of the algorithm’s built-in “unionfeature.  Done naively, the accuracy of the estimate will depend entirely on the the number of bins, m, of the smaller of the two HLLs.  In this case, to make our estimate more accurate, we would need to increase this m of one (or both) of our HLLs.  This post will investigate the feasibility of doing this; we will apply our idea of “doubling” to see if we can gain any accuracy.  We will not focus on intersections, since the only support the HyperLogLog algorithm has for intersections is via the inclusion/exclusion principle. Hence the error can be kind of funky for this – for a better overview of this, check out Timon’s post here. For this reason, we only focus on how the union works with doubling.

The Strategy: A Quick Reminder

In my last post we discussed the benefits and drawbacks of many different doubling strategies in the context of recovery time of the HLL after doubling. Eventually we saw that two of our doubling strategies worked significantly better than the others. In this post, instead of testing many different strategies, we’ll focus instead on one strategy, “proportion doubling” (PD), and how to manipulate it to work best in the context of unions. The idea behind PD is to guess the approximate intersection cardinality of the two datasets and to force that estimate to remain after doubling. To be more specific, suppose we have an HLL A and an HLL B withn bins and 2n bins, respectively. Then we check what proportion of bins in A, call it p, agree with the bins in B. When we doubled A, we fill in the bins by randomly selecting p\cdot n bins, and filling them in with the value in the corresponding bins in B. To fill in the rest of the bins, we fill them in randomly according to the distribution.

The Naive Approach

To get some idea of how well this would work, I put the most naive strategy to the test. The idea was to run 100 trials where I took two HLLs (one of size 2^5 = 32 and one of size 2^6 = 64), ran 200K keys through them, doubled the smaller one (according to Random Estimate), and took a union. I had a hunch that the accuracy of our estimate after doubling would depend on how large the true intersection cardinality of the two datasets would be, so I ran this experiment for overlaps of size 0, 10K, 20K, etc. The graphs below are organized by the true intersection cardinality, and each graph shows the boxplot of the error for the trials.

pd_graphs_naiveThis graph is a little overwhelming and a bit of a strange way to display the data, but is useful for getting a feel for how the three estimates work in the different regimes.  The graph below is from the same data and just compares the “Small” and “Doubled” HLLs.  The shaded region represents the middle 50% of the data, and the blue dots represent the data points.

smoothed_naive_union_error

The first thing to notice about these graphs is the accuracy of the estimate in the small intersection regime. However, outside of this, the estimates are not very accurate – it is clearly a better choice to just use the estimate from the smaller HLL.

Let’s try a second approach. Above we noticed that the algorithm’s accuracy depended on the cardinality of the intersection. Let’s try to take that into consideration. Let’s use the “Proportion Doubling” (PD) strategy we discussed in our first post. That post goes more in depth into the algorithm, but the take away is that this doubling strategy preserves the proportion of bins in the two HLLs which agree. I ran some trials like I did above to get some data on this. The graphs below represent this.

pd_graphsHere we again, show the data in a second graph comparing just the “Doubled” and “Small” HLL estimates.  Notice how much tighter the middle 50% region is on the top graph (for the “Doubled” HLL).  Hence in the large intersection regime, we get very accurate estimates.

smoothed_pd_union_error

One thing to notice about the second set of graphs is how narrow the error bars are.  Even when the estimate is biased, it still has much smaller error.  Also, notice that this works well in the large intersection regime but horribly in the small intersection regime.  This suggests that we may be able to interpolate our strategies. The next set of graphs is for an attempt at this. The algorithm gets an estimate of the intersection cardinality, then decides to either double using PD, double using RE, or not double depending on whether the intersection is large, small, or medium.

hybridpic1

smoothed_hybrid_union_error

Here, the algorithm works well in the large intersection regime and doesn’t totally crap out outside of this regime (like the second algorithm), but doesn’t sustain the accuracy of the first algorithm in the small intersection regime. This is most likely because the algorithm cannot “know” which regime it is in and thus, must make a guess.  Eventually, it will guess wrong will severely underestimate the union cardinality. This will introduce a lot of error, and hence, our boxplot looks silly in this regime. The graph below shows the inefficacy of this new strategy. Notice that there are virtually no gains in accuracy in the top graph.

Conclusion

With some trickery, it is indeed possible to gain some some accuracy when estimating the cardinality of the union of two HLLs by doubling one.  However, in order for this to be feasible, we need to apply the correct algorithm in the correct regime. This isn’t a major disappointment since for many practical cases, it would be easy to guess which regime the HLLs should fall under and we could build in the necessary safeguards if we guess incorrectly.  In any case, our gains were modest but certainly encouraging!

Open Source Release: postgresql-hll

We’re happy to announce the first open-source release of AK’s PostgreSQL extension for building and manipulating HyperLogLog data structures in SQL, postgresql-hll. We are releasing this code under the Apache License, Version 2.0 which we feel is an excellent balance between permissive usage and liability limitation.

What is it and what can I do with it?

The extension introduces a new data type, hll, which represents a probabilistic distinct value counter that is a hybrid between a HyperLogLog data structure (for large cardinalities) and a simple set (for small cardinalities). These structures support the basic HLL methods: insert, union, and cardinality, and we’ve also provided aggregate and debugging functions that make using and understanding these things a breeze. We’ve also included a way to do schema versioning of the binary representations of hlls, which should allow a clear path to upgrading the algorithm, as new engineering insights come up.

A quick overview of what’s included in the release:

  • C-based extension that provides the hll data structure and algorithms
  • Austin Appleby’s MurmurHash3 implementation and SQL-land wrappers for integer numerics, bytes, and text
  • Full storage specification in STORAGE.markdown
  • Full function reference in REFERENCE.markdown
  • .spec file for rpmbuild
  • Full test suite

A quick note on why we included MurmurHash3 in the extension: we’ve done a good bit of research on the importance of a good hash function when using sketching algorithms like HyperLogLog and we came to the conclusion that it wouldn’t be very user-friendly to force the user to figure out how to get a good hash function into SQL-land. Sure, there are plenty of cryptographic hash functions available, but those are (computationally) overkill for what is needed. We did the research and found MurmurHash3 to be an excellent non-cryptographic hash function in both theory and practice. We’ve been using it in production for a while now with excellent results. As mentioned in the README, it’s of crucial importance to reliably hash the inputs to hlls.

Why did you build it?

The short answer is to power these two UIs:

On the left is a simple plot of the number of unique users seen per day and the number of cumulative unique users seen over the days in the month. The SQL behind this is very very straightforward:

SELECT report_date,
       #users as by_day,
       #hll_union_agg(users) as cumulative_by_day OVER (ORDER BY report_date ASC)
FROM daily_uniques
WHERE report_date BETWEEN '2013-01-01' AND '2013-01-31'
ORDER BY report_date ASC;

where daily_uniques is basically:

   Column    | Type | Modifiers 
-------------+------+-----------
 report_date | date | 
 users       | hll  |

Briefly, # is the cardinality operator which is operating on the hll result of the hll_union_agg aggregate function which unions the previous days’ hlls.

On the right is a heatmap of the percentage of an inventory provider’s users that overlap with another inventory provider. Essentially, we’re doing interactive set-intersection of operands with millions or billions of entries in milliseconds. This is intersection computed using the inclusion-exclusion principle as applied to hlls:

SELECT ip1.id as provider1,
       ip2.id as provider2,
       (#ip1.users + #ip2.users - #hll_union(ip1.users, ip2.users))/#ip1.users as overlap
FROM inventory_provider_stats ip1, inventory_provider_stats ip2
WHERE ip1.id <> ip2.id;

where inventory_provider_stats is basically:

   Column    | Type | Modifiers 
-------------+------+-----------
 id          | date | 
 users       | hll  |

(Some of you may note that the diagonal is labeled “exclusive reach” and is not represented in the query’s result set. That’s because the SQL above is a simplification of what’s happening. There’s some extra work done that replaces that the useless diagonal entries with the percent of the inventory provider’s users that are only seen on that inventory provider.)

We’ve been running this type of code in production for over a year now and are extremely pleased with its performance, ease of use, and expressiveness. Everyone from engineers to researchers to ops people to analysts have been using hlls in their daily reports and queries. We’re seeing product innovation coming from all different directions in the organization as a direct result of having these powerful data structures in an easily accessed and queried format. Dynamic COUNT(DISTINCT ...) queries that would have taken minutes or hours to compute from a fact table or would have been impossible in traditional cube aggregates return in milliseconds. Combine that speed with PostgreSQL’s window and aggregate functions and you have the ability to present interactive, rich distinct-value reporting over huge data sets. I’ll point you to the README and our blog posts on HyperLogLog for more technical details on storage, accuracy, and in-depth use cases.

I believe that this pattern of in-database probabilistic sketching is the future of interactive analytics. As our VP of Engineering Steve Linde said to me, “I can’t emphasize enough how much business value [sketches] deliver day in and day out.”

Our Commitment

Obviously we’re open-sourcing this for both philanthropic and selfish reasons: we’d love for more people to use this technology so that they can tell us all the neat uses for it that we haven’t thought of yet. In exchange for their insight, we’re promising to stay active in terms of stewardship and contribution of our own improvements. Our primary tool for this will be the GitHub Issues/Pull Request mechanism. We’d considered a mailing list but that seems like overkill right now. If people love postgresql-hll, we’ll figure something out as needed.

Please feel free to get in touch with us about the code on GitHub and about the project in general in the comments here. We hope to release additional tools that allow seamless Java application integration with the raw hll data in the future, so stay tuned!


Update

Looks Dimitri Fontaine wrote up a basic “how-to” post on using postgresql-hll here and another on unions here. (Thanks, Dimitri!) He brings up the issue that hll_add_agg() returns NULL when aggregating over an empty set when it should probably return an empty hll. Hopefully we’ll have a fix for that soon. You can follow the progress of the issue here.

HLLs and Polluted Registers

Introduction

It’s worth thinking about how things can go wrong, and what the implications of such occurrences might be. In this post, I’ll be taking a look at the HyperLogLog (HLL) algorithm for cardinality estimation, which we’ve discussed before.

The Setup

HLLs have the property that their register values increase monotonically as they run. The basic update rule is:

for item in stream:
    index, proposed_value = process_hashed_item(hash(item))
    hll.registers[index] = max(hll.registers[index], proposed_value)

There’s an obvious vulnerability here: what happens to your counts if you get pathological data that blows up a register value to some really large number? These values are never allowed to decrease according to the vanilla algorithm. How much of a beating can these sketches take from such pathological data before their estimates are wholly unreliable?

Experiment The First

To get some sense of this, I took a 1024 bucket HLL, ran a stream through it, and then computed the error in the estimate. I then proceeded to randomly choose a register, max it out, and compute the error again. I repeated this process until I had maxed out 10% of the registers. In pseudo-python:

print("n_registers_touched,relative_error")
print(0, relative_error(hll.cardinality(), stream_size), sep = ",")
for index, reg in random.sample(range(1024), num_to_edit):
    hll.registers[reg] = 32
    print(index + 1, relative_error(hll.cardinality(), stream_size), sep = ",")

In practice, HLL registers are fixed to be a certain bit width. In our case, registers are 5 bits wide, as this allows us to count runs of 0s up to length 32. This allows us to count astronomically high in a 1024 register HLL.

Repeating this for many trials, and stream sizes of 100k, 1M, and 10M, we have the following picture. The green line is the best fit line.

Error grows linearly with polluted register values

What we see is actually pretty reassuring. Roughly speaking, totally poisoning x% of registers results in about an x% error in your cardinality estimate. For example, here are the error means and variances across all the trials for the 1M element stream:

Number of Registers Modified Percentage of Registers Modified Error Mean Error Variance
0 0 -0.0005806094 0.001271119
10 0.97% 0.0094607324 0.001300780
20 1.9% 0.0194860481 0.001356282
30 2.9% 0.0297495396 0.001381753
40 3.9% 0.0395013208 0.001436418
50 4.9% 0.0494727527 0.001460182
60 5.9% 0.0600436774 0.001525749
70 6.8% 0.0706375356 0.001522320
80 7.8% 0.0826034639 0.001599104
90 8.8% 0.0937465662 0.001587156
100 9.8% 0.1060810958 0.001600348

Initial Reactions

I was actually not too surprised to see that the induced error was modest when only a small fraction of the registers were poisoned. Along with some other machinery, the HLL algorithm uses the harmonic mean of the individual register estimates when computing its guess for the number of distinct values in the data stream. The harmonic mean does a very nice job of downweighting values that are significantly larger than others in the set under consideration:

In [1]: from scipy.stats import hmean

In [2]: from numpy import mean

In [3]: f = [1] * 100000 + [1000000000]

In [4]: mean(f)
Out[4]: 10000.899991000089

In [5]: hmean(f)
Out[5]: 1.0000099999999899

It is this property that provides protection against totally wrecking the sketch’s estimate when we blow up a fairly small fraction of the registers.

Experiment The Second

Of course, the algorithm can only hold out so long. While I was not surprised by the modesty of the error, I was very surprised by how linear the error growth was in the first figure. I ran the same experiment, but instead of stopping at 10% of the registers, I went all the way to the end. This time, I have plotted the results with a log-scaled y-axis:

Note that some experiments appear to start after others. This is due to missing data from taking the logarithm of negative errors.

Without getting overly formal in our analysis, there are roughly three phases in error growth here. At first, it’s sublinear on the log-scale, then linear, then superlinear. This roughly corresponds to “slow”, “exponential”, and “really, really, fast”. As our mathemagician-in-residence points out, the error will grow roughly as p/(1-p) where p is the fraction of polluted registers. The derivation of this isn’t too hard to work out, if you want to give it a shot! The implication of this little formula matches exactly what we see above. When p is small, the denominator does not change much, and the error grows roughly linearly. As p approaches 1, the error begins to grow super-exponentially. Isn’t it nice when experiment matches theory?

Final Thoughts

It’s certainly nice to see that the estimates produced by HLLs are not overly vulnerable to a few errant register hits. As is often the case with this sort of analysis, the academic point must be put in balance with the practical. The chance of maxing out even a single register under normal operation is vanishingly small, assuming you chose a sane hash function for your keys. If I was running an HLL in the wild, and saw that 10% of my registers were pegged, my first thought would be “What is going wrong with my system!?” and not “Oh, well, at least I know my estimate to within 10%!” I would be disinclined to trust the whole data set until I got a better sense of what caused the blowups, and why I should give any credence at all to the supposedly unpolluted registers.

Doubling the Size of an HLL Dynamically – Recovery

Author’s Note: This post is related to a few previous posts dealing with the HyperLogLog algorithm. See Matt’s overview of HLL, and see this post for an overview of “folding” or shrinking HLLs in order to perform set operations. It is also the first in a series of three posts on doubling the size of HLLs – the next two will be about set operations and utilizing additional bits of data, respectively.

Overview

In this post, we explore the error of the cardinality estimate of an HLL whose size has been doubled using several different fill techniques. Specifically, we’re looking at how the error changes as additional keys are seen by the HLL.

A Quick Reminder – Terminology and Fill Strategies

If we have an HLL of size 2^n and we double it to be an HLL of size 2^{n+1}, we call two bins “partners” if their bin number differs by 2^n.  For example, in an HLL double to be size 8, the bins 1 and 5 are partners, as are 2 and 6, etc. The Zeroes doubling strategy fills in the newly created bins with zeroes. The Concatenate strategy fills in the newly created bins with the values of their partner bins. MinusTwo fills in each bin with two less than its partner bin’s value. RE fills in the newly created bins according to the empirical distribution of each bin.

Some Sample Recovery Paths

Below, we ran four experiments to check recovery time. Each experiment consisted of running an HLL of size 210 on 500,000 unique hashed keys (modeled here using a random number generator), doubling the HLL to be size 211, and then ran 500,000 more hashed keys through the HLL. Below, we have graphs showing how the error decreases as more keys are added.  Both graphs show the same data (the only difference being the scale on the y-axis). We have also graphed “Large,” an HLL of size 2^{11}, and “Small,” an HLL of size 2^{10}, which are shown only for comparison and are never doubled.

One thing to note about the graphs is that the error is relative.

Notice that Concatenate and Zeroes perform particularly poorly. Even after 500,000 extra keys have been added, they still don’t come within 5% of the true value! For Zeroes, this isn’t too surprising. Clearly the initial error of Zeroes, that is the error immediately after doubling, should be high.  A quick look at the harmonic mean shows why this occurs. If a single bin has a zero as its value, the harmonic mean of the values in the bins will be zero. Essentially, the harmonic mean of a list always tends towards the lowest elements of the list. Hence, even after all the zeroes have been replaced with positive values, the cardinality estimate will be very low.

On the other hand, a more surprising result is that Concatenate gives such a poor guess. To see this we need to look at the formula for the estimate again.  The formula for the cardinality estimate is \frac{\alpha_m m^2}{\sum_{i=1}^{m} 2^{-M_i}} where M_i is the value in the i^{th} bin, m is the number of bins, and \alpha_m is a constant approaching about .72. For Concatenate, the value M_{i + m} is equal to M_i.  Hence we have that the cardinality estimate for Concatenate is:

\begin{array}{ll}\displaystyle\frac{\alpha_{2m} (2m)^2}{\left(\displaystyle\sum_{i=1}^{2m} 2^{-M_i}\right)} \vspace{10pt}&\approx \displaystyle\frac{.72\cdot 4\cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) + \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right) }\vspace{10pt} \\&\displaystyle= \displaystyle 4\cdot \frac{.72 \cdot m^2}{2\cdot \left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&= \displaystyle 2 \cdot \frac{.72 \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\vspace{10pt}\\&\approx \displaystyle 2 \cdot \frac{ \alpha_m \cdot m^2}{\left(\displaystyle\sum_{i=1}^m 2^{-M_i}\right)}\end{array}

Notice that this last term is about equal to 2 times the cardinality estimate of the HLL before doubling. One quick thing that we can take away from this is that it is unlikely for two “partner” bins to have the same value in them (since if this happens frequently, we get an estimate close to that given by Concatenate – which is very inaccurate!).

As for MinusTwo and RE, these have small initial error and the error only falls afterwards. The initial error is small since the rules for these give guesses approximately equal to the guess of the original HLL before doubling. From here, the error should continue to shrink, and eventually, it should match that of the large HLL.

One thing we noticed was that error for Concatenate in the graph above suggested that the absolute error wasn’t decreasing at all. To check this we looked at the trials and, sure enough, the absolute error stays pretty flat. Essentially, Concatenate overestimates pretty badly, and puts the HLL in a state where it thinks it has seen twice as many keys as it actually has. In the short term, it will continue to make estimates as if it has seen 500,000 extra keys. We can see this clearly in the graphs below.

Recovery Time Data

I also ran 100 experiments where we doubled the HLLs after adding 500,000 keys, then continued to add keys until the cardinality estimate fell within 5% of the true cardinality.  The HLLs were set up to stop running at 2,000,000 keys if they hadn’t arrived at the error bound.

Notice how badly Concatenate did! In no trials did it make it under 5% error. Zeroes did poorly as well, though it did recover eventually. My guess here is that the harmonic mean had a bit to do with this – any bin with a low value, call it k, in it would pull the estimate down to be about m^2 \cdot 2^k. As a result, the estimate produced by the Zeroes HLL will remain depressed until every bin is hit with a(n unlikely) high value. Zeroes and Concatenate should not do well since essentially the initial estimate (after doubling) of each HLL is off by a very large fixed amount. The graph of absolute errors, above, shows this.

On the other hand, RE and MinusTwo performed fairly well. Certainly, RE looks better in terms of median and middle 50%, though its variance is much higher than MinusTwo‘s.This should make sense as we are injecting a lot of randomness into RE when we fill in the values, whereas MinusTwo‘s bins are filled in deterministically.

Recovery Time As A Function of Size

One might wonder whether the recovery time of MinusTwo and RE depend on the size of the HLL before the doubling process. To get a quick view of whether or not this is true, we did 1,000 trials like those above but by adding 200K, 400K, 600K, 800K, 1M keys and with a a cutoff of 3% this time. Below, we have the box plots for the data for each of these. The headings of each graph gives the size of the HLL before doubling, and the y-axis gives the fractional recovery time (the true recovery time divided by the size of the HLL before doubling).

Notice that, for each doubling rule, there is almost no variation between each of the plots. This suggests that the size of the HLL before doubling doesn’t change the fractional recovery time. As a side note, one thing that we found really surprising is that RE is no longer king – MinusTwo has a slightly better average case. We think that this is just a factor of the higher variation of RE and the change in cutoff.

Summary

Of the four rules, MinusTwo and RE are clearly the best. Both take about 50 – 75% more keys after doubling to get within 3% error, and both are recover extremely quickly if you ask for them to only get within 5% error.

To leave you with one last little brainteaser, an HLL of size 2^{10}, which is then doubled, will eventually have the same values in its bins as an HLL of size 2^{11} which ran on the same data. About how long will it take for these HLLs to converge? One (weak) requirement for this to happen is to have the value in every bin of both HLLs be changed. To get an upper bound on how long this should take, one should read about the coupon collector problem.

Doubling the Size of an HLL Dynamically

Introduction

In my last post, I explained how to halve the number of bins used in an HLL as a way to allow set operations between that HLL and smaller HLLs.  Unfortunately, the accuracy of an HLL is tied to the number of bins used, so one major drawback with this “folding” method is that each time you have the number of bins, you reduce that accuracy by a factor of \sqrt{2}.

In this series of posts I’ll focus on the opposite line of thinking: given an HLL, can one double the number of bins, assigning the new bins values according to some strategy, and recover some of the accuracy that a larger HLL would have had?  Certainly, one shouldn’t be able to do this (short of creating a new algorithm for counting distinct values) since once we use the HLL on a dataset the extra information that a larger HLL would have gleaned is gone.  We can’t recover it and so we can’t expect to magically pull a better estimate out of thin air (assuming Flajolet et al. have done their homework properly and the algorithm makes the best possible guess with the given information – which is a pretty good bet!).  Instead, in this series of posts, I’ll focus on how doubling plays with recovery time and set operations.  By this, I mean the following:  Suppose we have an HLL of size 2n and while its running, we double it to be an HLL of size 2n+1. Initially, this may have huge error, but if we allow it to continue running, how long will it take for its error to be relatively small?  I’ll also discuss some ways of modifying the algorithm to carry slightly more information.

The Candidates

Before we begin, a quick piece of terminology.  Suppose we have an HLL of size 2n and we double it to be an HLL of size 2 n+1.  We consider two bins to be partners if their bin numbers differ by 2n.  To see why this is important – check the post on HLL folding.

Colin and I did some thinking and came up with a few naive strategies to fill in the newly created bins after the doubling. I’ve provided a basic outline of the strategies below.

  • Zeroes – Fill in with zeroes.
  • Concatenate – Fill in each bin with the value of its partner.
  • MinusTwo – Fill in each bin with the value of its partner minus two. Two may seem like an arbitrary amount, but quick look at the formulas involved in the algorithm show that this leaves the cardinality estimate approximately unchanged.
  • RandomEstimate (RE) – Fill in each bin according to its probability distribution. I’ll describe more about this later.
  • ProportionDouble (PD) – This strategy is only for use with set operations. We estimate the number of bins in the two HLLs which should have the same value, filling in the second half so that that proportion holds and the rest are filled in according to RE.

Nitty Gritty of RE

The first three strategies given above are pretty self-explanatory, but the last two are a bit more complicated. To understand these, one needs to understand the distribution of values in a given bin.  In the original paper, Flajolet et al. calculate the probability that a given bin takes the value k to be given by (1 - 0.5^k)^v - (1 - 0.5^{k-1})^v where v is the number of keys that the bin has seen so far. Of course, we don’t know this value (v) exactly, but we can easily estimate it by dividing the cardinality estimate by the number of bins. However, we have even more information than this. When choosing a value for our doubled HLL, we know that that value cannot exceed its partner’s value. To understand why this is so, look back at my post on folding, and notice how the value in the partner bins in a larger HLL correspond to the value in the related bin in the smaller HLL.

Hence, to get the distribution for the value in a given bin, we take the original distribution, chop it off at the relevant value, and rescale it to have total area 1. This may seem kind of hokey but let’s quickly look at a toy example. Suppose you ask me to guess a number between 1 and 10, and you will try to guess which number I picked. At this moment, assuming I’m a reasonable random number generator, there is a 1/10 chance that I chose the number one, a 1/10 chance that I chose the number two, etc. However, if I tell you that my guess is no larger than two, you can now say there there is a 1/2 chance that my guess is a one, a 1/2 chance that my guess is a two, and there is no chance that my guess is larger. So what happened here? We took the original probability distribution, used our knowledge to cut off and ignore the values above the maximum possible value, and then rescaled them so that the sum of the possible probabilities is equal to zero.

RE consists simply of finding this distribution, picking a value according to it, and placing that value in the relevant bin.

Nitty Gritty of PD

Recall that we only use PD for set operations. One thing we found was that the accuracy of doubling with set operations according to RE is highly dependent on the the intersection size of the two HLLs. To account for this, we examine the fraction of bins in the two HLLs which contain the same value, and then we force the doubled HLL to preserve this fraction

So how do we do this? Let’s say we have two HLLs: H and G. We wish to double H before taking its union with G. To estimate the proportion of their intersection, make a copy of G and fold it to be the same size as H. Then count the number of bins where G and H agree, call this number a. Then if m is the number of bins in H, we can estimate that H and G should overlap in about a/m bins. Then for each bin, with probability a/m we fill in the bin with the the minimum of the relevant bin from G and that bin’s partner in G. With probability 1 - a/m we fill in the bin according to the rules of RE.

The Posts

(Links will be added as the posts are published. Keep checking back for updates!)

Efficient Field-Striped, Nested, Disk-backed Record Storage

At AK we deal with a torrent of data every day. We can report on the lifetime of a campaign which may encompass more than a year’s worth of data. To be able to efficiently access our data we are constantly looking at different approaches to storage, retrieval and querying. One approach that we have been interested in involves dissecting data into its individual fields (or “columns” if you’re thinking in database terms) so that we only need to access the fields that are pertinent to a query. This is not a new approach to dealing with large volumes of data – it’s the basis of column-oriented databases like HBase.

Much of our data contains nested structures and this causes things to start to get a little more interesting, since this no longer easily fits within the data-model of traditional column-stores. Our Summarizer uses an in-memory approach to nested, field-striped storage but we wanted to investigate this for our on-disk data. Google published the Dremel paper a few years ago covering this exact topic. As with most papers, it only provides a limited overview of the approach without covering many of the “why”s and trade-offs made. So, we felt that we needed to start from the ground up and investigate how nested, field-striped storage works in order to really understand the problem.

Due to time constraints we have only been able to scratch the surface. Since the community is obviously interested in a Dremel-like project, we want to make the work that we have done available. We apologize in advance for the rough edges.

Without further ado: Efficient Field-Striped, Nested, Disk-backed Record Storage (on GitHub).

Follow

Get every new post delivered to your Inbox.

Join 117 other followers