Monday, June 8, 2015

Quickstep is now part of Pivotal

In January, my students and I spun out a company to help commercialize the technology produced in the Quickstep project. I'm excited to announce that we have now been acquired by Pivotal. More details are here and here.

Monday, August 25, 2014

WideTable: An Accelerator for Analytic Data Processing

Last year, I told you about BitWeaving, which is a compressed columnar scan that exploits intra-cycle parallelism. With BitWeaving, you can scan a table ... pretty darn fast!

As part of the Quickstep project, Yinan and I have been wondering if we could expand the "sweet spot'' of the bare metal performance of BitWeaving to a wider variety of analytical query processing. The answer that we have come up with is called WideTable.

The key idea behind WideTable is to denormalize the data into a wide table, and then run complex analytic SQL queries (e.g. with joins and nested subqueries) on the WideTable using BitWeaved scans. There are a number of different ways to use a WideTable, including as a materialized view and as an accelerator. Below, we show the how WideTable can be used as an accelerator.

The WideTable component plugs in as a front-end to a traditional SQL engine, which could be a traditional SQL RDBMS system (like Oracle) or a Hadoop/SQL engine (like Hive). It takes data and schema and runs it through a denormalizer module that transforms the schema using outerjoins, converting the data into a WideTable. SQL queries are intercepted and run through a series of rules to determine if and how the query can be translated into scans on the WideTable. Not all queries can be translated to WideTable scans, and those that can't be transformed are sent to the original source system for evaluation in the "usual" ways. For more details, see the paper.

Two questions that naturally follow are: 1) What is the space overhead associated with WideTables, and 2) What class of queries can be translated into (fast BitWeaved) scans on WideTables.

As a yardstick, we use the TPC-H benchmark to answer these questions. For the scale factor 10 TPC-H dataset (approximately 10GB of raw data), the WideTables takes only 8.5GB of space (we end up creating more than one WideTable in this case). The BitWeaved columnar storage and its dictionary compression method helps control the space overhead that is normally associated with denomalization.

How many of the TPC-H queries can we answer using the WideTable method? All of them except for Query 21. (Query 21 has a non primary-key foreign-key join that we can't handle, and have to send that query to the source system.) Note we can also handle the update queries, though as you might guess updates are slower when using a WideTable. More details on update query performance can be found in the paper.

What is the performance benefit of WideTable? Here are numbers comparing WideTable and MonetDB on the 10GB TPC-H benchmark.

WideTable is over 10X better than MonetDB in many cases! See the paper for many more experimental results. We also compared with VectorWise and the results have the same relative trend as MonetDB.

Some parting notes. First, WideTable can be viewed as a materialized view. But, it is a unique type of materialized view that is schema-centric vs. being workload-centric. Second, it is also easier to optimize and predict the query response of the WideTable queries as the query plan is largely made of simple scan kernels. So, for example, p99 response time management is easier. Third, the WideTable technique can be used to dramatically reduce the size of the source system (e.g. use fewer nodes) if the workload is amenable to the WideTable technique. Finally, WideTable's effectiveness is dependent on the schema graph. Star schema is fairly easy to deal with, but there are ways to deal with snowflake schemas too by selectively denormalizing around each fact table, and then using fast join methods (like those described here and here). More on that at a later date as we keep working on improving the WideTable technique.

Friday, August 2, 2013

From Locomatix to Twitter

Way back five years ago, just as the iPhone came out, Karthik Ramasamy (@karthikz) and I (@pateljm) started talking about building a combined real-time streaming and real-time analytical platform to power enterprise and consumer mobile services. We started a company called Locomatix to build such a platform. Sanjeev Kulkarni (@sanjeevrk), a founding member of the Google AdSense team, joined Locomatix soon after it was formed.  Next came Chris Kellogg (@cckellogg), who joined in his last semester at the University of Michigan.

Today, I’m very happy to announce that the Locomatix team is joining Twitter!

Special thanks to our families and friends for supporting us throughout this incredible journey. We simply wouldn't have made it without you!

Sanjeev, Karthik and I all got our graduate degrees from the University of Wisconsin, so a big part of Locomatix was powered by UW grads. Go Badgers! We are working on converting Chris into a Badger fan -- its still work in (very slow) progress.

Monday, June 24, 2013

Life Beyond Column-Stores: Exploiting intra-cycle parallelism

So, I have finally decided to start a blog. Why now, you might ask?

The first reason is that I have been too lazy to start a blog till now. Since blogs often matter as much as publications, this oversight is inexcusable. Better late than never.

The second reason to start a blog is that U. Wisconsin doesn’t, AFAIK, have a blog in the big data space. This is sad given the tradition that our university has in producing key data processing technologies, like the GAMMA parallel database system (which continues to influence what we now call “big data systems”), and BIRCH which provided a key pivot point for data mining (the buzzword-compliant term for that today is “deep data analytics”). This blog is a humble start in trying to get some of what Wisconsin does today in the big data space out into the blogosphere.

So, the topic for today is – Exploiting intra-cycle parallelism for query processing. First, a quick background behind this line of thinking. If you zoom into the processor and view it at the circuit-level, then there is an incredible amount of parallelism within each cycle. For example, if you add two 64-bit numbers, which takes about one cycle, the hardware circuits are actually computing on all the 64 bits in parallel. Thus, there is a 64-way parallelism at the circuit level. Enough here to be a potential game changer if we can exploit it effectively for some key data processing kernel.

The data processing kernel that we picked is a simple scan. Scans are pretty common in practice, and companies like SAP and IBM tend to worry about its efficiency excessively in their main memory data analytics products. Now, if you care about efficiency in these environments, then you should have implemented a column store in your product. So, the real challenge is: How can we speed up scans in a column-store engine?

The answer that Yinan Li and I came up with is called BitWeaving. BitWeaving builds on column stores but takes it to the next level. Here is how it works: First, we take each column and encode it using a fixed-length order-preserving code. So, if you have a column for all the birthdates of users in a table, then we map each unique birthdate in the column to an array of codes (one code for each original column).

Next, we view the codes at the bit-level (just like the processor's circuit does), and layout the bits in memory in a way that lets us exploit the circuit-level intra-cycle parallelism.

BitWeaving comes in two flavors. The first called BitWeaving/V, is like a column store, but at the bit level.  So, the highest-order bits of the column are layed out sequentially in memory, followed by the second highest-order bit, and so on. Thus, when you bring data into the processor, which is typically about 64-bits at a time, you actually have bits from different columns that are brought into the processor registers.

We have an algebraic framework that allows us to operate on these "oddly packed bits." With this framework, we can use regular CPU instructions like binary addition and exclusive OR, to operate on bits across the 64 columns in a few instructions (cycles). Furthermore, in many cases, we can safely prune the computation without looking at all the bits! The intuition behind the early pruning is as follows: Consider column codes that are 3-bit wide. If you are evaluating the predicate "col < 2", then you know that any column with a bit value set to 1 in the most significant bit position does not match the predicate, and you don't need to look at the last two bits of information. Organizing the column at the bit level allows us to skip over columns of bits. The end result is that in just a few cycles, we can compute the predicate for 64 columns. In other words, we have successfully exploited intra-cycle parallelism!

The second way to BitWeave is called BitWeaving/H. Here we pack the codes horizontally, laying the codes sequentially in memory. Such horizontal code packing is not new, but what we do that is new is:
  1. Add an extra bit to each code, which we use as a placeholder to record the result of the predicate evaluation on that code, and 
  2. We store consecutive codes stacked as columns across the memory. So, if we have a 3-bit code, the code for column 1 is followed in the memory address space by the code for column 5 (and not column 2!). See the paper more details.
With these techniques we can dramatically reduce the number of cycles that is used by BitWeaving/H to evaluate predicates. The end result is that we can compute the predicate across a batch of codes in a few cycles (again, we have successfully exploited intra-cycle parallelism).

Okay, so how well does this work in practice? Below is a figure comparing the performance of different techniques using cycles/code as the performance metric. These results are for a synthetic data set with a single column and one billion uniformly distributed integer values in this column. In this experiment, we vary the column width (# bits in the code). We scan the table using a predicate that selects 10% of the tuples, and then feed the results to a COUNT aggregate. This experiment was run on a single core of a Xeon X5680.

The top line in this figure corresponds to fetching one column code at a time, evaluating the predicate, and then feeding the result to a count aggregate operator. The second line labeled “SIMD Scan” is our best effort implementation of the method described in the SAP paper. It packs codes horizontally, like BitWeaving/H, but without the extra bit and the columnar layout, and uses SIMD instructions. The next two lines represent BitWeaving/H and BitWeaving/V.

As we can see, both BitWeaving methods provide significant gains over the traditional method and the SIMD Scan. For example, with a code size of 12 bits, BitWeaving/V is 10X and 6X faster than the Traditional and the SIMD Scan methods respectively. At a code size of 12 bits, BitWeaving/H is 9X and 5X faster than the Traditional and the SIMD Scan methods respectively.

Of course in the experiment above, the final output was a COUNT(*), which can be computed directly by counting how many results match the predicate. As one might imagine BitWeaving/V is more expensive if you have to actually reconstruct the column to produce the output. Here is another experimental result with a TPC-H database at scale factor 10 (i.e. a 10GB database), with the scan query Q6. This query has an aggregate that requires computing the average across a product of two columns.  Here BitWeaving/H outperforms BitWeaving/V.
One final note, BitWeaving can be used both as a native storage format and/or as an index. With 2-way replication, you could choose to store one copy of the column in a BitWeaved format and the other in a regular column store.

Of course, there are lots of interesting unanswered questions, including how to extend this idea of intra-cycle parallelism to work with other columnar compression schemes, expanding the use of BitWeaving beyond simple scans, physical schema optimization using BitWeaving, making BitWeaving work with other architectures (e.g. GPUs), dealing with updates and batch appends, etc. So, we are likely to continue having fun with this line of thinking.

The BitWeaving work is being presented this week @ SIGMOD in New York, and I hope to see some of you there.