Monday, August 25, 2014

WideTable: An Accelerator for Analytic Data Processing

Last year, I told you about BitWeaving, which is a compressed columnar scan that exploits intra-cycle parallelism. With BitWeaving, you can scan a table ... pretty darn fast!

As part of the Quickstep project, Yinan and I have been wondering if we could expand the "sweet spot'' of the bare metal performance of BitWeaving to a wider variety of analytical query processing. The answer that we have come up with is called WideTable.

The key idea behind WideTable is to denormalize the data into a wide table, and then run complex analytic SQL queries (e.g. with joins and nested subqueries) on the WideTable using BitWeaved scans. There are a number of different ways to use a WideTable, including as a materialized view and as an accelerator. Below, we show the how WideTable can be used as an accelerator.

The WideTable component plugs in as a front-end to a traditional SQL engine, which could be a traditional SQL RDBMS system (like Oracle) or a Hadoop/SQL engine (like Hive). It takes data and schema and runs it through a denormalizer module that transforms the schema using outerjoins, converting the data into a WideTable. SQL queries are intercepted and run through a series of rules to determine if and how the query can be translated into scans on the WideTable. Not all queries can be translated to WideTable scans, and those that can't be transformed are sent to the original source system for evaluation in the "usual" ways. For more details, see the paper.

Two questions that naturally follow are: 1) What is the space overhead associated with WideTables, and 2) What class of queries can be translated into (fast BitWeaved) scans on WideTables.

As a yardstick, we use the TPC-H benchmark to answer these questions. For the scale factor 10 TPC-H dataset (approximately 10GB of raw data), the WideTables takes only 8.5GB of space (we end up creating more than one WideTable in this case). The BitWeaved columnar storage and its dictionary compression method helps control the space overhead that is normally associated with denomalization.

How many of the TPC-H queries can we answer using the WideTable method? All of them except for Query 21. (Query 21 has a non primary-key foreign-key join that we can't handle, and have to send that query to the source system.) Note we can also handle the update queries, though as you might guess updates are slower when using a WideTable. More details on update query performance can be found in the paper.

What is the performance benefit of WideTable? Here are numbers comparing WideTable and MonetDB on the 10GB TPC-H benchmark.

WideTable is over 10X better than MonetDB in many cases! See the paper for many more experimental results. We also compared with VectorWise and the results have the same relative trend as MonetDB.

Some parting notes. First, WideTable can be viewed as a materialized view. But, it is a unique type of materialized view that is schema-centric vs. being workload-centric. Second, it is also easier to optimize and predict the query response of the WideTable queries as the query plan is largely made of simple scan kernels. So, for example, p99 response time management is easier. Third, the WideTable technique can be used to dramatically reduce the size of the source system (e.g. use fewer nodes) if the workload is amenable to the WideTable technique. Finally, WideTable's effectiveness is dependent on the schema graph. Star schema is fairly easy to deal with, but there are ways to deal with snowflake schemas too by selectively denormalizing around each fact table, and then using fast join methods (like those described here and here). More on that at a later date as we keep working on improving the WideTable technique.

No comments:

Post a Comment