Tuesday, February 5, 2019

Graph Analytics: Simpler than we think?

Graph analytics are all the rage today with companies rushing to build new software platforms that target just graph analytics all the time. But, are they really that different than many types of iterative data processing?

Here we argue that for iterative graph processing, it is not. In fact, one can map BSP-style computation to the good-old relational algebra, with a few extensions for user-defined functions and the ability to express a loop. How can this be done?

Consider the common BSP style vertex-centric computation.  The canonical graph analytics computation loop looks like this:

We can map this generic BSP computation to relational algebra using the following recipe. First, store the graph data in two relational tables, a nodes table (with one tuple for each node), and an edges table (with one edge tuple for each edge in the graph). Then, use the table below to map the BSP computation to relational algebra.

Essentially in the relational system, one executes a sequence of joins, aggregates and user-defined functions to carry out the graph computation. For shortest-path computation, the last line above (send messages) takes the current shortest path for each node and “sends” it to the next iteration by computing a join based on the edges table. These new values are “received” (the “received messages” row) in the next iteration, and the minimum of the current shortest path and the new path lengths is computed. These new values are now stored (“Mutate Values”) in a table that has one tuple for each node.

A more detailed description is in the Fan et al. 2015 CIDR paper titled: “The Case Against Specialized Graph Analytics Engines”. But, I hope you get the general idea: Graph analytics can be mapped to relational algebra. Now with this mapping, you can leverage relational platforms that have matured over time and have high-performance scalable data processing algorithms. Basically, you can scale your graph algorithms by simply using relational platforms. The CIDR 2015 paper has performance results comparing this approach with a few other graph system (circa 2015).

We have recently picked this work and are now implementing Grail on top of the Quickstep system. (Quickstep is a high-performance relational database system built at Wisconsin.) While this work is still on-going, here are a few quick performance results for a network security application in which we find advanced persistent threats using application and system logs. On a dataset with about 3K nodes and 15K edges Quickstep with Grail (QuickGrail) was about 4X faster than Neo4J; on a dataset with 75K nodes and 153K edges QuickGrail was about 500X faster; and on larger datasets Neo4J ran out of memory while QuickGrail was able to carry out its computation (on this single node dual socket Xeon machine). These results are preliminary and over the Spring we plan to write out a full paper. So, look for it, if you are interested in this space!

Sunday, April 29, 2018

Database systems research: dead or alive? The answer may be backpropagation!

Much has been said recently about the future of database systems research. Mike Stonebraker has loudly complained about how core systems papers are disappearing at an alarming rate from SIGMOD and VLDB. He called it Fear #1 in a recent talk. Another excellent and a more optimistic view of the database research was published by Mike Caferalla and Chris RĂ©.

So what is going on: Is there or is there not an issue with database systems research? While in my mind there is no doubt that there is a rich future for database systems research, it is definitely the case that publishing systems papers has gotten much harder. There is undoubtedly an urgent issue to solve with reviewing papers.

One key root cause is that there is a lot of bad reviewing in our conferences. And, there is no long-term mechanism to deal with bad reviewing. With the gigantic PCs that we now have and a bandwagon approach to following-the-latest-trend (Mike makes a point about this too in his talk), the way to game the system is to write a lot of papers. By definition, systems folks are at a disadvantage in this game as it takes much longer to write most systems papers. For example, the Quickstep paper (VLDB'18) took 5+ years with 7+ students to write.

So, a key issue that we need to deal with is measuring and reporting the quality of reviewers. Here I have a simple suggestion to PC chairs and the powers-to-be at the SIGMOD and PVLDB executive committees:
  1. Reduce the size of the PC so that each PC member has a significant number of papers to review. Current SIGMOD and VLDB committees are way too large. Reducing them to half is a good start (fewer random variables). For PVLDB the monthly cycle adds even more randomness. Reduce that too.
  2. Incorporate transparency in the review process so that the distribution of the overall recommendations made by the PC member is made public. There are reviewers who consistently hate everything they read, and reviewers who love nearly everything that they read. This really messes up the system, and I think is a big factor in why our system is broken. Note reducing the PC size (suggestion #1) yields more data per PC member, which you need to make sound comparisons. Also randomly assign papers to reviewers based on the interests that reviewers express to reduce the chances of someone saying "I had a bad batch of reviews."
In other words, a key problem that we need to solve is having a fairer way of accepting or rejecting papers. A key component to achieving this fairness is exposing overly optimistic or overly pessimistic reviewers. Lets do just that! Let's create a website (add this to DBLP?) with a scorecard for each reviewer that shows the distribution of their overall accept/reject scores. This information can be used by PC chairs to decide how to pick a good PC. Most importantly, it may lead to reviewers self-correcting themselves so that they are more balanced in their evaluation.

This backpropagation, if you will, can help correct those reviewers who haven't calibrated themselves (but wish to), and starkly expose those who likely shouldn't be on PCs.

Of course there are lots other good ideas floating around, including having more training for junior members of the community (I hope all advisors train their students in this), in-person PC meetings at least for the area chairs, and so on. I see all those as complementary to the proposal above.

What do you think?

Saturday, February 11, 2017

Introducing Ava

Recently we published a paper @ CIDR on a new system (called Ava) that we are building here @ UW. SIGMOD invited me to write a blog, and that blog has just been published here. In a nutshell, we claim that many data science pipelines will be constructed by chatbots in the near future, and going further in the longer run conversational bots will allow end users to use natural language (both for Q&A) to directly get insights from their data.

Monday, June 8, 2015

Quickstep is now part of Pivotal

In January, my students and I spun out a company to help commercialize the technology produced in the Quickstep project. I'm excited to announce that we have now been acquired by Pivotal. More details are here and here.

Monday, August 25, 2014

WideTable: An Accelerator for Analytic Data Processing

Last year, I told you about BitWeaving, which is a compressed columnar scan that exploits intra-cycle parallelism. With BitWeaving, you can scan a table ... pretty darn fast!

As part of the Quickstep project, Yinan and I have been wondering if we could expand the "sweet spot'' of the bare metal performance of BitWeaving to a wider variety of analytical query processing. The answer that we have come up with is called WideTable.

The key idea behind WideTable is to denormalize the data into a wide table, and then run complex analytic SQL queries (e.g. with joins and nested subqueries) on the WideTable using BitWeaved scans. There are a number of different ways to use a WideTable, including as a materialized view and as an accelerator. Below, we show the how WideTable can be used as an accelerator.

The WideTable component plugs in as a front-end to a traditional SQL engine, which could be a traditional SQL RDBMS system (like Oracle) or a Hadoop/SQL engine (like Hive). It takes data and schema and runs it through a denormalizer module that transforms the schema using outerjoins, converting the data into a WideTable. SQL queries are intercepted and run through a series of rules to determine if and how the query can be translated into scans on the WideTable. Not all queries can be translated to WideTable scans, and those that can't be transformed are sent to the original source system for evaluation in the "usual" ways. For more details, see the paper.

Two questions that naturally follow are: 1) What is the space overhead associated with WideTables, and 2) What class of queries can be translated into (fast BitWeaved) scans on WideTables.

As a yardstick, we use the TPC-H benchmark to answer these questions. For the scale factor 10 TPC-H dataset (approximately 10GB of raw data), the WideTables takes only 8.5GB of space (we end up creating more than one WideTable in this case). The BitWeaved columnar storage and its dictionary compression method helps control the space overhead that is normally associated with denomalization.

How many of the TPC-H queries can we answer using the WideTable method? All of them except for Query 21. (Query 21 has a non primary-key foreign-key join that we can't handle, and have to send that query to the source system.) Note we can also handle the update queries, though as you might guess updates are slower when using a WideTable. More details on update query performance can be found in the paper.

What is the performance benefit of WideTable? Here are numbers comparing WideTable and MonetDB on the 10GB TPC-H benchmark.

WideTable is over 10X better than MonetDB in many cases! See the paper for many more experimental results. We also compared with VectorWise and the results have the same relative trend as MonetDB.

Some parting notes. First, WideTable can be viewed as a materialized view. But, it is a unique type of materialized view that is schema-centric vs. being workload-centric. Second, it is also easier to optimize and predict the query response of the WideTable queries as the query plan is largely made of simple scan kernels. So, for example, p99 response time management is easier. Third, the WideTable technique can be used to dramatically reduce the size of the source system (e.g. use fewer nodes) if the workload is amenable to the WideTable technique. Finally, WideTable's effectiveness is dependent on the schema graph. Star schema is fairly easy to deal with, but there are ways to deal with snowflake schemas too by selectively denormalizing around each fact table, and then using fast join methods (like those described here and here). More on that at a later date as we keep working on improving the WideTable technique.

Friday, August 2, 2013

From Locomatix to Twitter

Way back five years ago, just as the iPhone came out, Karthik Ramasamy (@karthikz) and I (@pateljm) started talking about building a combined real-time streaming and real-time analytical platform to power enterprise and consumer mobile services. We started a company called Locomatix to build such a platform. Sanjeev Kulkarni (@sanjeevrk), a founding member of the Google AdSense team, joined Locomatix soon after it was formed.  Next came Chris Kellogg (@cckellogg), who joined in his last semester at the University of Michigan.

Today, I’m very happy to announce that the Locomatix team is joining Twitter!

Special thanks to our families and friends for supporting us throughout this incredible journey. We simply wouldn't have made it without you!

Sanjeev, Karthik and I all got our graduate degrees from the University of Wisconsin, so a big part of Locomatix was powered by UW grads. Go Badgers! We are working on converting Chris into a Badger fan -- its still work in (very slow) progress.

Monday, June 24, 2013

Life Beyond Column-Stores: Exploiting intra-cycle parallelism

So, I have finally decided to start a blog. Why now, you might ask?

The first reason is that I have been too lazy to start a blog till now. Since blogs often matter as much as publications, this oversight is inexcusable. Better late than never.

The second reason to start a blog is that U. Wisconsin doesn’t, AFAIK, have a blog in the big data space. This is sad given the tradition that our university has in producing key data processing technologies, like the GAMMA parallel database system (which continues to influence what we now call “big data systems”), and BIRCH which provided a key pivot point for data mining (the buzzword-compliant term for that today is “deep data analytics”). This blog is a humble start in trying to get some of what Wisconsin does today in the big data space out into the blogosphere.

So, the topic for today is – Exploiting intra-cycle parallelism for query processing. First, a quick background behind this line of thinking. If you zoom into the processor and view it at the circuit-level, then there is an incredible amount of parallelism within each cycle. For example, if you add two 64-bit numbers, which takes about one cycle, the hardware circuits are actually computing on all the 64 bits in parallel. Thus, there is a 64-way parallelism at the circuit level. Enough here to be a potential game changer if we can exploit it effectively for some key data processing kernel.

The data processing kernel that we picked is a simple scan. Scans are pretty common in practice, and companies like SAP and IBM tend to worry about its efficiency excessively in their main memory data analytics products. Now, if you care about efficiency in these environments, then you should have implemented a column store in your product. So, the real challenge is: How can we speed up scans in a column-store engine?

The answer that Yinan Li and I came up with is called BitWeaving. BitWeaving builds on column stores but takes it to the next level. Here is how it works: First, we take each column and encode it using a fixed-length order-preserving code. So, if you have a column for all the birthdates of users in a table, then we map each unique birthdate in the column to an array of codes (one code for each original column).

Next, we view the codes at the bit-level (just like the processor's circuit does), and layout the bits in memory in a way that lets us exploit the circuit-level intra-cycle parallelism.

BitWeaving comes in two flavors. The first called BitWeaving/V, is like a column store, but at the bit level.  So, the highest-order bits of the column are layed out sequentially in memory, followed by the second highest-order bit, and so on. Thus, when you bring data into the processor, which is typically about 64-bits at a time, you actually have bits from different columns that are brought into the processor registers.

We have an algebraic framework that allows us to operate on these "oddly packed bits." With this framework, we can use regular CPU instructions like binary addition and exclusive OR, to operate on bits across the 64 columns in a few instructions (cycles). Furthermore, in many cases, we can safely prune the computation without looking at all the bits! The intuition behind the early pruning is as follows: Consider column codes that are 3-bit wide. If you are evaluating the predicate "col < 2", then you know that any column with a bit value set to 1 in the most significant bit position does not match the predicate, and you don't need to look at the last two bits of information. Organizing the column at the bit level allows us to skip over columns of bits. The end result is that in just a few cycles, we can compute the predicate for 64 columns. In other words, we have successfully exploited intra-cycle parallelism!

The second way to BitWeave is called BitWeaving/H. Here we pack the codes horizontally, laying the codes sequentially in memory. Such horizontal code packing is not new, but what we do that is new is:
  1. Add an extra bit to each code, which we use as a placeholder to record the result of the predicate evaluation on that code, and 
  2. We store consecutive codes stacked as columns across the memory. So, if we have a 3-bit code, the code for column 1 is followed in the memory address space by the code for column 5 (and not column 2!). See the paper more details.
With these techniques we can dramatically reduce the number of cycles that is used by BitWeaving/H to evaluate predicates. The end result is that we can compute the predicate across a batch of codes in a few cycles (again, we have successfully exploited intra-cycle parallelism).

Okay, so how well does this work in practice? Below is a figure comparing the performance of different techniques using cycles/code as the performance metric. These results are for a synthetic data set with a single column and one billion uniformly distributed integer values in this column. In this experiment, we vary the column width (# bits in the code). We scan the table using a predicate that selects 10% of the tuples, and then feed the results to a COUNT aggregate. This experiment was run on a single core of a Xeon X5680.

The top line in this figure corresponds to fetching one column code at a time, evaluating the predicate, and then feeding the result to a count aggregate operator. The second line labeled “SIMD Scan” is our best effort implementation of the method described in the SAP paper. It packs codes horizontally, like BitWeaving/H, but without the extra bit and the columnar layout, and uses SIMD instructions. The next two lines represent BitWeaving/H and BitWeaving/V.

As we can see, both BitWeaving methods provide significant gains over the traditional method and the SIMD Scan. For example, with a code size of 12 bits, BitWeaving/V is 10X and 6X faster than the Traditional and the SIMD Scan methods respectively. At a code size of 12 bits, BitWeaving/H is 9X and 5X faster than the Traditional and the SIMD Scan methods respectively.

Of course in the experiment above, the final output was a COUNT(*), which can be computed directly by counting how many results match the predicate. As one might imagine BitWeaving/V is more expensive if you have to actually reconstruct the column to produce the output. Here is another experimental result with a TPC-H database at scale factor 10 (i.e. a 10GB database), with the scan query Q6. This query has an aggregate that requires computing the average across a product of two columns.  Here BitWeaving/H outperforms BitWeaving/V.
One final note, BitWeaving can be used both as a native storage format and/or as an index. With 2-way replication, you could choose to store one copy of the column in a BitWeaved format and the other in a regular column store.

Of course, there are lots of interesting unanswered questions, including how to extend this idea of intra-cycle parallelism to work with other columnar compression schemes, expanding the use of BitWeaving beyond simple scans, physical schema optimization using BitWeaving, making BitWeaving work with other architectures (e.g. GPUs), dealing with updates and batch appends, etc. So, we are likely to continue having fun with this line of thinking.

The BitWeaving work is being presented this week @ SIGMOD in New York, and I hope to see some of you there.