Graph Analytics: Simpler than we think?

Graph analytics are all the rage today with companies rushing to build new software platforms that target just graph analytics all the time. But, are they really that different than many types of iterative data processing?

Here we argue that for iterative graph processing, it is not. In fact, one can map BSP-style computation to the good-old relational algebra, with a few extensions for user-defined functions and the ability to express a loop. How can this be done?

Consider the common BSP style vertex-centric computation.  The canonical graph analytics computation loop looks like this:
 

We can map this generic BSP computation to relational algebra using the following recipe. First, store the graph data in two relational tables, a nodes table (with one tuple for each node), and an edges table (with one edge tuple for each edge in the graph). Then, use the table below to map the BSP computation to relational algebra.


Essentially in the relational system, one executes a sequence of joins, aggregates and user-defined functions to carry out the graph computation. For shortest-path computation, the last line above (send messages) takes the current shortest path for each node and “sends” it to the next iteration by computing a join based on the edges table. These new values are “received” (the “received messages” row) in the next iteration, and the minimum of the current shortest path and the new path lengths is computed. These new values are now stored (“Mutate Values”) in a table that has one tuple for each node.

A more detailed description is in the Fan et al. 2015 CIDR paper titled: “The Case Against Specialized Graph Analytics Engines”. But, I hope you get the general idea: Graph analytics can be mapped to relational algebra. Now with this mapping, you can leverage relational platforms that have matured over time and have high-performance scalable data processing algorithms. Basically, you can scale your graph algorithms by simply using relational platforms. The CIDR 2015 paper has performance results comparing this approach with a few other graph system (circa 2015).

We have recently picked this work and are now implementing Grail on top of the Quickstep system. (Quickstep is a high-performance relational database system built at Wisconsin.) While this work is still on-going, here are a few quick performance results for a network security application in which we find advanced persistent threats using application and system logs. On a dataset with about 3K nodes and 15K edges Quickstep with Grail (QuickGrail) was about 4X faster than Neo4J; on a dataset with 75K nodes and 153K edges QuickGrail was about 500X faster; and on larger datasets Neo4J ran out of memory while QuickGrail was able to carry out its computation (on this single node dual socket Xeon machine). These results are preliminary and over the Spring we plan to write out a full paper. So, look for it, if you are interested in this space!

Comments

Popular posts from this blog

WideTable: An Accelerator for Analytic Data Processing

Life Beyond Column-Stores: Exploiting intra-cycle parallelism