SIPping and TIPping for faster search over sorted arrays

Can you do better than binary search?

TL;DR: Yes you can with our new algorithms SIP and TIPand, not everything at SIGMOD/PVLDB has to be called ML-in-Systems!

That was the question that two of my students Peter Van Sandt (a recent graduate from our BS program and the winner of David DeWitt Undergraduate Scholarship) and Yannis Chronis worked on over the last few years. It is a central question as searching over sorted in-memory arrays is a fundamental operation used in nearly every modern data platform. And, most of these platforms use some form of Binary Search (BS). 

Interestingly, Interpolation Search (IS) has a better average-case complexity of O(log log N), but this result assumes that the data is uniformly distributed. Past attempts to use IS in any of less-than-ideal situation has generally not worked, leaving BS as the undisputed search method over sorted arrays.  

We took a deeper look at this search problem in a paper that Peter and Yannis will co-present at the upcoming SIGMOD conference.  We started by noting that a key technological trend is that memory accesses will continue to be relatively more expensive compared to CPU cycles. While IS incurs a far more expensive computation in the inner search loop (to calculate the interpolant), IS should incur lower number of memory lookups by O(log log n) vs O(log n) when compared to BS. Past attempts to use IS have in general not been able to beat a well-tuned BS implementation. 

We dug in and started to look at why IS was slower. For one if you do simple linear interpolation, you could land far away your “answer”. In the figure below, you can see how with Linear Interpolation using the left and right points, a straight-line interpolation ends up at point 1, which is element 2 in the sorted array. This is far away from the actual answer (in this case at y = 341); the skewed distribution throws off the linear interpolation calculation. We need an adaptive method to deal with skew.
Left Figure: A collection of values; Right Figure: Linear interpolation when searching for y=341 and x=6.
But, we don’t have to interpolate using just two points. If we interpolate using 3 points, we can do far better in our “guess.” However, we need to compute this 3-point interpolation efficiently. We found a beautiful result proposed by Jarratt and Nudds in 1965 for the problem of 3-point iterative root finding using linear fractions. To make it work in our case, we had to combine it with another technique called bracketing (more details are in the paper). Collectively, we end up with an efficient interpolation calculation as:

We can make the computation even more efficient in the first iteration/initialization by building on a result from Ridders’79, to get: 

There a few more mechanisms that we bring to the table, including reusing the calculation of the slope in the interpolation from one loop to the next (it doesn’t change that quickly), and using guards (switching to sequential search when the search window “narrows”). 

Another key mechanism is to use fixed-point arithmetic instead of floating-point arithmetic. Similar to work in Machine Learning (ML) acceleration methods (such as work on TPU at Google and BuckWild! at Stanford), here too we can accommodate some loss in precision in the calculations. In our case, we exploit the observation that multiplication followed by division by 264 can be fused together and done more quickly than multiplication by an arbitrary fraction. (See the paper for more details).  

Combining all this, we propose two algorithms Slope-reuse Interpolation (SIP) and Three-point Interpolation (TIP), that uses the recipe:

We compared these methods with an optimized version of BS using a number of datasets. The key result is highlighted below with one uniform dataset (Facebook user IDs), and a skewed dataset. The figure shows the relative performance of SIP, TIP and vanilla IS compared to BS (the horizontal line) to perform key lookups on a sorted array.
As can be seen above, SIP is about 2X faster than BS on the Facebook dataset. TIP is ~3X faster than BS on the skewed dataset; the other dataset are actually far slower than BS for this skewed case. There are additional results in the paper, including showing the end-to-end benefits of this approach in Numpy and LevelDB.

Three final notes: 
1. The approach we take here, namely of fast and efficient interpolation is a technique that many in the community are starting to employ, under the broader umbrella of applying ML methods to systems. Some examples of this approach are Andy Pavlo's work on OtterTune and Tim Kraska's work on SageDB. Take for example, the OtterTune work, Andy applies ML to tune systems and it is really hard to do this with traditional tuning methods. Within this context, there are interesting parallels between using optimized methods for interpolation (as what we use here), and using ML to predict the "right" point in a distribution. Exploring this connection is an interesting direction for future work. 

2. We had an interesting experience with this paper at SIGMOD. While most reviewers liked this paper in the first round, one reviewer was super-negative (among the harshest reviews I have ever seen). As far as we can tell, the key criticism was: simplicity in solving a problem is "not novel." This kind of thinking is increasingly starting to creep in our community, where everything needs to be called ML-in-Systems to "sell." Dangerously, funding proposals are subject to this trend too. While there are compelling reasons to bring ML into systems (see some examples above), I think it is dangerous to call everything as Systems-in-ML. We could have called the adaptive interpolation above a "regression" and hence magically make it an ML algorithm, but do we need to? There is in my view a more elegant and simpler way to present and solve this problem by giving credit to the amazing results from decades-old optimization/approximation theory. Why cloud it with marketing terms? Again, I want to emphasize that I think there are many valid application of ML-in-Systems, but there are also cases where you don't need to call everything that we do as ML. We should definitely not reject papers for not having "marketed right". Why not present things in the simplest possible way, like academicians often (used to?) do? 

3. In a normal year, one super-negative review in the initial round would have killed the paper. However, thanks to the excellent reviewing work at SIGMOD this year, we got a shot at rebutting the reviewer. We want to give a huge shout-out to the SIGMOD PC and the key leaders Anastasia Ailamaki (PC Chair), Amol Deshpande and Tim Kraska (Vice-Chairs) for a thorough review process. Thanks for accepting a paper even though it wasn't sold as ML-in-Systems!

Comments

Popular posts from this blog

WideTable: An Accelerator for Analytic Data Processing

Graph Analytics: Simpler than we think?

Life Beyond Column-Stores: Exploiting intra-cycle parallelism