Skip to content

Optimization of External Sort Performance

Paul Rogers edited this page Jan 2, 2017 · 10 revisions

Sorting is one of the fundamental operations of any query system. Good sort performance pays benefits in overall query performance. Drill external sort performs an in-memory sort-merge when data fits in memory, and spills to disk otherwise. The performance of the in-memory sort is already quite good. Can we make it better? It turns out we can double performance (reduce run time by half) by revising how Drill generates code for the external sort operator.

Overview

Recent experiments show that we can more than double in-memory sort performance by generating tighter code in the “inner loops” of the sort and merge implementations. The improved sort performance causes a 50% reduction in overall runtime (doubling of performance) for the sample query: 10 million rows of randomly-generated integers. Other queries will, of course, have different results.

The details are below. In short, generated code makes expensive use of layer upon layer of function calls to fetch each value. The speed-up comes from eliminating 13 function calls per compare operation. This matters because, in the sort, we do something like O(n log n) compares: so savings add up quickly. For this test, the reduction is something like ~1.5 billion function calls.

Existing:

compare()
. SelectionVector2.getIndex( ) // x 2
. . DrillBuf.getChar( )
. . . DrillBuf.getShort()
. . . . PlatformDependent.getShort()
. doEval( ) // Generated, this is for required int
. . IntVector.getAccessor() // x 2
. . . Accessor.get()
. . . . DrillBuf.getInt()
. . . . . PlatformDependent.getInt()

Revised:

    public int compare(int leftIndex, int rightIndex) {
      int sv1 = PlatformDependent.getShort(addrSv + (leftIndex << 1)) & 0xFFFF;
      int sv2 = PlatformDependent.getShort(addrSv + (rightIndex << 1)) & 0xFFFF;
      int left = PlatformDependent.getInt(addr0 + (sv1<<2));
      int right = PlatformDependent.getInt(addr4 + (sv2<<2));

A similar change was made in the “MSorter”: the code that merges sorted batches in memory.

Interestingly, no additional performance gain was seen in replacing the PlatformDependent calls with direct calls to Unsafe. Presumably that path is simple enough to be optimized away by the JVM.

This experiment hand-coded the replacement “SingleBatchSorter” and “MSorter". A next step would be to revise the code generator to generate the required code. And, of course, nullable, variable-length and array vectors will be more complex than the simple required int vector used in the experiment.

This technique can be used anywhere we have compute-intensive inner loops in generated code.

Baseline Sort Numbers

To determine where we stand, we can start with a baseline using several available sort algorithms. All tests use a data set of 10 million randomly generated integers uniformly distributed over the entire range of values (both positive and negative.)

The sorts are:

Arrays.sort(int) time: 847
Arrays.sort(Integer) time: 4681
TimSort.sort(Integer, c) time: 4982
Hadoop QS time (Integer): 5920
Hadoop QS time (int): 2398
Hadoop QS time (ByteBuffer): 4090
Hadoop QS time (ByteBuffer, SV): 9840

The first two data points are for using the basic Arrays.sort function: the first on an array of int values, the second on an array of Integer objects. In Java 8, the sort method is implemented using a "Tim Sort" (see Javadocs for details.) The third test uses the TimSort directly passing in a comparator object to measure the overhead of the extra method call.

Hadoop provides a QuickSort algorithm which is the basis for Drill's in-memory sort. The next two lines measure the time using an array of Integer and int. Drill uses direct memory buffers. To simulate that, the penultimate line uses a Hadoop QuickSort with data stored in a ByteBuffer. Drill actually uses a level of indirection, the so-called "selection vector." The last line simulates that indirection using a simulated selection vector and data vector, both in `ByteBuffers.

The numbers above came from a test run on a Mac. Values shown are the last iteration of a run with five iterations of each test.

As we will see, Drill actually compares very well against these experiments.

Drill Baseline

To provide a place to start, consider a query of 10 million rows using the mock data source described on the Performance Basics page. The tests use an embedded Drillbit in the test framework also described on that page, along with the configuration options also described (turning of assertions, disabling debug logging, etc.) Sufficient memory is provided that the sort occurs in memory. The query is:

SELECT id_i FROM `mock`.employee_10M ORDER BY id_i

This simply says to select an integer field ("_i") that we call "id" for convenience. Create 10 million rows ("_10M") from a table we call "employee" for convenience. Sort by the integer field.

The query runs in 9684 ms. The physical plan, and time breakdown is as follows:

Sorted 10,000,000 records in 154 batches, elapsed: 9684 ms

Op: 0 Screen
  Process: 10 - 0%, 0%
  Wait:    8
Op: 1 Project
  Process: 3 - 0%, 0%
Op: 2 SelectionVectorRemover
  Process: 846 - 8%, 8%
Op: 3 Sort
  Process: 8482 - 89%, 89%
  Spills: 0
  Memory: 118
Op: 4 Scan
  Process: 144 - 1%, 1%
Total:
  Setup:   1
  Process: 9485

As expected, the bulk of the time (89%) is spent in the (in-memory) sort. A bit of analysis showed that the time is spent in two key "inner loops" the in-memory sorter and the in-memory merger (called, for some reason, an "MSort"):

Load time: 3219 - 144 = 3075 ms.
MSort time: 5256 ms.

Here, load time is mostly the per-batch sorter (shown with the scan time subtracted.) The rest of the time is in the MSorter (merger).

Optimizing the In-Memory Sorter

The bulk of the time in the in-memory sorter is in the code generated from the SingleBatchSorterTemplate template. Fortunately, DRILL-5052 gives us the tools to see, and debug the generated code. Here is an abbreviated form:

public class SingleBatchSorterGen1 extends SingleBatchSorterTemplate {
    IntVector vv0;
    IntVector vv4;

    public int doEval(char leftIndex, char rightIndex) throws SchemaChangeException {
        IntHolder out3 = new IntHolder();
        out3.value = vv0.getAccessor().get(leftIndex);
        IntHolder out7 = new IntHolder();
        out7.value = vv4.getAccessor().get(rightIndex);
        IntHolder out8 = new IntHolder();
        final IntHolder out = new IntHolder();
        IntHolder left = out3;
        IntHolder right = out7;
        out.value = left.value < right.value ? -1 : (left.value == right.value ? 0 : 1);
        return out8 .value;
    }

This code is unique to the fact that we are sorting (non-nullable) integers. The code would vary if the sort were for nullable columns, were for another data type, were for variable-length columns (e.g. VARCHAR), etc. That's why Drill needs code generation.

What we see is that the sorter acts as if it has two incoming vector: vv0 and vv4. This is an artifact because, of course, a sorter just has a single vector. The code grabs a value from both, given the index, and compares them. Simple enough.

Note also that the generated code uses "holders" to work with vector values. Each holder is a new Java object. We might think that we could gain some performance by removing the temporary objects, but that already happens. Drill already performs "scalar value replacement" at the byte code level: it is the magic that replaces these temporary object with scalar variables. However, in modern Java, even that is unnecessary: the Java compiler and/or runtime does the work "for free."

To make our task a bit easier, we can rewrite the code as the JVM sees it:

    public int doEval(char leftIndex, char rightIndex) throws SchemaChangeException {
        left = vv0.getAccessor().get(leftIndex);
        right = vv0.getAccessor().get(rightIndex);
        return = left < right ? -1 : (left == right ? 0 : 1);
    }

Note that since vv0 and vv4 are the same, we retain just one of them.

Cache Accessor

What we want to notice runs a bit deeper. Notice that we get the values as follows:

left = vv0.getAccessor().get(leftIndex);

That is two function calls per value, four per comparison. That's got to add up. How could we replace them? First, we could cache accessors instead of vectors. Here is the current setup code (retaining just vv0):

    public void doSetup(FragmentContext context, VectorAccessible incoming, RecordBatch outgoing)
        throws SchemaChangeException
    {
        int[] fieldIds1 = new int[ 1 ] = { 0 };
        Object tmp2 = (incoming).getValueAccessorById(IntVector.class, fieldIds1).getValueVector();
        vv0 = ((IntVector) tmp2);
    }

Caching the accessor is simple:

        a0 = vv0.getAccessor();

Now eval is:

        left = a0.get(leftIndex);
        right = a0.get(rightIndex);

Cache the DrillBuf

This provides a bit of an improvement, but we want to to further. We can ask what the get() method does. Turns out, for IntVector, this just turns around and invokes a DrillBuf method:

    public int get(int index) {
      return data.getInt(index * 4);
    }

So, we could repeat the above, caching the DrillBuf instead of the accessor. In doing so, we are now required to do the byte indexing calculations:

        b0 = vv0.getBuffer( );

        left = b0.get(leftIndex * 4);
        right = b0.get(rightIndex * 4);
  

Work with Memory Addresses

This is again a bit of an improvement, but we can go further. Recall that Drill's value vectors are just direct memory blocks. Why not work directly with the underlying memory addresses? (Yes, this strips away lots of abstraction and so on, but the code here would be generated in production so the code generator can track any changes in value vector implementation.)

Now we have:

      addr0 = vv0.getBuffer().memoryAddress();

      int left = PlatformDependent.getInt(addr0 + (leftIndex <<2));
      int right = PlatformDependent.getInt(addr0 + (rightIndex <<2));

Doing all this reduces run time noticeably because we've squeezed out six function calls per comparison.

Optimizing the Selection Vector

The methods looked at so far reside within a larger context in the SingleBatchSorterTemplate. Sorting is actually done in a 2-byte selection vector (and "SV2"). Original code:

  public int compare(int leftIndex, int rightIndex) {
    char sv1 = vector2.getIndex(leftIndex);
    char sv2 = vector2.getIndex(rightIndex);
    return doEval(sv1, sv2);
  }

The compare method is called by the Hadoop QuickSort so we can't optimize beyond this method. Note that the method works with a selection vector using the same levels of indirection as we saw for the data vectors. We can replace the above with direct memory access, then combine the result with doEval to avoid another method call. The result is:

    public int compare(int leftIndex, int rightIndex) {
      int sv1 = PlatformDependent.getShort(addrSv + (leftIndex << 1)) & 0xFFFF;
      int sv2 = PlatformDependent.getShort(addrSv + (rightIndex << 1)) & 0xFFFF;
      int left = PlatformDependent.getInt(addr0 + (sv1<<2));
      int right = PlatformDependent.getInt(addr4 + (sv2<<2));
      return left < right ? -1 : (left == right) ? 0 : 1;
    }

Results

We've now removed a total of five more function calls for a total savings of 11 function calls per compare. Not bad.

Run time decreases from ~9700 ms to about ~8200 ms or about 15%. But, if we measure just the time in the in-memory sorter, time decreases from ~3200 ms to about ~1800 ms or about 44%. This is a significant savings.

Merge Sorter

The other main component of an in-memory sort is the merge pass which successively merges pairs of runs until a single, final, run remains. We can apply the same step-by-step analysis above to convert a prototype implementation to work directly with memory buffers. That work is too large to show here.

The results are significant. Total runtime drops from ~8200 ms to ~4700 or about 30%. Time for just the memory sorter drops from ~5300 ms to ~1900 ms or 64%. (That is, the new "M sorter" time is just 36% of the original time. Said yet another way, this is a speed-up of almost 3x.)

The output of a typical run is as follows:

Per-batch sort time: 1797
MSort time: 1806
Sorted 10,000,000 records in 153 batches, elapsed: 4641 ms

Op: 0 Screen
  Process: 10 - 0%, 0%
  Wait:    13
Op: 1 Project
  Process: 3 - 0%, 0%
Op: 2 SelectionVectorRemover
  Process: 830 - 18%, 18%
Op: 3 Sort
  Process: 3612 - 78%, 78%
  Spills: 0
  Memory: 118
Op: 4 Scan
  Process: 142 - 3%, 3%
Total:
  Setup:   1
  Process: 4597

We see that sort still takes the majority of the time, but the percentage dropped from 89% to 78%.

No Need to Go Further

The optimizations above use the Netty PlatformDependent class. The functions in this class are a wrapper around another class that is itself a wrapper around the sun.misc.Unsafe class. Perhaps we can optimize further by working directly with the Unsafe class? Experiments, however, showed no benefit at all in the 10 million integer query if we replace calls to PlatformDependent with equivalent calls to Unsafe. This is just as well; if we saw gains in using Unsafe then the code generator would have to make the same decisions that Netty already makes to use Unsafe correctly.

Summary

Sorting is one of the fundamental operations in a query engine (the other is hashing.) The Drill external sort, when operating in memory, spends most of its run time in compare and merge methods in generated code. By converting that code to work directly with memory buffers (and avoiding the three levels of indirection from value vector to accessor to Drillbuf to memory), total run time of a 10 million integer data set reduces from ~9700 ms to ~4700 ms for a savings of 52%. That is, sort performance doubles.

The experiments were done using a prototype version of query-specific code created by hand for this one query. The next step would be to modify the code generation mechanism to emit code of this new form for all queries.

Clone this wiki locally