-
Notifications
You must be signed in to change notification settings - Fork 980
Performance Basics
Drill performance work consists of focusing on one particular area, discovering current performance, and seeking ways to make that particular operation or query run faster. The "school of hard knocks" has taught that the unwary novice Drill developer can spend time "discovering" that Drill runs slow if you set the wrong options. This page describes those options, when to use them, and how to turn them off when doing performance runs.
Drill has a very complex memory allocator to manage direct memory. Clearly, those who wrestled this part of Drill into perfection found the need to carefully monitor each allocation. This shows up in the effect of enabling assertions (the -ea option the Java command) when running Drill. In general, assertions are very handy to include debugging checks that Java will optimize out at run-time. Drill, unfortunately, uses the -ea option as an indication to turn on detailed memory allocation logging. This logging has the effect of adding 10x to 20x to the run time of each query.
Be sure to turn on assertions when changing Drill to catch memory issues early. (Assertions are enabled for Drill unit tests run through Maven, for example.) But, be sure assertions are off when doing performance runs. Assertions will be off if you use the out-of-the-box Drill launch scripts (those in $DRILL_HOME/bin
) but may be on if you run Drill through your favorite IDE.
Note that DRILL-5122 is in progress to address this issue.
Drill provides comprehensive logging via the Logback Classic logging system. Like most loggers, Logback provides a variety of logging levels. At the default logging levels, logging will consume much of the run time of a query.
Before doing any performance runs, change the log level to error
. If running Drill using the out-of-the-box scripts, look for $DRILL_HOME/conf/logback.xml
(or $DRILL_SITE/logback.xml
). Change these lines:
<logger name="org.apache.drill" additivity="false">
<level value="error" />
<appender-ref ref="FILE" />
</logger>
<logger name="org.apache.drill" additivity="false">
<level value="error" />
<appender-ref ref="SOCKET" />
</logger>
That is, change the levels from the default of info
and debug
to error
. Note that all log levels must be at error
. If any logger (even one you are not using, such as the SOCKET
logger) is at debug
, you will pay the debug logging cost.
If you run Drill from an IDE, beware of the multiple (or missing) Logback configuration files. ([DRILL-5114)(https://issues.apache.org/jira/browse/DRILL-5114).
Depending on your focus, it is helpful to limit Drill's parallelism in order to force Drill to run a query on a single Drillbit in a single fragment. Doing so focusses efforts on the performance of a specific set of operators, if that is what you want to optimize.
ALTER SESSION SET `planner.width.max_per_node` = 1;
This can also be done via a data source that supports just one fragment, such as the mock data source described below.
The Drill query profile contains a wealth of information. While the Drill UI is a great start, and is very good for visualizing the query, there is no substitute for looking at the query profile itself. Best to look at the JSON form. Use a JSON formatter web site or program to turn Drill's JSON into something you can read. Or, create a simple program to grab out the important information such as timings. For example:
Op: 0 Screen
Setup: 0 - 0%, 0%
Process: 3 - 0%, 0%
Wait: 4
Op: 1 Project
Setup: 1 - 100%, 0%
Process: 1 - 0%, 0%
Op: 2 SelectionVectorRemover
Setup: 0 - 0%, 0%
Process: 762 - 12%, 12%
Op: 3 Sort
Setup: 0 - 0%, 0%
Process: 5016 - 84%, 84%
Spills: 0
Memory: 118
Op: 4 Scan
Setup: 0 - 0%, 0%
Process: 143 - 2%, 2%
Total:
Setup: 1
Process: 5925
The above tell us that the sort takes 5016 ms for this query which is 84% of the run time. This tells us exactly where to focus our efforts.
The profile contains various metrics (encoded, unfortunately as numbers) that give insight into the operation of each operator. For example, the "Spills" number in the sort operator is a numbered metric (number 0 in this case.)
Often performance work requires testing, changing something, and testing again. This cycle is much easier if done from within an IDE using a test framework. Drill offers several (such as BaseTestQuery
). DRILL-5126 describes a simple, unified framework that makes it very easy to configure an embedded Drillbit with a known set of configuration and session options, run a query, and get the results in various forms, including simply timing and counting the results. For example:
FixtureBuilder builder = ClusterFixture.builder()
.configProperty(ExecConstants.SYS_STORE_PROVIDER_LOCAL_ENABLE_WRITE, true)
.configProperty(ExecConstants.REMOVER_ENABLE_GENERIC_COPIER, true)
.configProperty(ClassBuilder.SAVE_CODE_OPTION, true)
.sessionOption(ExecConstants.MAX_QUERY_MEMORY_PER_NODE_KEY, 3L * 1024 * 1024 * 1024)
.maxParallelization(1)
;
try (ClusterFixture cluster = builder.build();
ClientFixture client = cluster.clientFixture()) {
String sql = "SELECT id_i FROM `mock`.employee_10M ORDER BY id_i";
QuerySummary summary = client.queryBuilder().sql(sql).run();
System.out.println(String.format("Sorted %,d records in %d batches.", summary.recordCount(), summary.batchCount()));
System.out.println(String.format("Query Id: %s, elapsed: %d ms", summary.queryIdString(), summary.runTimeMs()));
}
}
Running an embedded Drillbit in this way provides very fast turn-around: no need to set up a Drill install directory, edit config files and so on. The IDE does the work of rebuilding (quickly) prior to each run. Of course, this works only if you are focused on specific internal bits of Drill; it is not as useful if you are interested in the distributed aspects of Drill.
Drill provides a mock data source that has historically only been available when used in a physical plan. (See TestSimpleExternalSort
.) DRILL-5152 describes an enhancement to give access to the mock data source from SQL. For example, to generate a table of 10 million rows, each with a single int column with random values:
SELECT id_i FROM `mock`.employee_10M ORDER BY id_i