Skip to content

DagLibraries

Steve Wardle edited this page Apr 7, 2020 · 1 revision

Since at its heart the task performed by the build system can naturally be represented by a DAG (Directed Acyclic Graph), we may be able to make use of an existing module/library for constructing and executing a DAG. These type of tools are somewhat interchangeably referred to as "DAG Libraries", "Work/Task Schedulers", "Pipeline tool-kits", "ETL Work-flow Managers" (Extract, transform, load) and various other permutations and variations of similar or related words. The overlap in terms seems to come from the fact that a DAG is a good model for most logical work-flows where a series of tasks need to be executed.

Spoiled for choice

Even based on the most cursory bit of research online it is immediately surprising just how many of these type of libraries already exist. It makes sense that such a common type of problem would be relatively mature and well covered, but one quickly realises that almost every reasonably large online service seems to have written their own DAG library:

  • Airflow (AirBnb)
  • Pinball (Pinterest)
  • Luigi (Spotify)
  • Suro (Netflix)
  • Cadence (Uber)
  • Azkaban (LinkedIn)
  • MaaT (Alibaba)

The above isn't even close to an exhaustive list. Once you also include the numerous other efforts by smaller scale teams and individuals it is really rather shocking how many implementations there are. Fortunately we can trim down this pool significantly because we know our intention is to write Fab in Python. Restricting ourselves to only Python, and filtering out the various smaller-scale projects (most of the small projects are so simple that it doesn't seem worth the effort of introducing the dependency) leaves us with three likely contenders:

  • Luigi (Spotify)
  • Airflow (AirBnb)
  • Dask

Next we had a look at each of these in a little more detail, to try and see how well each would fit the task we want to do.

Dask

We considered Dask not because it showed up in our research above, but because some other projects at the Met Office are already using it, making it conveniently available and possible easier to work with owing to having some overlapping knowledge with colleagues. At the time of considering Dask it was relatively unknown (to us) but having started looking into it we quickly realised that the DAG related functionality within Dask really is not the main focus of the library. The Dask author's themselves even mention that there are 2 broad use cases for Dask.

Examination of the way in which the DAG is specified felt like it may be rather unwieldy for the large set of tasks we expect the build system to require. Dask essentially uses a series of nested dictionaries to define the DAG (see this simple example or this more complicated example).

After some deliberation we concluded that we probably don't want to depend on such a weighty library only to use a small subset of its features, and we weren't convinced that Dask's scheduling would be suitable for Fab.

Luigi

Unlike Dask, Luigi really is a specific DAG library. It has a series of classes that represent "tasks" it may perform (transforms in Fab's terminology) and "targets" (artifacts for Fab). The syntax seems clear and like it would be able to do what we need it to; this example gives a good overview. Along with the means to run the graph it comes with nice built-in visualisation tools that allow the graph and status of the tasks to be monitored in a web-browser based GUI. The developers also provide a list of what they consider the limitations of Luigi. Reading the limitations the only thing that could be of concern is their worries about scaling beyond tens of thousands of jobs; it's likely that we would need a high number of jobs when dealing with models on the scale of the UM/LFRic, so this is something we would have to monitor carefully.

Since Luigi seemed like it would be a good candidate, we decided to attempt to implement a very simple example to better understand how it might fit what we want to do. So using a very simple test case - a pair of Fortran files that contain a "hello world" program with a dependency between them, along with an attempt to generate and execute the DAG with Luigi. You can find this example in the "Experimental/DAG" directory of the main repository, it is named luigi_test.py. The code of the example contains many explanatory comments, which we recommend reading through before proceeding below.

In writing the example, it took a little time to understand the best way of working with Luigi. Initially some attempts were made to pass information about the dependencies via the arguments to the Task instances, but this didn't work well at all because of the way instances are created in multiple places (having to "pass down" the dependency information through several levels of dependent Tasks quickly becomes cumbersome). Looking at some examples of how Luigi is used within Spotify showed that typically the operations it is used for are date/time based (e.g. "Update the most-played artists list for today's date") and as a result the constructor Parameter passed to each Task used to perform this operation was the date.

After thinking about this some more two fairly fundamental problems emerged when trying to apply this to our imagined system in Fab. For one thing, the dependencies between tasks seem to be assumed to be process-flow type dependencies that are effectively known and fixed for every "cycle" or "run" of the DAG (like in the example; to calculate most-played artists maybe you have to extract the play counts from a database, then order the list, then format the results to be presented to users). It's certainly possible to do this for a build system, but especially once the project being built becomes large this would require significant calculation and processing up-front, followed by a fairly lengthy process to form the structures required for Luigi.

The second awkward detail is that as you can see in the example, the entire DAG has to be defined prior to running and then cannot really be altered during the run (actually it can, sort of - a task can forcibly run an extra task within its "run" method, but it isn't particularly elegant). In our requirements we talked about how it's possible that some transforms may need to be performed before the dependency analysis (and it's also possible that analysing dependencies may itself be a transform someday). This doesn't lend itself very well to an implementation where everything is determined ahead of time.

Airflow

Apache Airflow is another popular DAG library, with a fairly similar set of features to Luigi but a different implementation. Based on reading the examples and documentation it seems as though Airflow will have the same broad set of difficulties for use as part of Fab. However in case it is illuminating we will still attempt to implement the same example case using Airflow (this is not currently complete but will follow).

Conclusions

The main thing we have learned through this exercise is that actually most (if not all) DAG library implementations that are already available don't quite seem to fit nicely into what we are trying to do with Fab. There's a heavy dose of irony in the fact that we started this piece of investigation by commenting on the huge proliferation of different libraries, but are now effectively suggesting writing another new implementation. Perhaps this goes some way to explaining why there are so many; it doesn't take much for a specific case to require something outside of the existing capabilities.

In our case we have discovered that to provide the level of flexibility we want (in particular with on-the-fly dependency analysis as an intermediate step in the process chain) perhaps the entire idea of a "DAG library" is actually not a good fit. Certainly it is still the case that the build is representable as a DAG but the method for both constructing and "solving" it may be quite different from the approaches seen here. Our thoughts when discussing this issue more were that perhaps Fab would be more suited to an approach where tasks are submitted via a queue/stack as needed. If we could write a set of methods where the dependency tree/graph is traversed as part of the running work-flow, rather than requiring construction of the entire DAG up-front, we would have the required flexibility as well as perhaps some good options for optimisations. Doing this wouldn't preclude us from visualising the graph either; we would just have to do it at the end of the run (or visualise the "current" graph at a point during the run).

Clone this wiki locally