workflow

wukong-workflow

Workflow dichotomy:

(Having a graph will help you express parallel execution)

Makefile style of dependency [reverse dependency graph]
- Know your endpoint but not your beginning
- strong idempotency
- Doesn't like to be dependent on data
Forward facing workflow graph
- Know your beginning but not your endpoint [may have many choices]
- Know your direction/velocity

For any defined abstraction layer:

Only important that the contract is adhered to
No implication that there are lower level abstraction layers
May show a forward-looking vision of elegant lower level abstractions

example workflow

    chain :twitter_parse do
      wukong_rb ‘parse_api.rb’
      pig       ‘uniq_and_unsplice.pig’
    end

    Wukong.workflow(:launch) do
      task :aim do
        #...
      end
      task :enter do
      end
      task :commit do
        # ...
      end
    end

    Wukong.workflow(:recall) do
      task :smash_with_rock do
        #...
      end
      task :reprogram do
        # ...
      end
    end

Workflow

Wukong workflows work somewhat differently than you may be familiar with Rake and such.

In wukong, a stage corresponds to a resource; you can then act on that resource.

Consider first compiling a c program:

to build the executable, run `cc -o cake eggs.o milk.o flour.o sugar.o -I./include -L./lib`
to build files like '{file}.o', run `cc -c -o {file}.o {file}.c -I./include`

In this case, you define the steps, implying the resources.

Something rake can't do (but we should be able to): make it so I can define a dependency that runs last

Run

A run is the event that ensues when you invoke a workflow. Invoking the bake_pie workflow at 01:20:55 on Jan 30, 2012 results in the bake_pie-20120130012055 run.

Stages

Stage

A stage is a data process having

one input, an array of length one called inputs. (later: multiple inputs, named inputs)
one output, called output (later: multiple outputs, named outputs)
(later) an error channel named :error.

Any stage can be invoked by name; only that stage is executed.

Chain

A chain runs a sequence of stages, one after the other, in order. A chain is itself is a stage; it has an array of sub-stages (called steps) that it will execute in order.

the input to the chain becomes the input to the first stage, and the output of the last stage becomes the output of the chain.

You can of course invoke either the chain or one of its steps

ShellProcess (?name?)

A shell_process invokes the swineherd runner.

hash of config variables
?ordered? inputs
one output, named :output, and an error channel named :error

Input and Output

By default, a stage’s inputs are specified by the outputs of its dependencies.

File name templates

Output asset names

The output asset names are constructed from the stage’s metadata. There is a small set of pathname templates (in fact, only one):

Development mode output pathname template


somehow: %{user}, %{run_id}, %{session}, %{run_index}, %{prod|dev|test}

(?implement a template that you think works, those are some possible ingredients we’ll codify &/or fix?)

(later) Automated mode output pathname template (used when deployment class is prod and test): /%{project_path}/%{run_id}/%{transformed_stage_name}-%{deployment_class} (just implement something sensible, we’ll figure out the details)

somehow: %{user}, %{session}, %{run_index}, %{prod|dev|test}, %{timestamp}

project_path: A container for runs for the same purpose/project
session: A temporally close connected set of runs
run_index: An auto-incremented counter for the runs
deployment_class: The type of deployment instantiation. These may be used for more than one granularity of sets of run.
run_id: The time the run started and some other information to uniquely identify this specific invocation of the workflow. (?complete as you find natural?)
timestamp: timestamp of run. everything in this invocation will have the same timestamp.
user: username; ENV['USER'] by default
sources: basenames of job inputs, minus extension, non-\w replaced with '_', joined by '-', max 50 chars.

Explicit asset names

Normally, one should not rename inputs and output. However, there are some (hopefully rare) cases where they may be renamed. Example cases include:

You can override the default input name to adapt to external processes:

(show how)
(make sure I can still inject an explicit name at execution time)

You can also inject an explicit name:

(show how)

Dependencies

...

Execution

Configuration

Commandline args

handled by configliere: nukes launch --launch_code=GLG20
TODO: configliere needs context-specific config vars, so I only get information about the launch action in the nukes job when I run nukes launch --help

versioning of clobbered files

when files are generated or removed, relocate to a timestamped location
- a file /path/to/file.txt is relocated to ~/.wukong/backups/path/to/file.txt.wukong-20110102120011 where 20110102120011 is the job timestamp
- accepts a max_size param
- raises if it can't write to directory -- must explicitly say --safe_file_ops=false

Actions

each action

the default action is call
all stages respond to nothing, and like ze goggles, do nothing.
clobber -- run, but clear all dependencies
undo --
clean --

Standard Resources

Utility and Filesystem stages

The primitives correspond heavily with Rake and Chef. However, they extend them in many ways, don't cover all their functionality in many ways, and incompatible in several ways.

Later

The concrete swineherd runnables each have eponymous stage names.

Rake compatibility

Any simple Rake task should work as a swineherd flow

task
desc
namespace
... (flesh out)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly