WIP: a new serialization format for optimizing & executing lowered IR #309

timholy · 2019-07-29T16:58:05Z

This is a draft of a new tokenization of lowered IR. The main goal is to split off from the representation currently used in Base so that we are free to perform more significant transformations of the IR to enable more performance optimizations.

This PR supports serialization, printing, and very limited execution. Lots is manual, and lots is broken, but I wanted to post it to collect early feedback. (This is built on top of #307 so this PR contains lots of irrelevant changes. To start with you should focus just on two files, src/serializer.jl and the demo in test/serialization.jl, or just look at the commits individually.)

A demo

In test/ there's a demo script, serialization.jl. If you run it you get this output:

  1: call zero methlist listargs <vec of length 1> loadparameter 1 
 32: storeslot 3 
 41: loadslot 2 
 50: storessa 2 
 59: call iterate methlist listargs <vec of length 1> loadssa 2 
 90: storeslot 4 
 99: callbuiltin === fixedargs 2 loadslot 4 nothingtok 
116: storessa 4 
125: callintrinsic xor_int listargs <vec of length 1> loadssa 4 
148: storessa 5 
157: gotoifnot loadssa 5 444
175: loadslot 4 
184: storessa 7 
193: callbuiltin getfield fixedargs 2 loadssa 7 int 1 
218: storeslot 5 
227: callbuiltin getfield fixedargs 2 loadssa 7 int 2 
252: storessa 9 
261: call + methlist listargs <vec of length 2> loadslot 3 loadslot 5 
301: storeslot 3 
310: call iterate methlist listargs <vec of length 2> loadssa 2 loadssa 9 
350: storeslot 4 
359: callbuiltin === fixedargs 2 loadslot 4 nothingtok 
376: storessa 12 
385: callintrinsic xor_int listargs <vec of length 1> loadssa 12 
408: storessa 13 
417: gotoifnot loadssa 13 444
435: goto 175
444: returntok loadslot 3 
99

The 99 is a consequence of the last few lines, stepping forward until you get to the line that begins 99: callbuiltin... (builtins are not yet supported). This comes from the lowered code for summer, which looks like this:

julia> src
CodeInfo(
1 ─       s = Main.zero($(Expr(:static_parameter, 1)))
│   %2  = A
│         @_4 = Base.iterate(%2)
│   %4  = @_4 === nothing
│   %5  = Base.not_int(%4)
└──       goto #4 if not %5
2 ┄ %7  = @_4
│         a = Core.getfield(%7, 1)
│   %9  = Core.getfield(%7, 2)
│         s = s + a
│         @_4 = Base.iterate(%2, %9)
│   %12 = @_4 === nothing
│   %13 = Base.not_int(%12)
└──       goto #4 if not %13
3 ─       goto #2
4 ┄       return s
)

From this you should be able to learn a lot about the serialization format I've designed, but for the benefit of all I've reproduced below the extended comments that appear at the beginning of src/serializer.jl:

A brief description of the serialization format

This uses a simple format, conceptually implementing a machine with the following
properties:

a tape of instructions called ser
a single implicit "register" called ans
the ability to execute operations specific to a particular instruction token
in ser. Executing these operations may consume (nondestructively) future tokens.

Operations are conceptually of 4 categories:

loads (which fill ans from a variety of sources), encoded by load* or literal tokens
stores (which put ans somewhere more permanent), encoded by store* tokens
calls (for which the return value is stored in ans)
control-flow

The implementation of calls is allowed to store data to named local variables or lists,
thus increasing the temporary storage beyond ans.

The serialization of the lowered IR

   %4 = atan(@3, 2.4)

might look something like this on the tape:

   call atan_idx methlist fixedargs 2 loadslot 3 float64 2.4
   storessa 4

where

call is an instruction token signaling that next operation is a function call
(in reality, there are multiple call-type tokens for intrinsics, builtins,
generics via the interpreter, generics via ordinary compiled dispatch,
invokelatest, Core._apply, etc.)
atan_idx is a token representing atan. It is encoded as an integer index
into a table of functions (the table is maintained by the serializer)
methlist is a pointer to a local method table, a performance optimization
used for avoiding full-blown dispatch (this also stores whether the method should
be called via the interpreter or the compiled path)
fixedargs 2 is an indication that this call should use the path optimized for a
particular (small) number of arguments, which in this case is 2.
An alternative is listargs args, which packs an arbitrary number of arguments
into a literally-encoded args::Vector{Any} stored (via its pointer) in ser.
Contrary to this (fictitious) example, currently only builtins exploit fixedargs
since they are the only ones that will typically need runtime dispatch.
loadslot 3 indicates that the next argument (first argument) is to be loaded
from the slots at index 3
float64 2.4 indicates that the next argument (second argument) is a literal
value of type Float64 encoded in ser itself.
after the second argument, the function call is executed and the result is
stored in ans
storessa 4 indicates that ans should be placed in %4.

Call sites that use listargs currently have their own private args vector sized appropriately for that particular call site. Having one per site is almost certainly overkill. A better format would be to have framecode construction figure out what sizes the method needs and then have the framedata allocate a pool of different sizes. This would support multithreaded interpretation while also decreasing the total storage size. This would probably be one of the most urgent changes to make.

Potential performance benefits

Don't even think about timing things yet, since it's not finished enough to be meaningful. We need to support builtins/intrinsics and implement recursive calls.

But the serialization format already has some potential performance benefits. For example, one of our hottest methods is maybe_evaluate_builtin, which gets called even when f is not a builtin. With the new format, we decide "at compile time" (framecode construction time) whether f is a builtin or something else and then "dispatch" (a big if/then block) to the appropriate method.

In the longer run, as mentioned above we may be able to perform optimizations that would be incompatible with the tools we rely on for handling lowered IR. For example, the section

175: loadslot 4 
184: storessa 7 
193: callbuiltin getfield fixedargs 2 loadssa 7 int 1 
218: storeslot 5 
227: callbuiltin getfield fixedargs 2 loadssa 7 int 2 
252: storessa 9

might be simplified using a couple of new tokens as something like the following:

175: loadslot 4 
184: storevec storessa 7 getfieldtok 1 storeslot 5 getfieldtok 2 storessa 9

and we might be able to implement an optimized method that avoids having to create new frames in the default cases.

I don't know exactly where we want to head with this, but this might at least illustrate some possibilities.

The future

Just getting this far required that I put slightly more time into this than I can afford, so consider this post to be an invitation for others to run with it if interested. I've named the branch serialization rather than teh/serialization to explicitly disavow ownership. I'm well aware others will have their own agenda too, so if no one grabs it then it can just wait until I have time for it again. But that could be a while.

It's also worth noting that JeffB mentioned at JuliaCon that he had been wondering about doing something similar, so at some point (once this has been developed a bit further) we should probably ping him and see what he thinks of the format. It might be nice to share one format (at least in non-optimized form) between base & this package.

EDITS

I'd now change several things about this, esp. not storing pointers inline with the code. Just use a single long "stack" for all frames and pass in an offset when you enter into a new frame.

Co-authored-by: "Kristoffer Carlsson" <[email protected]>

This is a draft of a new tokenization of lowered IR. The main goal is to split off from the representation currently used in Base so that we are free to perform more significant transformations of the IR to enable more performance optimizations. This tackles the first step, serializing lowered code to the new tokenized format. Currently the only things supported are serializing and printing; execution is in draft form but never tested.

KristofferC · 2019-07-29T22:17:56Z

Really cool. Should we already now swap out the TypeMapEntry stuff for the DispatchableMethod in here? Seems easier and potentially faster?

timholy · 2019-07-30T10:36:50Z

Quite likely. If you're interested, feel free to make changes to the struct. The goal is to make things as easy as possible. One thought would also be to organize these differently: for a call with two arguments, rather than having a list like this

Tuple{Int, Int} => meth1
Tuple{Int16,Int} => meth1
Tuple{Float64,Int} => meth2
Tuple{Int,Float64} => meth3

one might instead consider a tree:

├──first arg is one of Union{Int,Int16}
|   ├──second arg isa Int => meth1
|   ├──second arg isa Float64 => meth3
├──first arg isa Float64
|   ├──second arg isa Int => meth2

It's not obvious that this is a good idea, since I bet >90% of all call sites inside loops (which is basically all we care about) dispatch to a single method, and it seems likely to be hard to beat the list in that case.

In general, writing this up was a good exercise in re-thinking "where should this bit of information go?" An example includes the compiled flag in DispatchableMethod; currently we allow either Compiled() or a TypeMapEntry to be stored in the local method table, but really I think we want both: when a given function has some methods that should be run in interpreted mode and others that must be run in compiled mode, you'd like to be able to exploit a local method table to quickly figure out which method is applicable and then run it in the appropriate mode. (I'm actually not quite sure what happens in that case currently, it might be worth testing.)

Other good ideas that could potentially be "stolen" from this and applied to our current infrastructure include checking whether something is a builtin/intrinsic at the time of framecode construction, rather than having to call maybe_evaluate_builtin on each iteration of a loop. We can't write it into the serialized code, but we could add yet another array to the framecode. Builtins & intrinsics don't support multiple methods, so this is something that can be checked from the expression alone.

KristofferC · 2019-07-30T15:49:00Z

Should we define some "calling convention" for this serialized IR? So that the function that gets called know how to look up the arguments without having to put them into a vector, pass it to the next frame and push them into locals?

A bit related to the comment at #308 (comment).

KristofferC · 2019-07-30T16:18:03Z

Is the reason for storing the ssastores explicitly due to the fact that some ssavalues not being used (is_used)? Otherwise, they could potentially just be implicit since every codeinfo statement stores to its corresponding ssaslot.

timholy · 2019-07-30T18:02:32Z

Should we define some "calling convention" for this serialized IR? So that the function that gets called know how to look up the arguments without having to put them into a vector, pass it to the next frame and push them into locals?

A bit related to the comment at #308 (comment).

Yes, I think if we do it the way outlined in that comment, that basically becomes the calling convention.

timholy · 2019-07-30T18:08:10Z

Is the reason for storing the ssastores explicitly due to the fact that some ssavalues not being used (is_used)? Otherwise, they could potentially just be implicit since every codeinfo statement stores to its corresponding ssaslot.

One option here is to renumber them so that you only have as many SSAValues as you use. See

JuliaInterpreter.jl/src/interpret.jl

Line 6 in f9ebc94

isassign(frame, pc) = (pc in frame.framecode.used)

and

JuliaInterpreter.jl/src/interpret.jl

Lines 537 to 542 in f9ebc94

    
           if isassign(frame, pc) 
        
               if !@isdefined(rhs) 
        
                   @show frame node 
        
               end 
        
               lhs = getlhs(pc) 
        
               do_assignment!(frame, lhs, rhs)

It's probably a slowdown to check this for every single statement. It's also a slowdown to save results to ssavalues that won't be used. If you can rewrite the IR it seems you can have the best of both worlds.

timholy and others added 3 commits July 27, 2019 11:31

Replace the slow slotname Dict with a counter ("age")-based mechanism

9bb6176

Co-authored-by: "Kristoffer Carlsson" <[email protected]>

Get the basics of stepping working

6558867

KristofferC mentioned this pull request Jul 30, 2019

Changing mirror as build in function? JuliaLang/Pkg.jl#1271

Closed

KristofferC mentioned this pull request Jul 31, 2019

use our own struct for the local method table instead of TypeMapEntry #313

Merged

KristofferC mentioned this pull request Sep 11, 2019

Expected performance #336

Closed

timholy mentioned this pull request Sep 20, 2019

ideas for reducing compiler latency via the interpreter JuliaLang/julia#33333

Open

timholy mentioned this pull request Jan 7, 2020

precompile misses many packages JuliaLang/Pkg.jl#1578

Closed

timholy mentioned this pull request Feb 6, 2020

Interpret? SciML/SparsityDetection.jl#19

Open

timholy mentioned this pull request Aug 14, 2021

[Do Not Merge] Run Julia Base Tests #495

Draft

timholy mentioned this pull request Aug 26, 2021

export jl_resolve_globals_in_ir JuliaLang/julia#42013

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: a new serialization format for optimizing & executing lowered IR #309

WIP: a new serialization format for optimizing & executing lowered IR #309

timholy commented Jul 29, 2019 •

edited

Loading

KristofferC commented Jul 29, 2019

timholy commented Jul 30, 2019

KristofferC commented Jul 30, 2019

KristofferC commented Jul 30, 2019 •

edited

Loading

timholy commented Jul 30, 2019

timholy commented Jul 30, 2019 •

edited

Loading

WIP: a new serialization format for optimizing & executing lowered IR #309

Are you sure you want to change the base?

WIP: a new serialization format for optimizing & executing lowered IR #309

Conversation

timholy commented Jul 29, 2019 • edited Loading

A demo

A brief description of the serialization format

Potential performance benefits

The future

EDITS

KristofferC commented Jul 29, 2019

timholy commented Jul 30, 2019

KristofferC commented Jul 30, 2019

KristofferC commented Jul 30, 2019 • edited Loading

timholy commented Jul 30, 2019

timholy commented Jul 30, 2019 • edited Loading

timholy commented Jul 29, 2019 •

edited

Loading

KristofferC commented Jul 30, 2019 •

edited

Loading

timholy commented Jul 30, 2019 •

edited

Loading