FR: Add plan_anchor field to PlanRel #725

tokoko · 2024-10-11T18:44:50Z

I'm working on a library that lets you build up substrait plans using dataframe API. Over the course of dataframe transformations various unrelated plans need to be merged to construct the final substrait plan. For the most part it's pretty straightforward, but you need to deal with extension functions and references during the merges.

Functions in substrait fortunately let you define "pointers" in a relaxed fashion with anchors. This means that as long as there is some global way of assigning a unique identifier to each function beforehand, merging plans become straightforward. Unlike functions, relationship between PlanRels and RefereceRels is modeled differently, RefereceRels need to be aware of the ordinal position of the target rel in the relations array. This makes merging plans very tricky. One has to either traverse the whole plan to adjust ordinal positions on each merge or use some sort of placeholders and do a single-pass traversal at the end.

To make this simpler, substrait could use the same anchor mechanism with PlanRels. An additional plan_anchor field can be introduced in PlanRel message, which will be used by ReferenceRel instead of relying on ordinal positions.

The text was updated successfully, but these errors were encountered:

drin · 2024-10-18T16:07:13Z

This sounds cool and potentially related to something I'm working on.

Functions in substrait fortunately let you define "pointers" in a relaxed fashion with anchors. This means that as long as there is some global way of assigning a unique identifier to each function beforehand, merging plans become straightforward

Do you mean merging plans is straightforward with respect to not having to compare function semantics?

To make this simpler, substrait could use the same anchor mechanism with PlanRels

How would you define the same type of anchor for a PlanRel? It seems easy to think that a function extension URI maps 1-1 with a particular semantic definition of a function. I'm not sure how to map that to PlanRels

drin · 2024-10-18T16:16:14Z

In my own project, I use PlanAnchor to refer to an operator where I will split a plan for distributed processing (PlanAnchor definition). The operator becomes a leaf of the original plan and a parent of the subplan that was originally rooted at the operator. The subplan becomes a new, independent query plan that carries the PlanAnchor as a reference for later merging.

Discussion of splitting and merging aside, the PlanAnchor marks a spot in a plan that says "someone else will execute this sub-graph, and when they're done the results should stream back into this spot;" so, I think my intention is the same but I have the benefit of knowing that I can create a PlanAnchor at planning time in a way that there is an origin to reconnect it back to.

If that sounds similar to what you're doing, then I think this is a great idea. I can explain the details of how I do it and find some way to upstream it. If you're trying to do something more general, then I'd be interested in hearing more about your use case.

tokoko · 2024-10-18T17:37:38Z

What I mean by a merge is to use two potentially independent Rels in the same parent Rel (for example a JoinRel or a SetRel). Normally a query like the following: WITH T AS (SELECT a from A) SELECT a FROM T would produce a substrait plan that looks like this:

{
  "relations": [
    {
      "rel": {
        "read": {
          "common": {
            "direct": {}
          },
          "baseSchema": {
            "names": [
              "a"
            ],
            "struct": {
              "types": [
                {
                  "i64": {
                    "nullability": "NULLABILITY_NULLABLE"
                  }
                }
              ],
              "nullability": "NULLABILITY_NULLABLE"
            }
          },
          "namedTable": {
            "names": [
              "A"
            ]
          }
        }
      }
    },
    {
      "root": {
        "input": {
          "project": {
            "common": {
              "emit": {
                "outputMapping": [
                  1
                ]
              }
            },
            "input": {
              "reference": {} // empty object here means subtree_ordinal = 0
            },
            "expressions": [
              {
                "selection": {
                  "directReference": {
                    "structField": {}
                  },
                  "rootReference": {}
                }
              }
            ]
          }
        },
        "names": [
          "a"
        ]
      }
    }
  ],
  "version": {
    "minorNumber": 54,
    "producer": "subframe"
  }
}

Note that the plan has two entries in the list of relations, the first being a plain Rel (representing the contents of a CTE) and another RelRoot that uses ReferenceRel{ subtree_ordinal = 0 } to refer to the first entry in the list.

The problem arises when you want to join two subtrees that already independently contain CTEs:

WITH T AS (SELECT a from A) SELECT a FROM T
WITH T AS (SELECT b from B) SELECT b FROM T

Each plan already uses a ReferenceRel{ subtree_ordinal = 0 } but actually pointing to different CTEs, the merged version of the plan can obviously only put one of them in the 0th position of the relations list, therefore the other plan will have to be traversed to change every occurrence of ReferenceRel{ subtree_ordinal = 0 } to ReferenceRel{ subtree_ordinal = 1 } because of the forced change in the ordinal position.

The alternative would be to have a plan_anchor field (an integer) in PlanRel message which will be used by ReferenceRel instead of the list index (something like ReferenceRel{ rel_reference = 99 }). This makes merging simpler because as long as plan_anchor integers are unique, there's no need to make changes to the existing ReferenceRel messages.

tokoko · 2024-10-18T17:43:06Z

someone else will execute this sub-graph, and when they're done the results should stream back into this spot

Unless I'm misunderstanding, I think you should be able to accomplish this w/o introducing PlanAnchor message type (even w/o changes requested in this ticket). Instead of a PlanAnchor you can put the Rel in the list of relations and refer to it by ReferenceRel instead of a PlanAnchor.

drin · 2024-10-18T18:09:08Z

Instead of a PlanAnchor you can put the Rel in the list of relations and refer to it by ReferenceRel instead of a PlanAnchor.

I think this would only work if I still keep all Rels in the plan. I was actually reducing the substrait message itself to only contain the Rels to be executed. Maybe I can demote the PlanRel.RelRoot to a normal Rel and promote the desired Rel to a new PlanRel.RelRoot and mark it so that a consumer doesn't think it's a CTE then that could work? But, that also seems wasteful in terms of payload size and complexity. I can think on it though

drin · 2024-10-18T18:16:58Z

as long as plan_anchor integers are unique

but how would you do this in the general case? I guess you could maybe make a hash that accommodates operators and schemas (essentially an identifier of a subgraph in a plan), but even if you do that, wouldn't you have to verify there were no collisions?

tokoko · 2024-10-18T18:26:58Z

but how would you do this in the general case?

For a general case, you can't really do anything, there can always be a "collision". (the same applies for function references as well) For a specific scenario when all the plans used are being incrementally built in the same process (converting from other representation or building with a dataframe API) you can simply set up a system-wide unique integer generator. That's how we handle function references in ibis-substrait producer now for example.

drin · 2024-10-18T18:28:43Z

therefore the other plan will have to be traversed to change every occurrence of ReferenceRel{ subtree_ordinal = 0 } to ReferenceRel{ subtree_ordinal = 1 } because of the forced change in the ordinal position.

Also, this could potentially be addressed by keeping a list of ReferenceRels per plan. Then traversal for any particular plan is O(n) where n is the number of references. I would think this shouldn't be too costly, though definitely annoying.

drin · 2024-10-18T18:34:39Z

yeah, so I get what the idea is and I wonder if it should just be called something other than an anchor since it's a general identity property that you can use an extrinsic approach for (UUID generator) or an intrinsic approach for (subgraph identity as a hash). I am interested in supporting materialized views so it would be valuable for me to reuse this type of property for an intrinsic identity value

EpsilonPrime · 2024-10-20T02:50:22Z

So I can see how referencing by id instead of index might make merging easier. It might be useful to look at what other operations are happening when you're merging plans for context. For instance, when merging happens you also have to reorganize function references and function URIs (which is even more involved as you have to walk all the expressions in the merged tree instead of the relations).

tokoko · 2024-10-20T08:35:43Z

@EpsilonPrime to be exact, even for plan references, you need to walk all the expressions in all the merged trees as well, because you might have to update subtree_ordinal fields in ReferenceRel messages which might be anywhere.

Let me give you a basic rundown of what's happening during a merge wrt function references. There are 2 general scenarios:

trees being merged are arbitrary valid substrait plans that you had no hand in crafting
trees being merged have been created by you, so you have control over what references were used

For the first scenario, there are no easy escape hatches, you have to at some point (probably right before merge) check anchors (inside extensions, w/o walking the trees) for collisions and in case there is one walk at least one of those trees fully to change all the relevant fields (function_reference, comparison_function_reference, custom_function_reference).

For the second scenario, the easy solution is to set up a single central registry which will be handing out unique references to functions to all the plans being created even if they don't have anything to do with one another. This is to guarantee that in case you need a merge, there will be no collisions and therefore no need to walk all the subtrees. You simply need to merge extensions and extension_uris and call it a day. This is roughly what's happening both in ibis-substrait and subframe, we read all function yamls on startup, parse the functions and assign a unique anchor to each function entry.

The point of this ticket is that the solution in the second scenario is possible only for function_reference fields right now, not for subtree_ordinal. Because subtree_ordinal is an index rather than an id, you're guaranteed to have a collision.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FR: Add plan_anchor field to PlanRel #725

FR: Add plan_anchor field to PlanRel #725

tokoko commented Oct 11, 2024

drin commented Oct 18, 2024

drin commented Oct 18, 2024 •

edited

Loading

tokoko commented Oct 18, 2024

tokoko commented Oct 18, 2024

drin commented Oct 18, 2024

drin commented Oct 18, 2024

tokoko commented Oct 18, 2024

drin commented Oct 18, 2024

drin commented Oct 18, 2024

EpsilonPrime commented Oct 20, 2024

tokoko commented Oct 20, 2024

FR: Add plan_anchor field to PlanRel #725

FR: Add plan_anchor field to PlanRel #725

Comments

tokoko commented Oct 11, 2024

drin commented Oct 18, 2024

drin commented Oct 18, 2024 • edited Loading

tokoko commented Oct 18, 2024

tokoko commented Oct 18, 2024

drin commented Oct 18, 2024

drin commented Oct 18, 2024

tokoko commented Oct 18, 2024

drin commented Oct 18, 2024

drin commented Oct 18, 2024

EpsilonPrime commented Oct 20, 2024

tokoko commented Oct 20, 2024

drin commented Oct 18, 2024 •

edited

Loading