Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow checking whether operators/types are supported for a backend before creating a graph #463

Closed
huningxin opened this issue Sep 8, 2023 · 30 comments · Fixed by #755
Closed

Comments

@huningxin
Copy link
Contributor

This feedback is from @RafaelCintron (Thanks!) when reviewing Chromium CL-4828993 review.

In the current spec and implementation, if any operators/types combination are not supported by a backend, for example DirectML backend doesn't support the dilations of average pooling operator, the errors will be thrown when user code calls MLGraphBuilder.build(). Rafael mentioned this might be too late because user code may already download all the weights.

We should add new WebNN APIs to help developers determine which of these operators/type combinations will fail BEFORE they attempt to create a graph. This will help them avoid large model downloads with operators/types the backend does not support.

@wacky6
Copy link

wacky6 commented Sep 11, 2023

🤔 Or perhaps WebNN needs a weight loading step?

This is essentially the pattern I see with popular models (e.g. Real-ESRGAN, Stable diffusion). The graph describes the architecture and doesn't include weights.

If we want to fail before building the grpah, I'm afraid the operand methods (e.g. conv2d) needs to be async (to allow querying the backend), or make every feature feature-detectable (think of sending a feature matrix to the renderer).

@huningxin
Copy link
Contributor Author

@wacky6

Or perhaps WebNN needs a weight loading step?

This is a great idea.

According to the current WebNN spec, the WebGPU context has MLCommandEncoder.initializeGraph(graph) method for preparing and optimizing constant input data before subsequent graph executions.

We probably can reuse this method for default context and extend it for weights uploading. For example, MLContext.initializeGraph(graph, weights), it would allow to upload the weights to constant operands after building the graph.

If the constants/weights uploading could be decoupled from graph building, developers can build the graph topology/architecture first before downloading the weights. If there are any unsupported error, they may handle the error or stop there. Only when the graph building succeeds, they proceed to download the weights from network and upload to device by MLContext.initializeGraph().

@wacky6
Copy link

wacky6 commented Sep 12, 2023

Weight loading would substantially change how graphBuilder.constant() works. This will be a non-trivial change to implementation and usage.

We need to match constant nodes with weight buffers. Probably do-able if we ask clients to retain references to constant MLOperand and pass pair<constNode, arrayBuffer> to loadWeights().

The overall usage will be:

// Build graph
const x = graphBuilder.input()
let constNode = graphBuilder.constant(shape)
const y = graphBuilder.add(x, constNode)
graphBuilder.constant()

// Compile the graph on backend
// Will reject if any operators in the graph isn't supported.
const graph = await graphBuilder.build(y)

// Download weight, match them with array buffer
const weightNodes = new Map()
weightNodes.set(constNode, arrayBuffer)   // Note the `constNode` reference here

// Load weight and make graph computable.
away graph.loadWeight(weightNodes, [arrayBuffer1])

// Compute
await MLContext.compute(graph, {...inputs})

I'd recommend including this issue in v2. Diffuser / LLM are both large.

Stable diffusion XL already hit 7GB (fp16). Maybe its size will double again this year :)

@huningxin
Copy link
Contributor Author

We need to match constant nodes with weight buffers. Probably do-able if we ask clients to retain references to constant MLOperand and pass pair<constNode, arrayBuffer> to loadWeights().

Alternatively, we may reuse MLNamedArrayBufferViews for weights uploading. As for common model formats, e.g., ONNX or TFLite, a weight usually has a name which can be used to build a constant operand when building the graph and to map a buffer when uploading the weights.

The overall usage will be:

The usage looks good. I slightly modified the pseudo code by using MLNamedArrayBufferViews:

// Build graph
const x = graphBuilder.input('x', shape)
let constNode = graphBuilder.constant('constNode', shape)
const y = graphBuilder.add(x, constNode)

// Compile the graph on backend
// Will reject if any operators in the graph isn't supported.
const graph = await graphBuilder.build({y})

// Download weight, match them with array buffer by name
const weights = {'constNode': arrayBuffer}

// Load weight and make graph computable.
await graph.loadWeight(weights)

// Compute
await MLContext.compute(graph, {...inputs})

@reillyeon
Copy link
Contributor

I had this same thought when reviewing Chromium CL 5232604. In that example there are a number of input shapes which aren't supported by XNNPACK. Generating these errors at build time rather than when the MLGraphBuilder.matmul() function is called will make it all but impossible to debug why the build failed.

In the XNNPACK case it is easy enough to do these checks earlier and in the renderer process, but I can see how for backends with a more elaborate build step this becomes much more difficult. Perhaps a build error API could return a reference to the problematic MLOperand so that debug tools can visualize the graph and highlight the problem. This would also aid in frameworks performing feature detection, though as @RafaelCintron pointed out in the original comment, the earlier the incompatibility can be detected the less wasted effort there is.

@philloooo
Copy link
Contributor

philloooo commented Apr 11, 2024

what's the expected behavior for the framework when an operator/type is not supported?

If we just want to fail quick, then make weight loading separate is sufficient. But if we want to allow frameworks to adjust the graph partition/ deploy a different model that works for the current webnn backend, then we do need to be able to feature detect in some way to surface the particular unsupported features.

@philloooo
Copy link
Contributor

philloooo commented May 15, 2024

Hi! Given we are having more and more data type support differences across platforms (e.g. #653, #675, #283 (comment)), I'd like to make more progress on this issue.

What do folks think about this structure that starts with just exposing the data type support level?

context.opSupportLimits():
{
    "input": {
        "dataType": [
            "float16",
            "float32"
        ]
    },
    "constant": {
        "dataType": [
            "float16",
            "float32",
            "int32"
        ]
    },
    "gather": {
        "input": {
            "dataType": [
                "float16",
                "float32"
            ]
        },
        "indices": {
            "dataType": [
                "uint32",
                "int32",
                "int64"
            ]
        }
    },
    "softmax": {
        "input": {
            "dataType": [
                "float16",
                "float32"
            ]
        }
    },
   ...
}

Probing supported feature limits should be tied to a MLContext as it’s particular to the underlying devices.
MLContext.opSupportLimits() returns supported types for all ops with all operand parameters.

In this proposal “dataType” is in a nested field to make it more extensible, so later we can specify other limits like {“rankRange”: [0, 4]}.

This will be generated using the intersection between the WebNN supported types(for chromium, currently defined in graph_validation_utils) and underlying platform’s supported types. Since MLContext creation is async, this information can be gathered at MLContext creation time by querying the backend.

  • For input and constant, since they are data operands themselves, the value is an object containing all the limits, currently it’s only the dataType limit.
  • For operator methods, the value is an object with all the operand parameters.
  • For operator methods with overloads(activation functions), the value is a superset of all the operand parameters. Which for activations are: e.g. {“softmax”: {“input”: {“dataType”: [“float32”, “float16”]}}}

This is also useful even if a context supports all types declared in WebNN spec. As it provides a programmable way for frameworks to check the supported types for all ops.

thougths? @huningxin @fdwr
@Honry interested in your thoughts on whether this makes sense for ONNX WebNN EP.

@Honry
Copy link
Contributor

Honry commented May 22, 2024

Thanks @philloooo, that's very useful for current ONNX WebNN EP, with this API we don't need to maintain a whitelist for each op's data type constraints.

However, in the future, we may upgrade the WebNN EP to use the same architecture as the DML EP. This would involve registering the supported operators as well as their supported data types to the onnxruntime kernel. This would allow ORT to handle the graph partitioning and help the WebNN EP gain the capability for higher-level graph optimization. If we adopt this new architecture, we would still need to maintain a whitelist.

@reillyeon
Copy link
Contributor

If we adopt this new architecture, we would still need to maintain a whitelist.

I don't understand this conclusion. Can you explain why or what information would be necessary to avoid the need for an allowlist?

@Honry
Copy link
Contributor

Honry commented May 23, 2024

I don't understand this conclusion. Can you explain why or what information would be necessary to avoid the need for an allowlist?

FYI, DML EP maintains a whitelist at OperatorRegistration.cpp, to register its supported ONNX ops, ONNX opset range, data type constraints, etc. to ORT kernel.

Correct my last understand, context.opSupportLimits() would still be helpful to generate supportedTypeList* for each op.

@huningxin
Copy link
Contributor Author

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".

/cc @fs-eire

@philloooo
Copy link
Contributor

FYI, DML EP maintains a whitelist at OperatorRegistration.cpp, to register its supported ONNX ops, ONNX opset range, data type constraints, etc. to ORT kernel.

@Honry I still don't understand what's the DML architecure. Does it not allow registering the type constraints at the time we get a context?

Correct my last understand, context.opSupportLimits() would still be helpful to generate supportedTypeList* for each op.

Or this means that the context.opSupportLimits() is compatible with the dml architecture? (So whitelist is not needed)

The difference between DML backend and WebNN is that WebNN has different type limits across backends. So a single whitelist wouldn't work and we will have to get it through the context.

@Honry
Copy link
Contributor

Honry commented May 29, 2024

The difference is DML EP has to register the data type supported list to the ORT kernel, while current WebNN EP doesn't, it only need to check if the data type of the ONNX op it delegated can be supported by WebNN op.

Does it not allow registering the type constraints at the time we get a context?

That's not a problem, we can get the context before registering the type constraints.

Or this means that the context.opSupportLimits() is compatible with the dml architecture? (So whitelist is not needed)

Yes, it's compatible with the DML architecture. We can use context.opSupportLimits() to generate the supported data type list.

The difference between DML backend and WebNN is that WebNN has different type limits across backends. So a single whitelist wouldn't work and we will have to get it through the context.

That's really helpful for different backends. 👍

@Honry
Copy link
Contributor

Honry commented May 29, 2024

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".

/cc @fs-eire

Can we also expose the device type info? Currently WebNN TFLite backend still has some constraints and we make workarounds for them in WebNN EP.

@Honry
Copy link
Contributor

Honry commented May 31, 2024

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".

/cc @fs-eire

Per @reillyeon's CL https://chromium-review.googlesource.com/c/chromium/src/+/5528084, if Chromium can handle preferred layout for each backend (inserting transpose from Chromium), I think we don't need to expose preferred layout info from MLContext.

@reillyeon
Copy link
Contributor

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".
/cc @fs-eire

Per @reillyeon's CL https://chromium-review.googlesource.com/c/chromium/src/+/5528084, if Chromium can handle preferred layout for each backend (inserting transpose from Chromium), I think we don't need to expose preferred layout info from MLContext.

We should do some measurements of how expensive the inserted transpose operators are before we make this decision. My CL will make the model work but if there's a version which matches the platform's preferred layout that might have better performance.

@Honry
Copy link
Contributor

Honry commented Jun 3, 2024

We should do some measurements of how expensive the inserted transpose operators are before we make this decision. My CL will make the model work but if there's a version which matches the platform's preferred layout that might have better performance.

Currently ORT-Web will also insert the transpose ops if the preferred layout is NHWC.

@wacky6
Copy link

wacky6 commented Jun 3, 2024

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".

/cc @fs-eire

🤔 What does "preferred" here mean? Is this the preferred tensor layout for DML, or the underlying device?

FWIW, one data point from my past experience, the "preferred layout" might differ based on device (different vendor, same vendor different generation) and workload (i.e. model architecture), see threads here: xinntao/Real-ESRGAN#650 (comment)

@Honry
Copy link
Contributor

Honry commented Jun 5, 2024

image

cast's type parameter is a special case, currently TFLite doesn't support casting to uint64.

type is one of MLOperandDataType enum rather than a MLOperand, we need to find a way to support such constraint.

@reillyeon
Copy link
Contributor

What does "preferred" here mean? Is this the preferred tensor layout for DML, or the underlying device?

In my experience each backend library only supports one tensor layout but if multiple were supported then it should be device-specific so that the developer can load the appropriate model.

@philloooo
Copy link
Contributor

philloooo commented Jun 26, 2024

Hi! I've landed the first CL for opSupportLimits to chromium with three initial fields:

console.log(JSON.stringify(context.opSupportLimits(), null, 2))
VM689:1 {
  "constant": {
    "dataTypes": [
      "float32",
      "float16",
      "int32"
    ]
  },
  "gather": {
    "indices": {
      "dataTypes": [
        "int32"
      ]
    },
    "input": {
      "dataTypes": [
        "float32",
        "float16",
        "int32",
        "int8",
        "uint8"
      ]
    }
  },
  "input": {
    "dataTypes": [
      "float32",
      "float16",
      "int32"
    ]
  }
}
  • This also changes chromium behavior to make some unsupported data type errors to happen synchronously at the graph building phrase (when calling builder.input, builder.gather, etc) instead of during async builder.build.

TODOs:

  • add the rest of ops to this list.
  • add output as a key to the result. This is needed for CoreML because the overal graph's output has the same constraint as the input.
  • also support output on the operator level. This is needed for cast and argmin/max.
  • add rankRange to each op (see Define the maximum number of operand dimensions (maximum rank) #456 )

Let me know if you have any feedback.
If anyone is interested to participate in this work, especially on filling the gaps on tflite/dml backends, that will be great!

@huningxin
Copy link
Contributor Author

Hi! I've landed the first CL for opSupportLimits to chromium with three initial fields:

Thanks @philloooo ! This is what I got from DirectML/Windows. Developers/frameworks can get platform specific supported datatypes now.

context = await navigator.ml.createContext({deviceType: 'gpu'})
console.log(JSON.stringify(context.opSupportLimits(), null, 2))
VM40:1 {
  "constant": {
    "dataTypes": [
      "float32",
      "float16",
      "int32",
      "uint32",
      "int64",
      "uint64",
      "int8",
      "uint8"
    ]
  },
  "gather": {
    "indices": {
      "dataTypes": [
        "int32",
        "uint32",
        "int64"
      ]
    },
    "input": {
      "dataTypes": [
        "float32",
        "float16",
        "int32",
        "uint32",
        "int64",
        "uint64",
        "int8",
        "uint8"
      ]
    }
  },
  "input": {
    "dataTypes": [
      "float32",
      "float16",
      "int32",
      "uint32",
      "int64",
      "uint64",
      "int8",
      "uint8"
    ]
  }
}

@huningxin
Copy link
Contributor Author

@inexorabletash proposed rank limits can be expressed in context's opSupportLimits(), #456 (comment)

@philloooo and @a-sully and I had an internal discussion about this. Over in issue #463 @philloooo proposes that each op could express rank limits in the context's opSupportLimits() output, e.g. {rankRange: [0,4]}

@fdwr
Copy link
Collaborator

fdwr commented Jul 12, 2024

Hi! I've landed the first CL for opSupportLimits to chromium with three initial fields:

😎

Let me know if you have any feedback.

You know, the names of input tensor parameters didn't really matter until now (tensors inside options mattered, but not parameters), and we could freely fix the condition selection operator's inputs from "input" and "other" to the clearer "trueValue" and "falseValue" without API impact (MLOperand where(MLOperand condition, MLOperand trueValue, MLOperand falseValue)). Once encoded into the operator-support-capabilities though, the names start to matter.

Another realization is that multiple tensors share the same data type, like "trueValue" and "falseValue" (and it would be redundant to report them both separately). So do we return them as {"condition", "value"} data types, or as {"condition", "trueValues", "falseValues"}? 🤔 For reflection-like scenarios, using the exact name and returning one dataType list per distinct tensor might simplify tooling, but then other scenarios (like checking data type matching in ORT), returning just one dataType list for both values may be clearer (because returning two distinct lists implies the possibility that trueValues and falseValues may differ, but they must not). ❔

@philloooo
Copy link
Contributor

philloooo commented Jul 15, 2024

Thanks for the feedback! @fdwr
I actually liked the fact that the opSupportLimits now formalizes the parameter namings.
For your second point, I prefer the former design, it's less error prone for consuming the opSupportLimits, more future proof(doesn't need to change when some group of params changed to allow different types) and is just simpler, one less thing to process (don't need to maintain another mapping in spec).

Internally for chromium implementation we can use the same member in the mojom struct to reduce redundancy.

@fdwr
Copy link
Collaborator

fdwr commented Jul 15, 2024

I actually liked the fact that the opSupportLimits now formalizes the parameter namings.

Yep, I wasn't implying it was a bad change, just a new consideration.

add output as a key to the result

Naming the outputs may be interesting. For simple cases that return a single output tensor, just using the name "output" makes sense (or plural "outputs" for cases like split and gru), but it's more interesting when we have an operator that returns multiple different tensors (e.g. say a dynamicQuantizeLinear existed that returned three tensors, the input quantized to int8, the scale factor, and zero-point tensor). In such cases, rather than just returning a triplet sequence<MLOperand>, we should probably return a struct with named fields (e.g. output, outputScale, outputZeroPoint), and those field names would be used in the opSupportLimits.

For your second point, I prefer the former design, it's less error prone for consuming the opSupportLimits

So then (to clarify), you mean {"condition", "trueValues", "falseValues"} where each tensor (whether via parameter or options) has exactly one entry in the operator support limits, right?

@philloooo
Copy link
Contributor

philloooo commented Jul 15, 2024

@fdwr I was planning on adding output as a key only to ops that has platform dependent behaviors ,which right now is only cast and argmin/argmax and overal graph. WDYT?

So then (to clarify), you mean {"condition", "trueValues", "falseValues"} where each tensor (whether via parameter or options) has exactly one entry in the operator support limits, right?

Yeah that's right!

@fdwr
Copy link
Collaborator

fdwr commented Jul 15, 2024

I was planning on adding output as a key only to ops that has platform dependent behaviors ... WDYT?

@philloooo 🤔 Consistently including the output field for all operators feels cleaner to me, allowing the caller to use more generic code paths for all cases rather than handling a few scattered cases specially. Granted, I realize that the vast majority of ops have identical input and output data types anyway (e.g. so add's, a, b, and output tensor data type lists are identical), but my thinking is that when we add more ops later, hopefully the caller logic works more transparently.

Additionally, it's nice for diagnostic purposes if somebody wanted to print out the IO data types into a table to see all the constraints, and a number of other libraries (e.g. StableHLO, DML, ONNX) already list data type constraints for both inputs and outputs.

@inexorabletash
Copy link
Member

Consistently including the output field for all operators feels cleaner to me

Agreed - it's helpful in cases like where() where the type is not the same as the type of the first input tensor.

@fdwr
Copy link
Collaborator

fdwr commented Jul 19, 2024

@philloooo

TODOs:

  • add the rest of ops to this list.
  • add output as a key to the result. This is needed for CoreML because the overal graph's output has the same constraint as the input.
  • also support output on the operator level. This is needed for cast and argmin/max.
    Let me know if you have any feedback.

Since #456 was folded into this issue, should we add {rankRange: [0,4]} into this issue's todo list, or reactivate 456?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants