Allow checking whether operators/types are supported for a backend before creating a graph #463

huningxin · 2023-09-08T06:36:40Z

This feedback is from @RafaelCintron (Thanks!) when reviewing Chromium CL-4828993 review.

In the current spec and implementation, if any operators/types combination are not supported by a backend, for example DirectML backend doesn't support the dilations of average pooling operator, the errors will be thrown when user code calls MLGraphBuilder.build(). Rafael mentioned this might be too late because user code may already download all the weights.

We should add new WebNN APIs to help developers determine which of these operators/type combinations will fail BEFORE they attempt to create a graph. This will help them avoid large model downloads with operators/types the backend does not support.

The text was updated successfully, but these errors were encountered:

wacky6 · 2023-09-11T07:01:52Z

🤔 Or perhaps WebNN needs a weight loading step?

This is essentially the pattern I see with popular models (e.g. Real-ESRGAN, Stable diffusion). The graph describes the architecture and doesn't include weights.

If we want to fail before building the grpah, I'm afraid the operand methods (e.g. conv2d) needs to be async (to allow querying the backend), or make every feature feature-detectable (think of sending a feature matrix to the renderer).

huningxin · 2023-09-12T08:08:53Z

@wacky6

Or perhaps WebNN needs a weight loading step?

This is a great idea.

According to the current WebNN spec, the WebGPU context has MLCommandEncoder.initializeGraph(graph) method for preparing and optimizing constant input data before subsequent graph executions.

We probably can reuse this method for default context and extend it for weights uploading. For example, MLContext.initializeGraph(graph, weights), it would allow to upload the weights to constant operands after building the graph.

If the constants/weights uploading could be decoupled from graph building, developers can build the graph topology/architecture first before downloading the weights. If there are any unsupported error, they may handle the error or stop there. Only when the graph building succeeds, they proceed to download the weights from network and upload to device by MLContext.initializeGraph().

wacky6 · 2023-09-12T09:41:03Z

Weight loading would substantially change how graphBuilder.constant() works. This will be a non-trivial change to implementation and usage.

We need to match constant nodes with weight buffers. Probably do-able if we ask clients to retain references to constant MLOperand and pass pair<constNode, arrayBuffer> to loadWeights().

The overall usage will be:

// Build graph
const x = graphBuilder.input()
let constNode = graphBuilder.constant(shape)
const y = graphBuilder.add(x, constNode)
graphBuilder.constant()

// Compile the graph on backend
// Will reject if any operators in the graph isn't supported.
const graph = await graphBuilder.build(y)

// Download weight, match them with array buffer
const weightNodes = new Map()
weightNodes.set(constNode, arrayBuffer)   // Note the `constNode` reference here

// Load weight and make graph computable.
away graph.loadWeight(weightNodes, [arrayBuffer1])

// Compute
await MLContext.compute(graph, {...inputs})

I'd recommend including this issue in v2. Diffuser / LLM are both large.

Stable diffusion XL already hit 7GB (fp16). Maybe its size will double again this year :)

huningxin · 2023-09-14T01:42:54Z

We need to match constant nodes with weight buffers. Probably do-able if we ask clients to retain references to constant MLOperand and pass pair<constNode, arrayBuffer> to loadWeights().

Alternatively, we may reuse MLNamedArrayBufferViews for weights uploading. As for common model formats, e.g., ONNX or TFLite, a weight usually has a name which can be used to build a constant operand when building the graph and to map a buffer when uploading the weights.

The overall usage will be:

The usage looks good. I slightly modified the pseudo code by using MLNamedArrayBufferViews:

// Build graph
const x = graphBuilder.input('x', shape)
let constNode = graphBuilder.constant('constNode', shape)
const y = graphBuilder.add(x, constNode)

// Compile the graph on backend
// Will reject if any operators in the graph isn't supported.
const graph = await graphBuilder.build({y})

// Download weight, match them with array buffer by name
const weights = {'constNode': arrayBuffer}

// Load weight and make graph computable.
await graph.loadWeight(weights)

// Compute
await MLContext.compute(graph, {...inputs})

reillyeon · 2024-01-27T00:20:09Z

I had this same thought when reviewing Chromium CL 5232604. In that example there are a number of input shapes which aren't supported by XNNPACK. Generating these errors at build time rather than when the MLGraphBuilder.matmul() function is called will make it all but impossible to debug why the build failed.

In the XNNPACK case it is easy enough to do these checks earlier and in the renderer process, but I can see how for backends with a more elaborate build step this becomes much more difficult. Perhaps a build error API could return a reference to the problematic MLOperand so that debug tools can visualize the graph and highlight the problem. This would also aid in frameworks performing feature detection, though as @RafaelCintron pointed out in the original comment, the earlier the incompatibility can be detected the less wasted effort there is.

philloooo · 2024-04-11T16:57:47Z

what's the expected behavior for the framework when an operator/type is not supported?

If we just want to fail quick, then make weight loading separate is sufficient. But if we want to allow frameworks to adjust the graph partition/ deploy a different model that works for the current webnn backend, then we do need to be able to feature detect in some way to surface the particular unsupported features.

philloooo · 2024-05-15T17:57:47Z

Hi! Given we are having more and more data type support differences across platforms (e.g. #653, #675, #283 (comment)), I'd like to make more progress on this issue.

What do folks think about this structure that starts with just exposing the data type support level?

context.opSupportLimits():
{
    "input": {
        "dataType": [
            "float16",
            "float32"
        ]
    },
    "constant": {
        "dataType": [
            "float16",
            "float32",
            "int32"
        ]
    },
    "gather": {
        "input": {
            "dataType": [
                "float16",
                "float32"
            ]
        },
        "indices": {
            "dataType": [
                "uint32",
                "int32",
                "int64"
            ]
        }
    },
    "softmax": {
        "input": {
            "dataType": [
                "float16",
                "float32"
            ]
        }
    },
   ...
}

Probing supported feature limits should be tied to a MLContext as it’s particular to the underlying devices.
MLContext.opSupportLimits() returns supported types for all ops with all operand parameters.

In this proposal “dataType” is in a nested field to make it more extensible, so later we can specify other limits like {“rankRange”: [0, 4]}.

This will be generated using the intersection between the WebNN supported types(for chromium, currently defined in graph_validation_utils) and underlying platform’s supported types. Since MLContext creation is async, this information can be gathered at MLContext creation time by querying the backend.

For input and constant, since they are data operands themselves, the value is an object containing all the limits, currently it’s only the dataType limit.
For operator methods, the value is an object with all the operand parameters.
For operator methods with overloads(activation functions), the value is a superset of all the operand parameters. Which for activations are: e.g. {“softmax”: {“input”: {“dataType”: [“float32”, “float16”]}}}

This is also useful even if a context supports all types declared in WebNN spec. As it provides a programmable way for frameworks to check the supported types for all ops.

thougths? @huningxin @fdwr
@Honry interested in your thoughts on whether this makes sense for ONNX WebNN EP.

Honry · 2024-05-22T06:58:36Z

Thanks @philloooo, that's very useful for current ONNX WebNN EP, with this API we don't need to maintain a whitelist for each op's data type constraints.

However, in the future, we may upgrade the WebNN EP to use the same architecture as the DML EP. This would involve registering the supported operators as well as their supported data types to the onnxruntime kernel. This would allow ORT to handle the graph partitioning and help the WebNN EP gain the capability for higher-level graph optimization. If we adopt this new architecture, we would still need to maintain a whitelist.

reillyeon · 2024-05-22T15:42:05Z

If we adopt this new architecture, we would still need to maintain a whitelist.

I don't understand this conclusion. Can you explain why or what information would be necessary to avoid the need for an allowlist?

Honry · 2024-05-23T02:04:58Z

I don't understand this conclusion. Can you explain why or what information would be necessary to avoid the need for an allowlist?

FYI, DML EP maintains a whitelist at OperatorRegistration.cpp, to register its supported ONNX ops, ONNX opset range, data type constraints, etc. to ORT kernel.

Correct my last understand, context.opSupportLimits() would still be helpful to generate supportedTypeList* for each op.

huningxin · 2024-05-27T01:04:32Z

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".

/cc @fs-eire

philloooo · 2024-05-28T22:24:10Z

FYI, DML EP maintains a whitelist at OperatorRegistration.cpp, to register its supported ONNX ops, ONNX opset range, data type constraints, etc. to ORT kernel.

@Honry I still don't understand what's the DML architecure. Does it not allow registering the type constraints at the time we get a context?

Correct my last understand, context.opSupportLimits() would still be helpful to generate supportedTypeList* for each op.

Or this means that the context.opSupportLimits() is compatible with the dml architecture? (So whitelist is not needed)

The difference between DML backend and WebNN is that WebNN has different type limits across backends. So a single whitelist wouldn't work and we will have to get it through the context.

Honry · 2024-05-29T01:46:51Z

The difference is DML EP has to register the data type supported list to the ORT kernel, while current WebNN EP doesn't, it only need to check if the data type of the ONNX op it delegated can be supported by WebNN op.

Does it not allow registering the type constraints at the time we get a context?

That's not a problem, we can get the context before registering the type constraints.

Or this means that the context.opSupportLimits() is compatible with the dml architecture? (So whitelist is not needed)

Yes, it's compatible with the DML architecture. We can use context.opSupportLimits() to generate the supported data type list.

The difference between DML backend and WebNN is that WebNN has different type limits across backends. So a single whitelist wouldn't work and we will have to get it through the context.

That's really helpful for different backends. 👍

Honry · 2024-05-29T01:58:06Z

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".

/cc @fs-eire

Can we also expose the device type info? Currently WebNN TFLite backend still has some constraints and we make workarounds for them in WebNN EP.

Honry · 2024-05-31T08:28:16Z

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".

/cc @fs-eire

Per @reillyeon's CL https://chromium-review.googlesource.com/c/chromium/src/+/5528084, if Chromium can handle preferred layout for each backend (inserting transpose from Chromium), I think we don't need to expose preferred layout info from MLContext.

reillyeon · 2024-05-31T15:00:55Z

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".
/cc @fs-eire

Per @reillyeon's CL https://chromium-review.googlesource.com/c/chromium/src/+/5528084, if Chromium can handle preferred layout for each backend (inserting transpose from Chromium), I think we don't need to expose preferred layout info from MLContext.

We should do some measurements of how expensive the inserted transpose operators are before we make this decision. My CL will make the model work but if there's a version which matches the platform's preferred layout that might have better performance.

Honry · 2024-06-03T01:00:40Z

We should do some measurements of how expensive the inserted transpose operators are before we make this decision. My CL will make the model work but if there's a version which matches the platform's preferred layout that might have better performance.

Currently ORT-Web will also insert the transpose ops if the preferred layout is NHWC.

wacky6 · 2024-06-03T04:26:32Z

Regarding to ONNXRuntime WebNN EP discussion, it would be ideal if the preferred tensor layout, for example "nchw" or "nhwc" for conv2d or pooling operators, could be exposed via MLContext, so framework won't guess that depending on device type, for example "nwhc" for "cpu" and "nchw" for "gpu".

/cc @fs-eire

🤔 What does "preferred" here mean? Is this the preferred tensor layout for DML, or the underlying device?

FWIW, one data point from my past experience, the "preferred layout" might differ based on device (different vendor, same vendor different generation) and workload (i.e. model architecture), see threads here: xinntao/Real-ESRGAN#650 (comment)

Honry · 2024-06-05T00:41:00Z

cast's type parameter is a special case, currently TFLite doesn't support casting to uint64.

type is one of MLOperandDataType enum rather than a MLOperand, we need to find a way to support such constraint.

reillyeon · 2024-06-05T01:10:08Z

What does "preferred" here mean? Is this the preferred tensor layout for DML, or the underlying device?

In my experience each backend library only supports one tensor layout but if multiple were supported then it should be device-specific so that the developer can load the appropriate model.

philloooo · 2024-06-26T21:43:26Z

Hi! I've landed the first CL for opSupportLimits to chromium with three initial fields:

console.log(JSON.stringify(context.opSupportLimits(), null, 2))
VM689:1 {
  "constant": {
    "dataTypes": [
      "float32",
      "float16",
      "int32"
    ]
  },
  "gather": {
    "indices": {
      "dataTypes": [
        "int32"
      ]
    },
    "input": {
      "dataTypes": [
        "float32",
        "float16",
        "int32",
        "int8",
        "uint8"
      ]
    }
  },
  "input": {
    "dataTypes": [
      "float32",
      "float16",
      "int32"
    ]
  }
}

This also changes chromium behavior to make some unsupported data type errors to happen synchronously at the graph building phrase (when calling builder.input, builder.gather, etc) instead of during async builder.build.

TODOs:

add the rest of ops to this list.
add output as a key to the result. This is needed for CoreML because the overal graph's output has the same constraint as the input.
also support output on the operator level. This is needed for cast and argmin/max.
add rankRange to each op (see Define the maximum number of operand dimensions (maximum rank) #456 )

Let me know if you have any feedback.
If anyone is interested to participate in this work, especially on filling the gaps on tflite/dml backends, that will be great!

huningxin · 2024-06-27T06:47:17Z

Hi! I've landed the first CL for opSupportLimits to chromium with three initial fields:

Thanks @philloooo ! This is what I got from DirectML/Windows. Developers/frameworks can get platform specific supported datatypes now.

context = await navigator.ml.createContext({deviceType: 'gpu'})
console.log(JSON.stringify(context.opSupportLimits(), null, 2))
VM40:1 {
  "constant": {
    "dataTypes": [
      "float32",
      "float16",
      "int32",
      "uint32",
      "int64",
      "uint64",
      "int8",
      "uint8"
    ]
  },
  "gather": {
    "indices": {
      "dataTypes": [
        "int32",
        "uint32",
        "int64"
      ]
    },
    "input": {
      "dataTypes": [
        "float32",
        "float16",
        "int32",
        "uint32",
        "int64",
        "uint64",
        "int8",
        "uint8"
      ]
    }
  },
  "input": {
    "dataTypes": [
      "float32",
      "float16",
      "int32",
      "uint32",
      "int64",
      "uint64",
      "int8",
      "uint8"
    ]
  }
}

huningxin · 2024-07-08T02:12:26Z

@inexorabletash proposed rank limits can be expressed in context's opSupportLimits(), #456 (comment)

@philloooo and @a-sully and I had an internal discussion about this. Over in issue #463 @philloooo proposes that each op could express rank limits in the context's opSupportLimits() output, e.g. {rankRange: [0,4]}

fdwr · 2024-07-12T23:52:32Z

Hi! I've landed the first CL for opSupportLimits to chromium with three initial fields:

😎

Let me know if you have any feedback.

You know, the names of input tensor parameters didn't really matter until now (tensors inside options mattered, but not parameters), and we could freely fix the condition selection operator's inputs from "input" and "other" to the clearer "trueValue" and "falseValue" without API impact (MLOperand where(MLOperand condition, MLOperand trueValue, MLOperand falseValue)). Once encoded into the operator-support-capabilities though, the names start to matter.

Another realization is that multiple tensors share the same data type, like "trueValue" and "falseValue" (and it would be redundant to report them both separately). So do we return them as {"condition", "value"} data types, or as {"condition", "trueValues", "falseValues"}? 🤔 For reflection-like scenarios, using the exact name and returning one dataType list per distinct tensor might simplify tooling, but then other scenarios (like checking data type matching in ORT), returning just one dataType list for both values may be clearer (because returning two distinct lists implies the possibility that trueValues and falseValues may differ, but they must not). ❔

philloooo · 2024-07-15T18:02:59Z

Thanks for the feedback! @fdwr
I actually liked the fact that the opSupportLimits now formalizes the parameter namings.
For your second point, I prefer the former design, it's less error prone for consuming the opSupportLimits, more future proof(doesn't need to change when some group of params changed to allow different types) and is just simpler, one less thing to process (don't need to maintain another mapping in spec).

Internally for chromium implementation we can use the same member in the mojom struct to reduce redundancy.

fdwr · 2024-07-15T19:50:06Z

I actually liked the fact that the opSupportLimits now formalizes the parameter namings.

Yep, I wasn't implying it was a bad change, just a new consideration.

add output as a key to the result

Naming the outputs may be interesting. For simple cases that return a single output tensor, just using the name "output" makes sense (or plural "outputs" for cases like split and gru), but it's more interesting when we have an operator that returns multiple different tensors (e.g. say a dynamicQuantizeLinear existed that returned three tensors, the input quantized to int8, the scale factor, and zero-point tensor). In such cases, rather than just returning a triplet sequence<MLOperand>, we should probably return a struct with named fields (e.g. output, outputScale, outputZeroPoint), and those field names would be used in the opSupportLimits.

For your second point, I prefer the former design, it's less error prone for consuming the opSupportLimits

So then (to clarify), you mean {"condition", "trueValues", "falseValues"} where each tensor (whether via parameter or options) has exactly one entry in the operator support limits, right?

philloooo · 2024-07-15T20:04:51Z

@fdwr I was planning on adding output as a key only to ops that has platform dependent behaviors ,which right now is only cast and argmin/argmax and overal graph. WDYT?

So then (to clarify), you mean {"condition", "trueValues", "falseValues"} where each tensor (whether via parameter or options) has exactly one entry in the operator support limits, right?

Yeah that's right!

fdwr · 2024-07-15T20:46:40Z

I was planning on adding output as a key only to ops that has platform dependent behaviors ... WDYT?

@philloooo 🤔 Consistently including the output field for all operators feels cleaner to me, allowing the caller to use more generic code paths for all cases rather than handling a few scattered cases specially. Granted, I realize that the vast majority of ops have identical input and output data types anyway (e.g. so add's, a, b, and output tensor data type lists are identical), but my thinking is that when we add more ops later, hopefully the caller logic works more transparently.

Additionally, it's nice for diagnostic purposes if somebody wanted to print out the IO data types into a table to see all the constraints, and a number of other libraries (e.g. StableHLO, DML, ONNX) already list data type constraints for both inputs and outputs.

inexorabletash · 2024-07-15T23:29:23Z

Consistently including the output field for all operators feels cleaner to me

Agreed - it's helpful in cases like where() where the type is not the same as the type of the first input tensor.

fdwr · 2024-07-19T02:28:23Z

@philloooo

TODOs:

add the rest of ops to this list.

add output as a key to the result. This is needed for CoreML because the overal graph's output has the same constraint as the input.

also support output on the operator level. This is needed for cast and argmin/max.
Let me know if you have any feedback.

Since #456 was folded into this issue, should we add {rankRange: [0,4]} into this issue's todo list, or reactivate 456?

philloooo mentioned this issue Feb 16, 2024

Core operator set #573

Open

inexorabletash added the feature request label Feb 21, 2024

inexorabletash mentioned this issue Apr 9, 2024

Supported operations query mechanism #504

Closed

philloooo mentioned this issue Apr 11, 2024

Specify the operand data type constraints of operation #283

Closed

a-sully mentioned this issue Apr 19, 2024

Introduce MLNumber for specifying numeric inputs of any type #647

Merged

a-sully mentioned this issue Apr 30, 2024

Do we need an MLConstantOperand? #668

Closed

inexorabletash mentioned this issue May 2, 2024

Meta: Introduce "Interop" label? #673

Closed

huningxin mentioned this issue May 22, 2024

[Web] executionProviders chain for webnn fallback does not work on init error microsoft/onnxruntime#20729

Open

huningxin mentioned this issue May 29, 2024

Bugfix: Add missing 64-bit integers support for some reduction operators #695

Open

Honry mentioned this issue May 30, 2024

[WebNN EP] TFLite backend only supports limit ranges for Clip microsoft/onnxruntime#20863

Merged

inexorabletash mentioned this issue Jul 1, 2024

Define the maximum number of operand dimensions (maximum rank) #456

Closed

inexorabletash mentioned this issue Jul 16, 2024

Make operand data type and rank validation table-driven #657

Draft

huningxin mentioned this issue Aug 5, 2024

[WebNN EP] Remove NHWC preferred layout microsoft/onnxruntime#21570

Merged

philloooo mentioned this issue Aug 26, 2024

Define opSupportLimits() #755

Merged

huningxin closed this as completed in #755 Aug 30, 2024

anssiko mentioned this issue Sep 17, 2024

WebML WG - TPAC 2024 agenda webmachinelearning/meetings#25

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow checking whether operators/types are supported for a backend before creating a graph #463

Allow checking whether operators/types are supported for a backend before creating a graph #463

huningxin commented Sep 8, 2023

wacky6 commented Sep 11, 2023

huningxin commented Sep 12, 2023

wacky6 commented Sep 12, 2023

huningxin commented Sep 14, 2023

reillyeon commented Jan 27, 2024

philloooo commented Apr 11, 2024 •

edited

Loading

philloooo commented May 15, 2024 •

edited

Loading

Honry commented May 22, 2024

reillyeon commented May 22, 2024

Honry commented May 23, 2024

huningxin commented May 27, 2024

philloooo commented May 28, 2024

Honry commented May 29, 2024

Honry commented May 29, 2024

Honry commented May 31, 2024

reillyeon commented May 31, 2024

Honry commented Jun 3, 2024 •

edited

Loading

wacky6 commented Jun 3, 2024

Honry commented Jun 5, 2024

reillyeon commented Jun 5, 2024

philloooo commented Jun 26, 2024 •

edited

Loading

huningxin commented Jun 27, 2024

huningxin commented Jul 8, 2024

fdwr commented Jul 12, 2024

philloooo commented Jul 15, 2024 •

edited

Loading

fdwr commented Jul 15, 2024

philloooo commented Jul 15, 2024 •

edited

Loading

fdwr commented Jul 15, 2024 •

edited

Loading

inexorabletash commented Jul 15, 2024

fdwr commented Jul 19, 2024

Allow checking whether operators/types are supported for a backend before creating a graph #463

Allow checking whether operators/types are supported for a backend before creating a graph #463

Comments

huningxin commented Sep 8, 2023

wacky6 commented Sep 11, 2023

huningxin commented Sep 12, 2023

wacky6 commented Sep 12, 2023

huningxin commented Sep 14, 2023

reillyeon commented Jan 27, 2024

philloooo commented Apr 11, 2024 • edited Loading

philloooo commented May 15, 2024 • edited Loading

Honry commented May 22, 2024

reillyeon commented May 22, 2024

Honry commented May 23, 2024

huningxin commented May 27, 2024

philloooo commented May 28, 2024

Honry commented May 29, 2024

Honry commented May 29, 2024

Honry commented May 31, 2024

reillyeon commented May 31, 2024

Honry commented Jun 3, 2024 • edited Loading

wacky6 commented Jun 3, 2024

Honry commented Jun 5, 2024

reillyeon commented Jun 5, 2024

philloooo commented Jun 26, 2024 • edited Loading

huningxin commented Jun 27, 2024

huningxin commented Jul 8, 2024

fdwr commented Jul 12, 2024

philloooo commented Jul 15, 2024 • edited Loading

fdwr commented Jul 15, 2024

philloooo commented Jul 15, 2024 • edited Loading

fdwr commented Jul 15, 2024 • edited Loading

inexorabletash commented Jul 15, 2024

fdwr commented Jul 19, 2024

philloooo commented Apr 11, 2024 •

edited

Loading

philloooo commented May 15, 2024 •

edited

Loading

Honry commented Jun 3, 2024 •

edited

Loading

philloooo commented Jun 26, 2024 •

edited

Loading

philloooo commented Jul 15, 2024 •

edited

Loading

philloooo commented Jul 15, 2024 •

edited

Loading

fdwr commented Jul 15, 2024 •

edited

Loading