Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AUDIO_WORKLET] Optimise the copy back from wasm's heap to JS #22753

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 26 additions & 7 deletions src/audio_worklet.js
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,9 @@ function createWasmAudioWorkletProcessor(audioParams) {
// 'render quantum size', and this exercise of passing in the value
// shouldn't be required (to be verified).
this.samplesPerChannel = opts['sc'];
// Typed views of the output buffers on the worklet's stack, which after
// creation should only change if the audio chain changes.
this.outputViews = [];
}

static get parameterDescriptors() {
Expand All @@ -54,7 +57,7 @@ function createWasmAudioWorkletProcessor(audioParams) {
bytesPerChannel = this.samplesPerChannel * 4,
stackMemoryNeeded = (numInputs + numOutputs) * {{{ C_STRUCTS.AudioSampleFrame.__size__ }}},
oldStackPtr = stackSave(),
inputsPtr, outputsPtr, outputDataPtr, paramsPtr,
inputsPtr, outputsPtr, outputDataPtr, paramsPtr, requiredViews = 0,
didProduceAudio, paramArray;

// Calculate how much stack space is needed.
Expand Down Expand Up @@ -93,6 +96,24 @@ function createWasmAudioWorkletProcessor(audioParams) {
k += {{{ C_STRUCTS.AudioSampleFrame.__size__ / 4 }}};
// Reserve space for the output data
dataPtr += bytesPerChannel * i.length;
// How many output views are needed in total?
requiredViews += i.length;
}

// Verify we have enough views (it doesn't matter if we have too many, any
// excess won't be accessed) then also verify the views' start address
// hasn't changed.
// TODO: allocate space for outputDataPtr before any inputs?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ok to turn this around to allocate outputs before inputs. This way it should be possible to only assert() in debug builds that the invariant this.outputViews[0].byteOffset == k << 2 holds.

Also, the stack is already preallocated by the time the above constructor() is called, so things can be precomputed there in the ctor instead of at process time for any performance wins.

There are two things that I'd recommend:

  1. benchmark the effect of this optimization. Since the code does become more sophisticated/more dependent on things having been pre-set up, it would be good to confirm that there is actually a performance benefit to calling TypedArray.set() instead of manually looping the samples.

I think performance.now() should work here, so adding calls to performance.now() at very top of process() and at the very end of process() to measure the overall perf differential would probably work.

  1. It would be important to test two scenarios: a) there being multiple distinct audio processor classes, and b) there being different node paths in the same graph that have different number of inputs/outputs channels.

With the original code, it was reasonably "trivial" to statically prove that the code will handle any combination of classes or channels, though when state is being cached like this, there will then be the possibility that different schemes could trip up.

I wonder if a good test case might be to have a scenario where there's simultaneously
a) one stereo source clip -> simple passthrough stereo worklet -> stereo audiocontext speaker destination
b) one mono source clip -> simple passthrough mono worklet -> mono output
c) two copies of mono sources from b) above -> worklet merging (memcpying) these into stereo as Left/Right channels -> stereo audiocontext speaker destination

In such a test, there would be a full combination of "1ch input, 1ch output", "2ch input, 2ch outputs", "two 1ch inputs, 2ch output" cases that I think would cover every possibility (graph would have multiple processor classes and different shaped nodes).

The worklets themselves would be trivial memcpy input -> output implementations so no much complexity to the audio code itself, but they would then test that different shapes of the graph are preserved. Would something like that be a feasible test case to build?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be ok to turn this around to allocate outputs before inputs. This way it should be possible to only assert() in debug builds that the invariant this.outputViews[0].byteOffset == k << 2 holds.

I was unexpectedly free this afternoon so already rearranged everything.

Also, the stack is already preallocated by the time the above constructor() is called, so things can be precomputed there in the ctor instead of at process time for any performance wins.

I'll look at creating some views in the ctor up front after I've done the timings.

But you're 100% right about timing this, which I'll aim to do once it's in roughly the right shape, and same for the test cases. At work our use is mono mic, stereo out from a wasm mixer, and then combined mic and mixer, but I want to see it tested with more ins and outs like you propose.

(I can dedicate quite a bit of time to this for the next weeks)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think performance.now() should work here

It's not part of the AudioWorkletGlobalScope. Date.now() is available but I just get zeros, though I did see a 1 which brought excitement for a brief moment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the same Date.now() to the original code... and it's a lot of zeros, then some occasional ones. I think there are more ones than with the views, but that's not very scientific.

What I'll need to do is write something standalone that runs process() not in the AudioWorkletGlobalScope and then benchmark it over many runs on different browser and hardware (which is worth doing).

k = outputDataPtr;
if (this.outputViews.length < requiredViews || (this.outputViews.length && this.outputViews[0].byteOffset != k << 2)) {
this.outputViews = [];
for (/*which output*/ i of outputList) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the length of outputList be different in each call?

Perhaps assert(this.outputViews.length == outputList.length) after this block to be clear?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can grow if other nodes are manually added. I’ll need to write contrived examples to see this, since in the real world I don’t think it will.

for (/*which channel*/ j of i) {
this.outputViews.push(
HEAPF32.subarray(k, k += this.samplesPerChannel)
);
}
}
}

// Copy parameters descriptor structs and data to Wasm
Expand All @@ -112,14 +133,12 @@ function createWasmAudioWorkletProcessor(audioParams) {
// Call out to Wasm callback to perform audio processing
if (didProduceAudio = this.callbackFunction(numInputs, inputsPtr, numOutputs, outputsPtr, numParams, paramsPtr, this.userData)) {
// Read back the produced audio data to all outputs and their channels.
// (A garbage-free function TypedArray.copy(dstTypedArray, dstOffset,
// srcTypedArray, srcOffset, count) would sure be handy.. but web does
// not have one, so manually copy all bytes in)
// The 'outputViews' are subarray views into the heap, each with the
// correct offset and size to be copied directly into the output.
k = 0;
for (i of outputList) {
for (j of i) {
for (k = 0; k < this.samplesPerChannel; ++k) {
j[k] = HEAPF32[outputDataPtr++];
}
j.set(this.outputViews[k++]);
}
}
}
Expand Down
5 changes: 4 additions & 1 deletion src/library_webaudio.js
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,10 @@ let LibraryWebAudio = {

let audioWorkletCreationFailed = () => {
#if WEBAUDIO_DEBUG
console.error(`emscripten_start_wasm_audio_worklet_thread_async() addModule() failed!`);
// Note about Cross-Origin here: a lack of Cross-Origin-Opener-Policy and
// Cross-Origin-Embedder-Policy headers to the client request will result
// in the worklet file failing to load.
console.error(`emscripten_start_wasm_audio_worklet_thread_async() addModule() failed! Are the Cross-Origin headers being set?`);
#endif
{{{ makeDynCall('viip', 'callback') }}}(contextHandle, 0/*EM_FALSE*/, userData);
};
Expand Down
Loading