Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Using futures_executor::block_on blocks forever in WASM #2047

Open
mendess opened this issue Aug 23, 2024 · 4 comments
Open

[Bug]: Using futures_executor::block_on blocks forever in WASM #2047

mendess opened this issue Aug 23, 2024 · 4 comments
Labels
bug Something isn't working triage:todo Needs to be traiged.

Comments

@mendess
Copy link

mendess commented Aug 23, 2024

What happened?

Using this library in WASM causes the program to block forever and use 100% CPU, due to the usage of futures_executor::block_on

API Version

0.24.0

SDK Version

0.24.1

What Exporter(s) are you seeing the problem on?

N/A

Using futures_executor::block_on is harmful to performance and correctness.

In single-threaded contexts, blocking in place leads to blocking the entire runtime forever, and block_on causes that thread to busy loop until the future resolves, using 100% of CPU.

In a multithreaded environment it is also harmful as it blocks an entire thread, and if that thread happens to have other tasks scheduled on it, this will cause those tasks to be starved of resources, worse still if the future the thread is blocking on is waiting for a task that's scheduled on that same thread via a mechanism such as the tokio LIFO slot, leading to a deadlock.

See this draft PR as a first attempt to address this problem.

In order to make this work for WASM I introduced a spawn_future function, which does the right thing in WASM but continues using the bad futures_executor::block_on function in other environments. With this patch I stopped having problems in the WASM project I'm working on, thus proving that this was the root of the issue.

Next I made some more methods async. The trait methods I made async are:

  • LogProcessor::emit
  • LogProcessor::force_flush
  • LogProcessor::shutdown
  • SpanProcessor::on_end
  • SpanProcessor::force_flush
  • SpanProcessor::shutdown

This avoids having to call spawn_future inside these functions. I couldn't however avoid calling this inside the Drop implementations that needed to call SpanProcessor::shutdown, as well as, inside periodic_reader.rs because I was unsure as to how best to proceed. So there are still some open questions.

@mendess mendess added bug Something isn't working triage:todo Needs to be traiged. labels Aug 23, 2024
@cijothomas
Copy link
Member

Thanks for opening the issue with detailed analysis and the PR! I left a comment on the PR https://github.com/open-telemetry/opentelemetry-rust/pull/2048/files#r1729165305 .

We need to see what is the right way to execute all the backgrounds tasks (batch exporting, metric reader). @ThomsonTan was also looking into this. Tagging @TommyCpp and @lalitb also to share ideas/concerns.

@cijothomas
Copy link
Member

In a multithreaded environment it is also harmful as it blocks an entire thread, and if that thread happens to have other tasks scheduled on it, this will cause those tasks to be starved of resources, worse still if the future the thread is blocking on is waiting for a task that's scheduled on that same thread via a mechanism such as the tokio LIFO slot, leading to a deadlock.

We are exploring a design where OTel will have its own dedicated thread, where these issues are going to be avoided. Do you consider it an issue, if Otel does a blocking call in its own background thread? That thread is OTel's own, and won't have any other tasks scheduled on it.

@demurgos
Copy link
Contributor

I keep hitting the same issues on regular x86_64 linux builds when using Tokio. I am not entirely sure what's causing this but all runtime.block_on calls hang. The presence of blocking calls seems to make most of the API brittle.

@mzabaluev
Copy link

mzabaluev commented Oct 16, 2024

Yes, block_on is problematic also in async contexts run by Tokio. This minimized test blocks:

use opentelemetry_sdk::{runtime, testing::trace::NoopSpanExporter, trace::TracerProvider};

#[tokio::test]
async fn shutdown_in_async_scope() {
    let provider = TracerProvider::builder()
        .with_batch_exporter(NoopSpanExporter::new(), runtime::Tokio)
        .build();

    provider.shutdown().unwrap();
}

One workaround is to call shutdown in a worker thread:

#[tokio::test]
async fn shutdown_in_tokio_worker() {
    let provider = TracerProvider::builder()
        .with_batch_exporter(NoopSpanExporter::new(), runtime::Tokio)
        .build();

    tokio::task::spawn_blocking(move || provider.shutdown())
        .await
        .unwrap()
        .unwrap();
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage:todo Needs to be traiged.
Projects
None yet
Development

No branches or pull requests

4 participants