Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Cleanup Global Error handler #2181

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

lalitb
Copy link
Member

@lalitb lalitb commented Oct 8, 2024

Fixes #2175

Changes

In Progress. TODO:

  • Move TraceError and LogError from API to SDK. API should be only returning no-op constructs, and not the error objects. MetricsError is still being used in its API, so need to be handled separately.
  • Better handling for <signal>Error, CoW and few other types inside otel macros. As of now, the called needs to use format! to invoke their Debug impl, which is cumbersome.

For future improvement:

  • Allow applications/users to plugin their log handler.

Please provide a brief description of the changes here.

Merge requirement checklist

  • CONTRIBUTING guidelines followed
  • Unit tests added/updated (if applicable)
  • Appropriate CHANGELOG.md files updated for non-trivial, user-facing changes
  • Changes in public API reviewed (if applicable)

@lalitb lalitb requested a review from a team as a code owner October 8, 2024 01:08
Copy link

codecov bot commented Oct 8, 2024

Codecov Report

Attention: Patch coverage is 19.00000% with 81 lines in your changes missing coverage. Please review.

Project coverage is 79.0%. Comparing base (99d24b7) to head (665e678).

Files with missing lines Patch % Lines
opentelemetry-sdk/src/trace/config.rs 0.0% 17 Missing ⚠️
opentelemetry-sdk/src/trace/span_processor.rs 0.0% 10 Missing ⚠️
opentelemetry-sdk/src/metrics/pipeline.rs 0.0% 8 Missing ⚠️
opentelemetry-sdk/src/metrics/view.rs 11.1% 8 Missing ⚠️
opentelemetry-sdk/src/propagation/baggage.rs 11.1% 8 Missing ⚠️
opentelemetry-sdk/src/metrics/periodic_reader.rs 0.0% 6 Missing ⚠️
opentelemetry-proto/src/transform/metrics.rs 0.0% 4 Missing ⚠️
opentelemetry-sdk/src/metrics/meter.rs 63.6% 4 Missing ⚠️
opentelemetry-sdk/src/metrics/manual_reader.rs 0.0% 3 Missing ⚠️
opentelemetry-jaeger-propagator/src/propagator.rs 77.7% 2 Missing ⚠️
... and 8 more
Additional details and impacted files
@@          Coverage Diff          @@
##            main   #2181   +/-   ##
=====================================
  Coverage   79.0%   79.0%           
=====================================
  Files        121     121           
  Lines      20945   20908   -37     
=====================================
- Hits       16561   16532   -29     
+ Misses      4384    4376    -8     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@@ -146,7 +145,7 @@ impl Drop for LoggerProviderInner {
fn drop(&mut self) {
for processor in &mut self.processors {
if let Err(err) = processor.shutdown() {
global::handle_error(err);
otel_error!(name: "LoggerProviderInner.Drop", otel_name = "LoggerProviderInner.Drop", error = format!("{:?}", err));
Copy link
Member

@cijothomas cijothomas Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any warning/error level message must be user understandable and should not refer to internal constructs like 'Inner'.
name:LoggerProvideShutdownOnDropFailure
msg = "LoggerProvider shutdown() failed when invoked from Drop"
error = "error"

Its okay to handle this in a follow up - so this PR can focus on replacing global error handler with internal logs.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have name to follow pattern like:
name = <Module>.<Funtion>.<Action>
So in this case it would be LoggerProvider.Drop.Error or LoggerProvider.Drop.ShutdownError. The actual error message would be contained in the error attribute. And if required, more attributes can be there for specific field values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

module.func.action pattern is good pattern to follow, for debug logging that owners of this crate use to debug. I don't think we can have such a pattern for user-actionable ones, as they don't need to know which module/component is triggering this error.

@@ -175,7 +174,7 @@ impl<R: RuntimeChannel> LogProcessor for BatchLogProcessor<R> {
name: "batch_log_processor_emit_error",
error = format!("{:?}", err)
);
global::handle_error(LogError::Other(err.into()));
otel_error!(name: "BatchLogProcessor.emit", otel_error = "BatchLogProcessor.emit", error = format!("{:?}", LogError::Other(err.into())));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to re-visit this, as this can flood if channel is full.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be discussed whether it's necessary to handle these errors internally by implementing logic to prevent log flooding. This could involve using an "already-sent" flag and tracking the last sent timestamp to reduce the frequency of repeated error messages.
Or else, let the user handle this flooding by adding a filter logic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or else, let the user handle this flooding by adding a filter logic.

OTel SDK, by default, should not burden the user with such requirements!
Few thoughts:

  1. If the item is being dropped due to buffer being full, emit a log only once in the app lifetime. If item is dropped for other unknown reasons, it maybe fine to emit log for each occurrence. I prefer that this should be a simple bool flag only and not a sophisticated throttling mechanism.
  2. We should use an internal metric for items dropped. This may require more refactoring/designs.
  3. Keep the count of items dropped (using atomic uint etc.). During shutdown/drop, if that count is greater than zero, emit a warning log "N items were dropped due to buffer full".

All of these are good for future additions, so no need to tackle in this PR.
My only minimum bar for RC release - do not log for each time item is dropped due to buffer is full. (We had similar issue for Metrics overflow that was fixed recently)

@@ -47,40 +43,3 @@ impl<T> From<PoisonError<T>> for Error {
Error::Other(err.to_string())
}
}

struct ErrorHandler(Box<dyn Fn(Error) + Send + Sync>);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest to split the PR into smaller ones -

  1. Review and replace global handler with error! internal logs, one crate or even small scope in one PR
  2. When we are done with all crates, removing the global handler can be a simple PR.
  3. Revisit each internal log - if it is not user actionable, and only good for sdk owners, then it should be debug level.

@@ -50,12 +50,24 @@ macro_rules! otel_warn {
{
tracing::warn!(name: $name, target: env!("CARGO_PKG_NAME"), "");
}
#[cfg(not(feature = "internal-logs"))]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is best left for Error level only.

@@ -55,7 +55,9 @@ impl SpanRef<'_> {
if let Some(ref inner) = self.0.inner {
match inner.lock() {
Ok(mut locked) => f(&mut locked),
Err(err) => global::handle_error(err),
Err(_err) => {
otel_error!(name: "SpanRef.with_inner_mut", otel_name = "SpanRef.with_inner_mut", error = format!("{:?}", _err));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of repeating otel_name, lets do it in the wrapper itself for now, until tracing:fmt is fixed to display event name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I thought so, but then this will force this behavior for anyone using these macros in their custom exporters or processors. Thought not a big reason for not adding it in macros :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will force this behavior for anyone using these macros in their custom exporters or processors.

That seems okay. And when fmt is fixed, we can just fix the wrapper, instead of asking everyone to change their code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove global error handler in favor of internal logs
2 participants