Add Error Tracking Standalone Config option #30065

Guillaume-Barrier · 2024-10-11T15:43:33Z

What does this PR do?

This PR adds ErrorTrackingStandalone config option as a boolean. When set to true, all samplers but the error sampler are bypassed. Only chunks that contain an error span or a span with exception span events are run through the error sampler. Kept spans are specifically tagged.

Motivation

We want to offer users the opportunity to have Error Tracking Standalone, i.e. a way to gather backend errors with lower cost than buying APM - but with upsell in mind. ETBS only relies on chunks that contain errors, so only the error sampler should be run.

https://datadoghq.atlassian.net/browse/ERRORT-4747

Error Tracking will support OpenTelemetry exception span events as issues. The sampler should not drop spans that don't have an error status but do have exception span events.

ErrorTrackingStandalone is a boolean. When set to true, all samplers but the error sampler are bypassed. Kept spans are specifically tagged.

bits-bot · 2024-10-11T15:43:37Z

All committers have signed the CLA.

pr-commenter · 2024-10-11T16:28:19Z

Test changes on VM

Use this command from test-infra-definitions to manually test this PR changes on a VM:

inv create-vm --pipeline-id=46925840 --os-family=ubuntu

Note: This applies to commit ce0d656

pr-commenter · 2024-10-11T17:02:25Z

Regression Detector

Regression Detector Results

Run ID: 0f286266-25e1-4a7f-a5f0-7c977d656e5a Metrics dashboard Target profiles

Baseline: 0eab229
Comparison: ce0d656

Performance changes are noted in the perf column of each table:

✅ = significantly better comparison variant performance
❌ = significantly worse comparison variant performance
➖ = no significant change in performance

No significant changes in experiment optimization goals

Confidence level: 90.00%
Effect size tolerance: |Δ mean %| ≥ 5.00%

There were no significant changes in experiment optimization goals at this confidence level and effect size tolerance.

Fine details of change detection per experiment

perf	experiment	goal	Δ mean %	Δ mean % CI	trials	links
➖	basic_py_check	% cpu utilization	+1.49	[-1.21, +4.18]	1	Logs
➖	file_to_blackhole_1000ms_latency	egress throughput	+0.40	[-0.09, +0.89]	1	Logs
➖	uds_dogstatsd_to_api_cpu	% cpu utilization	+0.35	[-0.38, +1.09]	1	Logs
➖	file_tree	memory utilization	+0.30	[+0.18, +0.43]	1	Logs
➖	file_to_blackhole_500ms_latency	egress throughput	+0.06	[-0.19, +0.30]	1	Logs
➖	tcp_dd_logs_filter_exclude	ingress throughput	+0.00	[-0.01, +0.01]	1	Logs
➖	uds_dogstatsd_to_api	ingress throughput	-0.00	[-0.10, +0.10]	1	Logs
➖	file_to_blackhole_0ms_latency	egress throughput	-0.02	[-0.35, +0.32]	1	Logs
➖	file_to_blackhole_100ms_latency	egress throughput	-0.02	[-0.25, +0.20]	1	Logs
➖	file_to_blackhole_300ms_latency	egress throughput	-0.10	[-0.27, +0.08]	1	Logs
➖	idle	memory utilization	-0.15	[-0.20, -0.10]	1	Logs bounds checks dashboard
➖	tcp_syslog_to_blackhole	ingress throughput	-0.36	[-0.41, -0.31]	1	Logs
➖	idle_all_features	memory utilization	-1.01	[-1.11, -0.91]	1	Logs bounds checks dashboard
➖	pycheck_lots_of_tags	% cpu utilization	-1.05	[-3.50, +1.40]	1	Logs
➖	otel_to_otel_logs	ingress throughput	-1.56	[-2.36, -0.75]	1	Logs

Bounds Checks

perf	experiment	bounds_check_name	replicates_passed
✅	file_to_blackhole_0ms_latency	memory_usage	10/10
✅	file_to_blackhole_1000ms_latency	memory_usage	10/10
✅	file_to_blackhole_100ms_latency	memory_usage	10/10
✅	file_to_blackhole_300ms_latency	memory_usage	10/10
✅	file_to_blackhole_500ms_latency	memory_usage	10/10
✅	idle	memory_usage	10/10
✅	idle_all_features	memory_usage	10/10

Explanation

A regression test is an A/B test of target performance in a repeatable rig, where "performance" is measured as "comparison variant minus baseline variant" for an optimization goal (e.g., ingress throughput). Due to intrinsic variability in measuring that goal, we can only estimate its mean value for each experiment; we report uncertainty in that value as a 90.00% confidence interval denoted "Δ mean % CI".

For each experiment, we decide whether a change in performance is a "regression" -- a change worth investigating further -- if all of the following criteria are true:

Its estimated |Δ mean %| ≥ 5.00%, indicating the change is big enough to merit a closer look.
Its 90.00% confidence interval "Δ mean % CI" does not contain zero, indicating that if our statistical model is accurate, there is at least a 90.00% chance there is a difference in performance between baseline and comparison variants.
Its configuration does not mark it "erratic".

…or-sampler

There is now a second method to convert otlp spans to dd spans. Adding the has_exception tag in this method as well.

… into guillaume.barrier/add-error-tracking-standalone-config-option

Removing the _dd.span_events.has_exception tag requires to loop through the whole chunk which is not ideal. Also: - tag is hidden (although visible in devtools) - indication of why the chunk was sampled if there is no error span - error sampler can be disabled so no clean up And while it is there, this tag could also be used at processing to detect exception span events without even unmarshalling.

… into guillaume.barrier/add-error-tracking-standalone-config-option

Removing the _dd.span_events.has_exception tag requires to loop through the whole chunk which is not ideal. Also: - tag is hidden (although visible in devtools) - indication of why the chunk was sampled if there is no error span - error sampler can be disabled so no clean up And while it is there, this tag could also be used at processing to detect exception span events without even unmarshalling.

…one-config-option

mackjmr · 2024-10-18T11:44:21Z

pkg/trace/api/otlp.go

@@ -517,6 +517,12 @@ func (o *OTLPReceiver) convertSpan(rattr map[string]string, lib pcommon.Instrume
 	if in.Events().Len() > 0 {
 		transform.SetMetaOTLP(span, "events", transform.MarshalEvents(in.Events()))
 	}
+	for i := range in.Events().Len() {


Were already iterating through events/ checking for exceptions here, may be more efficient to add span.Meta["_dd.span_events.has_exception"] = "true" below that line.

It also seems like Status2Error gets called in both paths you changed, so moving to there would only require adding the logic in 1 place.

From what I can see, there is a check in Status2Error here so that it is only applied to error spans.

The point being to consider non error spans that do contain exception span events in the error sampler (in addition to error spans), I don't think I can move the tag setting there unfortunately.

The point being to consider non error spans that do contain exception span events in the error sampler

Ah, wasn't aware of this. Agreed you can't add to Status2Error then. May be good to still create a func to avoid duplication but is fine as is too, approving.

Moved into its own func in 7f8a8e5, cleaner indeed!

pgimalac

LGTM for ASC, just a few nitpicks

pkg/config/config_template.yaml

pkg/config/setup/apm.go

mackjmr · 2024-10-18T12:09:15Z

pkg/trace/api/otlp.go

@@ -517,6 +517,12 @@ func (o *OTLPReceiver) convertSpan(rattr map[string]string, lib pcommon.Instrume
 	if in.Events().Len() > 0 {
 		transform.SetMetaOTLP(span, "events", transform.MarshalEvents(in.Events()))
 	}
+	for i := range in.Events().Len() {


The point being to consider non error spans that do contain exception span events in the error sampler

Ah, wasn't aware of this. Agreed you can't add to Status2Error then. May be good to still create a func to avoid duplication but is fine as is too, approving.

Indent subkeys in the config template and set default to false for error tracking standalone flag.

ichinaski · 2024-10-18T12:00:28Z

pkg/trace/agent/agent.go

+	// Trace chunks that don't contain errors are dropped.
+	if a.conf.ErrorTrackingStandalone {
+		return a.errorSampling(now, ts, pt)
+	}


I think this bypass may be too aggressive. The trace-agent needs to run probabilistic samplers to adapt sampling rates returned to the tracer. If this is not run, the tracer will miss updates on sampling rates.

May I get your thoughts on this @ajgajg1134 ?

edit: looking into runSamplers() func in this very same file, I think this may be a better place to put this logic. It already contains conditional statements on which samplers should be run (eg ProbabilisticSampler).

That may be more of a Product question but here is the reasoning:

if a user sets a host to ETS, we only want to keep chunks with errors (or exception span events)

the error sampler itself may not be needed - likely users would like all their errors and then use exclusion filters - but we can still leave it if they want to sample

apart from the priority sampler with manual drop, all sampling decisions could be to keep regardless of the presence of errors - which we don't want as we only need to send chunks with errors

Then I have to admit I didn't know the probabilistic sampler talked to the tracer, but I guess we wouldn't need that if the host is set as ETS and we never run it?

I discussed with @dussault-antoine this week and we concluded that it was fine to bypass all samplers but the error sampler.

Then I don't remember if I considered putting the logic in runSamplers(), maybe that would give better readability?

ichinaski · 2024-10-18T12:14:44Z

pkg/trace/agent/agent.go

+			if span.Error != 0 || spanContainsExceptionSpanEvent(span) {
+				span.Meta["_dd.error_tracking_backend_standalone.error"] = "true"
+			} else {
+				span.Meta["_dd.error_tracking_backend_standalone.error"] = "false"


This tag is pretty much derived from existing span properties. Can't we avoid the propagation of this tag on every span and resolve the value, if needed, in the backend? It would make the transport more efficient and less costly.

We are going to need the tag for billing, there is no other way we could know they come from ETS (we don't want to bill them for APM)

ichinaski · 2024-10-18T13:35:49Z

pkg/trace/agent/agent.go

+// Also sets "DroppedTrace" on the chunk.
+func (a *Agent) errorSampling(now time.Time, ts *info.TagStats, pt *traceutil.ProcessedTrace) (keep bool, numEvents int) {
+	sampled := a.runErrorSampler(now, *pt)
+	numEvents = len(a.getAnalyzedEvents(pt, ts))


do we really need analyzed spans when only this error sampler is enabled? 🤔

tbh I don't know a lot about analyzed spans but as it is set to true when running the error sampler in runSamplers(), I figured I would have it as well

ichinaski · 2024-10-18T13:43:03Z

pkg/trace/agent/agent.go

+	return false
+}
+
+func traceContainsErrorOrExceptionSpanEvent(trace pb.Trace) bool {


We have a very similar function already, maybe it's worth extending traceContainsError() adding a boolean parameter to determine whether exceptions should be considered?

that also works! done in ce0d656

aliciascott

good for docs

Guillaume-Barrier added 2 commits October 11, 2024 16:30

Consider spans with exception spanEvents as errors

42def98

Error Tracking will support OpenTelemetry exception span events as issues. The sampler should not drop spans that don't have an error status but do have exception span events.

Add Error Tracking Standalone Config option

d274bb0

ErrorTrackingStandalone is a boolean. When set to true, all samplers but the error sampler are bypassed. Kept spans are specifically tagged.

Guillaume-Barrier and others added 7 commits October 14, 2024 10:48

Merge branch 'main' into guillaume.barrier/support-span-events-in-err…

182e2de

…or-sampler

Add tag in receive span v2

8a5fddf

There is now a second method to convert otlp spans to dd spans. Adding the has_exception tag in this method as well.

Merge branch 'guillaume.barrier/support-span-events-in-error-sampler'…

81aa305

… into guillaume.barrier/add-error-tracking-standalone-config-option

Merge branch 'guillaume.barrier/support-span-events-in-error-sampler'…

7239492

… into guillaume.barrier/add-error-tracking-standalone-config-option

Add missing setup of config ET standalone

ef36b54

Guillaume-Barrier changed the base branch from guillaume.barrier/support-span-events-in-error-sampler to main October 18, 2024 08:29

Merge branch 'main' into guillaume.barrier/add-error-tracking-standal…

e908828

…one-config-option

github-actions bot added the team/agent-apm trace-agent label Oct 18, 2024

Only consider exception span events for ETS

a097126

Guillaume-Barrier marked this pull request as ready for review October 18, 2024 11:32

Guillaume-Barrier requested review from a team as code owners October 18, 2024 11:32

Guillaume-Barrier requested a review from dinooliva October 18, 2024 11:32

Guillaume-Barrier mentioned this pull request Oct 18, 2024

Consider spans with exception spanEvents as errors #30064

Closed

mackjmr reviewed Oct 18, 2024

View reviewed changes

pgimalac approved these changes Oct 18, 2024

View reviewed changes

pkg/config/config_template.yaml Outdated Show resolved Hide resolved

pkg/config/setup/apm.go Outdated Show resolved Hide resolved

mackjmr approved these changes Oct 18, 2024

View reviewed changes

Guillaume-Barrier added 2 commits October 18, 2024 14:13

Address comment

f62c51d

Indent subkeys in the config template and set default to false for error tracking standalone flag.

Move span tagging into its own method

7f8a8e5

ichinaski reviewed Oct 18, 2024

View reviewed changes

aliciascott approved these changes Oct 18, 2024

View reviewed changes

Use traceContainsError to detect exception events

ce0d656

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Error Tracking Standalone Config option #30065

Add Error Tracking Standalone Config option #30065

Guillaume-Barrier commented Oct 11, 2024 •

edited

Loading

bits-bot commented Oct 11, 2024 •

edited

Loading

pr-commenter bot commented Oct 11, 2024 •

edited by agent-platform-auto-pr bot

Loading

pr-commenter bot commented Oct 11, 2024 •

edited by cit-pr-commenter bot

Loading

Fine details of change detection per experiment

Explanation

mackjmr Oct 18, 2024

Guillaume-Barrier Oct 18, 2024

mackjmr Oct 18, 2024

Guillaume-Barrier Oct 18, 2024

pgimalac left a comment

mackjmr Oct 18, 2024

ichinaski Oct 18, 2024

Guillaume-Barrier Oct 18, 2024

ichinaski Oct 18, 2024

Guillaume-Barrier Oct 18, 2024

ichinaski Oct 18, 2024

Guillaume-Barrier Oct 18, 2024

ichinaski Oct 18, 2024

Guillaume-Barrier Oct 18, 2024

aliciascott left a comment

Add Error Tracking Standalone Config option #30065

Are you sure you want to change the base?

Add Error Tracking Standalone Config option #30065

Conversation

Guillaume-Barrier commented Oct 11, 2024 • edited Loading

What does this PR do?

Motivation

bits-bot commented Oct 11, 2024 • edited Loading

pr-commenter bot commented Oct 11, 2024 • edited by agent-platform-auto-pr bot Loading

Test changes on VM

pr-commenter bot commented Oct 11, 2024 • edited by cit-pr-commenter bot Loading

Regression Detector

Regression Detector Results

No significant changes in experiment optimization goals

Fine details of change detection per experiment

Bounds Checks

Explanation

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgimalac left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aliciascott left a comment

Choose a reason for hiding this comment

Guillaume-Barrier commented Oct 11, 2024 •

edited

Loading

bits-bot commented Oct 11, 2024 •

edited

Loading

pr-commenter bot commented Oct 11, 2024 •

edited by agent-platform-auto-pr bot

Loading

pr-commenter bot commented Oct 11, 2024 •

edited by cit-pr-commenter bot

Loading