Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sampling SIG research notes #213

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from

Conversation

spencerwilson
Copy link
Contributor

@spencerwilson spencerwilson commented Sep 8, 2022

  • Create text/experimental/notes/ directory, containing a partial OTEP ("Motivation" section only) and two research documents on the sampling landscape

The scope of the OTEP (that appears in this PR in partial form) is a prerequisite to #191.

This PR needn't be merged. Its purpose is to provide a vehicle for review and iteration by the Sampling SIG. I previously distributed these documents as GitHub gists, but the SIG decided this would be superior to that. I expect that at some point this PR will have served its purpose and at that point may be closed.

@spencerwilson spencerwilson requested a review from a team September 8, 2022 21:42
1. "Statistics" can be anything from [RED metrics](https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/) by service, to data used to answer richer questions like "Which dimensions of trace data are correlated with higher error rate?". You want to ensure that all inferences made from the data you *do* collect are valid.
2. Setting sampling error targets is akin to setting Service Level Objectives: just as one aspires to build *appropriately reliable* systems, so too one needs statistics which are *just accurate enough* to get valid insights from, but not so accurate that you excessively sacrifice goals #1 and #2.
3. An example consequence of this goal being unmet: metrics derived from the trace data become spiky and unfit for purpose.
4. Ensure traces are complete.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel this is a bit too strong. While we want to see many complete traces, we must not require that all traces are complete. Consider infrequently used sub-services which might not get enough representation when all sampling decisions are made at the root. BTW, poor coverage for such services is a weak point across the whole surveyed landscape today.

- limiting: Support all of both spans per second, spans per month, GB per month (approximation is ok)
- degree of limiting: Soft is ok
- horizontally scalable: Yes
- Prioritize tail sampling in Collector over head sampling in SDK

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this fits well here. It is a technical design decision. While I agree with this sentence in practice, users may even have the opposite opinion, as tail-based sampling is generally more expensive than head-based sampling.

1. Reduce or limit costs stemming from the construction and transmission of spans.
2. Analytics queries are faster when searching less data.
2. Respect limits of downstream storage systems.
1. Trace storage systems often have data ingest limits (e.g., GBs per second, spans per second, spans per calendar month). The costs of exceeding these limits can be either reduced reliability or increased hosting expenditures.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think similarly "hard" limitations apply for the tracers, collector and the network. Collecting too much data up front can lead to excessive memory usage, CPU or network saturation and can cause not only performance issues, but application malfunction as well.

- Per-stratum limiting: Partition input traces into strata, and sample such that each stratum's throughput does not exceed a threshold.
- Global limiting: Sample such that total throughput doesn't exceed a threshold.

Note that in addition to limiting traces per unit time, there are also use cases to support limting spans per unit time, or bytes per unit time. In such cases the limiter implementation should take care not to impart bias by systematically preferring traces comprising fewer spans, or fewer bytes, over "larger" traces.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "limting"


##### TraceIdRatioBased

`TraceIdRatioBased` may be used to consistently sample or drop a certain fixed percentage of spans. The decision is based on a random value, the trace ID, rather than any of the span metadata available to ShouldSample (span name, initial attributes, etc.) As a result,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, since OpenTelemetry doesn't specify what hashing algorithm to use, my understanding is this statement is not fully accurate (since different language SDKs could have different approaches) - e.g. when it is used for sampling non-root spans. It may be good to clarify that current limitation here.

1. "Statistics" can be anything from [RED metrics](https://www.weave.works/blog/the-red-method-key-metrics-for-microservices-architecture/) by service, to data used to answer richer questions like "Which dimensions of trace data are correlated with higher error rate?". You want to ensure that all inferences made from the data you *do* collect are valid.
2. Setting sampling error targets is akin to setting Service Level Objectives: just as one aspires to build *appropriately reliable* systems, so too one needs statistics which are *just accurate enough* to get valid insights from, but not so accurate that you excessively sacrifice goals #1 and #2.
3. An example consequence of this goal being unmet: metrics derived from the trace data become spiky and unfit for purpose.
4. Ensure traces are complete.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would something like "Ensure traces are as complete and consistent as possible" better express the intent here?

2. Setting sampling error targets is akin to setting Service Level Objectives: just as one aspires to build *appropriately reliable* systems, so too one needs statistics which are *just accurate enough* to get valid insights from, but not so accurate that you excessively sacrifice goals #1 and #2.
3. An example consequence of this goal being unmet: metrics derived from the trace data become spiky and unfit for purpose.
4. Ensure traces are complete.
1. "Complete" means that all of the spans belonging to the trace are collected. For more information, see ["Trace completeness"](https://github.com/open-telemetry/opentelemetry-specification/blob/v1.12.0/specification/trace/tracestate-probability-sampling.md#trace-completeness) in the trace spec.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have there been any discussions around how sampling impacts "linked traces" and whether the consistency/completeness goals should support linked traces as well? Since many async operations are modelled as linked traces, I am trying to understand if there's a way consistent sampling can be achieved across linked traces.


Notes:

- TODO(Spencer): Look at https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/telemetryquerylanguage/tql and give feedback. Split out from transformprocessor.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- TODO(Spencer): Look at https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/telemetryquerylanguage/tql and give feedback. Split out from transformprocessor.
- TODO(Spencer): Look at https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/ottl and give feedback. Split out from transformprocessor.

@jmacd
Copy link
Contributor

jmacd commented Dec 8, 2022

FYI I would like to merge this for the record. Thank you @spencerwilson!!

@tedsuo tedsuo marked this pull request as draft January 9, 2023 17:51
@tedsuo tedsuo added the triaged label Jan 30, 2023
@bputt-e
Copy link

bputt-e commented Apr 21, 2023

Not sure if this is something this PR will address, but we were experimenting with a client/api that enabled folks to set their sampling rate by span attributes. Using RQL would allow =, !=, regex, like/not like statements and would give more control over the data being traced. Yes, it does require the information that's being used to determine the sampling rate at the first time a span is created.

Our use case was to enable folks to increase sampling rates for a given value (or set of values) without having to make any code changes and the "sampling rules" would re-compile each time it got a change from the API (i.e. when a user makes a change). Moving forward, RQL isn't required, but flexibility could be helpful IF folks want more than = / != for comparing span attributes to something they want to sample by.

@tedsuo
Copy link
Contributor

tedsuo commented Jul 31, 2023

@spencerwilson we are cleaning up stale OTEP PRs. If there is no further action at this time, we will close this PR in one week. Feel free to open it again when it is time to pick it back up.

@tedsuo tedsuo added the stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. label Jul 31, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale This issue or PR is stale and will be closed soon unless it is resurrected by the author. triaged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants