Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Exemplar + SimpleFixedSizeExemplarReservoir + Cumulative aggregation #3891

Open
CodeBlanch opened this issue Feb 20, 2024 · 13 comments
Assignees
Labels
spec:metrics Related to the specification/metrics directory triage:deciding:community-feedback

Comments

@CodeBlanch
Copy link
Member

CodeBlanch commented Feb 20, 2024

The spec for SimpleFixedSizeExemplarReservoir says:

Any stateful portion of sampling computation SHOULD be reset every collection cycle. For the above example, that would mean that the num_measurements_seen count is reset every time the reservoir is collected.

The spec for ExemplarReservoir says:

The "collect" method MUST return accumulated Exemplars. Exemplars are expected to abide by the AggregationTemporality of any metric point they are recorded with. In other words, Exemplars reported against a metric data point SHOULD have occurred within the start/stop timestamps of that point. SDKs are free to decide whether "collect" should also reset internal storage for delta temporal aggregation collection, or use a more optimal implementation.

I'm working on exemplars for .NET and I want to make sure I understand this correctly.

It makes total sense to me for delta aggregation. But let's say we are in cumulative aggregation.

I have SimpleFixedSizeExemplarReservoir with 3 slots. I have recorded 3 exemplars A, B, & C. A collection occurs and I export A, B, & C.

I reset num_measurements_seen after that collection following what the spec says to do.

Now a new measurement happens and it goes into the first slot due to the reset.

On the next collection do I send...

  • D
    or
  • D, B, C

The first case seems a little odd (to me) given we're in cumulative mode. But so does the second. Why would we give unfair priority to the first slot? If I don't reset num_measurements_seen then D may or may not end up in any slot and we would see either A, B, C again or one of (D, B, C / A, D, C / A, B, D). This seems more reasonable (to me) for cumulative mode.

Apologies in advanced if this has already been discussed.

@CodeBlanch CodeBlanch added the spec:metrics Related to the specification/metrics directory label Feb 20, 2024
@MrAlias
Copy link
Contributor

MrAlias commented Feb 21, 2024

On the next collection, if there are no other measurements offered, you will send D, B, C, A, D, C, or A, B, D, but you won't send D. Sending D not produce a cumulative result.

The order of the exemplars in the reservoir, or the export, has no significance. Remember, they should be randomly replaced once the reservoir is full anyway. Users are not expected to understand or use the order they are exported in any way. The only important part here is going to be consistent and complete in your algorithm.

Given you have reset your count and need to replace all of the previous exemplars before replacing the new it is likely easier to structure your algorithm in a predicable way so you don't have to track these replacements. For example, if you make measurement D and have D, B, C, you cannot then be offered measurement E and hold E, B, C. The definition of the reservoir requires that either B or C is replaced instead.

The example algorithm R makes the replacements by starting at the front of the storage and working to the end before starting random replacements. You are free to, say, start at the back and work to the front, or maybe some middle out approach, but the required thing is you are consistent in your complete replacement of prior measurements before you start random evictions.

@CodeBlanch
Copy link
Member Author

I agree with most of this. The part I don't agree with is that in cumulative we should ever reset num_measurements_seen. In cumulative IMO once the SimpleFixedSizeExemplarReservoir fills up, we should just stay there and let all future exemplars have an equal random chance to get into the reservoir.

It might help in the SimpleFixedSizeExemplarReservoir section to have some examples?

Something like htis...

SimpleFixedSizeExemplarReservoir

...existing content goes here minus the reset bit...

Aggregation behavior and collection

In delta temporal aggregation any stateful portion of sampling computation SHOULD be reset every collection cycle. For the above example, that would mean that the num_measurements_seen count is reset every time the reservoir is collected. In cumulative temporal aggregation collection SHOULD NOT reset num_measurements_seen count.

Aggregation example

Given a SimpleFixedSizeExemplarReservoir with 3 slots...

  • SimpleFixedSizeExemplarReservoir Delta Aggregation

    • Exemplar A, B, & C recorded.
    • Collect called. Exemplar A, B, & C returned.
    • num_measurements_seen reset.
    • Exemplar D recorded.
    • Collect called. Exemplar D returned.
  • SimpleFixedSizeExemplarReservoir Cumulative Aggregation

    • Exemplar A, B, & C recorded.
    • Collect called. Exemplar A, B, & C returned.
    • Exemplar D offered. Depending on the result of the random operation the storage of the reservoir becomes one of these states:
      • A, B, & C.
      • D, B, & C.
      • A, D, & C.
      • A, B, & D.
    • Collect called and the 3 exemplars in the reservoir are returned.

@MrAlias
Copy link
Contributor

MrAlias commented Feb 21, 2024

You will prioritize exemplars from the start of the application instead of the most recent collection cycle given the probability of inclusion diminishes the more observations you see. That is not what the user wants. Users do not want to see measurements from the first collection cycle in the Nth collection cycle, unless there haven't been any measurements.

@CodeBlanch
Copy link
Member Author

I don't see how your logic solves anything. Let's say my app gets very busy and I fill up a SimpleFixedSizeExemplarReservoir with 10 exemplars:

[A, B, C, D, E, F, G, H, I, J]

A collection fires.

Now my app is less busy and I see only two new exemplars: K, L.

A collection fires.

[K, L, C, D, E, F, G, H, I, J]

My app stays less busy and I see only two exemplars: M, N.

A collection fires.

[M, N, C, D, E, F, G, H, I, J]

We are still sending things which were part of the first cycle, except we've given unfair weight to new things.

Users do not want to see measurements from the first collection cycle in the Nth collection cycle, unless there haven't been any measurements.

Is that what users want? In cumulative they want to see everything which has happened between start & end times which are the lifetime of the process, no?

@MrAlias
Copy link
Contributor

MrAlias commented Feb 21, 2024

unfair weight to new thing

How is this unfair?

@MrAlias
Copy link
Contributor

MrAlias commented Feb 21, 2024

In your example what if the next collection cycle is just as busy as before, with hundreds of measurements offered. Based on random probability you still return [K, L, C, D, E, F, G, H, I, J] to your users.

Is this really what they want? They want measurements from previous collections even though there have been hundreds more?

@CodeBlanch
Copy link
Member Author

Is this really what they want? They want measurements from previous collections even though there have been hundreds more?

TBH, I don't know 😄

The spec does say...

Exemplars reported against a metric data point SHOULD have occurred within the start/stop timestamps of that point.

Collection cycle doesn't seem relevant, only start/stop time. For cumulative our start/stop is the lifetime of the process so why wouldn't I see a mix of exemplars recorded over the lifetime of the process for any given collect?

@cijothomas
Copy link
Member

You will prioritize exemplars from the start of the application instead of the most recent collection cycle given the probability of inclusion diminishes the more observations you see. That is not what the user wants. Users do not want to see measurements from the first collection cycle in the Nth collection cycle, unless there haven't been any measurements.

Agree. This is currently ensured by the spec.

@reyang
Copy link
Member

reyang commented Feb 28, 2024

The TC discussed about this during the spec issue triage, here is the suggestion - check what Prometheus does, and if Prometheus is doing the same (only have exemplars from the latest collection cycle), we can close the issue as it aligns with the current spec.

FYI @dashpole if you could help to confirm.

@reyang reyang added the triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide label Feb 28, 2024
@MrAlias
Copy link
Contributor

MrAlias commented Feb 28, 2024

Prometheus has exemplar storage ...

implemented as a fixed size circular buffer that stores exemplars in memory for all series.

This can be verified here.

Prometheus does match our behavior in that they only show the latest collection cycle if there were enough exemplars offered. They explicitly evict oldest from the last collection cycle first, like the example in our spec suggest.

@dashpole
Copy link
Contributor

@MrAlias I think that is about how exemplars are stored in the prometheus server. For clients:

For histograms, Prometheus stores one exemplar per-bucket. When a new observation is made with an exemplar, it always overwrites the exemplar for that bucket: https://github.com/prometheus/client_golang/blob/14259fa70cfb69f1262f69fdfe58ed5e6318d441/prometheus/histogram.go#L745. For counters, it only keeps a single exemplar. When a new observation is made with an exemplar, it always overwrites the existing exemplar for the counter: https://github.com/prometheus/client_golang/blob/14259fa70cfb69f1262f69fdfe58ed5e6318d441/prometheus/counter.go#L172.

If there haven't been any observations for the counter during a scrape interval, it would still export the exemplar from the previous interval.

Overall, Prometheus will always return the most recent exemplar recorded for a series regardless of whether it has been exported/scraped.

@jmacd
Copy link
Contributor

jmacd commented Feb 28, 2024

I support the statement by @reyang -- we should ensure the default behavior with cumulative aggregation matches Prometheus. I agree with @MrAlias that the "unfairness" is intentional, that the specified behavior as written ensures that recent sampled traces will be exposed while they are still recent. I agree with @CodeBlanch that the concept of "collection time" is awkward when talking about cumulative data.

except we've given unfair weight to new things

I want to make sure this concept is well-defined, as it stems from an algorithmic problem with simple random sampling. When you have a two populations of different sizes, and you take a simple random sample of equal size from each, the resulting samples cannot be recombined fairly using a simple algorithm, such that the items are equally representative (*).

What we can do, instead, is combine the two samples into unequal-probability or weighted samples. My preferred algorithm for this task is VarOpt. Here's a Golang reference implementation. Here's how this can work for cumulative collection--

Say we define a semantic conventional-attribute, to be included in the exemplar, stating the adjusted count of the exemplar (e.g. adjusted.count). When we use simple random sampling and there are M events between collection cycles with an N-size reservoir, each exemplar has adjusted count of M/N, so the sum of the collected exemplars adjusted.count equals num_measurements_seen each cycle.

In order to compute a single, fair sample over the lifetime of all cumulative collection cycles, create a VarOpt sampler of the intended size. Next, enter every exemplar that has been (ever) collected into the VarOpt sampler using the adjusted.count value as weight. At the end of this process, the resulting sample will cover all collection cycles -- however we have to update the adjusted.count of each exemplar as it is output. After VarOpt finishes its calculation, the sum of the output exemplars adjusted.count values is expected to equal the sum of the input exemplars adjusted.count values -- which corresponds with any cumulative metrics that have been collected. More importantly, the individual exemplar adjusted.count are individually representative, without any recency bias.

This is a powerful technique because it applies to all subsets of collected exemplars. Choose any subset of exemplars and sum their adjusted counts (or recombine them and sum their recombined adjusted counts), and the resulting total is expected to match the cumulative counter or histogram-counter from which it was derived. This means you can actually dispense with the timeseries data entirely, a set of exemplars with accurate adjusted counts can estimate any timeseries you might have otherwise computed, and this works with any aggregation temporality.

@CodeBlanch If this interests you, I would be glad to support development on this semantic convention. I have this prototype for the Lightstep metrics SDK.

(*) This is not to say there are not equal-representativity algorithms -- see "Conditional Poisson Sampling" schemes.

@CodeBlanch
Copy link
Member Author

@jmacd I'm happy to work on this with you but I'll admit, a lot of that went over my head 🤣

Let me put down what I think you are saying:

For SimpleFixedSizeExemplarReservoir we would reset its state the same way for delta & cumulative aggregations. Meaning, nothing from the previous collect will ever show up on the next one. When in cumulative, we add a tag to the recorded exemplars which tracks the count?

@jmacd jmacd self-assigned this Apr 24, 2024
@austinlparker austinlparker added triage:deciding:community-feedback and removed triaged-needmoreinfo The issue is triaged - the OTel community needs more information to decide labels Apr 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory triage:deciding:community-feedback
Projects
None yet
Development

No branches or pull requests

8 participants