Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disambiguate http.*.duriation and/or split them into separate metrics #3520

Closed
antonfirsov opened this issue May 25, 2023 · 26 comments
Closed
Assignees
Labels
spec:metrics Related to the specification/metrics directory

Comments

@antonfirsov
Copy link

antonfirsov commented May 25, 2023

What are you trying to achieve?

The .NET team is working on the implementation of http.client.duriation for HttpClient, and we were expecting that this metric has a clear definition regarding the duration span: whether it's duration to the first byte of the body VS the duration to the last byte of the body. This is not the case unfortunately, which leads to confusion and disagreements about how the metric should be interpreted.

From the discussion in #3519 it seems like both interpretations are valuable, and the preferred interpretation depends on the application. This makes me think, that there should be two separate metrics instead of an ambiguous http.client.duriation.

Additional context.

Statements from the discussion:

Java HTTP client instrumentation ends the HTTP client span before any user callbacks are executed (so that user code execution won't be included in the HTTP client span duration)

Because [...] general complications outside of .NET to measure time to last byte, I argue that HTTP request duration should represent time-to-first-byte.

So the Java implementation already made the choice to go with the "time to first byte" interpretation, and some frameworks may have difficulties to implement the "time to last byte" version. It would be very unfortunate if different frameworks/libs would interpret the metric in different ways.

The same problem also applies to http.server.duration, which has been already implemented as "time to last byte" in .NET and may have been (?) interpreted differently in other existing implementations.

Recommendation.

  • As a minimum, this aspect of http.client.duration and http.server.duration should be disambiguated in the specs. Because of the statements quoted, I assume the preferred interpretation would be "time to first byte"?
  • Since there are good arguments for "time to last byte" being a valuable metric for many applications, consider defining http.client.duration-to-last-byte and http.server.duration-to-last-byte.
  • For better clarity, consider renaming http.client.duration to http.client.duration-to-first-byte, though this would break existing implementations, and potentially delay the .NET delivery.
@antonfirsov antonfirsov added the spec:metrics Related to the specification/metrics directory label May 25, 2023
@JamesNK
Copy link

JamesNK commented May 25, 2023

I think http.server.duration should be the duration of the HTTP request starting to when the HTTP response has finished downloading. If you ask most people what the duration of an HTTP request is, that is the answer they will give you. That name is the principal of least surprise.

I know opentelemetry-dotnet currently has a counter that ends the duration when headers are received. I took a look at opentelemetry-js does, and it appears to complete the duration on the "end" event. This is when the response has completed reading.

Options:

  • A new counter could be added for receiving the first byte. It could be called http.client.duration-to-first-byte. Or http.client.duration-to-headers (response headers are always the first thing the server returns). However, I think you should wait until people ask for this before adding a new counter that implementations should implement.
  • Clarify what http.server.duration means. I think it should be from the start of the request to finish receiving the response. A note should be added for frameworks that can't do that that they can measure from the request starting to response headers being received.

@JamesNK
Copy link

JamesNK commented May 25, 2023

This same question applies to tracing of HttpClient: https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/semantic_conventions/http.md#http-client

When should the trace start and when does it stop?

  • Start - Client sends HTTP request headers. I think everyone agrees on this.
  • Stop - Client receives HTTP response headers or HTTP response finishes?

@lmolkova
Copy link
Contributor

lmolkova commented May 25, 2023

I think http.server.duration should be the duration of the HTTP request starting to when the HTTP response has finished downloading. If you ask most people what the duration of an HTTP request is, that is the answer they will give you. That name is the principal of least surprise.

I wish there was any consensus like this.

developer.mozilla.org says

Network latency is the time it takes for a data request to get from the computer making the request, to the computer responding. Including the time it takes for a byte of data to make it from the responding computer back to the requesting computer.

random article on the internet says

Response time is sometimes defined as Time to First Byte (TTFB),

Wikipedia says
> Note that round trip latency excludes the amount of time that a destination system spends processing the packet

Wikipedia

Time to first byte (TTFB) is a measurement used as an indication of the responsiveness of a webserver

@JamesNK
Copy link

JamesNK commented May 26, 2023

Those articles are talking about latency. Looking at those, it occurs to me that http.client.latency is a pretty good name instead of something awkward like http.client.duration-first-byte.

  • http.client.latency - Time to receive the response headers
  • http.client.duration - Time to the response being complete

@antonfirsov
Copy link
Author

  • http.client.latency - Time to receive the response headers
  • http.client.duration - Time to the response being complete

Thinking long-term, this is is the cleanest possible resolution of the problem.

@trask
Copy link
Member

trask commented May 26, 2023

I think allowing flexibility in how http.client.duration is calculated may be useful for now:

  • Instrumentation which is external to the library (which is most OpenTelemetry HTTP client instrumentation at this point) typically has limited instrumentation options compared to native library instrumentation
  • I think it's ideal to have a single metric that matches whatever duration was recorded on the HTTP client span

If there's agreement, we could add a recommendation for http.client.duration to be duration-to-completion when that can be done efficiently and reliably (which are typical challenges for non-native library instrumentation), and we could recommend that HTTP client instrumentations document their behavior around how they calculate duration.

@lmolkova
Copy link
Contributor

lmolkova commented May 26, 2023

I believe we need both metrics (not necessarily now and however we call them). Despite challenges for non-native HTTP client instrumentations, client libraries (such as Azure SDK, Camel, Spring integrations, .NET Polly), service meshes, and user applications should be able to record both without any problem.

If we allow flexibility in what http.client.duration means (i.e. best client call duration we can measure, ideally to completion), making it more precise in the future with be breaking.

One option I see is eventually having three metrics: TTFB (latency), TTLB (duration), client (no user code) call duration.

Client call duration should be implemented in all instrumentations to enable default and consistent experience.
TTFB/TTLB can be supported by certain instrumentations which can populate them. They can be opt-in and users will have control to enable one or both if they care.

I don't see other options if we keep http.client.duration definition flexible. @trask do I miss something? How do you envision us evolving it?

@lmolkova
Copy link
Contributor

lmolkova commented May 26, 2023

WRT consistency with span duration on .NET - it feels really confusing to have HTTP span duration covering (implicitly) the duration of all of this code:

var response = await client.SendAsync(request);
// do something long and potentially failing here
var stream = response.Content.ReadAsStream();
while (canRead(stream)) {
  var line = await readNextLine(stream);
  // do something long and potentially failing here
}

It will be a change in the behavior and would require users to add instrumentation around var response = await client.SendAsync(request); to measure just the response time.
It will likely be a .NET-specific behavior which is not great either.

From a tracing perspective, I would model it as a span for inner HTTP handler and a separate (non HTTP) span for reading a stream.

Content mode with buffered content is obvious here.
image

Header mode is obvious here
image

Plus, stream instrumentation is useful beyond HTTP in sockets or IO, and this way we create orthogonal instrumentations with clearer semantics.

@lmolkova
Copy link
Contributor

lmolkova commented May 26, 2023

One learning we have on Azure SDK in .NET side, that we can't reliably know when users are done reading the content - they don't frequently read all of it and don't dispose of the response. It's extensive and frequent and we've done some design decisions based on it. I probably miss something, but I still don't understand how TTLB can work reliably given this.

@trask
Copy link
Member

trask commented May 26, 2023

making it more precise in the future will be breaking

Is it breaking based on the semantic convention stability rules?

@lmolkova
Copy link
Contributor

lmolkova commented May 26, 2023

Is it breaking based on the semantic convention stability rules?

nope, not formally. but it's changing the semantics. One day my request duration is 100ms (ttfb) and tomorrow it's 20 sec (ttlb) - as a user, I'd be surprised.
Perhaps we can restrict changing the semantics formally?

UPDATE:

I think it is based on this:

Semantic Conventions defines breaking changes as those that would break the common usage of tooling written against the telemetry it produces. That is, the portions of telemetry where specialized tooling (alerts, dashboards, e.g.) interact are expected to remain stable for that tooling

alerts will be broken

@antonfirsov
Copy link
Author

@lmolkova @trask any update? Did you manage to touch this topic in your meeting?

@lmolkova
Copy link
Contributor

lmolkova commented May 31, 2023

@antonfirsov This Monday was a public holiday in the US, so nothing happened. We'll post once we have any update.

@antonfirsov @JamesNK I wonder what are your thoughts on #3520 (comment) and #3520 (comment)?

we don't know which byte is the last (i.e. if user reads the stream until condition/error/etc) and can't guarantee response stream is closed on time. Arguably applications that don't follow best practices need observability even more than good ones and providing skewed and irrelevant telemetry to them would be a bummer anyway.

@trask
Copy link
Member

trask commented Jun 1, 2023

@JamesNK
Copy link

JamesNK commented Jun 1, 2023

@antonfirsov @JamesNK I wonder what are your thoughts on #3520 (comment) and #3520 (comment)?

Think about the duration or span ending being separate from the response stream being read to the end. Instead, the end is tied to the end of the HTTP request (the point where the client releases all resources for it). An end can happen for many reasons:

  • Reading the response to its end. Graceful completion of the HTTP request.
  • HttpResponseMessage is disposed of. Ungraceful client cancellation.
  • Server abort. Ungraceful server cancellation.
  • Timeout. A timeout could come from the server or client, resulting in that side of the request triggering an abort/cancellation. For example, servers and proxies have a lot of functionality to kill idle requests that take up resources to prevent attacks.

The key point here is an HTTP request will end even if someone stops using it and doesn't clean it up. HTTP clients know when the request has ended (for whatever reason).

And if, for some reason, an app is leaving HTTP requests hanging and they are eventually closed because of a server timeout, that is useful information to communicate in metrics.

@lmolkova
Copy link
Contributor

lmolkova commented Jun 1, 2023

so we can only measure the responses that are read to the end/cancelled/failed. Assuming none of these things has happened, we don't know when user is done.

E.g. I can forget to dispose of the response. I can sometimes get away with it. Should I know that I forgot to close it? Probably, from static analysis, but not at the cost of my telemetry usefulness.

E.g. it's super-easy to ignore response stream for non-successful status codes. It's also super-common to send error details in response.

Another case, I can write

async Task GetAndProcess(HttpRequestMessage message)
{
    using var response = await client.SendAsync(reuqest);
    var content = readFirst100Bytes(response);

    // do unrelated, heavy and long stuff
    await _storageClient.Upload(content);
    _queue.Publish("new content ....");
}

I.e. when response is disposed is irrelevant and says nothing about time user would be interested in. And asking people to close response when they're done with it is too much.

@JamesNK
Copy link

JamesNK commented Jun 1, 2023

so we can only measure the responses that are read to the end/cancelled/failed. Assuming none of these things has happened, we don't know when user is done.

No, I mentioned timeouts:

Timeout. A timeout could come from the server or client, resulting in that side of the request triggering an abort/cancellation. For example, servers and proxies have a lot of functionality to kill idle requests that take up resources to prevent attacks.

For example:

If you stop using an HTTP request, then servers will kill it. Because, as you said, it's super easy for a client not to clean up a request. If servers didn't kill badly behaving HTTP clients, the internet would quickly DDOS itself because clients are awful (I write and ship HTTP servers).

@lmolkova
Copy link
Contributor

lmolkova commented Jun 1, 2023

@JamesNK timeout is fine, can you please comment on disposal?

@lmolkova
Copy link
Contributor

lmolkova commented Jun 1, 2023

to clearify: in the sample above, the user did everything right, but the duration of their span (and metrics) are wrong because they didn't read the stream to the end and didn't close response right away after being done with it.
with timeout, user did something they used to do (not disposing response) and suddenly their telemetry is wrong.

i.e.

async Task GetAndProcess(HttpRequestMessage message)
{
    using var response = await client.SendAsync(reuqest);
    // this is the duration we can reliably measure

    var content = readFirst100Bytes(response);
    // this is when user is done with the response, but we have no way of knowing

    await _storageClient.Upload(content); ...
    
   // this is a duration we'll measure as HTTP client call duration, which makes no sense to me
}

@JamesNK
Copy link

JamesNK commented Jun 1, 2023

The code runs, but did the user do everything right? Leaving HTTP requests open can cause many problems:

  • They left an unused HTTP request active, which is consuming resources on the client and consuming resources on the server. Bad for obvious reasons.
  • In HTTP/1.1 they are holding a TCP socket open. Instead of the request completing and the client returning the TCP socket to the connection pool, new requests are forced to open new TCP sockets. Under high load, this could cause the app to error with TCP socket exhaustion. Even under low load, this causes higher latency because new requests must open a new TCP socket and negotiate TLS. This is much slower than reusing a pooled connection.
  • In HTTP/2 the active request takes up a value against the SETTINGS_MAX_CONCURRENT_STREAMS limit. By default, .NET has a SETTINGS_MAX_CONCURRENT_STREAMS value of 100 active HTTP/2 requests for a server. After that limit, unused requests start queuing on the client. If requests are being left open for 5 seconds before timeout, then it is very easy to kill performance under load.
  • The client or server is forced to abort the request. This will reduce overall server and client performance. I don't know about other servers, but the one I write (Kestrel) only caches and reuses server types when a request is gracefully completed. Aborted requests are considered too unpredictable to reuse. All state for the request is thrown away and must be garbage collected.
  • Aborting many requests under high load can cause HTTP/2 connections to be aborted with the wrong conditions.

Reporting the request ended with the response headers is false information. It potentially tells someone that the request was completed orderly and correctly.

On the surface, the HTTP request may look healthy, but the reality is the request is still open. As explained above, that can cause many problems. HTTP request metrics should tell the truth and provide accurate information (it didn't end when you thought it ended) so someone can look at the data, see what is wrong with their app and fix it.

@lmolkova
Copy link
Contributor

lmolkova commented Jun 1, 2023

The code runs, but did the user do everything right?

The code is correct, but not optimal. Moreover, it can be the only reasonable thing to do when you conditionally need to continue reading the stream and checking the condition is a long operation.

I see what you're saying: by measuring the duration when the response is disposed (or fully completed), we're telling user when resources are released, which is useful information. Users then may decide if they want to optimize or tolerate perf impact from it.

What I'm trying to say is that tracing instrumentations so far are measuring a different thing - the call duration, which is another piece of useful information - how long did it take my application code to get a response at all and before I started running some other, potentially unrelated, code.

So, let's try to agree that both pieces of information are useful.
Then, I argue that the first thing to measure is time to get response without user code that reads it because it's the least surprising (tracing instrumentations HTTP or not doing something along these lines today and did for the past decade). It's also good to measure it because we can reliably do so in generic instrumentations and it should provide reasonable observability for the absolute majority of apps including buggy, inefficient, etc. Especially those.

Time to full completion is an additive thing, and we need to define the new semantics for it.

@lmolkova
Copy link
Contributor

lmolkova commented Jun 1, 2023

And one more point. Only by comparing different times ( 1) when user got a response 2) when user is done with it 3) when the request is fully completed and all resources are released) users can understand where and what they can optimize. With just p3, they won't have a clue and the duration will heavily depend on how the application code is written. Refactoring can change this time a lot, it won't be possible to compare similar calls to the same service when done by different pieces of code.

@trask
Copy link
Member

trask commented Jun 13, 2023

Removing this as an HTTP semconv stability blocker based on the merging of #69 and #70

@trask
Copy link
Member

trask commented Oct 10, 2023

@antonfirsov are you still interested in additional HTTP metrics, or do the clarifications to the duration metrics resolve this? thx!

@antonfirsov
Copy link
Author

The clarification unblocked the .NET 8.0 metrics work which is done by now. We can discuss new metrics on-demand in the future.

@trask
Copy link
Member

trask commented Oct 10, 2023

if all good, can you close this? (I don't have permissions in this repo)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
spec:metrics Related to the specification/metrics directory
Projects
Development

No branches or pull requests

5 participants