Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add client metrics #751

Open
wants to merge 36 commits into
base: master
Choose a base branch
from
Open

Add client metrics #751

wants to merge 36 commits into from

Conversation

isacikgoz
Copy link
Member

Summary

Add client metrics implementation. Since the performance report generated with user activity, it should be better to reflect that behaviour and it would be much more realistic about the payload. I wanted to hear an early feedback whether this is a good approach.

I'll add ClientFirstContentfulPaint, ClientLargestContentfulPaint metrics in the meantime.

@isacikgoz isacikgoz added the 2: Dev Review Requires review by a core committer label May 28, 2024
Copy link
Member

@agarciamontoro agarciamontoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, I like the approach! I left some comments, looking forward to seeing more metrics 👀 Thanks for this work!

Comment on lines 157 to 160
err := c.user.ObserveClientMetric(model.ClientTimeToFirstByte, float64(time.Now().UnixMilli()-start)/1000)
if err != nil {
mlog.Warn("Failed to store observation", mlog.Err(err))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point we've already made 3-4 requests, so I don't think this is the time to first byte, right? How are we defining the time to first byte in the webapp?
We can go quite low-level here and use something like https://pkg.go.dev/net/http/httptrace#WithClientTrace, there's a callback for GotFirstResponseByte, when we can observe the metric. But this is a per-request metric, so are we planning in measuring that in all requests? Or in some specific one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL! We want to measure that whenever a user logs in (thinking of a scenario where opens a web page). But the user could also be logged in with the session so we may need to cover that scenario. The goal of the PR is to measure the load in the server with client metrics, but I think it could be a good idea to actually get realistic metrics from the agents.

Maybe we can add this to backlog, WDYT?

loadtest/control/simulcontroller/controller.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
Copy link
Contributor

@streamer45 streamer45 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Gave this a first quick pass and left some comments.

.golangci.yml Outdated Show resolved Hide resolved
go.mod Outdated Show resolved Hide resolved
loadtest/control/simulcontroller/actions.go Outdated Show resolved Hide resolved
loadtest/control/simulcontroller/controller.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
loadtest/store/memstore/store.go Outdated Show resolved Hide resolved
@agnivade
Copy link
Member

@isacikgoz - Just checking on this, is there anything left other than us reviewing this again?

Comment on lines 60 to 61
func randomUserAgent() string {
i := rand.Intn(len(userAgents))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whaaat? Are you saying there's an equal chance of Safari sending metrics than Chrome? :p

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vanilla macos users are quite a lot among devs to be fair :p

.golangci.yml Show resolved Hide resolved
.golangci.yml Show resolved Hide resolved
loadtest/control/simulcontroller/periodic.go Outdated Show resolved Hide resolved
loadtest/control/simulcontroller/periodic.go Outdated Show resolved Hide resolved
loadtest/user/userentity/report.go Outdated Show resolved Hide resolved
loadtest/user/userentity/report.go Outdated Show resolved Hide resolved
loadtest/user/userentity/user.go Outdated Show resolved Hide resolved
Comment on lines +158 to +162
elapsed := time.Since(start).Seconds()
err := c.user.ObserveClientMetric(model.ClientTimeToFirstByte, elapsed)
if err != nil {
mlog.Warn("Failed to store observation", mlog.Err(err))
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still think this is not semantically correct. I'm not sure if it's a problem with the naming (how are we using it exactly in the webapp?) or with the code here. For reference: I would understand this if the metric would be called something like model.ClientTimeToLogin.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, answering to your comment in the other thread (sorry, just read it now):

TIL! We want to measure that whenever a user logs in (thinking of a scenario where opens a web page). But the user could also be logged in with the session so we may need to cover that scenario. The goal of the PR is to measure the load in the server with client metrics, but I think it could be a good idea to actually get realistic metrics from the agents.
Maybe we can add this to backlog, WDYT?

I'm ok adding a ticket to refine this to the backlog, but I'd still like to understand how the webapp uses this specific metric, I think I lack some context here, sorry.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An explanation can be found here: https://web.dev/articles/ttfb and turns out my interpretation is also incorrect. Just sent a canonical way of measuring it with your suggestion 😅

isacikgoz and others added 9 commits July 30, 2024 11:59
Co-authored-by: Alejandro García Montoro <[email protected]>
* Enable ethtool metrics

* Increase network interface rx size on proxy instance

* Use retransmission rate instead of timeouts

* Review socket buffer sizes

* Update panel unit

* Add TCP retransmissions panel

* Fix datasource

* Make RX size dynamic

* Fix comment
* Update postgres client

* Make assets
Half the value based on latest test results
Recover broken images and tweak things that have changed since this was
first written.
* MM-59319: Allow opensearch installations to be created

We unblock the prefix limitation, and allow
either prefixes.

https://mattermost.atlassian.net/browse/MM-59319

* Fix test
* Allow 0 shard replicas

For 1-node ES clusters, we need 0 shard replicas, not 1. Otherwise, the
cluster status check is never valid, since there are always unassigned
shards (one per index) and subsequently the cluster status is always
yellow.

* Validate RestoreSnapshot options before marshaling
* Remove Cloudwatch log policy

There's a limit of 10 Cloudwatch log policies per region per account, so
it's not scalable to create one with every deployment. Instead, we rely
on such a policy already present in the AWS account. If it is not
present, the only downside is that logs cannot be viewable through
Cloudwatch, but everything else should keep working.

* make assets

* Check needed policy and create if it doesn't exist

* Refactor CloudWatch logic to another file and test
Copy link
Member

@agarciamontoro agarciamontoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Just a couple more comments

Comment on lines 106 to 115
trace := &httptrace.ClientTrace{
GotFirstResponseByte: func() {
elapsed := time.Since(startTime).Seconds()
err := t.ue.ObserveClientMetric(model.ClientTimeToFirstByte, elapsed)
if err != nil {
mlog.Warn("Failed to store observation", mlog.Err(err))
}
},
}
req = req.WithContext(httptrace.WithClientTrace(req.Context(), trace))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, neat! Do we know the amount of overhead this creates on the requests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are here to find that out :p, it shouldn't have much impact as it's just storing these observations locally and sending them in every minute.

loadtest/control/simulcontroller/actions.go Show resolved Hide resolved
Copy link
Member

@agarciamontoro agarciamontoro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is great! Do you think we can run a comparison before merge and check whether the overhead on the RoundTrip is negligible or not?

@isacikgoz
Copy link
Member Author

Thank you, this is great! Do you think we can run a comparison before merge and check whether the overhead on the RoundTrip is negligible or not?

That'd be my first loadtest tool comparison, is there a way to execute it?

@isacikgoz isacikgoz changed the title WIP: Add client metrics Add client metrics Jul 31, 2024
@agnivade
Copy link
Member

agnivade commented Aug 1, 2024

@isacikgoz - you can do a ltctl comparison run, or just run two consecutive bounded tests. Both works.

@agarciamontoro
Copy link
Member

@isacikgoz: I think you'll have to run two consecutive bounded tests, updating the LoadTestDownloadURL between one and the other (you can use a local path pointing to the file generated by make package to use your changes). Let me know if you need some assistance and we can sync!

Comment on lines 12 to 13
userAgents = []string{"desktop", "firefox", "chrome", "safari", "edge", "other"}
platforms = []string{"linux", "macos", "ios", "android", "windows", "other"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use other/other for this. If we use all the platforms, then this becomes hard for a user to open a browser session and observe the "actual" client metrics in a live load test.

More context here: https://community.mattermost.com/core/pl/9y8cgyxxq7yq5yjorpqykbjkrw

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing this up. I think this approach will strike good balance between team wanting to use load testing tool for the web client metrics and tool being used to measure impact of web client metrics with load test.

@isacikgoz
Copy link
Member Author

@agarciamontoro after discussing with @hmhealey I reverted the TTFB measurement to be made on login for only once. I think that more or less reflects our current usual scenario. I'll merge it as it is now if you are okay with it.

@agarciamontoro
Copy link
Member

@agarciamontoro after discussing with @hmhealey I reverted the TTFB measurement to be made on login for only once. I think that more or less reflects our current usual scenario. I'll merge it as it is now if you are okay with it.

Ok! I still don't think the name is correct, though, since it's not the time to first byte, but the time for the whole login to finish. Is there a way we can rename it? Don't get me wrong, I think that "the time for the whole login to finish" is a great metric to measure, it's just that "TTFB" means a very specific thing.

@isacikgoz
Copy link
Member Author

@agarciamontoro Yes, you are right in terms of naming but this is just to simulate what does client metrics bring load. I'm not sure if we can use this metric for the agents anyway. We can add a new metric to simulate same payload but we would just add another time series in Prometheus. Is that something we should do?

@agarciamontoro
Copy link
Member

Yeah, for this PR I'm ok with what we have as long as we don't deviate from the implementation in our clients. My concern is with the metric itself: if that's the name that we use in the clients, I think we should change it. But that's an off-topic for this PR, actually.

My TL;DR for this PR is: if the implementation here mimics what our clients do, then let's merge it (and keep it updated with the clients implementation in the future) :)

@isacikgoz
Copy link
Member Author

@agarciamontoro gotcha, yes here this is only to mimic what our clients do. I'd say let's go with it for now and keep discussion on renaming it or using it per request etc.

@agarciamontoro agarciamontoro added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Sep 16, 2024
@agarciamontoro
Copy link
Member

Sounds good! All yours to merge, then :)

@hmhealey
Copy link
Member

@isacikgoz @agarciamontoro I think there might be some misunderstanding here still. The TTFB measurement that the we app reports is indeed a proper TTFB, but it's the TTFB for the initial request made from the app to the server when loading the page, not the TTFB for every single request. It's sent exactly once on each page load.

There's more technical information on it and the other web app metrics here: https://mattermost.atlassian.net/wiki/spaces/ICU/pages/2715418659/Grafana+Metrics#Time-to-First-Byte

@agarciamontoro
Copy link
Member

@hmhealey, can you point to the code measuring and reporting this?

@hmhealey
Copy link
Member

@agarciamontoro The code that reports it is here, but the calculations are all done by either the browser itself or by the Chrome team's web-vitals library. There's very little that we do here ourselves

@agarciamontoro
Copy link
Member

Ah, ok ok, thanks, @hmhealey! Then I think we should move the login TTFB metric to the lower level possible, and get it for the actual first byte. If we measure it in SimulController.login as we're doing now, and we defer the measurement, we're adding several finished requests to that metric. SimulController.login calls control.Login, which calls UserEntity.Login, which in turn calls the final Client4.Login. That's where we need to measure it, I believe.
I agree that in terms of load it will be the same, but now that we're adding it, I would like to get it right. That metric may be useful in tests as well. @isacikgoz, thoughts?

@isacikgoz
Copy link
Member Author

@agarciamontoro If you think that metric will be useful then let's go for it. So basically will run it again in the http trace but single time with sync.Once right?

@agnivade
Copy link
Member

I think we are spending way too much time here. The purpose of this PR is to just add coverage for the client metrics feature. It should not be taken as the real data, because clients aren't going to be in the same network as the server. Where it will be useful is when someone actually logs in with a browser and points to the instance. That's why we are using "other"/"other" as the browser/platform combination for the load-test agent, so that we can clearly see real data when used from browsers.

It's already been months with this PR :)

@agarciamontoro
Copy link
Member

I'm ok unblocking this PR and getting it merged. But I don't feel comfortable having a metric that doesn't do what its name suggests. So here's my proposal: let's merge this and, in parallel, create a ticket to address that concern, so that the tool comes closer to the real implementation. Thoughts?

@agnivade
Copy link
Member

Sounds good to me. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
4: Reviews Complete All reviewers have approved the pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants