Add client metrics #751

isacikgoz · 2024-05-28T11:07:39Z

Summary

Add client metrics implementation. Since the performance report generated with user activity, it should be better to reflect that behaviour and it would be much more realistic about the payload. I wanted to hear an early feedback whether this is a good approach.

I'll add ClientFirstContentfulPaint, ClientLargestContentfulPaint metrics in the meantime.

loadtest/control/simulcontroller/actions.go

agarciamontoro

Looks great, I like the approach! I left some comments, looking forward to seeing more metrics 👀 Thanks for this work!

agarciamontoro · 2024-05-31T09:28:40Z

loadtest/control/simulcontroller/actions.go

+				err := c.user.ObserveClientMetric(model.ClientTimeToFirstByte, float64(time.Now().UnixMilli()-start)/1000)
+				if err != nil {
+					mlog.Warn("Failed to store observation", mlog.Err(err))
+				}


At this point we've already made 3-4 requests, so I don't think this is the time to first byte, right? How are we defining the time to first byte in the webapp?
We can go quite low-level here and use something like https://pkg.go.dev/net/http/httptrace#WithClientTrace, there's a callback for GotFirstResponseByte, when we can observe the metric. But this is a per-request metric, so are we planning in measuring that in all requests? Or in some specific one?

TIL! We want to measure that whenever a user logs in (thinking of a scenario where opens a web page). But the user could also be logged in with the session so we may need to cover that scenario. The goal of the PR is to measure the load in the server with client metrics, but I think it could be a good idea to actually get realistic metrics from the agents.

Maybe we can add this to backlog, WDYT?

loadtest/control/simulcontroller/controller.go

loadtest/store/memstore/store.go

streamer45

Thanks! Gave this a first quick pass and left some comments.

.golangci.yml

go.mod

loadtest/control/simulcontroller/actions.go

loadtest/control/simulcontroller/controller.go

loadtest/store/memstore/store.go

agnivade · 2024-07-29T07:43:57Z

@isacikgoz - Just checking on this, is there anything left other than us reviewing this again?

streamer45 · 2024-07-29T16:58:14Z

loadtest/user/userentity/report.go

+func randomUserAgent() string {
+	i := rand.Intn(len(userAgents))


Whaaat? Are you saying there's an equal chance of Safari sending metrics than Chrome? :p

Vanilla macos users are quite a lot among devs to be fair :p

.golangci.yml

loadtest/control/simulcontroller/periodic.go

loadtest/user/userentity/report.go

loadtest/user/userentity/user.go

agarciamontoro · 2024-07-29T18:08:15Z

loadtest/control/simulcontroller/actions.go

+				elapsed := time.Since(start).Seconds()
+				err := c.user.ObserveClientMetric(model.ClientTimeToFirstByte, elapsed)
+				if err != nil {
+					mlog.Warn("Failed to store observation", mlog.Err(err))
+				}


I still think this is not semantically correct. I'm not sure if it's a problem with the naming (how are we using it exactly in the webapp?) or with the code here. For reference: I would understand this if the metric would be called something like model.ClientTimeToLogin.

Also, answering to your comment in the other thread (sorry, just read it now):

TIL! We want to measure that whenever a user logs in (thinking of a scenario where opens a web page). But the user could also be logged in with the session so we may need to cover that scenario. The goal of the PR is to measure the load in the server with client metrics, but I think it could be a good idea to actually get realistic metrics from the agents.
Maybe we can add this to backlog, WDYT?

I'm ok adding a ticket to refine this to the backlog, but I'd still like to understand how the webapp uses this specific metric, I think I lack some context here, sorry.

An explanation can be found here: https://web.dev/articles/ttfb and turns out my interpretation is also incorrect. Just sent a canonical way of measuring it with your suggestion 😅

Co-authored-by: Alejandro García Montoro <[email protected]>

* Enable ethtool metrics * Increase network interface rx size on proxy instance * Use retransmission rate instead of timeouts * Review socket buffer sizes * Update panel unit * Add TCP retransmissions panel * Fix datasource * Make RX size dynamic * Fix comment

* Update postgres client * Make assets

Half the value based on latest test results

Recover broken images and tweak things that have changed since this was first written.

* MM-59319: Allow opensearch installations to be created We unblock the prefix limitation, and allow either prefixes. https://mattermost.atlassian.net/browse/MM-59319 * Fix test

* Allow 0 shard replicas For 1-node ES clusters, we need 0 shard replicas, not 1. Otherwise, the cluster status check is never valid, since there are always unassigned shards (one per index) and subsequently the cluster status is always yellow. * Validate RestoreSnapshot options before marshaling

* Remove Cloudwatch log policy There's a limit of 10 Cloudwatch log policies per region per account, so it's not scalable to create one with every deployment. Instead, we rely on such a policy already present in the AWS account. If it is not present, the only downside is that logs cannot be viewable through Cloudwatch, but everything else should keep working. * make assets * Check needed policy and create if it doesn't exist * Refactor CloudWatch logic to another file and test

agarciamontoro

Thanks! Just a couple more comments

agarciamontoro · 2024-07-30T17:46:15Z

loadtest/user/userentity/user.go

+	trace := &httptrace.ClientTrace{
+		GotFirstResponseByte: func() {
+			elapsed := time.Since(startTime).Seconds()
+			err := t.ue.ObserveClientMetric(model.ClientTimeToFirstByte, elapsed)
+			if err != nil {
+				mlog.Warn("Failed to store observation", mlog.Err(err))
+			}
+		},
+	}
+	req = req.WithContext(httptrace.WithClientTrace(req.Context(), trace))


Oh, neat! Do we know the amount of overhead this creates on the requests?

We are here to find that out :p, it shouldn't have much impact as it's just storing these observations locally and sending them in every minute.

loadtest/control/simulcontroller/actions.go

agarciamontoro

Thank you, this is great! Do you think we can run a comparison before merge and check whether the overhead on the RoundTrip is negligible or not?

isacikgoz · 2024-07-31T11:41:41Z

Thank you, this is great! Do you think we can run a comparison before merge and check whether the overhead on the RoundTrip is negligible or not?

That'd be my first loadtest tool comparison, is there a way to execute it?

agnivade · 2024-08-01T05:16:49Z

@isacikgoz - you can do a ltctl comparison run, or just run two consecutive bounded tests. Both works.

agarciamontoro · 2024-08-05T15:49:55Z

@isacikgoz: I think you'll have to run two consecutive bounded tests, updating the LoadTestDownloadURL between one and the other (you can use a local path pointing to the file generated by make package to use your changes). Let me know if you need some assistance and we can sync!

agnivade · 2024-08-28T15:38:25Z

loadtest/user/userentity/report.go

+	userAgents = []string{"desktop", "firefox", "chrome", "safari", "edge", "other"}
+	platforms  = []string{"linux", "macos", "ios", "android", "windows", "other"}


Let's just use other/other for this. If we use all the platforms, then this becomes hard for a user to open a browser session and observe the "actual" client metrics in a live load test.

More context here: https://community.mattermost.com/core/pl/9y8cgyxxq7yq5yjorpqykbjkrw

Thank you for bringing this up. I think this approach will strike good balance between team wanting to use load testing tool for the web client metrics and tool being used to measure impact of web client metrics with load test.

isacikgoz · 2024-09-09T15:16:17Z

@agarciamontoro after discussing with @hmhealey I reverted the TTFB measurement to be made on login for only once. I think that more or less reflects our current usual scenario. I'll merge it as it is now if you are okay with it.

agarciamontoro · 2024-09-12T18:20:26Z

@agarciamontoro after discussing with @hmhealey I reverted the TTFB measurement to be made on login for only once. I think that more or less reflects our current usual scenario. I'll merge it as it is now if you are okay with it.

Ok! I still don't think the name is correct, though, since it's not the time to first byte, but the time for the whole login to finish. Is there a way we can rename it? Don't get me wrong, I think that "the time for the whole login to finish" is a great metric to measure, it's just that "TTFB" means a very specific thing.

isacikgoz · 2024-09-13T15:24:45Z

@agarciamontoro Yes, you are right in terms of naming but this is just to simulate what does client metrics bring load. I'm not sure if we can use this metric for the agents anyway. We can add a new metric to simulate same payload but we would just add another time series in Prometheus. Is that something we should do?

agarciamontoro · 2024-09-13T15:59:21Z

Yeah, for this PR I'm ok with what we have as long as we don't deviate from the implementation in our clients. My concern is with the metric itself: if that's the name that we use in the clients, I think we should change it. But that's an off-topic for this PR, actually.

My TL;DR for this PR is: if the implementation here mimics what our clients do, then let's merge it (and keep it updated with the clients implementation in the future) :)

isacikgoz · 2024-09-16T06:16:06Z

@agarciamontoro gotcha, yes here this is only to mimic what our clients do. I'd say let's go with it for now and keep discussion on renaming it or using it per request etc.

agarciamontoro · 2024-09-16T14:42:17Z

Sounds good! All yours to merge, then :)

hmhealey · 2024-09-17T16:06:55Z

@isacikgoz @agarciamontoro I think there might be some misunderstanding here still. The TTFB measurement that the we app reports is indeed a proper TTFB, but it's the TTFB for the initial request made from the app to the server when loading the page, not the TTFB for every single request. It's sent exactly once on each page load.

There's more technical information on it and the other web app metrics here: https://mattermost.atlassian.net/wiki/spaces/ICU/pages/2715418659/Grafana+Metrics#Time-to-First-Byte

agarciamontoro · 2024-09-17T16:29:26Z

@hmhealey, can you point to the code measuring and reporting this?

hmhealey · 2024-09-17T19:37:32Z

@agarciamontoro The code that reports it is here, but the calculations are all done by either the browser itself or by the Chrome team's web-vitals library. There's very little that we do here ourselves

agarciamontoro · 2024-09-18T10:20:11Z

Ah, ok ok, thanks, @hmhealey! Then I think we should move the login TTFB metric to the lower level possible, and get it for the actual first byte. If we measure it in SimulController.login as we're doing now, and we defer the measurement, we're adding several finished requests to that metric. SimulController.login calls control.Login, which calls UserEntity.Login, which in turn calls the final Client4.Login. That's where we need to measure it, I believe.
I agree that in terms of load it will be the same, but now that we're adding it, I would like to get it right. That metric may be useful in tests as well. @isacikgoz, thoughts?

isacikgoz · 2024-09-18T12:37:09Z

@agarciamontoro If you think that metric will be useful then let's go for it. So basically will run it again in the http trace but single time with sync.Once right?

agnivade · 2024-09-19T03:47:36Z

I think we are spending way too much time here. The purpose of this PR is to just add coverage for the client metrics feature. It should not be taken as the real data, because clients aren't going to be in the same network as the server. Where it will be useful is when someone actually logs in with a browser and points to the instance. That's why we are using "other"/"other" as the browser/platform combination for the load-test agent, so that we can clearly see real data when used from browsers.

It's already been months with this PR :)

agarciamontoro · 2024-09-20T15:23:31Z

I'm ok unblocking this PR and getting it merged. But I don't feel comfortable having a metric that doesn't do what its name suggests. So here's my proposal: let's merge this and, in parallel, create a ticket to address that concern, so that the tool comes closer to the real implementation. Thoughts?

agnivade · 2024-09-23T04:12:58Z

Sounds good to me. 👍

isacikgoz added 3 commits May 28, 2024 12:40

WIP: add client metrics

36460a1

Merge remote-tracking branch 'origin/master' into add-client-metrics

ca3db46

remove toolchain

d1ca053

isacikgoz added the 2: Dev Review Requires review by a core committer label May 28, 2024

isacikgoz requested review from streamer45 and agarciamontoro May 28, 2024 11:07

agnivade reviewed May 28, 2024

View reviewed changes

loadtest/control/simulcontroller/actions.go Outdated Show resolved Hide resolved

agarciamontoro reviewed May 31, 2024

View reviewed changes

isacikgoz added 2 commits June 3, 2024 19:00

reflect revivew comments

efce9ac

fix golangci

b462c21

streamer45 reviewed Jun 6, 2024

View reviewed changes

isacikgoz added 3 commits June 11, 2024 14:53

Merge remote-tracking branch 'origin/master' into add-client-metrics

7e067aa

remove golint

6656416

address review comments

e8be503

isacikgoz force-pushed the add-client-metrics branch from 2cfc265 to e8be503 Compare June 11, 2024 13:31

agnivade requested review from streamer45 and agarciamontoro July 29, 2024 07:43

agnivade approved these changes Jul 29, 2024

View reviewed changes

streamer45 approved these changes Jul 29, 2024

View reviewed changes

agarciamontoro requested changes Jul 29, 2024

View reviewed changes

isacikgoz and others added 9 commits July 30, 2024 11:59

Apply suggestions from code review

a7ff258

Co-authored-by: Alejandro García Montoro <[email protected]>

Update postgres client (#755)

6b15c15

* Update postgres client * Make assets

Bump version strings in master (#757)

b1a23cd

Update retransmission threshold (#760)

eba5843

Half the value based on latest test results

Update coverage frequency docs (#761)

6da2986

Recover broken images and tweak things that have changed since this was first written.

MM-59319: Allow opensearch installations to be created (#763)

1692691

* MM-59319: Allow opensearch installations to be created We unblock the prefix limitation, and allow either prefixes. https://mattermost.atlassian.net/browse/MM-59319 * Fix test

agarciamontoro requested changes Jul 30, 2024

View reviewed changes

remove ttb observation from controller actions

2562236

isacikgoz requested a review from agarciamontoro July 31, 2024 10:32

agarciamontoro approved these changes Jul 31, 2024

View reviewed changes

isacikgoz changed the title ~~WIP: Add client metrics~~ Add client metrics Jul 31, 2024

Merge remote-tracking branch 'origin/master' into add-client-metrics

02a451d

agnivade reviewed Aug 28, 2024

View reviewed changes

isacikgoz added 3 commits September 9, 2024 16:42

Merge remote-tracking branch 'origin/master' into add-client-metrics

abda30f

revert TTFB measurement to be calculated only on login

63fd563

remove random platform/agent

06bc18a

Merge remote-tracking branch 'origin/master' into add-client-metrics

7d1e7a8

agarciamontoro added 4: Reviews Complete All reviewers have approved the pull request and removed 2: Dev Review Requires review by a core committer labels Sep 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add client metrics #751

Add client metrics #751

isacikgoz commented May 28, 2024

agarciamontoro left a comment

agarciamontoro May 31, 2024

isacikgoz Jun 3, 2024

streamer45 left a comment

agnivade commented Jul 29, 2024

streamer45 Jul 29, 2024

isacikgoz Jul 30, 2024

agarciamontoro Jul 29, 2024

agarciamontoro Jul 29, 2024

isacikgoz Jul 30, 2024

agarciamontoro left a comment

agarciamontoro Jul 30, 2024

isacikgoz Jul 31, 2024

agarciamontoro left a comment

isacikgoz commented Jul 31, 2024

agnivade commented Aug 1, 2024

agarciamontoro commented Aug 5, 2024

agnivade Aug 28, 2024

M-ZubairAhmed Aug 29, 2024

isacikgoz commented Sep 9, 2024

agarciamontoro commented Sep 12, 2024

isacikgoz commented Sep 13, 2024

agarciamontoro commented Sep 13, 2024

isacikgoz commented Sep 16, 2024

agarciamontoro commented Sep 16, 2024

hmhealey commented Sep 17, 2024

agarciamontoro commented Sep 17, 2024

hmhealey commented Sep 17, 2024

agarciamontoro commented Sep 18, 2024

isacikgoz commented Sep 18, 2024

agnivade commented Sep 19, 2024

agarciamontoro commented Sep 20, 2024

agnivade commented Sep 23, 2024

		func randomUserAgent() string {
		i := rand.Intn(len(userAgents))

		userAgents = []string{"desktop", "firefox", "chrome", "safari", "edge", "other"}
		platforms = []string{"linux", "macos", "ios", "android", "windows", "other"}

Add client metrics #751

Are you sure you want to change the base?

Add client metrics #751

Conversation

isacikgoz commented May 28, 2024

Summary

agarciamontoro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

streamer45 left a comment

Choose a reason for hiding this comment

agnivade commented Jul 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agarciamontoro left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

agarciamontoro left a comment

Choose a reason for hiding this comment

isacikgoz commented Jul 31, 2024

agnivade commented Aug 1, 2024

agarciamontoro commented Aug 5, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

isacikgoz commented Sep 9, 2024

agarciamontoro commented Sep 12, 2024

isacikgoz commented Sep 13, 2024

agarciamontoro commented Sep 13, 2024

isacikgoz commented Sep 16, 2024

agarciamontoro commented Sep 16, 2024

hmhealey commented Sep 17, 2024

agarciamontoro commented Sep 17, 2024

hmhealey commented Sep 17, 2024

agarciamontoro commented Sep 18, 2024

isacikgoz commented Sep 18, 2024

agnivade commented Sep 19, 2024

agarciamontoro commented Sep 20, 2024

agnivade commented Sep 23, 2024