Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

garm webhook && metrics/o11y #272

Open
pathcl opened this issue Jun 26, 2024 · 3 comments
Open

garm webhook && metrics/o11y #272

pathcl opened this issue Jun 26, 2024 · 3 comments
Milestone

Comments

@pathcl
Copy link

pathcl commented Jun 26, 2024

Hello folks,

One the challenges about runners and github actions after years it's still observability.

I'd like to know if we have plans to work on o11y for garm's webhook.

func (a *APIController) handleWorkflowJobEvent(ctx context.Context, w http.ResponseWriter, r *http.Request) {

Use case(s)

  • If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap
  • What's the P99/P90 for jobs&runners, startup time
  • Get better insights about jobs. It should be possible to log/report about webhook events.
  • Github actions doesn't provide a retry-mechanism. How do we cope with it?
@gabriel-samfira
Copy link
Member

Hi @pathcl

The scope of GARM is limited to successfully spinning up runners and making them available to the workflow jobs that are triggered on GitHub. Everything we add to GARM is geared towards that scope. But I'll explain on each point:

If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap

Indeed, this is something that can be addressed simultaneously with adding metrics to providers (see bellow). If the stuck workflow is a symptom of a stuck runner/provider, then that should be addressed by metrics added to providers. Otherwise, it's outside the scope of GARM to watch workflows themselves. We only care about the workflow jobs that we record. The distinction is important. We may not record all jobs for various reasons:

  • There is no pool to handle it, so we don't care
  • GARM was down when that event was generated.
  • GitHub was down and webhooks were never sent out (happens more often than one might think)

Sadly, there is no efficient way to fix any of the last 2 scenarios. We have orgs which may have many repos, and we have enterprises which may have many orgs which may have many repos. Workflows only exist at the repo level, so if we attempt to ingest any workflows we missed, it means hammering the GH API for workflows on potentially thousands of repos.

The only potential workaround is if the operator of GARM knows that GARM/GitHub was down for a while and missed some jobs, they can increase min-idle-runners to match max-runners until they consume the queue on their github repos and then set min-idle-runners back to its original value.

What's the P99/P90 for jobs&runners, startup time.

This is one area where we need to improve. We have metrics for the GH API calls, but no metrics for provider calls. We don't currently see if a runner just failed to reach idle state and is just recreated over and over due to the bootstrap timeout. So yes, this needs to be addressed. Would you be willing to open an issue about this?

Get better insights about jobs. It should be possible to log/report about webhook events.

We have that. If you look at the function you highlighted above, you should see them in the body of the function.:

metrics.WebhooksReceived.WithLabelValues(
"false", // label: valid
"owner_unknown", // label: reason
).Inc()

metrics.WebhooksReceived.WithLabelValues(
"false", // label: valid
"signature_invalid", // label: reason
).Inc()

metrics.WebhooksReceived.WithLabelValues(
"false", // label: valid
"unknown", // label: reason
).Inc()

metrics.WebhooksReceived.WithLabelValues(
"true", // label: valid
"", // label: reason
).Inc()

We can improve on this. If you have any suggestions in regards to what extra info you believe would make sense, we can find a way to add it.

Github actions doesn't provide a retry-mechanism. How do we cope with it?

That is something that we can't fix. The scope of GARM is to make a runner available to a workflow job. As long as we receive a webhook for a queued job that we can handle, we go ahead and try to create a runner. But retrying jobs is something that should be done by the repo maintainer. Any retry attempt will generate new jobs, which will get to GARM and GARM will do the right thing (hopefully).

@gabriel-samfira gabriel-samfira added this to the v0.1.6 milestone Jul 1, 2024
@bavarianbidi
Copy link
Contributor

[...] We have metrics for the GH API calls, but no metrics for provider calls [... ]

we are able to "calculate" some kind of an error rate when it came to provider-interaction (this metric is part of the runner_ scope) - please see my comment here

@bavarianbidi
Copy link
Contributor

slightly off topic, but somehow related to this discussion here.
I can't share the exact code here (will discuss if possible - but it's not that complicated if you read my next few words 😅 ):

We are operating garm on an enterprise level. To make this work, we are receiving every action event from github (according to the garm documentation).

With that, we get a lot of events, even those we are not responsible for. To get more insights about our users/customers and the information we already have in the event payload, we are using this information by storing it into a database.
To do custom operation with the event requests, we installed traefik in front of garm.
With that we are able to use the mirroring feature of traefik (xref) and sending the traffic to garm and also to our custom piece of code to parse the event payload and store information like org/repo, job started, job ended, job queued, job status, ...

with that we are e..g able to see how delayed github send events to our system.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants