garm webhook && metrics/o11y #272

pathcl · 2024-06-26T20:11:24Z

Hello folks,

One the challenges about runners and github actions after years it's still observability.

I'd like to know if we have plans to work on o11y for garm's webhook.

garm/apiserver/controllers/controllers.go

Line 98 in 8f0d447

    
           func (a *APIController) handleWorkflowJobEvent(ctx context.Context, w http.ResponseWriter, r *http.Request) {

Use case(s)

If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap
What's the P99/P90 for jobs&runners, startup time
Get better insights about jobs. It should be possible to log/report about webhook events.
Github actions doesn't provide a retry-mechanism. How do we cope with it?

gabriel-samfira · 2024-06-27T16:53:01Z

Hi @pathcl

The scope of GARM is limited to successfully spinning up runners and making them available to the workflow jobs that are triggered on GitHub. Everything we add to GARM is geared towards that scope. But I'll explain on each point:

If there's a stuck workflow because of a failed runner/provider. I know we have a timeout for bootstrap

Indeed, this is something that can be addressed simultaneously with adding metrics to providers (see bellow). If the stuck workflow is a symptom of a stuck runner/provider, then that should be addressed by metrics added to providers. Otherwise, it's outside the scope of GARM to watch workflows themselves. We only care about the workflow jobs that we record. The distinction is important. We may not record all jobs for various reasons:

There is no pool to handle it, so we don't care
GARM was down when that event was generated.
GitHub was down and webhooks were never sent out (happens more often than one might think)

Sadly, there is no efficient way to fix any of the last 2 scenarios. We have orgs which may have many repos, and we have enterprises which may have many orgs which may have many repos. Workflows only exist at the repo level, so if we attempt to ingest any workflows we missed, it means hammering the GH API for workflows on potentially thousands of repos.

The only potential workaround is if the operator of GARM knows that GARM/GitHub was down for a while and missed some jobs, they can increase min-idle-runners to match max-runners until they consume the queue on their github repos and then set min-idle-runners back to its original value.

What's the P99/P90 for jobs&runners, startup time.

This is one area where we need to improve. We have metrics for the GH API calls, but no metrics for provider calls. We don't currently see if a runner just failed to reach idle state and is just recreated over and over due to the bootstrap timeout. So yes, this needs to be addressed. Would you be willing to open an issue about this?

Get better insights about jobs. It should be possible to log/report about webhook events.

We have that. If you look at the function you highlighted above, you should see them in the body of the function.:

garm/apiserver/controllers/controllers.go

Lines 112 to 115 in 8f0d447

    
           metrics.WebhooksReceived.WithLabelValues( 
        
           	"false",         // label: valid 
        
           	"owner_unknown", // label: reason 
        
           ).Inc()

garm/apiserver/controllers/controllers.go

Lines 120 to 123 in 8f0d447

    
           metrics.WebhooksReceived.WithLabelValues( 
        
           	"false",             // label: valid 
        
           	"signature_invalid", // label: reason 
        
           ).Inc()

garm/apiserver/controllers/controllers.go

Lines 125 to 128 in 8f0d447

    
           metrics.WebhooksReceived.WithLabelValues( 
        
           	"false",   // label: valid 
        
           	"unknown", // label: reason 
        
           ).Inc()

garm/apiserver/controllers/controllers.go

Lines 134 to 137 in 8f0d447

    
           metrics.WebhooksReceived.WithLabelValues( 
        
           	"true", // label: valid 
        
           	"",     // label: reason 
        
           ).Inc()

We can improve on this. If you have any suggestions in regards to what extra info you believe would make sense, we can find a way to add it.

Github actions doesn't provide a retry-mechanism. How do we cope with it?

That is something that we can't fix. The scope of GARM is to make a runner available to a workflow job. As long as we receive a webhook for a queued job that we can handle, we go ahead and try to create a runner. But retrying jobs is something that should be done by the repo maintainer. Any retry attempt will generate new jobs, which will get to GARM and GARM will do the right thing (hopefully).

bavarianbidi · 2024-07-22T10:41:37Z

[...] We have metrics for the GH API calls, but no metrics for provider calls [... ]

we are able to "calculate" some kind of an error rate when it came to provider-interaction (this metric is part of the runner_ scope) - please see my comment here

bavarianbidi · 2024-07-22T10:49:46Z

slightly off topic, but somehow related to this discussion here.
I can't share the exact code here (will discuss if possible - but it's not that complicated if you read my next few words 😅 ):

We are operating garm on an enterprise level. To make this work, we are receiving every action event from github (according to the garm documentation).

With that, we get a lot of events, even those we are not responsible for. To get more insights about our users/customers and the information we already have in the event payload, we are using this information by storing it into a database.
To do custom operation with the event requests, we installed traefik in front of garm.
With that we are able to use the mirroring feature of traefik (xref) and sending the traffic to garm and also to our custom piece of code to parse the event payload and store information like org/repo, job started, job ended, job queued, job status, ...

with that we are e..g able to see how delayed github send events to our system.

gabriel-samfira added this to the v0.1.6 milestone Jul 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

garm webhook && metrics/o11y #272

garm webhook && metrics/o11y #272

pathcl commented Jun 26, 2024

gabriel-samfira commented Jun 27, 2024

bavarianbidi commented Jul 22, 2024

bavarianbidi commented Jul 22, 2024

garm webhook && metrics/o11y #272

garm webhook && metrics/o11y #272

Comments

pathcl commented Jun 26, 2024

gabriel-samfira commented Jun 27, 2024

bavarianbidi commented Jul 22, 2024

bavarianbidi commented Jul 22, 2024