Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Production - [Alerting] Build Analysis: Exceptions and Errors Alert #12954

Closed
3 tasks
dotnet-eng-status bot opened this issue Mar 24, 2023 · 12 comments
Closed
3 tasks

Production - [Alerting] Build Analysis: Exceptions and Errors Alert #12954

dotnet-eng-status bot opened this issue Mar 24, 2023 · 12 comments
Assignees
Labels
Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)

Comments

@dotnet-eng-status
Copy link

dotnet-eng-status bot commented Mar 24, 2023

💔 Metric state changed to alerting

Description and instructions for this alert

  • Message 12

Metric Graph

Go to rule

@dotnet/dnceng, please investigate

Release Note Category

  • Feature changes/additions
  • Bug fixes
  • Internal Infrastructure Improvements

Release Note Description

No need for release notes

Automation information below, do not change

Grafana-Automated-Alert-Id-6fe0b7b34a004f0bad0064a42f9b9135

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active Critical Ops - First Responder Grafana Alert Issues opened by Grafana Production Tied to the Production environment (as opposed to Staging) labels Mar 24, 2023
@riarenas
Copy link
Member

riarenas commented Mar 24, 2023

The alert description mentioned this is usually for catastrophic/unexpected errors, so I took a look in case there was an outage:

We're throwing non-stop exceptions trying to parse the known issue in dotnet/runtime#83655

Invalid pattern '['WasmTestOnBrowser-System.+ END OF WORK ITEM LOG: Command timed out, and was killed]' at offset 21. [x-y] range in reverse order.

I escaped the - and hopefully that will make things stop.

A few questions on this:

  • Should we keep alerting on what looks like user error writing the message? I know we have an issue to surface errors in the error pattern so this might be at least a stopgap until then?
  • Even with these errors being thrown, it looks like we still found matches for the known issue. How come?

@riarenas riarenas removed the Critical label Mar 24, 2023
@riarenas
Copy link
Member

FYI @missymessa @AlitzelMendez

@riarenas
Copy link
Member

  • Even with these errors being thrown, it looks like we still found matches for the known issue. How come?

FIgured this one out at least. Initially this issue had an error message which got the hits, and this only started failing after the issue was switched to an errorPattern that uses the regex.

@missymessa
Copy link
Member

@AlitzelMendez do you think we should turn this into a warning instead of an error when and invalid pattern occurs?

@riarenas
Copy link
Member

I believe I initially mentioned I thought this should be an error, since we wouldn't have any way of knowing the service is failing to process these if they were warnings. I'm fine with downgrading if a single bad issue is enough to trigger a critical alert.

@riarenas riarenas self-assigned this Mar 24, 2023
@riarenas
Copy link
Member

We have stopped throwing the errors as of the fixed regex. Waiting until the alert clears.

@missymessa
Copy link
Member

missymessa commented Mar 24, 2023

I feel that errors should incur alerts if there's something wrong on our side of things, but malformed regex provided by the user shouldn't meet that bar.

@riarenas
Copy link
Member

riarenas commented Mar 24, 2023

https://dev.azure.com/dnceng/internal/_git/dotnet-helix-service/pullrequest/30231?_a=files to downgrade this to a warning and stop logging the exception.

@dotnet-eng-status
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

  • Message 14

Metric Graph

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Mar 28, 2023
@dotnet-eng-status
Copy link
Author

💚 Metric state changed to ok

Description and instructions for this alert

Metric Graph

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Active Alert Issues from Grafana alerts that are now active and removed Inactive Alert Issues from Grafana alerts that are now "OK" labels Mar 29, 2023
@dotnet-eng-status
Copy link
Author

💔 Metric state changed to alerting

Description and instructions for this alert

  • Message 16

Metric Graph

Go to rule

@dotnet-eng-status dotnet-eng-status bot added Inactive Alert Issues from Grafana alerts that are now "OK" and removed Active Alert Issues from Grafana alerts that are now active labels Mar 30, 2023
@dotnet-eng-status
Copy link
Author

💚 Metric state changed to ok

Description and instructions for this alert

Metric Graph

Go to rule

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Grafana Alert Issues opened by Grafana Inactive Alert Issues from Grafana alerts that are now "OK" Ops - First Responder Production Tied to the Production environment (as opposed to Staging)
Projects
None yet
Development

No branches or pull requests

2 participants