Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KubeAPIDown not working as intended if targets a set of clusters #825

Open
thunko opened this issue Feb 13, 2023 · 3 comments
Open

KubeAPIDown not working as intended if targets a set of clusters #825

thunko opened this issue Feb 13, 2023 · 3 comments
Labels

Comments

@thunko
Copy link

thunko commented Feb 13, 2023

hi,

I get the following rule when generating prometheus alerts for kubeapi:

- "alert": "KubeAPIDown"
    "annotations":
      "description": "KubeAPI has disappeared from Prometheus target discovery."
      "runbook_url": "https://github.com/kubernetes-monitoring/kubernetes-mixin/tree/master/runbook.md#alert-name-kubeapidown"
      "summary": "Target disappeared from Prometheus target discovery."
    "expr": |
      absent(up{job="kube-apiserver"} == 1)
    "for": "15m"
    "labels":
      "severity": "critical"

The issue that I'm running into is, that my prometheus instance reads data for several clusters, meaning if I add this rule, it doesn't work as intended because the alert will not trigger as long as there is any KubeAPI that is up.
I could create a rule for each cluster, but I'd like to avoid hard-coding.

Have you run into a similar situation and what would you suggest for such use case ?
Thank you,

@zoftdev
Copy link

zoftdev commented Aug 30, 2024

+1 best approach I think is compare with history. if apiserver disappear then raise alert.

Another technic is . move out only "up" rule to separtated group and deploy it per cluster.
this way we have common rule and each-cluster rule.

@skl
Copy link
Collaborator

skl commented Aug 30, 2024

This is tough when considering auto-scaling node groups. For example, if a node is scaled down and removed intentionally, that shouldn't trigger an alert. So taking every single instance into account seems difficult.

However, you could try and assert that at least one instance of the API server job is present in each cluster with a query like:

# This query lists all clusters found by kube_node_info, and marks them as either
# 1 or 0 depending on if they have up{job="kube-apiserver"}, or not (respectively).
#
# List all clusters and mark them value: 0
# {cluster="my-cluster-without-apiserver-job"} 0
1 - group by (cluster) (max by (cluster, node) (kube_node_info{cluster!=""}))
unless on (cluster) (
  # except those clusters with kube-apiserver
  group by (cluster) (up{job="kube-apiserver", cluster!=""})
)
# List all clusters with kube-apiserver and mark them with value: 1
or on (cluster) (
  # {cluster="my-cluster-without-apiserver-job"} 0
  group by (cluster) (max by (cluster, node) (kube_node_info{cluster!=""}))
)

But this is use-case dependent.

Some users would want ALL clusters to have the apiserver job, which is fairly easy to alert on (look for anything with a value of zero).

However, some users would want apiserver on only certain clusters, which likely needs the query to be modified to match only the subset of clusters which are intended to have apiserver job.

Copy link

This issue has not had any activity in the past 30 days, so the
stale label has been added to it.

  • The stale label will be removed if there is new activity
  • The issue will be closed in 7 days if there is no new activity
  • Add the keepalive label to exempt this issue from the stale check action

Thank you for your contributions!

@github-actions github-actions bot added the stale label Sep 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants
@skl @zoftdev @thunko and others