Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node labels and annotations are reset after restarting k3s-agent in v1.31.1+k3s1 #10957

Closed
christian-schlichtherle opened this issue Sep 28, 2024 · 2 comments

Comments

@christian-schlichtherle
Copy link

christian-schlichtherle commented Sep 28, 2024

... which was not the case in v1.26.*, so it's a regression from my point of view.

Details:

We are using node labels, annotations and taints to ensure that certain workloads only run on specific nodes and have some individual configuration. Therefore, it is critical that node labels, annotations and taints are persisted.
In /etc/rancher/k3s/config.yaml, we only have --node-taint: specified. Configuring this ensures that when a new node joins the cluster it doesn't get unwanted pods scheduled prematurely. We do not specify --node-label however because...

  • it isn't required because we use kubectl patch node ... instead to set the desired node labels and annotations
  • there is no way in config.yaml to specify node annotations, so why bother?

This was all working fine in v1.26.*+k3s. No that we've upgraded to v1.31.1+k3s1, we figured that after restarting the k3s-agent service, all our node labels and annotations are gone.

Is there a way to bring the persistence back?

In order to reproduce the issue, you have to stop the k3s-agent service and wait for some seconds for the master node(s) to recognize the change and remove the node from the cluster. We are using embedded etcd BTW.

@brandond
Copy link
Member

There is nothing in k3s that resets node labels/annotations/taints when the agent is restarted. Similarly, there is nothing in k3s that will delete a node from the cluster when it is down. I suspect you have some 3rd party component enabled that is doing these things.

@christian-schlichtherle
Copy link
Author

christian-schlichtherle commented Sep 29, 2024

You are right, we are using the hcloud-cloud-controller-manager and that’s deleting nodes that are being offline for 30 seconds. Sorry for barking up the wrong tree.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

2 participants