Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s-agent panics and exists unclean after network interruptions #10981

Closed
PeaceRebel opened this issue Oct 3, 2024 · 1 comment
Closed

k3s-agent panics and exists unclean after network interruptions #10981

PeaceRebel opened this issue Oct 3, 2024 · 1 comment

Comments

@PeaceRebel
Copy link

Environmental Info:
K3s Version:
v1.29.4+k3s1

Node(s) CPU architecture, OS, and Version:
Linux my-edge-host 6.10.7-200.fc40.aarch64 # 1 SMP PREEMPT_DYNAMIC Fri Aug 30 00:37:24 UTC 2024 aarch64 GNU/Linux

Cluster Configuration:
1 server in AWS and 32 agents nodes (includes amd64 and aarch64 machines)

Describe the bug:
The agent nodes are on the edge and has occasional network interruptions and can be out for a few hours. k3s-agent keeps trying to reach the server but after a point it seems to panic and agent crashes (unclean exit). After this I'm seeing failed to get CA certs error. This results in the agent not connecting to the server once the network is stable. We that the server is fine as the other edge nodes are healthy.

Steps To Reproduce:

  • Installed K3s
  • Unplug the network cable or block all traffic to and from the agent node

Expected behavior:
Agent shouldn't panic and exists should be cleaner.

Actual behavior:
k3s-agent crashes and needs to restart the service for it to report back to server after network is stable.

Additional context / logs:

Oct 02 18:20:28 my-edge-host k3s[1525]: time="2024-10-02T18:20:28Z" level=info msg="Connecting to proxy" url="wss://<server-ip>:6443/v1-k3s/connect"
Oct 02 18:20:28 my-edge-host k3s[1525]: time="2024-10-02T18:20:28Z" level=error msg="Failed to connect to proxy. Empty dialer response" error="dial tcp <server-ip>:6443: connect: network is unreachable"
Oct 02 18:20:28 my-edge-host k3s[1525]: time="2024-10-02T18:20:28Z" level=error msg="Remotedialer proxy error; reconecting..." error="dial tcp <server-ip>:6443: connect: network is unreachable" url="wss://<server-ip>:6443/v1-k3s/connect"
Oct 02 18:20:28 my-edge-host k3s[1525]: panic: runtime error: index out of range [2] with length 2
Oct 02 18:20:28 my-edge-host k3s[1525]: goroutine 213962 [running]:
Oct 02 18:20:28 my-edge-host k3s[1525]: github.com/k3s-io/k3s/pkg/agent/loadbalancer.(*LoadBalancer).nextServer(0x40004d3420, {0x4000700648?, 0x4000700648?})
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/src/github.com/k3s-io/k3s/pkg/agent/loadbalancer/servers.go:131 +0x2a0
Oct 02 18:20:28 my-edge-host k3s[1525]: github.com/k3s-io/k3s/pkg/agent/loadbalancer.(*LoadBalancer).dialContext(0x40004d3420, {0x6a37140, 0x400054eb60?}, {0x5ab6058, 0x3}, {0x0?, 0x0?})
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/src/github.com/k3s-io/k3s/pkg/agent/loadbalancer/loadbalancer.go:176 +0x248
Oct 02 18:20:28 my-edge-host k3s[1525]: inet.af/tcpproxy.(*DialProxy).HandleConn(0x400066d8c0, {0x6a54130, 0x4002866188})
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/pkg/mod/inet.af/[email protected]/tcpproxy.go:359 +0xf0
Oct 02 18:20:28 my-edge-host k3s[1525]: inet.af/tcpproxy.(*Proxy).serveConn(0x50bdd40?, {0x6a54130?, 0x4002866188}, {0x400088d190, 0x1, 0x4000aca060?})
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/pkg/mod/inet.af/[email protected]/tcpproxy.go:239 +0x28c
Oct 02 18:20:28 my-edge-host k3s[1525]: created by inet.af/tcpproxy.(*Proxy).serveListener in goroutine 275
Oct 02 18:20:28 my-edge-host k3s[1525]:         /go/pkg/mod/inet.af/[email protected]/tcpproxy.go:221 +0x40
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ An ExecStart= process belonging to unit k3s-agent.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 2.
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit k3s-agent.service has entered the 'failed' state with result 'exit-code'.
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Unit process 4511 (containerd-shim) remains running after unit stopped.
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Unit process 4512 (containerd-shim) remains running after unit stopped.
Oct 02 18:20:28 my-edge-host systemd[1]: k3s-agent.service: Consumed 9min 39.280s CPU time, 304.2M memory peak, 0B memory swap peak.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ The unit k3s-agent.service completed and consumed the indicated resources.
Oct 02 18:20:33 my-edge-host systemd[1]: k3s-agent.service: Scheduled restart job, restart counter is at 1.
░░ Subject: Automatic restarting of a unit has been scheduled
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ Automatic restarting of the unit k3s-agent.service has been scheduled, as the result for
░░ the configured Restart= setting for the unit.
Oct 02 18:20:33 my-edge-host systemd[1]: k3s-agent.service: Found left-over process 4511 (containerd-shim) in control group while starting unit. Ignoring.
Oct 02 18:20:33 my-edge-host systemd[1]: k3s-agent.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 02 18:20:33 my-edge-host systemd[1]: k3s-agent.service: Found left-over process 4512 (containerd-shim) in control group while starting unit. Ignoring
Oct 02 18:20:33 my-edge-host systemd[1]: Starting k3s-agent.service - Lightweight Kubernetes...
░░ Subject: A start job for unit k3s-agent.service has begun execution
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░
░░ A start job for unit k3s-agent.service has begun execution.
░░
░░ The job identifier is 11608.
Oct 02 18:20:33 my-edge-host sh[15540]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Oct 02 18:20:34 my-edge-host systemd[1]: k3s-agent.service: Found left-over process 4511 (containerd-shim) in control group while starting unit. Ignoring.
Oct 02 18:20:34 my-edge-host systemd[1]: k3s-agent.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 02 18:20:34 my-edge-host systemd[1]: k3s-agent.service: Found left-over process 4512 (containerd-shim) in control group while starting unit. Ignoring.
Oct 02 18:20:34 my-edge-host systemd[1]: k3s-agent.service: This usually indicates unclean termination of a previous run, or service implementation deficiencies.
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Starting k3s agent v1.29.4+k3s1 (94e29e2e)"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: server.my-aws-server.com:6443"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: <server-ip>:6443"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Removing server from load balancer k3s-agent-load-balancer: server.my-aws-server.com:6443"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [<server-ip>:6443] [default: server.my-aws-server.com:6443]"
Oct 02 18:20:34 my-edge-host k3s[15547]: time="2024-10-02T18:20:34Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:44428->127.0.0.1:6444: read: connection reset by peer"

@brandond
Copy link
Member

brandond commented Oct 3, 2024

This was fixed in June, please update to a newer release.

@brandond brandond closed this as completed Oct 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done Issue
Development

No branches or pull requests

2 participants