Init container TCP timeout causes retries forever #203

sidewinder12s · 2020-12-17T23:29:57Z

Describe the bug
It appears if the Vault init container runs into TCP connection errors (as opposed to a HTTP 500 or 400 error), it will continue retrying forever. The deployment we had that ran into this behavior only got deleted when another process removed the entire pod as failed.

To Reproduce
Steps to reproduce the behavior:

Deploy application annotated for vault-agent injection
Have some kind of networking failure on the node that disallows a connection to Vault.
Watch the init container run forever.

Application deployment:

Annotations: vault.hashicorp.com/agent-inject: true
              vault.hashicorp.com/agent-inject-secret-airflow: kv-v2/secret
              vault.hashicorp.com/agent-inject-secret-cloud-swe-jwt: kv-v2/secret
              vault.hashicorp.com/agent-inject-status: injected
              vault.hashicorp.com/agent-inject-template-airflow:
                
                {{- with secret "kv-v2/secrett" -}}
                {{- .Data.data.value -}}
                {{- end -}}
              vault.hashicorp.com/agent-inject-template-cloud-swe-jwt:
                
                {{- with secret "kv-v2/secret" -}}
                {{- .Data.data.value -}}
                {{- end -}}
              vault.hashicorp.com/agent-pre-populate-only: true
              vault.hashicorp.com/role: my-role


Dec 16, 2020 @ 09:19:50.568 | 2020-12-16T17:19:50.568Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=1.696763184

 Dec 16, 2020 @ 09:19:20.567 | 2020-12-16T17:19:20.567Z [INFO]  auth.handler: authenticating
  | Dec 16, 2020 @ 09:19:17.792 | 2020-12-16T17:19:17.792Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=2.774728483

  | Dec 16, 2020 @ 09:18:47.791 | 2020-12-16T17:18:47.791Z [INFO]  auth.handler: authenticating

  | Dec 16, 2020 @ 09:18:46.354 | 2020-12-16T17:18:46.354Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=1.437024355

Expected behavior

Would have expected the client timeout or retry limits to possibly have an effect. Or just a hard timeout/give up at some point.

Environment

Kubernetes version: 1.18
- Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): EKS
vault-k8s version: 0.6.0

Additional context
I think we've been running into conntrack limits on some of our nodes which have lead to dropped packets, though this failure and the duration of it almost seem like the node itself had something wrong with it.

The text was updated successfully, but these errors were encountered:

jasonodonnell · 2021-01-04T20:43:38Z

Hi @sidewinder12s, which version of Vault Agent are you using?

sidewinder12s · 2021-01-04T21:39:53Z

Version 1.5.4

sidewinder12s · 2021-01-07T07:26:31Z

Root cause on our k8s cluster was high # of DNS requests/conntrack entries across the cluster (which would then overload the node vault agent was running on). I've solved that now with reduced kube-proxy usage/node local DNS caching but I assume the issue still stands.

esethuraman · 2021-02-01T07:04:58Z

@sidewinder12s Can you please detail the steps for finding the k8s cluster traffic?

sidewinder12s · 2021-02-01T07:54:04Z

I had Prometheus node exporter which has metrics for node conntrack entries. But I've now seen this bug/behavior a couple other times where the TCP connection fails rather than throwing an HTTP error code. Once where we broke routing to Vault cross AWS Accounts, in addition to this node connection failure.

maksemuz · 2023-05-30T13:52:44Z

Experiencing the same issue.
K8s v1.23.16
Vault v1.13.1
Is there any fix/workaround?

puneetloya · 2024-10-01T21:55:12Z

I have seen similar behavior when Vault server was having trouble and we were seeing context deadline exceeded from the client with the container not crashing at all until I manually deleted the pod.

sidewinder12s added the bug Something isn't working label Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Init container TCP timeout causes retries forever #203

Init container TCP timeout causes retries forever #203

sidewinder12s commented Dec 17, 2020 •

edited

Loading

jasonodonnell commented Jan 4, 2021

sidewinder12s commented Jan 4, 2021

sidewinder12s commented Jan 7, 2021

esethuraman commented Feb 1, 2021 •

edited

Loading

sidewinder12s commented Feb 1, 2021

maksemuz commented May 30, 2023

puneetloya commented Oct 1, 2024

Init container TCP timeout causes retries forever #203

Init container TCP timeout causes retries forever #203

Comments

sidewinder12s commented Dec 17, 2020 • edited Loading

jasonodonnell commented Jan 4, 2021

sidewinder12s commented Jan 4, 2021

sidewinder12s commented Jan 7, 2021

esethuraman commented Feb 1, 2021 • edited Loading

sidewinder12s commented Feb 1, 2021

maksemuz commented May 30, 2023

puneetloya commented Oct 1, 2024

sidewinder12s commented Dec 17, 2020 •

edited

Loading

esethuraman commented Feb 1, 2021 •

edited

Loading