Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init container TCP timeout causes retries forever #203

Open
sidewinder12s opened this issue Dec 17, 2020 · 7 comments
Open

Init container TCP timeout causes retries forever #203

sidewinder12s opened this issue Dec 17, 2020 · 7 comments
Labels
bug Something isn't working

Comments

@sidewinder12s
Copy link

sidewinder12s commented Dec 17, 2020

Describe the bug
It appears if the Vault init container runs into TCP connection errors (as opposed to a HTTP 500 or 400 error), it will continue retrying forever. The deployment we had that ran into this behavior only got deleted when another process removed the entire pod as failed.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy application annotated for vault-agent injection
  2. Have some kind of networking failure on the node that disallows a connection to Vault.
  3. Watch the init container run forever.

Application deployment:

Annotations: vault.hashicorp.com/agent-inject: true
              vault.hashicorp.com/agent-inject-secret-airflow: kv-v2/secret
              vault.hashicorp.com/agent-inject-secret-cloud-swe-jwt: kv-v2/secret
              vault.hashicorp.com/agent-inject-status: injected
              vault.hashicorp.com/agent-inject-template-airflow:
                
                {{- with secret "kv-v2/secrett" -}}
                {{- .Data.data.value -}}
                {{- end -}}
              vault.hashicorp.com/agent-inject-template-cloud-swe-jwt:
                
                {{- with secret "kv-v2/secret" -}}
                {{- .Data.data.value -}}
                {{- end -}}
              vault.hashicorp.com/agent-pre-populate-only: true
              vault.hashicorp.com/role: my-role

Dec 16, 2020 @ 09:19:50.568 | 2020-12-16T17:19:50.568Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=1.696763184

 Dec 16, 2020 @ 09:19:20.567 | 2020-12-16T17:19:20.567Z [INFO]  auth.handler: authenticating
  | Dec 16, 2020 @ 09:19:17.792 | 2020-12-16T17:19:17.792Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=2.774728483

  | Dec 16, 2020 @ 09:18:47.791 | 2020-12-16T17:18:47.791Z [INFO]  auth.handler: authenticating

  | Dec 16, 2020 @ 09:18:46.354 | 2020-12-16T17:18:46.354Z [ERROR] auth.handler: error authenticating: error="Put "https://vault/login": dial tcp: i/o timeout" backoff=1.437024355


Expected behavior

Would have expected the client timeout or retry limits to possibly have an effect. Or just a hard timeout/give up at some point.

Environment

  • Kubernetes version: 1.18
    • Distribution or cloud vendor (OpenShift, EKS, GKE, AKS, etc.): EKS
  • vault-k8s version: 0.6.0

Additional context
I think we've been running into conntrack limits on some of our nodes which have lead to dropped packets, though this failure and the duration of it almost seem like the node itself had something wrong with it.

@sidewinder12s sidewinder12s added the bug Something isn't working label Dec 17, 2020
@jasonodonnell
Copy link
Contributor

Hi @sidewinder12s, which version of Vault Agent are you using?

@sidewinder12s
Copy link
Author

Version 1.5.4

@sidewinder12s
Copy link
Author

Root cause on our k8s cluster was high # of DNS requests/conntrack entries across the cluster (which would then overload the node vault agent was running on). I've solved that now with reduced kube-proxy usage/node local DNS caching but I assume the issue still stands.

@esethuraman
Copy link

esethuraman commented Feb 1, 2021

@sidewinder12s Can you please detail the steps for finding the k8s cluster traffic?

@sidewinder12s
Copy link
Author

I had Prometheus node exporter which has metrics for node conntrack entries. But I've now seen this bug/behavior a couple other times where the TCP connection fails rather than throwing an HTTP error code. Once where we broke routing to Vault cross AWS Accounts, in addition to this node connection failure.

@maksemuz
Copy link

Experiencing the same issue.
K8s v1.23.16
Vault v1.13.1
Is there any fix/workaround?

@puneetloya
Copy link

I have seen similar behavior when Vault server was having trouble and we were seeing context deadline exceeded from the client with the container not crashing at all until I manually deleted the pod.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants