-
Notifications
You must be signed in to change notification settings - Fork 561
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛: elbv2: wait for LB active state instead of resolving DNS name #5093
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @r4f4. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
In my local tests I see:
I'm going to run Openshift e2e CI tests on this change to check for regressions. /cc @patrickdillon |
@r4f4: GitHub didn't allow me to request PR reviews from the following users: patrickdillon. Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
* Fix for DNS name resolution too early: kubernetes-sigs/cluster-api-provider-aws#5093
I don't see any install issues in the e2e tests of openshift/installer#8825 |
@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh. |
/ok-to-test |
/test ? |
@r4f4: The following commands are available to trigger required jobs:
The following commands are available to trigger optional jobs:
Use
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/test pull-cluster-api-provider-aws-e2e pull-cluster-api-provider-aws-e2e-conformance |
Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready. Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases. |
I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back. |
For what's worth, when we were using terraform for provisioning there was no such LB DNS name check. But the tf aws provider does wait for the LB to become active after creation: https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/elbv2/load_balancer.go#L406-L408 |
No install failures in 11 CI jobs: openshift/installer#8927. |
/assign @kubernetes-sigs/cluster-api-provider-aws-maintainers for approval |
Sorry, posted that PR assignment to the wrong PR 🙄 |
Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times. Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.
665e6e5
to
a880668
Compare
Update: rebased on top of master so I could rebase openshift/installer#8927 and re-run the Openshift e2e tests. |
@AndiDog I haven't seen a single regression in the OCP e2e tests so far. Anything else I can do to move this along? |
Right, Chesterton's Fence says we need to understand it before we tear it down. We discussed the motivation in this slack thread.. There is not a complete consensus but to me the key takeaways/concerns are:
I want to reiterate that the current DNS check only ensures that the DNS is resolvable from the manager--not the nodes. In our tests the nodes are always able to resolve the DNS quickly, but in some edge cases (like the DNS resolver for a distinguished engineer's local ISP or AWS GovCloud simulators) DNS can be very slow to propagate.
So in our (openshift) case removing this check speeds up--not delays--the cluster readiness, because we do not block that the manager can resolve the DNS until it is needed. According to @vincepri in the thread, CAPI performs a similar check.
We use a different bootstrapping process so I'm not as familiar with kubeadm as I would like, but the only potential for failure would be if the cluster coming up depends on the manager resolving the LB DNS name. I would also point out that this current code arbitrarily assumes clusters use the LB DNS name, where it should technically be possible to create a DNS record pointing to the LB IPs (NLB IPs are static for their lifetime). That is an edge case, but we are indeed working on exactly that scenario.
We are extensively testing this in openshift, but again our bootstrapping & install process has some differences with the standard CAPI flow. So far we have only found that the DNS check causes problems rather than preventing them. |
/test pull-cluster-api-provider-aws-e2e |
Let's try again: |
If it fails again we can see what AWS account is using and then investigate. |
@r4f4: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
I suspect the timeout failures in the e2e are due to #5131 |
What type of PR is this?
/kind bug
What this PR does / why we need it:
Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times.
Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)
format, will close the issue(s) when PR gets merged):Fixes #5032
Special notes for your reviewer:
This is an alternative approach to #4976 and #5033.
Checklist:
Release note: