🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

r4f4 · 2024-08-12T16:10:36Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times.

Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #5032

Special notes for your reviewer:

This is an alternative approach to #4976 and #5033.

Checklist:

Release note:

Check for the LB "active" status instead of trying to resolve the DNS name to validate the LB is ready.

k8s-ci-robot · 2024-08-12T16:10:44Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andidog for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2024-08-12T16:10:45Z

Hi @r4f4. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

r4f4 · 2024-08-12T16:11:55Z

In my local tests I see:

DEBUG I0812 14:31:14.346251     109 loadbalancer.go:115] "Created new network load balancer for apiserver" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:31:14.346269     109 loadbalancer.go:124] "Waiting for LB to become active" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:33:16.918691     109 loadbalancer.go:130] "LB reports active state" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int" time="2m2.57235285s"

I'm going to run Openshift e2e CI tests on this change to check for regressions.

/cc @patrickdillon

k8s-ci-robot · 2024-08-12T16:11:57Z

@r4f4: GitHub didn't allow me to request PR reviews from the following users: patrickdillon.

Note that only kubernetes-sigs members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

In my local tests I see:

DEBUG I0812 14:31:14.346251     109 loadbalancer.go:115] "Created new network load balancer for apiserver" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:31:14.346269     109 loadbalancer.go:124] "Waiting for LB to become active" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int"
DEBUG I0812 14:33:16.918691     109 loadbalancer.go:130] "LB reports active state" controller="awscluster" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="AWSCluster" AWSCluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" namespace="openshift-cluster-api-guests" name="rdossant-installer-08-txnpz" reconcileID="81772cf4-6773-4126-8231-bdbc23ba84c8" cluster="openshift-cluster-api-guests/rdossant-installer-08-txnpz" api-server-lb-name="rdossant-installer-08-txnpz-int" time="2m2.57235285s"

I'm going to run Openshift e2e CI tests on this change to check for regressions.

/cc @patrickdillon

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

* Fix for DNS name resolution too early: kubernetes-sigs/cluster-api-provider-aws#5093

r4f4 · 2024-08-12T19:37:26Z

I'm going to run Openshift e2e CI tests on this change to check for regressions.

I don't see any install issues in the e2e tests of openshift/installer#8825

r4f4 · 2024-08-12T21:10:15Z

@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh.

nrb · 2024-08-16T17:24:27Z

/ok-to-test

r4f4 · 2024-08-16T21:42:01Z

/test ?

k8s-ci-robot · 2024-08-16T21:42:16Z

@r4f4: The following commands are available to trigger required jobs:

/test pull-cluster-api-provider-aws-build
/test pull-cluster-api-provider-aws-build-docker
/test pull-cluster-api-provider-aws-test
/test pull-cluster-api-provider-aws-verify

The following commands are available to trigger optional jobs:

/test pull-cluster-api-provider-aws-apidiff-main
/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-blocking
/test pull-cluster-api-provider-aws-e2e-clusterclass
/test pull-cluster-api-provider-aws-e2e-conformance
/test pull-cluster-api-provider-aws-e2e-conformance-with-ci-artifacts
/test pull-cluster-api-provider-aws-e2e-eks
/test pull-cluster-api-provider-aws-e2e-eks-gc
/test pull-cluster-api-provider-aws-e2e-eks-testing

Use /test all to run the following jobs that were automatically triggered:

pull-cluster-api-provider-aws-apidiff-main
pull-cluster-api-provider-aws-build
pull-cluster-api-provider-aws-build-docker
pull-cluster-api-provider-aws-test
pull-cluster-api-provider-aws-verify

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

r4f4 · 2024-08-16T21:43:19Z

/test pull-cluster-api-provider-aws-e2e pull-cluster-api-provider-aws-e2e-conformance

AndiDog · 2024-08-30T06:03:11Z

@AndiDog I think this is a good compromise regarding your comment in #4976 (comment) . Doing the DNS name check after waiting for the LB to become "active" would address the issue in #5033 but we've got reports of cases where the DNS name resolution didn't happen until the host DNS server was changed to force a DNS cache refresh.

Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready. Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases.

r4f4 · 2024-08-30T06:28:20Z

Did you try such a scenario?

I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back.

r4f4 · 2024-08-30T09:29:59Z

Would it just delay cluster readiness because CAP* controllers error out and retry? Would the cluster fail to come up? Did you try such a scenario?

For what's worth, when we were using terraform for provisioning there was no such LB DNS name check. But the tf aws provider does wait for the LB to become active after creation: https://github.com/hashicorp/terraform-provider-aws/blob/main/internal/service/elbv2/load_balancer.go#L406-L408

r4f4 · 2024-08-30T13:56:01Z

Did you try such a scenario?

I was not expecting such a solution to be accepted by capa devs. Let me try to run the openshift e2e tests with the change. I'll report back.

No install failures in 11 CI jobs: openshift/installer#8927.

AndiDog · 2024-09-05T22:32:00Z

/assign @kubernetes-sigs/cluster-api-provider-aws-maintainers

for approval

AndiDog · 2024-09-05T22:32:54Z

Sorry, posted that PR assignment to the wrong PR 🙄

Using DNS name resolution as a way to check the load balancer is working can cause problems that are dependent on the host running CAPA. In some systems, the DNS resolution can fail with very large TTLs cached DNS responses, causing very long provisioning times. Instead of DNS resolution, let's use the AWS API to check for the load balancer "active" state. Waiting for resolvable DNS names should be left for the clients to do.

r4f4 · 2024-09-24T15:13:20Z

Update: rebased on top of master so I could rebase openshift/installer#8927 and re-run the Openshift e2e tests.

r4f4 · 2024-09-25T15:27:06Z

@AndiDog I haven't seen a single regression in the OCP e2e tests so far. Anything else I can do to move this along?

patrickdillon · 2024-09-26T15:03:17Z

Yes, this might work. I'm just wondering overall why the DNS check was introduced and what would happen if we remove it completely, i.e. go on even before DNS and LB are ready.

Right, Chesterton's Fence says we need to understand it before we tear it down. We discussed the motivation in this slack thread.. There is not a complete consensus but to me the key takeaways/concerns are:

The check was introduced to make sure the LB is ready, which this PR does directly through the SDK
There are some concerns that this may cause issues with kubeadm or the join that you mentioend in the thread

I want to reiterate that the current DNS check only ensures that the DNS is resolvable from the manager--not the nodes. In our tests the nodes are always able to resolve the DNS quickly, but in some edge cases (like the DNS resolver for a distinguished engineer's local ISP or AWS GovCloud simulators) DNS can be very slow to propagate.

Would it just delay cluster readiness because CAP* controllers error out and retry?

So in our (openshift) case removing this check speeds up--not delays--the cluster readiness, because we do not block that the manager can resolve the DNS until it is needed. According to @vincepri in the thread, CAPI performs a similar check.

Would the cluster fail to come up?

We use a different bootstrapping process so I'm not as familiar with kubeadm as I would like, but the only potential for failure would be if the cluster coming up depends on the manager resolving the LB DNS name.

I would also point out that this current code arbitrarily assumes clusters use the LB DNS name, where it should technically be possible to create a DNS record pointing to the LB IPs (NLB IPs are static for their lifetime). That is an edge case, but we are indeed working on exactly that scenario.

Did you try such a scenario? Basically I'd like to have some certainty that this change doesn't cause problems in 1 out of 100 cases.

We are extensively testing this in openshift, but again our bootstrapping & install process has some differences with the standard CAPI flow. So far we have only found that the DNS check causes problems rather than preventing them.

richardcase · 2024-09-30T09:52:55Z

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

r4f4 · 2024-10-01T16:01:57Z

STEP: Failed to create CloudFormation stack in attempt 8: failed to create CF stack: failed to wait for AWS CloudFormation stack to be CreateComplete: ResourceNotReady: failed waiting for successful resource state @ 09/30/24 15:30:39.323

Let's try again:
/test pull-cluster-api-provider-aws-e2e

richardcase · 2024-10-01T16:10:59Z

STEP: Failed to create CloudFormation stack in attempt 8: failed to create CF stack: failed to wait for AWS CloudFormation stack to be CreateComplete: ResourceNotReady: failed waiting for successful resource state @ 09/30/24 15:30:39.323

Let's try again: /test pull-cluster-api-provider-aws-e2e

If it fails again we can see what AWS account is using and then investigate.

k8s-ci-robot · 2024-10-01T19:13:18Z

@r4f4: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cluster-api-provider-aws-e2e	`a880668`	link	false	`/test pull-cluster-api-provider-aws-e2e`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

r4f4 · 2024-10-02T18:09:35Z

I suspect the timeout failures in the e2e are due to #5131

k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority labels Aug 12, 2024

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 12, 2024

k8s-ci-robot requested review from AndiDog and killianmuldoon August 12, 2024 16:10

r4f4 added a commit to r4f4/installer that referenced this pull request Aug 12, 2024

WIP: OCPBUGS-35898: capi/aws: bump provider for LB DNS fix - 2

998aa3a

* Fix for DNS name resolution too early: kubernetes-sigs/cluster-api-provider-aws#5093

r4f4 mentioned this pull request Aug 12, 2024

WIP: OCPBUGS-35898: capi/aws: bump provider for LB DNS fix - 2 openshift/installer#8825

Open

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 16, 2024

r4f4 force-pushed the fix-5032-approach-2 branch from 665e6e5 to a880668 Compare September 24, 2024 15:07

r4f4 mentioned this pull request Sep 30, 2024

🐛 wait for lb dns name to propagate before resolving #5033

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

r4f4 commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

r4f4 commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

r4f4 commented Aug 12, 2024

r4f4 commented Aug 12, 2024

nrb commented Aug 16, 2024

r4f4 commented Aug 16, 2024

k8s-ci-robot commented Aug 16, 2024

r4f4 commented Aug 16, 2024

AndiDog commented Aug 30, 2024

r4f4 commented Aug 30, 2024

r4f4 commented Aug 30, 2024

r4f4 commented Aug 30, 2024

AndiDog commented Sep 5, 2024

AndiDog commented Sep 5, 2024

r4f4 commented Sep 24, 2024

r4f4 commented Sep 25, 2024

patrickdillon commented Sep 26, 2024

richardcase commented Sep 30, 2024

r4f4 commented Oct 1, 2024

richardcase commented Oct 1, 2024

k8s-ci-robot commented Oct 1, 2024

r4f4 commented Oct 2, 2024

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

Are you sure you want to change the base?

🐛: elbv2: wait for LB active state instead of resolving DNS name #5093

Conversation

r4f4 commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

r4f4 commented Aug 12, 2024

k8s-ci-robot commented Aug 12, 2024

r4f4 commented Aug 12, 2024

r4f4 commented Aug 12, 2024

nrb commented Aug 16, 2024

r4f4 commented Aug 16, 2024

k8s-ci-robot commented Aug 16, 2024

r4f4 commented Aug 16, 2024

AndiDog commented Aug 30, 2024

r4f4 commented Aug 30, 2024

r4f4 commented Aug 30, 2024

r4f4 commented Aug 30, 2024

AndiDog commented Sep 5, 2024

AndiDog commented Sep 5, 2024

r4f4 commented Sep 24, 2024

r4f4 commented Sep 25, 2024

patrickdillon commented Sep 26, 2024

richardcase commented Sep 30, 2024

r4f4 commented Oct 1, 2024

richardcase commented Oct 1, 2024

k8s-ci-robot commented Oct 1, 2024

r4f4 commented Oct 2, 2024