Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Measure CF-D upgrades appropriately with uptimer #1176

Open
2 of 4 tasks
ctlong opened this issue May 10, 2024 · 3 comments
Open
2 of 4 tasks

Measure CF-D upgrades appropriately with uptimer #1176

ctlong opened this issue May 10, 2024 · 3 comments

Comments

@ctlong
Copy link
Member

ctlong commented May 10, 2024

CF-D pipelines currently use uptimer to capture certain measurements relating to downtime during CF-D upgrades, and to fail them when those measurements exceed certain thresholds. However, due the retry logic we've added to deploys, the thresholds are not being applied appropriately – uptimer may fail a successful deploy, then on a retry the deploy will essentially be a no-op, resulting in no downtime for uptimer to fail on because no change occurred.

Tasks

@jochenehret
Copy link
Contributor

Good point. I think we should remove the "retry" for the "bosh-deploy-cf-develop" tasks in the "upgrade-deploy" and "experimental-deploy" jobs.

The first attempts of "bosh-deploy-cf-develop" have indeed been failing often recently, mostly because of the newly introduced "Stats availability" test: cloudfoundry/uptimer#115.

[UPTIMER] 2024/05/10 02:16:16 FAILED (Stats availability): 2 failed attempts to retrieve stats for app exceeded the threshold of 0 allowed failures (Total attempts: 103, pass rate 98.06%)

The current default value for APP_STATS_THRESHOLD is set to 0: https://github.com/cloudfoundry/cf-deployment-concourse-tasks/blob/bbca3545a2a9781f6aa8635b9440a664f184a848/bosh-deploy/task.yml#L113

We should increase this value slightly. Not sure however what a reasonable threshold might be.

@jochenehret
Copy link
Contributor

Opened PR: #1177

@ctlong
Copy link
Member Author

ctlong commented May 14, 2024

@jochenehret called out that uptimer emits failure messages pretty consistently during the tear down, e.g. [UPTIMER] 2024/05/08 20:09:08 FAILURE (App pushability, 2/10): exit status 1, which are confusing coming right after the measurement summaries.

We should improve those messages, either

  • Don't emit them, if possible.
  • Make the message clear that the failure was during tear down only.

I've added a task to the issue description for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants