Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve support for diagnostics of failed compilation: flag to preserve compilation source packages and logs #2481

Open
gberche-orange opened this issue Dec 7, 2023 · 2 comments

Comments

@gberche-orange
Copy link

Is your feature request related to a problem? Please describe.

As a bosh operator,
In order to diagnose root cause of bosh release compilation
I need access compilation environment after the compilation failed (such as compilation log files)
And I need to be able to retry executing the failing package compilation command, possibly with additional debugging flags

Currently, after enabling the director.debug.keep_unreachable_vms property, I'm able to ssh into the compilation vm, but the bosh agent quickly removes the compilation data as observed in the following sample trace:

/var/vcap/bosh/log# less current 
[...]
2023-12-07_13:19:37.48530 [File System] 2023/12/07 13:19:37 DEBUG - Remove all /var/vcap/data/compile/galera

Describe the solution you'd like

A flag in the bosh director instructing to skip cleaning the file system on compilation failure. This flag would potentially have the side effect of preventing reuse of this faulty compilation (as its file system might fill up, preventing new compilation jobs from properly succeeding)

Describe alternatives you've considered

The current workaround is to race with the bosh agent and make a file system copy of the /var/vcap/data/compile/\<package\> directory before the compilation job ends, and then work on the copy to perform compilation diagnostics and retries.

See full diagnostic example into https://github.com/orange-cloudfoundry/paas-templates/issues/2209

Additional context

The bosh agent seems to currently unconditionally perform the clean up upon compilation completion. See likely related sources below

Clean up of the whole compilation directory
https://github.com/cloudfoundry/bosh-agent/blob/efdd50448fc8936e68a53ff5ff35c7df3d7e385c/agent/compiler/concrete_compiler.go#L80-L85

after the packaging script completion https://github.com/cloudfoundry/bosh-agent/blob/efdd50448fc8936e68a53ff5ff35c7df3d7e385c/agent/compiler/concrete_compiler.go#L109-L113

@rkoster
Copy link
Contributor

rkoster commented Dec 14, 2023

There is work underway to add the ability to compile bosh releases directly with the bosh-agent: cloudfoundry/bosh-agent#315 This would allow for a docker-based local development workflow.
Would that address your problem or is this for debugging issues in production?

@gberche-orange
Copy link
Author

thanks @rkoster for the follow up. Our use-case is rather debugging issues compiling 3rd party bosh releases, in particular during our automated pre-compilation pipelines which we use to speed up bosh stemcell bumps.

/CC @o-orand @poblin-orange

We'll deeper study whether the replacement of our pipeline using bosh-director based bosh release compilation by new pipelines using the bosh agent could help with compilation reproducibility (more at https://github.com/orange-cloudfoundry/paas-templates/issues/2210 )and ease diagnostic support in case of errors.

From a 1st sight, I think that the bosh-agent based release compilation might be harder to ensure reproducibility of the target iaas stemcells used at runtime.: we'd need the new pipelines running the bosh-agent to be running in the target stemcell and iaas infrastructure. We have observed in the past that compilation failure that a useful signal of breaking changes in the infrastructure/stemcell. Such breaking change might be harder to detect when running the compiled release.

The docker-based bosh-agent approach might therefore not be adapted to our use-case to debug failing compilation: the overhead to recreate the build environment is likely larger than the workaround described at #2481 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Pending Review | Discussion
Development

No branches or pull requests

2 participants