Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Users shall be able to update existing jobs #3871

Closed
wdbaruni opened this issue Jan 22, 2024 · 1 comment
Closed

Users shall be able to update existing jobs #3871

wdbaruni opened this issue Jan 22, 2024 · 1 comment
Labels
th/production-readiness Get ready for production workloads type/epic Type: A higher level set of issues

Comments

@wdbaruni
Copy link
Member

The Problem

With the introduction of long running jobs, users shall be able to update the specs of active jobs where the orchestrator shall deploy updated jobs in place if possible, or select new compute nodes. Today users must cancel previous jobs and submit new ones whenever they want to update their jobs, which is not efficient, introduce gaps in execution, and discard versioning and history of previous job instances.

Updates include and not limited to:

  1. Update execution count where the orchestrator must cancel previous executions or deploy new ones depending on the new count
  2. Update resources, where the orchestrator should attempt to deploy the update in place and ask the compute node to reflect the new resource requirements if the node has capacity, or find a new node and cancel the previous execution
  3. Update metadata and labels, where the orchestrator and compute node shall update the execution’s metadata without stopping anything
  4. Update engine config or engine type, where again the orchestrator can decide to update in place if the compute node supports the new requirements or find a new node

Requirements

More info can be found here

  1. As a user, I want the capability to update the specifications of service and daemon jobs, with Bacalhau rolling out these changes across the network.
  2. As a user, I want the ability to define job update strategies, including rolling updates and blue/green deployments.
  3. As a user, I expect Bacalhau to automatically rollback to a previous job version if a new version is deemed unhealthy within a configurable timeframe.
  4. As a user, I need Bacalhau to conduct health checks to determine if a job is unhealthy, even if the execution environment (e.g., docker container) appears operational.
  5. As a user, I want access to previous versions of my job specifications, with the option to rollback to earlier versions.
  6. As a user, I prefer to submit and query jobs using a unique job name I provide, rather than relying solely on a Bacalhau-generated job ID.
  7. As a user, I want to stop a job using the unique name I have provided, in addition to the Bacalhau-generated ID.
  8. As a user, I expect re-submitting a job with the same name to be treated as an update by Bacalhau, triggering appropriate actions.
  9. As a user, I want re-submitting a job with the same specifications and name to result in no action from Bacalhau and no increment to the job version
  10. As a user, I desire an option to force Bacalhau to update and re-deploy a job even if the specifications have not changed.

Open Questions:

  • What is the desired behaviour when re-submitting a batch or an ops job with same name?
    1. Option 1: If there is no change in the spec, or the user did not force an update, then take no action. Otherwise, treat it as an update by stopping any existing executions and deploy new ones.
    2. Option 2: Always deploy new executions without stopping existing ones. bacalhau get and bacalhau describe will always return results and status of the latest job version. User will have to pass a new --version <int> flag to describe or download results of previous versions. bacalhau stop on the other hand will stop all active executions, including from previous versions. Users will have to pass --version <int> to stop a specific version
@wdbaruni wdbaruni added type/epic Type: A higher level set of issues th/production-readiness Get ready for production workloads labels Jan 22, 2024
@wdbaruni wdbaruni added this to the v1.4.0 milestone Jan 25, 2024
@wdbaruni wdbaruni removed this from the v1.4.0 milestone Apr 16, 2024
@wdbaruni wdbaruni transferred this issue from another repository Apr 21, 2024
@wdbaruni wdbaruni transferred this issue from another repository Apr 21, 2024
@wdbaruni
Copy link
Member Author

replaced with linear project

@wdbaruni wdbaruni closed this as not planned Won't fix, can't repro, duplicate, stale Oct 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
th/production-readiness Get ready for production workloads type/epic Type: A higher level set of issues
Projects
Status: Done
Status: Backlog
Development

No branches or pull requests

1 participant