Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliable Orchestration #3861

Closed
4 of 8 tasks
wdbaruni opened this issue Jan 25, 2024 · 1 comment
Closed
4 of 8 tasks

Reliable Orchestration #3861

wdbaruni opened this issue Jan 25, 2024 · 1 comment
Assignees
Labels
th/production-readiness Get ready for production workloads type/epic Type: A higher level set of issues
Milestone

Comments

@wdbaruni
Copy link
Member

wdbaruni commented Jan 25, 2024

The Problem

Today we have many situations where the requester node thinks a compute node is still running job while it isn't, and a compute node is till running a job where it shouldn't. A major reason for this is the way we communicate between the requester and the compute node where we only rely on live communication and lack periodic checks.

As an example, when a compute node finishes execution, it communicates that to the requester by calling OnRunComplete on the requester node, but the requester might be down at that time, the compute node might be isolated, or the requester might face a transient failure and don't handle the requester properly (e.g. transient failure when writing to the job store).
A similar issue can happen when the user stops a job, and the requester forwards that to the compute node, but the compute node was unreachable at that time, which means the compute node will continue to run the job even though the user and requester state says otherwise.

The Proposal

Multiple changes are required to improve the communication path between the requester and compute nodes, including:

Tasks

  1. 1 of 1
    wdbaruni
  2. 4 of 4
    comp/ncl type/epic
    wdbaruni
  3. 4 of 4
    comp/ncl type/epic
    wdbaruni
  4. comp/ncl type/epic
    wdbaruni
  5. comp/ncl type/epic
    wdbaruni
  6. comp/ncl type/epic
    wdbaruni
  7. 2 of 2
    comp/ncl type/epic
    wdbaruni
  8. comp/ncl type/epic
    wdbaruni
@wdbaruni wdbaruni added this to the v1.4.0 milestone Jan 25, 2024
@wdbaruni wdbaruni added type/epic Type: A higher level set of issues th/production-readiness Get ready for production workloads labels Jan 25, 2024
@wdbaruni wdbaruni removed this from the v1.4.0 milestone Apr 16, 2024
@wdbaruni wdbaruni transferred this issue from another repository Apr 21, 2024
@wdbaruni wdbaruni transferred this issue from another repository Apr 21, 2024
@wdbaruni wdbaruni self-assigned this May 21, 2024
@wdbaruni wdbaruni added this to the v1.5.0 milestone Aug 12, 2024
@wdbaruni wdbaruni closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2024
@wdbaruni
Copy link
Member Author

Replacing the issue with a linear project

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
th/production-readiness Get ready for production workloads type/epic Type: A higher level set of issues
Projects
Status: Done
Status: Backlog
Development

No branches or pull requests

1 participant