Reliable Orchestration #3861
Labels
th/production-readiness
Get ready for production workloads
type/epic
Type: A higher level set of issues
Milestone
The Problem
Today we have many situations where the requester node thinks a compute node is still running job while it isn't, and a compute node is till running a job where it shouldn't. A major reason for this is the way we communicate between the requester and the compute node where we only rely on live communication and lack periodic checks.
As an example, when a compute node finishes execution, it communicates that to the requester by calling
OnRunComplete
on the requester node, but the requester might be down at that time, the compute node might be isolated, or the requester might face a transient failure and don't handle the requester properly (e.g. transient failure when writing to the job store).A similar issue can happen when the user stops a job, and the requester forwards that to the compute node, but the compute node was unreachable at that time, which means the compute node will continue to run the job even though the user and requester state says otherwise.
The Proposal
Multiple changes are required to improve the communication path between the requester and compute nodes, including:
Tasks
The text was updated successfully, but these errors were encountered: