Reliable Orchestration #3861

wdbaruni · 2024-01-25T10:32:22Z

The Problem

Today we have many situations where the requester node thinks a compute node is still running job while it isn't, and a compute node is till running a job where it shouldn't. A major reason for this is the way we communicate between the requester and the compute node where we only rely on live communication and lack periodic checks.

As an example, when a compute node finishes execution, it communicates that to the requester by calling OnRunComplete on the requester node, but the requester might be down at that time, the compute node might be isolated, or the requester might face a transient failure and don't handle the requester properly (e.g. transient failure when writing to the job store).
A similar issue can happen when the user stops a job, and the requester forwards that to the compute node, but the compute node was unreachable at that time, which means the compute node will continue to run the job even though the user and requester state says otherwise.

The Proposal

Multiple changes are required to improve the communication path between the requester and compute nodes, including:

Tasks

Give feedback

Design reliable orchestration #4134

1 of 1
Core Event System #4261

4 of 4

comp/ncl type/epic
BoltDB Event Logging #4267

4 of 4

comp/ncl type/epic
Improved Logging #4283

comp/ncl type/epic
Async Job Assignment #4272

comp/ncl type/epic
State Synchronization #4276

comp/ncl type/epic
Event Management #4280

2 of 2

comp/ncl type/epic
Failure Handling #4286

comp/ncl type/epic
Options

The text was updated successfully, but these errors were encountered:

wdbaruni · 2024-10-12T17:00:34Z

Replacing the issue with a linear project

wdbaruni added this to the v1.4.0 milestone Jan 25, 2024

wdbaruni added type/epic Type: A higher level set of issues th/production-readiness Get ready for production workloads labels Jan 25, 2024

coderabbitai bot mentioned this issue Mar 22, 2024

refactor: bidder to simplfiy exposing errors #3680

Merged

wdbaruni removed this from the v1.4.0 milestone Apr 16, 2024

wdbaruni transferred this issue from another repository Apr 21, 2024

wdbaruni self-assigned this May 21, 2024

wdbaruni mentioned this issue Jul 1, 2024

Better "waiting" information #2533

Open

This was referenced Aug 11, 2024

Bacalhau connect / delete no ability to reconnect #4198

Open

Execution status compute API #3858

Closed

wdbaruni added this to the v1.5.0 milestone Aug 12, 2024

wdbaruni closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reliable Orchestration #3861

Reliable Orchestration #3861

wdbaruni commented Jan 25, 2024 •

edited

Loading

Tasks

wdbaruni commented Oct 12, 2024

Reliable Orchestration #3861

Reliable Orchestration #3861

Comments

wdbaruni commented Jan 25, 2024 • edited Loading

The Problem

The Proposal

Tasks

wdbaruni commented Oct 12, 2024

wdbaruni commented Jan 25, 2024 •

edited

Loading