Execution Engine Monitoring #3869

wdbaruni · 2024-01-24T06:49:04Z

The Problem

I have observed multiple times while working on the use cases leaking of executions, where docker containers are still running for jobs that are marked as stopped by the compute node’s state. These containers will continue to run indefinitely until the user stops the container using docker cli and not through bacalhau. Vice versa, some executions were marked as Running in the execution store, but their container was no longer running.

A major feedback we received from Mycelial on why they don’t want to run docker with bacalhau is that they don’t want their users to manage and have knowledge of both docker and bacalhau. Our answer to them was that bacalhau will be doing all of the work and their users won’t need to know about docker. We need to deliver on that promise.

The Proposal

While we can fix whatever is causing the leak today, and we should, there will always be edge cases or future bugs where a leak will be introduced.

The proposal is to implement a background task in the compute node that periodically monitors all active executions, detect any out of sync between the desired state (execution store) and actual state (e.g. docker containers), and then take corrective actions. A metric should be published whenever an out of sync is detected to keep track of such incidents.

Preferably, this should be implemented as part of the Execution Runtime story, but we might have to implement something earlier if Execution Runtime is going to be delayed.

How to identify related executions/containers?

Today we are labeling docker containers with bacalhau, nodeID, jobID and executionID. While this might sound sufficient to detect any leaks, there are some challenges:

We can rely on the nodeID to find all containers that belong to a compute node, and stop any container not in the execution store. This will result in issues if the node changes its ID at some point, re-initialize the repo, or if we have multiple nodes with the same ID in the same host that are part of different networks (very bad practice). I understand if a node changes its ID it can be treated as a new node, but something has to clean the leaking containers. Maybe just proper documentation on manual actions to take if you were to uninstall bacalhau or rename a node can be sufficient
We can rely on the executionID and look them up in the execution store of the compute node. If the execution doesn’t exist, it means it belongs to a different node. If it exists, then we compare the desired and actual states. This means we need a long enough garbage collection window for stopped executions before they are fully deleted from the execution store. Also we will face similar issues as above if the user deletes the repo where the execution store is persisted.

Recommendation

First, we should stop executions when bacalhau agent is stopped, with some graceful window to wait for those executions. This should avoid leaking executions if bacalhau is stopped indefinitely, and if the node ID is changed or repo is deleted while bacalhau was down. I also believe this is the behaviour that users would expect when they shutdown bacalhau to also stop the jobs deployed locally, right? @aronchick
Second, we should never run containers with auto-restart enabled. This should minimize the duration of leaked executions until the host is restarted. We don't do this today, and we shouldn't do it in the future even for long running jobs.
Third, the previous mitigations only handle when bacalhau is gracefully shutdown. To handle ungraceful terminations, we have the following options:
1. Option 1: run executions as child processes of bacalhau, which should handle terminating all executions as well if the bacalhau was terminated ungracefully. Not sure if this is possible with docker though
2. Option 2: if that is not possible, we should rely on a heartbeat from bacalhau agent instead of fixed labels. Meaning that periodically bacalhau would increment a heartbeat version on all of its containers or runtime environments. In terms of docker, this might be updating a label of a running container (not sure if possible). When a bacalhau agent sees a container with label bacalhau and expired heartbeats, it can make a decision to stop the container even if it was created by another node. Race conditions can happen where multiple nodes can try to stop the same container if we have multiple nodes on the same host, but that is acceptable as the action is idempotent.
3. Option 3: if heart beating is not possible, then our best effort is relying on the nodeID. We need to provide sufficient documentation on how to run multiple nodes on the same host, make sure each have its unique nodeID, which is very reasonable, and reasons not to do it except for testing purposes. Also, we need to document how to uninstall or fully stop bacalhau by giving manual command to check for and stop any executions
Lastly, with all previous mitigations, and assuming Option 1 or Option 2 were feasible, it should be now safe to rely on the executionID to identify related executions to the current node, detect inconsistencies between desired and actual state, and take actions.

Priority

Priority: 2 (Medium)
Quarter: 2024/Q2
Justification:
- Not a low priority as we can't onboard actual production workloads with unreliable executions, where containers are not running even though we are saying they are, or containers are leaking even though we are saying they have been stopped
- Not a high priority as I don't expect to onboard production workloads in this quarter or early in the next one. Also the solution might be better implemented once Execution Runtime is defined

The text was updated successfully, but these errors were encountered:

wdbaruni · 2024-10-12T17:49:31Z

replaced by a linear project

wdbaruni added type/epic Type: A higher level set of issues th/production-readiness Get ready for production workloads labels Jan 24, 2024

wdbaruni mentioned this issue Jan 24, 2024

Execution Runtime #3870

Closed

wdbaruni added this to the v1.4.0 milestone Jan 25, 2024

wdbaruni removed this from the v1.4.0 milestone Apr 16, 2024

wdbaruni transferred this issue from another repository Apr 21, 2024

wdbaruni closed this as not planned Won't fix, can't repro, duplicate, stale Oct 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution Engine Monitoring #3869

Execution Engine Monitoring #3869

wdbaruni commented Jan 24, 2024 •

edited

Loading

wdbaruni commented Oct 12, 2024

Execution Engine Monitoring #3869

Execution Engine Monitoring #3869

Comments

wdbaruni commented Jan 24, 2024 • edited Loading

The Problem

The Proposal

Priority

wdbaruni commented Oct 12, 2024

wdbaruni commented Jan 24, 2024 •

edited

Loading