Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cancelling a task is not resilient enough #1002

Open
benoit74 opened this issue Jul 24, 2024 · 3 comments
Open

Cancelling a task is not resilient enough #1002

benoit74 opened this issue Jul 24, 2024 · 3 comments

Comments

@benoit74
Copy link
Collaborator

When a user or the periodic scheduler requests a task cancellation, the information is stored in DB.

It it then the worker manager responsibility to detect that these tasks have been asked to be removed, and to really cancel them on the worker.

The limitation of this is that when the worker manager is down (e.g. because one task is consuming way too much disks and the worker manager shut itself down, encountered on zimit.kiwix.org when someone tricked the size limit), then nothing can stop the task.

I wonder if the task manager should not itself query the API to check if it should stop himself.

Having both the task manager and the worker manager cancelling tasks has to be assessed for impact, but it should be feasible.

Another concern is that if the task manager is gone, then the worker manager has no idea about how to kill other containers linked to this task. Currently it is the task manager responsibility to properly terminate all containers associated to it. I think we should take into account the scenario where the task manager is gone / not responsive and be able to kill all containers associated to a given task from the worker manager.

@rgaudin
Copy link
Member

rgaudin commented Jul 24, 2024

I like that the solution has a clear design and intent with those roles separate.

The manager has a responsibility and if it's not running those can not be fulfilled obviously. If the manager is shutdown on purpose, then that's the worker's admin responsibility.

It the manager stops itself (currently when avail disk space is lower than expected), then we can revisit this action.

Changing the responsibilities of the actors because of some behavior at the feature/bug border feels 😳 to me

@benoit74
Copy link
Collaborator Author

Makes sense, you're probably right, we should probably rather consider that the manager should not stop itself no matter the reason.

When disk space is lower than expected the manager should continue to run but just refuse to start new tasks (which is the primary goal as far as I've understood).

And we should probably revisit the actions available to a zimfarm worker admin so that it is possible to stop scheduling new tasks (typically because maintenance is need) but still keep the manager running. This is the current situation in badger2, and it means that we cannot cancel running tasks because worker admin want to cleanly stop its worker.

Both would be way simpler / straightforward indeed.

@rgaudin
Copy link
Member

rgaudin commented Jul 24, 2024

Exactly. We could also imagine sending richer feedback than ping from manager to the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants