Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: use max_workers and added worker timeout #18

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

geektophe
Copy link

This patch forces the module to take into account the max_workers parameters.

The previous implementation spawned as much workers as the machine's cores using the multiprocessing library. The main issue is that the first operation done by multiprocessing when spawning a Process is a fork() system call. The module being internal to the scheduler,
it's the scheduler itself that is forked. With big configurations, the memory consumption is large enough to crash the whole machine.

The number of workers is now controlled by the max_workers parameter which was not used before.

The other fix is the inclusion of a worker_timeout parameter.

The previous implementation was waiting for the spawned workers to join 30 seconds, and was simply exiting afterwards if the job wasn't finished. Still with big configurations, if the MongoDB server does not
finish to handle queries quick enough, stalled scheduler processes (spawned workers) can remain.

With this patch, the scheduler waits for worker_timeout seconds, and kills the remaining jobs and raises an error in the logs.

Setting worker_timeout to 0 forces the scheduler to wait for all children to finish before going further.

This patch forces the module to take into account the `max_workers`
parameters.

The previous implementation spawned as much workers as the machine's
cores using the `multiprocessing` library. The main issue is that the first
operation done by `multiprocessing` when spawning a `Process`
is a `fork()` system call. The module being internal to the scheduler,
it's the scheduler itself that is forked. With big configurations, the
memory consumption is large enough to crash the whole machine.

The number of workers is now controlled by the `max_workers` parameter
which was not used before.

The other fix is the inclusion of a `worker_timeout` parameter.

The previous implementation was waiting for the spawned workers to join
30 seconds, and was simply exiting afterwards if the job wasn't
finished. Still with big configurations, if the MongoDB server does not
finish to handle queries quick enough, stalled scheduler processes
(spawned workers) can remain.

With this patch, the scheduler waits for `worker_timeout` seconds, and
kills the remaining jobs and raises an error in the logs.

Setting `worker_timeout` to `0` forces the scheduler to wait for
all children to finish before going further.
@maethor
Copy link

maethor commented Oct 8, 2020

Hi @geektophe, I just pushed a new version which does not use workers anymore. I invite you to test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants