Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache: reduce sleep time #504

Merged
merged 7 commits into from
Dec 13, 2021
Merged
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 12 additions & 4 deletions pyopencl/cache.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,18 +88,26 @@ def __init__(self, cleanup_m, cache_dir):
except OSError:
pass

wait_time_seconds = 0.05
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for this change? It seems really short. If you were running this (by mistake, say) on 1000 nodes sharing a cache directory, I think the resulting thrashing would not be a good thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific value was taken from https://github.com/tox-dev/py-filelock/blob/a6c8fabc4192fa7a4ae19b1875ee842ec5eb4f61/src/filelock/_api.py#L113. I can't really predict what the result of this with thousands of nodes would be, but we already keep on hitting the 1 second timeout when running with 2 ranks on a single node, which I guess indicates that the existing polling time would cause minutes-long waiting times when running with thousands of ranks.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that argument. Could you add a comment justifying this choice of time-out value? (making reference to the single-node interactive "test" use case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 387011e


# Warn every 10 seconds if not able to acquire lock
warn_attempts = int(10/wait_time_seconds)

# Exit after 60 seconds if not able to acquire lock
exit_attempts = int(60/wait_time_seconds)

from time import sleep
sleep(1)
sleep(wait_time_seconds)

attempts += 1

if attempts > 10:
if attempts % warn_attempts == 0:
from warnings import warn
warn("could not obtain cache lock--delete '%s' if necessary"
matthiasdiener marked this conversation as resolved.
Show resolved Hide resolved
% self.lock_file)

if attempts > 3 * 60:
raise RuntimeError("waited more than three minutes "
if attempts > exit_attempts:
raise RuntimeError("waited more than one minute "
"on the lock file '%s'"
"--something is wrong" % self.lock_file)

Expand Down