-
Notifications
You must be signed in to change notification settings - Fork 242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache: reduce sleep time #504
Conversation
pyopencl/cache.py
Outdated
|
||
attempts += 1 | ||
|
||
if attempts > 10: | ||
if attempts % (10/wait_time_seconds) == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work reliably. /
unconditionally has a floating point result, and comparing floating point numbers to zero is unreliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That won't work reliably.
/
unconditionally has a floating point result, and comparing floating point numbers to zero is unreliable.
4d13b82 should fix this
Actually, what about don't do polling at all, i.e. no sleep. There's a lot more efficient system calls available like |
Thanks for the suggestion. IMO, anytime there is contention for this lock,
you're running your program poorly. (I.e. this should always be avoidable,
e.g. by setting a private cache folder.) As such, I'm not very inclined to optimize that case.
|
A bit of background on this PR: We keep seeing this timeout a lot when running multiple MPI ranks on a single node, which is a difficult situation to set private cache folders (especially when running interactively). |
@@ -88,18 +88,26 @@ def __init__(self, cleanup_m, cache_dir): | |||
except OSError: | |||
pass | |||
|
|||
wait_time_seconds = 0.05 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation for this change? It seems really short. If you were running this (by mistake, say) on 1000 nodes sharing a cache directory, I think the resulting thrashing would not be a good thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific value was taken from https://github.com/tox-dev/py-filelock/blob/a6c8fabc4192fa7a4ae19b1875ee842ec5eb4f61/src/filelock/_api.py#L113. I can't really predict what the result of this with thousands of nodes would be, but we already keep on hitting the 1 second timeout when running with 2 ranks on a single node, which I guess indicates that the existing polling time would cause minutes-long waiting times when running with thousands of ranks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see that argument. Could you add a comment justifying this choice of time-out value? (making reference to the single-node interactive "test" use case)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in 387011e
CC: #503.