Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache: reduce sleep time #504

Merged
merged 7 commits into from
Dec 13, 2021
Merged

Cache: reduce sleep time #504

merged 7 commits into from
Dec 13, 2021

Conversation

matthiasdiener
Copy link
Contributor

@matthiasdiener matthiasdiener commented Jul 15, 2021

CC: #503.


attempts += 1

if attempts > 10:
if attempts % (10/wait_time_seconds) == 0:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work reliably. / unconditionally has a floating point result, and comparing floating point numbers to zero is unreliable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That won't work reliably. / unconditionally has a floating point result, and comparing floating point numbers to zero is unreliable.

4d13b82 should fix this

@yxliang01
Copy link
Contributor

Actually, what about don't do polling at all, i.e. no sleep. There's a lot more efficient system calls available like inotify and seemingly good quality python package watchdog (never use myself though) to do this cross-platform.

@inducer
Copy link
Owner

inducer commented Jul 23, 2021 via email

@matthiasdiener
Copy link
Contributor Author

Thanks for the suggestion. IMO, anytime there is contention for this lock, you're running your program poorly. (I.e. this should always be avoidable, e.g. by setting a private cache folder.) As such, I'm not very inclined to optimize that case.

A bit of background on this PR: We keep seeing this timeout a lot when running multiple MPI ranks on a single node, which is a difficult situation to set private cache folders (especially when running interactively).

@@ -88,18 +88,26 @@ def __init__(self, cleanup_m, cache_dir):
except OSError:
pass

wait_time_seconds = 0.05
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation for this change? It seems really short. If you were running this (by mistake, say) on 1000 nodes sharing a cache directory, I think the resulting thrashing would not be a good thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific value was taken from https://github.com/tox-dev/py-filelock/blob/a6c8fabc4192fa7a4ae19b1875ee842ec5eb4f61/src/filelock/_api.py#L113. I can't really predict what the result of this with thousands of nodes would be, but we already keep on hitting the 1 second timeout when running with 2 ranks on a single node, which I guess indicates that the existing polling time would cause minutes-long waiting times when running with thousands of ranks.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see that argument. Could you add a comment justifying this choice of time-out value? (making reference to the single-node interactive "test" use case)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 387011e

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants