-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simple MPI-based SSG program failing #13
Comments
In GitLab by @mdorier on Oct 31, 2019, 17:39 changed the description |
1 similar comment
In GitLab by @mdorier on Oct 31, 2019, 17:39 changed the description |
In GitLab by @mdorier on Oct 31, 2019, 19:48 Update with
|
In GitLab by @mdorier on Oct 31, 2019, 20:00 Some more testing, introducing some
I'm getting the following:
Then the program either crashes with a segfault (most of the times) or keeps looping, displaying the following message every second:
|
In GitLab by @mdorier on Oct 31, 2019, 20:03 When the code segfault, gdb indicates that it does in src/ssg.c at line 719, and inspecting the variables there shows that |
In GitLab by @mdorier on Oct 31, 2019, 20:37 Some more progress, changing I'll leave the issue open just in case someone runs into the same problem (namely, leaving a group immediately after it is created), but I don't think there is anything that could be done at this point. Maybe a couple of Something similar could surely be implemented for |
In GitLab by @mdorier on Nov 1, 2019, 09:12 This combination of sleep times either hangs or segfaults:
If I increase the last sleep to 2 seconds or more, it systematically hangs. |
In GitLab by @shanedsnyder on Nov 4, 2019, 16:15 Thanks for all of the details! I think this is partially fixed with 40a22e3. Essentially, the last group member to leave a group was assuming the group view wasn't empty and was poking at memory it should not have been. Now, we just detect when we're the last group member and skip the forwarding of the leave request in that case. I no longer see segfaults in your example code, but I do occasionally get instances where at least one group member never fully shuts down. Investigating that now. |
In GitLab by @shanedsnyder on Nov 5, 2019, 15:51 OK, I fixed more of this with 29a7f9d. This bug was causing SSG to skip the step of destroying the group locally when it was unable to forward the leave request to it's target (e.g., because the target is shutting down, too). I guess it's an open question of whether SSG should destroy the group locally if the leave request forward fails, if it should just return an error and let the user decide to retry the leave or just destroy (which is what it was doing, except it also caused a hang), or whether it should use a more elaborate retry scheme before just destroying locally. I went with the first option for now, which just makes the forward request best effort -- we can revisit in the future to see if we want to add some retry logic. |
In GitLab by @shanedsnyder on Nov 5, 2019, 15:55 My current status of testing your code is that I can run without segfaults or hangs with any combination of margo_init threading parameters when using 2 or 3 processes to run your example. I have not tried using ofi+tcp, just shared memory. However, I still see weird behavior when using 4 group members. Specifically, one group member does not appear to shut down cleanly, with it continuing to use SWIM to try to reach group members. Investigating that now. |
In GitLab by @shanedsnyder on Nov 5, 2019, 18:01 OK, one more update. Commit c4e707e applies one more bug fix you were probably hitting. SSG RPCs were not using I can run a bunch of iterations of your test code now, using different variations on margo thread parameters and number of MPI processes, and nothing crashes. I'll leave this open in case you have more issues. I'll also continue to investigate the crashes I'm seeing in issue #12, which are related to processes implicitly leaving the group (silent fail). It's possible something happening there could affect this test case in some instances, as well. |
In GitLab by @mdorier on Oct 31, 2019, 17:37
Trying out this simple SSG program with the version of ssg that Spack installs by default right now (0.3.0):
Running it on a local machine with 4 ranks gives me this:
and the program hangs.
If I use 0 for the third argument of
margo_init
(no progress thread), I get this:and the program hangs.
If I use "ofi+tcp" instead of "na+sm", and enable a progress thread, the program hangs.
If I don't use a progress loop, I get the following error:
and the program hangs.
The text was updated successfully, but these errors were encountered: