Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simple MPI-based SSG program failing #13

Open
shanedsnyder opened this issue Mar 18, 2021 · 11 comments
Open

Simple MPI-based SSG program failing #13

shanedsnyder opened this issue Mar 18, 2021 · 11 comments
Assignees

Comments

@shanedsnyder
Copy link
Contributor

In GitLab by @mdorier on Oct 31, 2019, 17:37

Trying out this simple SSG program with the version of ssg that Spack installs by default right now (0.3.0):

#include <margo.h>
#include <ssg.h>
#include <ssg-mpi.h>
#include <mpi.h>
#include <unistd.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    margo_instance_id mid = margo_init("na+sm", MARGO_SERVER_MODE, 1, -1);
    ssg_init(mid);
    ssg_group_id_t gid = ssg_group_create_mpi("mygroup", MPI_COMM_WORLD, NULL, NULL);
    ssg_group_leave(gid);
    ssg_finalize();
    margo_finalize(mid);
    MPI_Finalize();
}

Running it on a local machine with 4 ranks gives me this:

# HG -- Warning -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:2703
 # hg_core_process(): Could not find RPC ID in function map

and the program hangs.

If I use 0 for the third argument of margo_init (no progress thread), I get this:

# HG -- Warning -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:2703
 # hg_core_process(): Could not find RPC ID in function map
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:1239
 # hg_core_finalize(): HG addrs must be freed before finalizing HG (3 remaining)
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:3616
 # HG_Core_finalize(): Cannot finalize HG core layer
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury.c:1120
 # HG_Finalize(): Could not finalize HG core class

and the program hangs.

If I use "ofi+tcp" instead of "na+sm", and enable a progress thread, the program hangs.

If I don't use a progress loop, I get the following error:

# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:1239
 # hg_core_finalize(): HG addrs must be freed before finalizing HG (3 remaining)
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:3616
 # HG_Core_finalize(): Cannot finalize HG core layer
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury.c:1120
 # HG_Finalize(): Could not finalize HG core class

and the program hangs.

@shanedsnyder shanedsnyder self-assigned this Mar 18, 2021
@shanedsnyder
Copy link
Contributor Author

In GitLab by @mdorier on Oct 31, 2019, 17:39

changed the description

1 similar comment
@shanedsnyder
Copy link
Contributor Author

In GitLab by @mdorier on Oct 31, 2019, 17:39

changed the description

@shanedsnyder
Copy link
Contributor Author

In GitLab by @mdorier on Oct 31, 2019, 19:48

Update with ssg@develop

The code

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    margo_instance_id mid = margo_init("na+sm", MARGO_SERVER_MODE, 0, -1);
    ssg_init();
    ssg_group_config config = SSG_GROUP_CONFIG_INITIALIZER;
    ssg_group_id_t gid = ssg_group_create_mpi(mid, "mygroup", MPI_COMM_WORLD, &config, NULL, NULL);
    ssg_group_leave(gid);
    ssg_finalize();
    margo_finalize(mid);
    MPI_Finalize();
}

With na+sm, progress thread disabled

I get this error:

SWIM dping req recv error -- invalid group state
test_ssg_crash: src/swim-fd/swim-fd.c:982: swim_apply_ssg_member_update: Assertion `swim_ctx != NULL' failed.
# NA -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/na/na_sm.c:1489
 # na_sm_send_conn_id(): sendmsg() failed (Broken pipe)
# NA -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/na/na_sm.c:2018
 # na_sm_progress_accept(): Could not send connection ID
# NA -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/na/na_sm.c:1804
 # na_sm_progress_cb(): Could not make progress on accept
# HG Util -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/util/mercury_poll.c:581
 # hg_poll_wait(): poll cb failed
# NA -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/na/na_sm.c:3943
 # na_sm_progress(): hg_poll_wait() failed
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:3014
 # hg_core_progress_na_cb(): Could not make progress on NA
# HG Util -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/util/mercury_poll.c:581
 # hg_poll_wait(): poll cb failed
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:3255
 # hg_core_progress_poll(): hg_poll_wait() failed
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:4820
 # HG_Core_progress(): Could not make progress
WARNING: unexpected return code (6) from HG_Progress()

Then the program crashes.

With na+sm, enabling progress thread

I'm getting the following messages, then the program hangs:

SWIM dping req recv error -- invalid group state
SWIM dping req recv error -- group 13614397414369239985 not found
SWIM iping req recv error -- group 13614397414369239985 not found
SWIM iping req recv error -- invalid group state
SWIM iping req recv error -- invalid group state
SWIM iping req recv error -- group 13614397414369239985 not found
SWIM iping req recv error -- invalid group state
SWIM dping req recv error -- invalid group state
SWIM iping req recv error -- invalid group state
SWIM iping req recv error -- invalid group state
SWIM dping req recv error -- invalid group state
SWIM iping req recv error -- invalid group state
SWIM dping req recv error -- invalid group state

With ofi+tcp, progress thread disabled

I get this error, then the program hangs:

SWIM dping req recv error -- group 13614397414369239985 not found
SWIM dping req recv error -- group 13614397414369239985 not found
SWIM dping req recv error -- group 13614397414369239985 not found
SWIM dping req recv error -- group 13614397414369239985 not found
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:1239
 # hg_core_finalize(): HG addrs must be freed before finalizing HG (3 remaining)
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:3616
 # HG_Core_finalize(): Cannot finalize HG core layer
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury.c:1120
 # HG_Finalize(): Could not finalize HG core class
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:1239
 # hg_core_finalize(): HG addrs must be freed before finalizing HG (3 remaining)
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:3616
 # HG_Core_finalize(): Cannot finalize HG core layer
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury.c:1120
 # HG_Finalize(): Could not finalize HG core class
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:1239
 # hg_core_finalize(): HG addrs must be freed before finalizing HG (3 remaining)
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:3616
 # HG_Core_finalize(): Cannot finalize HG core layer
# HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury.c:1120
 # HG_Finalize(): Could not finalize HG core class

With ofi+tcp, enabling progress thread

I get this series of messages, then the program hangs:

SWIM dping req recv error -- group 13614397414369239985 not found
SWIM dping req recv error -- group 13614397414369239985 not found
SWIM dping req recv error -- group 13614397414369239985 not found
SWIM dping req recv error -- group 13614397414369239985 not found

@shanedsnyder
Copy link
Contributor Author

In GitLab by @mdorier on Oct 31, 2019, 20:00

Some more testing, introducing some margo_thread_sleep to make sure processes have time to initialize things:

#include <margo.h>
#include <ssg.h>
#include <ssg-mpi.h>
#include <mpi.h>
#include <unistd.h>
#include <stdio.h>

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    margo_instance_id mid = margo_init("ofi+tcp", MARGO_SERVER_MODE, 1, -1);
    ssg_init();
    margo_thread_sleep(mid, 1000);
    ssg_group_config config = SSG_GROUP_CONFIG_INITIALIZER;
    ssg_group_id_t gid = ssg_group_create_mpi(mid, "mygroup", MPI_COMM_WORLD, &config, NULL, NULL);
    fprintf(stderr, "Before sleeping\n");
    margo_thread_sleep(mid, 2000*rank);
    fprintf(stderr, "After sleeping\n");
    ssg_group_leave(gid);
    margo_thread_sleep(mid, 2000);
    ssg_finalize();
    margo_finalize(mid);
    MPI_Finalize();
}

I'm getting the following:

[3] Before sleeping
[1] Before sleeping
[0] Before sleeping
[2] Before sleeping
[0] After sleeping
[0] SWIM dping req recv error -- group 13614397414369239985 not found
[0] SWIM dping ack recv error -- group 13614397414369239985 not found
[0] SWIM dping req recv error -- group 13614397414369239985 not found
[1] After sleeping
[1] SWIM dping req recv error -- group 13614397414369239985 not found
[2] After sleeping
[3] After sleeping

Then the program either crashes with a segfault (most of the times) or keeps looping, displaying the following message every second:

[3] # NA -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/na/na_ofi.c:4040
[3]  # na_ofi_msg_send_unexpected(): fi_tsend(unexpected) failed, rc: -113(No route to host)
[3] # HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:2057
[3]  # hg_core_forward_na(): Could not post send for input buffer
[3] # HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury_core.c:4718
[3]  # HG_Core_forward(): Could not forward buffer
[3] # HG -- Error -- /tmp/mdorier/spack-stage/mercury-master-7bbslxovmxwec4veom2zifkt5fgkc4fn/spack-src/src/mercury.c:2092
[3]  # HG_Forward(): Could not forward call

@shanedsnyder
Copy link
Contributor Author

In GitLab by @mdorier on Oct 31, 2019, 20:03

When the code segfault, gdb indicates that it does in src/ssg.c at line 719, and inspecting the variables there shows that g_desc->g_data.g->view.member_map is actually NULL.

@shanedsnyder
Copy link
Contributor Author

In GitLab by @mdorier on Oct 31, 2019, 20:37

Some more progress, changing margo_thread_sleep(mid, 2000*rank); into margo_thread_sleep(mid, 1000+2000*rank); removes the segfault, so I guess the segfault was caused by the fact that rank 0 was not sleeping, and was immediately destroying the group, which prevented other ranks from reaching it fast enough. Now I only have Mercury error messages saying that either forward failed (not route to host) or it could not find the RPC ID, which I guess is normal (though annoying. It would be nice to be able to turn these messages off).

I'll leave the issue open just in case someone runs into the same problem (namely, leaving a group immediately after it is created), but I don't think there is anything that could be done at this point.

Maybe a couple of MPI_Barriers could be added in ssg_group_create_mpi (one at the beginning, one at the end), to make sure that (1) when ssg_group_create_mpi starts creating the group, all processes involved have called ssg_init, and (2) when ssg_group_create_mpi ends, the group has been created on all processes.

Something similar could surely be implemented for ssg_group_create_pmix, which the PMIx equivalent of MPI_Barrier. As for the file-based and address-list-based function, I don't know what could be used.

@shanedsnyder
Copy link
Contributor Author

In GitLab by @mdorier on Nov 1, 2019, 09:12

This combination of sleep times either hangs or segfaults:

#include <margo.h>
#include <ssg.h>
#include <ssg-mpi.h>
#include <mpi.h>
#include <unistd.h>
#include <stdio.h>

void update(void*, ssg_member_id_t id, ssg_member_update_type_t update_type) {
    fprintf(stderr, "Member %ld updated\n", id);
}

int main(int argc, char** argv) {
    MPI_Init(&argc, &argv);
    int rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    margo_instance_id mid = margo_init("ofi+tcp", MARGO_SERVER_MODE, 1, -1);
    ssg_init();
    margo_thread_sleep(mid, 1000);
    ssg_group_config config = SSG_GROUP_CONFIG_INITIALIZER;
    ssg_group_id_t gid = ssg_group_create_mpi(mid, "mygroup", MPI_COMM_WORLD, &config, &update, NULL);
    fprintf(stderr, "Before sleeping\n");
    margo_thread_sleep(mid, 1000*(1+rank));
    fprintf(stderr, "After sleeping\n");
    ssg_group_leave(gid);
    margo_thread_sleep(mid, 1000);
    ssg_finalize();
    margo_finalize(mid);
    MPI_Finalize();
}

If I increase the last sleep to 2 seconds or more, it systematically hangs.
If I increase the second sleep time to, say, 5000*(1+rank), it systematically segfaults.
I also tried setting the config like in the ssg tests, with not success, I always get segfaults.

@shanedsnyder
Copy link
Contributor Author

In GitLab by @shanedsnyder on Nov 4, 2019, 16:15

Thanks for all of the details!

I think this is partially fixed with 40a22e3.

Essentially, the last group member to leave a group was assuming the group view wasn't empty and was poking at memory it should not have been. Now, we just detect when we're the last group member and skip the forwarding of the leave request in that case.

I no longer see segfaults in your example code, but I do occasionally get instances where at least one group member never fully shuts down. Investigating that now.

@shanedsnyder
Copy link
Contributor Author

In GitLab by @shanedsnyder on Nov 5, 2019, 15:51

OK, I fixed more of this with 29a7f9d.

This bug was causing SSG to skip the step of destroying the group locally when it was unable to forward the leave request to it's target (e.g., because the target is shutting down, too). ssg_group_leave() was returning an error in this case, but it was also trashing some runtime state which prevented a clean shutdown.

I guess it's an open question of whether SSG should destroy the group locally if the leave request forward fails, if it should just return an error and let the user decide to retry the leave or just destroy (which is what it was doing, except it also caused a hang), or whether it should use a more elaborate retry scheme before just destroying locally. I went with the first option for now, which just makes the forward request best effort -- we can revisit in the future to see if we want to add some retry logic.

@shanedsnyder
Copy link
Contributor Author

In GitLab by @shanedsnyder on Nov 5, 2019, 15:55

My current status of testing your code is that I can run without segfaults or hangs with any combination of margo_init threading parameters when using 2 or 3 processes to run your example. I have not tried using ofi+tcp, just shared memory.

However, I still see weird behavior when using 4 group members. Specifically, one group member does not appear to shut down cleanly, with it continuing to use SWIM to try to reach group members. Investigating that now.

@shanedsnyder
Copy link
Contributor Author

In GitLab by @shanedsnyder on Nov 5, 2019, 18:01

OK, one more update. Commit c4e707e applies one more bug fix you were probably hitting.

SSG RPCs were not using margo_forward_timed(), which could cause them to hang completely if the remote target is entirely offline (already called margo_finalize()).

I can run a bunch of iterations of your test code now, using different variations on margo thread parameters and number of MPI processes, and nothing crashes.

I'll leave this open in case you have more issues. I'll also continue to investigate the crashes I'm seeing in issue #12, which are related to processes implicitly leaving the group (silent fail). It's possible something happening there could affect this test case in some instances, as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant