Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC client-server does not work between macos and linux #616

Open
jspanchu opened this issue Aug 24, 2022 · 1 comment
Open

RPC client-server does not work between macos and linux #616

jspanchu opened this issue Aug 24, 2022 · 1 comment

Comments

@jspanchu
Copy link

Describe the bug
The hello world thallium RPC example doesn't work in a heterogeneous environment (mac + linux). See hello-world. I modified the source to use 'sockets' provider instead of TCP. I am posting this here because the error messages come from mercury and maybe libfabric?

Run the server on mac:

~/hello-thallium  $ ./server
Server running at address ofi+sockets://10.50.58.248:39517
# [80739.928023] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:2431
 # na_ofi_addr_map_insert(): fi_av_insert() failed, inserted: 0
# [80739.928109] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:2320
 # na_ofi_addr_key_lookup(): Could not insert new address
# [80739.928120] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:4756
 # na_ofi_cq_process_recv_unexpected_event(): Could not lookup address
# [80739.928128] mercury->msg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:4680
 # na_ofi_cq_process_event(): Could not process unexpected recv event
# [80739.928156] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3917
 # hg_core_progress_na(): Could not make progress on NA (NA_PROTOCOL_ERROR)
# [80739.928167] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3809
 # hg_core_poll_wait(): hg_core_progress_na() failed
# [80739.928173] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3708
 # hg_core_progress(): Could not make blocking progress on context
# [80739.928180] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:5077
 # HG_Core_progress(): Could not make progress
# [80739.928208] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury.c:2074
 # HG_Progress(): Could not make progress on context (HG_PROTOCOL_ERROR)
[critical] unexpected return code (12: HG_PROTOCOL_ERROR) from HG_Progress()
Assertion failed: (0), function __margo_hg_progress_fn, file margo-core.c, line 1659.
zsh: abort      ./server

and client on Linux:

$ ./client ofi+sockets://10.50.58.248:39517

I get the same output for a client on mac and a server on linux.

To Reproduce
Steps to reproduce the behavior:
On macOS, spack installs [email protected] which simply crashes the server (segmentation fault), so use argobots@main on both Linux and mac with this command.

$ spack install mochi-thallium@develop^libfabric fabrics=tcp,rxm,sockets ^argobots@main
$ spack load mochi-thallium@develop

Compile

  1. server.cpp
// c++ --std=c++14 -o server server.cpp `pkg-config --cflags --libs thallium`
#include <iostream>
#include <thallium.hpp>

namespace tl = thallium;

void hello(const tl::request& req) {
    std::cout << "Hello World!" << std::endl;
}

int main(int argc, char** argv) {
    HG_Set_log_level("debug");
    tl::engine myEngine("sockets", THALLIUM_SERVER_MODE);
    myEngine.define("hello", hello).disable_response();
    std::cout << "Server running at address " << myEngine.self() << std::endl;

    return 0;
}
  1. client.cpp
// c++ --std=c++14 -o server server.cpp `pkg-config --cflags --libs thallium`
#include <thallium.hpp>

namespace tl = thallium;

int main(int argc, char** argv) {

    if(argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <address>" << std::endl;
        exit(0);
    }

    tl::engine myEngine("sockets", THALLIUM_CLIENT_MODE);
    tl::remote_procedure hello = myEngine.define("hello").disable_response();
    tl::endpoint server = myEngine.lookup(argv[1]);
    hello.on(server)();

    return 0;
}

Platforms:
MacOS: Monterey 12.5.1 on M1 with clang-13.1.6
Linux: Ubuntu 22.04 with GCC 11.2.0

Here's output of spack spec mochi-thallium on each platform.

# macOS
$ spack spec mochi-thallium 
Input spec
--------------------------------
mochi-thallium

Concretized
--------------------------------
mochi-thallium@develop%[email protected]+cereal~ipo build_type=RelWithDebInfo arch=darwin-monterey-m1
    ^[email protected]%[email protected]~ipo build_type=RelWithDebInfo patches=2dfa0bf arch=darwin-monterey-m1
        ^[email protected]%[email protected]~doc+ncurses+ownlibs~qt build_type=Release arch=darwin-monterey-m1
            ^[email protected]%[email protected]~symlinks+termlib abi=none arch=darwin-monterey-m1
                ^gnuconfig@2021-08-14%[email protected] arch=darwin-monterey-m1
                ^[email protected]%[email protected] arch=darwin-monterey-m1
            ^[email protected]%[email protected]~docs~shared certs=mozilla patches=3fdcf2d arch=darwin-monterey-m1
                ^ca-certificates-mozilla@2022-07-19%[email protected] arch=darwin-monterey-m1
                ^[email protected]%[email protected]+cpanm+shared+threads arch=darwin-monterey-m1
                    ^[email protected]%[email protected]+cxx~docs+stl patches=b231fcc arch=darwin-monterey-m1
                    ^[email protected]%[email protected]~debug~pic+shared arch=darwin-monterey-m1
                        ^[email protected]%[email protected] arch=darwin-monterey-m1
                            ^[email protected]%[email protected] libs=shared,static arch=darwin-monterey-m1
                    ^[email protected]%[email protected] arch=darwin-monterey-m1
                        ^[email protected]%[email protected] arch=darwin-monterey-m1
                    ^[email protected]%[email protected]+optimize+pic+shared patches=0d38234 arch=darwin-monterey-m1
    ^mochi-margo@develop%[email protected]~debug~pvar arch=darwin-monterey-m1
        ^argobots@main%[email protected]~affinity~debug~lazy_stack_alloc+perf~stackunwind~tool~valgrind stackguard=none arch=darwin-monterey-m1
            ^[email protected]%[email protected] patches=35c4492,7793209,a49dd5b arch=darwin-monterey-m1
                ^[email protected]%[email protected]+sigsegv patches=9dc5fbd,bfdffa7 arch=darwin-monterey-m1
                    ^[email protected]%[email protected] arch=darwin-monterey-m1
            ^[email protected]%[email protected] arch=darwin-monterey-m1
            ^[email protected]%[email protected] arch=darwin-monterey-m1
        ^[email protected]%[email protected]~ipo build_type=RelWithDebInfo arch=darwin-monterey-m1
        ^mercury@master%[email protected]~bmi+boostsys+checksum~debug~hwloc~ipo~mpi+ofi~psm~psm2+shared+sm~ucx~udreg build_type=RelWithDebInfo arch=darwin-monterey-m1
            ^[email protected]%[email protected]+atomic+chrono~clanglibcpp~container~context~contract~coroutine+date_time~debug+exception~fiber+filesystem+graph~graph_parallel~icu+iostreams~json+locale+log+math~mpi+multithreaded~nowide~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded~stacktrace+system~taggedlayout+test+thread+timer~type_erasure~versionedlayout+wave cxxstd=98 patches=a440f96 visibility=hidden arch=darwin-monterey-m1
            ^[email protected]%[email protected]~debug~disable-spinlocks~kdreg fabrics=rxm,sockets,tcp arch=darwin-monterey-m1
# linux
spack spec mochi-thallium 
Input spec
--------------------------------
mochi-thallium

Concretized
--------------------------------
mochi-thallium@develop%[email protected]+cereal~ipo build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
    ^[email protected]%[email protected]~ipo build_type=RelWithDebInfo patches=2dfa0bf arch=linux-ubuntu22.04-icelake
        ^[email protected]%[email protected]~doc+ncurses+ownlibs~qt build_type=Release arch=linux-ubuntu22.04-icelake
            ^[email protected]%[email protected]~symlinks+termlib abi=none arch=linux-ubuntu22.04-icelake
                ^[email protected]%[email protected] arch=linux-ubuntu22.04-icelake
            ^[email protected]%[email protected]~docs~shared certs=mozilla patches=3fdcf2d arch=linux-ubuntu22.04-icelake
                ^ca-certificates-mozilla@2022-03-29%[email protected] arch=linux-ubuntu22.04-icelake
                ^[email protected]%[email protected]+cpanm+shared+threads arch=linux-ubuntu22.04-icelake
                    ^[email protected]%[email protected]+cxx~docs+stl patches=b231fcc arch=linux-ubuntu22.04-icelake
                    ^[email protected]%[email protected]~debug~pic+shared arch=linux-ubuntu22.04-icelake
                        ^[email protected]%[email protected] arch=linux-ubuntu22.04-icelake
                            ^[email protected]%[email protected] libs=shared,static arch=linux-ubuntu22.04-icelake
                    ^[email protected]%[email protected] arch=linux-ubuntu22.04-icelake
                        ^[email protected]%[email protected] arch=linux-ubuntu22.04-icelake
                    ^[email protected]%[email protected]+optimize+pic+shared patches=0d38234 arch=linux-ubuntu22.04-icelake
    ^mochi-margo@develop%[email protected]~pvar arch=linux-ubuntu22.04-icelake
        ^argobots@main%[email protected]~affinity~debug~lazy_stack_alloc+perf~stackunwind~tool~valgrind stackguard=none arch=linux-ubuntu22.04-icelake
            ^[email protected]%[email protected] patches=35c4492,7793209,a49dd5b arch=linux-ubuntu22.04-icelake
                ^[email protected]%[email protected]+sigsegv patches=9dc5fbd,bfdffa7 arch=linux-ubuntu22.04-icelake
                    ^[email protected]%[email protected] arch=linux-ubuntu22.04-icelake
            ^[email protected]%[email protected] arch=linux-ubuntu22.04-icelake
            ^[email protected]%[email protected] arch=linux-ubuntu22.04-icelake
        ^[email protected]%[email protected]~ipo build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
        ^mercury@master%[email protected]~bmi+boostsys+checksum~debug~hwloc~ipo~mpi+ofi~psm~psm2+shared+sm~ucx~udreg build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
            ^[email protected]%[email protected]+atomic+chrono~clanglibcpp~container~context~contract~coroutine+date_time~debug+exception~fiber+filesystem+graph~graph_parallel~icu+iostreams~json+locale+log+math~mpi+multithreaded~nowide~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded~stacktrace+system~taggedlayout+test+thread+timer~type_erasure~versionedlayout+wave cxxstd=98 patches=a440f96 visibility=hidden arch=linux-ubuntu22.04-icelake
            ^[email protected]%[email protected]~debug~disable-spinlocks~kdreg fabrics=rxm,sockets,tcp arch=linux-ubuntu22.04-icelake
@soumagne
Copy link
Member

soumagne commented Dec 9, 2022

we should investigate what is the right solution for that now as anything that uses OFI's sockets provider will be unsupported.

@soumagne soumagne added this to the future milestone Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants