Skip to content

WeeklyTelcon_20190924

Geoffrey Paulsen edited this page Oct 1, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Charles Shereda (LLNL)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Erik Zeiske
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Intel)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Thomas Naughton (ORNL)
  • William Zhang (AWS)
  • Akshay Venkatesh (NVIDIA)

not there today (I keep this for easy cut-n-paste for future notes)

  • Brian Barrett (AWS)
  • Todd Kordenbrock (Sandia)
  • Josh Hursey (IBM)
  • Brendan Cunningham (Intel)
  • Artem Polyakov (Mellanox)
  • Brandon Yates (Intel)
  • George Bosilca (UTK)
  • Joshua Ladd (Mellanox)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Tom Naughton
  • Xin Zhao (Mellanox)
  • mohan (AWS)

Agenda/New Business

lists.open-mpi.org isn't working

  • unexpected outage, They're doing some transition slowly. Howard is following up.

OFI MTL Fragmentation issue:

  • PR 7004 - master https://github.com/open-mpi/ompi/pull/7004
    • Merged. If message is too large, returns an error.
    • This is an issue with ANY OFI transport as well, not just verbs. Innate in OFI.
    • We all assume this is a band-aid that fixes the Silent part of the issue (most troubling part of this)
    • Technically this is not part of the MPI SPEC, because MPI doesn't have a Max message size.
      • counter-point, all CM MTLs have limits, and CM doesn't protect, and it's been fine so far.
    • LLNL, Intel, HPE, AWS - OFI parties.
    • A better approach might be a layered approach.
  • PR 7005 - v4.0.x https://github.com/open-mpi/ompi/pull/7005
    • Can we pull into v4.0.2rc2 ?
    • If v4.0.2 release managers, want to release with this PR so it silently doesn't fail.
  • PR 7003 - v3.1 https://github.com/open-mpi/ompi/pull/7003

Affinity discussion -

  • Had a good discussion, and Jeff updated

Infrastrastructure

Process enforcement bots

  • No update (Brian on vacation)

Submodule prototype

  • OMPI has been waiting for some git submodule work in Jenkins on AWS.

    • It's been a few months, with no progress.
    • Three pieces: Jenkins, CI, bot.
      • AWS has a libfabirc setup like this for testing.
      • Issue is that they're reworking the design, and will rollout for both libfabric and open-mpi.
  • Howard and Jeff have access to Jenkins on AWS. Part of the problem is that we don't have much expertise on Jenkins/AWS.

    • William will probably be admining the Jenkins/AWS or communicating with those who will.
  • Merged --recurse-submodules update into ompi-scripts Jenkins script as first step. Let's see if that works.

  • Modular thread re-write (noah)

    • UGNI and Vader BTLs were getting better performance, not sure why.
    • For modular threading library, might be interesting to decide at compile time or runtime.
    • Previously similar things seemed to be related to ICACHE.

Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • Release goal of Oct 31st.
  • Need to put an RC out soon (will discuss date with Brian)
  • Start drawing up a list of fixes that won't be backported to v3.0.x
    • Datatype bug won't be backported, because it snowballed too big.
    • Will put out a list at new 3.0.x and 3.1.x releases of issues fixed in v4.0.x that's NOT being backported... please upgrade, in either NEWS or README.

Review v4.0.x Milestones v4.0.2

  • ikrit fix ready to go in.

  • Will

  • 7002 ready to merge.

  • not sure what to do with 7001.

  • Howard tested the CMA workaround PR

  • Issue 6976 - Thinks this is a PSM issue, not a v4.0.x

    • Confirmed that this issue exists.
    • Should this be a blocker of v4.0.2? Think this is in the OFI layer issue.
      • Silent data issue. Would really like a fatal error at Open-MPI layer.
      • Not a regression,
      • Not-default path (OFI MTL (non-default) BTL
      • IS a default path if built with libfabric
      • Will work on issues.
      • Intel will look at what it might take to add a fatal error check for v4.0.2
  • ABI changes: https://github.com/open-mpi/ompi/issues/6949

    • Linkers are a bit smarter now and we should define our ABI better.
    • Help it work with the tool.
    • Looks like in this
    • We have Open MPI the package, then we have Open MPI and Open SHMEM libraries.
      • Our versioning is on the larger package, not really on library level.
      • Compatibility guarantees are confusing
      • We're letting OpenSHMEM add new functions, though not Open MPI.
      • this is confusing for folks.
      • Tearing this apart will be challenging.
    • Lets take this particular issue seriously.
    • It would be cool to have CI - Geoff signs up to find out more information about tools.
    • This is probably okay for v4.0.2.
    • We should
  • Geoffroy Vallee has a system setup to run cross-compatibility, and can report out which versions are failing. Ralph will forward info to devel-core.

  • Still have some issues; we expect to still have to do an rc2, e.g., https://github.com/open-mpi/ompi/issues/6932.

  • Discuss Issue 6568 - large messages overwhelm put

    • PR 6961 went into master - Nathan said it might help.
      • George commented it's a partial solution.
      • See if this fixes 6568, and if it does consider for v4.0.2
      • Hold off on pulling into v4.0.x until after rc2, for easier regression testing.
      • The other interfaces don't have as tight of constraints, and might not hit this.
    • This SHOULD stay as a blocker, since it ends in hang.
    • We need to look for a workaround.
      • Could disable put completely.
      • Could use an opal_unlikely check of message-size, and only then kick it back if the message size is too large.
    • OB1 tries put / get, and if these don't work, it falls back to send/recv.?
    • possibly a flaw in put itself.
    • Jeff will ask george what would be viable workaround, and identify.
      • Not signing up to implement.
  • PR6942 - ready to merge.

  • MTT failures in Generic Simple unpack on v4.0.x - segfaults, assertions.

    • DDT-unpack assertion on v4.0.x
  • NERSC - running ibm suite will always fail because of srun won't pass connect-accept.

  • See older weekday notes for prior items.

Review Master Master Pull Requests

  • Howard will test master to see if PR 6961 fixes Issue 6568 (large messages overwhelm put)
    • If it goes well, we can
  • PR 6844 - If Jeff gives the okay, Howard says we should merge this.
    • This does fix what container folks were seeing (having to disable CMA)
    • Trying to talk to each other through vader, will talk to each other (bypassing CMA)
    • XPmem doesn't care about memspaces, just the key to access virtual address space.
    • This is a good PR.
    • Is this for v4.0.x or just master?
      • Need to investigate if it changes datastructures that are exchanged.
    • PMIx did a think in v3.1.4 to extend the modex at some point, since just added it to existing one.
      • So this does it similarly, so shouldn't be an issue.
  • IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
  • nVidia bought PGI, perhaps someone there could take a look?
    • Akshay said he'd talk to a PGI person at nVidia to see.
  • Edgar mentioned that Mark Allen should rebase PR6756 and get that in to resolve an issue another customer is seeing.

CI status

  • Cray running into problems again. :frown:
    • Back on track.

v5.0.0


Depdendancies

PMIx Update

ORTE/PRRTE


Next face to face

MTT

  • IBM has to triage some failures on master and v4.0.x and some test build issues. Josh Hursey thought they might be accidentally mixing XLC and PGI compilers. Will investigate.
  • Cisco has a build failure to investigate.

Back to 2019 WeeklyTelcon-2019

Clone this wiki locally