WeeklyTelcon_20190924

Open MPI Weekly Telecon

Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

Geoffrey Paulsen (IBM)
Jeff Squyres (Cisco)
Charles Shereda (LLNL)
David Bernhold (ORNL)
Edgar Gabriel (UH)
Erik Zeiske
Harumi Kuno (HPE)
Howard Pritchard (LANL)
Matthew Dosanjh (Sandia)
Michael Heinz (Intel)
Noah Evans (Sandia)
Ralph Castain (Intel)
Thomas Naughton (ORNL)
William Zhang (AWS)
Akshay Venkatesh (NVIDIA)

not there today (I keep this for easy cut-n-paste for future notes)

Brian Barrett (AWS)
Todd Kordenbrock (Sandia)
Josh Hursey (IBM)
Brendan Cunningham (Intel)
Artem Polyakov (Mellanox)
Brandon Yates (Intel)
George Bosilca (UTK)
Joshua Ladd (Mellanox)
Mark Allen (IBM)
Matias Cabral (Intel)
Nathan Hjelm (Google)
Tom Naughton
Xin Zhao (Mellanox)
mohan (AWS)

Agenda/New Business

lists.open-mpi.org isn't working

unexpected outage, They're doing some transition slowly. Howard is following up.

OFI MTL Fragmentation issue:

PR 7004 - master https://github.com/open-mpi/ompi/pull/7004
- Merged. If message is too large, returns an error.
- This is an issue with ANY OFI transport as well, not just verbs. Innate in OFI.
- We all assume this is a band-aid that fixes the Silent part of the issue (most troubling part of this)
- Technically this is not part of the MPI SPEC, because MPI doesn't have a Max message size.
  - counter-point, all CM MTLs have limits, and CM doesn't protect, and it's been fine so far.
- LLNL, Intel, HPE, AWS - OFI parties.
- A better approach might be a layered approach.
PR 7005 - v4.0.x https://github.com/open-mpi/ompi/pull/7005
- Can we pull into v4.0.2rc2 ?
- If v4.0.2 release managers, want to release with this PR so it silently doesn't fail.
PR 7003 - v3.1 https://github.com/open-mpi/ompi/pull/7003

Affinity discussion -

Had a good discussion, and Jeff updated

Infrastrastructure

Process enforcement bots

No update (Brian on vacation)

Submodule prototype

OMPI has been waiting for some git submodule work in Jenkins on AWS.
- It's been a few months, with no progress.
- Three pieces: Jenkins, CI, bot.
  - AWS has a libfabirc setup like this for testing.
  - Issue is that they're reworking the design, and will rollout for both libfabric and open-mpi.
Howard and Jeff have access to Jenkins on AWS. Part of the problem is that we don't have much expertise on Jenkins/AWS.
- William will probably be admining the Jenkins/AWS or communicating with those who will.
Merged --recurse-submodules update into ompi-scripts Jenkins script as first step. Let's see if that works.
Modular thread re-write (noah)
- UGNI and Vader BTLs were getting better performance, not sure why.
- For modular threading library, might be interesting to decide at compile time or runtime.
- Previously similar things seemed to be related to ICACHE.

Release Branches

Blockers All Open Blockers

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

Release goal of Oct 31st.
Need to put an RC out soon (will discuss date with Brian)
Start drawing up a list of fixes that won't be backported to v3.0.x
- Datatype bug won't be backported, because it snowballed too big.
- Will put out a list at new 3.0.x and 3.1.x releases of issues fixed in v4.0.x that's NOT being backported... please upgrade, in either NEWS or README.

Review v4.0.x Milestones v4.0.2

ikrit fix ready to go in.
Will
7002 ready to merge.
not sure what to do with 7001.
Howard tested the CMA workaround PR
Issue 6976 - Thinks this is a PSM issue, not a v4.0.x
- Confirmed that this issue exists.
- Should this be a blocker of v4.0.2? Think this is in the OFI layer issue.
  - Silent data issue. Would really like a fatal error at Open-MPI layer.
  - Not a regression,
  - Not-default path (OFI MTL (non-default) BTL
  - IS a default path if built with libfabric
  - Will work on issues.
  - Intel will look at what it might take to add a fatal error check for v4.0.2
ABI changes: https://github.com/open-mpi/ompi/issues/6949
- Linkers are a bit smarter now and we should define our ABI better.
- Help it work with the tool.
- Looks like in this
- We have Open MPI the package, then we have Open MPI and Open SHMEM libraries.
  - Our versioning is on the larger package, not really on library level.
  - Compatibility guarantees are confusing
  - We're letting OpenSHMEM add new functions, though not Open MPI.
  - this is confusing for folks.
  - Tearing this apart will be challenging.
- Lets take this particular issue seriously.
- It would be cool to have CI - Geoff signs up to find out more information about tools.
- This is probably okay for v4.0.2.
- We should
Geoffroy Vallee has a system setup to run cross-compatibility, and can report out which versions are failing. Ralph will forward info to devel-core.
Still have some issues; we expect to still have to do an rc2, e.g., https://github.com/open-mpi/ompi/issues/6932.
Discuss Issue 6568 - large messages overwhelm put
- PR 6961 went into master - Nathan said it might help.
  - George commented it's a partial solution.
  - See if this fixes 6568, and if it does consider for v4.0.2
  - Hold off on pulling into v4.0.x until after rc2, for easier regression testing.
  - The other interfaces don't have as tight of constraints, and might not hit this.
- This SHOULD stay as a blocker, since it ends in hang.
- We need to look for a workaround.
  - Could disable put completely.
  - Could use an opal_unlikely check of message-size, and only then kick it back if the message size is too large.
- OB1 tries put / get, and if these don't work, it falls back to send/recv.?
- possibly a flaw in put itself.
- Jeff will ask george what would be viable workaround, and identify.
  - Not signing up to implement.
PR6942 - ready to merge.
MTT failures in Generic Simple unpack on v4.0.x - segfaults, assertions.
- DDT-unpack assertion on v4.0.x
NERSC - running ibm suite will always fail because of srun won't pass connect-accept.
See older weekday notes for prior items.

Review Master Master Pull Requests

Howard will test master to see if PR 6961 fixes Issue 6568 (large messages overwhelm put)
- If it goes well, we can
PR 6844 - If Jeff gives the okay, Howard says we should merge this.
- This does fix what container folks were seeing (having to disable CMA)
- Trying to talk to each other through vader, will talk to each other (bypassing CMA)
- XPmem doesn't care about memspaces, just the key to access virtual address space.
- This is a good PR.
- Is this for v4.0.x or just master?
  - Need to investigate if it changes datastructures that are exchanged.
- PMIx did a think in v3.1.4 to extend the modex at some point, since just added it to existing one.
  - So this does it similarly, so shouldn't be an issue.
IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
nVidia bought PGI, perhaps someone there could take a look?
- Akshay said he'd talk to a PGI person at nVidia to see.
Edgar mentioned that Mark Allen should rebase PR6756 and get that in to resolve an issue another customer is seeing.

CI status

Cray running into problems again. :frown:
- Back on track.

v5.0.0

No discussion this week.
See older weekday notes for prior items.

Depdendancies

PMIx Update

No discussion this week.
See older weekday notes for prior items.

ORTE/PRRTE

No discussion this week.
See older weekday notes for prior items.

Next face to face

MTT

IBM has to triage some failures on master and v4.0.x and some test build issues. Josh Hursey thought they might be accidentally mixing XLC and PGI compilers. Will investigate.
Cisco has a build failure to investigate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WeeklyTelcon_20190924

Open MPI Weekly Telecon

Attendees (on Web-ex)

not there today (I keep this for easy cut-n-paste for future notes)

Agenda/New Business

lists.open-mpi.org isn't working

OFI MTL Fragmentation issue:

Affinity discussion -

Infrastrastructure

Process enforcement bots

Submodule prototype

Release Branches

Blockers All Open Blockers

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

Review v4.0.x Milestones v4.0.2

Review Master Master Pull Requests

CI status

v5.0.0

Depdendancies

PMIx Update

ORTE/PRRTE

Next face to face

MTT

Back to 2019 WeeklyTelcon-2019

Clone this wiki locally