Skip to content

WeeklyTelcon_20210504

Geoffrey Paulsen edited this page May 5, 2021 · 2 revisions

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Brian Barrett (AWS)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Raghu Raja
  • Sam Gutierrez (LANL)
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Brandon Yates (Intel)
  • Brendan Cunningham (Cornelis Networks)
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Joshua Ladd (nVidia/Mellanox)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Naughton III, Thomas (ORNL)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia/Mellanox)

New Items

  • Tommy is taking over for Josh Ladd for short-term.
    • Please send Mellanox items to him.
    • He will also help with v5 RM work.
  • Howard was trying to build OSU benchmark (most recent) doesn't build simply against master and v5
    • Howard didn't have mpicxx or mpicpp
    • If this is an actual issue, assign this to Jeff.
    • Also, Joseph set CC not CCX env, and C++ wraper wasn't being built.
      • This Could be correct behavior even if it's unexpected.

4.0.x

  • We're still waiting on Datatype issues now reported in v4.1.1
    • If others can replicate tests/datatype/partial - make check
      • Jeff and George can not get it to fail.
      • If can make it fail with original, then try debug test with lots of output.
    • Two users have reported it with two different environments.
      • concerned.
    • Also run by CI.
    • Test we're talking about is on master (partial.c) This test was not cherry-picked back to release branches.
      • Test is in PRs merged into v4.1.x, but we haven't merged PR to v4.0.x yet.
      • Jeff will check the test on v4.1.x branch.
  • Issue 8918 - Another datatype issue we need to look at.
  • Need a review for 8898 (and equivalent v4.1.x)

v4.1.x

  • In holding pattern waiting for Datatype issue.
  • Not taking too many more PRs in case we decide to spin a v4.1.2 with datatype regression fixes

v5.0.0

  • Went through a bunch of stuff last week.
  • At least 3 PRs pending for v5.0
  • Got ROMIO 3.4.1 sync in.
  • Bringing Tommy in for nVidia RM.

Master

  • Examples and tests directory need to get done.
  • Code Refactoring needs to get done.
    • ompi PR 8816 is still open. Need rebasing
    • Could be as easy as running clang-format on HEAD, and merging quickly.
    • Any volunteer?
      • Joseph saw Opal code, some copyright headers got scrambled.
        • Fixed in master and v5.0
      • Macros might need "don't reformat" tags around some macros.
      • includes might need reordering to build properly.
    • May need to stop committing other PRs until this gets done.
    • Nathan responded to a ping during the call and will try to get it done Thursday.
  • Should eventually do oshem eventually
  • Some folks didn't like the results
    • Macro was one area and that can be address with tags.
  • Do we want to set a date to close master if this doesn't get done?
    • Not really, someone should just do it.
  • Scope should only be an hour.
  • May 14th turn on CI.

Reformatting master

  • PR 8816
  • Would like Nathan to rebase and merge to master.
  • Certain blocks we don't want to format (specifically some in datatype)
    • clang format trips over

PMIx

  • Pmix is trying to maintain standards and library versions that are in sync with each other.
    • There is a PMIx standard version and an open PMIx library version.
  • Added some PMIx v4.1 standard items to the PMIx v4.0 branch
  • Rest the PMIx v4.0 branch without all of v4.1 functionality.
  • Open-MPI v5.0 will ship open PMIx v4.1 submodule
    • Will require Prrte 2.0 will require open PMIx v4.1
    • So if running with Open PMIx v4.0 or older, just can't use PRRTE
  • Has anyone checked how far back Open-MPI v5 can work with PMIx?
    • At one point verified it worked with open pmix v3.1, but there had been some work on top and need to reverify.

PRRTE v2.0

  • No update

Some outstanding work for the way that OMPI calls PRRTE configure.

  • Also some changes with libcurl, especially since this breaks OMPI built.
  • PMIx can interface with REST interfaces (used by libcurl)
    • JSON
    • Build system issue in PMIx when we changed to static DSOs.
    • Think this has been resolved
  • Ralph was looking at this (private messaged Geoff)

issue 8801 - mpirun and prefixing.

  • Jeff and Ralph and Yosif had a good conversation
  • Lengthy discussion, Summary is, that it's a work in progress.

MTT

  • Need to look at the public tests repo for merging in both ULFM and Sessions tests.
  • Howard and Geoff will look at this week.
  • ULFM is built in by default.
    • Since we don't test it, then it degenerates quickly.
    • At this moment the latest changes to PRRTE has broken ULFM.
    • May be easier to integrate into somewhere else.
    • Some tests put into OMPI-public - this test ran for 4 minutes on 4 nodes
  • Would MTT be sufficient for the ULFM testing?
    • It would be a step in the right direction.
    • It WOULD be good to get things into CI.
    • How do we do it without adding more time to CI.
  • If someone has one physical box.
    • Open MPI CI is not machines, it's someone needs to set this up and maintain it.
  • It's be great if someone in the community could extend the Open-MPI Infrastructure and maintain this.
    • Our CI tests are currently running on a single node.
    • Could be extended, just need volunteers to learn and maintain.

Longer Term discussions

Doc update

Clone this wiki locally