Skip to content

WeeklyTelcon_20210302

Geoffrey Paulsen edited this page Mar 7, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • Edgar Gabriel (UH)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard (LANL)
  • Joseph Schuchart
  • Marisa Roman (Cornelius)
  • Christoph Niethammer (HLRS)
  • George Bosilca (UTK)
  • Joshua Ladd (nVidia/Mellanox)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Jeff Squyres (Cisco)
  • Josh Hursey (IBM)
  • Naughton III, Thomas (ORNL)
  • David Bernhold (ORNL)
  • Howard Pritchard
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Tomislav Janjusic
  • Xin Zhao (nVidia/Mellanox)

Sorry notes are not well organized this week. I'll try to claen up for next week.

#New Discussion

  • v4.0.x and v4.1.x blocked on UCX priority PR https://github.com/open-mpi/ompi/pull/8496

    • Jeff is actively testing
    • AWS tested, looks good.
  • v4.1.x what's needed for Bull 2020 update of coll/han - https://github.com/open-mpi/ompi/pull/8487

  • ci check broke over the weekend, and caused some PRs to hang. Jeff fixed.

    • Some confusion over the name. "ci" - initially named this, and only later named something
    • Checks signed-off checker.
    • Checks for bogus email names. Optionally checks for cherry-pick messages.
    • Now verifies that cherry-pick commits exist in Open-MPI. Will prevent release branch merge before merge to master.
    • Had it working on master, working on v4.1, but had to re-run PRs it was waiting on.
    • Not a "bot:retest", etc. the way to rerun this is to change the description text.
      • if anyone can help figure out the bot retest situation, that'd be great.
  • Should we HAVE the org name in the CI test name, so folks know who to talk to.

    • This one, code is in ompi repository.
    • Would rather have a reference of who to mention when somethings broken.
    • AWS doesn't like where we're at.
    • But if CI is broken, it stops PR progress, and stops development.
      • If no one can help, then maybe don't make it mandatory
    • Only discussing mandatory ones. At least need a wiki describing it, and mentioning how to help.
    • Jenkins is a Giant responsibility.
    • Jeff will help here for now.
  • Do we have a wiki page on how to rerun bots. How to make this more prominent?

  • building with autoconf 2.7 is broken (or just a lot of warnings?)

New Topics

  • PMIX v4.1 might be delayed.

    • So backup plan is get PRRTE working with PMIx v4.0
    • Not sure what we'll lose with PMIx v4.0 instead of v4.1
    • Folks should try runng OMPI with PMIx v4.0 Probably release Open-MPI v5.0 with PMIx v4.0
  • S

  • PR 8511 Addresses issues/8321

    • Ready to merge to master.
    • Need to PR to release branches.
  • IBM CI needs to upgrade UCX from 1.8 to 1.9

  • What MOFED do we need to pickup UCX 1.9?

    • Josh L. will get back.
  • Issue 8489 - UCX being selected where

    • Priority PR 8496
    • Should there be a VENDOR check in the PML?
    • Jeff was using UCX over TCP, where he was expecting to use the TCP BTL
      • Still a lot getting lucky
    • UCX over UD over EFA - want libFabric OFI MTL
    • on Amazon, do you build UCX, so you have both UCX and libFabric components?
      • Today no, but what we're seeing is that at least one distro plans on shipping Open MPI built against both.
      • So need better runtime checks.
    • Ties used to be broken by what was built.
    • Who's the default TCP provider?
    • All came up with --net, which TCP do you get?
      • came down to check vendor of hardware, and if there's a vendor specific hardware, then pick that, but if generic, then default to BTLs.
    • What time can this decision be made? MPI_Init() is the right time.
      • ib_device doesn't tell WHO's device.
      • hwloc does have VENDOR info, but it gets loaded a bit too late.
        • But we've discussed moving hwloc earlier (because we also get some cache-sizes incorrect today).
    • OSHMEM only works on UCX.
    • So that's some good discussion about futures, but focusing on PR 8496 for NOW
      • PR8496 currently makes UCX not the TCP provider
      • What about OPA?
        • How do we tell the difference between a Cornelius Networks OPA device, and a mellanox device.
        • PR 8496 - tried to fix, but not sure... probably not correct for OPA
        • Cornelius strongly recommends users NOT use verbs
          • But it presents itself as verbs, so this PR may try to run.
    • Cornelius Networks is not developing PSM3, that's Intel.
      • PSM3 is completely incompatible with PSM2 and doesn't work on OMNIpath hardware.
    • This PR is a good first step, but need to tweak some.
    • This looks like it's close for OMPI v4.0 and v4.1
    • But for Open-MPI v5, we should do the larger work to work on open-mpi that's built against both UCX and OFA
    • What's the default shared memory transport?
      • Have to go the same route... if you're using UCX for fabric, should probably run UCX for shmem.
        • But if running SHMEM only, that may not be the case, but still probably a good first guess.
      • SHMEM gets more complex because may want CUDA, and not all shmem providers support CUDA.
    • Does libFabric support CUDA-IPC? It will, but doesn't currently.
    • Brian will come up with a proposal... don't want to make HARD statements, but want to select defaults.
      • Will send out a v5.0 thought today.
    • Don't have a solution for EFA/usNIC/OPA
      • EFA provides a UD interface.
      • Solution will be the same
  • PR 8435 - https://github.com/open-mpi/ompi/pull/8435

    • No progress this week.
    • Question as to what George was saying.
    • George just saying that MPI already has that info and we don't need to ask PMIx again.
      • Need it in HAN, and if we need it elsewhere, just move to base
    • That being said, George doesn't want it in Tuned at all.
    • mistake this was targeting v4.1 instead of master.
  • UCX Issue 8321,

    • We do need to understand what's going on , as there were comments saying we should not support anything older than 1.9.0, but then there was a comment that it's reproducable in 1.9 also
    • Is this a UCX problem, or a PML problem?
      • We don't know if it's PML or UCX
  • UCX 1.9.0 + OMPI 4.0.4 - Issue 8442

    • datatype engine issue
    • George has a fix, but it no longer applies cleanly.
    • He will try to push, so someone else can
    • PR8473 - Sergy pushed a possible fix, but it still failed a CI test, and then closed the PR.
    • May not be related to Issue 8321
    • We're ready to cut an RC for both 4.1.1 and 4.0.6, these two are blocking.
  • UCX meeting is on Wednesdays

    • Howard may go tomorrow.
    • UCX community didn't like us configuring out, they're looking into
    • It'd be nice to link this to an issue tomorrow.

4.0.x

  • We put out 4.0.6rc2 last week, but we'll know more about UCX and maybe 8466.
    • PR8466 just wating on Cisco testing.
  • 4.0.6 blocking on UCX fixes.
    • Issue 8442 - datatype issue
      • Severity blocker - and no fix other than using
      • opal_datatype_Unpack issue.
    • PR 8496 - waiting on Cisco testing
  • Austen can simplify PR

v4.1

  • blocking on UCX issues (see New topics above)
    • PR 8466 and Issue 8442

Open-MPI v5.0

  • Geoff went through most open PRs and many of the newer issues to see if anything would block the branching of v5.0. Discussed these briefly: WeeklyTelcon_20210302-ompiv5-branching

    • Look on target to branch next week after AWS GPU Direct PR, and remove CR gets in
  • PRRTE making good progress:

    • Ralph resolved about 11 tickets in PRRTE last week. Maybe 20 more
    • Then prrte will branch v2.0
    • Open-MPI can branch anytime, we'll revisit end of Feb.
  • Raghu, How is GPU Direct RDMA for AWS? Still on track. PR this week.

  • One-sided tests are still busted. Do we keep running these if they're failing?

    • Nathan is actively working on, so hopeful we'll get this.
  • Issue 7486

  • Josh summarized discussion from last week in issue.

  • Anything else Josh needs to implement?

    • No, Josh will get to before end of month, before v5.0 branches.
  • master configure issue - for v5.0 both of these will need to be fixed.

  1. Luster configure option, Edger sees it, but no idea how to fix it.
    • Not sure if he should open an issue. Ralph thinks Giles fixed. Edger will give it a try
  2. SharedFP component, Edger opened an issue this morning.
    • Blocker for v5.0
  • What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?

    • What's the general state? Any known issues?
    • AWS would like to get.
    • Josh Ladd - Will take internally to see what they have to say.
    • From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
    • Hessam Mirsadeg - All Cuda awareness through UCX
    • May ask George Bosilica about this.
    • Don't want to remove a BTL if someone is interested in it.
    • UCX also supports TCP via CUDA
    • PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
  • Update 11/17/2020

    • UTK is interested in this BTL, and maybe others.
    • Still gap in the MTL use-case.
    • nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
    • What's the state of the shared memory in the BTL?
      • This is the really old generation Shared Memory. Older than Vader.
    • Was told after a certain point, no more development in SM Cuda.
    • One option might be to
    • Another option might be to bring that SM in SMCuda to Vader(now SM)
  • Edgar atomicity issue for OMPIO. Not sure if it's a full feature, but need to have on radar.

    • Not yet resolved.
    • ETA: a few days after Edgar finds time. 2-3 weeks.
    • Made some progress, hope in next few days
  • Discuss for v5.0

    • Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
    • One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
    • Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)

Doc update

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Intent this is for v5.0
      • mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
    • Ralph has asked about this for PMIx/PRRTE since this is turning out to work

Longer Term discussions

ROMIO Long Term (12/8)

  • What do we want to do about ROMIO in general.
    • OMPIO is the default everywhere.
    • Giles is saying the changes we made are integration changes.
      • There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
      • We may be able to work with upstream to make a clear API between the two.
    • As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
  • Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
  • Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
  • Putting new tests there
  • Very little there so far, but working on adding some more.
  • Should have some new Sessions tests

MTT

  • what is being reported looks pretty good.
    • ppc atomics - Austen has been looking at this
  • Intercomm Merge is getting inconsistant ordering of procs.
    • What is the priority of this?
    • Many of the ibm tests start off by doing some intercomm manipulation.
      • Won't get
  • Mellanox MTT had been failing. Boris set some debug, and they unplugged it.
    • They plan to re-enable it tomorrow.

Video Presentation

  • ECP Community days ( March 30-April 1st )
    • David Bernholdt and/or George Bosilica
    • Each day 90 minute time slots.
    • Get proposal in by this Friday.
Clone this wiki locally