Skip to content

WeeklyTelcon_20210629

Geoffrey Paulsen edited this page Jul 5, 2021 · 1 revision

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • Geoffrey Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (NVIDIA))
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart (HLRS)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Naughton III, Thomas (ORNL)
  • Sam Gutierrez (LANL)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (NVIDIA)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • David Bernholdt (ORNL)
  • Edgar Gabriel (UH)
  • Erik Zeiske (HPE)
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Josh Hursey (IBM)
  • Joshua Ladd (NVIDIA)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic (NVIDIA)
  • William Zhang (AWS)
  • Xin Zhao (NVIDIA)

New Items

v4.0.x

  • No schedule for v4.0.7
    • Might be possible to have this SOMEDAY.
    • Cisco would like v4.0.7 someday.
  • PR9094 - external32 - Do we want it in v4.0?
  • PR9088 - long long - Do we want it in v4.1
  • We need both 9094 and 9088 on v4.0.x to fix the bug reported.
    • Quality of what this is and what's needed.
  • v4.0.6 shipped last week. Looking good.
  • Mpool PR, waiting for review and to go into master first.
    • Howard is testing.
  • 8919 nVidia cannot link. Some users may have already hit this.
    • Tomislav will try to find someone to look at it.

v4.1.x

  • Schedule: Planning on late August (no reason for August) for accumulated bugfixes.
  • Fix huge page allocator waiting on Howard's testing.
  • Long Long one
  • 8867 - show help if libz is missing, Jeff's looking at.

v5.0.x

  • PMIX / PRRTE plan to release in next few weeks

  • Need to do a v5.0 rc as soon as PRRTE v2 ships.

    • Need feedback if we've missed an important one.
  • PMIx Tools support is still not functional. Opened tickets in PRRTE.

    • Not a common case for most users.
    • This also impacts the MPIR shim.
      • PRRTE v2 will probably ship with broken tool support.
  • Is the driving force for PRRTE v2.0 OMPI?

    • So we'd be indirectly/directly responsible for PRRTE shipping with broken tool support?
    • Ralph would like to retire, and really wants to finish PRRTE v2.0 before he retires.
    • Or just fix it in PRRTE v2.0?
    • Is broken tool support a blocker for PRRTE v2.0?
      • Don't ship OMPI v5.0 with broken Tools support.
  • Is there any objections to delaying

    • Either we resource this
  • https://github.com/openpmix/pmix-tests/issues/88#issuecomment-861006665

    • Current state of PMIx tool support.
    • We'd like to get Tool support in CI, but need it to be working to enable the CI.
  • https://github.com/openpmix/prrte/issues/978#issuecomment-856205950

    • Blocking issue for Open-MPI
    • Brian
  • PR 9014 - new blocker.

    • fix should just be a couple of lines of code... hard to decide what we want.
    • Ralph, Jeff and Brian started talking.
    • Simplest solution was to have our own
  • Need people working on v5.0 stuff.

  • Need some configury changes in before we RC.

  • Issue 8850, 8990 and more

  • Brian will file 3-ish issues

    • One is configure pmix
  • Dynamic Windows fix in for UCX.

  • Any update on debugger support?

  • Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if

  • UCC coll component updating to just set to be default when UCX is selected. PR 8969

    • Intent is that this will eventually replace hcoll.
    • Qaulity

Documentation

  • Solid progress happening, on Read the docs.
  • These docs would be on the readthedocs.io site, or on our site?
    • Haven't thought either way yet.
    • No strong opinion yet.

Master

  • Issue 8884 - ROMIO detects CUDA differently.

    • Giles proposed a quick fix for now.

MPI 4.0 API

  • Now released.

  • Virtual Face to face.

  • Persistant Collectives

    • So nice to get MPIX_ rename into v5.0
    • Don't think this was planned for v5.0
    • Don't know if anyone asked them this. - Might not matter to them
      • Virtual face to face -
  • a bunch of stuff in pipeline. Then details.

  • Plan to open Sessions pull request.

    • Big, almost all in OMPI.
    • Some of it are more impacted by clang format changes.
    • New functions.
    • Considerably more functions can be called before MPI_Init/Finalize
    • Don't want to do sessions in v5.0
    • Hessam Miradeghi is interested in trying MPI_Sessions.
      • Interested in a timeline of a release that will contain MPI_Sessions.
    • Sessions working group meets every monday at noon central time.
    • Update:
      • Did some cleanup of refactoring.
      • Topology might NOT change with Sessions relative to whats currently in master
      • Extra topology work that wasn't accepted by MPI v4.0 standard.
      • Question on how we do mca versioning
  • We don't KNOW that OMPI v6.0 may not be an ABI break

  • Would be NICE to get MPIX symbols into a seperate library.

    • What's left in MPIX after persistant collectives?
      • Short Float,
      • Pcall_req - persistant collective
      • Affinity
    • If they're NOT built by default, it's not too high of a priority.
      • Should just be some code-shuffling.
        • On the surface shouldn't be too much.
        • If they use wrapper compilers, or official mechanism
        • Top level library, since app -> MPI and app -> MPIX lib.
        • libmpi_x library can then be versioned differently.
  • Dont change to build MPIX by default.

  • Open an issue to track all of our MPI 4.0 items

    • MPI Forum will want, certainly before supercomputing.
  • Do we want an MPI 4.0 Design meeting in place of a Tuesday meeting.

    • In person meeting is off the table for many of us. We might want an out of sequence meeting.
    • Lets doodle something a couple of weeks out.
    • Doodle and send it out
    • trivial wiki page in style of other in person wiki.
  • Two days of 2 hour blocks - wiki *

MTT

  • Who owns our open-SQL?

    • noone?
    • What value is the viewer using to generate the ORG data?
      • Looking for field in the perl client
        • It's just the username. It's nothing simple.
          • Something about how the cherry-pie server is stuffing stuff into the database.
      • Thought it was in the ini file, but isn't.
    • Concerned that we don't have an owner.
    • Back in the day, we used MTT because there was nothing else.
      • But perhaps there's something else now?
  • A lot of segfaults in UCX 1sided in IBM

  • Howard Pritchard Does someone at nVidia have a good set of test for GPU

    • Can ask around.
    • Only tests is The OSU MPI has support for CUDA and ROCM tests.
      • Good enough for sanity.
      • No support for Intel low level stuff now.
    • PyTorch - machine learning framework - resembles an actual application.
      • Has different backends, collectives reduction tool NCCL, but also has a CUDA backend for single/multiple nodes.
  • ECP - worried we're going to get so far behind MPICH because all 3 major exascale systems are using essentially the same technology and their vendors use MPICH. They're racing ahead with integrating GPU offloaded code with MPICH. Just a heads up.

    • A thread on The GPU can trigger something to happen in MPI.
    • CUDA_Async Not sure of

PMIx

  • No discussion

PRRTE v2.0

  • No update

Longer Term discussions

  • No discussion.
Clone this wiki locally