Skip to content

WeeklyTelcon_20190716

Geoffrey Paulsen edited this page Jul 16, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Akshay Venkatesh (nVidia)
  • Artem Polyakov (Mellanox)
  • Brendan Cunningham (Intel)
  • Brian Barrett (Amazon)
  • Edgar Gabriel (UH)
  • Geoff Paulsen (IBM)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Josh Hursey (IBM)
  • Michael Heinz (Intel)
  • Ralph Castain (Intel)
  • Todd Kordenbrock

not there today (I keep this for easy cut-n-paste for future notes)

  • Aravind Gopalakrishnan (Intel)
  • Arm (UTK)
  • Brandon Yates (Intel)
  • Dan Topa (LANL)
  • David Bernhold
  • Geoffroy Vallee
  • George Bosilca (UTK)
  • Jake Hemstad
  • Joshua Ladd (Mellanox)
  • Mark Allen (IBM)
  • Matias Cabral
  • Matthew Dosanjh (Sandia)
  • Nathan Hjelm
  • Noah Evans (Sandia)
  • Peter Gottesman (Cisco)
  • Thomas Naughton
  • Xin Zhao (Mellanox)
  • mohan

Agenda/New Business

  • Git submodules

    • This PR is in progress. Requires CI owners to add --recursive to their Jenkin's git clone commands.
    • As a first step, Jeff created:
      • PR 6821 "hwloc201 use a submodule"
  • What to do with OFI BTL and OFI MTL

    • Harumi Kuno (HPE) - Discussion about OMPI's component philosophy
    • mail archive: https://www.mail-archive.com/[email protected]/msg20736.html
    • ofi/BTL and MTL components can step on each other.
    • PSM2 - when a user of PSM2 calls PSM2_Finalize, as long as there's a PSM2 provider, PSM2 is refcounting is only observed in initializing not in finallizing, meaning first finalize, was finalizing entire job.
  • Status of Scale testing

    • No update
    • Issue 6786 "OMPI 4.0.1 TCP connection errors beyond 86 nodes"
    • Issue 6198 "SSH launch fails when host file has more than 64 hosts"
    • IBM is also working on something like this as well (for ssh launch)
      • Prefer this every night, instead of each PR.
  • Issue 6799 "UFM buffers failing in culpGetMemHandle ?"

    • No update

Infrastrastructure

Transition website, and email to AWS

  • Complete

Process enforcement bots

  • No update

Submodule prototype

  • Suggest just doing hwloc (stable and not too much development) first
  • No update

Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • Tested new PMIx
    • Exposed a few new test suite issues in "ibm", but fixed

Review v4.0.x Milestones v4.0.2

  • PR6806 - Want to wait until CI is back. Do we have any tests to test this?
    • Howard will reproduce and add to ibm suite
  • 2nd Put issue PR 6568 (Vader deadlocking with 4MB transfers)
    • waiting on George to return (end of the month)
  • New Datatype work https://github.com/open-mpi/ompi/pull/6695 (master)
    • Want for v4.0.2
    • Now approved for master.
    • waiting on George to return (end of the month). We could merge to master, but if any issues, we'd need George to fix.
  • https://github.com/open-mpi/ompi/issues/6568 - put protocol has lost it's pipelining.
    • Right now only shows in vader, because all others prefer get protocol.
    • Vader generate a bunch of 32K frags. so for 4MBs overwhelms vader.
    • Does NOT occur with single copy like CMA or KNEM.
  • Issue 6789 - OMPI crashes when configured with ucx version
    • Issue with PML UCX conflicting with btl_uct - memory hooks
    • New this week: Howard not convinced it's memory hooks.

Review Master Master Pull Requests

  • PR6556 and 6621 should go to the release branches.
    • no update
  • Good reminder that we now need to be careful about OPAL's ABI.

v5.0.0

  • When do we get rid of 32bit?
  • Still don't have any release manager.
    • Need to identify someone in next few months.

Depdendancies

PMIx Update

  • PMIx v3.1.3 is ready to release.
    • Two issues around MPIR attach
      • 5501 - IBM need to investigate.
      • 5115 - Community OpenMPI Possibly still PMIx
        • howard will try to reproduce
      • Still Open MPIR attach issue in v3.1.x
      • Neither of these issues should block v4.0.2
    • MPIR We emit a warning saying we've deprecated MPIR
      • Need a wiki page describing how to get MPIR to work.
      • What is the answer?
      • DDT is about 90% ready.
  • PMIx v2.2 update could be ready soon after that.

ORTE/PRRTE

  • Take a look at Gile's PRRTE work. He may have done SOME of that. He should have done that all in PRRTE layer, maybe just some MPI layer work remains.

Next face to face

  • Need people to react and do things.
  • Fall Face to face is canceled due to lack of agenda
    • PRTE transition still requires dedicated discussion
  • Might meet in New Mexico, University of Tennessee, or Dallas (IBM)
    • Should make a meeting prep page
    • Jeff will make doodle.
    • Two days

MTT

  • IBM has to triage some failures on master and v4.0.x

Back to 2019 WeeklyTelcon-2019

Clone this wiki locally