Skip to content

WeeklyTelcon_20200107

Geoffrey Paulsen edited this page Jan 7, 2020 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Akshay Venkatesh (NVIDIA)
  • Austen Lauria (IBM)
  • Charles Shereda (LLNL)
  • Josh Hursey (IBM)
  • Joshua Ladd (Mellanox)
  • Thomas Naughton (ORNL)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • Brian Barrett (AWS)
  • Brendan Cunningham (Intel)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Michael Heinz (Intel)

not there today (I keep this for easy cut-n-paste for future notes)

  • Noah Evans (Sandia)
  • William Zhang (AWS)
  • George Bosilca (UTK)
  • Artem Polyakov (Mellanox)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Matthew Dosanjh (Sandia)
  • Brandon Yates (Intel)
  • Erik Zeiske
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Xin Zhao (Mellanox)
  • mohan (AWS)

New Business

  • Discuss PR 6821

    • This would be the first PR with a submodule. Uses hwloc via submodules.
    • Name of the hwloc component changed to hwloc2 (not hwloc20x)
    • Question, there was some issues with submodule PR testing?
      • Initially we had some issues due to bad git docs, and caught by PR CI.
        • didn't get through checkout phase.
      • All issues resolved now.
  • Discuss Probot process 7260

    • Delayed until next week.
  • Unprefixed Symbols from December:

    • Ton of unprefixed symbols being spit out by MPI.
      • OMPI, OPAL, ORTE that's ours.
      • Everything that starts with MCA are in there as public symbols.
      • Problem is if Another library reuses the mca system you hit this.
    • Domain frameworks - adding mca components to a list for autoclosure, but sequencing of closing needs to be very specific.
      • Want to strip out as it's causing problems.
      • Might need this for sessions

Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • New oops 3.0.x/3.1.x
    • Issue 7212 - patcher issue for new compilers
    • Jeff reviewed and merged other PRs, Fix merged in, but no testing yet.
    • Two still PRs open.
  • No schedule yet for 3.0.6 and 3.0.5. Based on RM availability.
  • Possibly a configure test for pmix warning/error.

Review v4.0.x Milestones v4.0.3

  • v4.0.3 in the works.
    • Schedule: End of january.
    • Need to confirm if we need PR 7149 for 4.0.3 with George
  • There may be a new PMIx v3.1.5 in January, we could pickup for v4.0.3.
    • We'll know next week

v5.0.0

  • Schedule: April 2020?

Face to face

  • It's official! Portland Oregon, Feb 17, 2020.
  • Please register on Wiki page, since Jeff has to register you.
  • Date looks good. Feb 17th right before MPI Forum
    • 2pm monday, and maybe most of Tuesday
    • Cisco has a portland facility and is happy to host.
    • about 20-30 min drive from MPI Forum, will probably need a car.

Infrastrastructure

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

  • There may be a new PMIx v3.1.5 in January, we could pickup for v4.0.3.
    • We'll know next week

ORTE/PRRTE

  • PRRTE almost ready to merge, but need help with oshmem
    • PR 7202 build logic working correctly.
    • Possibly some glitches in PRRTE support, but in general working okay.
    • OSHMEM is compiling, but is expecting ORTE to do something.
      • oshrun of hello_oshmem.
      • Some ORTEcall in oshrun possibly?
    • It's the application that's crashing.
    • Suspicion that MPI_Init is no longer calling ORTE_Init.
    • Question: Things will work "the same" under SLURM with this PR?
      • yes.
    • How will startup performance differ?
      • Don't expect any difference.
    • Mellanox can take a look. Probably something pretty trivial.
  • In this PR, mpirun may be a shell script?
  • PRRTE testing infrastructure
    • Josh has been working on this, in IBM's virtual scale cluster.
    • Hitting a number of issues in PRRTE.
    • CI testing of PRRTE itself (without open-mpi).
    • Two PRRTE CI tests:
      • PMIx Hello World (passes depending on number of nodes)
      • Submit like 100 jobs in single job.
    • Once this is working might want to use this in Open MPI testing before moving submodule pointer.
  • Note: Can also use PMIx client
  • PR7202 looks good for building, but do we want to move from mpirun to prrte at same time, or wait until PRRTE testing is better.
  • Submodule vs embedded is harder than the simpler embedding orte (or prrte).
    • This coirdination between repos is hard.
    • Especially because PMIx is embedded in Open-MPI but PRRTE is a submodule.
  • Part of this PR makes pmix a static component, and calls PMIx directly.
    • Could seperate the PMIx static component and direct PMIx calls to seperate PR.
  • Once this settles down, track release branches instead of master.

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally