Skip to content

WeeklyTelcon_20191022

Geoffrey Paulsen edited this page Oct 23, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (Mellanox)
  • Austen Lauria (IBM)
  • Brendan Cunningham (Intel)
  • Brian Barrett (AWS)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Jeff Squyres (Cisco)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Howard Pritchard (LANL)
  • Joshua Ladd (Mellanox)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Thomas Naughton (ORNL)
  • Tom Naughton
  • Xin Zhao (Mellanox)
  • mohan (AWS)

Agenda/New Business

New PRRTE launcher proposal on mailing list.

  • All of this in context in v5.0
  • Intel is no longer driving PRRTE work, and Ralph won't be available for PRRTE much either.
  • PRRTE will be a good PMIX developement environment, but no longer a focus to be a scale and robust launcher.
  • OMPI community could come into PRRTE, and put in production / scalability testing, features, etc.
  • Given that we have not been good at contributing to PRRTE (other than Ralph), there's another proposal
    • There's been a drift from ORTE / PRRTE, so transitioning is risky.
  • Step 1. Make PMIX a first class citizen
    • Still good to keep PMIX as a static framework (no more glue, but still under orte/mca/pmix, but basicly just passes through, and call PMIX_ calls directly.
    • Allows us to still have internal backup PMIx if no external PMIX is found.
  • Step 2. We can whittle down orte, since PMIX does much of this.
  • Two things PRRTE won't care about, is scale and all binding patterns.
  • Only recent versions of SLURM have PMIx
  • Need to continue to support ssh.
    • Not just core PMIx, still need daemons for SSH to work, but they're not part of PMIx.
    • Part of ORTE that we wouldn't be deleting.
  • What does Altair PbsPro and open source PbsPro do?
    • Torque is different than PbsPro
  • Are there OLD systems that we currently support that we still don't care, and could discontinue support in v5.x
    • Who supports PMIx, and who doesn't
  • If PMIx becomes a first class citizen and rest of code base just makes PMIx calls, how do we support these things?
    • mpirun would still have to launch orteds via plm.
    • srun wouldn't need
    • But this is how it works today. Torque doesn't support PMIx at all, but TM just launches ORTEDs
    • ALPS - aprun ./a.out - requires a.out to connect up to ALPS daemons.
      • Cray still supports PMI - someone would need to write a PMI -> PMIX adapter.
    • ORTE does not have the concept of persistant daemons
  • Is there a situation where we might have a launcher launching ortes and we'd need to relay pmix calls to the correct pmix server layer?
    • Generally we won't have that situation, since the launcher won't launch ORTEds.
  • George's work currently depends on PRRTE
    • If ORTEDs provides PMIx_Events, would that be enough?
      • No George needs PRRTE's fault-tollerant overlay network.
      • George will scope the effort to port that feature from PRRTE to ORTE.
  • ACTION - Please gather list of resource managers, and Tools that we care about supporting in Open-MPI v5.0.x

Face to face

  • Date looks good. Feb 17th right before MPI Forum
    • 2pm monday, and maybe most of Tuesday
    • Cisco has a portland facility and is happy to host.
    • But willing to step asside if others want to host.
    • about 20-30 min drive from MPI Forum, will probably need a car.
  • It's official! Portland Oregon, Feb 17, 2020.
    • Safe to begin booking travel now.

Infrastrastructure

Submodule prototype

  • OMPI has been waiting for some git submodule work in Jenkins on AWS.

    • Need someone to have someone to figure out why Jenkins doesn't like Jeff's PR.
      • Anyone with github account for ompi team should have access.
      • PR 6821
      • Apparently Jenkin's isn't behaving as it should.
    • Three pieces: Jenkins, CI, bot.
      • AWS has a libfabirc setup like this for testing.
      • Issue is that they're reworking the design, and will rollout for both libfabric and open-mpi.
    • William Zhang talked to Brian
      • Not something AWS team will work on, but Brian will work on it.
    • Jeff will talk to Brian as well.
  • Howard and Jeff have access to Jenkins on AWS. Part of the problem is that we don't have much expertise on Jenkins/AWS.

    • William will probably be admining the Jenkins/AWS or communicating with those who will.
  • Merged --recurse-submodules update into ompi-scripts Jenkins script as first step. Let's see if that works.

  • Modular thread re-write (noah)

    • UGNI and Vader BTLs were getting better performance, not sure why.
    • For modular threading library, might be interesting to decide at compile time or runtime.
    • Previously similar things seemed to be related to ICACHE.
    • Howard will lok at.

Release Branches

Review v3.0.x Milestones v3.0.4

Review v3.1.x Milestones v3.1.4

  • Will put out RCs for v3.0.5 and v3.1.5 this week.
  • Please test RCs when they become available.
  • Start drawing up a list of fixes that won't be backported to v3.0.x
    • Datatype bug won't be backported, because it snowballed too big.
    • Will put out a list at new 3.0.x and 3.1.x releases of issues fixed in v4.0.x that's NOT being backported... please upgrade, in either NEWS or README.

Review v4.0.x Milestones v4.0.2

  • v4.0.2 was released and haven't had any catastrophic issues come in.
  • We're begining to merge in new v4.0.3 PRs

v5.0.0

  • Schedule: April 2020?
    • Wiki - go look at items, and we should discuss a bit in weekly calls.
    • Some items:
      • MPI1 removed stuff.

Review Master Master Pull Requests

CI status

  • IBM's PGI test has NEVER worked. Is it a real issue or local to IBM.
  • Absoft 32bit fortran failures.

Depdendancies

PMIx Update

ORTE/PRRTE

  • No discussion this week.

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally