Skip to content

WeeklyTelcon_20200414

Geoffrey Paulsen edited this page Apr 16, 2020 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Akshay Venkatesh (NVIDIA)
  • Austen Lauria (IBM)
  • Brendan Cunningham (Intel)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Joshua Ladd (Mellanox)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Intel)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Artem Polyakov (Mellanox)
  • Brian Barrett (AWS)
  • Geoffroy Vallee (ARM)
  • Scott Breyer (Sandia?)
  • Erik Zeiske
  • Shintaro iwasaki
  • Nathan Hjelm (Google)
  • Charles Shereda (LLNL)
  • Brandon Yates (Intel)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Xin Zhao (Mellanox)
  • mohan (AWS)

New

sign-off checker - when edit code on github, they put "co-authored by" instead of "signed-off by"

  • Be nice to update automation to accept this.
  • Would need to update legal text saying that if users do this they're agreeing to the terms of the community.

Jeff presented Feature --net (and --tune)

  • use a new openmpi-params.ini instead of open-mpi-mca-params.conf
    • symlink for forward compatibility.
  • This is similar for --tune <file> to provide more command line and env parameters from file.
  • Two parts to "visible" rule:
    1. command line values takes precidence over file values.
    2. If there's a conflict between things you don't see, that's an error.
  • Already slated for v5.0.x, slightly different than v4.0.x
  • --net says take a single line from a parameter.
  • --tune says take an entire parameter file.

Markdown vs Nroff

  • Writing manpages in nroff is painful
  • Jeff wrote MPI_T.5.md man page in Markdown
    • Make converts markdown files to nroff via new tool pandoc
    • Don't want to require users to install pandoc, so will convert .md files to nroff in make dist
    • Configure will error if you don't have pandoc v1.19 in path
      • will check if we can lower requirements to v1.12.3 (Comes with CentOS 7.7)
      • Initial testing looks good, verifying now.
  • Native nroff and markdown can co-exist, so don't need to do them all at once.
  • Can we suppress generation of manpages if there is no pandoc?
    • No. Don't want to support dist without "full" contents.
    • --without-manpages - maybe could make this work.
  • Jeff will send something to packagers downstream
  • Worst case, we could pull this for v5.0

Old

  • MTT -
    • If you change your MTT to startup PRRTE at begining of session, and just use prun.
    • Can see times cut in half or more.
    • This is good, but also need to test mpirun wrapper.
    • Cisco is converting half of MPI installs to use prrte/prun

OMPI submodules

  • OMPI master submodule pointers setup to track PMIx and PRRTE master.

Release Branches

Review v4.0.x Milestones v4.0.4

  • v4.0.4 in the works.
  • 7616 - ABI break introduced in OMPI v4.0.3 for some f08 symbols.
    • May drive an earlier v4.0.4 to fix.
  • 7617 - Howard is looking at this, may want for v4.0.4
  • OLD - Do we want to integrate with latest PMIx v3.1 branch (commits after v3.1.5)?
    • open question for RMs.
  • Comm Spawn failure on v4.0.3, possibly related to PMIx v3.1.4 commit.
    • Ralph Can't reproduce it. Complex app. Easy workaround PMIx 3.1.3 or earlier.
    • Ralph is looking at.
    • If this is in PMIX, then this might drive a new v3.1 release.

v5.0.0

  • Schedule:

    • Feature Freeze: April 30
    • Release: End of June
  • Discussing Features on google sheets document

  • PMIx v4.0.0 - on track

    • Schedule:
    • PMIX - Won't release v4.0 in time for OMPI v5.0, but will drop a tag that Open-MPI can use.
  • PRRTE v2.0 - on track

  • A number of new MTT failures.

  • Issues not tracked on spreadsheet.

    • libopal isn't slurped into Open-MPI correctly (related to 7560)
      • Jeff and Brian will meet Friday

master

  • Heriarchacal collectives

    • If someone wants to do, PMIx has much of this information already.
    • Not too hard to do, and they're much faster. Will be in next version of competitor MPI
    • Probably not for v5.0
  • Static linking is failing on master right now.

    • Issue 7560
    • May be an issue in static build support in PMIx and PRTE as well as how we're pulling it in.
    • Affects everything, just masked at the moment because static linking is broken.
    • Jeff will investigate
    • No progress.
  • SLURM PMIx plugin has been locked on PMIx v2 for some time.

    • There are some NEW PMIx calls that SHOULD be added to bring it up.
      • Ralph has started a PR, but needs help.
    • So for now, there's some optional info that won't be passed correctly.
      • No OMPI_INFO for now.
      • Ralph gets pinged occasionally.
    • Not sure priority of this.
  • MTT on master is looking pretty good.

Face to face

  • Defered.

Infrastrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

  • CI testing only tests build and did it run, but doesn't test HOW it ran.
    • Environment setup can be a bit different.
    • For example no-permissions in /tmp. Might pass on one machine, and fail on another without /tmp permissions.

ORTE/PRRTE

MTT


Back to 2019 WeeklyTelcon-2019

Clone this wiki locally