Skip to content

WeeklyTelcon_20220906

Geoffrey Paulsen edited this page Oct 4, 2022 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Brendan Cunningham (Cornelis Networks)
  • Christoph Niethammer (HLRS)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UoH)
  • Geoffrey Paulsen (IBM)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Thomas Naughton (ORNL)
  • Todd Kordenbrock (Sandia)
  • Tommy Janjusic (nVidia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia)
  • Aurelien Bouteiller (UTK)
  • Austen Lauria (IBM)
  • Brandon Yates (Intel)
  • Brian Barrett (AWS)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Jan (Sandia -ULT support in Open MPI)
  • Jingyin Tang
  • Joseph Schuchart
  • Josh Fisher (Cornelis Networks)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Sam Gutierrez (LLNL)10513
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia)

Reminders

  • Thursday HAN/Adapt wrapup decision.
    • Contact Geoff Paulsen if you need webex infoN

v4.1.x

  • Multiple weeks on CVE from nvidia.
  • v4.1.5
    • Schedule: targeting ~6 mon (Nov?)
    • No driver on schedule yet.
  • Potential CVE from 4 years ago issue in libevent.. but might not need to do anything.
    • Updated one company reported scanner didn't report anything.
    • Waiting on confirmation that patches to remove dead was enough.

v5.0.x

  • SLURM allocation.
    • RC this week.
    • Sept 30.
  • Finally swapped out the PPRTE submodule pointer to point to v3.0 branch
  • Did it without the SLURM fix, but there was some traction there.
  • Posted Issue Open-MPI #10698 with about 13 issue, that will need
  • NEED an mpirun manpage
  • NEED mpirun --help
  • Need all these fixes before PRTE ships v3.0.0
  • Any of these issues complex?
  • Testing mpirun command line options.
  • Supposed to do automatic translations from old command line options to new options.
    • Are we planning to get rid of options at some point?
    • Not printing deprecated warning by default.
    • We've made new options (that are the new way), but if we're not encouraging people to go to them, why?
      • Can we even map old to new options one-to-one.
    • We "own" the szitso component and we could ditch new options, and only use old options if we want.
    • Before we force any change, we should get user's
    • Old ones had auto-completion.
    • If we have old options that are going to new options, weird that we don't print the messages.
    • v5.0 was supposed to be pretty disruptive, but if we go back and make it less disruptive, that's fine, but we are kinda saying that the old options are the way.
  • Do we want HW_GUIDED in v5?
    • No discussion.
  • It's be nice to make a test suite that assumes 2-4 Nodes with 4ppr or so... *
  • Schedule:
    • PMIx and PRRTE changes coming at end of August.
      • PMIx v3.2 released.
      • Try to have bugfixes PRed end of August, to give time to iterate and merged.
    • Still using Critical v5.0.x Issues (https://github.com/open-mpi/ompi/projects/3) yesterday
  • Docs
    • mpirun --help is OUT OF DATE.
      • Have to do this relatively quickly, before PRRTE releases.
      • Austen, Geoff and Tomi will be
      • REASON for this, is because mpirun command line is in PRRTE.
  • mpirun manpage needs to be re-written.
    • Docs are online and can be updates asyncronously.
    • Jeff posted PR to document runpath vs rpath
      • Our configure checks some linker flags, but there might be default in linker or in system that really governs what happens.
  • Symbol Pollution - Need an issue posted.
    • OPAL_DECLSPEC - Do we have docs on this?
      • No. Intent is where do you want a symbol available?
        • Outside of your library, then use OPAL_DECLSPEC (like Windows DECLSPEC)
        • I want you to export this symbol.
    • need to clean up as much as possible.
    • Open-MPI community's perspective, our ABI is just MPI_Symbols
    • Still unfortunate. We need to clean up as much as possible.

Main branch

  • Case of QThreds, where they need a recursive lock.
    • A configury problem was fixed.

Accelerator framework

  • Not merged into main or v5 yet.
    • still a couple of discussion points.
  • No discussion. Still some changes needed before we can retest/rereview.
    • ShowLoad errors came out of this.
    • Intent is to turn this error off by default.
    • In Open MPI v5, we've slurped all mca libraries into libmpi (still can via configure)
  • If you build them as a dso (say cuda component)
    • dlopen will fail because cuda isn't there.
    • and mca framework will emit a warning on STDERR.
    • Accelerators are expensive, and therefore you might not have them on all nodes.
  • BUT customers have hit this ERROR in the field.
  • In this case.
    • What if we make this switch not be a boolean (always show warning, or don't show the warning)
    • Jeff posted 10763.
  • Two mechanisms... could be accelerators as DSOs.
    • Because if you're in libmpi.so, whole job will not run.
  • Overall Edgar likes the ideas of the PR.
    • How is Open MPI (or PRTE) dealing with slurm?
      • Because slurm component is built every time, even if it doesn't find slurm.
      • Slurm Headers/libs are GPL
      • So Open MPI fork/exec srun/
  • MCA component can still do a dlopen on required libraries
  • HCOLL component must be dlopening also
  • If we don't get Accelerator Framework in v5, is there any AMD accelerator support?
    • Not much... just some specific derivated datatype
    • No Streams, No Abstration, etc.
    • Would be a big gap.
  • William will try
  • Edgar also has a follow up commit.
  • Waiting until big commit is merged into main, to not further complicate this commit.
  • Any testing with libfabric and accelerator support?
    • Edgar is hoping to test this week.
    • If something is missing, it'd probably be on the libfabric

Attomics PRs.

  • Switching to builtin atomics,
    • 10613 - Prefered PR. GCC / Clang should have that.
    • Next step would be to refactor the atomics for post v5.0.
    • Waiting on Brian's review and CI fixes.
  • Joseph will post some additional info thing in the ticket

MTT

Administrative tasks

Face-to-face

Clone this wiki locally