Skip to content

WeeklyTelcon_20211214

Geoffrey Paulsen edited this page Jan 4, 2022 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (UCX/nVidia)
  • William Zhang (AWS)
  • Christoph Niethammer (HLRS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Howard Pritchard (LANL)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Joshua Ladd (nVidia/Mellanox)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Naughton III, Thomas (ORNL)
  • Noah Evans (Sandia)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Sam Gutierrez (LLNL)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic
  • Xin Zhao (nVidia/Mellanox)

4.0.x

  • Schedule: No schedule for v4.0.8 yet - sometime in 2022
    • bugfixes case-by-case basis

v4.1.x

  • Schedule: No schedule for v4.1.3 yet either - sometime in 2022
  • Slowing down.
  • 9756 - one outstanding PR.

v5.0.x

  • Austen PRed a bunch of commits on master not yet in v5.0
    • Opened two more.
  • 9643 - Issue for - needs PMIx and PRRTE updates
  • Submodule pointers on v5.0 need updating
    • Still pointing at something on PMIx v4.1.x.
    • Brian PRing some fixes so we can update to PMIx v4.2
  • Issue
    • If there is an SPML, we build the OSHMEM interface.
  • libNBC uninitialized variable. Jeff filed 9749 this morning (prob on both master and v5.0.x)

Master

  • Community Warm/Open to bringing in Sessions, but want to see Howard's PR later this week

  • Clock Monotomic - Jeff updated Timers.md in ompi-www

    • May only be Linux and OSX - maybe just an opal_inline, doesn't warrent a whole framework
    • WTIME a long time ago said not using framework.
      • Everyone just needs to agree to use one function
      • just need ompi_wtime (very MPI specific), wouldn't put it into opal
        • just going to call clock_gettime_monotomic_raw (doesn't allow for migrating to another core)
    • Maybe we should unify the times.
    • No requirement that MPI_Times to be comparable to Wtick and Wtime.
      • Quirks on different platforms.
    • Opal_Timers really build for opal progress where we needed a 10ms with low pertibation.
  • Numa Domain in BIOS - Didn't have a chance to test the newest Open MPI v5,

    • Systems where you can change the way to distribute the cores in BIOS
    • Default binding. When you run more than two processes should bind to socket.
      • Man pages are misleading, though they were right at the time.
      • It binds to the numa domain (at the time was a one-on-one mapping with a socket)
    • Might be - lstopo output and hwloc output.

MTT

  • Cisco has some test build failures.

  • Intel systems that have zero-level API - ROMIO issue in compilation

    • Issue 9715 - Only workaround is to disable building ROMIO (luster perf issue)
    • To fix it right, we might need to upgrade ROMIO in MPICH v4.
      • This package has been rewritten.
    • No configury to disble the Intel GPU support in ROMIO. This would workaround this issue.
    • Is this a blocker for v5? Probably No? Perhaps Intel?
  • IBM has an OMPI build failure with XL compiler on ppc64le.

    • We might need to
ompi_proc_sentinel_to_name(uintptr_t)$AF56_10.  Compilation ended.  Contact your Service
Representative and provide the following information: Internal abort. For more information visit:
http://www.ibm.com/support/docview.wss?uid=swg21110810
make[2]: *** [Makefile:2559: dpm/dpm.lo] Error 1
make[1]: *** [Makefile:2665: all-recursive] Error 1
make: *** [Makefile:1478: all-recursive] Error 1
  • IBM's looking to workaround with Open MPI code change.

PMIx

  • Should we be concerned with an API break from PMIx v4.x to v5.x?
    • Not sure?
    • ABI things were going to break, so he wanted to break API at the same time.
      • Storage spaces for strings.
      • He had them all fixed stride so compilers could optimize... but not sure why.
      • Not sure how to solve striding problem with variable length strings.
  • There was something that was brought up previously about module pointers being fixed for v5.0 for OMPI.
    • Is the long term we'll always
    • Probably converging, but a few hicups

Longer Term discussions

Doc update

  • No discussion 12/14/2021

  • OMPI docs and manpages, but persistant problem that mpirun is really prrterun

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.

    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Intent this is for v5.0
      • mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
    • Ralph has asked about this for PMIx/PRRTE since this is turning out to work
  • No update - 3/16

    • Could be independent of PMIx and PRRTE.
    • PMIx and PRRTE want to follow suite, and not require both pandoc and sphynx.
Clone this wiki locally