Skip to content

WeeklyTelcon_20210601

Geoffrey Paulsen edited this page Jul 5, 2021 · 2 revisions

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • David Bernholdt (ORNL)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • Hessam Mirsadeghi (NVIDIA))
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart (HLRS)
  • Matthew Dosanjh (Sandia)
  • Sam Gutierrez (LANL)
  • Todd Kordenbrock (Sandia)
  • Tomislav Janjusic (NVIDIA)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (NVIDIA)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • Erik Zeiske (HPE)
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Josh Hursey (IBM)
  • Joshua Ladd (NVIDIA)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Michael Heinz (Cornelis Networks)
  • Nathan Hjelm (Google)
  • Naughton III, Thomas (ORNL)
  • Noah Evans (Sandia)
  • Raghu Raja (secret startup)
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (NVIDIA)

New Items

v4.0.x

  • Will roll v4.0.6 rc today

  • We'll do one more RC, and then get a final v4.0.6 out.

  • Where are we on pack/unpack with long and long double

    • only external32
    • This worked before, but not sure
  • 8918 - pack/unpack with external32

  • 8818 - checking if

  • Brian thinks Issue 8990 would also apply to v4.0.x

    • with-libevent=/usr (Debian packaging does), we add a -L/usr to wrapper output, and put all of the -L to find deps, before -L to libmpi.so, and if there is an ompi in /usr/lib as well,

v4.1.x

  • Shooting for end of August
  • No driver to rush, so now just in bugfix phase.

v5.0.x

  • Unscheduled RC
  • PR 9014 - new blocker.
    • fix should just be a couple of lines of code... hard to decide what we want.
    • Ralph, Jeff and Brian started talking.
  • Need some configury changes in before we RC.
  • Issue 8850, 8990 and more
  • Brian will file 3-ish issues
    • One is configure pmix
  • Dynamic Windows fix in for UCX.
  • Any update on debugger support?
  • Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if
  • MPIR Shim - pushed up fixes, and enabled CI.
    • Could add it to some more CI, to ensure that PMIx doesn't break
    • IBM is working on some CI testing with MPIR (typically very brittle)
    • Need some guidance on pmix version.
    • Right not, probably not a big deal, but perhaps in 2 years when we have 3 release branches with different pmix versions on different release branches, it might make sense to do open-mpi CI testing.
      • Shouldn't be too much work to do.
  • UCC coll component updating to just set to be default when UCX is selected. PR 8969
    • Intent is that this will eventually replace hcoll.

Reformatting

Master

  • PR 8998 - MPIPy -
    • In shift to PRRTE, --oversubscribe is NOT being handled. If you have more procs than slots on a node, internal oversubscribe var is not yet being set.
    • Jeff will look at.

MTT

  • Mellanox hasn't been reporting for a while. Tommi will follow up.
  • Jeff did some work on Cisco MTT.
    • There are a bunch of one-sided issues across node.
    • Austen and Jeff looking into.
    • Narrowed it down to strange results from MPI_Comm_split
      • Local Peers value appears to be set wrong under PRRTE
  • Joseph see when he installed hwloc in installation path, which leads to warnings if using another hwloc.
    • We changed how all of this worked a few weeks ago.
    • We shouldn't be installing one unless we can't find an external one.
    • Problem is if you link the application to a different hwloc, it now complains.
    • This has always been true, we just warn now. Don't do this.
  • Austen filed a couple of issues from MTT.

PMIx

  • No discussion

PRRTE v2.0

  • No update

Longer Term discussions

  • No discussion.
Clone this wiki locally