Skip to content

WeeklyTelcon_20180911

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Howard Pritchard
  • Josh Hursey
  • Edgar Gabriel
  • Matthew Dosanjh
  • Thomas Naughton
  • Matias Cabral
  • Todd Kordenbrock
  • Ralph Castain
  • Nathan Hjelm
  • Geoff Paulsen
  • Xin Zhao

not there today (I keep this for easy cut-n-paste for future notes)

  • Geoffroy Vallee
  • Aravind Gopalakrishnan (Intel)
  • Akvenkatesh (nVidia)
  • Dan Topa (LANL)
  • Brian
  • Joshua Ladd
  • Dan Topa (LANL)
  • David Bernholdt
  • George
  • Peter Gottesman (Cisco)

Agenda/New Business

  • Ralph proposed moving mailman to a new hosting site.

    • mailmanhost.com - $3/list for up to 4K members.
    • dotlist is company behind them.
    • We have about 2600 subscribers.
    • Might have had some more issues today with current provider.
    • No action until face to face
  • Silent Wrong Issue(s)

    • Vader fence issue (Originally Issue 4937)
    • Released v2.1.x with this.
    • Other things for v3.1.x
      • Put out an RC for v3.1.x
    • ACTION: Did this get fixed for v4.0.x?
    • ACTION: did this go to all release branches?
  • Nathan is requestiong Comments on

    • C11 integration into master. PR5445
    • Got good comments from George and others.
    • eliminate all of our atomic for C11 atomics.
      • So will need to support until 2020 due to RHEL.
    • Nathan agreed to clear out old stuff now, and will rebase.
  • github suggestion on email filtering

Minutes

Review v2.1.x Milestones v2.1.5

  • v2.1.5 - Done. Did this just for vader bug.
    • That was the last one here as well as v2.0.x. Venders may PR but no expectations to release again.

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • v3.0.3 - targeting Oct 1st (more start RCs when 2.1 wraps up.
    • Not important enough to do in parallel with v4.0.x
  • There was an issue with external PMIX v3.0 hanging. Fixed on master, Ralph backported the fix to OMPI v3.0.x and v3.1.x Already fixed in OMPI v4.0
  • fairly extensive bug fix list is building.
  • PR 5634 - Paulsen take a look.

Review v3.1.x Milestones v3.1.0

  • 5083 - ucx segfault - Geoff (IBM) will grab UCX from upstream release and verify Issue 5083 (UCX issue not OMPI issue)

  • NEW There are a BUNCH of issues on v3.1 series, that probably also affect v4.0.x.

    • Some of these issues are entire platforms are broken.
    • Some non-core developers can't add labels to issues.
    • 5540 issue with overlapping datatype.
    • Need to review all open Issues before we ship v4.0.x and see if any are blockers.

v4.0.0

  • Schedule: release: End of Sept.
    • Date for first RC - Setp 11 (today)
  • Configuring PMIx is challenging to pass in configure options.
    • considering a mechanism to pass configure flags down to PMIx configure
    • consider something similar to: --with-romio-flags and pass that to romio.
    • Caution: this can be very painful for escaping.
  • Another issue is it's hard to see how pmix was configured.
    • pmix has a pmix_info - and we should build/package that.
  • PR 5650:
    • with PR 5650 was much simpler. Final solution will come through PMIx, so don't need to do this greater solution.
    • PR 5650 has been removed for earlier versions, and will be satisfied by pmix in the future. So this PR is ONLY needed for v4.0.0
    • Why 2 commits? - Ralph got confused a bit about what Mathias wanted.
    • Mathias and Ralph have agreed to back the out 2nd commit out and create a new PR with 2nd commit, and both will get into RC2
  • Geoff and Howard came up with list of commits in master, not PRed to v4.0.x and will send out list to devel-core and to people directly.
  • Issue: 5470
    • builtin atomics seem to fail on ppc64le / ARM?
    • Nathan will look at Issue 5470.
    • We May disable atomics by default on everything but intel systems for v4.0
  • Issue: 5375 in vader.
    • may be new blocker for v4.0.0

PMIx

  • Relase 2.1.3 and 3.0.1

  • Testers for dstore functionality with Cross version functionality.

    • Build with older PMIx and run with latest PMIx dstore.
  • Added PMIx related talking items for face2face.

  • Still an issue with PMIx not supporting cross mpirun connect / accept.

  • Open MPI v5.x Future of Launch

    • Geoffroy Vallee sent out document with summary to core-devel.
      Everyone please read and reply.
    • ORTE/PRTE
      • We had a working group meeting to discuss launching under Open MPI v5.0
      • Summary is to throw away ORTE, and make calls directly to PMIx, and then use PRTE with an mpirun wrapper around PRTE.
    • Split this into two steps:
      1. Make PMIx a first class citizen - and call PMIx API directly.
        • When we added the opal PMIx layer, we added infrastructure, and we're talking about flipping that around, so internally Open MPI calls PMIx calls, and then other components might translate the PMIx calls to PMI1 or PMI2 or whatever else.
        • PMIx community operating as a "standard" for over a year or so now.
        • PMIx standard document is in progress.
        • Just doing this much, should make ORTE much more in-line with PRTE, and make bugfixing between the two much less.
      2. Packaging / Launcher.
        • PRTE is that far ahead of ORTE because it's painful to move them back.
        • Many don't want to have to download something different to launch.
      3. Will need to ponder and come to consensus at face to face.

New topics

  • MTT License discussion - MTT needs to be de-GPL-ified.

    • Main desire is python is in a repo with no GPL code (no Perl code)
    • Current status:
      • Need to make progress on sooner than later.
      • Ralph will move existing MTT to new mtt-legacy repo,
        • then rip out perl from MTT repo.
      • Cisco spins up a different slurm job for each MPI build, with a single ini file. By doing it this way, it depends on many perl funclets.
      • If change to have a different ini for each different "stream", it should work okay with python. Didn't happen before Peter left.
    • Resolution - Just backup current mtt to mtt-legacy,
      • and then rip out the perl from main mtt.
  • MTT performance database?

    • No status for a while.
    • MTT does report this, but no one looks.
    • Howard suggests many different performance dashboards.
      • Influx DB with jenkins, and can be queried.
      • Still need to get an up to date viewer.

Review Master Master Pull Requests

  • didn't discuss today.
  • Next Face to Face
    • When? Week of Oct 16-18th
    • Where? San Jose - Cisco
    • Need Agenda items added to the face to face.
      • Issue with devel-core / mailman.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally