Skip to content

WeeklyTelcon_20180821

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Edgar Gabriel
  • David Bernholdt
  • Howard Pritchard
  • Geoffroy Vallee
  • George
  • Peter Gottesman (Cisco)
  • Ralph Castain
  • Thomas Naughton
  • Geoff Paulsen
  • Todd Kordenbrock
  • Nathan Hjelm
  • Akvenkatesh (nVidia)
  • Xin Zhao

not there today (I keep this for easy cut-n-paste for future notes)

  • Brian
  • Josh Hursey
  • Matthew Dosanjh
  • Joshua Ladd
  • Matias Cabral
  • Dan Topa (LANL)

Agenda/New Business

  • Silent Wrong Issue(s)

    • Vader fence issue (Originally Issue 4937)
    • Released v2.1.x with this.
    • Other things for v3.1.x
      • Put out an RC for v3.1.x
  • 128 bit discussion on devel-core.

    • Real issue for ARM.
    • Bug in specific ARM Archetecture back-end.
      • Jeff filed compiler bug with LLVM
    • ARM architecture we're testing is different than what ARM was testing.
      • George sent: Sorry guy, I wont be able to attend the call today to talk about the 2 pending issues (atomic 128 CAS and datatype). However, with Jeff's help the atomic ticket can be considered as resolved (once the -latomic on Power 9 is resolved and the PR merged) as it addresses all issues we know about the 128 bits CAS. The datatype bug (vector with stride less than block length) is still open, I have a patch but I did not yet had time to completely validate the patch and make a PR. I will try to get to it asap.
      • Failing one of IBM XL compiler CIs - but the problem is -latomic library, seems to work fine with C, but Fortran PR 5546
      • IBM will investigate and comment on PR 5546
  • Nathan is requestiong Comments on

    • C11 integration into master. PR5445
      • will rebase after 126bit atomics are in, since some analogous changes.
      • ACTION: Please review and comment on code.
    • eliminate all of our atomic for C11 atomics.
      • So will need to support until 2020 due to RHEL.
  • ORTE discussion went well, Geoffroy Vallee wrote up summary and posted to devel-core on Jul 24th.

    • ACTION: Everyone please read and reply to devel-core with your thoughts.
  • User wants an env var to allow root to run.

    • Compromize is to set TWO env vars to allow "I want this, yes I really want this".
    • Okay this sounds reasonable.
  • github suggestion on email filtering

Minutes

Review v2.x Milestones v2.1.5

  • v2.1.5 - Done. Did this just for vader bug.

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • v3.0.x will try to do an RC today.
    • Probably won't have George's patch in it.
    • Want to have Nathan's patch (which isn't in master yet).
  • On old compilers, the configury in PMIx isn't quite right for 'inline'.
  • v3.0.3 - targeting Oct 1st (more start RCs when 2.1 wraps up.
    • Anticipate RC1 after Aug 10th release of v2.1.4 releases.
    • Got good progress in reviews.

Review v3.1.x Milestones v3.1.0

  • v3.1.2 release process, starts after Sept 1st release of v3.0.3
  • PMIX 2.1.3
  • Schedule: Still looking on target for Sept 1st.
  • Please test with v3.1.2 RC1.
  • Lots of PRs multiple 5485
  • ucx segfault - Geoff (IBM) will grab UCX from upstream release and verify Issue 5083 (UCX issue not OMPI issue)
  • 5083 - we just need some update. Xin Zhao will update issue.

v4.0.0

  • Schedule: branch: July 18. release: Sept 17
    • Date for first RC - Aug 13 (after sunset of 2.1.4)
  • a few PRs need review.
  • 5562 - Edgar - ordering problem during file open
  • Cuda support:
    • Does nVidia want if --with-cuda, then openib included by default?
      • Yes, because at this moment UCX is not on par, but still want to migrate to ucx cuda.
      • Warning message will mention deficate openib vs ucx
      • Has this work been done???
  • MXM removal stuff -
    • Howard will work on removing MXM (no configure option).
  • NEWS - Depricate MPIR message for NEWs - Ralph can help with this.
    • DONE
  • PR 5497 - ROMIO wait for Giles to review. Later this week.
  • PR 5472 - joint effort of 4 commits - Jeff to review
    • status update: Good enough at the moment, Not exactly the scheme we outlined in prior issue. It does satisfy external hwloc or external libevent. Since it broke aws.
  • Geoff and Howard will build test suites with v3.1.x and run with master/v4.0 to see if anything breaks.
    • Diddn't happen last week, will try for this week.

PMIx

  • Relase 2.1.3 and 3.0.1

  • Added PMIx related talking items for face2face.

  • Still an issue with PMIx not supporting cross mpirun connect / accept.

  • Open MPI v5.x Future of Launch

    • Geoffroy Vallee sent out document with summary to core-devel. Everyone please read and reply.
    • ORTE/PRTE
      • We had a working group meeting to discuss future of launching under Open MPI.
      • Summary is to throw away ORTE, and make calls directly to PMIx, and then use PRTE with an mpirun wrapper around PRTE.
    • Split this into two layers:
      1. Make PMIx a first class citizen - and call PMIx API.
        • When we added the opal PMIx layer, we added infrastructure, and we're talking about flipping that around, so internally Open MPI calls PMIx calls, and then other components might translate the PMIx calls to PMI1 or PMI2 or whatever else.
        • PMIx community operating as a "standard" for over a year or so now.
        • PMIx standard document is in progress.
        • Just doing this much, should make ORTE much more in-line with PRTE, and make bugfixing between the two much less.
      2. Packaging / Launcher.
        • PRTE is that far ahead of ORTE because it's painful to move them back.
        • Many don't want to have to download something different to launch.
      3. Will need to ponder and come to consensus at face to face.

New topics

  • HLRS - bunch of master failures.
  • failure is pull-base-gather-in-place segved. - George.
  • From three weeks ago:
    • MTT License discussion - MTT needs to be de-GPL-ified.
      • All go try the python. - All the GPL is in the perl modules (using python works around that).
      • Ralph started a PR, and now in limbo. Need to get this done by end of 2018
    • Main concern is python is in a repo with no GPL code.
      • Could delete perl alltogether, but may need to just move perl to different repo for a period of time, until everyone can move off of python.
    • Has cisco found an alternative to perl funclets?
      • Python ini execution is different than perls.
      • Not yet, and this is Peter's last week.
    • Cisco has one perl ini for each branch, and under than 20-30 mpi installs.
      • Probably will go with a template and stamp out 20-30 times

Review Master Master Pull Requests

  • PR for setting VERSION on master Have we broken any VERSIONs
  • Issue 5529 - George and Jeff discussing a flag that doesn't work. Not sure how to fix it yet.
  • Hope to have better Cisco MTT in a week or two

    • Peter is going through, and he found a few failures, which some have been posted.
      • one-sided - nathan's looking at.
      • some more coming.
    • OSC_pt2pt will exclude yourself in a MT run.
      • One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
        • Now that osc_pt2pt is ineligible, many tests fail.
        • on Master, this will fix itself 'soon'
        • BLOCKER for v4.0 for this work so we'll have vader and something for osc_pt2pt.
        • Probably an issue on v3.x also.
      • Did this for release branches, Nathan's not sure if on Master. - v4.0.x has RMA capable vader. Once
  • PR5570 Attempting to put an asterix character in the name of an MCA parameter.

    • No precidence for this, how do we like this?
    • Consensus that we'd prefer something other than an asterix character.
  • Next Face to Face

    • When? Week of Oct 16-18th
    • Where? San Jose - Cisco

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally