Skip to content

WeeklyTelcon_20180828

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Aravind Gopalakrishnan (Intel)
  • Akvenkatesh (nVidia)
  • Edgar Gabriel
  • Geoffroy Vallee
  • Howard Pritchard
  • Geoff Paulsen
  • Dan Topa (LANL)
  • Matthew Dosanjh
  • Nathan Hjelm
  • Ralph Castain
  • Todd Kordenbrock
  • Xin Zhao
  • Josh Hursey

not there today (I keep this for easy cut-n-paste for future notes)

  • Brian
  • Joshua Ladd
  • Matias Cabral
  • Dan Topa (LANL)
  • David Bernholdt
  • George
  • Peter Gottesman (Cisco)
  • Thomas Naughton

Agenda/New Business

  • Ralph proposed moving mailman to a new hosting site.

    • mailmanhost.com - $3/list for up to 4K members.
    • dotlist is company behind them.
    • We have about 2600 subscribers.
  • Silent Wrong Issue(s)

    • Vader fence issue (Originally Issue 4937)
    • Released v2.1.x with this.
    • Other things for v3.1.x
      • Put out an RC for v3.1.x
  • 128 bit discussion on devel-core.

    • Real issue for ARM.
    • Bug in specific ARM Archetecture back-end.
      • Jeff filed compiler bug with LLVM
    • ARM architecture we're testing is different than what ARM was testing.
      • George sent: Sorry guy, I wont be able to attend the call today to talk about the 2 pending issues (atomic 128 CAS and datatype). However, with Jeff's help the atomic ticket can be considered as resolved (once the -latomic on Power 9 is resolved and the PR merged) as it addresses all issues we know about the 128 bits CAS. The datatype bug (vector with stride less than block length) is still open, I have a patch but I did not yet had time to completely validate the patch and make a PR. I will try to get to it asap.
      • Failing one of IBM XL compiler CIs - but the problem is -latomic library, seems to work fine with C, but Fortran PR 5546
      • IBM will investigate and comment on PR 5546
    • Aug 28 - Jeff is asking Nathan for confirmation that his commit message is correct, and will PR soon.
    • Blocker on v4.0
  • Nathan is requestiong Comments on

    • C11 integration into master. PR5445
      • will rebase after 126bit atomics are in, since some analogous changes.
      • ACTION: Please review and comment on code.
    • eliminate all of our atomic for C11 atomics.
      • So will need to support until 2020 due to RHEL.
  • ORTE discussion went well, Geoffroy Vallee wrote up summary and posted to devel-core on Jul 24th

    • ACTION: Everyone please read and reply to devel-core with your thoughts.
  • User wants an env var to allow root to run.

    • Compromize is to set TWO env vars to allow "I want this, yes I really want this".
    • Okay this sounds reasonable.
  • github suggestion on email filtering

Minutes

v2.0.x

  • PR for old v2.0.x - their version they're shipping is based on v2.0.x
  • Right now any PR would fail, because of CI.
  • From a policy -
    • If it passes CI, put it in, if it doesn't don't put it in.
    • Once a release branch goes out of maintenance, but if a vendor cares to put PRs in it, if a release engineer doesn't want to keep pulling things, they could give the vendor merge permission.

Review v2.x Milestones v2.1.5

  • v2.1.5 - Done. Did this just for vader bug.

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • v3.0.x will try to do an RC today.
    • Probably won't have George's patch in it.
    • Want to have Nathan's patch (which isn't in master yet).
  • On old compilers, the configury in PMIx isn't quite right for 'inline'.
  • v3.0.3 - targeting Oct 1st (more start RCs when 2.1 wraps up.
    • Anticipate RC1 after Aug 10th release of v2.1.4 releases.
    • Got good progress in reviews.
  • There was an issue with external PMIX v3.0 hanging. Fixed on master, Ralph backported the fix to OMPI v3.0.x and v3.1.x Already fixed in OMPI v4.0

Review v3.1.x Milestones v3.1.0

  • v3.1.2 release process, starts after Sept 1st release of v3.0.3
  • PMIX 2.1.4 will be released on Friday.
    • Want to bring it into internal for Ompi v3.1.x and MAYBE v3.0.x
  • Schedule: Still looking on target for Sept 1st.
  • Please test with v3.1.2 RC1.
  • Lots of PRs multiple 5485
  • ucx segfault - Geoff (IBM) will grab UCX from upstream release and verify Issue 5083 (UCX issue not OMPI issue)
  • 5083 - we just need some update. Xin Zhao will update issue.

v4.0.0

  • Schedule: branch: July 18. release: Sept 17
    • Date for first RC - Aug 13 (after sunset of 2.1.4)
  • RMs will meet friday.
  • a few PRs need review.
  • 5562 - Edgar - ordering problem during file open
    • Edgar has a few PRs.
  • Cuda support:
    • Does nVidia want if --with-cuda, then openib included by default?
      • Yes, because at this moment UCX is not on par, but still want to migrate to ucx cuda.
      • Warning message will mention deficate openib vs ucx
      • Has this work been done???
  • MXM MTL removal stuff -
    • Howard will work on removing MXM (no configure option).
  • Geoff to look at 5608 - and 5606.
  • 5565 - PMIx v3.0.1 - josh hursey to review.
  • PR some issues sharing a daemon with mpirun - feature that leads to this problem is also in v3.0 and v3.1.
    • Nathan did some testing and hit this. One on ALPS and one General.
    • opal_output has a lock. And if another thread is in opal_output calls fork before exec.
      • Could just GET the lock before the fork, and then both parent and child both drop locks.
      • OR could just have children dump ALL locks.
      • Clone might be safer (at least on crays) in some situations.
    • Nathan hit when he added extra opal_outputs testing multi-threaded launcher (alps)
      • Only an issue in multi-threaded spawner.
  • NEWS - Depricate MPIR message for NEWs - Ralph can help with this.
    • DONE
  • PR 5497 - ROMIO wait for Giles to review. Later this week.
  • PR 5472 - joint effort of 4 commits - Jeff to review
    • status update: Good enough at the moment, Not exactly the scheme we outlined in prior issue. It does satisfy external hwloc or external libevent. Since it broke aws.
  • Geoff and Howard will build test suites with v3.1.x and run with master/v4.0 to see if anything breaks.
    • Diddn't happen last week, will try for this week.

PMIx

  • Relase 2.1.3 and 3.0.1

  • Testers for dstore functionality with Cross version functionality.

    • Build with older PMIx and run with latest PMIx dstore.
  • Added PMIx related talking items for face2face.

  • Still an issue with PMIx not supporting cross mpirun connect / accept.

  • Open MPI v5.x Future of Launch

    • Geoffroy Vallee sent out document with summary to core-devel. Everyone please read and reply.
    • ORTE/PRTE
      • We had a working group meeting to discuss future of launching under Open MPI.
      • Summary is to throw away ORTE, and make calls directly to PMIx, and then use PRTE with an mpirun wrapper around PRTE.
    • Split this into two layers:
      1. Make PMIx a first class citizen - and call PMIx API.
        • When we added the opal PMIx layer, we added infrastructure, and we're talking about flipping that around, so internally Open MPI calls PMIx calls, and then other components might translate the PMIx calls to PMI1 or PMI2 or whatever else.
        • PMIx community operating as a "standard" for over a year or so now.
        • PMIx standard document is in progress.
        • Just doing this much, should make ORTE much more in-line with PRTE, and make bugfixing between the two much less.
      2. Packaging / Launcher.
        • PRTE is that far ahead of ORTE because it's painful to move them back.
        • Many don't want to have to download something different to launch.
      3. Will need to ponder and come to consensus at face to face.

New topics

  • HLRS - bunch of master failures.
  • failure is pull-base-gather-in-place segved. - George.
  • From three weeks ago:
    • MTT License discussion - MTT needs to be de-GPL-ified.
      • OLD - All go try the python. - All the GPL is in the perl modules (using python works around that).
      • Ralph started a PR, and now in limbo.
        • Need to get this done by end of 2018 (Sooner than that!)
      • Agreed to move perl MTT. perl client can stay, but CPAN stuff must go, and users can get their own.
      • Cisco spins up a different slurm job for each MPI build, and each section thend depends on that. Cisco is doing it with a single ini file.
    • Main concern is python is in a repo with no GPL code.
      • Could delete perl alltogether, but may need to just move perl to different repo for a period of time, until everyone can move off of python.
    • Has cisco found an alternative to perl funclets?
      • Python ini execution is different than perls.
      • Not yet, and this is Peter's last week.
    • Cisco has one perl ini for each branch, and under than 20-30 mpi installs.
      • Probably will go with a template and stamp out 20-30 times
  • MTT performance database?
    • MTT does report this, but no one looks.
    • Howard suggests many different performance dashboards.
      • Influx DB with jenkins, and can be queried.
      • Still need to get an up to date viewer.

Review Master Master Pull Requests

  • PR for setting VERSION on master Have we broken any VERSIONs
  • Issue 5529 - George and Jeff discussing a flag that doesn't work. Not sure how to fix it yet.
  • Hope to have better Cisco MTT in a week or two

    • Peter is going through, and he found a few failures, which some have been posted.
      • one-sided - nathan's looking at.
      • some more coming.
    • OSC_pt2pt will exclude yourself in a MT run.
      • One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
        • Now that osc_pt2pt is ineligible, many tests fail.
        • on Master, this will fix itself 'soon'
        • BLOCKER for v4.0 for this work so we'll have vader and something for osc_pt2pt.
        • Probably an issue on v3.x also.
      • Did this for release branches, Nathan's not sure if on Master. - v4.0.x has RMA capable vader. Once
  • PR5570 Attempting to put an asterix character in the name of an MCA parameter.

    • No precidence for this, how do we like this?
    • Consensus that we'd prefer something other than an asterix character.
  • Next Face to Face

    • When? Week of Oct 16-18th
    • Where? San Jose - Cisco
    • Need Agenda items added to the face to face.
      • Issue with devel-core / mailman.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally