Skip to content

WeeklyTelcon_20180814

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Brian
  • David Bernholdt
  • Howard Pritchard
  • Geoffroy Vallee
  • George
  • Josh Hursey
  • Peter Gottesman (Cisco)
  • Ralph Castain
  • Thomas Naughton
  • Geoff Paulsen
  • Todd Kordenbrock
  • Xin Zhao

not there today (I keep this for easy cut-n-paste for future notes)

  • Nathan Hjelm
  • akshay
  • Matthew Dosanjh
  • Joshua Ladd
  • Matias Cabral
  • Edgar Gabriel
  • Akvenkatesh (nVidia)
  • Dan Topa (LANL)

Agenda/New Business

  • Silent Wrong Issue(s)

    • Branching issue
    • The head of v2.1.x branch is essentially the same (tag and branch)
    • For v3.0.x and v3.1.x, branch has a lot AFTER the last tag.
    • The Nightly tarballs would test the v3.0.x and v3.1.x, but not the special branch.
    • Lets not panic, and fix it like we'd normally fix it.
    • This issue is NOT on v3.0.x. It was fixed by Mark Allen March 26 here:
    • MAY flip the dates of v3.0.x and v3.1.x milestones, since we just put out a v3.0.x release, and it'd be easy to cherry-pick a few changes and role that release, and pickup the larger v3.1.x release after v4.0.0 goes out.
  • Nathan is requestiong Comments on

    • C11 integration into master. PR5445
    • eliminate all of our atomic for C11 atomics.
    • ACTION: Please review and comment on code.
  • ORTE discussion went well, Geoffroy Vallee wrote up summary and posted to devel-core on Jul 24th.

    • ACTION: Everyone please read and reply to devel-core with your thoughts.
  • github suggestion on email filtering

Minutes

Review v2.x Milestones v2.1.4

  • v2.1.4 - Released v2.1.4 ON TIME.
  • Now we need v2.1.5
  • A serious issue on VADER came in. Bad memory barrier in fast box.
    • Introduced last Dec. on all supported release streams.
    • Potential Silent Data Corruption.
  • George issue -
    • If users use an overlapping datatype (data overlaps itself), Open MPI sends wrong data.
    • Potential Silent Data Corruption.
    • Effects all releases.
    • George has a patch
  • Always used to have a src RPM as part of RC.
  • Jeff had some problems using Python scrypt to upload 2.1.4 tarballs built on aws to s3.
  • Type-o fix for PMIx (MB prefix), but not upgrading because 2.1.4 is end of 2.x stream
  • Peter filed an Issue 5520
    • Thread Multiple warnings when exit on an error. Doesn't block.
  • Aug 10th is release date.
    • Test RC, get feedback back.

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • v3.0.x will try to do an RC today.
    • Probably won't have George's patch in it.
    • Want to have Nathan's patch (which isn't in master yet).
  • Is there a reason it's not in master? Jeff will followup.
  • PR 5484 - want into RC1, but Giles on vacation. - Nathan can test
  • need
  • v3.0.3 - targeting Sept 1st (more start RCs when 2.1 wraps up.
    • Anticipate RC1 after Aug 10th release of v2.1.4 releases.
    • Got good progress in reviews.

Review v3.1.x Milestones v3.1.0

  • v3.1.2 release process, starts after Sept 1st release of v3.0.3
  • Lots of PRs multiple 5485
  • ucx segfault
  • 5083 - we just need some update. Xin Zhao will update issue.

v4.0.0

  • Schedule: branch: July 18. release: Sept 17
    • Date for first RC - Aug 13 (after sunset of 2.1.4)
  • Cuda support:
    • Does nVidia want if --with-cuda, then openib included by default?
      • Yes, because at this moment UCX is not on par, but still want to migrate to ucx cuda.
      • Warning message will mention deficate openib vs ucx
      • Has this work been done???
  • NEWS - Depricate MPIR message for NEWs - Ralph can help with this.
  • PR 5497 - ROMIO wait for Giles to review. Later this week.
  • PR 5472 - joint effort of 4 commits - Jeff to review
    • status update: Good enough at the moment, Not exactly the scheme we outlined in prior issue. It does satisfy external hwloc or external libevent. Since it broke aws.
  • New OMPI-IO components - PR 5539 -DDN added support for infiniate memory engine.
    • Can we Pull this into v4.0.0?
    • Sorry, No. This is new functionality and we've already branched for v4.0.x
      • We can consider this for v4.0.1, but it might not get it until v4.1.x
    • Who has a filesystem that can test this?
    • Very well isolated component. Can it be considered?
  • PR 5504 - Please ensure bug fixes only, and seperate commits to allow us to consider seperately.
  • Geoff and Howard will build test suites with v3.1.x and run with master/v4.0 to see if anything breaks.

PMIx

  • ORTE/PRTE - Geoffroy Vallee sent out document with summary to core-devel. Everyone please read and reply.
    • Just asked everyone to please read this, and will discuss next week.
    • Want to make sure that there are very good alternatives to whatever orte is turning into that will use PMIx.
    • Replacing framework and calling PMIx directly is a really good idea.
      • Will mess up if there is no native support for PMIx.
    • in Open MPI v5.0.x timeframe.
  • A couple of PMIx release branches getting closer to released.
    • Some updates that might be worth getting into Open MPI, but don't hold up release for.

New topics

  • From two weeks ago:
    • MTT License discussion - MTT needs to be de-GPL-ified.
      • All go try the python. - All the GPL is in the perl modules (using python works around that).
      • Ralph started a PR, and now in limbo. Need to get this done by end of 2018
    • Main concern is python is in a repo with no GPL code.
      • Could delete perl alltogether, but may need to just move perl to different repo for a period of time, until everyone can move off of python.
    • Has cisco found an alternative to perl funclets?
      • Python ini execution is different than perls.
    • Cisco has one perl ini for each branch, and under than 20-30 mpi installs.
      • Probably will go with a template and stamp out 20-30 times

Review Master Master Pull Requests

  • PR for setting VERSION on master Have we broken any VERSIONs
  • Issue 5529 - George and Jeff discussing a flag that doesn't work. Not sure how to fix it yet.
  • Hope to have better Cisco MTT in a week or two

    • Peter is going through, and he found a few failures, which some have been posted.
      • one-sided - nathan's looking at.
      • some more coming.
    • OSC_pt2pt will exclude yourself in a MT run.
      • One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
        • Now that osc_pt2pt is ineligible, many tests fail.
        • on Master, this will fix itself 'soon'
        • BLOCKER for v4.0 for this work so we'll have vader and something for osc_pt2pt.
        • Probably an issue on v3.x also.
      • Did this for release branches, Nathan's not sure if on Master. - v4.0.x has RMA capable vader. Once
  • Next Face to Face

    • When? Discuss results of doodle. Settle: Oct 16-18 week
    • Where? Settle: San Jose - Cisco * Brian may be able to come if it's San Jose for a day-trip Albuquerque - Sandia (believe it's okay, but need to verify) * May have problems with foriegn nationals (90 days), so too late.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally