Skip to content

WeeklyTelcon_20170620

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Artem Polyakov
  • Jeff Squyres (Cisco)
  • Howard Pritchard
  • Josh Hursey
  • Mohan
  • Murali Emani (LLNL)
  • Todd Kordenbrock
  • David Bernholdt (ORNL)
  • Nathan Hjelm
  • Ralph
  • Brian Barrett (Amazon)

Agenda

2.0.3

  • just a few PRs going in.

Review v2.x

  • 3714 - Does shift signal forwarding need to go to 2.0.x?
    • This is an enhancement not bugfix for running under SLURM.
    • It's a bug because, if they are using mpirun on SLURM, Scancel won't get the signal.
    • It's certainly not a regression.
    • LLNL - will handle this with a patch.
    • whenever we mess with job termination, it causes issues.
    • we'll think about it... no rush for next 2.1.x
  • PR3487 -
    • Continue to discuss in PR.
  • Looked at timelines for 2.0.x and 2.1.x
    • No super critical bugs / bugfixes.
  • PMIx - PR3696
    • IBM will open an issue associated with this.
    • When PMIx fixed IBM Load/Store issue, opened a can of issue (memory corruption, alignment in PMIx Lib).
      • was hitting some hangs and data corruption.
      • Still iterating on. Once that's done, can PR it to v3.x
      • Ralph thinks he's got that running cleanly now.
    • We want these changes inside of v3.0.
  • Cisco tests still having some weird issue in their MTT with
    • Leave Session Attached is busted.
  • PMIx & SLURM
    • in SLURM if you configure (default) you don't get PMIx support.
    • in 3.0.x if you launch directly, they all throw an error in MPI_Init().
    • Ralph will improve the error message when MPI_Init() can't find a PMIx server.
    • Not a blocker, but nice to have.
  • Brian is working on Release Template, and will get v3.0 RC out this week.
  • Schedule for v3.0 is still end of this month.
  • Branch for next release will be End of Face to Face in July.
  • Expectations for Folks to test RC.
    • Down the road we should make a release tarball each night, and have MTT test THAT nightly.
    • Very different in how they're built, until they call 'make dist'.

  • Mellanox was having some MTT testing issue, Artem will look at it.
    • Mellanox might be seeing it because of deprecated build status stuff.
  • Some issues with tests running successfully, but then hangs at the end of output, and dies due to Timeout.
  • Right Now PRs, building exactly what the person PRs,
    • But could build AFTER a merge of the PR and test THAT.
    • IBM has seen internally this method has caught a failure before it was merged to the branch.
    • Amazon likes this approach also.
  • Have always allowed merging to Master without a PR, but trying to make it more attracted to PR.
  • Still test each commit to master, and also
  • ompi_scripts/Jenkins - all available, can make changes there.

MTT Dev status:

  • Intel is pushing content somewhat regularly, but unclear how much longer.
    • Not seeing much benefit.
  • Howard - Trying to use it an trying to work on viewer.

Exceptional topics

  • Face2Face Meeting-2017-07
    • Date: July 11-13 (9am Tuesday - noon on Thursday.
    • Cisco has booked space in Chicago.
    • Jeff will see about setting up a Web-Ex for those who are interested.
      • Please email him if you are interested in attending via Web-Ex.

Status Updates:

  • Cisco - Focused on release manager things.
  • ORNL - IBM helping with some cluster issue.

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally