Skip to content

WeeklyTelcon_20180116

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

--- Will fill out as meeting starts

  • Geoff Paulsen
  • Brian
  • David Bernholdt
  • Edgar Gabriel
  • Geoffroy Vallee
  • Jeff Squyres
  • Howard
  • Matthew Dosanjh
  • Mohan
  • Ralph
  • Todd Kordenbrock
  • Joshua Ladd
  • Josh Hursey

Agenda

Minutes

Review v2.x Milestones v2.1.2

  • Delayed until next week.
  • No one has URGENT need, but would like to get this out

Review v3.0.x Milestones v3.0

  • Schedule: RC2
    • On 3.x series trying to cut RCs on nightly tarballs.
    • Didn't get RC last week
    • Will get RC today.
  • No Blockers on v3.0.x (one we JUST merged)
  • Will Pull in PR4715
  • Will Pull in PR4716
    • Issue 4563 - not seeing on little arm boxes here, Jenkins uses --disable-builtin-atomics.
  • Comm Spawn - Documentation PR ready or pulled
  • Issue 4509
    • We believe this is closed. Asked Nathan to close.

Review v3.1.x Milestones v3.1

  • SCHEDULE:
    • Will shoot on getting Release Canidate out Friday.
  • BLOCKER:
    • OSC monitoring fix (doesn't build with Portals 4)
    • PMIx 2.1 PR4605
      • Ralph - there is cleanup issue with PMIx 2.1, but we have cleanup issues today
    • UCX one sided violating PR4688
    • Issue 4303
      • Probably just need to build a patch.

Review Master Master Pull Requests

  • Issue PR4686
    • Jeff Tried to reproduce and failed.
    • Thought HCOLL was an issue, Artem took out, and put back.
    • Something going on in there. Possibly atomic related.
    • Might need Nathan's attention.
    • Someone could try reverting the one change to atomics to see if that caused it.
    • Mellanox will try to reproduce after reverting atomic change. Timing issue.
  • Dynamic operations, a TON of sigfaults. All in opal_progress, during ompi_sync_wait multi-credit.
    • Something is wrong with atomics. Intercomm_create or Spawn.
    • Cisco is tickling the most, and will look at.
  • PR4697 seems to have stalled. * Opal Progress change looks good for most interconnects. * TCP performance regression. * Pointer solution seems reasonable. * mellanox will try to implement pointer.
  • Reg-ex expression creation.
    • PR4710
    • someone created a test and put it in make-check rather than MTT.
    • Then made the component static so that don't have to do make install
    • Dont think we should be adding tests to make-check
    • Question - Is there a Regex library we could use? Reg-ex is hard.
    • This is working pretty well, but did add Framework to allow for future components.

Process

  • When your PR has been accepted into a release branch, please go to the issue, and remove the target of the release branch that it was just merged into. Attempting to automate this in the future.

MTT / Jenkins Testing Dev

  • New Topic - We currently can't write unit tests against components.
    • Some way to say "this unit test is against this component".
    • Intel went through and did this internally for orte. Already hosted in public domain.
      • Ralph will send link to Brian to take a look.
  • Python Client can't report back to database.

Other Discussion

Next Face-to-face

  • Probably looking at March or early April
    • San Jose or Dallas
      • Geoff will send out two Doodles for date and time.

Abandoning OpenIB BTL

  • Discuss abandoning openib btl.
    • LNLL - is no longer paying anyone to maintain openib btl.
      • Nathan has a UCX BTL
    • ETA on GPU in UCX - basic minus CUDA IPC in test now.
    • Any warning message if on iWarp
    • What's the roadmap for this? 3.x or 4.x?

Oldest PR

Oldest Issue

Next face-to-face meeting

  • pushed date to late feb or march.

Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally