Skip to content

WeeklyTelcon_20180501

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • David Bernholdt
  • Geoffroy Vallee
  • Howard Pritchard
  • Joshua Ladd
  • Josh Hursey
  • Nathan Hjelm
  • Todd Kordenbrock
  • Xin Zhao

Agenda/New Business

Minutes

Review v2.x Milestones v2.1.4

  • v2.1.4 - Targeting Oct 15th,
  • lower priority to v3.0 and v3.1
  • No new news on v2.1.x

Review v3.0.x Milestones v3.0.2

  • Schedule:
    • Quick turnaround on this, Shooting for May 1st.
  • Waiting for PMIx
  • v3.0.2 open for bugfixes.
    • Will pre-emptively fix PMIx compatibility pieces to pickup PMIx v1.2.5 clients.
      • Fixed
    • This will bring in PMIx compatibility with OMPI client (mpirun/orted/libmpi) from OMPI v2.1.3
  • memkind disable needs to get into v3.0.2, Either taken care of or waiting to be taken care of.
    • DONE - merged in
  • All is good, just waiting on v3.1.0 update.

Review v3.1.x Milestones v3.1.0

  • Schedule - ASAP - but blockers keep getting filed.
    • Need to discuss with Brain.
    • Should be able to release this week after priority of UCX/OSC is reduced to zero (see blockers bellow)
  • Blockers
    • Still have oritinal Issue 5083 - PR5094
      • Issue on Connext X3, but not on Connect X4
    • Suggestion: Put priority down to zero. and Release with a known issue in README.
      • Mellanox okay with this for v3.1.x (and upping priority again when fixed in v3.1.1)
      • Consensus - After xin creates PR to put priority to zero, and add item to known issues.
      • Consensus - Since this is "removal of a component" everyone is okay without rolling an RC after this PR, and going for full release.
    • Issue 5048
      • Feature that works in v3.0 that doesn't work in v3.1
      • also broken in master.
      • Giles confirmed this is not a problem in the 3.0.x series.
      • This is not mpirun/orte differences. Just a library
      • Giles fix does not go back to pmix v1.2
      • Any use cases for who wants this to work?
        • We made statements about compatibility
        • We're not sure what the container folks want/need.
      • Suggestion with OMPI v3.1.x you should switch to PMIx v2.x
      • Suggestion we rename this to v4.0, but then we know fall release is v5.0

v4.0.0

  • Last week voted for Geoff and Howard for release manager for v4.0 mid-July branch, release mid-Sept?
  • Howard and Jeff were talking about a proposal that we were discussing in face to face:
    • Rename openib BTL to iWarp - because if UCX is prefered way for Mellanox, go whole hog on Mellanox.
      • And change logic so that it only works on iwarp devices.
      • If someone (Broadcom or Chelcio?) steps up for MTT testing.
      • Jeff will get contact info to v4.0 release managers who will reach out to iWarp providers to request MTT testing.
  • New UTL BTL - ucx based BTL. RDMA only. Low priority for RDMA. Good performance for OSC. UCX PML would still be winner for pt2pt, and hopefully OSC soon.
  • TKR : We wanted to get rid of TKR because we thought that only the old gfortran compiler was using it, but NAG (Numerical Algorithms Group Inc) is still using it, so we'll keep TKR in for v4.0
  • UCX PML should have GPU support (and something for AMD GPU)
    • At the momemnt UCX hvid Bernholdt
      • Targeted to go into v1.3.
      • Believe UCX PML has support for GPUs, possibly not OSC.
  • What are people's thoughts about removing C++ bindings?
    • v4.0 is a good time to remove (major release)
    • IBM doesn't build C++ bindings.
    • Others dont have visibility if customers are building / using.
    • Boost uses C bindings.
    • Momentum has shifted to removing this.
    • David Bernholdt can poll his users to see if anyone is depending on this removed language bindings.

Review Master Master Pull Requests

  • Last week: OSHMEM v1.4 - not sure if we have to drop the depricated APIs, curious OMPI is dropping depricated APIs...
    • Only remove things removed from the OSHMEM standard, not things Deprecated as "deprecated" means it will be removed from a future version of the standard. If some APIs were removed from the standard, then ask oshmem email list their thoughts.

Other topics

  • Update on old discussion:
    • Cisco MTT SLURM dies in weird ways for both v3.0.x and v3.1.x - pretty sure bug in SLURM
    • 100% Triggered by plm base verbose 100, or leave attached. std forwarding in old SLRUM v14.x seems broken
    • Cisco NOW turned off these two options (together they caused 100% failures)
    • Cisco is now running MTT still with SLRUM v14.x and so may randomly see timeouts, but they're looking to upgrade there SLURM for MTT sometime (this summer?)
  • OMPI testing of PMIx compatibility
    • No Progress.

MTT / Jenkins Testing Dev

  • Got compiler licenses for NAG compiler, and Absoft
    • Both Fortran
  • Get copy of perl JSON, and put it on MTT.

When should we branch v4.0?

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally