Skip to content

WeeklyTelcon_20170228

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Josh Hursey
  • Ralph
  • Joshua Ladd
  • Nathan Hjelm
  • Thomas Naughton
  • Todd Kordenbrock

Agenda

  • No plans for a v1.10.7.
  • PMIx 1.2.1.
    • PMIx 1.2.1 is in!!!
  • A commit was missed, and PMIx can fix in PMIx tree.
  • Open MPI don't want to wait on a PMIx rev.
  • Issue 3048Josh can create a patch for this one fix for Open MPI v2.1 to ship with.
    • Make a README down in component so it's PMIx 1.2.1 + commit.
    • Issue
  • v2.1 RC1 went out last Sunday. Will do RC2 later today.
    • Some News updates. Developers, please go read News.
    • Please go read README.
  • Confusion about README Backwards Compatibility
    • Static linking, and with containers, want developer's input on this.
  • Nathan thinks he can test Open MPI master with both PMIx 1.2.1 and PMIx master.
    • Nathan is still concerned, still sees lots of scaling issues.
    • Not going to hold up v2.1.0 for BSD fix.
  • PMIx verbage for Open MPI v2.1.0: For jobs launched with PMIx: Reduced memory footprint, haven't fixed launch time problem.
    • No mem improvements on Alps and older versions of SLURM (not using PMIx).
  • At least one PR3012
    • Bug to scale at O(Nodes) rather than O(procs).
    • This is a specific fix, but it's not complete in that there are similar loops in other places of the code.
    • This, fix did not go into master.
    • This is the most obvious place for this O(procs) loop, but there are others.
    • Need to commit 3011, run overnight, then we can do 3012.
    • Ralph - Additional work to find all the places where we loop over all procs to JUST find local procs. Would be better to come up with another way to iterate to find all local procs, rather than O(all procs).
  • OSHMEM has a work-in-progress OSHMEM allocator. Not super critical, but just awaiting review PR2717. * Can Mellanox have until end of week to get a review? * Jeff and Howard will discuss.
  • Pushed configure issue to v2.1.1.
  • What about issue of significant performance degradation on v2.x?
    • Seems fine.
  • Jeff reved the shared libraries versions by 10. See commit log on VERSION.
  • Does it make sense to shoot for a June 15th v3.0 rather than a v2.2?
    • Yes, this still makes sense.
  • Can we branch v3.x soon?
    • Still have a few White Listed features:
      • Hook Framework - can go in before branch.
      • Removing Internal hwloc component -
      • PMIx 2.0.0 - April is more realistic.
    • After that, just bugfixes only.
  • What about back-end mappers? RR2 mapper Nathan was going to test.
    • Couple hundred nodes helps tell if this significantly reduces launch time.
    • Takes a 40MB message, down to 15MB and then compresses this to ~600 bytes. Pushes mapping to backend, so mpirun just sends RegEx.
  • Howard wanted to ask about DVM, is this supported in v3.0?
    • Yes, well tested and stable in master.
  • v3.0.0 Schedule
    • Aim for branching next Tuesday after meeting.
    • If Howard and Brian has a philosophy, get together and present to community.
  • List of committed features for v3.0?
    • What's on Master now, plus white features.
  • Testing focus on v2.1 now, and then transition to testing v3.0.0
  • Supporting what we test, however we're not still ripping out support for things we're not testing, just clearly delineate levels of testing / support.
  • Nathan's been working on a customized libev, but still seeing some threading issues. It needs some time to develop. Out for v3.0, but potential for future.

MTT Dev status:


Exceptional topics

  • We should begin thinking about scheduling our next face to face.

Status Updates:

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally