Skip to content

WeeklyTelcon_20170207

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Artem Polyakov
  • Edgar Gabriel
  • Howard
  • josh Hursey
  • Joshua Ladd
  • Todd Kordenbrock
  • Thomas Naughton
  • Nathan Hjelm
  • Ryan Grant

Agenda

  • Ralph put in the approved stuff this morning.
  • Still 7 PRs that need review.
  • No schedule yet.
  • Want to check that 2678 doesn't impact 1.10, but think it might.
  • [PR 2593|https://github.com/open-mpi/ompi/pull/2593] - osc_pt2pt previous locks must complete in order, not sure what's correct in standard.
    • Nathan - code in question does something, and does a deadlock. Gist of test code.
    • Nathan - low on priority list. osc_rdma succeeds (uses atomics instead of osc_pt2pt locks).
    • Nathan - not opposed to the patch, but wants to understand.
  • Nathan trying to get OMPI 2.1 to launch at scale this week.
  • PR2932 - would cause PMIx to use dstore by default in v2.1 to match master.
    • Nathan knows there is a bug in dstore. It stomps memory. v2.0.x launches okay at scale, but master does not. Possibly related to this.
    • It's just add_procs - only happens in 5%-10% of ranks. offset was 0, data segment was 0.
    • Artem: Lets use Artem's patch to keep those ranks running so we can attach and see.
    • Nathan ran STAT, and showed where it was.
    • We all agreed we WANT dstore in OMPI v2.1, but don't want to merge this PR until we have dstore fixed.
      • Artem is actively working it.
      • 1ppn launch is pretty good, but 64ppn will crash or do bad things because of dstore overwrite.
  • Everything else is about bugging folks for reviews.
  • Yesterday we started looking at them. Some need rebasing, and many need reviews.
  • Schedule - the gating feature is PMIx. and dstore is also a blocker.

PMIx - v1.2.1

  • held up because of this dstore / shared memory ppn scaling issue.
  • Is the dstore similar enough between PMIx - v1.1.2 and v1.2.1?
    • pretty close, but cherry-picks are not clean.
  • related to the problem we are solving.
  • Hopeful to have a PMIx v1.2.1 RC rolled this week, and PR this into Open MPI v2.1 late this week, or next week.
  • Artem opened a PR to resolve the dstore problem. Would like to know what happens next with it.
    • MPI_Spawn problem
    • Slyvain was seeing MTT failures with very similar error.
    • Jeff opened an PR 2925, do we need to wait until Ralph to review?
      • Josh Hursey will review.
  • Jeff could narrow this down, because he wasn't seeing it last week.
    • Don't know if this affects Open MPI v2.1.x

  • Jeff's MTT tonight is running slower than it usually does. So far doing all master things.
    • Don't know if something went in in past day or 3, that would cause a noticible slowdown over 10,000 tests.

MTT Dev status:


Exceptional topics

  • MAC OS X - only 4 processors. Right now only building.
    • Are you doing MTT on OS X? - plan to.
    • /tmpdir issue - orte is generating a path into /var/lib/tmp - path is too big, so have to manually set tempdir.
      • Apple's problem, but we need a work around.
    • How long can travis take? Any reason to keep botnybay?
      • Like to see Amazon AWS come online first, before we turn off linux side of travis.
  • Travis - paid service is $130 / month = 1560.
    • Just like github enterprise, there is travis enterprise.
      • Ask friends to see if any open source projects / SPI is using Travis enterprise.

Status Updates:

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally