Skip to content

WeeklyTelcon_20170117

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Ralph
  • Howard
  • Josh Hursey
  • Josh Ladd
  • Nathan Hjelm
  • Sylvain Jeaugey
  • Todd Kordenbrock
  • David Bernholdt
  • Geoffroy Vallee

Agenda

  • 1.10.6 will be needed.
  • Still 5 PRs that need review (Jeff and Giles)
  • Estimated schedule: RC this week, check the issues, want by end of the month.
  • From last week: Want to check that 2678 doesn't impact 1.10, but think it might. -O3 optimization.
    • Ralph already merged in.
  • want to verify that 1654 was fixed by Nathan's Pull Request.
    • Nathan - we were freeing deleted VMAs in memory path. Nathan put them on a list, and clean it up from next instant, which will never be called from a memory hook.
  • Closing 2666 - Paul Hargrove found issues with his install (OSX), when he fixed, it went away.
  • Jeff needs to review 2728 -
  • Pull 2730 and related on 2.x
  • Curious about -hostfile change. Is this a change in our behavior?
    • Yes, if you saw -host on 2.0 today, and no -n it will launch only 1 process.
    • With the change, it will auto-detect how many slots for that host.
    • We decided last face to face not to make changes like this in a minor change.
      • the problem that surfaces, then there is a fundamental difference between -host foo, and put foo in hostfile.
    • Decision make host and hostfile behave the same.
    • Make the user specify if you're in a non-resource managed environment.
    • if you saw foo in a hostfile, we auto-detect slots
    • Don't like the idea of changing behavior in minor update.
    • We went with user having to specify in non-managed cases.
      • Master does not do this today.
    • All in notes from last face 2 face.
    • hash this out at face2face next week.
  • 2724 - porting for additional signals
    • minor change in behavior - it's be surprising if anyone is relying on us NOT forwarding a signlal.
    • in 1.10 the child processes were in same process group as orted. So people were relying on hitting an orted with a signal, all children getting it. But now we need to trap and forward.
    • There are signals that people have come to rely on. Probably want.
    • agreed to merge in.
  • Put in Memheap refactoring.
  • Otherwise, 2.0.2 will probably do one more RC (before -hostfile / -host). Howard

PMIx update

  • Artem will start on it in next few hours or so.
  • If we bring PR in, we can begin PMIx testing with 1.2.0, and then ship with 1.2.1. Alternatively we can do it in one giant PR.
    • Does the community want it in 1 PR or 2?
      • Depends on extra amount of work.
      • The fix we're waiting on is a code-path Open MPI should never go down... An MPI application might go down that road, but Open MPI shouldn't hit it.
      • PMIx folks in favor of earlier testing of PMIx 1.2.0, and then pickup PMIx 1.2.1 for code-correctness.
    • Some reports of folks having issues on Titan. Does that come into play here?
      • No, it's a known issue. But don't think we should react to this yet, since their use-case is very different from Open MPI.
    • someone file a bug (blocker) to pickup PMIx 1.2.1 (josh)
    • How should we test to stress?
      • just launching in general for now.
  • PMIx is biggest thing.
  • Ralph has a fix that we need before 2.1, otherwise we'll have problems like Trinity.
  • mpirun is running on login node (different than compute node).
    • PMIx sees this (different than mpirun node), and so ranks send their topology nodes back to mpirun. This consumes a lot of time!
    • new mca parameter to request no-one but first rank to send topology strings.
    • It won't solve eventual trinity problem (compute nodes of different types).
  • What is the OMPI v2.1 schedule?
    • Lets talk about this next week. Depends on how PMIx works

  • Mellanox cluster is dying somehow. Will look into.
  • MTT an Open Shmem -
    • Giles said you're just running all OSHMEM tests, and expecting them all to pass.
    • Jeff looked at latest
    • Jeff sent them a patch, to help us run OSHMEM tests. This should help us run OSHMEM, and if others could turn this on after we get this back that would help.

MTT Dev status:

  • Python stuff - Ralph sees a bunch of mtt failures he needs to look at.
    • Should re-start a telcom on this... haven't since December.
    • Will get MTT telcom going again.
  • Got MTT emails restored.

Exceptional topics

  • Face 2 Face
    • did not expect 15 people, so location will probably change. Jeff will send out a note.
    • Please add agenda items to face2face. Those are not in order, so just add to bottom.
  • Anything with SPI?
    • http://www.spi-inc.org/projects/open-mpi/
    • All good to go. We're official now. Jeff and Ralph just needs to close the loop on this.
    • Lobbying Github to change us to a non-profit - going back and forth.
    • Ralph and Jeff need to finish on-boarding with SPI.

Status Updates:

  • Mellanox - doing a lot with UCX - Deploying to customer sites.
    • had some performance analysis - dashboard monitoring proposal for face2face.
    • Something on the dev-list in last 48 hours. KNEM + Yalla on 2.0.1
      • Ideas: Mellanox HPCX Open MPI - is compiled with KNEM, but KNEM not activated on cluster.
    • Shouldn't vader just run-without it if not there?
      • MXM also barking because it can't find KNEM.
      • Set BLT vader copy mechanicms to CMA or NONe if kernel 2.x
      • KNEM might be a little faster than CMA.
  • Sandia
    • Before holiday, implemented some non-contig atomic
    • MTT broke - hope to fix this weekend.
    • Intel fixed some bugs recently (DBM mostly)
      • most of time on PMIx.

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally