Skip to content

WeeklyTelcon_20170410

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Artem Polyakov
  • Brian Barrett
  • David Bernholdt
  • Geoffroy Vallee
  • Howard
  • josh Hursey
  • Joshua Ladd
  • Ralph
  • Thomas Naughton
  • Todd Kordenbrock

Agenda

  • https://github.com/open-mpi/ompi/issues/3267 - a v2.1.1 based blocker
    • Jeff seems to remember some persistent one sided failure.
    • Looks like issue still opened but PRs PULLed in?
    • Cisco can turn on MTT for master.
  • https://github.com/open-mpi/ompi/issues/3268
    • Artem still sees this, but hasn't seen it since Nathan's merge.
  • Segfault when trying to launch under a debugger specific to v2.1.1
    • Ralph created a PR with a fix, that should go into a v2.x release.
  • Load Leveler support was removed, but code remains. IBM approves removal on master.
  • v3.0 Support items:
    • 64bit
    • MacOSX10.12
    • FreeBSD
    • Cisco MTT is going -m32 builds.

MTT Dev status:


Exceptional topics

  • GIT PR - Why do merge, and not rebase and merge?
    • Shows empty (or sometimes non-empty) merge commits.
    • Idea that we merge exactly what the CI tested.
    • Can be very hard to line up PRs.
    • Good to periodically audit what we're doing, and discuss.
    • the Merge-commit is not signed off (and gets flagged a bunch in CI).
  • https://github.com/open-mpi/ompi/pull/3288
    • Ralph noticed that there was a bunch of OMPI_ env vars that were being propagated, but shouldn't be.
    • ALL OMPI_* was being propagated, but we really should be propagating OMPI_MCA_*.
      • We do set some OMPI_UNIVERSE_SIZE type env vars.
      • Surprised. It was forwarding env vars that it shouldn't have been.
      • Document that users should stop doing this.
    • We'll continue to discuss next week.
    • There are times when you need to capture something prior to calling OPAL_Init, so influencing STDOUT.
      • These can't be MCA params, because that won't be open yet.
  • Ralph has an issue when using -btl sm.
    • Could put an abort when can't find an endpoint. But this in BML R2. Error message coming from there.
    • Portion of code in end_procs - abort will give a stack trace, and can figure out there.
    • this communication is removing advantage of not-doing full modex. But then doing on-demand modex because they're trying to see who they can talk to.
      • Shouldn't be happing, Ralph will look into R2, and try to figure out who's communicating and why.
    • Ralph will give a presentation next time. Looks really good, minus a Kernel issue with KNL.
  • FYI - You will see lots of Jenkins jobs, that's Brian's adding stuff. jenkins.open-mpi.org - will see lots of builder things. Amazon fiddling with Jenkin's settings.

Status Updates:

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, IBM, Fujitsu

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally