Skip to content

WeeklyTelcon_20181009

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen

  • Jeff Squyres

  • Brian

  • Edgar Gabriel

  • Howard Pritchard

  • Josh Hursey

  • Joshua Ladd?

  • Matias Cabral

  • Matthew Dosanjh

  • Nathan Hjelm

  • Ralph Castain

  • Todd Kordenbrock

  • Xin Zhao

  • mohan


not there today (I keep this for easy cut-n-paste for future notes)

  • Arm (UTK)
  • Dan Topa (LANL)
  • Thomas Naughton
  • Aravind Gopalakrishnan (Intel)
  • Akvenkatesh (nVidia)
  • Geoffroy Vallee
  • Dan Topa (LANL)
  • David Bernholdt
  • George
  • Peter Gottesman (Cisco)

Agenda/New Business

  • Vader - Compilers are wripping our word-sized writes.

    • compiler no longer guarantee that a word-sized write is an atomic write. They used to gauarantee, but no longer guarantee this.
    • Linux kernel solved this by having a contract with gcc and llvm on exactly what volatile meant so that access_once, Read_once, and Write_once macros works.
    • Other way to solve this is custom asm to prevent writes from being wripped.
    • Our Options:
      • Require the core part of Open MPI to be compiled with a compiler that honors what gcc does with volatile (gcc v4.0 and later, last few years of icc, and llvm)
    • On master, we use C11 atomics by default. If C11 isn't available, we'll use gcc sync builtins (no atomic load/store) finally we'd use our hand done atomics. HOPEFULLY we can just use C11, instead of gcc sync, and hand done ones, but
      • C11 - has atomic load/store
      • gcc builtins - has atomic load/store -sufficent
      • sync builtiuns - Dont work for us.
      • base assembly - very easy to write an atomic load/store.
    • The day we REQUIRE C11 is a glorious day. :)
    • Brian thinks we should reduce sync ops to below hand-asm priority
      • Then only enable fbox support if there is atomic load/store support
    • Do this mid-stream - if using sync built-ins today.
    • If on a platform without hand-asm, would still use sync built-ins, but disable vader-fastbox feature.
    • AS we do more and more with atomics rather than locks, it's going to be harder and harder to support such a wide range of compilers.
    • Would support a "wrapper compiler" that compiles MOST things, but compiles atomic based things with core compilers (smaller list that conforms)
  • Face to Face is next week

    • Oct 16th - Brian and Nathan might come this one day.
      • Libfabric / OFI on Oct16th.
  • github suggestion on email filtering

Minutes

Review v2.1.6 (not going to do this in immediate future.

  • PRing Nathan's Vader BTL for fastbox to ALL release branches back to v2.1.x

  • Compilers COULD but probably won't get around this fix, so it should be good.

  • Nathan will discuss some future vader fixes later.

  • This PR is good for Release branches.

  • Vader problem is still happening on i386 and MIPSL nodes.

    • Do we want to just NOT support 32bit builds?
    • That makes our packager's lives difficult.
    • 32bit should be considered a "canary in the coalmine", and we might have other REAL issues.
    • Tested with patch, and still failing, so THIS might not be the only issue.
    • Not ready to say "drop 32bit".
    • Brian will investigate as time permits.
  • Driving a new release because it's a regression.

  • Dec 1st.

Review v3.0.x Milestones v3.0.3

  • Schedule:
  • v3.0.3 - targeting Oct 1st (more start RCs when 2.1 wraps up.
    • Not important enough to do in parallel with v4.0.x

Review v3.1.x Milestones v3.1.0

  • Schedule: Dec 1st (post v4.0.x)
  • We could pull in this release.
  • Issue 5083 - ucx segfault - Geoff (IBM) will grab UCX from upstream release and verify Issue 5083 (UCX issue not OMPI issue)
    • Open PR: PMIx v2.1.4 upgrade
    • PR 4986 - if no updates in 7 days, Brian will close PR.
    • Issue 5540 issue with overlapping datatype.
      • George is working on.

v4.0.0

  • Schedule: release: End of Sept.
    • Date for first RC - Setp 11 (today)
  • PR5765 - merge as is to v4.0.x (independent)
  • -with-hwloc=external PR filed this morning.
  • release rc4 yesterday Morning.
  • Issue 5638 - 32bit fail in vader probably for all releases.
    • Fastbox thing is just an optimization.
    • We could just disable this optimization for 32bit.
      • We should do a build with fastbox disabled, and run through user's CI.
      • Then if we have a fix in time for release, then perhaps
      • Disabling fastbox, no mca parameter.
    • Briant will look into this a bit more
    • Howard will look into adding an mca param to disable this.
  • Fastboxes - Nathan made it configurable
  • TCP sockets issue on v4.0.0 George was going to look at this.
    • SEGV when trying to print error message.
  • AWS, Mellanox and IBM need to update to use the legacy repo which uses PERL
  • End of the week

PMIx

  • Talking Thursday about bringing in dstore.

    • IBM is testing compatibility testing. Found a few issues in the tests themselves.
      • Trying to identify a delta in PMIx
  • PMIx team close to releasing the version 2 of the PMIx standard.

  • No action: Open MPI v5.x Future of Launch

    • Geoffroy Vallee sent out document with summary to core-devel.
      Everyone please read and reply.
    • ORTE/PRTE
      • We had a working group meeting to discuss launching under Open MPI v5.0
      • Summary is to throw away ORTE, and make calls directly to PMIx, and then use PRTE with an mpirun wrapper around PRTE.
    • Split this into two steps:
      1. Make PMIx a first class citizen - and call PMIx API directly.
        • When we added the opal PMIx layer, we added infrastructure, and we're talking about flipping that around, so internally Open MPI calls PMIx calls, and then other components might translate the PMIx calls to PMI1 or PMI2 or whatever else.
        • PMIx community operating as a "standard" for over a year or so now.
        • PMIx standard document is in progress.
        • Just doing this much, should make ORTE much more in-line with PRTE, and make bugfixing between the two much less.
      2. Packaging / Launcher.
        • PRTE is that far ahead of ORTE because it's painful to move them back.
        • Many don't want to have to download something different to launch.
      3. Will need to ponder and come to consensus at face to face.

New topics

  • MTT License discussion - MTT needs to be de-GPL-ified.

    • Main desire is python is in a repo with no GPL code (no Perl code)
    • Current status:
      • Need to make progress on sooner than later.
      • Ralph will move existing MTT to new mtt-legacy repo,
        • then rip out perl from MTT repo.
      • Cisco spins up a different slurm job for each MPI build, with a single ini file. By doing it this way, it depends on many perl funclets.
      • If change to have a different ini for each different "stream", it should work okay with python. Didn't happen before Peter left.
    • Ralph is waiting for MTT users to move to MTT-legacy repo.
      • Absoft, Amazon, IBM, need to move.
  • Do we need to update the LICENSE doc?

    • No, because not planning to distribute the legacy repo.
    • There are plans to redistribute the new MTT repo.
  • MTT performance database?

    • No status for a while.
    • MTT does report this, but no one looks.
    • Howard suggests many different performance dashboards.
      • Influx DB with jenkins, and can be queried.
      • Still need to get an up to date viewer.

Review Master Master Pull Requests

  • didn't discuss today.
  • Ralphs setting up a virtual machine and hitting a TON of new warnings
    • Most of these are not checking return code of snprintf or asprintf.
      • There is an opal_asprintf().
  • Thought about adding CI to check for new warnings.
    • warning count delta is gross.
    • Getting warning free would be next to printf.
  • Next Face to Face
    • When? Week of Oct 16-18th
    • Where? San Jose - Cisco
    • Need Agenda items added to the face to face.
      • Issue with devel-core / mailman.
      • Discuss MPIR / PMIx debugger interfaces.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally