Skip to content

WeeklyTelcon_20200512

Geoffrey Paulsen edited this page May 13, 2020 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Christoph Niethammer (HLRS)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Joshua Ladd (nVidia/Mellanox)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Intel)
  • Naughton III, Thomas (ORNL)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)

not there today (I keep this for easy cut-n-paste for future notes)

  • Barrett, Brian (AWS)
  • Brendan Cunningham (Intel)
  • William Zhang (AWS)
  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (nVidia/Mellanox)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Xin Zhao (nVidia/Mellanox)
  • mohan (AWS)

New

ORTE/PRRTE

MTT


Back to 2020 WeeklyTelcon-2020


New

Old

MTT setup

  • If you change your MTT to startup PRRTE at begining of session, and just use prun.
  • Can see times cut in half or more.
  • This is good, but also need to test mpirun wrapper.
  • Cisco is converting half of MPI installs to use prrte/prun

OMPI submodules

  • OMPI master submodule pointers setup to track PMIx and PRRTE master.

Release Branches

Review v4.0.x Milestones v4.0.4

  • v4.0.4rc1 - available this last weekend (see: https://www.open-mpi.org/software/ompi/v4.0/)
    • 7616 - ABI break introduced in OMPI v4.0.3 for some f08 symbols.
    • Got feedback - It hangs on launch.
      • Mailing list devel.
    • Pull Request checker - Something is going on with SLES 12 / AWS automation
      • Blocking ALL PRs.
    • Probably do an RC2
  • Discuss if we want to take https://github.com/open-mpi/ompi/pull/7698 to v4.0.4?
    • NEWs Worthy.
    • Doesn't break backward or forward guarntees.
    • Les, lets take it.
    • history: libevent - changed their library name to libevent_core / libevent_pthread
      • libevent is sum of libevent_core and libevent_extra.
      • libevent_core is the code OMPI uses, and libevent_extra is other functionality that OMPI doesn't use.
    • Why on v4.0.x before master?
    • v4.0.x was "complete" solution, but on master, need to split the fix up into ompi, pmix, and prrte pieces.

Review v5.0.0 Milestones v5.0.0

  • Schedule:
    • Can't fork until configure changes are in.
      • PRTE is still chugging alone.
      • Slipping to at least May 22nd.
      • Taking it week-by-week.
    • Feature Freeze: May 14 (slipped from April 30)
      • Please Post a PR ASAP as place holder
    • Release: End of June
    • Pandoc - got a little pushback on Open HPC
      • Not all MTT systems have pandoc - Absoft, AWS, HLRS.
  • PMIx v4.0.0 - on track
    • Schedule: Still a number of issues, but probably not blocking
  • Hwloc - Are we still going to support older 1.x ?
    • Issue -master build failure on Ubuntu, because it has too old of hwloc.
    • Distros won't use the embedded hwloc 2.x
    • If Open MPI doesn't support hwloc 1.x, Open-MPI
    • What's the effort to support hwloc 1.x?
      • Coding effort. Got to build it against older, and adjust accordingly.
    • Sounds like we don't have a choice but to do it.
    • PRRTE could handle it, but not true with new binding stuff (in Ralph's branch)
    • Also HWLOC ABI break between hwloc 2.1 and 2.2.
    • Need to drop in a major release.
    • Master MPICH dropped support for hwloc 1.x
    • A lot of supported distros still at hwloc: https://github.com/open-mpi/ompi/wiki/OMPI-Third-Party-Packages
    • Need to test both versions of hwloc.
    • what 1.x version? 1.8 or 1.10
  • PRRTE v2.0 -
    • Went through issues to discuss remaining issues.
    • MCA usage is very different in PRRTE than in ORTE.
      • ORTE was a "one-shot" launcher, but PRRTE is persistant.
      • When launching PRRTE you can set "defaults" for the deamon
      • individual pruns override these defaults via command line parameters not mca parameters.
    • A lot or change.
    • Now have two MCA users in the job. OPAL / PRRTE - if setting something in the wrong one, then it gets ignored and is confusing.
    • There will be a lot of mca param files, won't do what people expect them to do.
      • Might want to Open some issues on OMPI side to do some docs.
    • report bindings doesn't make sense to set this as a "default setting" in PRRTE, so is always a per-job basis.
    • RC1 Blockers things to get done before RC1 (Maybe 2-3 weeks?)
      • Need to get User-facing stuff done to reduce use confusion.
      • Binding reporting should be done (confusing) 523 - Needs thinking/careful work.
      • A bunch of knarly issues in here:
        • Call tomorrow?
        • Socket -> Package name change - Should we do now or later?
          • Already a lot of change, but hwloc has already moved onto the new name.
        • Also what to do with NUMA? - Doesn't even make sense anymore on some archetecture.
      • Depends on what versions of hwloc we support. Will be tricky (or more expensive) to support hwloc both 1 and 2.
      • Is there a list of distros and hwloc versions? Brian will put together list.
  • Discussing Features on google sheets document
  • Please send collective tuning data to AWS to help select new defaults.
  • Today with libevent, we default to prefering libevent if it's version is newer (we redistribute 2.0.22)
    • Still a bunch of distros that ship a libevent not newer than 2.0.22, but works.
  • For v5.0 we're continuing down the path of NOT encouraging users to use the internal libs.
    • So probably should just use external if it's found, as long as it's newer than 2.0.21 (trusted version)
  • Issues not tracked on spreadsheet.
    • libopal isn't slurped into Open-MPI correctly (related to 7560)
      • Jeff and Brian will meet Friday

master

  • Heriarchacal collectives

    • If someone wants to do, PMIx has much of this information already.
    • Not too hard to do, and they're much faster. Will be in next version of competitor MPI
    • Probably not for v5.0
  • SLURM PMIx plugin has been locked on PMIx v2 for some time.

    • There are some NEW PMIx calls that SHOULD be added to bring it up.
      • Ralph has started a PR, but needs help.
    • PR???
    • So for now, there's some optional info that won't be passed correctly.
      • No OMPI_INFO for now.
      • Ralph gets pinged occasionally.
    • Not sure priority of this.
  • MTT on master is looking pretty good.

Face to face

  • Defered.

Infrastrastructure

  • scale-testing, PRs have to opt-into it.

Review Master Master Pull Requests

CI status


Depdendancies

PMIx Update

  • CI testing only tests build and did it run, but doesn't test HOW it ran.
    • Environment setup can be a bit different.
    • For example no-permissions in /tmp. Might pass on one machine, and fail on another without /tmp permissions.

ORTE/PRRTE

MTT


Back to 2020 WeeklyTelcon-2020

Clone this wiki locally