Skip to content

WeeklyTelcon_20190108

Geoffrey Paulsen edited this page Mar 12, 2019 · 2 revisions

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Geoff Paulsen
  • Jeff Squyres
  • Todd Kordenbrock
  • Edgar Gabriel
  • Howard Pritchard
  • Josh Hursey
  • Joshua Ladd
  • Ralph Castain
  • Xin Zhao
  • Aravind Gopalakrishnan (Intel)
  • Brian Barrett

not there today (I keep this for easy cut-n-paste for future notes)

  • Nathan Hjelm
  • Dan Topa (LANL)
  • Thomas Naughton
  • Matias Cabral
  • Akshay Venkatesh (nVidia)
  • David Bernholdt
  • Geoffroy Vallee
  • Matthew Dosanjh
  • Arm (UTK)
  • George
  • Peter Gottesman (Cisco)
  • mohan

Agenda/New Business

  • Summary of PMIx re-architecturing for v5.0

  • Lots of TCP wire-up discussion

  • Session work is complete (Nathan and Howard worked on)

    • or check archives for MPI sessions working group.

    • works with MPI_Init.

    • Involved a lot of cleanup for setup and shutdown.

    • Can keep it as prototype, or put it in, without headers.

    • For MPI_Init/MPI_Finalize only apps, fully backward compatible.

      • Initialize a "default" Session.
    • Asking about adding this to master in mid-January

    • Part of cleanup is to have reverse setup and shutdown.

    • Cleanup sounds good. Well contained. Set of pathes.

      • Calling it "instances" inside of MPI, but we'll be renaming it if/when MPI standardizes sessions.
    • Summary - patches for cleanup lets do them and look at them.

      • Under work for sessions, need to look at a bit closer
      • We can discuss sessions bindings in the future.
    • Session init is all local, so timing should still be good.

  • Opal PR6136 - Nathan did some Opal cleanup prep for Sessions

    • It's failing on ARM and PPC64LE XL missing a symbol.
    • Next bring in ompi cleanup.
    • Then create communicators from groups.
    • And need to bring in PMIx v4.x updates.
    • Can use the embedded pmix for sessions stuff. Have been
    • Have to use prterun
    • Is this going to require PMIx 4.x even for non-session MPI apps?
      • No, because fake sessions can use PMIx 3.x functionality.
      • Checks if group create interface is available, and if not it falls back
      • preterun / preteserver can support different PMIx versions at build time,
      • New MPI session based
    • Do we want to update mpirun inside of prte to know about sessions before preterun?
      • No, introduce sessions in prterun first, and follow with mpirun
  • New Work: We need to contextify Opal.

    • So that we can just have one libopal, and different systems can use the same one.
    • PRTE uses MCA system, but doesn't rename everything in PRTE, but did in PMIx
    • One possibility could be to split it / rename it, and move on.
      • two problems: atomics, and mca system itself.
    • Other possiblity is to Contextify, so that various projects can pass in context and they behave nicely together.
    • Contextify - would need to look at all global vars, and variable systems, context would have to be evaluated. Would need to come up with a system where clients could register variables, etc.
      • First step would be to determine what needs to be contextified, versus shared.
      • Then you could have multiple instances that would play together in same process.
    • Bundling is also somewhat of an issue. We assume most users grab everything but some other users deconstruct this.
    • Building libopal as a seperate project, you end up with a configure script per project (unavoidable). Longer configure times.
  • Really two problems:

    1. Opal Problem
    2. Deconstruction Problem
  • Compatibility matrix become more complicated to define.

    • A bit of a maintence headache, but perhaps easiest...
      • Prte doesn't use much of OPAL. Perhaps best answer is sever it, rename it and let it diverge. Hand merge changes to MCA, and Atomics to PRTE.
  • Now we're at the point where we're shipping:

    • an MPI library, with a portability layer
    • a runtime, with a portability layer
    • an OPAL with a portability layer. Doesn't sound like an awesome story.
  • At one point, we were going to have a configure script per project, no on liked it because configure times were slow. but maybe okay today.

  • git submodules might help here also.

  • At Amazon, they have a project where, whenever a commit gets pushed to submodule, then jenkins creates and published a PR to the master project.

    • DONT try to when we build OMPI, one big configure script, and yet still have an opal only configure script.
  • What about versioning information?

    • Either a solution that doesn't work for static.
    • Trending towards a solution that will require versioning of OPAL.
  • Need to come up with a solution, because we're getting to the time where we should do something. But need direction.

    • Whatever solution we come up with, might work for ALL of our embedded projects. If we're going to do something, nice to keep it consistent.
  • Submodules are no too bad, people are using them more.

    • We could catch with CI, if we allow PRs only.
  • Need to discuss how our branches track submodule releases (their master, or their versions)

  • github suggestion on email filtering

Minutes

Review v2.1.6

  • Schedule: posted a v2.1.6 rc1 (Nov 28th - no probs since then, but delayed for holidays, and a good round of MTT)
  • Driver: Assembly and locking fix, vader and pmix, etc.
    • we think the atomic fixes didn't matter for pmix in 2.1.x
  • Should release by end of the week assuming good MTT nightly runs.
  • Uses OLD release process, so not hindered by EWS / Jenkins issue (see v4.0.x)

Review v3.0.x Milestones v3.0.3

  • Scheduled 3.0.4 may of 2019
  • May be able to pull this date IN
  • Will merge in PRs this afternoon.

Review v3.1.x Milestones v3.1.0

  • Scheduled 3.1.4 april of 2019
  • May be able to pull this date IN
  • Will merge in PRs this afternoon.
  • Brian will reply on github on question.

Review v4.0.x Milestones v4.0.1

  • Schedule: Need a quick turn around for a v4.0.1
  • v4.0.0 - a few major issues:
    1. mpi.h is correct, but the library is not building the removed and deprecated functions because they're missing in Makefile.
    2. Two issue hit via SPACK packaging:
      • root cause may be: make -j creates TOO many threads of parallel execution on some OSes.
      • max filename restrition on fortran header files.
        • PR6121 master - should resolve on v4.0.x ??
    3. Manpages is perl, jeff runs on mac. maybe some other magic. need rman
  • Discuss pulling PR 6110 into v4.0.1
    • Bug, some OSHMEM APIs missed in v4.0.0
    • Jeff pulled up slides showing that we can ADD APIs in minor versions.
      • Old built executables must be able to run with newer.
      • We need to verify if the patch breaks anything with older built executables.
    • Because this PR is just adding functions, it should be okay.
    • Mellanox volunteered to test built with old executable and run with newer OMPI
    • If that test passes, everyone is okay with pulling this in.
  • UCX priority PR - expecting a PR from master
  • Matias Cabral local procs with OFI MTL - master this PR is okay, will be coming back to v4.0.x 6106
  • Two rankfile mapper issues reported on mailing list. Howard will file issue.
  • Need to create v4.0.x issues for https://www.mail-archive.com/[email protected]/msg32847.html
    • @siegmargross

Master

  • Issue 6242 -
  • Issue 6228 - Open MPI v4.0.2 would like PMIx 3.1.0 (still unreleased)
  • PR 6191 - Aravind - asked Brian and Howard to take a look.
  • Opal Issue - One version embedded in Open MPI, and another in PRTE.
    • How do we manage that overlapping code?
    • similar to libevent, and hwloc (prte, pmix, and ompi)
    • Already affecting us, if you want an external PMIX, you have to use external libevent, and hwloc
    • We have a decision in near future about libopal. Used by other packages, need to figure out a way out of this.
    • Brian is writing a doc on an approach
    • Some discussion.
  • Libtool issue came up before or during supercomputing.
    • Went around with many options - Ultimately will need to version all .so's
      • need to explicitly version on each release branch going forward.
      • WONT make opals on various release branches compatible with each other.
  • Amazon AWS / Jenkins is still crippled
    • Jenkins Broke the EC2 plugin, and there is a fix for EC2, but EC2 has not released the fix.
    • Scope of how this affects Open MPI Projects:
      • release build process is broken
      • only about 10% of CI tests right now.
    • Status: we're currently stuck waiting on this EC2 fix.

PMIx

  • Ralph worked a lot on PMIx Tools interface, and documenting it for standard.
    • Ralph should have 3 new chapters of PMIx v4 standard document in a few weeks.
    • Ralph will send email to PMIx announce list.
    • PMIX gropu, PMIX tools, and PMIx fabric
  • Will release a version of PMIx v3.1.0 in next week or two for Open MPI v4.0.x

MTT

  • IBM test configure should have caused that.
  • Cisco has a one-sided info check that failed a hundred times.
    • Cisco install fail looks like a legit compile fail (ipv6 master)

New topics

  • We have a new open-mpi SLACK channel for Open MPI developers.
    • Not for users, just developers...
    • email Jeff If you're interested in being added.

Review Master Master Pull Requests

  • didn't discuss today.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally