Skip to content

WeeklyTelcon_20180710

Geoffrey Paulsen edited this page Jan 15, 2019 · 1 revision

Open MPI Weekly Telecon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Geoff Paulsen
  • Josh Hursey
  • Matias Cabral
  • Matthew Dosanjh
  • Nathan Hjelm
  • Peter Gottesman (Cisco)
  • Ralph Castain
  • Thomas Naughton
  • Akvenkatesh (nVidia)

not there today (I keep this for easy cut-n-paste for future notes)

  • Edgar Gabriel
  • Geoffroy Vallee
  • Todd Kordenbrock
  • Howard Pritchard
  • Joshua Ladd
  • Dan Topa (LANL)
  • Xin Zhao
  • David Bernholdt
  • Brian
  • Dan Topa (LANL)

Agenda/New Business

  • default to external for v4.0

    • should this be a blanket statement, or is there a version limit? For example, if someone has a v1.1 version of PMIx installed in a default location, do we really want to use that versus the internal v3.0?
      • if external support is found and is "compatible"... keep it fuzy.
        • If we know it won't work, just hardcode versioning.
        • If it's compatible, but lower than internal, we'd emit a warning, but still use external one.
      • no issue with PMIx v3.0 vs 2.x, since we haven't implemented new features anyway.
      • A little worried about PMIx, since older slrums are still pmix v1.x based.
      • if they're using pmix v1.2.5, ompi needs to use v1.2.5, since older pmix won't use newer pmix.
      • Does PMIx website have this compatibility chart somewhere? We should point to it.
      • We like checking at configure time, since it's a nice early failure, but this will have to be runtime failures.
      • configure summary at end. opal_summary_add m4.
  • Comm_Spawn issue:

    • George has raised a comm_spawn issue about inheriting MCA params https://github.com/open-mpi/ompi/issues/5376
      • Child job is launched using the same MCA params as first node. Isn't an easy way to overwrite those, and some can't be at all (other than don't use that).
      • He proposed mca parameters should not be inherited by child jobs. If it's set on command line, as opposed to mca param file, then it shouldn't be passed to child jobs.
    • for Open MPI v4.0
      • Just a couple of lines of code to NOT propagate the MCA params related to launch.
      • map-by option - can't turn off. (perhaps we need a sentinal meaning "default")
  • nVidia - update for UCX CUDA support

    • ompi_opal converter does not need any changes (verified by Akshay), changes thought needed, not needed.
    • See more on UCX and --with-cuda below.
  • github suggestion on email filtering

Minutes

Review v2.x Milestones v2.1.4

  • v2.1.4 - Final release on v2.1.x
  • Moved the date back from Aug 31, to Aug 10th to allow more time for other releases.
  • If we pushed it OUT, it probably wouln't happen due to schedule and after v4.0.x not too useful.

Review v3.0.x Milestones v3.0.3

  • Schedule:
    • v3.0.2 has been shipped.
  • v3.0.3 - targeting Sept 1st
    • Cisco is seeing some weirdness in v3.0 and v3.1
      • Haven't nailed down, but haven't reported yet. PMIx / runtime.
      • Ralph wants to see what happens when upgrade to PMIx v2.2, but probably a problem in ORTE.

Review v3.1.x Milestones v3.1.0

  • v3.1.1 - Just released

  • Power9 hang in make check

  • Issue 5363 - teardown in shmem that trying to ulink a file already ulinked.

  • Issue 5336 - Brian merged PR that will print a bit more info. PMIx + libevent issue hit in cisco.

  • PMIx v2.1.2 in Open MPI v3.1 and v3.0 - Ralph about to release PMIx v2.1.2 does Open MPI want to embed it?

    • All just bug fixes, not much work for Ralph to update.
    • Jeff yes, but want to talk to Brian.

v4.0.0

  • Schedule: branch: July 15. release: Sept 17
    • Date for all MTT testing online - July 22? -
    • Date for first RC - Aug 13 (after sunset of 2.1.4)
  • Targeting UCX v1.4 to support CUDA buffers.
    • May be changes in UCX PML and/or datatype converter.
    • Will have more info by next week.
  • Cuda support - cudasm, and openib
    • Still a couple of steps away of being on par in UCX regarding CUDA support.
    • Does nVidia want if --with-cuda, then openib included by default?
      • Yes, because at this moment UCX is not on par, but still want to migrate to ucx cuda.
      • Warning message will mention deficate openib vs ucx
  • NEWS - Depricate MPIR message for NEWs - Ralph can help with this.
  • Sent email to ompi-packagers list with schedule and info on
  • Still at risk features for branch of v4.0 on July 15.
    • UTK ULFM - Fault Tollerant - Geoff JUST emailed George after meeting.
    • external preferences configury - Giles did libevent.
      • Ralph will update PMIx.
      • leaves hwloc - Jeff will review Giles code, and see if it can be easily translated to hwloc

PMIx

  • Ralph merged in some PMIx v3.0
  • Overall Runtime Discussion (talking v5.0 timeframe, 2019)
    • TODAY - Geoff Paulsen will send out doodle for next week to devel-core.

New topics

  • From last week:
    • MTT License discussion - MTT needs to be de-GPL-ified.
    • Last week Brian had an interesting proposal to remove all of the perl out, or the python out?
    • Next week we'll Have Brian on the call.
    • Schedule - Like resolution by end of july.

Overall Runtime Discussion (talking v5.0 timeframe, 2019)

  • Will discuss this in a sperate call 2nd week in July.
  • Two Options:
    1. Keep going on our current path, and taking updates to ORTE, etc.
    2. Shuffle our code a bit (new ompi_rte framework merged with orte_pmix frame work moved down and renamed)
      • Opal used to be single process abstraction, but not as true anymore.
      • API of foo, looks pretty much like PMIx API.
        • Still have PMIx v2.0, PMI2 or other components (all retooled for new framework to use PMIx)
      • to call just call opal_foo.spawn(), etc then you get whatever component is underneath.
      • what about mpirun? Well, PRTE comes in, it's the server side of the PMIx stuff.
      • Could use their prun and wrap in a new mpirun wrapper
      • PRTE doesn't just replace ORTE. PRTE and OMPI layer don't really interact with each other, they both call the same OPAL layer (which contains PMIx, and other OPAL stuff).
        • prun has a lam-boot looking approach.
      • Build system about opal, etc. Code Shufflling, retooling of components.
      • We want to leverage the work the PMIx community is doing correctly.
  • If we do this, we still need people to do runtime work over in PRTE.
    • In some ways it might be harder to get resources from management for yet another project.
    • Nice to have a componentized interface, without moving runtime to a 3rd party project.
    • Need to think about it.
  • Concerns with working adding ORTE PMIx integration.
  • Want to know the state of SLURM PMIx Plugin with PMIx v3.x
    • It should build, and work with v3. They only implemented about 5 interfaces, and they haven't changed.
  • A few related to OMPIx project, talking about how much to contribute to this effort.
    • How to factor in requirements of OSHMEM (who use our runtimes), and already doing things to adapt.
    • Would be nice to support both groups with a straight forward component to handle both of these.
  • Thinking about how much effort this will be. and manage these tasks in a timely manor.
  • Testing, will need to discuss how to best test all of this.
  • ACTION: Lets go off and reflect and discuss at next week's Web-Ex.
    • We aren't going to do this before v4.0 branches in mid-July.
    • Need to be thinking about the Schedule, action items, and owners.

Review Master Master Pull Requests

  • Decided to file PR5200 to begin the long process of deleting osc/pt2pt (by enabling all relevant RDMA BTLs so that every transport will use osc/rdma).
  • Anything Jeff can help with Absoft and NAG licenses?
    • waiting.
  • Hope to have better Cisco MTT in a week or two

    • Peter is going through, and he found a few failures, which some have been posted.
      • one-sided - nathan's looking at.
      • some more coming.
    • OSC_pt2pt will exclude yourself in a MT run.
      • One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
        • Now that osc_pt2pt is ineligible, many tests fail.
        • on Master, this will fix itself 'soon'
        • BLOCKER for v4.0 for this work so we'll have vader and something for osc_pt2pt.
        • Probably an issue on v3.x also.
  • OSHMEM v1.4 - cleanup work

    • and refactoring.
  • Edgar has some issues running on omnipath - Not able to open HFI correctly.

    • Not sure if it's OFI components.
    • Mathias just updated his PR5004 and asked Jeff to review.
      • libfabric related, but probably not Edgar's issue.
  • Next Face to Face?

    • When? Late summer, early fall?
    • Where? San Jose - Cisco, Albuquerque - Sandia
    • Super computing is in Dallas this year in Nov.

Oldest PR

Oldest Issue


Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2018 WeeklyTelcon-2018

Clone this wiki locally