Skip to content

WeeklyTelcon_20200804

Geoffrey Paulsen edited this page Jan 19, 2021 · 2 revisions

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

Did not capture attendance accurately -- this may not be fully correct. I put a "yes" next to the people I know were there today.

  • NOT-YET-UPDATED

Release Branches

Review v4.0.x Milestones v4.0.5

  • Still waiting on blocker (also v4.1): cache line stuff

    • Why is this a correctness issue (not just a performance optimization)?
      • We align the data in the shared memory stuff to be on cache line sizes
      • We start the ring every 128 bytes (i.e., local rank 0)
      • Other processes then find out the real cache line size of 64.
      • Then other processes attach to shared memory, and use the cache line size/alignment of 64.
      • First message will get sent, but then the 2nd message will never be received (and/or it's reading corrupt data because it's reading at offset 64 instead of 128).
    • How is this not happening anywhere else?
      • Previously, cache line size was setup very, very late (after all the shmem stuff was setup -- even the non-local-process-rank-0). I.e., we got lucky.
      • I.e., we brought the hwloc initialization forward at some point and broke this.
      • This only happens in smcuda BTL (and possibly only in single-node runs, because other BTLs/PMLs may have been selected).
      • The plain sm and vader BTLs do this differently.
      • Meaning: this is a very specific corner case.
    • Solutions?
      • Trivial fix: just have everyone use a fixed value (e.g., 128 or 64).
      • Pretty simple: modex-send the size to be used from local rank 0 to the others. The others modex recv the value and use it.
      • A little more complicated: also add code to smcuda to read the Linux /proc / /sys / whatever to get the cache line size.
    • There's a PR for master that does the fix -- but in a way that will kill scalability.
      • Once Brian's configury fixes are in, this is easy to fix on master.
      • Or it could be done the "A little more complicated" way, above. Neither of which are difficult.
    • For 4.0 and 4.1: George will make one-liner patch to make everyone use a fixed value.
      • This clears the blocker.
  • https://github.com/open-mpi/ompi/issues/7968: added something to README for v4.0: there's a known issue when using UCX with very, very old IB hardware (pre-Connect X) -- it'll segv. According to Mellanox, UCX 1.10 will fix this issue.

Review v4.1.x Milestones v4.1.0

  • Same cache line blocker as v4.0.

  • https://github.com/open-mpi/ompi/issues/7982: OFI BTL and FI_DELIVERY_COMPLETE. This only matters for MPI one-sided.

    • EFA and other providers are misbehaving
    • https://github.com/open-mpi/ompi/pull/7973: PR for fix: Disable EFA provider
      • ...but then later discovered that other providers also misbehave in the same way.
    • AWS proposal: extend #7973 to exclude other providers that misbehave.
    • Meaning: if you're using libfabric over verbs, the OFI BTL won't be used.
      • In v4.0x, there is no OFI BTL. So this is not an issue.
      • In v4.1 this is a minor inconvenience because we still have osc/pt2pt. I.e., OMPI will automatically fall back to osc/pt2pt.
      • This is unfortunately a big problem for master/v5.0. Need to figure this out -- i.e., coordinate with libfabric community.
      • NOTE: This is a different code path than the MPI-one-sided problem Cisco MTT discovered when we removed osc/rdma (and all MPI_WIN_CREATE operations failed).
        • Looks like Cisco MTT is still failing one-sided tests -- need to follow up with Nathan.
    • Howard asks: how can I see this problem?
      • Anything with MPI_PUT. E.g., IBM one-sided tests.
  • ADAPT / HAN.

    • Need to test and produce some documentation for ADAPT and HAN.

Review v5.0.0 Milestones v5.0.0

  • No update this week other than master discussion.

Master

  • osc/pt2pt removal on master

    • George: There are many machines where osc/pt2pt is the only mechanism, and it was the most performant.
    • Brian: osc/pt2pt wasn't removed because it wasn't needed, it was removed because it's very buggy (to include no good path to becoming multi-thread safe) and "unrecoverably broken" (Brian's words! And he wrote it!) and no one will take ownership of fixing it.
    • ...so if someone wants to take ownership of fixing it, they can!
  • Ralph points out:

    • AWS MTT builds for SLURM, need to fix up the compiles for external hwloc/libevent. Brian+William will talk internally.
    • Java: builds failing from Aurelien PR. He'll have a look.

Annual review of OMPI committers

  • It's after July, so Jeff will go de-activate people.
    • Brian will go do it today.

Virtual meeting next week

  • Agenda items for next week.
    • Talk through MPI-4 features. Howard will make a list of big-ticket MPI-4 features (from MPI-4 changelog).
      • Sessions
      • Default error handler
      • ...etc.
    • Walk through PRRTE issues.
      • Figure out: which are blockers for v5.0? (etc.)
    • With these two, we're good enough for Monday's meeting.
      • Please add any other items to the wiki.
      • We'll evaluate if we still need Tuesday's meeting.

Back to 2020 WeeklyTelcon-2020

Clone this wiki locally