Skip to content

WeeklyTelcon_20160906

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Jeff Squyres
  • Artem Polyakov
  • Brad Benton
  • Geoffroy Vallee
  • George
  • Howard
  • Josh Hursey
  • Nathan Hjelm
  • ralph
  • Sylvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones
  • 1.10.4
    • Only potential blocker is issue with wrapper compiler.
      • mpifort is not libpath-ing rpath lib
      • when you do C builds, add rpath to all dependent libs during build.
      • static builds on 1.10
    • 1.10.4 Released!
    • Ralph will "bulk move" still open 1.10.4 PRs to 1.10.5.

Review 2.0.x

  • Wiki

  • Milestones

    • 2.0.1 is OUT
    • moving oustanding stuff to 2.0.2 or 2.1.0
    • Jeff and Howard pulled in some PRs for 2.0.2
    • coll_sync - macro had a type-o in it. Works, but was wrong. Fixed.
    • Figured out bug with powerpc atomics - there is a fix.
      • optionA - re-enabled PGI atomic and apply a patch.
      • optionB - or re-write atomics.
      • Summary- there are a small number of asm files that are handlined.
        • If there are non-inline atomics, and no asm file - fails horribly in configure
        • If there are no-inline atomics, but asm is stale, fails at Build time (powerpc).
      • JHjelm - is proposing to remove asm files (as all compilers we support support inline atomics).
        • We had a check that said "if PGI, then just use asm file"
        • We should require PGI version > 10.8 (for inline atomics).
        • Nvidia (Sylvain) agreed this was okay.
        • Paul filed bug with PGI inline assembly fix.
    • Schedule - End of October.
    • Issue 2030 - Comm Spawn is still Broken. - timeout in OPAL_PMIX_Exchange macro. Fixed in master?
      • Very hard to reproduce.
      • Race condition that's tickled by MTT, but not manually. Have seen this for years.
    • Issue 2049 -
      • Patcher issue. Can't write to page (in shared code, read only page).
      • disabling patcher framework fixes this.
      • No Open BSD drivers, since Open BSD puts program shared pages in read-only, Linux does not.
      • Resolved to NOT support this on Open BSD at this time.
    • Issue 2028 - SPML Yoda not BTL 3.0 compliant
      • Blocking issue for 2.1.0!
      • Work not done for Open SHMEM.
      • Still allocate a fragment
      • OpenSHMEM - works with Open1, and whatever MxM flavors. ???
      • Open question, who's going to fix this.
      • Artem - Mellanox is now testing yoda in their jenkins.
      • Suggest we remove the broken test from Mellanox jenkins.
        • Artem will fix now.
      • rework way callbacks are done, and for put and get, don't allocate a fragment.
        • Hjelm - can help by telling how BTL3 works.
    • Significant degradation in message rates observed on Master - Issue 1831
      • Master from 2 days ago, so yet includes all MT fixes, etc.
      • George trying to figure out where bandwidth latency slowdown came in.
      • Message rate was good again, but Bandwidth / Latency, not yet.
      • Significantly slower for large message on this machine, despite configured with CMA.
      • Really strange that vader is slower with SM, since they're making the same calls. Bizarre!
      • Looks like we went from on-cache to off-cache performance.
      • Not a weird binding issue. George did more testing to ensure not a binding issue.
      • Need more people to try to reproduce this.
    • Hidden in a message that Giles sent today. Really funny bug.
      • If you send a message inside a communicator, and then free it, and allocate it, and THEN receive the message on the new communicator. If the message is small enough, it goes eager comm->frags->cannot_match.
      • Later when you create a communicator, we can match that message.
      • Because we can re-use a CID.
      • NOT hard to fix. Multiple ways.
      • doesn't happen in MPICH.
      • window is probably small. Need a distributed system that is out of sync.
      • Could split up CID into two parts.
      • Why do we always return the lowest CID? - Fragmentation would be horrible if we didn't.
      • Someone will file a bug about this. Need to think through this.
  • Ralph sent out proposed language for new Contributor agreement. Need to talk to legal departments.

    • We've always had by-laws on wiki
    • Folks should comment, so we don't iterate with legal too many times.
    • Once we've finalize, we need to have an official vote.
  • Don't know if ompi_release -> ompi transition will be done by next tuesday.

    • Still pulling in the ready PRs.
    • need to cut a 2.0 branch from v2.x branch.

New Agenda Items:

Review Master MTT testing (https://mtt.open-mpi.org/)

  • Master has a sea of red.
  • Mellanox is pulling Yoda issue out of Jenkins.

MTT Dev status:

Website migration

Open MPI Developer's Meeting

  • Date of another face to face. January or February? Think about, and discuss next week.

Status Update Rotation

  1. LANL, Houston, IBM
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally