Skip to content

WeeklyTelcon_20160823

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen
  • Artem Polyakov
  • Jeff Squyres
  • Brian
  • Edgar Gabriel
  • george
  • Geoffroy Vallee
  • Howard
  • Josh Hursey
  • Joshua Ladd
  • Ralph
  • Slyvain Jeaugey
  • Todd Kordenbrock

Agenda

Review 1.10

  • Milestones
  • 1.10.4
    • A few PRs to pull in. want folks to focus on 2.0. Once 2.0.1 is out, might begin work on 1.10.4.
    • Ralph may need a 1.10.4.
      • Cisco 1.10.4 has a bunch of failures on MTT, Jeff needs to know if there is an issue.
      • Driver: Didn't have the sync component in the collectives. This is causing problems for several customers. Trying to bump them to 2.x, but might not be possible.
      • What problem does the sync component in collectives solve? If applications in tight loop call non-blocking collectives, and one process starts to fall behind (typically extra work). We don't have flow control for that.
      • nathan has an idea, if unexpected msg queue (per rank) gets big, send message to other side, to use sync send on next message.
        • This can deadlock if messages come in, in a bad message. (in non-blocking send)
      • Do have an ACK protocol for long and sync messages. Ack is piggibacked for rondevue.
      • Portals MTL has something, since it has a small unexpected msg.
      • MPI standard is not clear what to do on sender side for non-blocking sends if running out of messages.
      • Hard to do scalably, reliably, and fast.
      • Should take this offline to wiki or email or something.
        • George will describe deadlock path.
      • coll_sync is the temporary solution?
        • George, either we force it all the time for everybody, or we ask people to activate by hand.
          • OR they could change the size of the eagar, and get almost the same effect.
        • Can't set eager below match size.
        • coll_sync is a good bandaid.
        • coll_sync was in up to 1.6 series, but it disappeared, and they want / need it.
      • Is coll_sync on FAQ? - yes, think so.
      • May need coll_sync in 2.0.2 also.

Review 2.0.x

  • Wiki

  • 2.0.1 PRs that are reviewed and approved

  • Blocker Issues

  • Milestones

  • Paul Hargrove uncovered 3 things.

    • Need to update PMIx anyway, due to solaris issue.
    • Can drop OSX v10.6 - 10.10 is list of systems tested. 10.6 can't even be run in VM.
    • Should change test list to OSX v10.8 - 10.11. (10.12 still in beta)
    • dlopen crash, possibly specific to XLC in Patcher.
      • Nathan may not have got the XL piece correct.
      • Don't actually refer to the translation table.
    • Oracle Studio lightly tested.
  • PR1333 - hcoll datatype fixes.

  • Check AUTHORS file - NOW auto-generated from Spreadsheet.

    • git .mailcap - filters name and emails show through .mailcap file.
    • Edgar had a commit from his wife's local macbook, so this was put into .mailcap, so when you see that it changes to actual email.
    • dist directory has a make AUTHORS script to run before release, to regen AUTHORs.
  • coll_sync - 2.0.1 or 2.0.2?

    • 2.0.2 - already in PR list, Ralph will set milestone.
  • Mellanox needs PMIx 2.0 in 2.1.0

    • PMIx will release a 2.0 that just has shared memory data as an addition,
      • but doesn't have everything else they were targeting for 2.0.0.
      • This should come out Early September.
      • This is the piece that Mellanox and IBM are interested in.
    • Put items requested on the wiki (e.g., PMIx direct modex, OpenSHMEM, stability improvements)
    • What do people want to see for 2.1.0?
    • Finalize the list in Dallas meeting
    • Hopefully target Sept./Oct. release, not Super Computing Goal.

Review Master MTT testing (https://mtt.open-mpi.org/)

  • Howard looks close talking to reporter.
    • looks like Jengo Cherry-py is not running during HTTP_PUT. Josh will check.
    • There is a separate path, that's different, send email to josh, and josh will check.

MTT Dev status:

  • Getting closer.
  • Josh started moving MTT server to Amazon cloud server.
    • Probably have a transition time for database transfer, not this week.

Website migration

  • Most of it's migrated now, other than MTT database.
  • Statistics for download numbers?
    • at the moment these are gone.
    • when did we actually flip this bits to move to hostgator?
    • 3-4 weeks ago.
    • get numbers up until then to Edgar.
  • Google analytics only has permissions to certain directories.
    • So can't track number of downloads.
    • If we're eventually going to move downloads to S3, then we get that for free.

Open MPI Developer's Meeting

  • Date of another face to face. January or February? Think about, and discuss next week.

  • Non-Profit

    • Ralph sent email out to list, please comment either pro/con.

Status Update Rotation

  1. LANL, Houston, IBM
  2. Cisco, ORNL, UTK, NVIDIA
  3. Mellanox, Sandia, Intel

Back to 2016 WeeklyTelcon-2016

Clone this wiki locally