Skip to content

WeeklyTelcon_20170919

Geoffrey Paulsen edited this page Jan 9, 2018 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Geoff Paulsen (IBM)
  • Ralph Castain (Intel)
  • Howard
  • Brian Barrett
  • Todd Kordenbrock
  • Jeff Squyres (Cisco)
  • Joshua Hursey
  • David Bernholdt (ORNL)
  • Geoffroy Vallee (ORNL)
  • Artem (Mellanox)
  • Mohan
  • Nathan Hjelm
  • Thomas Naughton

Agenda

Review v2.0.x Milestones v2.0.4

  • Going to switch v2.0.x to only Critical fixes only!
    • Only Critical fix we know of now is MAdvise fix.
  • Ask people to move to v2.1.x or v3.0.0

Review v2.x Milestones v2.1.2

  • v2.1.2
    • Allinia still seeing intermittent issues in v2.1.2 release candidate Issue 3660
    • Issue 2614
      • Targeting v2.1.x - Not sure if it affects v3.0.x or master
      • Nathan may have a fix.
      • Mark Allen can help

Review v3.0.x Milestones v3.0

  • v3.0.1 - Opened the branch for bugfixes Sep 18th.
    • Looking at mid-October.
    • SuSE reported some issue with a combination of flags that causes issues. Giles is looking at. Will take a while.
  • Nightlies have all switched.
  • ortedvm is broken on v3.0.0
    • could do a specific fix for v3.0.0, but it's already working in latest PMIx / master, but that would require upgrading to new PMIx and changes through all of orted
    • Do we want to fix in v3.0.1? It is already in master, and will be release in v3.1.

Review v3.1.x Milestones v3.1](https://github.com/open-mpi/ompi/milestone/27)

  • Plan to branch from Master Spetember 19.
  • gives us 6 weeks to stabilize and release before supercomputing.
  • Schedule for NEXT v3.1 release (Branch and Ship)
    • Would like to have all features into master before we branch for v3.1
    • Ralph is working on tool connection.
      • Told the debugger community to move away from MPI_R to PMIx for standard attaching mechanism.
      • This is the way of doing this, first cut for debuggers to start their development work.
      • Not that critical to be in an OMPI release, because could get them functionality via other channels.
    • RMLFI component is now complete. > 32nodes it launches much faster. Sockets or PSM2? Should work with ugenie, psm2, and sockets (but don't get any benefit).
    • Amazon has something in review for v3.1
  • No whitelist for v3.1. v3.0.0 was transition, and no more whitelists.
  • New Features in master:
    • mellanox added some stuff
    • Howard added some code for tools.
  • Want to remove Reachable framework in v3.0.0 since it's very broken, and not used, and can't backport v3.1.x
  • Amazon wants to put Reachable Framework back in PR 4225 merged into master before we branch v3.1.x
    • Some bugs in the TCP btl code, hoping to have it USING Reachable before we branch for v3.1.x.
    • Not sure if we can fix TCP without Reachable framework.
  • Will turn around and create an RC as soon as we can.

Review Master Master Pull Requests

  • Single digit number of fails.
  • Artem will look (sometime) at out of resources in dstore
  • dynamic disconnect test needs to run with --oversubscribe (otherwise will fail).
  • argv null tests for Fortran spawn. All failing with executable can't be found.

MTT / Jenkins Testing

  • Howard having issues with reaching out and getting ID from MTT. Josh isn't sure.
  • Brian tried some new ways of building the tarball, but it failed... so delayed until Thanksgiving.
  • Root filesystem on webserver failed, because jenkins failed.
    • Jenkins reachout plugin is terrible, so having the clients reach out to the server is more stable.
    • Brian is working with Nathan's MAC. Not sure if this approach would work for Cray machines.
    • Howard would like to get this setup. Brian can send instructions.

This week Discussion Points.

  • Ralph proposed to have a bot that could scan issues, and close issues if no action in some time.
    • A bit of concern about auto-closing (losing visibility of legitimate issues)
    • Ticket shaming seems to work.

Oldest PR

Oldest Issue

Next face-to-face meeting

  • Jan / Feb
  • Possible locations: San Jose, Portland, Albuquerque, Dallas

Status Updates:

Status Update Rotation

  1. Mellanox, Sandia, Intel
  2. LANL, Houston, IBM, Fujitsu
  3. Amazon,
  4. Cisco, ORNL, UTK, NVIDIA

Back to 2017 WeeklyTelcon-2017

Clone this wiki locally