Skip to content

Meeting 2015 01

Tomislav Janjusic edited this page Jan 6, 2024 · 1 revision

January 2015 OMPI Developer's Meeting

This is a standalone meeting; it is not being held in conjunction with an MPI Forum meeting.

Logistics

Doodle for choosing the date: https://doodle.com/zzaupgxge9y6medu

  • Date: 9am Tuesday, January 27 through 3pm Thursday, January 29, 2015
  • Location: Cisco Richardson facility (outside Dallas), building 4:

Cisco Building 4
2200 East President George Bush Highway
Richardson, Texas 75082-3550

Google maps link: https://goo.gl/maps/SNrbu

Attendees

Local attendees:

  • (*) Jeff Squyres - Cisco
  • (*) Howard Pritchard - Los Alamos
  • (*) Ralph Castain - Intel
  • (*) George Bosilca - U. Tennessee, Knoxville
  • (*) Dave Goodell - Cisco
  • (*) Edgar Gabriel - U. Houston
  • (*) Vish Venkatesan (not Tuesday) - Intel
  • (*) Geoff Paulsen - IBM
  • (*) Joshua Ladd - Mellanox Technologies
  • (*) Rayaz Jagani - IBM
  • (*) Dave Solt - IBM
  • (*) Perry Schmidt - IBM
  • (*) Naoyuki Shida - Fujitsu
  • (*) Shinji Sumimoto - Fujitsu
  • (*) Stan Graves - IBM
  • (*) Mark Allen - IBM
  • ...please add your name if you plan to attend...

(*) = Registered (by Jeff)

Remote attendees

  • Nathan Hjelm - Los Alamos
  • Ryan Grant - Sandia (planning to attend for the MTL and 1.9 branch discussions)

Topics still to discuss

Deferred

  • Ralph: RTE-MPI sharing of BTLs

Resolved

  • Jeff/Howard: Branch for v1.9

    • See Releasev19 wiki page
    • We need to make a list of features for v1.9.0 to see if we're ready to branch yet
  • Jeff: libtool 2.4.4 bug / libltdl may no longer be embeddable. Should we embed manually, or should we just tell people to have libltdl-devel installed?

    • Resolved: let's stop embedding; we'll always link against external libltdl.
    • However: this means people need to have the libltdl headers installed (e.g., libltdl-devel RPM). We don't care about telling developers to do this, but we are a little worried about telling users to do this (because it raises the bar for building Open MPI -- the assumption that libltldl-devel is almost certainly not installed on most user machines).
    • The question becomes: what is configure's default behavior when it can't find ltdl.h?
      1. Abort
      2. Just fall back to --disable-dlopen behavior (i.e., slurp in plugins)
    • Let's bring up the "default behavior" issue as an RFC / beer discussion.
  • Jeff/Howard: Jenkins integration with Github:

    • how do we do multiple Jenkins servers? (e.g., running at different organizations)
    • much discussion in the room. Seems like a good idea to have multiple Jenkins polling github and running their own smoke tests. Need to figure out how to have them report results. Mike Dubman/Eugene V/Dave G will go investigate how to do this.
  • Howard/George: fate of coll ML

  • see http://www.open-mpi.org/community/lists/devel/2015/01/16820.php

  • who owns it?

  • should we try to fix it or disable by default?

  • Point was raised that coll/ml is very expensive during communicator creation -- including MPI_COMM_WORLD. Should we delete coll/ml? George asked Pasha; Pasha is checking.

  • Pasha: disable it for now, ORNL will fix and re-enable

  • DONE: George opal_ignore'd the coll/ml component

  • Ralph: Scalable startup, including:

    • Current state of opal_pmix integration
    • Async modex, static endpoint support
    • Re-define the role of PML/BTL add_procs: need to move to a more lazy-based setup of peers
    • Memory footprint reduction
    • Resolved:
    • Revive sparse groups
      • Edgar checked: passes smoke test today
      • first phase: replace ompi_proc_t array with pointer array to ompi_proc_t's
        • investigate further reduction in footprint
          • very simple, 1-way static setup of group hash, current optimize for MCW
    • remove add_procs from MPI_Init unless preconnect called
      • PML calls add_procs with 1 proc on first send to peer
        • need centralized method to check if we need to make a proc (must be thread safe)
        • may need to poll BTLs...etc. Expensive! Async? Must also be done thread safe
        • still a blocking call
        • Nathan: if one-sided calls BTLs directly, then need to check/call add_procs
      • call add_procs with all procs for preconnect-all and in connect/accept, or if PML component indicates it needs to add_procs with all procs
      • need to check with MTL owners on impact to them
      • will only add_procs a peer proc at most once before it is del_proc'd
    • del_procs needs to release memory and NULL the proc entry to ensure that you get NULL when you next look for the proc
    • differentiate between "I need a proc for..."
      • communication
      • non-communication
    • need to check BTL/MTLs to see how they handle messages from peers that we don't have an ompi_proc_t for
      • need way for BTL/MTL to upcall the PML with the message so the PML can create a new ompi_proc_t, call add_proc, handle message
  • COMM_SPLIT_TYPE PR: https://github.com/open-mpi/ompi/pull/326 -- what about IP issues?

  • Jeff added request to PR that the author mark it as released as BSD so we can properly ingest it

  • George to contact offlist to discuss enhancements

  • Edgar: extracting libnbc core from the collective component into a standalone directory such that it can be used from OMPIO and other locations

    • move the libnbc core portions into a subdirectory in ompi
    • modification to libnbc will include new read/write primitives as well as new send/recv primitives with an additional indirection level for buffer pointers.
  • Ralph: Review: v1.8 series / RM experience with Github and Jenkins and the release process

    • Ralph's feedback: lots more PRs than we used to have CMRs
    • Ralph's feedback: people seem to be relying on Jenkins for correctness, when Jenkins is really just a smoke test
    • Github fans will look at creating some helpful scrips to support MTT testing of PRs
  • Ralph: PMIx update

    • Given orally at meeting
  • Ralph: Data passing down to OPAL

    • Revising process naming scheme
    • MPI_Info
      • OPAL_info (renamed) object and typedef it at the OMPI layer
        • Dave Salt from IBM volunteered
        • Perry is going to ensure that IBM's Schedule A is up-to-date
    • Error response propagation (e.g., BTL error propagation up from OPAL into ORTE and OMPI, particularly in the presence of async progress).
      • Create opal_errhandler registration, call that function with errcode and remote process involved (if applicable) when encountering error that cannot be propagated upward (e.g., async progress thread)
        • Ralph will move the orte_event_base + progress thread down to OPAL
        • Ralph will provide opal_errhandler registration and callback mechanism
        • Ralph will integrate the pmix progress thread to the OPAL one
        • opal_event_base priority reservations:
          • error handler (top)
          • next 4 levels for BTLs
          • lowest 3 levels for ORTE/RTE
  • Howard: Progress on async progress

  • Nathan: --disable-smp-locks: remove this option?

    • See RFC email http://www.open-mpi.org/community/lists/devel/2015/01/16736.php
    • See, in particular, George's replies
    • In short: atomics are only used when multi-threading is enabled. But sm and vader need the smp locks.
    • However, people are discovering --disable-smp-locks, but this breaks sm/vader.
    • OMPI atomic functions:
      • CAPS versions: only enabled when opal_using_threads() is true, which is only true when set_opal_using_thread(true), which is only when we are MPI_THREAD_MULTIPLE
      • lower_case version: only on when --enable-smp-locks
    • George misunderstood the issue. Now he understands and agrees with Nathan: remove the --enable-smp-locks option.
  • Nathan: Performance of freelists and other common OPAL classes with OPAL_ENABLE_MULTI_THREADS==1 (as discussed in [GitHub]). Part of this is done already -- LIFO is a bit faster now (with threads), etc.

    • This is pretty much already resolved (after this item was added to the agenda) -- a fix went in on master for this, and a different fix went in for v1.8.
    • So the issue is now moot. Yay!
  • Vish: Memkind integration: see http://www.open-mpi.org/community/lists/devel/2014/11/16320.php

    • Vish has slides that he will post here.
    • We all generally agree that memkind introduces some new, desirable functionality
    • With some discussion in the room, it seems "easy" to to add this functionality to MPI_ALLOC_MEM/MPI_FREE_MEM.
    • We decided that it's quite hard to know how to use this internally in the rest of the OMPI code base right now. We assume we will want to use it; we just don't know how yet (there are many variables). So let's get some experience with memkind in MPI_ALLOC_MEM first and revisit how to use this internally in the rest of the code base.
    • Here's the 4 steps we think we need to do:
      1. remove "allocator" framework use from ob1, replace it with malloc (because the use of allocator there seems to be pretty useless)
      2. create new allocator modules for things like:
        • posix_memalign
        • mmap
        • malloc
        • ...?
      3. change the mpool framework/modules to use allocator modules to get memory
      4. update MPI_Alloc_mem to:
        • lazily create allocator modules from memkind when each memkind type requested
        • make an mpool with that allocator
        • allocate memory from the mpool associated with that memkind allocator type
        • (somehow) register the memory with all other mpools (e.g., mpools in use by the BTLs)
        • MPI_FREE_MEM needs to unregister with all mpools (probably already done?)
        • MPI_FREE_MEM needs to return the memory to the right mpool
    • Nathan and Vish will coordinate to move forward on this.
    • George and NAthan are digging in to ensure that allocator is not already being used in a way that will be problematic. ob1 usage seems to be understood / ok to change. sm mpool needs to be investigated -- it uses allocator, too.
  • Fujitsu: future plans for Open MPI development

  • Ralph/Nathan: MTL overhead reduction

    • Why does Yalla perform better? I.e., why was it better to move it to a PML than just fix CM?
    • Part of it was atomics in freelists; fixed in master and v1.8
    • Also atomics in OBJ_FREE/RELEASE; fixed in master and v1.8
    • But Nathan says there are more fixes that are needed.
    • Josh: moving to PML just short-circuited some code paths, e.g., filling in descriptors.
    • Nathan: perhaps we should apply same kind of optimization that we did in ob1 -- don't create/make descriptors that we don't need to. I'll look into this in the next month or so.
    • Meta question: So what is cm for?
    • After much discussion -- it seems like CM might not be necessary before. In the case of MXM, it literally just added overhead. Hence, moving it to PML made the message rates much better.
    • Supposition: Brian basically did it for Portals 3 and 4. Portals API is a bit higher abstraction and PSM and MXM and whatnot.
    • Q: What happens if yalla job calls MPI_PUT?
      • Will use osc/pt2pt. So it's still correct.
      • Could write osc/mxm if Mellanox ever cares. Today, the functionality is correct -- but could be more optimized if osc/mxm is provided.
    • So... perhaps PSM, OFI should move up to PMLs?
      • Then should OFI and MXM write osc modules?
      • Nathan: there is potential for a lot of common code between these.
      • Perhaps put the common code in osc/base...?
    • We need to talk to the MTL authors about this before going forward. ..we setup time to talk to MTL authors tomorrow at 1pm US Central. More notes after this discussion.
  • Jeff: MPI extensions (and not-yet-published MPI symbols): MPIX_ prefix, or OMPI_ prefix?

    • Just a discussion between Jeff and George.
    • Resolved to have a "rule of thumb" about naming symbols in mpi-ext:
      • If the symbol is never intended to be something outside of OMPI (e.g., OMPI_Paffinity_str), give it an "OMPI_" prefix.
      • If the symbol is intended to be standardized -- i.e., other MPI implementations may pick it up (e.g., ULFM functionality), give it an "MPIX_" prefix.
      • If the symbol has passed at least one vote at the MPI Forum (and subjectively passed it "easily"), i.e., the symbol looks like it's going to get into an official MPI standard but just hasn't done so yet, give it an "MPI_" prefix.
      • Jeff added ompi/mpiext/README.txt file with this rule of thumb.
  • Ralph: ORCM update

    • Roadmap
    • Instant On launch planning
    • See presentation below
  • MTL issues (some of which might become moot...?):

    • Review note Jeff sent out yesterday about MTL idea
      • General discussion: PSM and OFI MTL maintainers said that they would look into moving to a PML; they do not think it will be a problem. Portals maintainers need to investigate further.
      • There was some discussion and clarification about osc components (e.g., PSM and OFI might make their own osc components. There is already a portals4 osc component).
    • Intel/LANL: MTL selection issue (PSM vs. OFI)
      • Howard committed it. If most/all MTLs move to be PMLs, then Howard volunteers to do the same kind of fix for PML selection.
    • Nathan: Enhance MTL interface to include one-sided and atomics
      • All of the above seems to make the idea of adding one-sided interfaces to the MTL be moot.
  • Collective switching points & MPI tuning params - what is required to change them. Had a discussion brought up by Mellanox, and we never finished this.

    • Much discussion. Random points:
      • Those bencmarking to look Open MPI bad will always find something that they can point to that their MPI does better. So we're not trying to fix that problem (we can't).
      • What we do want to fix is to allow vendors to provide their own coll/tuned cut point text config files. But those are challenging to create.
      • LANL and UH are interested in collaborating to make the creation of these files easier (and mostly/fully automated).
      • Additionally, George is going to move all the naked algorithm routines out of the tuned coll and put them in the coll/base. This will allow others to write their own coll modules and use these naked routines without needing to duplicate them.
      • The tuned coll will therefore become much smaller, but still be functionally identical (i.e., still have the same decision tables, still allow loading config text files, etc.).
  • All: Progress on thread-multiple support

    • Refresher on current status:
      • Must configure with --enable-mpi-thread-multiple
      • there is a question as to whether the openib BTL is thread safe or not -- no one seems to know.
      • usnic BTL is (probably) not thread safe yet
      • all other MPI objects are (nominally) thread safe -- but really need to be tested.
      • collectives are (nominally) thread safe -- but really need to be stress tested
      • Dave points out that we should run an MTT with --enable-mpi-thread-multiple and setenv OMPI_MPI_THREAD_LEVEL to effect MPI_THREAD_MULTIPLE. Cisco and LANL will do this in MTT.
      • There are only a handful of THREAD_MULTIPLE tests in our repo. Mellanox will investigate if they can share more.
    • George thinks that his pending MPI_TEST/WAIT fixes will go a long way to fixing THREAD_MULTIPLE performance problems. Will need to re-evaluate THREAD_MULTIPLE performance once these have been applied to see where we are.

Presentation Material

Clone this wiki locally