Skip to content

Releasev2x

Howard Pritchard edited this page Nov 1, 2016 · 51 revisions

Notes for the v2.X series

Estimated timeline

  • June '15 master branched to 2.0.0
  • July '16 2.0.0 is released (yes 1 year and 1 month later!)
  • September '16 2.0.1 release
  • September 30, feature freeze for 2.1.0
  • Late Oct./early November '16 2.1.0 release (driven by OSHEM 1.3 compliance, PMIX 2.0 embedded)
  • Early '17 2.2.0 release

Must-have Features for 2.0.X

  • Thread safety (MPI_THREAD_MULTIPLE) support
    • need to verify which BTLs are thread safe (via testing vs stating) (DONE)
    • need more testing (non-blocking collectives, one-sided, MPI I/O, etc.)
    • need to document what is not thread safe (DONE)
    • performance improvements when using MPI_THREAD_MULTIPLE (i.e., TEST/WAIT improvements) - may wait for a publication before committing (DONE)
  • MPI-3.1 Compliance
    • Ticket #349 (MPI_Aint_add) - 2.0.X candidate (DONE)
    • Ticket #369 (same_disp_info key for MPI_Win_create) - 2.0.X candidate, maybe (DONE)
    • Ticket #273 (non blocking coll I/O, non trivial). This is dependent on moving libnbc core out of libnbc component. (DONE)
    • Ticket #404 (MPI_Aint_diff) - 2.0.X candidate (DONE)
    • Ticket #357 (MPI_Initialized, MPI_Query_thread, MPI_Thread_is_main) always thread safe (probably just verify with a test to see this is true now for OMPI thread models) (DONE)
  • MPI-3 Errata Items
  • Coverity cleanup (IN PROGRESS, down to ~260) (never ending)
  • Scalable startup work (smarter add_proc in the OB1 PML), needs more work (DONE)
    • Sparse groups (DONE)
    • Additional PMIx features (issue 394)
  • ROMIO refresh - need to be using a released ROMIO package (DONE)
  • Fix Java bindings garbage collection issues (DONE)
  • Hwloc 1.11.3 final (DONE)
  • CUDA extension (to add MPIX_CUDA_IS_AWESOME to mpi.h) and MPI_T Cvar for run-time query of whether CUDA is supported in this OMPI (DONE)
  • Add MPI 3 features to Java bindings (DONE)

Must-have Features for the 2.1 Series

  • PMIx 2.0 integration (Intel, MLNX) - partially in master Issue #2072. Will require porting to the v2.x branch. (this may not happen, may use PMIx 1.2 instead) - PR to bring PMIx v1.2 into OMPI v2.1
  • OpenSHMEM 1.3 compliance (MLNX) - issue #2109 tracking multiple PRs (DONE)
  • mpool/rcache rewrite (LANL) - already in master (DONE)
  • HCOLL datatypes (MLNX) - already in master (DONE)
  • Better CMA support detection / make CMA the default in vader (LANL) - [PR #2216] (https://github.com/open-mpi/ompi/pull/2216)
  • BTL/OpenIB across different subnets PR #1043 (DONE)
  • usNIC BTL thread safety PR #1326 and libfabric 1.4 support (DONE)
  • Lustre performance PR #987 (DONE)
  • mpirun command line options ehancements PR #1317 (DONE)
  • pending PRs (Nathan's free list work) (DONE)
  • Multi-rail performance in OB1? What happened? (WONT FIX)
  • Support for thread-based asynchronous progress for BTLs (DONE)
  • Improved story on out-of-the-box performance, particularly for collectives. Ideally some kind of auto-tune type of mechanism. (otopo project) (kind of DONE, not otopo)

Desirable-to-have Features for 2.1 (but may go in to a 2.2)

  • Awesome bug fixes from IBM
  • direct AMO support in OSC/RDMA (compartmentalized) - partially in master (DONE)
  • TCP latency went up and bandwidth(rendezvous) went way down. What happened? Maybe in 2.x series sometime... (AWS)
  • better CUDA support (UTK/Nvidia working on this, BTL, ob1 specific, follow-on for a collectives)
  • MPI I/O support MPI_THREAD_MULTIPLE (IBM, UH)
  • New coll tuned component (UTK)
  • Missing MPI I/O routines in ROMIO: MPI_File_iread_all, etc. (relies on generalized request MPICH-specific extensions) (IBM, UH)
  • Configury enhancements for modified library names, sonames, etc. (IBM) (DONE) PR #2139

Desirable-to-have Features for 2.1/2.2 (but may go in to a future release)

  • Rationalized configuration for Cray XE/XC (DONE)
  • platform file for using OFI MTL on Cray XC/KNL
  • usNIC stuff
    • conversion to libfabric (DONE)
  • simplified verbs BTL for iWarp? (NOT GOING TO HAPPEN)
  • Mellanox stuff (see above)
  • OMPI commands (mpirun, orte_info, etc.): deprecate all single-dash options except for the sacrosanct ones (-np, etc.). Print a stderr warning for all the deprecated options.
    • Note that MPI-3.1 8.8 mpiexec mentions: -soft, -host, -arch, -wdir, -path, -file (MOVED TO 3.x)
  • Score-P integration (won't hit 2.0.0, but will get in 2.x) (MOVED TO 3.x)
  • libfabric support (Intel MTL, Cisco BTL, others) (DONE in 1.10)
  • Memkind support both for MPI_Alloc_mem and Open MPI internal (DONE)
    • Note that there is not the right info infrastructure to allow apps to use memkind via MPI_alloc_mem
  • Nathan Hjelmn's BTL 3.0 changes (DONE)
  • MPI-4 features (maybe as extensions?)
    • endpoints proposal (3.0 or later)
    • ULFM (as of June 2015, Ralph/George are coordinating so that ORTE can give ULFM what it needs) (Not for 2.1)
    • MPI T extensions - event stuff to replace Peruse (3.0)
  • Better interop with OpenMP 4 placement - esp. for nested OMP parallelism
  • OFI MTL support MPI_THREAD_MULTIPLE - may already be thread safe
  • OFI OSC component (probably will not happen)
  • Issue with pipelined algorithms in collectives don't comply with MPI standard (placeholder) see [Issue #1763] (https://github.com/open-mpi/ompi/issues/1763)
  • multi-target configury
    • install of both thread safe/thread single with a single configure option
  • Features that are already in master that made it in to v2.0

    • Switch to using OMPI I/O as default
    • Switch to vader as default for shared memory BTL
    • PSM2 MTL

    Terminating support

    • Cray XT legacy items (ESS alps component, etc.) (DONE - although new ess/alps for Cray XE/XC)
    • MX BTL (DONE)
    • What other BTLs to delete? SCIF?
    • Clean up README (DONE)
    • Delete coll hierarch component
    • coll ML disabled
    • Delete VampirTrace interface
    • Deprecate mpif77/mpif90: print a stderr warning

    Testing

    • What do we want to test?
      • More thread safety tests - non blocking collectives, etc.
      • OMPI I/O tests, refresh from HDF group? (DONE)

    Stale code check - Opal

    !INCLUDE "opal_mca_owner.md"

    Stale code check - ORTE

    !INCLUDE "orte_mca_owner.md"

    Stale code check - OMPI

    !INCLUDE "ompi_mca_owner.md"

    Stale code check - Opal

    Framework Component Owner Status
    shmem sysv LANL maintenance
    shmem mmap LANL maintenance
    shmem posix LANL maintenance
    shmem base LANL maintenance
    backtrace none SNL maintenance
    backtrace execinfo SNL maintenance
    backtrace printstack SNL maintenance
    backtrace base project maintenance
    crs none UTK maintenance
    crs criu CISCO maintenance
    crs self UTK maintenance
    crs dmtcp UBrit.Columbia unmaintained
    crs base project maintenance
    pstat linux INTEL maintenance
    pstat test INTEL maintenance
    if bsdx_ipv4 INTEL maintenance
    if bsdx_ipv6 INTEL maintenance
    if linux_ipv6 INTEL maintenance
    if solaris_ipv6 nobody maintenance
    if posix_ipv4 INTEL maintenance
    if base project active
    pmix s1 INTEL active
    pmix s2 INTEL active
    pmix cray LANL active
    installdirs env SNL maintenance
    installdirs config SNL maintenance
    installdirs base project active
    hwloc external CISCO maintenance
    hwloc hwloc1110 INTEL maintenance
    hwloc base project maintenance
    reachable netlink INTEL unmaintained
    reachable weighted INTEL unmaintained
    reachable base INTEL unmaintained
    event external CISCO maintenance
    event libevent2022 INTEL active
    event base project maintenance
    allocator basic NVIDIA maintenance
    allocator bucket NVIDIA maintenance
    allocator base project maintenance
    timer solaris nobody unmaintained
    timer linux SNL maintenance
    timer darwin SNL unmaintained
    timer aix IBM? unmaintained
    timer altix SNL? unmaintained
    timer base SNL maintenance
    compress gzip project maintenance
    compress bzip project maintenance
    compress base project maintenance
    rcache vma LANL maintenance
    memcpy base project maintenance
    common sm UTK maintenance
    common verbs MELLANOX maintenance
    common ugni LANL active
    common cuda NVIDIA active
    common libfabric Intel active
    memchecker valgrind HLRS? unmaintained
    memchecker base project unmaintained
    dstore hash project active
    dstore base project active
    sec basic INTEL maintenance
    sec munge INTEL active
    sec keystone INTEL maintenance
    sec base INTEL active
    btl tcp UTK active
    btl sm UTK active
    btl usnic CISCO active
    btl template project active
    btl portals4 SNL active?
    btl scif LANL maintenance
    btl self UTK active
    btl ugni LANL active
    btl smcuda NVIDIA active
    btl vader LANL active
    btl openib Chelsio maintenance
    btl base btlowners active
    mpool sm LANL maintenance
    mpool gpusm NVIDIA maintenance
    mpool rgpusm NVIDIA maintenance
    mpool grdma LANL maintenance
    mpool udreg LANL maintenance
    mpool base project maintenance
    memory linux MELLANOX,CISCO maintenance
    memory malloc_solaris nobody unmaintained
    memory base project maintenance

    Stale code check - ORTE

    Framework Component Owner Status
    oob tcp INTEL maintenance
    oob alps LANL active
    oob usock INTEL maintenance
    oob ud MELLANOX maintenance
    oob base project maintenance
    rtc hwloc INTEL maintenance
    rtc freq INTEL maintenance
    rtc omp INTEL active
    rtc base INTEL maintenance
    schizo ompi INTEL active
    schizo base INTEL active
    filem raw INTEL maintenance
    filem base INTEL maintenance
    dfs app INTEL maintenance
    dfs orted INTEL maintenance
    dfs base INTEL maintenance
    rmaps rank_file INTEL maintenance
    rmaps lama CISCO maintenance
    rmaps round_robin INTEL maintenance
    rmaps seq INTEL maintenance
    rmaps staged INTEL maintenance
    rmaps resilient INTEL maintenance
    rmaps mindist MELLANOX maintenance
    rmaps ppr INTEL maintenance
    rmaps base INTEL maintenance
    routed binomial INTEL maintenance
    routed debruijn LANL? unmaintained
    routed direct INTEL active
    routed base INTEL maintenance
    errmgr default_app INTEL maintenance
    errmgr default_orted INTEL maintenance
    errmgr default_tool INTEL maintenance
    errmgr default_hnp INTEL maintenance
    errmgr base INTEL maintenance
    plm isolated INTEL maintenance
    plm tm INTEL maintenance
    plm rsh INTEL maintenance
    plm alps LANL maintenance
    plm slurm INTEL maintenance
    plm lsf INTEL maintenance
    plm base project maintenance
    ess tm INTEL maintenance
    ess tool INTEL maintenance
    ess pmi INTEL maintenance
    ess hnp INTEL maintenance
    ess singleton INTEL maintenance
    ess alps LANL maintenance
    ess env Intel maintenance
    ess slurm INTEL maintenance
    ess lsf INTEL maintenance
    ess base project maintenance
    rml oob INTEL maintenance
    rml ftrm ? unmaintained
    rml base INTEL maintenance
    snapc full nobody unmaintained
    snapc base nobody unmaintained
    sstore central nobody unmaintained
    sstore stage nobody unmaintained
    sstore base nobody unmaintained
    odls alps LANL active
    odls default INTEL maintenance
    odls base project maintenance
    state app INTEL active
    state novm INTEL active
    state staged_orted INTEL active
    state tool INTEL active
    state dvm INTEL active
    state hnp INTEL active
    state orted INTEL active
    state staged_hnp INTEL active
    state base INTEL active
    grpcomm rcd INTEL maintenance
    grpcomm direct INTEL maintenance
    grpcomm brks INTEL maintenance
    grpcomm base INTEL maintenance
    ras tm INTEL maintenance
    ras loadleveler IBM maintenance
    ras alps LANL active
    ras simulator INTEL maintenance
    ras slurm INTEL maintenance
    ras lsf INTEL maintenance
    ras gridengine INTEL unmaintained
    ras base INTEL maintenance
    common alps LANL active
    iof mr_hnp INTEL maintenance
    iof tool INTEL maintenance
    iof hnp INTEL maintenance
    iof mr_orted INTEL maintenance
    iof orted INTEL maintenance
    iof base INTEL maintenance

    Stale code check - OMPI

    Framework Component Owner Status
    fbtl pvfs2 UH active
    fbtl posix UH active
    fbtl plfs UH active
    fbtl base UH active
    pubsub orte INTEL maintenance
    pubsub pmi INTEL maintenance
    pubsub base INTEL maintenance
    osc sm LANL maintenance
    osc rdma LANL active?
    osc portals4 SNL active
    osc pt2pt LANL active
    osc base LANL active
    fs pvfs2 UH active
    fs lustre UH active
    fs ufs UH active
    fs plfs UH active
    fs base UH active
    pml ob1 LANL active
    pml yalla MELLANOX active
    pml v UTK maintenance
    pml bfo NVIDIA unmaintained
    pml cm SNL maintenance
    pml crcpw nobody unmaintained
    pml base project active
    dpm orte INTEL maintenance
    dpm base INTEL maintenance
    topo basic UTK maintenance
    topo example UTK maintenance
    topo base UTK maintenance
    vprotocol pessimist UTK maintenance
    vprotocol example UTK maintenance
    vprotocol base UTK maintenance
    coll libnbc project active
    coll inter UH maintenance
    coll sm nobody unmaintained
    coll fca MELLANOX active
    coll hcoll MELLANOX active
    coll portals4 SNL active
    coll self CISCO maintenance
    coll cuda NVIDIA maintenance
    coll basic UH maintenance
    coll ml ORNL? unmaintained
    coll demo project maintenance
    coll tuned UTK maintenance
    coll base project maintenance
    bml r2 SNL maintenance
    bml base project maintenance
    io romio314 LANL/RIST active
    io ompio UH active
    io base project maintenance
    bcol iboffload ORNL unmaintained
    bcol basesmuma ORNL unmaintained
    bcol ptpcoll ORNL unmaintained
    bcol base ORNL unmaintained
    sharedfp sm UH maintenance
    sharedfp lockedfile UH maintenance
    sharedfp individual UH maintenance
    sharedfp addproc UH maintenance
    sharedfp base UH maintenance
    sbgp p2p ORNL unmaintained
    sbgp basesmuma ORNL unmaintained
    sbgp basesmsocket ORNL unmaintained
    sbgp ibnet ORNL unmaintained
    sbgp base ORNL unmaintained
    op x86 INTEL maintenance
    op example project maintenance
    op base project unmaintained
    mtl mxm MELLANOX active
    mtl psm INTEL active
    mtl portals4 SNL active
    mtl ofi INTEL active
    mtl base project active
    fcoll dynamic UH active
    fcoll static UH active
    fcoll individual UH active
    fcoll two_phase UH active
    fcoll base UH active
    crcp bkmrk nobody unmaintained
    crcp base nobody unmaintained
    Clone this wiki locally