Skip to content

Meeting Minutes 2018 03

Tomislav Janjusic edited this page Jan 6, 2024 · 1 revision

Open MPI Developer's Face-to-Face Meeting - March 2018

Schedule

  • Overall dates: March 20-22, 2018 (see below)
  • Open MPI
    • Tue, Mar 20: 9am-6pm
    • Wed, Mar 21: 9am-noon
    • Thu, Mar 22: 9am-noon
  • PMIx
    • Thu, Mar 22: noon-6pm
    • Fri, Mar 23: 9am-noon
  • ORTE
    • Wed, Mar 21: noon-6pm

Attendees

Add your name below if you plan to attend:

  • Tuesday
  1. Geoff Paulsen (IBM)
  2. Josh Hursey (IBM)
  3. Brice Goglin (Inria)
  4. Howard Pritchard (LANL)
  5. Shinji Sumimoto (Fujitsu)
  6. Takahiro Kawashima (Fujitsu)
  7. Brian Barrett (AWS)
  8. Xin Zhao (Mellanox)
  9. Ralph Castain (Intel)
  10. Jeff Squyres (Cisco)
  11. Mark Allen (IBM)
  12. George Bosilca (UTK)
  13. Matthew Dosanjh (Sandia National Laboratories)
  14. Edgar Gabriel (UH)
  15. Arm Patinyasakdikul (UTK)
  16. Dong Zhong (UTK)
  17. Geoffroy Vallee (ORNL)

Remote attendance

Webex for joining remotely will be posted on the day of the meetings.

Tuesday Webex: https://cisco.webex.com/ciscosales/j.php?MTID=ma0d9503e06fc21eba8c12ca52038f3e1, password SvHBAwEX

Topics to discuss

PMIx things

  1. Deeper utilization of PMIx
    Our integration strategy for PMIx so far has been just replacement - i.e., we made no code path or logic changes, but simply replaced RTE-related calls with their PMIx equivalent. Perhaps it is time to step back and take a fresh look at how we can exploit PMIx. For example, we engage in potentially multiple negotiating steps to determine a new communicator ID to ensure it is globally unique - could we instead utilize the PMIx_Connect function (which returns a globally unique nspace identifier)?
    See https://github.com/open-mpi/ompi/issues/4542 for some initial thoughts.
  2. Now that PMIx is a stable, standalone project, is it time to talk about separating MCA into a separate library again? (this is a shade different than making OPAL a standalone library) Yes, there are many challenges with this. ...but is it time to figure them out?
    • The MCA base code is tightly coupled to the rest of OPAL - it uses lists, atomics, etc.
    • PMIx cannot link against OPAL for political reasons, so this is a non-starter
  3. Backward compatibility concerns
    • Need to start testing cross-version support in both PMIx and OMPI
  4. Ralph will give a presentation about all the new PMIx functionality (E.g., PMIx debuggers, etc.)
  5. PMIx integration
    • remove embedded?
    • forward-version compatibility in components to support packagers
      • https://github.com/open-mpi/ompi/issues/4072#issuecomment-373746869
      • Steps reqd:
        • Consolidate to one external component
        • Remove check for version in external component's configure.m4
        • Use #if to exclude APIs that are not supported by given PMIx version
        • Further use #if to handle API break in v1.1 series, if desired
      • May need a static framework / libtool magic to prevent orte linking directly to extrnal PMIx.
      • very late in v2.x lifecycle.
    • Either go with the approach where we supply multiple external components, and then try them in version order from highest to lowest... and the higher version external components will just fail.
      • may need show-load-warning mca parame to prevent lots of runtime warnings when running with older components.
      • Need to see if we can compile all 3 libevent external components against 1 libpmix, or if we need 3 different libpmixs to compile all 3 libpmixs.
      • Can specify which version of pmix external component to use by setting an mca parameter.

ORTE discussion

  • ORTE Presentation
  • Wednesday 2pm central
  • Now that ORTE is a PMIx server and we imbed PMIx with Open MPI, why do we still need ORTE?
    • We don't. We just need Orte for ssh launch. We'd just need an mpi_boot and PMIx reference server.
    • Long Term, we could ditch ORTE
    • Reference server includes:
      1. PRSVR - equivalent to lamboot
      2. prun does the same as mpirun, parses command lines and calls PMIx_Spawn()
  • Torge has TM interface, but no PMIx interface. Would we still need a PLM like thing to start the daemons? - Yes.
  • Changes in last year or so:
    • Event driven state machine - gives us flexability to add progress threads.
    • Thread-safe operations based on events (Progress thread owns some data)
    • Heavy reliance on PMIx.
    • Distributed mapping (Now just start DBM up, and send out app-context, and everyone creates data in sync)
  • State machine
    • Every operation executes in an event
    • App/proc lifecycle broekn into "states"
      • Dynamic mapping of state to operation - mapping of this state calls this function.
      • There is an API to override the default.
    • All implemented in libevent event functions. The void ptr returned each time is a state function.
    • Can override, add, delete, reorder states.
    • State activation macros MUST be used.
      • Provides "trace" capability for debugging. Because this is asyncronous event engine, tells you WHERE it's being called from.
  • Thread-Safety
    • Access to data based on progress thread.
    • Main ORTE progress thread
      • ALL ORTE globals, most framework globals.
    • Framework progress threads
      • Most only have one, but some have multiple: odls, oob
      • These framework threads don't progress the main orte state, but they "do work" for a single state.
        • in some cases, like odls, the main orte thread can return early and continue other orte state info, but when the other threads are done, the final one updates the state info.
  • PMIx reliance
    • Apps no longer use OOB at all.
      • All app-to-daemon interactions over PMIx
      • show_help flows over PMIx_Log
    • ORTE Ddaemons (PMIx reference server)
      • Support all PMIx calls except for PMIx_Alloc
      • It's very complete, even though Open MPI isnt using all of this, but because ORTE is a development environment / reference server.
    • MPIR tools deprecated
      • Shift to PMIx-based tools
      • Greatly extends capabilities
      • PMIx more featureful i.e. Nice to query on fabric what the traffic report is on a switch
  • Distributed Mapping
  • mpirun does a pass through application descriptions after daemons have been launched and it knows the topology
    • says 3 procs on this node, and 4 on that node.
    • slot info is required for RR by slot, node and ppr for node, but DONT have to have topology.
    • If you use non-VM mode where you do mapping first, and THEN launch daemons to gather topology. then you only can use mapping above.
      • Unless you tell topology info or specify homogenous then it infers remote topology info.
    • Does not assign location within node, rank.
    • Sends node (includes nodename, daemon vpid, #slots), ppn regex to all daemons.
    • Also sends whatever else PMIx sends as a payload (such as environment and application context)
    • Each Daemon calls rmaps.assign_locations computes rank, local rank, node rank (across ALL ranks in all JOBS)
  • Frameworks: Infrastructure.
    • RTE Messaging Layer (RML)
    • Out of Band (OOB)
    • Routed (routed) defines fanout
      • Radix tree / Binomial tree.
    • IO Forwarding (iof)
      • Daemons forward to mpirun
      • mpirun outputs and forwards to tools (PMIx interface at mpirun node).
  • State (state)
    • Correlates state activation to function
    • Determines ordering of launch operations, where errors get handled.
    • Can customize - add/set/delete state if you're in the main orte thread (only one allowed to change state)
  • Environment Setup Subsystem (ess)
    • Initialize ORTE for given proc type.
    • Components for daemons when launched by different RMs.
    • PMI component for apps.
    • ess components are parsing RM specific arguments to get their identity.
  • Error Manager (errmgr)
    • What are you going to do if you get an error.
    • No APIs in this framework. Strictly done by states.
    • First searches for specific error types, and if none, then move to errmgr state.
  • Framework: Launch
    • Resource Allocation subsystem (ras)
    • Discover allocations from RMs
    • Process hostfile, -host specifications
    • Simulator component
      • Specify number of nodes, topology file
      • Create arbitrary collections to test mapping
      • Automatically sets do-not-launch.
    • Regular Expressions (regx)
      • Generate/parse nodes, ppn.
      • Strictly for COMPRESSING list of ALL HOSTs into a reg-ex that should be shorter for sending out to ALL nodes to figure out where everyone else is.
  • Framekwork: rmaps
    • components on how to distrubute procs across hosts.
  • Process Launch and Monitoring (plm)
    • No longer launching procs, only launching daemons.
    • Uses RM native launcher if available, else 'ssh'
    • No longer does any monitoring.
  • ORTE Daemon Launch Subsystem (odls)
    • For/exec launch of local procs.
  • Schizoid (schizo)
    • Customize/tailor cmd line
    • Personality-based operations.
    • non-mpi cmd line parsing.
  • Runtime controls (RTC)
    • hwloc - bind, topology shared memory
    • Freq - power control
      • took it out because it needed root access.
  • Group Communications (grpcomm)
    • OOB bcase, allgather support.
  • Distributed File System (dfs)
    • Access remote files (read/write)
    • If you want to open a file remotely, and read/write data... orte will carry bits, and do file ops. just a file descripto
  • File Mover (filem)
    • Move directories/files across nodes.
    • If you're running on system without shared filesystem, you can ask filem to move files to remote node, and then give filenames back to ORTE
  • Notifier (notifier)
    • basicly PMIx_Log.
  • App Lifecycle.
    • See Slides... useful and detailed.
    • There are a lot of states defined, because always want to do stuff between tasks, for various hooking purposes.
  • ORTE DVM
    • Identical to PMIx Reference Server
    • Launches daemons, reads allocation, etc. exactly like mpirun
    • If you start PMIx reference server to setup as system server (non-default), will accept jobs from any user... by default, only same user.
    • Difference
      • Does not terminate at end of job
        • Must be manually terminated (prun --terminate)
      • Use prun to launch apps (same options as mpirun)
      • If you want mpirun to find dvm, you have to specify context info with --hnp , prun does this automatically.
  • What is left to do with ORTE?
    • move to PMIx reference Server. Here to discuss this.
    • Intel intends to continue to invest in PMIx in Open Source.
    • Would you split out orte without PMIx? Probably not. With PMIx, maybe.
  • orterun, is options parse, some glue code, and go button.
    • Some glue code might go away, but open mpi would want to keep the options parser.
    • In the long term, ssh would so something similar to lamboot (possibly via mpirun)
    • many customers dont WANT to use a resource sheduler and want to use ssh.
  • ORTE is now a 'fork' of PMIx reference server.
  • Ralph plans to get feature requests and getting that done in summer and stabalize in next year and a half.
  • There is debugging and stress testing of PMIx reference server that doesn't make it back into ORTE today.
  • Do you do releases of reference server?
    • Not to date, but need to.
    • If we adopt PMIx reference server we'd have to sync up the release schedules abit
  • PMIx reference serer depends on opal, but PMIx server would use prefixing of opal symbols and would be in the daemon server, not the application.
  • If you use prun, you need to psrvr (like lamboot) first to start the serves.
    • psrvr understands the -host and hostfile type options.
    • prun also understands those options because you can launch WITHIN those options.
  • Average customer shouldn't care that we switched the runtime.
  • We'd still control mpirun as a "template" parameters
  • We need to try to run with the reference server.
  • Today we only test 10% of ORTE.
  • Amazon will take the MTT action item.
  • This should get written up and emailed as an RFC. Do this on a tuesday call.
    • deadline send out tomorrow. Josh and Jeff will work on slides at dinner.
  • Software Fragmentation
    • Heavily relies on opal and critical internals.
  • Intel is still interested in PMIx but not Open MPI
  • We need more resources for whatever runtime.
  • Another way to think about this is how to maintain a runtime with minimial man-power.

Open SHMEM discussion Thur 9:15am

  • Status for Open SHMEM for v1.4 spec released Dec 2017
    • Contains many new features
    • Has a new Context Object - allows users to overlap communication / computation in each thread.
    • Default is multithreaded, tho users could specify serialized, private or nostore (can't gaurantee ordering)
    • If user doesn't create a context, there is a default.
  • Strategy: Map to UCX worker, and what should be in context.
  • Thread Safety
    • two new APIs: shmeme_init_thread (same as MPI thread level requests), and shmem_query_thread to query.
    • SHMMEM_THREAD_SINGLE - only 1 thread in whole app.
  • Schedule
    • By end of March - Finalize plan and design details
    • By mid of May - Implementation and debugging
    • By end of June - add test units, achieve phase 1.
  • IBM is testing OSHMEM + UCX
    • Who should we work with Mellanox?
    • Report them through same mechansism as HCOLL.
  • Testing between IBM and Mellanox.
    • Open MPI would like issues for major issues that might hold up a release for their tracking.
  • Maintaining OpenSHMEM code in Open MPI
    • Issues - On platforms where there is no transport for OSHMEM, it's still built by default, but no transport.
      • Could we ONLY build OSHMEM on platforms where there are transports by default?
      • If you don't have a viable UCX SPML on that platform, don't build OSHMEM
      • Yes, reasonable.
      • Brian will have this ACTION for 4.0
      • UCX is in new distros now.
    • Issue - some concern about Major and minor release versioning for Open MPI versioning and OSHMEM versioning.
      • Mellanox is harmonizing with Open MPI major releases, and doing major OSHMEM stuff in major open mpi versions, and minor in minor versions.

Fujitsu MPI for Post-K computer

  • 20180320_Fujitsu-OMPI-Dev-Meeting.pdf

  • currently based on Open MPI v2.0.4, planning to rebase onto v3.1.x

  • Optimized for new Tofu interconnect

  • Use PMIx v2.0 with Fujitsu RM.

  • Supported features:

    • Persistent collective operations (depends on standardization early next year.
    • Open MPI Java Binding
    • MPI-IO on heirarchical
  • Two Pull Requests:

    • PR2758
    • PR4515, combinded into PR4618.
  • ACTION - don't want these in MPI_, keep them as MPIX_ prefix until the MPI Fourm decides what to do.

    • Need to figure out the versioning.
    • Need documentation that says that MPIX routines might complete change in a minor version.
  • No man pages yet, maybe one man page for all of them for now.

    • Easiest way might be to go through the existing collective manpages, and add them there.
  • Another topic is MTT Run on SPARC

    • Fujitsu resumed work of running MTT on our machines.
    • Current plan, Linux/SPARC64
    • master branch and one or two release branch (v3.1 and v3.0?)
    • Weekly run.
  • When enough armv8 machines become available, we'll add.

  • Open MPI Java Binding

    • Missing methods and some test programs are contributed last year.
  • Thread parallel tests under development and will contribute in a year.

ORTE discussion continues.

  • [OMPI_PMIx_RTE_Proposal.pdf]]
  • We need ORTE help
    • Open MPI ORTE has drifted from PMIx reference library ORTE fork.
  • Open MPI Runtime Support
    • Should we fork ORTE from Open MPI? - It already has forked. Ralph is doing ORTE work in PMIx reference server.
    • ORTE and Reference server
  • Clarification
    • There is the PMIx Reference Library - client, and dependant on for years, different than.
    • PMIx Reference Server - RM implements these (SLURM v1.2, IBM JSM, others)
  • Direct Launch today (on master)
    • Rank0 uses: ompi/mca/rte orte and pmix
    • OPAL layer uses orte/mca/ess/pmix - Requires orte errmgr and state frameworks.
  • Launching with reference server Orte2
    • prun -> psrvr -> psrvd Compute node -
  • Does PMIXRTE actually work? Why is it disabled by default?
    • A few bugs still.
    • We need it for MPIR interface, because that's not in PMIx
  • mpirun launch today:
    • same, but mpirun launches the orted on remote nodes diectly (no mpiboot / psrvr start type step)
    • WOuld just need to remove ORTE and push functionality into mca pmix component.
  • Howard noticed some issueso
  • Next steps:
    • Do nothing to ORTE, but run with external ORTE2 server.
    • start running tests and start fixing bugs, and pushing to this.
  • If we treat this like hwloc, we'd have an internal and external component, and Open MPI could just pull in "released" versions of ORTE2/PMIx reference library. (whatever it's named)
  • PMIx team needs to go and decide what the minimal functions / keys would be required to be considered PMIx compliant. PMIx team is meeting tomorrow on this.

Everything else

  1. HWLOC

    • upgrade to v2.0 planning
      • Probably need to skip hwloc v2.0 support until Open MPI v4.0
      • Dont want to change default from internal to external for v3.1.x or change internal component, but would be open to a new external hwloc component that supports the v2.0.x hwloc.
    • ACTION:
      • Ralph and Brice will meet to discuss issues of getting v3.1.x and master to work with hwloc external.
      • Jeff will add configury for v2.x to disable hwloc v2.0.0 and Brice will do configury to prevent v3.0.x from working with hwloc v2.0.x
      • IBM and Intel interested in testing hwloc v2.0.x with open mpi v3.1.x and v4.0
    • ACTION: For components that have internal components (hwloc, libevent, etc)...
      • If search and FIND an external component, we will compare versions with the internal.
      • If the external that was found (again without --with-foo specified) was OLDER than the internal, we'll die in configure with a message saying they should specify if they want that older external or the internal explicitly.
      • If the external and internal are THE SAME, then prefer external.
    • ACTION: Ralph and Brice will talk to get Open MPI master (pre v4.0) to work with hwloc v2.0.x
  2. github contributor guidelines - Contribution Guidelines. If anyone cares, comment.

  3. Memkind mpool component needs a maintainer - the APIs being called in it have been deprecated.

    • LANL will maintain this. Howard's currently working on a simplified variant.
    • Is it really deprecated? There's a Jan 2018 release listed in http://memkind.github.io/memkind/
    • Not correct, what changed is in memkind was using a partition thing, and instead we need to work with kind API.
      • This made memkind deprecated. The idea of using something that was depricated isn't smart.
    • Action: Need to move away from deprecated API usage.
    • Howard agreed to maintain memkind.
  4. https://www.open-mpi.org/papers/ page:

    • Listing of academic papers gets sparse after 2007.
      • Agreed to Keep the actual pages of all those papers, just in case there are links to them elsewhere
      • Only list BOF slides / OMPI-project-specific pages
    • project specific pages.
    • ACTION rename papers to presentations
    • Some of these can not be blind, such as
    • Jeff will take action
  5. One sided / osc_rdma updates

    • https://github.com/open-mpi/ompi/pull/4918
      • already updated and in master. Dealing with some issues with OSC + MT.
      • Large accumulates were having problems. Sometimes single threaded would get hit also.
      • IBM OSC + MT now runs clean
      • Large patch, release managers needs to see if they're okay with that.
      • Severaly tested over last 3 months.
      • Nathan should file PRs to v3.1.x and v3.0.x - Similar enough to master, should apply cleanly.
    • https://github.com/open-mpi/ompi/pull/4919 - BTL_UCT
      • UCX + OSC - UCX (used UCT which is UCX lower level BTL like) (UCP is higher level)
      • Problem, to get good performance, have to create multiple low level UCT compoents, but those get bound to resources.
      • When One sided operations come in, it will farm those out.
      • binds threads to devices using C11.
      • This is a new BTL, ONLY for OSC - if BTL_Send is not set, it totally ignores it for 2sided.
      • LLNL will be owner of this OSC BTL - Want to check with Mellanox.
      • Mellanox would like to see multithreading benchmark to further optimize UCX further.
      • Arm (UTK) - Should get built in automatically, but didn't see it in ompi_info.
      • Is it better to have seperate 1sided and 2sided BTLs? Better to keep them together.
  6. When should we branch v4.0?

    • New Features coming in for v4.0:
      1. Better Multithreading (UTK) - what's the scope?
      2. In OB1 PML, normal OMP parallel Sections. Improved for injection and extraction rates.
      3. Implications for other PMLs. Very OB1 specific. Maybe a little bit in progress.
      4. still working on (Half ready)
      5. PMIx debugger-support (PMIx v3.0 this summer?)
      6. hwloc v2.0.x
        • We need a small amount of configure work in v3.x to print that Open MPI v3.x and earlier won't work with hwloc v2.0.x and newer.
    • ULFM (UTK) own new MPIX functions. Most is in MPIX, but some in PML.
    • George has test suite they could add to MTT.
    • ready now, but no PR yet.
    • SPC - Software Performance Counters (looks good)
    • Where are we on next MPI Forum?
      1. Some hope for an MPI v3.2 Don't wait for this.
    • Understand PMIx implications before branch for v4.0
    • Decide if we're going to continue to embed hwloc and libevent?
    • -prot and -entry (maybe?)
    • OSHMEM 1.4 - Mellanox [Planning for 9:15am Thursday]
    • Discussing mid-july branch
    • Push the branch date BACK so that once we branch, NO more features.
    • Nice to have firm branch date so feature writers can plan.
  7. libevent replacement

    • remove embedded?
    • What needs to happen to upgrade to latest libeven?
      • Open MPI requires a thread safe libevent, and if you read docs, then you see it's only thread safe if you used different contexts (useless for us).
      • Nathan did work to make the APIs we care about thread event.
      • But to upgrate to latest libevent, someone needs to redo this work and repackage for Open MPI.
    • libev is long term approach because it's much simpler and only includes about what we need.
      • If we go libev route, we may need to fork and embed a version of it.
      • performance of libev seems advantagous only at many many file descriptors, but we dont care about that case.
    • Cent-OS doesn't have latest libevent.
    • RHEL 7.4 is 2.0.24
    • CONCLUSION - going same route as hwloc is the Answer.
    • But who would do the work to upgrate to latest libevent. possibly backporting to Open MPI v3.1.1
    • Some runtime issues with OMPI v1.8.6 and 1.8.8 using libevent 2.1.8, actually issue in PMIx I/O forwarding on master and not an issue.
      • talking about backporting to OMPI v2.0 series.
      • Should be good in v3.x series.
    • ACTION: Jeff will build external libevent v2.1.8
    • If testing goes well, ONLY update configury to require libeven v2.1.8 or internal (which we won't rev).
  8. OSHMEM status.

    • If there are major rev in one OSHMEM API or MPI API. Problem we're fighting with.is confusing users about major revs that are only major for one API and not the other.
    • Does OSHMEM still need to be piggybacking on Open MPI?
    • There are libraries that run OSHMEM and other libraries that run on top of MPI, and both are used in the same application.
    • We as a community has decided that we have that (but only have that for UCX).
    • Challange is that users are conservative, and don't want to upgrate to next major release.
    • We could choose to NOT drive the Open MPI version number based on OSHMEM, only base it on MPI version number.
    • You can mix OSHMEM and MPI and have totally different implementation.
    • If you don't share the same opal it won't work.
    • These would both depend on opal.
    • Can easly rename symbols with one line change in configure.ac
    • Has alwasy been weird that Open MPI contains OSHMEM.
    • Because of OSHMEM we get installed onto machines
    • Frustrating for release managers to manage the OSHMEM stuff, especially when we are Open MPI.
    • Various community members
    • What's common:
      • Orte and direct launch (ess)
      • OPAL
      • PMIx has MPI-RTE to replace ess (better for OSHMEM).
      • Ah... A lot of Fortran stuff, OMPI datatypes, a lot of MPI stuff.
    • Ralph's proposing an easy way to do this would be to have OSHMEM be a fork of Open MPI that has OSHMEM and then have OSHMEM fork can pull when they're ready.
    • Lets make a cost/benifit analysis.
    • Like to see this move forward. Tomorrow March 21st Morning we'll talk tomorrow morning.
    • One idea would be to put a deadline of July (v4.0 branch) to conclude, if we don't conclude this by then we could remove OSHMEM from Open MPI. Decided to NOT yet do this, but just discuss further tomorrow when josh Ladd is online.
  9. Webpage updates

    • Jeff will work on ompi-www Issue 28 and 35.
    • What needs updates on the FAQ? Home grown HTML. It'd be nice if the FAQ engine did something better. Replacing it with better FAQ system. A better system to "show me the FAQ for version X.Y" would be awesome.
    • Geoff Paulsen signs up to create issue describing what's out of date, etc.
  10. What to do about unsupported platforms (e.g., in the context of POWER 7/BE)

    • Alastair Mc. made some good points about just letting "unsupported" platforms build and let people know THIS IS UNSUPPORTED!: https://github.com/open-mpi/ompi/issues/4349
    • E.g., should we re-enable POWER BE under this nomenclature? We don't know that it's broken -- we just know that we disabled it in v2.0.x and v2.1.x... for some reason.
    • We are STRONG about not wanting silent data corruption, and issues with atomics can lend toward silent data corruption.
    • We did a bad job of commit message saying why we disabled a platform.
    • We didn't communicate to users and maintainers well. We need to do better job talking to maintainers.
    • We'll create a new communication channel to allow for some discussion about changing the packaging
      • possibly disabling platform, imbedding libraries, switching to external packaging by default.
      • encourage redistributors to subscribe and comment.
    • Second part of the question:
      • Can you NOT disable, Unsupported platforms.
      • We list the platforms we test, and say support.
      • Brian will add to issue something to the NEWS
    • Issue 4563 - Jeff will list this in the FAQ of places for redistributors to join.
      • developers should join this list too.
  11. Software-based performance counters

    • Expose some of the iternal perforance issues to Tools. Either go the way that MPI T is going, or go the way PAPI is going (SDC - software definced counters).
      • A whole list of counters. A number of things (out of sequences, time matching, etc).
      • high water mark, how long is the queue at certain points, how many out of sequence messages.
    • https://github.com/open-mpi/ompi/pull/4885
    • IBM: What's the compiler segv? ====
  12. Improve Jenkins reliability

    • We have regular problems with the Jenkins testers yielding false positives (e.g., full disks). These failures sometimes occur during inconvenient times such as on weekends or USA holidays when people are not available to fix them. This leaves non-USA developers (and others working on their own time) with no recourse.
    • Could/should we provide a bot to repair identifiable problems?
    • training / documentation could help bring more to help.
    • Other options?
  13. More Jenkins testing:

    • Absoft running in EC2
      • Have license -- will install when possible.
    • ??NAG running in EC2??
      • No reply since March 5; NAG probably not interested.
  14. MTT update - status of the Python based client, server, viewer.

    • Walkthrough of how to move from the Perl to Python client for Open MPI testing. (Howard?)
    • Howard discuss status of server. Initial Intel implementation is more general purpose framework, so howard added some open mpi specifics.
    • Shows some specific minro picky things.
    • .ini is NOT a python script, perl HAD functlets, but that's NOT in this, plugins gets you much of that though.
    • Yapsi - plugin manager to load other stuff
    • python 2 and 3 (supposed to be both)
    • https://github.com/open-mpi/mtt/wiki/How-to-Set-up-Python-Virtual-Environments
      • Don't ahve to use this virtual env, but it's nice. Howard was going to have someone check to see if it works with Anaconda (DOE), he's only run as root or make virtual env.
      • Does Pip use anaconda? Don't know. Pip is more general way... PIP install
      • Brian, It would be awesome to just 'pip install MTT'. Howard agrees.
    • Question - we're at the point of giving some trials and discuss, not yet ready for everyone to switch.
      • python client doesn't do chksums, but Brain says they should and do it like they figured out.
    • Mtt-modern - idea is wrap the interfaces in REST interfaces, and get someone to write a new reporter/viewer based on cherry-py.
    • Jeff, can I get a sample file, a run and a submission.
    • Intel documented a README on how to create Dox documents. It's just not pushed to the web. Sample ini file?
    • Directory under samples with inis.
    • Getting started? (maybe, certainly in doxegen docs but not uploaded to web).
      • closest thing is that make virtual env.
    • How to setup cherry py server - https://github.com/open-mpi/mtt/wiki/MTT-Cherrypy-Server (not needed if submitting to Open MPI cerry py )
    • Cherry Py submission API has been beatted on pretty well.
    • Cant incrementally upload to database, everything has to run until finished until anything shows up.
    • Intel discussed possibly moving from Cherry py REST to something more robust like Jingo.
    • Think about partial updates. - Ralph has a plugin to do this, he will talk to about uploading that.
    • Background, Ralph was tasked with a hardware bring up test/database.
      • If you look at it, you'll see alot of stuff beyond MTT tests.
      • For example there are harassers which will load the machine down.
    • Stuff to shoot the hardware from under you, to see how you handle.
    • Intel has a bunch of plugins to do various things.
      • Noel and RIcky have been doing much of the work.
  15. Encourage people to use "unset" MCA params (vs. sentinel values).

    • Sentinal values, and where it came from.
    • Mathias has added a new value like "unset". If you have components that use -1, you should switch to 'unset', since that looks better.
    • Auto-bool enumerator. mpi-leave-pinned takes -1, 1, 0. -1 means if I can do it, turn it on.
  16. Dealing with long standing Issues

  17. open Issue Round-up

  18. old PR Round-up

  19. Default binding policy considering #4799.

    • Answered on ticket and closed - concluded that current default bindings are the correct ones for OMPI
  20. MPIR deprecation warning

    • Add to NEWS?
    • Output when attached?
    • Replaced
  21. What to do about unsupported platforms (e.g., in the context of POWER 7/BE)

    • Alastair Mc. made some good points about just letting "unsupported" platforms build and let people know THIS IS UNSUPPORTED!: https://github.com/open-mpi/ompi/issues/4349#issuecomment-364382688
    • E.g., should we re-enable POWER BE under this nomenclature? We don't know that it's broken -- we just know that we disabled it in v2.0.x and v2.1.x... for some reason.
  22. How do we expose MCA params in component packages such as PMIx?

    • Do we need some kind of "registration" API that ompi_info can call to harvest them?
  23. When using multiple MPI_COMM_DUPs simultaneously in multiple threads, we barf.

  24. Envar version of allow-run-as-root for container folks who keep complaining about it?

  25. MPI_Init Connectivity Map (IBM)

    • See https://github.com/open-mpi/ompi/issues/30
    • Print a 2D table to STDOUT to see how rank X is communicating with Y.
    • Right now we just walk a bunch of function pointers to see what BTL, MTL, PMLs are loaded. but it's a hack.
    • Dont go to the level of if using UCX / libfabric, just get the top level.
    • If EACH MCA gave a short string, and a LONG string.
    • If we add this at the PML function, then the various PMLs can report what they're using.
      • PML can then walk MTL or BTL. (extending BTL will be a bit challenging)
    • We're assuming there is one way to get from X to Y.
    • Current IBM implementation assumes only one way to get form here to there, and today only print host to host connectivity.
    • PML for now is good, keep OSC and COLL for future (less useful, harder, etc:
    • Consensus it's great, come up with an interface, and Jeff will help prototype on usnic.
      • need by mid-july v4.0
    • Would be nice for XML / JSON, etc. Admins love this.
  26. Fujitsu's status

    • Persistent collective operations
    • MTT run on SPARC
    • Other development status
  27. Spark-MPI-TensorFlow

    • Ralph will provide presentation describing what has been done, if interest
    • Initiate discussion on possible MPI Sessions role
      • Sessions isn't quite the same, not about dynamic process management.
    • End up with lots of idel time, unless you can have dynamic resource allocation to return resources back to system (for better efficency)
    • George saw a paper on using UFLM and had some success with that approach.
  28. Endpoint management (Ralph, Jeff, Howard)

    • How to handle multiple libraries/plugins creating libfabric endpoints when "instant on" provides single endpoint?
    • Can we define a single rendezvous connection point for each proc, and then exchange endpoint-specific info via that?
    • Does that require an endpoint manager plugin for OFI?
  29. Check padding on MPI predefined objects to ensure adequate room for lifetime of 4.0

  30. Jenkins reliability:

    • AWS instances get re-created all the time. Disk full and what not are not much of an issue there.
    • Mellanox has random failures (disk full, UCX problems, etc.).
    • IBM drops offline sometimes -- they get no warnings when this will happen.
    • Jenkins can't easily handle "error" cases (e.g., IBM offline). Don't know how to make that better.
    • Have seen some github failures recently:
      • On the required email checks, PHP is claiming it didn't get a full JSON blob.
      • We should start logging what is missing and sending some kind of alert (e.g., email).
      • These two CI tests need to be made more reliable, since they are required checks.
    • What about the weekend / holiday scenarios (i.e., something went wrong and someone can't look at it for days)
    • We don't know how to balance: I have something to merge, but someone can't fix CI for days
    • Should we add a "Failed CI" tag? It would help, but it doesn't necessarily make anything fundamentally better.
    • We don't really have better answers here. 😦
  31. Status of coll/sm component

    • George indicates that this component might be working, but it remains relatively untested and probably shouldn't be used
    • If the folks asking about it do test it, we would like them to let us know if it works and/or any patches it required
  32. Re-evaluate compiler support for Open MPI v4.0.0 (drop older than gcc 5 support, etc)

    • Amazon, IBM, Cisco are still interested in older compilers
    • but some, like IBM is okay with requiring new compilers for configure/build, but want older compilers for customer apps.
    • RHEL6 is End of Life 2020, no GCC v5 RPMs.
    • gcc 4.8.5 is default on RHEL7 past that.
    • Customers using RHEL6 and RHEL7.
    • Developing New stuff, that only works with new stuff is fine, but breaking things that currently work is bad.
    • OPAL_C_HAS_THREAD_LOCAL is another good canidate for macro to search for.
    • gccv5 is first line, like to cross
    • Next line is intel2018 is when they first added C11 atomics.
Clone this wiki locally