Skip to content

Final Report

H. Joe Lee edited this page Sep 21, 2022 · 51 revisions

Abstract

The Mochi project explores a software defined storage approach for composing storage services that provides new levels of functionality, performance, and reliability for science applications at extreme scale. One key component of Mochi is the use of OpenFabrics Interfaces (OFI), a framework focused on exporting fabric communication services to applications. Libfabric is a core component of OFI. We have developed a test suite, fabtsuite, to assess libfabric for the features that the Mochi project requires.

Introduction

The Mochi project [1] provides a set of software building blocks for rapid construction of custom, composable HPC data services. These building blocks rely upon the libfabric [2] network transport library, by way of the Mercury RPC library [3], for high-performance network support on most HPC platforms.

Mochi and Mercury impose specific requirements on network transport libraries, and libfabric’s OFI interface meets all of these requirements. We have found, however, that many of the features that we rely upon are not as thoroughly tested as the features that are more routinely used in other environments (e.g., by message-passing libraries). The objective of the Mochi libfabric test suite, fabtsuite, is to evaluate the specific libfabric features required by Mochi in a controlled, reproducible manner, with minimal external dependencies, so that stakeholders can quickly assess the suitability of different providers for use in HPC data services.

Problem

We identify 7 features that are required for Mochi but not thoroughly tested in Exascale HPC environment.

FI_WAIT_FD File Descriptor

FI_WAIT_FD provides a file descriptor that users can block on (without busy spinning) until there are events to process in the libfabric interface. This allows data services to minimize compute resources when they are idle while they arestill responding in a timely manner to new requests.

fi_cancel() Function

fi_cancel() is used to cancel pending asynchronous operations and permit the reuse of memory buffers that were previously associated with those operations. This is necessary for graceful shutdown of persistent services that have pending buffers posted. It is also necessary for fault tolerant use cases, in which a process must cancel pending operations before re-issuing them to a secondary server.

Cross-Job Communication

HPC applications or collections of processes are typically launched with a parallel job launcher, like mpiexec, aprun, or srun. For many data service use cases, we need to be able to launch service daemons separately from applications. The separately launched collections of processes must be able to communicate with each other exchanging both messages and RMA transfers.

Graceful Resources Clean-up

Mercury uses connectionless RDM endpoints in libfabric, but some providers internally allocate resources or memory buffers on a per-peer basis. We need to make sure that these resources are correctly released when those peers exit. In the data service use case, a single daemon may be relatively long-lived as a large number of clients connect to it and exit over time.

Multithreading

HPC systems are trending towards systems with many cores sharing a single network interface. Data services will likely utilize multithreading to make more efficient use of these systems. We need to ensure that multithreaded workloads operate correctly and do not present a significant performance degradation.

Vectored I/O

HPC systems easily generate structured data. For example, a column of a multidimensional array can describe particles in a molecular dynamics code that generate trillions of rows. Being able to describe those non-contiguous requests with a list of offset-length vectors, and then transfering that request in a single operation gives the network transport a chance to deliver data with lower latency and higher bandwidth.

MPI Interoperability

When an application uses both MPI and libfabric, MPI may use libfabric in a way that conflicts with the appication's use of libfabric.

Background

The libfabric supports many providers. A provider may use memory-registration cache. The libfabric provides a test suite [4] for different providers. It doesn't test any MPI application that calls MPI APIs. Multi-node test is performed using ssh command. Memory leak test is performed using valgrind.

Mercury NA layer has some basic tests [5]. Mercury uses a plugin artchitecture for libfabric and MPI. Mercury uses CMake build and testing system. Mercury test requires kwsys. AddressSanitizer is integrated with CMake to check resource leak.

Solution

One Program Two Applications

We have developed fabtsuite. It has one test program that either runs as a transmitter (fabtput) or a receiver (fabtget). Using command-line options, a user can select different operating modes to compare single- and multi-threaded (MT) libfabric operation, compare contiguous (fi_write) and vector (fi_writev) transfers, and so on.

fabtput RDMA-writes multiple non-contiguous buffer segments at once using solitary fi_writemsg(3) calls.

Single Local Test Script

We wrote a single test script that tests all options on a single machine. This script will help users to ensure that the two applications can communicate properly.

The script uses several environment variables to adjust test parameters.

Name Purpose
FABTSUITE_CANCEL_TIMEOUT Cancel transfer after s seconds. Default value is 2 seconds.
FABTSUITE_RANDOM_FAIL Fail randomly. Default value is no.
FI_MR_CACHE_MAX_SIZE Disable memory-registration cache when it's set to 0.

Multiple Scripts for Multi-node Test using CTest

We wrote some parallel batch job scripts for multi-node testing. They can be submitted by CTest easily. We support both PBS and SLURM.

Experiment

We tested one local system (A) and two HPC systems (B,C) for the 6 features. The following table summarizes pass (P) / fail (F) result.

Feature A B C
wait P P P
cancel P P F
cross P F P
thread P P P
vector P P P
MPI-IO N N N

The MPI-IO feature test is not implemented (N) yet.

Discussion

The wait test requires a slightly longer time allocation than other tests. Otherwise, receiver job will not finish on time and generate output.

The cancel test failure on C was caused by the INT signal.

fabtget: caught a signal, exiting.
real 1.99
user 0.53
sys 1.38
1
srun: error: SystemB: task 0: Exited with exit code 1
srun: launch/slurm: _step_signal: Terminating StepId=186257.0

The cross test failure on B is due to system queue issue that doesn't allow to run jobs on more than 3 machines.

Conclusion

Although libfabric provides an extensive set of unit testing, our test suite can provide an alternative set of exa-scalable tests in HPC.

Future Work

Our test suite is fully open-source and anyone can extend it. One can add a new command line option for a new feature testing.

We provide sample test scripts for HPC systems.

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative.

Glossary

CQ: Circular Queue

MR: Memory Registration

NA: Network Abstraction

RDMA: Remote Direct Memory Access

RMA: Remote Memory Access

rcvr: receiver

xmtr: transmitter

References

  1. https://mochi.readthedocs.io/
  2. https://ofiwg.github.io/libfabric/
  3. https://mercury-hpc.github.io/
  4. https://github.com/ofiwg/libfabric/tree/main/fabtests
  5. https://github.com/mercury-hpc/mercury/tree/master/Testing/na
Clone this wiki locally