Skip to content

saidamirouche/MPI-Resilience-and-recovery-methods-

Repository files navigation

Saïd ulfm framework

Visit the github said project and clone it:

git clone https://github.com/samirouche/ft_uflm.git

Install and test ulfm docker image

We follow this tutorial: http://fault-tolerance.org/2017/02/10/sc16-tutorial/

Install docker on a Debian sid machine

Here is the tutorial to install docker on Debian: https://docs.docker.com/engine/installation/linux/debian/#install-using-the-repository

On my sid machine, however, docker-ce install failed and the docker.io from the baseline sid distribution directely worked.

Check debian release

What is my distribution?

lsb_release -cs
sid

Debian packages pre-requisites

sudo aptitude install apt-transport-https ca-certificates curl software-properties-common

Debian repository for docker-ce install

sudo aptitude update 
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian  $(lsb_release -cs) stable"
sudo aptitude update 

Install docker package

According to the docker tutorial for debian, we should install docker-ce package:

sudo apt-get purge docker-ce

However, this failed on my sid distribution and I simply installed the docker.io package:

sudo aptitude install docker.io

Hello world test

sudo docker run hello-world

Set privileges

sudo usermod -a -G docker username
su - username

Deploy ulfm docker image

Get ulfm docker image

ULFMREP=/tmp/ulfm
mkdir $ULFMREP
cd $ULFMREP
wget http://fault-tolerance.org/downloads/docker-ftmpi.sc16tutorial.tar.xz
tar xf docker-ftmpi.sc16tutorial.tar.xz 
$ $ --2017-03-30 19:23:18--  http://fault-tolerance.org/downloads/docker-ftmpi.sc16tutorial.tar.xz
Résolution de fault-tolerance.org (fault-tolerance.org)… 160.36.131.221
Connexion à fault-tolerance.org (fault-tolerance.org)|160.36.131.221|:80… connecté.
requête HTTP transmise, en attente de la réponse… 200 OK
Taille : 39673176 (38M) [application/x-tar]
Sauvegarde en : « docker-ftmpi.sc16tutorial.tar.xz »
[                    ]       0  --.-KB/s               
         docker-ftm   0%[                    ]  43,55K   183KB/s               
        docker-ftmp   0%[                    ] 172,23K   361KB/s               
       docker-ftmpi   1%[                    ] 554,03K   775KB/s               
      docker-ftmpi.   4%[                    ]   1,64M  1,72MB/s               
     docker-ftmpi.s   9%[>                   ]   3,56M  3,09MB/s               
    docker-ftmpi.sc  15%[==>                 ]   5,78M  4,27MB/s               
   docker-ftmpi.sc1  19%[==>                 ]   7,53M  4,84MB/s               
  docker-ftmpi.sc16  21%[===>                ]   8,21M  4,66MB/s               
 docker-ftmpi.sc16t  22%[===>                ]   8,64M  4,41MB/s               
docker-ftmpi.sc16tu  25%[====>               ]   9,79M  4,53MB/s               
ocker-ftmpi.sc16tut  31%[=====>              ]  12,00M  5,08MB/s               
cker-ftmpi.sc16tuto  37%[======>             ]  14,21M  5,55MB/s               
ker-ftmpi.sc16tutor  43%[=======>            ]  16,45M  5,96MB/s               
er-ftmpi.sc16tutori  49%[========>           ]  18,59M  6,28MB/s               
r-ftmpi.sc16tutoria  55%[==========>         ]  20,83M  6,59MB/s    eta 3s     
-ftmpi.sc16tutorial  60%[===========>        ]  23,08M  6,87MB/s    eta 3s     
ftmpi.sc16tutorial.  66%[============>       ]  25,31M  7,60MB/s    eta 3s     
tmpi.sc16tutorial.t  72%[=============>      ]  27,56M  8,34MB/s    eta 3s     
mpi.sc16tutorial.ta  78%[==============>     ]  29,81M  9,20MB/s    eta 3s     
pi.sc16tutorial.tar  84%[===============>    ]  31,93M  9,40MB/s    eta 1s     
i.sc16tutorial.tar.  90%[=================>  ]  34,18M  9,53MB/s    eta 1s     
.sc16tutorial.tar.x  96%[==================> ]  36,42M  9,52MB/s    eta 1s     
docker-ftmpi.sc16tu 100%[===================>]  37,83M  9,79MB/s    in 4,7s    

2017-03-30 19:23:22 (8,07 MB/s) — « docker-ftmpi.sc16tutorial.tar.xz » sauvegardé [39673176/39673176]

Load the pre-compiled ULFM Docker machine into your Docker installation

sudo docker load < ftmpi_ulfm-1.1.tar
. dockervars.sh load

Test an example

Compile examples

cd ulfm-testing/tutorial
make

Test

Launch mpi with ulfm enabled.

Note the special `-am ft-enable-mpi` parameter; if this parameter is omitted, the non-fault tolerant version of Open MPI is launched and applications containing failures will automatically abort.

mpirun -am ft-enable-mpi -np 10 ex2.1.revoke

Testing ulfm examples

In order to start handling and understanding deeply ULFM, i started analysing, compiling and running the examples in ULFM page from the basic to revoke communicator and reconstruct onother healthy one that contains the same ranks.

Determine the ranks of the failed processus

the examples Single failed, Multiple failed, Exchange value between 2 consecutive process, recieve from anysource are slightly differents in the behave (sample MPI basic operations) but for the ULFM part, they have a common behave which is to determine the ranks for the failed processus.

we took the examples single failed and multiple failed that demonstrate the way ULFM proceed for detecting and knowing the failure ranks, after setting an error handler and killing one or more process, a sample test is set that kills the last one in the first example and half of the remaing process in the second examples, once enterring the error handler, it calls MPIX_Comm_failure and MPIX_Comm_failure_get_acked to give a acknowledge of the failed process on the communicator and put them in a groupe, and in the same time create a groupe that contains the original communicator in ordre to determine the failed ranks using MPI_Group_translate_ranks, resulting an array of failed ranks.

mpirun -am ft-enable-mpi -np 4 ex0.3.report_one
mpirun -am ft-enable-mpi -np 8 ex0.4.report_many

Revoke, shrink and reconstruct the communicator

In ordre to understand how ULFM proceed in fixing a communicator that contains failures, i used this global example that illustrate a communicator with a failure that has been shrinked and then spawned with a new process in the same old rank.

to do so, a process is killed just before broadcasting, a repair function MPIX_Comm_replace is set manually to handle this type of failures, once in this function, the communicator is shrinked to remove all the dead processes, then determines the numbre of the failure processes to be spawned , once spawned, a manipulation of ranks is used to restore the old same ranks, the rank 0 in the shrinked communicator send the rank’s assignement for the new processes.

once finish the repair function, another broadcast is done but now to all the survival ranks and the spawned ones

Preliminary personal tests

download FT_ULFM

git clone https://github.com/samirouche/ft_uflm.git

compile the src codes

cd ft_uflm/Resilient_version/
 . dockervars.sh load
make

First personal example using ULFM

After seeing all the examples in the tutorial provided by the ULFM team, and analysing them, they have a common point is that the part for fixing failures is set always as a function (protocol to follow), each time there is a failure processus the program calls MPIX_Comm_replace, once in this function, the communicator is shrinked, revoked and then the numbre of the failure processes is spawned and added to the shrinked communicator.

each time a process enters that function, if its a survival it will take the same ranks as before and then continue its works, if its new, it will get its assignement rank from the rank 0 and it will join its work.

my example, is a simple broadcast after killing a process, and then calling the function that repairs, once it shrinks, revokes and spawns the failure processes, it does onother broadcast , results are here, its goal is to apply ULFM and reconstruct a communicator that contains the same rank assignement as the original one,

mpirun -am ft-enable-mpi -np 4 ulfm_ranks

Summation with resilience example

The respawn example presents a perfect reference for the MPI applications that needs to shrink and spawn and reconstruct a new communicator, for that i used parts of it to update a simple summation example precisely the repair’s function and the necessary modifications to do to be able to work ULFM on.

The main senario is to do a simple summation using MPI, then we kill random process, after once the master rank send the job’s chunks to the other ranks and it detects an error of type “MPI_ERR_PROC_FAILED” or “MPI_ERR_PENDING” it calls the repair function which supposed to shrink the communicator that contain the failure processes, spawn and merge them with the shrinked communicator in a way that each process take its old rank, i managed to implement this a failed version, its failed because it works just with 2 processes, once i want to test it with more then 2 processes it blocks and don’t finish the work, because with 2 processes once the master one kills the other one it can repair it because there is no other processes that are working, but with more than 2 processes when it entre to the repair part the other remainin processes block while working.

In order to avoid this problem and instead of checking each send from the master, i put MPI_Barrier and put a check on it, if there is an error of the previous mentioned types, it will call the repair’s function, once its repaired it will goes to the recovery part after we are assured that no process is dead, it re-sends again the job’s chunks to all the ranks and finish the job successufly , this version works for any number of processes and illustrats the senario mentionned in the begening.

mpirun -am ft-enable-mpi -np 4 sumulfm

Example

This results illustrates the summation of an array using MPI and ULFM with 8 processes.

More comprehensive examples

Building blocks

In general, to construct a resilient mpi application, we need a set of blocks, each one will contribute for the composition of the final resilient mpi application it can be seen as the composition of 4 blocks, one that contains the main goal of the application, the other is seen as a reference for the use of ULFM to built a resilient application, and the rest two blocks are designed for recovery static and dynamic data.

MPI numerical algorithms

It presents the main problem to solve, the simple working mpi source code without fault tolerance, it can be any kind of MPI application.

For me i took three examples which are numerical algorithms on MPI, that i’ll need for my internship.

Summation

The summation example treated in the previous part

Jacobi

In order to practise the resilience of parallelism in iterativ methods, i’ve decided to start applicating it in the Jacobi method, for that i use This code which consists in solving the Laplace equation in two dimensions with finite differences, in a mesh of 12 x 12 on a 4 processes only!

CG

MPIX reference rebuild steps

There is many ways to applicate the ULFM method, for our case, we need to shrink - respawn - and reconstruct the communicator, the previous respawn example provides this features so we’ll consider it as our reference for building a resilient applications. the main steps to take for that are :

Shrink

Create a new communicator that contains only the survival processes

Get knowledge

Know the processes that failed and their count which will be used for respawning just after and for rank assignement

respawn

Create the processes that failed depending on their number

reconstruct

Merge the new safe communicator with the respawned processes in a way that each one takes its previous rank in the old communicator

Recovery schemes for static data

The static data are the data that is supplied in the entry of the program by the user , for solving linear system it represents the matrix A and the vector B. Its recovery is just re-initializing again the input data in the begening.

Numerical recovery schemes for dynamic data

There are several ways to recover the dynamic data which represents the data that we obtain after computing, some of this ways are

Reset

One of the basic methodes of recovery schemes, it consists in doing all the work again from the begening

Enforced Restart (EF)

It basically consists of saving the current results’s state, so that it can restart from that point in case of failure and not from the begening. This is particularly important for long running applications who are higly in risk of failures.

Linear Interpolation (LI)

We consider the Linear Interpolation (LI) scheme from https://hal.inria.fr/hal-01323192

Resilient algorithms by composing building blocks

For all the next numerical algorithms we use as a reference for MPIX The respawn example.

Resilient summation

for the summation of an array, we used just a Reset method, for our previous example, we combined it with our MPIX reference code with some modifications, we obtained this resilient version when executing with 8 processes, it gives this results

mpirun -am ft-enable-mpi -np 4 sumulfm

Resilient Jacobi

we took the previous example of jacobi, and edited it so it become resilient for processes failures, but the numerical recovery scheme for the dynamic data stays the user choice, for our case we did the 3 previous mentionned ways :

Reset

The reset method of the jacobi’s iterations is presented in This code, it doesn’t necessite a numerical recovery for dynamic data but only static data as presented in the mentionned example

the execution of this code with 4 processes gave us theses results.

mpirun -am ft-enable-mpi -np 4 reset_jacobi

Enforced Restart

The Checkpoint/Restart methode of the jacobi’s iterations illustrated on this code, it doesn’t necessite a numerical recovery of static data because it works only with the dynamic data after the first iteration, the results of execution with 4 processes can be found here.

mpirun -am ft-enable-mpi -np 4 jacobi_check_restart

LI

for the example of jacobi, the linear interpolation consists in getting the ghostpoints of the neighber processes which contain the values of the adjacent points once we do it, we recover the static data for the failed processes and we iterate locally the local mesh until we get a stopping condition.

after the interpolation, we continue the work in parallel by iterating until it converges or until it get to a stopping condition.

This code illustrates the linear interpolation for the jacobi example, These results were obtained in the execution of the code with

mpirun -am ft-enable-mpi -np 4 LI_jacobi 

Resilient CG

CG/LI

  • MPI numerical algorihm: …
  • Numerical recovery scheme for dynamic data: LI

Numerical experiences & results

the recovery strategies used are gathered as functions in the same source code here, which allows to make the same tests each time but using different recovery strategies, by changing in the execution parameter :

  • 0 : NF (without faults)
  • 1 : RESET (Reset the work of the dieying process)
  • 2 : ER (Enforced restart)
  • 3 : LI (Linear interpolation)
mpirun -am ft-enable-mpi -np 6 resilient 0  # for the NF methode

MPI_Comm_Agree utilities and cost

the ULFM function MPI_Comm_Agree is known as a function that cost too much, in order to propagate the failure to all the survival processess and to set an agreement over a binary value and to synchronise between them in this graph it demonstrates the difference in time execution of the version that doesn’t containt faults with the use of MPI_Comm_agree and without it

the Costy methode to set an agreement over all the processess

https://github.com/samirouche/ft_uflm/blob/master/Results_resilience/agree.jpg

For the previous examples, killing a process was made by simulating its death from inside the application with “exit(0)”, the main propose for Resilient MPI support is to manage to rebuilt the communicator and continue working after a real externe failure or node crash, to do so a different tests were set, by injecting faults within the execution of the different typs of recovery in different iterations, just to compare the efficacity of the methods after reconstructing the communicator, the condition of convergency is set to 10.e-14 and we compare number of iterations which is necessary for that , and save the outputs of each method in txt file .

mpirun -am ft-enable-mpi -np 6 resilient 3 >> li.txt 
mpirun -am ft-enable-mpi -np 6 resilient 1 >> reset.txt 
mpirun -am ft-enable-mpi -np 6 resilient 2 >> er.txt
mpirun -am ft-enable-mpi -np 6 resilient 0 >> nf.txt
two other graphes were added which represent the non faulty and a relaxed methode of jacobi (bouth without agreement - MPI_Comm_Agree)

there plots were done by OCTAVE, using a script to plot from different outputs of the source code and it shows the different behavor of the recovery methods (LI, Reset, ER) compared with the Nan-fault methode (NF) and the relaxed one, when killing 4 processes in different iterations and seeing the necessair number of iterations or the necessair time to converge

The convergence in order of number of iterations

https://github.com/samirouche/ft_uflm/blob/master/Results_resilience/relax_itr.jpg

The convergence in order of execution time

https://github.com/samirouche/ft_uflm/blob/master/Results_resilience/relax_time.jpg

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages