Visit the github said project and clone it:
git clone https://github.com/samirouche/ft_uflm.git
We follow this tutorial: http://fault-tolerance.org/2017/02/10/sc16-tutorial/
Here is the tutorial to install docker on Debian: https://docs.docker.com/engine/installation/linux/debian/#install-using-the-repository
On my sid machine, however, docker-ce install failed and the docker.io from the baseline sid distribution directely worked.
What is my distribution?
lsb_release -cs
sid
sudo aptitude install apt-transport-https ca-certificates curl software-properties-common
sudo aptitude update
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository "deb [arch=amd64] https://download.docker.com/linux/debian $(lsb_release -cs) stable"
sudo aptitude update
According to the docker tutorial for debian, we should install docker-ce package:
sudo apt-get purge docker-ce
However, this failed on my sid distribution and I simply installed the docker.io package:
sudo aptitude install docker.io
sudo docker run hello-world
sudo usermod -a -G docker username
su - username
ULFMREP=/tmp/ulfm
mkdir $ULFMREP
cd $ULFMREP
wget http://fault-tolerance.org/downloads/docker-ftmpi.sc16tutorial.tar.xz
tar xf docker-ftmpi.sc16tutorial.tar.xz
$ $ --2017-03-30 19:23:18-- http://fault-tolerance.org/downloads/docker-ftmpi.sc16tutorial.tar.xz Résolution de fault-tolerance.org (fault-tolerance.org)… 160.36.131.221 Connexion à fault-tolerance.org (fault-tolerance.org)|160.36.131.221|:80… connecté. requête HTTP transmise, en attente de la réponse… 200 OK Taille : 39673176 (38M) [application/x-tar] Sauvegarde en : « docker-ftmpi.sc16tutorial.tar.xz » [ ] 0 --.-KB/s docker-ftm 0%[ ] 43,55K 183KB/s docker-ftmp 0%[ ] 172,23K 361KB/s docker-ftmpi 1%[ ] 554,03K 775KB/s docker-ftmpi. 4%[ ] 1,64M 1,72MB/s docker-ftmpi.s 9%[> ] 3,56M 3,09MB/s docker-ftmpi.sc 15%[==> ] 5,78M 4,27MB/s docker-ftmpi.sc1 19%[==> ] 7,53M 4,84MB/s docker-ftmpi.sc16 21%[===> ] 8,21M 4,66MB/s docker-ftmpi.sc16t 22%[===> ] 8,64M 4,41MB/s docker-ftmpi.sc16tu 25%[====> ] 9,79M 4,53MB/s ocker-ftmpi.sc16tut 31%[=====> ] 12,00M 5,08MB/s cker-ftmpi.sc16tuto 37%[======> ] 14,21M 5,55MB/s ker-ftmpi.sc16tutor 43%[=======> ] 16,45M 5,96MB/s er-ftmpi.sc16tutori 49%[========> ] 18,59M 6,28MB/s r-ftmpi.sc16tutoria 55%[==========> ] 20,83M 6,59MB/s eta 3s -ftmpi.sc16tutorial 60%[===========> ] 23,08M 6,87MB/s eta 3s ftmpi.sc16tutorial. 66%[============> ] 25,31M 7,60MB/s eta 3s tmpi.sc16tutorial.t 72%[=============> ] 27,56M 8,34MB/s eta 3s mpi.sc16tutorial.ta 78%[==============> ] 29,81M 9,20MB/s eta 3s pi.sc16tutorial.tar 84%[===============> ] 31,93M 9,40MB/s eta 1s i.sc16tutorial.tar. 90%[=================> ] 34,18M 9,53MB/s eta 1s .sc16tutorial.tar.x 96%[==================> ] 36,42M 9,52MB/s eta 1s docker-ftmpi.sc16tu 100%[===================>] 37,83M 9,79MB/s in 4,7s 2017-03-30 19:23:22 (8,07 MB/s) — « docker-ftmpi.sc16tutorial.tar.xz » sauvegardé [39673176/39673176]
sudo docker load < ftmpi_ulfm-1.1.tar
. dockervars.sh load
cd ulfm-testing/tutorial
make
Launch mpi with ulfm enabled.
Note the special `-am ft-enable-mpi` parameter; if this parameter is omitted, the non-fault tolerant version of Open MPI is launched and applications containing failures will automatically abort.
mpirun -am ft-enable-mpi -np 10 ex2.1.revoke
In order to start handling and understanding deeply ULFM, i started analysing, compiling and running the examples in ULFM page from the basic to revoke communicator and reconstruct onother healthy one that contains the same ranks.
the examples Single failed, Multiple failed, Exchange value between 2 consecutive process, recieve from anysource are slightly differents in the behave (sample MPI basic operations) but for the ULFM part, they have a common behave which is to determine the ranks for the failed processus.
we took the examples single failed and multiple failed that demonstrate the way ULFM proceed for detecting and knowing the failure ranks, after setting an error handler and killing one or more process, a sample test is set that kills the last one in the first example and half of the remaing process in the second examples, once enterring the error handler, it calls MPIX_Comm_failure and MPIX_Comm_failure_get_acked to give a acknowledge of the failed process on the communicator and put them in a groupe, and in the same time create a groupe that contains the original communicator in ordre to determine the failed ranks using MPI_Group_translate_ranks, resulting an array of failed ranks.
mpirun -am ft-enable-mpi -np 4 ex0.3.report_one
mpirun -am ft-enable-mpi -np 8 ex0.4.report_many
In ordre to understand how ULFM proceed in fixing a communicator that contains failures, i used this global example that illustrate a communicator with a failure that has been shrinked and then spawned with a new process in the same old rank.
to do so, a process is killed just before broadcasting, a repair function MPIX_Comm_replace is set manually to handle this type of failures, once in this function, the communicator is shrinked to remove all the dead processes, then determines the numbre of the failure processes to be spawned , once spawned, a manipulation of ranks is used to restore the old same ranks, the rank 0 in the shrinked communicator send the rank’s assignement for the new processes.
once finish the repair function, another broadcast is done but now to all the survival ranks and the spawned ones
git clone https://github.com/samirouche/ft_uflm.git
cd ft_uflm/Resilient_version/ . dockervars.sh load make
After seeing all the examples in the tutorial provided by the ULFM team, and analysing them, they have a common point is that the part for fixing failures is set always as a function (protocol to follow), each time there is a failure processus the program calls MPIX_Comm_replace, once in this function, the communicator is shrinked, revoked and then the numbre of the failure processes is spawned and added to the shrinked communicator.
each time a process enters that function, if its a survival it will take the same ranks as before and then continue its works, if its new, it will get its assignement rank from the rank 0 and it will join its work.
my example, is a simple broadcast after killing a process, and then calling the function that repairs, once it shrinks, revokes and spawns the failure processes, it does onother broadcast , results are here, its goal is to apply ULFM and reconstruct a communicator that contains the same rank assignement as the original one,
mpirun -am ft-enable-mpi -np 4 ulfm_ranks
The respawn example presents a perfect reference for the MPI applications that needs to shrink and spawn and reconstruct a new communicator, for that i used parts of it to update a simple summation example precisely the repair’s function and the necessary modifications to do to be able to work ULFM on.
The main senario is to do a simple summation using MPI, then we kill random process, after once the master rank send the job’s chunks to the other ranks and it detects an error of type “MPI_ERR_PROC_FAILED” or “MPI_ERR_PENDING” it calls the repair function which supposed to shrink the communicator that contain the failure processes, spawn and merge them with the shrinked communicator in a way that each process take its old rank, i managed to implement this a failed version, its failed because it works just with 2 processes, once i want to test it with more then 2 processes it blocks and don’t finish the work, because with 2 processes once the master one kills the other one it can repair it because there is no other processes that are working, but with more than 2 processes when it entre to the repair part the other remainin processes block while working.
In order to avoid this problem and instead of checking each send from the master, i put MPI_Barrier and put a check on it, if there is an error of the previous mentioned types, it will call the repair’s function, once its repaired it will goes to the recovery part after we are assured that no process is dead, it re-sends again the job’s chunks to all the ranks and finish the job successufly , this version works for any number of processes and illustrats the senario mentionned in the begening.
mpirun -am ft-enable-mpi -np 4 sumulfm
This results illustrates the summation of an array using MPI and ULFM with 8 processes.
In general, to construct a resilient mpi application, we need a set of blocks, each one will contribute for the composition of the final resilient mpi application it can be seen as the composition of 4 blocks, one that contains the main goal of the application, the other is seen as a reference for the use of ULFM to built a resilient application, and the rest two blocks are designed for recovery static and dynamic data.
It presents the main problem to solve, the simple working mpi source code without fault tolerance, it can be any kind of MPI application.
For me i took three examples which are numerical algorithms on MPI, that i’ll need for my internship.
The summation example treated in the previous part
In order to practise the resilience of parallelism in iterativ methods, i’ve decided to start applicating it in the Jacobi method, for that i use This code which consists in solving the Laplace equation in two dimensions with finite differences, in a mesh of 12 x 12 on a 4 processes only!
There is many ways to applicate the ULFM method, for our case, we need to shrink - respawn - and reconstruct the communicator, the previous respawn example provides this features so we’ll consider it as our reference for building a resilient applications. the main steps to take for that are :
Create a new communicator that contains only the survival processes
Know the processes that failed and their count which will be used for respawning just after and for rank assignement
Create the processes that failed depending on their number
Merge the new safe communicator with the respawned processes in a way that each one takes its previous rank in the old communicator
The static data are the data that is supplied in the entry of the program by the user , for solving linear system it represents the matrix A and the vector B. Its recovery is just re-initializing again the input data in the begening.
There are several ways to recover the dynamic data which represents the data that we obtain after computing, some of this ways are
One of the basic methodes of recovery schemes, it consists in doing all the work again from the begening
It basically consists of saving the current results’s state, so that it can restart from that point in case of failure and not from the begening. This is particularly important for long running applications who are higly in risk of failures.
We consider the Linear Interpolation (LI) scheme from https://hal.inria.fr/hal-01323192
For all the next numerical algorithms we use as a reference for MPIX The respawn example.
for the summation of an array, we used just a Reset method, for our previous example, we combined it with our MPIX reference code with some modifications, we obtained this resilient version when executing with 8 processes, it gives this results
mpirun -am ft-enable-mpi -np 4 sumulfm
we took the previous example of jacobi, and edited it so it become resilient for processes failures, but the numerical recovery scheme for the dynamic data stays the user choice, for our case we did the 3 previous mentionned ways :
The reset method of the jacobi’s iterations is presented in This code, it doesn’t necessite a numerical recovery for dynamic data but only static data as presented in the mentionned example
the execution of this code with 4 processes gave us theses results.
mpirun -am ft-enable-mpi -np 4 reset_jacobi
The Checkpoint/Restart methode of the jacobi’s iterations illustrated on this code, it doesn’t necessite a numerical recovery of static data because it works only with the dynamic data after the first iteration, the results of execution with 4 processes can be found here.
mpirun -am ft-enable-mpi -np 4 jacobi_check_restart
for the example of jacobi, the linear interpolation consists in getting the ghostpoints of the neighber processes which contain the values of the adjacent points once we do it, we recover the static data for the failed processes and we iterate locally the local mesh until we get a stopping condition.
after the interpolation, we continue the work in parallel by iterating until it converges or until it get to a stopping condition.
This code illustrates the linear interpolation for the jacobi example, These results were obtained in the execution of the code with
mpirun -am ft-enable-mpi -np 4 LI_jacobi
- MPI numerical algorihm: …
- Numerical recovery scheme for dynamic data: LI
the recovery strategies used are gathered as functions in the same source code here, which allows to make the same tests each time but using different recovery strategies, by changing in the execution parameter :
- 0 : NF (without faults)
- 1 : RESET (Reset the work of the dieying process)
- 2 : ER (Enforced restart)
- 3 : LI (Linear interpolation)
mpirun -am ft-enable-mpi -np 6 resilient 0 # for the NF methode
the ULFM function MPI_Comm_Agree is known as a function that cost too much, in order to propagate the failure to all the survival processess and to set an agreement over a binary value and to synchronise between them in this graph it demonstrates the difference in time execution of the version that doesn’t containt faults with the use of MPI_Comm_agree and without it
For the previous examples, killing a process was made by simulating its death from inside the application with “exit(0)”, the main propose for Resilient MPI support is to manage to rebuilt the communicator and continue working after a real externe failure or node crash, to do so a different tests were set, by injecting faults within the execution of the different typs of recovery in different iterations, just to compare the efficacity of the methods after reconstructing the communicator, the condition of convergency is set to 10.e-14 and we compare number of iterations which is necessary for that , and save the outputs of each method in txt file .
mpirun -am ft-enable-mpi -np 6 resilient 3 >> li.txt mpirun -am ft-enable-mpi -np 6 resilient 1 >> reset.txt mpirun -am ft-enable-mpi -np 6 resilient 2 >> er.txt mpirun -am ft-enable-mpi -np 6 resilient 0 >> nf.txt
two other graphes were added which represent the non faulty and a relaxed methode of jacobi (bouth without agreement - MPI_Comm_Agree)
there plots were done by OCTAVE, using a script to plot from different outputs of the source code and it shows the different behavor of the recovery methods (LI, Reset, ER) compared with the Nan-fault methode (NF) and the relaxed one, when killing 4 processes in different iterations and seeing the necessair number of iterations or the necessair time to converge