Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
linusseelinger authored Feb 25, 2024
1 parent a61b92a commit 7da82f9
Showing 1 changed file with 5 additions and 5 deletions.
10 changes: 5 additions & 5 deletions kubernetes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -80,11 +80,11 @@ kubectl get services --namespace=haproxy-controller

The model instances may be accessed from any UM-Bridge client, and up to `replicas` requests will be handled in parallel.

# Multinode MPI on kubernetes
## Multinode MPI on kubernetes

The instructions above work for any UM-Bridge model container, even ones that are MPI parallel. However, a single container is naturally limited to a single physical node. In order to parallelize across nodes (and therefore across containers) via MPI, the additional steps below are needed.

## Step 1: mpi-operator base image
### Step 1: mpi-operator base image

The multinode MPI configuration makes use of the [mpi-operator](https://github.com/kubeflow/mpi-operator) from kubeflow. This implies that the mode base image has to be constructed via one of the following base images, depending on MPI implementation:

Expand All @@ -97,15 +97,15 @@ When separating between builder and final image, the corresponding base images m
- `mpioperator/intel`


## Step 2: Deploy mpi-operator
### Step 2: Deploy mpi-operator

In addition to choosing a suitable base image for the model, the mpi-òperator needs to be deployed on the cluster:

```
kubectl apply -f https://raw.githubusercontent.com/kubeflow/mpi-operator/master/deploy/v2beta1/mpi-operator.yaml
```

## Step 3: Setting up NFS
### Step 3: Setting up NFS

The multinode MPI setup mounts a shared (NFS) file system on the `/shared` directory of your model container, replicating a traditional HPC setup. The NFS server is set up via:

Expand All @@ -129,7 +129,7 @@ Then run:
kubectl apply -f setup/nfs-pv-pvc.yaml
```

## Step 4: Running a job on the new cluster
### Step 4: Running a job on the new cluster

The job configuration is located in `multinode-mpi-model.yaml`. It is largely analogous to `model.yaml`, except that both launcher and worker containers are configured. The relevant additional config options are:

Expand Down

0 comments on commit 7da82f9

Please sign in to comment.