The Message Passing Interface (MPI) is a standard extensively used by HPC applications to implement various communication across compute nodes of a single system or across compute platforms. There are two main open-source implementations of MPI at the moment - OpenMPI and MPICH, both of which are supported by {Singularity}. The goal of this page is to demonstrate the development and running of MPI programs using {Singularity} containers.
There are several ways of carrying this out, the most popular way of executing MPI applications installed in a {Singularity} container is to rely on the MPI implementation available on the host. This is called the Host MPI or the Hybrid model since both the MPI implementations provided by system administrators (on the host) and in the containers will be used.
Another approach is to only use the MPI implementation available on the host and not include any MPI in the container. This is called the Bind model since it requires to bind/mount the MPI version available on the host into the container.
Note
The bind model requires users to be able to mount user-specified files from the host into the container. This ability is sometimes disabled by system administrators for operational reasons. If this is the case on your system please follow the hybrid approach.
The basic idea behind the Hybrid Approach is when you execute a
{Singularity} container with MPI code, you will call mpiexec
or a
similar launcher on the singularity
command itself. The MPI process
outside of the container will then work in tandem with MPI inside the
container and the containerized MPI code to instantiate the job.
The Open MPI/{Singularity} workflow in detail:
- The MPI launcher (e.g.,
mpirun
,mpiexec
) is called by the resource manager or the user directly from a shell. - Open MPI then calls the process management daemon (ORTED).
- The ORTED process launches the {Singularity} container requested by the launcher command.
- {Singularity} instantiates the container and namespace environment.
- {Singularity} then launches the MPI application within the container.
- The MPI application launches and loads the Open MPI libraries.
- The Open MPI libraries connect back to the ORTED process via the Process Management Interface (PMI).
At this point the processes within the container run as they would normally directly on the host.
- The advantages of this approach are:
- Integration with resource managers such as Slurm.
- Simplicity since similar to natively running MPI applications.
- The drawbacks are:
- The MPI in the container must be compatible with the version of MPI available on the host.
- The MPI implementation in the container must be carefully configured for optimal use of the hardware if performance is critical.
Since the MPI implementation in the container must be compliant with the version available on the host system, a standard approach is to build your own MPI container, including building the same MPI framework installed on the host from source.
To illustrate how {Singularity} can be used to execute MPI applications,
we will assume for a moment that the application is mpitest.c
, a
simple Hello World:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
int main (int argc, char **argv) {
int rc;
int size;
int myrank;
rc = MPI_Init (&argc, &argv);
if (rc != MPI_SUCCESS) {
fprintf (stderr, "MPI_Init() failed");
return EXIT_FAILURE;
}
rc = MPI_Comm_size (MPI_COMM_WORLD, &size);
if (rc != MPI_SUCCESS) {
fprintf (stderr, "MPI_Comm_size() failed");
goto exit_with_error;
}
rc = MPI_Comm_rank (MPI_COMM_WORLD, &myrank);
if (rc != MPI_SUCCESS) {
fprintf (stderr, "MPI_Comm_rank() failed");
goto exit_with_error;
}
fprintf (stdout, "Hello, I am rank %d/%d\n", myrank, size);
MPI_Finalize();
return EXIT_SUCCESS;
exit_with_error:
MPI_Finalize();
return EXIT_FAILURE;
}
Note
MPI is an interface to a library, so it consists of function calls and libraries that can be used by many programming languages. It comes with standardized bindings for Fortran and C. However, it can support applications in many languages like Python, R, etc.
The next step is to create the definition file used to build the container, which will depend on the MPI implementation available on the host.
If the host MPI is MPICH, a definition file such as the following example can be used:
Bootstrap: docker From: ubuntu:18.04 %files mpitest.c /opt %environment # Point to MPICH binaries, libraries man pages export MPICH_DIR=/opt/mpich-3.3.2 export PATH="$MPICH_DIR/bin:$PATH" export LD_LIBRARY_PATH="$MPICH_DIR/lib:$LD_LIBRARY_PATH" export MANPATH=$MPICH_DIR/share/man:$MANPATH %post echo "Installing required packages..." export DEBIAN_FRONTEND=noninteractive apt-get update && apt-get install -y wget git bash gcc gfortran g++ make # Information about the version of MPICH to use export MPICH_VERSION=3.3.2 export MPICH_URL="http://www.mpich.org/static/downloads/$MPICH_VERSION/mpich-$MPICH_VERSION.tar.gz" export MPICH_DIR=/opt/mpich echo "Installing MPICH..." mkdir -p /tmp/mpich mkdir -p /opt # Download cd /tmp/mpich && wget -O mpich-$MPICH_VERSION.tar.gz $MPICH_URL && tar xzf mpich-$MPICH_VERSION.tar.gz # Compile and install cd /tmp/mpich/mpich-$MPICH_VERSION && ./configure --prefix=$MPICH_DIR && make install # Set env variables so we can compile our application export PATH=$MPICH_DIR/bin:$PATH export LD_LIBRARY_PATH=$MPICH_DIR/lib:$LD_LIBRARY_PATH echo "Compiling the MPI application..." cd /opt && mpicc -o mpitest mpitest.c
Note
The version of MPICH you install in the container must be compatible with the version on the host. It should also be configured to support the same process management mechanism and version, e.g. PMI2 / PMIx, as used on the host.
There are wide variations in MPI configuration across HPC systems. Consult your system documentation, or ask your support staff for details.
If the host MPI is Open MPI, the definition file looks like:
Bootstrap: docker From: ubuntu:18.04 %files mpitest.c /opt %environment # Point to OMPI binaries, libraries, man pages export OMPI_DIR=/opt/ompi export PATH="$OMPI_DIR/bin:$PATH" export LD_LIBRARY_PATH="$OMPI_DIR/lib:$LD_LIBRARY_PATH" export MANPATH="$OMPI_DIR/share/man:$MANPATH" %post echo "Installing required packages..." apt-get update && apt-get install -y wget git bash gcc gfortran g++ make file echo "Installing Open MPI" export OMPI_DIR=/opt/ompi export OMPI_VERSION=4.0.5 export OMPI_URL="https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-$OMPI_VERSION.tar.bz2" mkdir -p /tmp/ompi mkdir -p /opt # Download cd /tmp/ompi && wget -O openmpi-$OMPI_VERSION.tar.bz2 $OMPI_URL && tar -xjf openmpi-$OMPI_VERSION.tar.bz2 # Compile and install cd /tmp/ompi/openmpi-$OMPI_VERSION && ./configure --prefix=$OMPI_DIR && make -j8 install # Set env variables so we can compile our application export PATH=$OMPI_DIR/bin:$PATH export LD_LIBRARY_PATH=$OMPI_DIR/lib:$LD_LIBRARY_PATH echo "Compiling the MPI application..." cd /opt && mpicc -o mpitest mpitest.c
Note
The version of Open MPI you install in the container must be compatible with the version on the host. It should also be configured to support the same process management mechanism and version, e.g. PMI2 / PMIx, as used on the host.
There are wide variations in MPI configuration across HPC systems. Consult your system documentation, or ask your support staff for details.
The standard way to execute MPI applications with hybrid {Singularity}
containers is to run the native mpirun
command from the host, which
will start {Singularity} containers and ultimately MPI ranks within the
containers.
Assuming your container with MPI and your application is already built,
the mpirun
command to start your application looks like when your
container has been built based on the hybrid model:
$ mpirun -n <NUMBER_OF_RANKS> singularity exec <PATH/TO/MY/IMAGE> </PATH/TO/BINARY/WITHIN/CONTAINER>
Practically, this command will first start a process instantiating
mpirun
and then {Singularity} containers on compute nodes. Finally,
when the containers start, the MPI binary is executed:
$ mpirun -n 8 singularity run hybrid-mpich.sif /opt/mpitest Hello, I am rank 3/8 Hello, I am rank 4/8 Hello, I am rank 6/8 Hello, I am rank 2/8 Hello, I am rank 0/8 Hello, I am rank 5/8 Hello, I am rank 1/8 Hello, I am rank 7/8
Similar to the Hybrid Approach, the basic idea behind the Bind Approach is to start the MPI application by calling the MPI launcher (e.g., mpirun) from the host. The main difference between the hybrid and bind approach is the fact that with the bind approach, the container usually does not include any MPI implementation. This means that {Singularity} needs to mount/bind the MPI available on the host into the container.
Technically this requires two steps:
- Know where the MPI implementation on the host is installed.
- Mount/bind it into the container in a location where the system will be able to find libraries and binaries.
- The advantages of this approach are:
- Integration with resource managers such as Slurm.
- Container images are smaller since there is no need to add an MPI in the containers.
- The drawbacks are:
- The MPI used to compile the application in the container must be compatible with the version of MPI available on the host.
- The user must know where the host MPI is installed.
- The user must ensure that binding the directory where the host MPI is installed is possible.
- The user must ensure that the host MPI is compatible with the MPI used to compile and install the application in the container.
The creation of a {Singularity} container for the bind model is based on the following steps:
- Compile your application on a system with the target MPI implementation, as you would do to install your application on any system.
- Create a definition file that includes the copy of the application from the host to the container image, as well as all required dependencies.
- Generate the container image.
As already mentioned, the compilation of the application on the host is not different from the installation of your application on any system. Just make sure that the MPI on the system where you create your container is compatible with the MPI available on the platform(s) where you want to run your containers. For example, a container where the application has been compiled with MPICH will not be able to run on a system where only Open MPI is available, even if you mount the directory where Open MPI is installed.
A definition file for a container in bind mode is fairly straight
forward. The following example shows the definition file for the test
program, which in this case has been compiled on the host to
/tmp/mpitest
:
Bootstrap: docker From: ubuntu:18.04 %files /tmp/mpitest /opt/mpitest %environment export PATH="$MPI_DIR/bin:$PATH" export LD_LIBRARY_PATH="$MPI_DIR/lib:$LD_LIBRARY_PATH"
In this example, the application mpitest
is copied from the host
into /opt
, so we will need to run it as /opt/mpitest
inside our
container.
The environment section adds paths for binaries and libraries under
$MPI_DIR
- which we will need to set when running the container.
When running our bind mode container we need to --bind
our host's
MPI installation into the container. We also need to set the environment
variable $MPI_DIR
in the container to point to the location where
the MPI installation is bound in.
Setting up the container in this way makes it semi-portable between systems that have a version-compatible MPI installation, but under different installation paths. You can also hard code the MPI path in the definition file if you wish.
$ export MPI_DIR="<PATH/TO/HOST/MPI/DIRECTORY>" $ mpirun -n <NUMBER_OF_RANKS> singularity exec --bind "$MPI_DIR" <PATH/TO/MY/IMAGE> </PATH/TO/BINARY/WITHIN/CONTAINER>
On an example system we may be using an Open MPI installation at
/cm/shared/apps/openmpi/gcc/64/4.0.5/
. This means that the commands
to run the container in bind mode are:
$ export MPI_DIR="/cm/shared/apps/openmpi/gcc/64/4.0.5" $ mpirun -n 8 singularity exec --bind "$MPI_DIR" bind.sif /opt/mpitest Hello, I am rank 1/8 Hello, I am rank 2/8 Hello, I am rank 0/8 Hello, I am rank 7/8 Hello, I am rank 5/8 Hello, I am rank 3/8 Hello, I am rank 4/8 Hello, I am rank 6/8
If your target system is setup with a batch system such as SLURM, a standard way to execute MPI applications is through a batch script. The following example illustrates the context of a batch script for Slurm that aims at starting a {Singularity} container on each node allocated to the execution of the job. It can easily be adapted for all major batch systems available.
$ cat my_job.sh #!/bin/bash #SBATCH --job-name singularity-mpi #SBATCH -N $NNODES # total number of nodes #SBATCH --time=00:05:00 # Max execution time mpirun -n $NP singularity exec /var/nfsshare/gvallee/mpich.sif /opt/mpitest
In fact, the example describes a job that requests the number of nodes
specified by the NNODES
environment variable and a total number of
MPI processes specified by the NP
environment variable. The example
is also assuming that the container is based on the hybrid model; if it
is based on the bind model, please add the appropriate bind options.
A user can then submit a job by executing the following SLURM command:
$ sbatch my_job.sh
On many systems it is common to use an alternative launcher to start MPI
applications, e.g. Slurm's srun
rather than the mpirun
provided
by the MPI installation. This approach is supported with {Singularity}
as long as the container MPI version supports the same process
management interface (e.g. PMI2 / PMIx) and version as is used by the
launcher.
In the bind mode the host MPI is used in the container, and should interact correctly with the same launchers as it does on the host.
High performance interconnects such as Infiniband and Omnipath require that MPI implementations are built to support them. You may need to install or bind Infiniband/Omnipath libraries into your containers when using these interconnects.
By default {Singularity} exposes every device in /dev
to the
container. If you run a container using the --contain
or
--containall
flags a minimal /dev
is used instead. You may need
to bind in additional /dev/
entries manually to support the
operation of your interconnect drivers in the container in this case.
If your containers run N rank 0 processes, instead of operating correctly as an MPI application, it is likely that the MPI stack used to launch the containerized application is not compatible with, or cannot communicate with, the MPI stack in the container.
E.g. if we attempt to run the hybrid Open MPI container, but with
mpirun
from MPICH loaded on the host:
$ module add mpich $ mpirun -n 8 singularity run hybrid-openmpi.sif /opt/mpitest Hello, I am rank 0/1 Hello, I am rank 0/1 Hello, I am rank 0/1 Hello, I am rank 0/1 Hello, I am rank 0/1 Hello, I am rank 0/1 Hello, I am rank 0/1 Hello, I am rank 0/1
If your container starts processes of different ranks, but fails with communications errors there may also be a version incompatibility, or interconnect libraries may not be available or configured properly with the MPI stack in the container.
Please check the following things carefully before asking questions in the {Singularity} community:
- For the hybrid mode, is the MPI version on the host compatible with the version in the container? Newer MPI versions can generally tolerate some mismatch in the version number, but it is safest to use identical versions.
- Is the MPI stack in the container configured to support the process management method used on the host? E.g. if you are launching tasks with
srun
configured for PMIx only, then a containerized MPI supporting PMI2 only will not operate as expected.- If you are using an interconnect other than standard Ethernet, are any required libraries for it installed or bound into the container? Is the MPI stack in the container configured correctly to use them?
We recommend using the {Singularity} Google Group or Slack Channel to ask for MPI advice from the {Singularity} community. HPC cluster configurations vary greatly and most MPI problems are related to MPI / interconnect configuration, and not caused by issues in {Singularity} itself.