Parallelization over multiple nodes of a cluster
Running distributed Monolix on a cluster
To distribute calculations over multiple nodes, a special Monolix executable needs to be used. This executable, named distMonolix, comes with the installation of the Linux version of MonolixSuite. A separate cluster license is available and needs to be obtained in order to use distMonolix. All settings applicable to the monolix executable on this page can also be used with distMonolix on a cluster.
Monolix installation
To run MonolixSuite on a cluster, each cluster node must have access to the MonolixSuite directory and to the user home directory. Thus, there are two possibilities.
MonolixSuite is installed on each node.
MonolixSuite installation is shared. MonolixSuite is installed on a master server. Each cluster node accesses to MonolixSuite through a shared directory (via CIFS, Network drive, NFS, …).
License management
On a cluster, we are managing the usage of our applications with the license management system described here.
The license management server is on a physical machine and manage the application through its license file. The associated license file has to be put in the folder {MonolixSuite install path}/config/system/access (and also {MonolixSuite install path}/bin/Monolix_mcr/runtime/config/system/access for MonolixSuite2016R1). So either on all nodes in the installation case 1, or only on the master server in the configuration 2.
Running Monolix on a single node
If a Monolix run is performed on a single node, it is possible to run Monolix using its executable in the lib folder (typically $HOME/Lixoft/MonolixSuite2024R1/lib/):
monolix --no-gui -p mlxtran_project_path
where mlxtran_project_path is a Monolix project with a .mlxtran extension.
Running Monolix on multiple nodes using MPI
To run Monolix on multiple nodes, OpenMPI needs to be installed on all nodes. To run using MPI directly using the distMonolix executable in the lib folder (typically $HOME/Lixoft/MonolixSuite2024R1/lib/), you can use this command (this command will distribute Monolix over 4 nodes listed in hostfile.txt with hostfile.txt specifying that each host [i.e node] has 1 slot [i.e core]):
mpirun -n 4 -hostfile hostfile.txt distMonolix -p mlxtran_project_path --thread 16
Arguments that can be provided to distMonolix are the same ones as with Monolix. This includes --tool
to select a multi-run task (model building, convergence assessment, bootstrap) and --config
to provide settings for this task. The thread argument indicates the number of threads for each MPI process, see details below.
MPI and multithreading
The distMonolix
executable is multithreaded, meaning that each MPI process uses multiple threads.
Multithreading is generally faster than distributing computation across multiple MPI processes on the same node, because threads share memory space and avoid the overhead of inter-process communication. MPI, on the other hand, is designed for distributed memory environments, and using it for core-level parallelism on a single node can introduce unnecessary complexity and reduce performance.
The number of CPU cores used for multithreading by distMonolix
is determined either by the --thread
argument, or—if this argument is not provided—by the value specified in the config.ini
file. By default, the config.ini
file sets the number of threads to the number of the available cores on the node. To control thread usage explicitly, you can pass --thread <number>
when launching distMonolix
.
It is strongly recommended to run only one MPI process per node and use multithreading for parallelism within that node. Running multiple MPI processes per node can lead to resource contention and significantly reduce performance.
We strongly recommend using the --thread
argument when running distMonolix
to explicitly control the number of threads per MPI process. If --thread
is not specified, distMonolix
will default to the value in config.ini
, which may not always align with the allocated resources, potentially leading to suboptimal performance.
MPI troubleshooting
Different versions of distributed Monolix were built using different versions of Open MPI. If a more recent version was installed on the cluster, the following error may appear when trying to run distributed Monolix:
distMonolix: error while loading shared libraries: libmpi_cxx.so.YY: cannot open shared object file: No such file or directory
To resolve the error, you have to create a symbolic link from your installation:
from your installation of libmpi.so (usually in /usr/lib64/openmi/lib/libmpi.so.XX) to libmpi.so.YY (in the MonolixSuite lib folder):
CODEsudo ln -s your_installation_of_openmi/lib/libmpi.so.XX installation_of_MonolixSuiteXXXX/lib/libmpi.so.YY
from your installation of libmpi_cxx.so (usually in /usr/lib64/openmi/lib/libmpi_cxx.so.XX) to libmpi_cxx.so.YY (in the MonolixSuite lib folder):
CODEsudo ln -s your_installation_of_openmi/lib/libmpi_cxx.so.XX installation_of_MonolixSuiteXXXX/lib/libmpi_cxx.so.YY
Distributed calculation
How the distribution is done differs between different tasks:
in MCMC (SAEM, Fisher by Stochastic Approximation, Conditional Distribution): pools of ids are created and distributed over the MPI processes,
in Importance Sampling: the same is done with simulation pools,
in multi-run tasks (bootstrap, convergence assessment): each run is distributed over all processes and the runs are performed one after the other.
Using distributed Monolix with a scheduler
Usually, runs on clusters are scheduled using a job scheduling application (e.g., Torque, PBS, GridEngine, Slurm, LSF, …). After submitting a Monolix run with the job scheduling application, the run will wait in a queue until enough of the resources become available. When the resources become available, the run will be performed.
Generally, a run is submitted to the cluster using a specific command, e.g. qsub in the case of Torque, PBS or GridEngine (former SGE). This command runs a script, provided as parameter, on a cluster node chosen by the cluster scheduler.
Scheduling Monolix runs with Slurm Workload Manager
When using Slurm Workload Manager on a cluster, the runs are submitted using the sbatch command. With the command, a path to a batch script needs to be provided. A simple example of a batch script that can be used to run Monolix is shown here (note that there is no need to provide the number of nodes directly to the mpirun command in the script, since Slurm will automatically pass that information to Open MPI):
#!/bin/bash
mpirun --bind-to core --map-by slot:PE=$2 ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p $1 --thread $2
While the examples provided use mpirun
within sbatch
scripts, launching your MPI application directly via srun
(e.g., srun ./your_executable
) is generally the preferred and often more efficient method on Slurm. srun
typically offers better integration with Slurm's resource allocation (--cpus-per-task
, task distribution) and process management. We recommend using srun
for launching MPI tasks whenever feasible.
If the script is saved as run.sh, we can schedule a Monolix project to run with the following command (the command will distribute the run across 4 tasks on 4 nodes with each task running 8 threads:
$ sbatch --nodes 4 --ntasks-per-node 1 --cpus-per-task 8 run.sh mlxtran_project_path 8
Additional arguments, such as time limit or job name, can be provided to the sbatch command either through the command line or through the batch script. All the available options are listed on this page.
The --ntasks-per-node
indicates the number of tasks (i.e MPI processes) per node. Within a node, parallelization using multi-threading is more efficient than parallelization using several processes. It is thus recommended to set --ntasks-per-node
to 1 and to use the --thread
argument of distMonolix
to define the number of threads.
The --cpus-per-task
option specifies the number of CPU cores allocated per task. When running distMonolix
, each MPI process is considered a task, so setting --cpus-per-task=<N>
ensures that each MPI process gets N
CPU cores for multithreading. Thus, the value of --cpus-per-task
and --thread
should be the same.
Here is how the run.sh file should look like if we want to assign a job name through the file:
#!/bin/bash
#SBATCH --job-name=monolixRun
mpirun --bind-to core --map-by slot:PE=$2 ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p $1 --thread $2
After submitting the job using the sbatch command, we can use the command squeue check the status of a run:
$ sbatch --nodes 4 --ntasks-per-node 1 --cpus-per-task 8 run.sh mlxtran_project_path 8
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
86 debug monolixR lixoft R 0:02 4 slave[1-4]
We can cancel the run using scancel and specifying the job ID:
$ scancel 86
Scheduling Monolix runs with IBM Spectrum LSF
When using IBM Spectrum LSF on a cluster, the runs are submitted using the bsub
command. With the command, a path to a batch script needs to be provided. A simple example of a batch script that can be used to run Monolix is shown here (note that there is no need to provide the number of processes directly to the mpirun
command in the script, since LSF and the MPI library communicate to determine this from the resource request):
#!/bin/bash
# Basic IBM Spectrum LSF execution script (resource requests via bsub command line)
mpirun ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p "$1" --thread "$2"
If the script is saved as run_lsf.sh
, we can schedule a Monolix project to run with the following command. This command aims to replicate the Slurm example's resource allocation (distribute the run across 4 tasks [i.e MPI processes] on 4 nodes with each task running 8 threads). In IBM Spectrum LSF, this is achieved by requesting 4 processes (-n 4
), placing 1 process per host (span[ptile=1]
), and requesting 8 cores bound to each process (affinity[core(8)]
).
$ bsub -n 4 -R "span[ptile=1] affinity[core(8)]" < run_lsf.sh mlxtran_project_path 8
bsub
: The LSF job submission command.-n 4
: Requests 4 MPI processes (tasks) in total for the job.-R "span[ptile=1] affinity[core(8)]"
: A resource requirement string specifying:span[ptile=1]
: Places each of the 4 MPI processes onto a separate host [i.e node]affinity[core(8)]
: Tells LSF to allocate and bind 8 CPU cores to each of the 4 processes. This ensures that each multithreaded Monolix process has dedicated resources. Note: The exactaffinity
syntax can sometimes vary based on IBM Spectrum LSF version and configuration. Consult your local IBM Spectrum LSF documentation.
< run_lsf.sh
: Redirects the script content to thebsub
command.mlxtran_project_path
: The path to the Monolix project file (becomes$1
).8
: The number of threads for each Monolix task (becomes$2
, used by Monolix via--thread
and requested from LSF viaaffinity[core(8)]
).
The number of threads indicated via --thread
should be the same as the number of cores bound to each MPI process with affinity[core()]
.
The span[ptile=xx]
indicates the number of MPI processes per node. Within a node, parallelization using multi-threading is more efficient than parallelization using several processes. It is thus recommended to set span[ptile=1]
(1 MPI process per node) and use the --thread
argument of distMonolix
to define the number of threads.
Additional arguments, such as time limit (-W HH:MM
), job name (-J jobname
), or queue (-q queuename
), can be provided to the bsub
command either through the command line or through #BSUB
directives within the batch script. All available options are listed in the IBM Spectrum LSF documentation (man bsub
).
Here is how the run_lsf.sh
file should look if we want to assign a job name through the file:
#!/bin/bash
#BSUB -J monolixRun
mpirun ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p "$1" --thread "$2"
After submitting the job using the bsub
command, we can use the command bjobs
check the status of a run:
$ bsub -n 4 -R "span[ptile=1] affinity[core(8)]" < run_lsf.sh mlxtran_project_path 8
Job is submitted to default queue <normal>.
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
98766 lixoft PEND normal login01 monolixRun Apr 3 16:28:00
# --- A short time later ---
$ bjobs
JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME
98766 lixoft RUN normal login01 host1 monolixRun Apr 3 16:28:15
We can cancel the run using bkill
and specifying the job ID:
$ bkill 98766
Job is being terminated
For further assistance, contact us.