Parallelization over multiple nodes of a cluster

Running distributed Monolix on a cluster

To distribute calculations over multiple nodes, a special Monolix executable needs to be used. This executable, named distMonolix, comes with the installation of the Linux version of MonolixSuite. A separate cluster license is available and needs to be obtained in order to use distMonolix. All settings applicable to the monolix executable on this page can also be used with distMonolix on a cluster.

Monolix installation

To run MonolixSuite on a cluster, each cluster node must have access to the MonolixSuite directory and to the user home directory. Thus, there are two possibilities.

MonolixSuite is installed on each node.
MonolixSuite installation is shared. MonolixSuite is installed on a master server. Each cluster node accesses to MonolixSuite through a shared directory (via CIFS, Network drive, NFS, …).

License management

On a cluster, we are managing the usage of our applications with the license management system described here.
The license management server is on a physical machine and manage the application through its license file. The associated license file has to be put in the folder {MonolixSuite install path}/config/system/access (and also {MonolixSuite install path}/bin/Monolix_mcr/runtime/config/system/access for MonolixSuite2016R1). So either on all nodes in the installation case 1, or only on the master server in the configuration 2.

Running Monolix on a single node

If a Monolix run is performed on a single node, it is possible to run Monolix using its executable in the lib folder (typically $HOME/Lixoft/MonolixSuite2024R1/lib/):

CODE

monolix --no-gui -p mlxtran_project_path

where mlxtran_project_path is a Monolix project with a .mlxtran extension.

Running Monolix on multiple nodes using MPI

To run Monolix on multiple nodes, OpenMPI needs to be installed on all nodes. To run using MPI directly using the distMonolix executable in the lib folder (typically $HOME/Lixoft/MonolixSuite2024R1/lib/), you can use this command (this command will distribute Monolix over 4 nodes listed in hostfile.txt with hostfile.txt specifying that each host [i.e node] has 1 slot [i.e core]):

CODE

mpirun -n 4 -hostfile hostfile.txt distMonolix -p mlxtran_project_path --thread 16

Arguments that can be provided to distMonolix are the same ones as with Monolix. This includes --tool to select a multi-run task (model building, convergence assessment, bootstrap) and --config to provide settings for this task. The thread argument indicates the number of threads for each MPI process, see details below.

MPI and multithreading

The distMonolix executable is multithreaded, meaning that each MPI process uses multiple threads.

Multithreading is generally faster than distributing computation across multiple MPI processes on the same node, because threads share memory space and avoid the overhead of inter-process communication. MPI, on the other hand, is designed for distributed memory environments, and using it for core-level parallelism on a single node can introduce unnecessary complexity and reduce performance.

The number of CPU cores used for multithreading by distMonolix is determined either by the --thread argument, or—if this argument is not provided—by the value specified in the config.ini file. By default, the config.ini file sets the number of threads to the number of the available cores on the node. To control thread usage explicitly, you can pass --thread <number> when launching distMonolix.

It is strongly recommended to run only one MPI process per node and use multithreading for parallelism within that node. Running multiple MPI processes per node can lead to resource contention and significantly reduce performance.

We strongly recommend using the --thread argument when running distMonolix to explicitly control the number of threads per MPI process. If --thread is not specified, distMonolix will default to the value in config.ini, which may not always align with the allocated resources, potentially leading to suboptimal performance.

MPI troubleshooting

Different versions of distributed Monolix were built using different versions of Open MPI. If a more recent version was installed on the cluster, the following error may appear when trying to run distributed Monolix:

distMonolix: error while loading shared libraries: libmpi_cxx.so.YY: cannot open shared object file: No such file or directory

To resolve the error, you have to create a symbolic link from your installation:

from your installation of libmpi.so (usually in /usr/lib64/openmi/lib/libmpi.so.XX) to libmpi.so.YY (in the MonolixSuite lib folder):
CODE
```
 sudo  ln -s your_installation_of_openmi/lib/libmpi.so.XX  installation_of_MonolixSuiteXXXX/lib/libmpi.so.YY
```
from your installation of libmpi_cxx.so (usually in /usr/lib64/openmi/lib/libmpi_cxx.so.XX) to libmpi_cxx.so.YY (in the MonolixSuite lib folder):
CODE
```
 sudo  ln -s your_installation_of_openmi/lib/libmpi_cxx.so.XX   installation_of_MonolixSuiteXXXX/lib/libmpi_cxx.so.YY
```

Distributed calculation

How the distribution is done differs between different tasks:

in MCMC (SAEM, Fisher by Stochastic Approximation, Conditional Distribution): pools of ids are created and distributed over the MPI processes,
in Importance Sampling: the same is done with simulation pools,
in multi-run tasks (bootstrap, convergence assessment): each run is distributed over all processes and the runs are performed one after the other.

Using distributed Monolix with a scheduler

Usually, runs on clusters are scheduled using a job scheduling application (e.g., Torque, PBS, GridEngine, Slurm, LSF, …). After submitting a Monolix run with the job scheduling application, the run will wait in a queue until enough of the resources become available. When the resources become available, the run will be performed.

Generally, a run is submitted to the cluster using a specific command, e.g. qsub in the case of Torque, PBS or GridEngine (former SGE). This command runs a script, provided as parameter, on a cluster node chosen by the cluster scheduler.

Scheduling Monolix runs with Slurm Workload Manager

When using Slurm Workload Manager on a cluster, the runs are submitted using the sbatch command. With the command, a path to a batch script needs to be provided. A simple example of a batch script that can be used to run Monolix is shown here (note that there is no need to provide the number of nodes directly to the mpirun command in the script, since Slurm will automatically pass that information to Open MPI):

BASH

#!/bin/bash
mpirun --bind-to core --map-by slot:PE=$2 ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p $1 --thread $2

While the examples provided use mpirun within sbatch scripts, launching your MPI application directly via srun (e.g., srun ./your_executable) is generally the preferred and often more efficient method on Slurm. srun typically offers better integration with Slurm's resource allocation (--cpus-per-task, task distribution) and process management. We recommend using srun for launching MPI tasks whenever feasible.

If the script is saved as run.sh, we can schedule a Monolix project to run with the following command (the command will distribute the run across 4 tasks on 4 nodes with each task running 8 threads:

CODE

$ sbatch --nodes 4 --ntasks-per-node 1 --cpus-per-task 8 run.sh mlxtran_project_path 8

Additional arguments, such as time limit or job name, can be provided to the sbatch command either through the command line or through the batch script. All the available options are listed on this page.

The --ntasks-per-node indicates the number of tasks (i.e MPI processes) per node. Within a node, parallelization using multi-threading is more efficient than parallelization using several processes. It is thus recommended to set --ntasks-per-node to 1 and to use the --thread argument of distMonolix to define the number of threads.

The --cpus-per-task option specifies the number of CPU cores allocated per task. When running distMonolix, each MPI process is considered a task, so setting --cpus-per-task=<N> ensures that each MPI process gets N CPU cores for multithreading. Thus, the value of --cpus-per-task and --thread should be the same.

Here is how the run.sh file should look like if we want to assign a job name through the file:

BASH

#!/bin/bash
#SBATCH --job-name=monolixRun

mpirun --bind-to core --map-by slot:PE=$2 ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p $1 --thread $2

After submitting the job using the sbatch command, we can use the command squeue check the status of a run:

CODE

$ sbatch --nodes 4 --ntasks-per-node 1 --cpus-per-task 8 run.sh mlxtran_project_path 8
$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                86     debug monolixR   lixoft  R       0:02      4 slave[1-4]

We can cancel the run using scancel and specifying the job ID:

CODE

$ scancel 86

Scheduling Monolix runs with IBM Spectrum LSF

When using IBM Spectrum LSF on a cluster, the runs are submitted using the bsub command. With the command, a path to a batch script needs to be provided. A simple example of a batch script that can be used to run Monolix is shown here (note that there is no need to provide the number of processes directly to the mpirun command in the script, since LSF and the MPI library communicate to determine this from the resource request):

BASH

#!/bin/bash
# Basic IBM Spectrum LSF execution script (resource requests via bsub command line)
mpirun ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p "$1" --thread "$2"

If the script is saved as run_lsf.sh, we can schedule a Monolix project to run with the following command. This command aims to replicate the Slurm example's resource allocation (distribute the run across 4 tasks [i.e MPI processes] on 4 nodes with each task running 8 threads). In IBM Spectrum LSF, this is achieved by requesting 4 processes (-n 4), placing 1 process per host (span[ptile=1]), and requesting 8 cores bound to each process (affinity[core(8)]).

CODE

$ bsub -n 4 -R "span[ptile=1] affinity[core(8)]" < run_lsf.sh mlxtran_project_path 8

bsub: The LSF job submission command.
-n 4: Requests 4 MPI processes (tasks) in total for the job.
-R "span[ptile=1] affinity[core(8)]": A resource requirement string specifying:
- span[ptile=1]: Places each of the 4 MPI processes onto a separate host [i.e node]
- affinity[core(8)]: Tells LSF to allocate and bind 8 CPU cores to each of the 4 processes. This ensures that each multithreaded Monolix process has dedicated resources. Note: The exact affinity syntax can sometimes vary based on IBM Spectrum LSF version and configuration. Consult your local IBM Spectrum LSF documentation.
< run_lsf.sh: Redirects the script content to the bsub command.
mlxtran_project_path: The path to the Monolix project file (becomes $1).
8: The number of threads for each Monolix task (becomes $2, used by Monolix via --thread and requested from LSF via affinity[core(8)]).

The number of threads indicated via --thread should be the same as the number of cores bound to each MPI process with affinity[core()].

The span[ptile=xx] indicates the number of MPI processes per node. Within a node, parallelization using multi-threading is more efficient than parallelization using several processes. It is thus recommended to set span[ptile=1] (1 MPI process per node) and use the --thread argument of distMonolix to define the number of threads.

Additional arguments, such as time limit (-W HH:MM), job name (-J jobname), or queue (-q queuename), can be provided to the bsub command either through the command line or through #BSUB directives within the batch script. All available options are listed in the IBM Spectrum LSF documentation (man bsub).

Here is how the run_lsf.sh file should look if we want to assign a job name through the file:

BASH

#!/bin/bash
#BSUB -J monolixRun
mpirun ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p "$1" --thread "$2"

After submitting the job using the bsub command, we can use the command bjobs check the status of a run:

CODE

$ bsub -n 4 -R "span[ptile=1] affinity[core(8)]" < run_lsf.sh mlxtran_project_path 8
Job  is submitted to default queue <normal>.

$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
98766   lixoft  PEND  normal     login01                 monolixRun Apr  3 16:28:00
# --- A short time later ---
$ bjobs
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
98766   lixoft  RUN   normal     login01     host1       monolixRun Apr  3 16:28:15

We can cancel the run using bkill and specifying the job ID:

CODE

$ bkill 98766
Job  is being terminated

Scheduling Monolix runs with Sun Grid Engine (SGE)

When using Sun Grid Engine (SGE) or its derivatives (like Son of Grid Engine, Oracle Grid Engine) on a cluster, jobs are submitted using the qsub command. This command typically requires a path to a batch script.

Here is an example of a batch script tailored for running distMonolix with SGE, using an Open MPI parallel environment (commonly named orte or similar). Note that SGE sets an environment variable (usually $NSLOTS) indicating the number of allocated process slots when a parallel environment (-pe) is requested. We use this variable directly with mpirun.

BASH

#!/bin/bash
#$ -S /bin/bash

TOTAL_CORES=$NSLOTS
CORES_PER_NODE=$2
NUM_MPI_PROCS=$((TOTAL_CORES / CORES_PER_NODE))

mpirun -np $NUM_MPI_PROCS --map-by node ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p $1 --thread $2

If this script is saved as run_sge.sh, we can schedule a Monolix project to run using the following qsub command. This command aims to replicate the resource allocation goal from the LSF/Slurm examples: distribute the run across 4 MPI processes on 4 separate nodes, with each process running 8 threads. In SGE, this is achieved by requesting 32 slots from the orte parallel environment (-pe orte 32). The mpirun command within the script then uses 4 slots (-np $NUM_MPI_PROCS) and the --map-by node option directs Open MPI to place each process on a distinct node, if possible according to the cluster configuration and PE setup.

BASH

$ qsub -pe orte 32 run_sge.sh mlxtran_project_path 8

qsub: The SGE job submission command.
-pe orte 32: Requests the parallel environment named orte and allocates 32 cores for the job. SGE will set the $NSLOTS environment variable to 32 within the job's environment. The specific name (orte) might differ on your cluster; consult your cluster documentation.
run_sge.sh: The batch script to be executed.
mlxtran_project_path: The path to the Monolix project file (passed as $1 to the script).
8: The number of threads for each Monolix task (passed as $2 to the script and used by distMonolix via --thread).

Important Considerations:

Thread Count and Cores: The number of threads specified via --thread (8 in this example) tells each distMonolix process how many threads to use. You must ensure that the nodes allocated by SGE have at least this many cores available per MPI process allocated to that node. In this setup (requesting 4 slots, mapping by node), you are relying on each of the 4 nodes having at least 8 cores available for the single Monolix process running there. SGE itself, unlike LSF's affinity, doesn't typically bind cores directly via simple qsub options in this manner; core binding is often handled by the MPI implementation (mpirun flags like --bind-to core) or the PE configuration, if desired.
Process Placement (--map-by node): Using --map-by node with mpirun asks Open MPI to place subsequent processes on different nodes until all available nodes are used, then potentially cycling back (depending on the exact MPI version and configuration). When the number of processes (-np $NSLOTS) is less than or equal to the number of unique nodes allocated by SGE for the requested slots, this effectively achieves one process per node. This is generally recommended for distMonolix, as intra-node parallelization is often more efficiently handled by multi-threading (--thread).
Parallel Environment (-pe): The configuration of the SGE parallel environment (orte in this example) is crucial. It defines how slots are counted and distributed across hosts. Consult your cluster's documentation for details on available PEs and their behavior.
Example parallel environment configuration for hybrid processes (assumes that the number of slots per node in the queue definition is equal to the number of cores):
CODE
```
   pe_name            orte
   slots              99999
   user_lists         NONE
   xuser_lists        NONE
   start_proc_args    NONE
   stop_proc_args     NONE
   allocation_rule    $round_robin 
   control_slaves     TRUE
   job_is_first_task  FALSE
   urgency_slots      min
   accounting_summary FALSE
   qsort_args         NONE
```
$NSLOTS: This environment variable is standard in SGE for jobs submitted with -pe, but confirm its availability and name on your specific system.

For this setup to work efficiently, you MUST verify that your resource request and the cluster's configuration ensure that:

The SGE parallel environment (<pe_name>) you use, combined with the total cores requested (<total_cores>), effectively allocates your job across the intended number of distinct compute nodes (<total_cores> / <cores_per_node>).
Each compute node allocated to your job makes at least <cores_per_node> physical CPU cores exclusively available to the single Monolix MPI process running on it.

Additional Arguments and Script Directives:

SGE options can be provided directly on the qsub command line or embedded within the batch script using #$ directives. Common examples include:

-N jobname: Specify a job name (overrides #$ -N in the script).
-q queuename: Submit to a specific queue.
-l h_rt=HH:MM:SS: Request a maximum wall clock time limit.
-j y: Merge standard error into the standard output stream.
-o output_file.log: Specify a file for standard output (and error if -j y is used).

The example script run_sge.sh already includes directives for the shell (#$ -S), working directory (#$ -cwd), job name (#$ -N), and merging output streams (#$ -j y).

Monitoring and Managing Jobs:

After submitting the job using qsub, you can check its status using qstat:

BASH

$ qsub -pe orte 32 run_sge.sh mlxtran_project_path 8
Your job 12345 ("monolixRun") has been submitted

$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  12345 0.55500 monolixRun   lixoft       qw    04/16/2025 18:25:00                                    32
# --- A short time later ---
$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
  12345 0.60500 monolixRun   lixoft       r     04/16/2025 18:25:15 all.q@node01.cluster               32
                                                                    all.q@node02.cluster               
                                                                    all.q@node03.cluster               
                                                                    all.q@node04.cluster

(Note: qstat output format can vary significantly between SGE installations).

To cancel a running or pending job, use the qdel command with the job ID:

BASH

$ qdel 12345
user lixoft has deleted job 12345

Always consult your local cluster documentation for the specific SGE configuration, available parallel environments, queue names, resource limits, and recommended practices.

For further assistance, contact us.