Parallelization over multiple nodes of a cluster
Running distributed Monolix on a cluster
To distribute calculations over multiple nodes, a special Monolix executable needs to be used. This executable, named distMonolix, comes with the installation of the Linux version of MonolixSuite. A separate cluster license is available and needs to be obtained in order to use distMonolix. All settings on this page can also be used with distMonolix on a cluster.
Monolix installation
To run MonolixSuite on a cluster, each cluster node must have access to the MonolixSuite directory and to the user home directory. Thus, there are two possibilities.
MonolixSuite is installed on each node.
MonolixSuite installation is shared. MonolixSuite is installed on a master server. Each cluster node accesses to MonolixSuite through a shared directory (via CIFS, Network drive, NFS, …).
License management
On a cluster, we are managing the usage of our applications with the license management system described here.
The license management server is on a physical machine and manage the application through its license file. The associated license file has to be put in the folder {MonolixSuite install path}/config/system/access (and also {MonolixSuite install path}/bin/Monolix_mcr/runtime/config/system/access for MonolixSuite2016R1). So either on all nodes in the installation case 1, or only on the master server in the configuration 2.
Running Monolix on a single node
If a Monolix run is performed on a single node, it is possible to run Monolix using its executable in the lib folder (typically $HOME/Lixoft/MonolixSuite2024R1/lib/):
monolix --no-gui -p mlxtran_project_path
where mlxtran_project_path is a Monolix project with a .mlxtran extension.
Running Monolix on multiple nodes using MPI
To run Monolix on multiple nodes, OpenMPI needs to be installed on all nodes. To run using MPI directly using the distMonolix executable in the lib folder (typically $HOME/Lixoft/MonolixSuite2024R1/lib/), you can use this command (this command will distribute Monolix over 4 nodes listed in hostfile.txt):
mpirun -n 4 -hostfile hostfile.txt distMonolix -p mlxtran_project_path
Arguments that can be provided to distMonolix are the same ones as with Monolix. This includes --tool
to select a multi-run task (model building, convergence assessment, bootstrap) and --config
to provide settings for this task.
MPI troubleshooting
Different versions of distributed Monolix were built using different versions of Open MPI. If a more recent version was installed on the cluster, the following error may appear when trying to run distributed Monolix:
distMonolix: error while loading shared libraries: libmpi_cxx.so.YY: cannot open shared object file: No such file or directory
To resolve the error, you have to create a symbolic link from your installation:
from your installation of libmpi.so (usually in /usr/lib64/openmi/lib/libmpi.so.XX) to libmpi.so.YY (in the MonolixSuite lib folder):
CODEsudo ln -s your_installation_of_openmi/lib/libmpi.so.XX installation_of_MonolixSuiteXXXX/lib/libmpi.so.YY
from your installation of libmpi_cxx.so (usually in /usr/lib64/openmi/lib/libmpi_cxx.so.XX) to libmpi_cxx.so.YY (in the MonolixSuite lib folder):
CODEsudo ln -s your_installation_of_openmi/lib/libmpi_cxx.so.XX installation_of_MonolixSuiteXXXX/lib/libmpi_cxx.so.YY
Distributed calculation
How the distribution is done differs between different tasks:
in MCMC (SAEM, Fisher by Stochastic Approximation, Conditional Distribution): pools of ids are created and distributed by process,
in Importance Sampling: the same is done with simulation pools,
in multi-run tasks (bootstrap, convergence assessment): each run is distributed over all processes.
Using distributed Monolix with a scheduler
Usually, runs on clusters are scheduled using a job scheduling application (e.g., Torque, PBS, GridEngine, Slurm, LSF, …). After submitting a Monolix run with the job scheduling application, the run will wait in a queue until enough of the resources become available. When the resources become available, the run will be performed.
Generally, a task is submitted to the cluster using a specific command, e.g. qsub in the case of Torque, PBS or GridEngine (former SGE). This command runs a script, provided as parameter, on a cluster node chosen by the cluster scheduler.
Scheduling Monolix runs with Slurm Workload Manager
When using Slurm Workload Manager on a cluster, the runs are submitted using the sbatch command. With the command, a path to a batch script needs to be provided. A simple example of a batch script that can be used to run Monolix is shown here (note that there is no need to provide the number of nodes directly to the mpirun command in the script, since Slurm will automatically pass that information to Open MPI):
#!/bin/bash
mpirun ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p $1
If the script is saved as run.sh, we can schedule a Monolix project to run with the following command (the command will distribute the run across 16 tasks on 4 nodes:
$ sbatch -n 16 --nodes 4 run.sh mlxtran_project_path
Additional arguments, such as time limit or job name, can be provided to the sbatch command either through the command line or through the batch script. All the available options are listed on this page.
Here is how the run.sh file should look like if we want to assign a job name through the file:
#!/bin/bash
#SBATCH --job-name=monolixRun
mpirun ~/Lixoft/MonolixSuite2024R1/lib/distMonolix -p $1
After submitting the job using the sbatch command, we can use the command squeue check the status of a run:
$ sbatch -n 16 --nodes 4 run.sh mlxtran_project_path
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
86 debug monolixR lixoft R 0:02 4 slave[1-4]
We can cancel the run using scancel and specifying the job ID:
$ scancel 86
For further assistance, contact us.