Overview
This example demonstrates the efficiency of Message Passing Interface (MPI) parallelization in MonolixSuite for a large-scale quantitative systems pharmacology (QSP) job.
The benchmark used here is a Neuro-Dynamic QSP model with 4056 subjects and 20 iterations of SAEM.
The results illustrate how distributing the workload across multiple compute nodes significantly reduces execution time compared to single-node symmetric multiprocessing (SMP) execution.
Benchmark Configuration
|
Parameter |
Description |
|---|---|
|
Model |
Neuro-Dynamic QSP |
|
SAEM Iterations |
20 |
|
Nodes |
1 – 20 |
|
Cores per node |
96 |
|
Parallel modes tested |
SMP (1 node) and MPI (1 – 20 nodes) |
Results Summary
|
Parallel Mode |
# of Nodes |
# of Cores |
Wallclock Time (mins) |
Speed-up (fold) |
|---|---|---|---|---|
|
SMP |
1 |
96 |
55 |
1.0 |
|
MPI |
1 |
96 |
43 |
1.3 |
|
MPI |
2 |
192 |
25 |
2.2 |
|
MPI |
3 |
288 |
18 |
3.0 |
|
MPI |
4 |
384 |
13.8 |
4.0 |
|
MPI |
5 |
480 |
12 |
4.6 |
|
MPI |
10 |
960 |
8 |
6.9 |
|
MPI |
20 |
1920 |
5.8 |
9.5 |
Performance Analysis
Speed-Up Behavior
-
The performance improves nearly linearly up to around 5 nodes, achieving a 4.6× speed-up over the single-node SMP run.
-
Beyond 5 nodes, speed-up continues to increase, reaching 9.5× with 20 nodes (1920 cores).
-
The curve indicates diminishing returns at higher node counts, typical for communication-bound workloads.
The following plot shows the scaling behavior:
The trend reflects a strong scaling regime where computational work is distributed effectively but inter-node communication gradually becomes the dominant factor.
Efficiency Calculation
Parallel efficiency is defined as:
|
# of Nodes |
Speed-up |
Efficiency (%) |
|---|---|---|
|
1 |
1.3 |
130 %* |
|
2 |
2.2 |
110 %* |
|
3 |
3.0 |
100 % |
|
4 |
4.0 |
100 % |
|
5 |
4.6 |
92 % |
|
10 |
6.9 |
69 % |
|
20 |
9.5 |
48 % |
* Values > 100 % at small scale reflect measurement noise and cache effects rather than true superlinear scaling.
Interpretation
-
Up to 4 nodes, MPI scales almost perfectly (≈ 100 % efficiency).
-
Between 5 – 10 nodes, efficiency drops moderately due to increased communication overhead.
-
At 20 nodes, performance remains strong but efficiency decreases to ~50 %, which is typical for distributed workloads with high synchronization needs.
Best Practices for MPI Execution
-
Use sufficient problem size:
The computational load per core should be high enough to offset communication costs. -
Monitor scaling efficiency:
Efficiency <70% typically signals that further node scaling is not cost-effective.
Conclusion
MPI parallelization in MonolixSuite offers near-linear speed-up up to 4–5 nodes and substantial performance gains up to 20 nodes.
For large QSP simulations, this enables reducing multi-hour runs to just a few minutes, making high-throughput parameter estimation and simulation studies practical on modern HPC clusters.