Skip to main content
Skip table of contents

Evaluating MPI Efficiency

Overview

This example demonstrates the efficiency of Message Passing Interface (MPI) parallelization in MonolixSuite for a large-scale quantitative systems pharmacology (QSP) job.
The benchmark used here is a Neuro-Dynamic QSP model with 4056 subjects and 20 iterations of SAEM.

The results illustrate how distributing the workload across multiple compute nodes significantly reduces execution time compared to single-node symmetric multiprocessing (SMP) execution.

Benchmark Configuration

Parameter

Description

Model

Neuro-Dynamic QSP

SAEM Iterations

20

Nodes

1 – 20

Cores per node

96

Parallel modes tested

SMP (1 node) and MPI (1 – 20 nodes)

Results Summary

Parallel Mode

# of Nodes

# of Cores

Wallclock Time (mins)

Speed-up (fold)

SMP

1

96

55

1.0

MPI

1

96

43

1.3

MPI

2

192

25

2.2

MPI

3

288

18

3.0

MPI

4

384

13.8

4.0

MPI

5

480

12

4.6

MPI

10

960

8

6.9

MPI

20

1920

5.8

9.5

Performance Analysis

Speed-Up Behavior

  • The performance improves nearly linearly up to around 5 nodes, achieving a 4.6× speed-up over the single-node SMP run.

  • Beyond 5 nodes, speed-up continues to increase, reaching 9.5× with 20 nodes (1920 cores).

  • The curve indicates diminishing returns at higher node counts, typical for communication-bound workloads.

The following plot shows the scaling behavior:

2866934b-cb99-495d-8627-e63d0de1f15d-20251106-155011.png

The trend reflects a strong scaling regime where computational work is distributed effectively but inter-node communication gradually becomes the dominant factor.

Efficiency Calculation

Parallel efficiency is defined as:

# of Nodes

Speed-up

Efficiency (%)

1

1.3

130 %*

2

2.2

110 %*

3

3.0

100 %

4

4.0

100 %

5

4.6

92 %

10

6.9

69 %

20

9.5

48 %

* Values > 100 % at small scale reflect measurement noise and cache effects rather than true superlinear scaling.

Interpretation

  • Up to 4 nodes, MPI scales almost perfectly (≈ 100 % efficiency).

  • Between 5 – 10 nodes, efficiency drops moderately due to increased communication overhead.

  • At 20 nodes, performance remains strong but efficiency decreases to ~50 %, which is typical for distributed workloads with high synchronization needs.

Best Practices for MPI Execution

  1. Use sufficient problem size:
    The computational load per core should be high enough to offset communication costs.

  2. Monitor scaling efficiency:
    Efficiency <70% typically signals that further node scaling is not cost-effective.

Conclusion

MPI parallelization in MonolixSuite offers near-linear speed-up up to 4–5 nodes and substantial performance gains up to 20 nodes.
For large QSP simulations, this enables reducing multi-hour runs to just a few minutes, making high-throughput parameter estimation and simulation studies practical on modern HPC clusters.

JavaScript errors detected

Please note, these errors can depend on your browser setup.

If this problem persists, please contact our support.