MPI-Parallel Saddle Point Search¶
What you will learn¶
This tutorial walks through running a GP-accelerated dimer saddle point search with MPI parallelism. By the end you will be able to:
Build
gprdwith the ScaLAPACK backend for distributed dense linear algebraPrepare input files (
.congeometry, capnp config, orientation) for any molecular systemRun the search with a PET-MAD universal ML potential via
rgpotScale across MPI ranks and measure the speedup
Prerequisites¶
A working pixi installation
An ML potential model file (PET-MAD
.ptor any metatomic-compatible model)A
.congeometry file for your system
Building with MPI support¶
The ScaLAPACK backend distributes both the GP kernel assembly and the Cholesky factorization across MPI ranks. Install the environment and build:
pixi install -e scalapack
pixi run -e scalapack meson setup bld \
-Dlinalg_backend=scalapack \
-Duse_readcon=true \
-Duse_capnp=true \
-Duse_hdf5=false \
-Dwith_tests=true
pixi run -e scalapack ninja -C bld
For MetatomicPot (PET-MAD), you also need to pass cmake prefix paths for torch, metatensor, and metatomic. These are found via Python:
SP=$(pixi run -e scalapack python3 -c \
'import site; print(site.getsitepackages()[0])')
cat > torch_native.ini << EOF
[built-in options]
cmake_prefix_path = [
'$SP/torch/share/cmake',
'$SP/metatensor/lib/cmake',
'$SP/metatensor/torch/torch-2.7/lib/cmake',
'$SP/metatomic/torch/torch-2.7/lib/cmake'
]
cpp_args = [
'-I$SP/metatensor/include',
'-I$SP/metatensor/torch/torch-2.7/include',
'-I$SP/metatomic/torch/torch-2.7/include',
'-I$SP/vesin/include'
]
EOF
LIBRARY_PATH="$SP/vesin/lib:$LIBRARY_PATH" \
pixi run -e scalapack meson setup bld \
--native-file torch_native.ini \
-Dlinalg_backend=scalapack \
-Duse_readcon=true \
-Duse_capnp=true
pixi run -e scalapack ninja -C bld
Verify the build:
pixi run -e scalapack meson test -C bld
# Expect 41/41 tests pass (or 43 with scaling tests)
Preparing input files¶
Geometry: pos.con¶
The .con format is used by eOn and stores atomic positions, types, cell
dimensions, and frozen/moving flags. Example for oxadiazole (9 atoms):
Generated by eOn
20.000000 20.000000 20.000000
90.000000 90.000000 90.000000
4
2 1 2 4
14.000000 15.980000 15.980000 1.008000
N
Coordinates of component 1
12.709927 12.943985 10.896305 0 1
...
The fifth column (0 or 1) marks each atom as moving (0) or frozen (1).
gprd reads this via readcon-core and automatically assigns atom types from
atomic numbers.
Configuration: capnp binary¶
GPR dimer parameters are stored as a Cap’n Proto binary. Generate one from a TOML config:
# Convert an eOn config.ini to the TOML format:
python3 schema/eon_ini_to_toml.py config.ini --con pos.con -o config.toml
# Generate the capnp binary:
uv run schema/gen_capnp_params.py config.toml params.bin
Key parameters to tune per system:
Parameter |
TOML key |
Typical value |
|---|---|---|
Convergence force |
|
0.01 |
Max outer iterations |
|
300 |
SCG max iterations |
|
400 |
SCG lambda |
|
100 |
GP noise variance |
|
1e-5 |
Active radius |
|
30.3 |
Dimer orientation: direction.dat¶
One line per atom with dx dy dz displacement components. This sets the
initial dimer orientation along a physically meaningful direction (e.g. a
reaction coordinate). If omitted, gprd generates a random orientation.
Running the search¶
Single rank (serial)¶
gprd --pos pos.con \
--orient direction.dat \
--config params.bin \
--potential metatomic \
--model pet-mad-xs-v1.5.0.pt
Multiple ranks (MPI parallel)¶
mpirun -np 4 gprd --pos pos.con \
--orient direction.dat \
--config params.bin \
--potential metatomic \
--model pet-mad-xs-v1.5.0.pt
Available potentials: xtb (GFN2-xTB), gfn2, gfnff (GFN force field),
lj (Lennard-Jones), cuh2 (EAM for Cu+H2), metatomic (any metatrain
model).
What gets parallelized¶
The GP pipeline distributes work across MPI ranks at three levels:
Kernel matrix assembly: The covariance matrix
Kis built tile by tile directly into a 2D block-cyclic distributed buffer. Each rank walks the lower-triangle of observation tile pairs and the BLACS ownership predicate selects exactly one rank per tile; owned tiles are evaluated locally and written into the rank’s block-cyclic slice. No replicated fullKis materialized and noMPI_Allreducefires on the assembled matrix. A legacy round-robin + Allreduce assembly (do_reduce=true) remains for non-ScaLAPACK callers and unit tests.Cholesky factorization:
K = L L^Tvia ScaLAPACKpdpotrfover the same 2D block-cyclic BLACS process grid that owns the assembled tiles – no scatter step needed. Triangular solves (pdtrsm) and inverse (pdpotri) are also distributed and reuse the cached factor.Gradient matrices: The
dK/d\thetamatrices for each hyperparameter follow the same tile-based ownership pattern; the per-hyperparameter trace term is computed locally per rank and a singleordered_sum_bcastreduces alln_hptraces in one collective.
The force evaluations (PET-MAD inference) are replicated on all ranks. For systems where force evaluation is cheap relative to GP operations (e.g. ML potentials on small molecules), the MPI scaling is dominated by the GP linear algebra.
Scaling results¶
CuH2 (218 atoms, EAM potential)¶
The gprd binary on the hardcoded CuH2 system shows near-linear scaling:
Ranks |
Wall time (s) |
Speedup |
|---|---|---|
1 |
26.1 |
1.0x |
2 |
14.4 |
1.8x |
4 |
8.3 |
3.1x |
8 |
4.7 |
5.6x |
Cyclopentadienyl radical (14 atoms, PET-MAD-xs-v1.5.0)¶
A real saddle point search on a hydrocarbon radical. The GP training set grows from 86 to 500+ data points as the dimer converges. With PET-MAD as the potential:
Ranks |
Wall time (s) |
Speedup |
Outer iterations |
Force evals |
|---|---|---|---|---|
1 |
4829.5 |
1.0x |
17 |
23 |
4 |
558.7 |
8.6x |
13 |
19 |
16 |
263.8 |
18.3x |
13 |
19 |
All multi-rank runs use OMP_NUM_THREADS=2 to control torch thread
contention. The np=16 run uses rank-0-only force evaluation with
MPI_Bcast, eliminating 15 redundant ML inference calls.
The np=4 run converges in 9.3 minutes with the dimer finding a saddle point at energy -6.952 eV. The observation count grows to 500+ during the search, giving kernel matrices of dimension 500 x 43 = 21,500 – well into the regime where distributed linear algebra matters.
Convergence path¶
The maximum force on any atom decreases across outer relaxation iterations:
Iteration np=1 np=4
0 52.18 52.18
1 8.99 8.65
2 4.01 4.08
3 3.75 4.07
4 3.37 5.78
5 2.50 1.70
6 2.80 0.96
7 2.87 0.81
8 2.61 0.26
9 2.57 0.16
10 4.69 0.05
11 1.12 0.02
12 0.55 0.01 (np=4 converged)
13 0.25
14 0.06
15 0.03
16 0.02 (np=1 converged)
The slight differences between ranks (e.g. 8.99 vs 8.65 at iteration 1)
are from floating-point non-associativity in the distributed reductions
(ordered_sum_bcast and MPI_Allreduce inside the BLAS calls). Both
runs converge to the same saddle point within the force tolerance.
Troubleshooting¶
Warning! Matrix L is not positive definite¶
The GP covariance matrix became numerically singular during SCG optimization. The SCG automatically increases its damping parameter (lambda) and retries. This is normal for ill-conditioned systems and does not indicate a bug.
Warning! Function value at xnew not finite¶
SCG took a step that produced NaN or infinity in the log-marginal likelihood. This happens when hyperparameters drift to extreme values (e.g. lengthscale near zero). The SCG damping mechanism handles it automatically.
Dimer not converging¶
If Current Max is not decreasing across outer iterations:
Check that
--orientprovides a physically meaningful direction (along a reaction coordinate, not random)Increase
max_outer_iterationsTry a smaller
dimer_separation(default 0.01)Ensure the potential gives reasonable forces for your system (test with a single-point evaluation first)