MPI-Parallel Saddle Point Search¶

What you will learn¶

This tutorial walks through running a GP-accelerated dimer saddle point search with MPI parallelism. By the end you will be able to:

Build gprd with the ScaLAPACK backend for distributed dense linear algebra
Prepare input files (.con geometry, capnp config, orientation) for any molecular system
Run the search with a PET-MAD universal ML potential via rgpot
Scale across MPI ranks and measure the speedup

Prerequisites¶

A working pixi installation
An ML potential model file (PET-MAD .pt or any metatomic-compatible model)
A .con geometry file for your system

Building with MPI support¶

The ScaLAPACK backend distributes both the GP kernel assembly and the Cholesky factorization across MPI ranks. Install the environment and build:

pixi install -e scalapack
pixi run -e scalapack meson setup bld \
    -Dlinalg_backend=scalapack \
    -Duse_readcon=true \
    -Duse_capnp=true \
    -Duse_hdf5=false \
    -Dwith_tests=true
pixi run -e scalapack ninja -C bld

For MetatomicPot (PET-MAD), you also need to pass cmake prefix paths for torch, metatensor, and metatomic. These are found via Python:

SP=$(pixi run -e scalapack python3 -c \
    'import site; print(site.getsitepackages()[0])')
cat > torch_native.ini << EOF
[built-in options]
cmake_prefix_path = [
    '$SP/torch/share/cmake',
    '$SP/metatensor/lib/cmake',
    '$SP/metatensor/torch/torch-2.7/lib/cmake',
    '$SP/metatomic/torch/torch-2.7/lib/cmake'
]
cpp_args = [
    '-I$SP/metatensor/include',
    '-I$SP/metatensor/torch/torch-2.7/include',
    '-I$SP/metatomic/torch/torch-2.7/include',
    '-I$SP/vesin/include'
]
EOF
LIBRARY_PATH="$SP/vesin/lib:$LIBRARY_PATH" \
pixi run -e scalapack meson setup bld \
    --native-file torch_native.ini \
    -Dlinalg_backend=scalapack \
    -Duse_readcon=true \
    -Duse_capnp=true
pixi run -e scalapack ninja -C bld

Verify the build:

pixi run -e scalapack meson test -C bld
# Expect 41/41 tests pass (or 43 with scaling tests)

Preparing input files¶

Geometry: `pos.con`¶

The .con format is used by eOn and stores atomic positions, types, cell dimensions, and frozen/moving flags. Example for oxadiazole (9 atoms):

Generated by eOn

20.000000 20.000000 20.000000
90.000000 90.000000 90.000000


4
2 1 2 4
14.000000 15.980000 15.980000 1.008000
N
Coordinates of component   1
12.709927 12.943985 10.896305 0 1
...

The fifth column (0 or 1) marks each atom as moving (0) or frozen (1). gprd reads this via readcon-core and automatically assigns atom types from atomic numbers.

Configuration: capnp binary¶

GPR dimer parameters are stored as a Cap’n Proto binary. Generate one from a TOML config:

# Convert an eOn config.ini to the TOML format:
python3 schema/eon_ini_to_toml.py config.ini --con pos.con -o config.toml

# Generate the capnp binary:
uv run schema/gen_capnp_params.py config.toml params.bin

Key parameters to tune per system:

Parameter	TOML key	Typical value
Convergence force	`problem.finalConvergenceForce`	0.01
Max outer iterations	`problem.relaxation.maxOuterIterations`	300
SCG max iterations	`optimizer.maxIter`	400
SCG lambda	`optimizer.scg.lambda`	100
GP noise variance	`gpr.gpSigma2`	1e-5
Active radius	`problem.actdistFro`	30.3

Dimer orientation: `direction.dat`¶

One line per atom with dx dy dz displacement components. This sets the initial dimer orientation along a physically meaningful direction (e.g. a reaction coordinate). If omitted, gprd generates a random orientation.

Running the search¶

Single rank (serial)¶

gprd --pos pos.con \
     --orient direction.dat \
     --config params.bin \
     --potential metatomic \
     --model pet-mad-xs-v1.5.0.pt

Multiple ranks (MPI parallel)¶

mpirun -np 4 gprd --pos pos.con \
     --orient direction.dat \
     --config params.bin \
     --potential metatomic \
     --model pet-mad-xs-v1.5.0.pt

Available potentials: xtb (GFN2-xTB), gfn2, gfnff (GFN force field), lj (Lennard-Jones), cuh2 (EAM for Cu+H2), metatomic (any metatrain model).

What gets parallelized¶

The GP pipeline distributes work across MPI ranks at three levels:

Kernel matrix assembly: The covariance matrix K is built tile by tile directly into a 2D block-cyclic distributed buffer. Each rank walks the lower-triangle of observation tile pairs and the BLACS ownership predicate selects exactly one rank per tile; owned tiles are evaluated locally and written into the rank’s block-cyclic slice. No replicated full K is materialized and no MPI_Allreduce fires on the assembled matrix. A legacy round-robin + Allreduce assembly (do_reduce=true) remains for non-ScaLAPACK callers and unit tests.
Cholesky factorization: K = L L^T via ScaLAPACK pdpotrf over the same 2D block-cyclic BLACS process grid that owns the assembled tiles – no scatter step needed. Triangular solves (pdtrsm) and inverse (pdpotri) are also distributed and reuse the cached factor.
Gradient matrices: The dK/d\theta matrices for each hyperparameter follow the same tile-based ownership pattern; the per-hyperparameter trace term is computed locally per rank and a single ordered_sum_bcast reduces all n_hp traces in one collective.

The force evaluations (PET-MAD inference) are replicated on all ranks. For systems where force evaluation is cheap relative to GP operations (e.g. ML potentials on small molecules), the MPI scaling is dominated by the GP linear algebra.

Scaling results¶

CuH2 (218 atoms, EAM potential)¶

The gprd binary on the hardcoded CuH2 system shows near-linear scaling:

Ranks	Wall time (s)	Speedup
1	26.1	1.0x
2	14.4	1.8x
4	8.3	3.1x
8	4.7	5.6x

Cyclopentadienyl radical (14 atoms, PET-MAD-xs-v1.5.0)¶

A real saddle point search on a hydrocarbon radical. The GP training set grows from 86 to 500+ data points as the dimer converges. With PET-MAD as the potential:

Ranks	Wall time (s)	Speedup	Outer iterations	Force evals
1	4829.5	1.0x	17	23
4	558.7	8.6x	13	19
16	263.8	18.3x	13	19

All multi-rank runs use OMP_NUM_THREADS=2 to control torch thread contention. The np=16 run uses rank-0-only force evaluation with MPI_Bcast, eliminating 15 redundant ML inference calls.

The np=4 run converges in 9.3 minutes with the dimer finding a saddle point at energy -6.952 eV. The observation count grows to 500+ during the search, giving kernel matrices of dimension 500 x 43 = 21,500 – well into the regime where distributed linear algebra matters.

Convergence path¶

The maximum force on any atom decreases across outer relaxation iterations:

Iteration   np=1     np=4
        52.18    52.18
         8.99     8.65
         4.01     4.08
         3.75     4.07
         3.37     5.78
         2.50     1.70
         2.80     0.96
         2.87     0.81
         2.61     0.26
         2.57     0.16
        4.69     0.05
        1.12     0.02
        0.55     0.01  (np=4 converged)
        0.25
        0.06
        0.03
        0.02     (np=1 converged)

The slight differences between ranks (e.g. 8.99 vs 8.65 at iteration 1) are from floating-point non-associativity in the distributed reductions (ordered_sum_bcast and MPI_Allreduce inside the BLAS calls). Both runs converge to the same saddle point within the force tolerance.

Troubleshooting¶

`Warning! Matrix L is not positive definite`¶

The GP covariance matrix became numerically singular during SCG optimization. The SCG automatically increases its damping parameter (lambda) and retries. This is normal for ill-conditioned systems and does not indicate a bug.

`Warning! Function value at xnew not finite`¶

SCG took a step that produced NaN or infinity in the log-marginal likelihood. This happens when hyperparameters drift to extreme values (e.g. lengthscale near zero). The SCG damping mechanism handles it automatically.

Dimer not converging¶

If Current Max is not decreasing across outer iterations:

Check that --orient provides a physically meaningful direction (along a reaction coordinate, not random)
Increase max_outer_iterations
Try a smaller dimer_separation (default 0.01)
Ensure the potential gives reasonable forces for your system (test with a single-point evaluation first)