Performance Optimisation

Overview

This document covers the performance-critical paths in the GP model and the optimisation techniques applied to them. The GP training hot-path accounts for the bulk of wall-clock time during dimer searches.

GP Hot-Path Architecture

The inner loop of hyperparameter optimisation (ADAM or SCG) calls evaluateEnergyAndGradient at each iteration, which triggers the full covariance matrix assembly, Cholesky decomposition, and gradient computation pipeline.

Optimizer loop (ADAM / SCG)
  |
  +-- evaluateEnergyAndGradient(w)
        |
        +-- setParameters(w)
        +-- evaluateEnergy(x, x_ind, y)
        |     +-- evaluateTrainingCovarianceMatrix
        |     |     +-- applyCovarianceFunction  (SexpatCF + ConstantCF)
        |     |     |     +-- dist_at / dist_at_vec  <-- distance computation
        |     |     |     +-- extractCoordinatesByIndex
        |     |     +-- LikGaussian::evaluateTrainingCovarianceMatrix  <-- diagonal noise
        |     +-- decomposeCovarianceMatrix  (Cholesky, O(N^3))
        |     +-- calculateMeanPrediction
        |
        +-- evaluateGradient(x, x_ind, y, gradient)
              +-- calculateGradientWithCovFunc  (SexpatCF)
              |     +-- Gradient::calculateDerivativesSameDim
              |     |     +-- calculateGradientBetweenMovingAtoms
              |     |     +-- calculateGradientBetweenMovingAndFrozenAtoms
              |     +-- Gradient::calculateDerivativesDiffDim
              +-- calculateGradientWithCovFunc  (ConstantCF)

Optimisation Techniques Applied

1. Pre-compute loop-invariant x2 quantities

In dist_at, dist_at_vec, and Gradient inner loops, the x2-dependent inverse distances are invariant across the outer n-loop (x1 observations). Pre-computing them into a std::vector<double> before the n-loop eliminates (n1-1) redundant coordinate extractions and norm computations per atom pair.

Function

Before (per pair)

After (per pair)

distat

n1 * n2 norms

n1 + n2 norms

Gradient

n1 * n2 norms

n1 + n2 norms

2. Cache frozen atom positions

In dist_at, dist_at_vec, and dist_max1Dlog, frozen atom positions are extracted via gpr::coord::at(conf_info.atoms_froz_active.positions, 0, i) inside the (n, m) inner loop but are invariant. Hoisting them to the (j, i) loop level eliminates n1 * n2 redundant segment extractions per frozen atom pair.

3. Vectorize Eigen operations

Replaced scalar pointer loops with Eigen vectorised operations:

Before

After

\*(data + i) = 0. loop

setZero()

\*(data + i) = val loop

setConstant(val)

C(i,i) = sigma2 loop

C.diagonal().setConstant(sigma2)

sqrt(*(data + n)) loop

Map<ArrayXd>(data, size).sqrt()

Per-element matrix copy

matrix.block(...) = source

4. Reduce heap allocations in ADAM

The ADAM optimizer allocated m_hat, v_hat, and w_update vectors on every iteration (100–300 iterations typical). Pre-allocating them before the loop and converting moment updates to in-place operations eliminates 3 heap allocations per iteration.

// Before:
m = beta1 * m + (1 - beta1) * g;     // temporary + assignment
Eigen::VectorXd m_hat = m / (1 - beta1_t);  // heap allocation

// After:
m *= beta1;
m.noalias() += (1 - beta1) * g;      // in-place
m_hat.noalias() = m / (1 - beta1_t); // reuse pre-allocated storage

5. Hoist vectors outside gradient loops

In calculateGradientWithCovFunc, three std::vector<FieldMatrixd> were constructed fresh inside the per-dimension loop body. Moving them before the loop and calling .clear() each iteration avoids 3 * D vector constructions per gradient evaluation (D = 3 * Nmov= 6 for CuH2).

6. Eliminate redundant temporaries in ConstantCF

ConstantCF::ginput2/3/4 created a temp matrix, zeroed it with a scalar loop, then copied it into each DKff element. Replaced with direct DKff[i].setZero(). calculateCovarianceMatrix used a temp matrix filled via scalar loop then copied to C; replaced with C.setConstant(val).

7. Block copy in extractCoordinatesByIndex

The function iterated over all rows checking ind_Ddim[i] == ind and copying one row at a time. Since ind_Ddim has known contiguous block structure (blocks of Nimidentical indices), replaced with a single x.block(ind * block_size, 0, block_size, n_cols) copy.

8. Const maps in applyCrossCovarianceFunction

applyCrossCovarianceFunction copied ind_Ddim and ind_Ddim2 into mutable VectorXd to create Map<ArrayXd> for comparisons. Since the maps are only read (for == comparisons), replaced with Map<const ArrayXd> directly on the original data, eliminating 2 heap allocations per call.

Benchmark Results

Measured with pixi run -e python bench (A/B comparison vs main):

Benchmark

Before (main)

After (branch)

Speedup

SCG Dimer

979 ms

733 ms

1.34x

ADAM Dimer

3.48 s

2.69 s

1.29x

GP Training (40 obs)

9.25 ms

7.61 ms

1.22x

GP Training (10 obs)

5.08 ms

3.65 ms

1.39x

Profiling

For future performance work, use the A/B benchmark:

pixi run -e python bench

To profile a specific function:

pixi shell -e python
python -c "
import cProfile
from benchmarks.bench_gpr import TimeGPTraining
b = TimeGPTraining()
b.param = 40
b.setup(40)
cProfile.run('b.time_gp_train(40)', sort='cumulative')
"

Guidelines for Performance PRs

  1. Always run pixi run -e python bench before and after changes

  2. Focus on the hot-path (distance computation, gradient evaluation, optimizer iteration)

  3. Prefer Eigen block operations over scalar loops

  4. Pre-compute loop-invariant quantities outside inner loops

  5. Avoid heap allocations inside iteration loops (.noalias(), pre-allocate)

  6. Add a towncrier news fragment under docs/newsfragments/ with category changed