Performance Optimisation¶

Overview¶

This document covers the performance-critical paths in the GP model and the optimisation techniques applied to them. The GP training hot-path accounts for the bulk of wall-clock time during dimer searches.

GP Hot-Path Architecture¶

The inner loop of hyperparameter optimisation (ADAM or SCG) calls evaluateEnergyAndGradient at each iteration, which triggers the full covariance matrix assembly, Cholesky decomposition, and gradient computation pipeline.

Optimizer loop (ADAM / SCG)
  |
  +-- evaluateEnergyAndGradient(w)
        |
        +-- setParameters(w)
        +-- evaluateEnergy(x, x_ind, y)
        |     +-- evaluateTrainingCovarianceMatrix
        |     |     +-- applyCovarianceFunction  (SexpatCF + ConstantCF)
        |     |     |     +-- dist_at / dist_at_vec  <-- distance computation
        |     |     |     +-- extractCoordinatesByIndex
        |     |     +-- LikGaussian::evaluateTrainingCovarianceMatrix  <-- diagonal noise
        |     +-- decomposeCovarianceMatrix  (Cholesky, O(N^3))
        |     +-- calculateMeanPrediction
        |
        +-- evaluateGradient(x, x_ind, y, gradient)
              +-- calculateGradientWithCovFunc  (SexpatCF)
              |     +-- Gradient::calculateDerivativesSameDim
              |     |     +-- calculateGradientBetweenMovingAtoms
              |     |     +-- calculateGradientBetweenMovingAndFrozenAtoms
              |     +-- Gradient::calculateDerivativesDiffDim
              +-- calculateGradientWithCovFunc  (ConstantCF)

Optimisation Techniques Applied¶

1. Pre-compute loop-invariant x2 quantities¶

In dist_at, dist_at_vec, and Gradient inner loops, the x2-dependent inverse distances are invariant across the outer n-loop (x1 observations). Pre-computing them into a std::vector<double> before the n-loop eliminates (n1-1) redundant coordinate extractions and norm computations per atom pair.

Function	Before (per pair)	After (per pair)
dist_at	n1 * n2 norms	n1 + n2 norms
Gradient	n1 * n2 norms	n1 + n2 norms

2. Cache frozen atom positions¶

In dist_at, dist_at_vec, and dist_max1Dlog, frozen atom positions are extracted via gpr::coord::at(conf_info.atoms_froz_active.positions, 0, i) inside the (n, m) inner loop but are invariant. Hoisting them to the (j, i) loop level eliminates n1 * n2 redundant segment extractions per frozen atom pair.

3. Vectorize Eigen operations¶

Replaced scalar pointer loops with Eigen vectorised operations:

Before	After
`\*(data + i) = 0.` loop	`setZero()`
`\*(data + i) = val` loop	`setConstant(val)`
`C(i,i) = sigma2` loop	`C.diagonal().setConstant(sigma2)`
`sqrt(*(data + n))` loop	`Map<ArrayXd>(data, size).sqrt()`
Per-element matrix copy	`matrix.block(...) = source`

4. Reduce heap allocations in ADAM¶

The ADAM optimizer allocated m_hat, v_hat, and w_update vectors on every iteration (100–300 iterations typical). Pre-allocating them before the loop and converting moment updates to in-place operations eliminates 3 heap allocations per iteration.

// Before:
m = beta1 * m + (1 - beta1) * g;     // temporary + assignment
Eigen::VectorXd m_hat = m / (1 - beta1_t);  // heap allocation

// After:
m *= beta1;
m.noalias() += (1 - beta1) * g;      // in-place
m_hat.noalias() = m / (1 - beta1_t); // reuse pre-allocated storage

5. Hoist vectors outside gradient loops¶

In calculateGradientWithCovFunc, three std::vector<FieldMatrixd> were constructed fresh inside the per-dimension loop body. Moving them before the loop and calling .clear() each iteration avoids 3 * D vector constructions per gradient evaluation (D = 3 * N_mov= 6 for CuH2).

6. Eliminate redundant temporaries in ConstantCF¶

ConstantCF::ginput2/3/4 created a temp matrix, zeroed it with a scalar loop, then copied it into each DKff element. Replaced with direct DKff[i].setZero(). calculateCovarianceMatrix used a temp matrix filled via scalar loop then copied to C; replaced with C.setConstant(val).

7. Block copy in extractCoordinatesByIndex¶

The function iterated over all rows checking ind_Ddim[i] == ind and copying one row at a time. Since ind_Ddim has known contiguous block structure (blocks of N_imidentical indices), replaced with a single x.block(ind * block_size, 0, block_size, n_cols) copy.

8. Const maps in applyCrossCovarianceFunction¶

applyCrossCovarianceFunction copied ind_Ddim and ind_Ddim2 into mutable VectorXd to create Map<ArrayXd> for comparisons. Since the maps are only read (for == comparisons), replaced with Map<const ArrayXd> directly on the original data, eliminating 2 heap allocations per call.

Benchmark Results¶

Measured with pixi run -e python bench (A/B comparison vs main):

Benchmark	Before (main)	After (branch)	Speedup
SCG Dimer	979 ms	733 ms	1.34x
ADAM Dimer	3.48 s	2.69 s	1.29x
GP Training (40 obs)	9.25 ms	7.61 ms	1.22x
GP Training (10 obs)	5.08 ms	3.65 ms	1.39x

Profiling¶

For future performance work, use the A/B benchmark:

pixi run -e python bench

To profile a specific function:

pixi shell -e python
python -c "
import cProfile
from benchmarks.bench_gpr import TimeGPTraining
b = TimeGPTraining()
b.param = 40
b.setup(40)
cProfile.run('b.time_gp_train(40)', sort='cumulative')
"

Guidelines for Performance PRs¶

Always run pixi run -e python bench before and after changes
Focus on the hot-path (distance computation, gradient evaluation, optimizer iteration)
Prefer Eigen block operations over scalar loops
Pre-compute loop-invariant quantities outside inner loops
Avoid heap allocations inside iteration loops (.noalias(), pre-allocate)
Add a towncrier news fragment under docs/newsfragments/ with category changed