Performance Optimisation¶
Overview¶
This document covers the performance-critical paths in the GP model and the optimisation techniques applied to them. The GP training hot-path accounts for the bulk of wall-clock time during dimer searches.
GP Hot-Path Architecture¶
The inner loop of hyperparameter optimisation (ADAM or SCG) calls
evaluateEnergyAndGradient at each iteration, which triggers the full
covariance matrix assembly, Cholesky decomposition, and gradient
computation pipeline.
Optimizer loop (ADAM / SCG)
|
+-- evaluateEnergyAndGradient(w)
|
+-- setParameters(w)
+-- evaluateEnergy(x, x_ind, y)
| +-- evaluateTrainingCovarianceMatrix
| | +-- applyCovarianceFunction (SexpatCF + ConstantCF)
| | | +-- dist_at / dist_at_vec <-- distance computation
| | | +-- extractCoordinatesByIndex
| | +-- LikGaussian::evaluateTrainingCovarianceMatrix <-- diagonal noise
| +-- decomposeCovarianceMatrix (Cholesky, O(N^3))
| +-- calculateMeanPrediction
|
+-- evaluateGradient(x, x_ind, y, gradient)
+-- calculateGradientWithCovFunc (SexpatCF)
| +-- Gradient::calculateDerivativesSameDim
| | +-- calculateGradientBetweenMovingAtoms
| | +-- calculateGradientBetweenMovingAndFrozenAtoms
| +-- Gradient::calculateDerivativesDiffDim
+-- calculateGradientWithCovFunc (ConstantCF)
Optimisation Techniques Applied¶
1. Pre-compute loop-invariant x2 quantities¶
In dist_at, dist_at_vec, and Gradient inner loops, the x2-dependent
inverse distances are invariant across the outer n-loop (x1 observations).
Pre-computing them into a std::vector<double> before the n-loop eliminates
(n1-1) redundant coordinate extractions and norm computations per atom
pair.
Function |
Before (per pair) |
After (per pair) |
|---|---|---|
distat |
n1 * n2 norms |
n1 + n2 norms |
Gradient |
n1 * n2 norms |
n1 + n2 norms |
2. Cache frozen atom positions¶
In dist_at, dist_at_vec, and dist_max1Dlog, frozen atom positions are
extracted via gpr::coord::at(conf_info.atoms_froz_active.positions, 0, i)
inside the (n, m) inner loop but are invariant. Hoisting them to the (j, i)
loop level eliminates n1 * n2 redundant segment extractions per frozen
atom pair.
3. Vectorize Eigen operations¶
Replaced scalar pointer loops with Eigen vectorised operations:
Before |
After |
|---|---|
|
|
|
|
|
|
|
|
Per-element matrix copy |
|
4. Reduce heap allocations in ADAM¶
The ADAM optimizer allocated m_hat, v_hat, and w_update vectors on every
iteration (100–300 iterations typical). Pre-allocating them before the loop
and converting moment updates to in-place operations eliminates 3 heap
allocations per iteration.
// Before:
m = beta1 * m + (1 - beta1) * g; // temporary + assignment
Eigen::VectorXd m_hat = m / (1 - beta1_t); // heap allocation
// After:
m *= beta1;
m.noalias() += (1 - beta1) * g; // in-place
m_hat.noalias() = m / (1 - beta1_t); // reuse pre-allocated storage
5. Hoist vectors outside gradient loops¶
In calculateGradientWithCovFunc, three std::vector<FieldMatrixd> were
constructed fresh inside the per-dimension loop body. Moving them before
the loop and calling .clear() each iteration avoids 3 * D vector
constructions per gradient evaluation (D = 3 * Nmov= 6 for CuH2).
6. Eliminate redundant temporaries in ConstantCF¶
ConstantCF::ginput2/3/4 created a temp matrix, zeroed it with a scalar
loop, then copied it into each DKff element. Replaced with direct
DKff[i].setZero(). calculateCovarianceMatrix used a temp matrix filled
via scalar loop then copied to C; replaced with C.setConstant(val).
7. Block copy in extractCoordinatesByIndex¶
The function iterated over all rows checking ind_Ddim[i] == ind and copying
one row at a time. Since ind_Ddim has known contiguous block structure
(blocks of Nimidentical indices), replaced with a single
x.block(ind * block_size, 0, block_size, n_cols) copy.
8. Const maps in applyCrossCovarianceFunction¶
applyCrossCovarianceFunction copied ind_Ddim and ind_Ddim2 into
mutable VectorXd to create Map<ArrayXd> for comparisons. Since the
maps are only read (for == comparisons), replaced with
Map<const ArrayXd> directly on the original data, eliminating 2 heap
allocations per call.
Benchmark Results¶
Measured with pixi run -e python bench (A/B comparison vs main):
Benchmark |
Before (main) |
After (branch) |
Speedup |
|---|---|---|---|
SCG Dimer |
979 ms |
733 ms |
1.34x |
ADAM Dimer |
3.48 s |
2.69 s |
1.29x |
GP Training (40 obs) |
9.25 ms |
7.61 ms |
1.22x |
GP Training (10 obs) |
5.08 ms |
3.65 ms |
1.39x |
Profiling¶
For future performance work, use the A/B benchmark:
pixi run -e python bench
To profile a specific function:
pixi shell -e python
python -c "
import cProfile
from benchmarks.bench_gpr import TimeGPTraining
b = TimeGPTraining()
b.param = 40
b.setup(40)
cProfile.run('b.time_gp_train(40)', sort='cumulative')
"
Guidelines for Performance PRs¶
Always run
pixi run -e python benchbefore and after changesFocus on the hot-path (distance computation, gradient evaluation, optimizer iteration)
Prefer Eigen block operations over scalar loops
Pre-compute loop-invariant quantities outside inner loops
Avoid heap allocations inside iteration loops (
.noalias(), pre-allocate)Add a towncrier news fragment under
docs/newsfragments/with categorychanged