Performance Development Guidelines¶
Hot-Path Overview¶
The GP training inner loop is the dominant cost. During a dimer search with 40 observations and 2 moving atoms (D=6), the covariance matrix is 280x280 and the optimizer runs 100–400 iterations.
The key cost centres are:
Distance computation (
dist_at,dist_at_vec) – O(Nmov2* n1 * n2)Gradient evaluation (
calculateGradientBetweenMovingAtoms) – O(Nmov2* n1 * n2)Cholesky decomposition (
decomposeCovarianceMatrix) – O(((D+1)*nobs)3)Optimizer iteration (ADAM/SCG) – allocations per iteration
Optimisation Principles¶
Prefer Eigen block operations¶
Eigen’s block/segment/diagonal accessors map to optimised BLAS/memcpy
calls. Scalar pointer loops (\*(data + i) = val) bypass vectorisation.
// Bad
for (Eigen::Index i = 0; i < C.rows(); ++i) C(i, i) = sigma2;
// Good
C.diagonal().setConstant(sigma2);
Pre-compute loop-invariant quantities¶
If a value depends only on the outer loop variable, compute it once outside the inner loop.
// Bad: recomputes x2 norm for every (n, m) pair
for (n ...) {
for (m ...) {
double invr2 = 1.0 / (coord::at(x2, m, j) - coord::at(x2, m, i)).norm();
// Good: pre-compute x2 quantities before the n-loop
std::vector<double> invr2(n2);
for (m ...) invr2[m] = 1.0 / (coord::at(x2, m, j) - coord::at(x2, m, i)).norm();
for (n ...) {
for (m ...) {
double diff = invr1 - invr2[m];
Avoid heap allocations in iteration loops¶
Eigen::VectorXd allocates on construction. Pre-allocate before the
loop and reuse with .noalias():
// Before the loop:
Eigen::VectorXd m_hat(nparams), v_hat(nparams);
// Inside the loop:
m_hat.noalias() = m / (1.0 - beta1_t); // reuses storage
Use .noalias() for matrix products¶
When the destination doesn’t alias the source, .noalias() skips the
temporary:
C.noalias() = A * B; // no temporary
Cache invariant positions¶
gpr::coord::at(positions, 0, i) extracts a 3-element segment via
pointer arithmetic. If the result doesn’t change across loop
iterations, hoist it:
Eigen::Vector3d froz_pos = gpr::coord::at(frozen.positions, 0, i);
for (n ...) {
double d = (gpr::coord::at(x1, n, j) - froz_pos).norm();
Memory Layout¶
gpr_optim uses mixed Eigen storage orders:
Type |
Storage |
Used for |
|---|---|---|
|
RowMajor |
Observations, coordinates |
|
ColMajor |
Covariance matrices, linear algebra |
The assembleVectorFromEnergyAndGradient function must copy gradients in
column-major order to match the ind_Ddim block structure (all dE/dx first,
then dE/dy, then dE/dz). This is a structural requirement of the covariance
matrix layout, not a performance choice.
Benchmarking Workflow¶
# A/B comparison against main
pixi run -e python bench
# Compare against a specific branch
pixi run -e python -- nim r scripts/bench_compare.nim develop
# Keep venvs for debugging
pixi run -e python -- nim r scripts/bench_compare.nim main --keep-venvs
The script builds both branches in parallel (isolated venvs) then benchmarks
sequentially to avoid CPU contention. See scripts/bench_compare.nim for
details.
Adding a Benchmark¶
Add new benchmarks to benchmarks/bench_gpr.py using ASV conventions:
class TimeMyOperation:
params = [10, 20, 40]
param_names = ["n_obs"]
def setup(self, n_obs):
# expensive setup (not timed)
...
def time_my_operation(self, n_obs):
# timed operation
...