Performance Development Guidelines

Hot-Path Overview

The GP training inner loop is the dominant cost. During a dimer search with 40 observations and 2 moving atoms (D=6), the covariance matrix is 280x280 and the optimizer runs 100–400 iterations.

The key cost centres are:

  1. Distance computation (dist_at, dist_at_vec) – O(Nmov2* n1 * n2)

  2. Gradient evaluation (calculateGradientBetweenMovingAtoms) – O(Nmov2* n1 * n2)

  3. Cholesky decomposition (decomposeCovarianceMatrix) – O(((D+1)*nobs)3)

  4. Optimizer iteration (ADAM/SCG) – allocations per iteration

Optimisation Principles

Prefer Eigen block operations

Eigen’s block/segment/diagonal accessors map to optimised BLAS/memcpy calls. Scalar pointer loops (\*(data + i) = val) bypass vectorisation.

// Bad
for (Eigen::Index i = 0; i < C.rows(); ++i) C(i, i) = sigma2;

// Good
C.diagonal().setConstant(sigma2);

Pre-compute loop-invariant quantities

If a value depends only on the outer loop variable, compute it once outside the inner loop.

// Bad: recomputes x2 norm for every (n, m) pair
for (n ...) {
    for (m ...) {
        double invr2 = 1.0 / (coord::at(x2, m, j) - coord::at(x2, m, i)).norm();

// Good: pre-compute x2 quantities before the n-loop
std::vector<double> invr2(n2);
for (m ...) invr2[m] = 1.0 / (coord::at(x2, m, j) - coord::at(x2, m, i)).norm();
for (n ...) {
    for (m ...) {
        double diff = invr1 - invr2[m];

Avoid heap allocations in iteration loops

Eigen::VectorXd allocates on construction. Pre-allocate before the loop and reuse with .noalias():

// Before the loop:
Eigen::VectorXd m_hat(nparams), v_hat(nparams);

// Inside the loop:
m_hat.noalias() = m / (1.0 - beta1_t);  // reuses storage

Use .noalias() for matrix products

When the destination doesn’t alias the source, .noalias() skips the temporary:

C.noalias() = A * B;  // no temporary

Cache invariant positions

gpr::coord::at(positions, 0, i) extracts a 3-element segment via pointer arithmetic. If the result doesn’t change across loop iterations, hoist it:

Eigen::Vector3d froz_pos = gpr::coord::at(frozen.positions, 0, i);
for (n ...) {
    double d = (gpr::coord::at(x1, n, j) - froz_pos).norm();

Memory Layout

gpr_optim uses mixed Eigen storage orders:

Type

Storage

Used for

FieldMatrixd / Coord

RowMajor

Observations, coordinates

EigenMatrix / MatrixXd

ColMajor

Covariance matrices, linear algebra

The assembleVectorFromEnergyAndGradient function must copy gradients in column-major order to match the ind_Ddim block structure (all dE/dx first, then dE/dy, then dE/dz). This is a structural requirement of the covariance matrix layout, not a performance choice.

Benchmarking Workflow

# A/B comparison against main
pixi run -e python bench

# Compare against a specific branch
pixi run -e python -- nim r scripts/bench_compare.nim develop

# Keep venvs for debugging
pixi run -e python -- nim r scripts/bench_compare.nim main --keep-venvs

The script builds both branches in parallel (isolated venvs) then benchmarks sequentially to avoid CPU contention. See scripts/bench_compare.nim for details.

Adding a Benchmark

Add new benchmarks to benchmarks/bench_gpr.py using ASV conventions:

class TimeMyOperation:
    params = [10, 20, 40]
    param_names = ["n_obs"]

    def setup(self, n_obs):
        # expensive setup (not timed)
        ...

    def time_my_operation(self, n_obs):
        # timed operation
        ...