AMR Module

The AMR (Adaptive Mesh Refinement) module in backend/distributed/unstructured/amr/ provides GPU-native, multi-rank h-refinement for unstructured hex and tet meshes built on top of ElementDomain. The pipeline is octree-aligned: each rank emits children for its locally-owned, marked elements, and cstone redistributes them by SFC during the rebuild.

What's in the module

File	Class / kernels	Purpose
`mars_amr.hpp`	umbrella include	exposes the public API
`mars_amr_octree.hpp`	`AmrOctree`	refinement-level tracking, Doerfler marking, 1-irregular enforcement
`mars_amr_octree_native.hpp`	`OctreeNativeAmr`	drives marking from the cstone leaf decomposition
`mars_amr_octree_refine.hpp`	`OctreeAlignedRefine`	hex8 1→8 child emission with new edge/face/body nodes; solution interpolation kernel
`mars_amr_hex_refine.hpp`	`HexRefiner`	legacy hex refiner (replaced by `OctreeAlignedRefine` for the multi-rank path)
`mars_amr_tet_refine.hpp`	`TetRefiner`	tet4 1→8 red refinement (Bey's algorithm)
`mars_amr_error_indicator.hpp`	`HexErrorIndicator`, `TetErrorIndicator`, `ErrorIndicator`	per-element gradient-jump error proxy + Doerfler marking helpers
`mars_amr_solution_transfer.hpp`	`SolutionTransfer`	parentage-based interpolation (legacy hex/tet path)
`mars_amr_manager.hpp`	`AmrManager`	orchestrates the full cycle: mark → refine → rebuild → transfer

   d_errorPerElement (size = numElements)
            │
            ▼
   ┌─────────────────────────────────────┐
   │  marking (Doerfler or OctreeNative) │  AmrOctree / OctreeNativeAmr
   │  → d_marks (size = numElements)     │
   └─────────────────────────────────────┘
            │
            ▼
   ┌─────────────────────────────────────┐
   │  octree-aligned refinement          │  OctreeAlignedRefine::refineLocal
   │  → 8 children per marked parent     │
   │    + 19 new node coords / parent    │
   │    (no dedup; cstone handles it)    │
   └─────────────────────────────────────┘
            │
            ▼
   ┌─────────────────────────────────────┐
   │  solution transfer (per field)      │  OctreeAlignedRefine::transferSolution
   │  → trilinear interp at the same     │
   │    19 positions emitted by refine   │
   └─────────────────────────────────────┘
            │
            ▼
   ┌─────────────────────────────────────┐
   │  rebuildDomainFromDevice            │  ElementDomain device-data ctor
   │  → cstone re-derives box, sorts by  │
   │    SFC, redistributes, dedupes,     │
   │    rebuilds halos                   │
   └─────────────────────────────────────┘
            │
            ▼
   ┌─────────────────────────────────────┐
   │  post-sync key re-encode + lookup   │
   │  → match transferred values onto    │
   │    new local node IDs by SFC key    │
   │  → exchangeNodeHalo to fill ghosts  │
   └─────────────────────────────────────┘
            │
            ▼
   d_nodeSolution (warm start for next level's CG)

Public API

`AmrManager` (the orchestrator)

template<typename ElementTag, typename KeyType, typename RealType>
class AmrManager {
public:
    struct Config {
        int    maxLevels        = 3;
        RealType refineFraction = 0.3;     // Doerfler fraction
        RealType errorTolerance = 1e-6;    // stop refining below this error
        int    blockSize        = 256;
        int    bucketSize       = 64;
        MarkingStrategy strategy = MarkingStrategy::OctreeNative;
    };

    AmrManager(Config cfg = {});

    void initialize(const std::string& meshFile, int rank, int numRanks);

    // Single-field overload
    AmrStats adaptMesh(const RealType* d_errorPerElement,
                       const RealType* d_nodeSolution,
                       cstone::DeviceVector<RealType>& d_newNodeSolution);

    // Multi-field overload — takes any number of node-keyed fields
    // (e.g. pressure, velocity components)
    AmrStats adaptMeshMultiField(const RealType* d_errorPerElement,
                                  const std::vector<const RealType*>& oldFields,
                                  const std::vector<DevVector*>& newFields);

    bool shouldContinue(const AmrStats& stats) const;
    int  currentLevel() const;
    ElementDomain<ElementTag, RealType, KeyType, cstone::GpuTag>& domain();
    static void printStats(const AmrStats& stats, int rank);
};

AmrStats reports per-level: elements before/after (local + global), nodes before/after, elements refined, error norm, total time.

shouldContinue(stats) returns false when any of: currentLevel >= maxLevels, errorNorm <= errorTolerance, or elementsRefined == 0.

Marking strategies

enum class MarkingStrategy {
    Doerfler,        // mark elements with err > refineFraction * maxError
    OctreeNative,    // mark via cstone leaf-error reduction + computeNodeOpsGpu
};

Doerfler runs AmrOctree::markForRefinement then enforces 1-irregular at mesh level (with an extra halo exchange of int-promoted marks for cross-rank consistency).

OctreeNative reduces per-leaf errors via MPI_Allreduce(MAX) on the global cstone tree, then calls cstone's computeNodeOpsGpu for split decisions. Per-leaf split conformity is the octree's invariant — no extra mark exchange needed.

`OctreeAlignedRefine`

template<typename KeyType, typename RealType>
struct OctreeAlignedRefine {
    struct Result {
        DeviceVector<KeyType>  d_conn0..d_conn7;   // hex connectivity
        DeviceVector<RealType> d_x, d_y, d_z;       // node coords
        size_t numNodes, numElements;
    };

    static Result refineLocal(/* parent conn, coords, marks, sizes */);

    static void transferSolution(/* parent conn, oldSolution, marks, sizes */,
                                 DeviceVector<RealType>& newSolution);
};

refineLocal emits 8 children per marked element with 19 new local nodes per parent (12 edge midpoints + 6 face centers + 1 body center). No dedup — cstone::Domain::sync handles it during the rebuild.

transferSolution runs trilinear interpolation at the same 19 positions used by the geometry kernel. See Solution Transfer.

`ErrorIndicator` + `HexErrorIndicator`

template<typename KeyType, typename RealType>
class HexErrorIndicator {
public:
    static cstone::DeviceVector<RealType> computeError(/* hex conn, sol, coords, n2d, count */);
};

template<typename KeyType, typename RealType>
class ErrorIndicator {
public:
    // sums over ALL local elements (incl. halos) — double-counts at boundaries
    static RealType globalErrorNorm(DeviceVector<RealType>& d_error);

    // sums only over [startIdx, endIdx) — owned elements only, correct for distributed
    static RealType globalErrorNormOwned(DeviceVector<RealType>& d_error,
                                          size_t startIdx, size_t endIdx);
};

Use globalErrorNormOwned whenever d_error includes halo elements, which is always the case when computed via per-element kernels over the full local element array. globalErrorNorm is kept for cases where d_error is already filtered.

Working example: `mars_amr_cvfem_graph.cu`

Distributed Poisson with adaptive refinement and warm-start CG. Located at examples/distributed/unstructured/mars_amr_cvfem_graph.cu.

// Setup
amr.initialize(meshFile, rank, numRanks);

cstone::DeviceVector<RealType> d_nodeSolution;
cstone::DeviceVector<RealType> d_errorPerElement;

for (int level = 0; level <= amrLevels; ++level) {
    // Solve the current level. d_nodeSolution arrives non-empty from the
    // previous level (transferred + halo-filled), so CG warm-starts from it.
    solvePoissonGraph(amr.domain(), sourceTerm, kernelVariant, ...,
                      d_nodeSolution, d_errorPerElement);

    if (level >= amrLevels) break;

    // Refine + transfer
    cstone::DeviceVector<RealType> d_newSolution;
    auto stats = amr.adaptMesh(d_errorPerElement.data(),
                                d_nodeSolution.data(), d_newSolution);
    AmrManager<HexTag, KeyType, RealType>::printStats(stats, rank);

    d_nodeSolution = std::move(d_newSolution);
    if (!amr.shouldContinue(stats)) break;
}

The solver itself (solvePoissonGraph) uses the standard distributed CG path:

ConjugateGradientSolver<RealType, int, cstone::GpuTag> solver(maxIter, tolerance);
solver.setOwnedSize(numOwnedDofs);
solver.setHaloExchangeCallback(
    [&domain, dofMap = d_node_to_dof.data()](cstone::DeviceVector<RealType>& p) {
        domain.exchangeNodeHalo(p, dofMap);
    });
solver.solve(A, b, x);

// Halo-exchange the per-node solution before the error indicator,
// so per-element gradient kernels see correct ghost-corner values.
domain.exchangeNodeHalo(d_nodeSolution);

See Halo Management for the halo layering, Node Halo Topology for exchangeNodeHalo internals, Solution Transfer for the warm-start mechanism.

Verified results

cube16 mesh (16³ hex), 4 ranks, --amr-levels=2:

Level	Elements (global)	`\|\|u\|\|`	`\|\|err\|\|`	Solve time
L0	4096	1.588786	0.0753	384 ms (cold)
L1	32768	4.513941	0.0796	70 ms (warm-started)

L1 is ~3× faster than a cold-start L1 (~209 ms) thanks to the trilinear-interpolated initial guess for CG. See Solution Transfer.

Single-rank baseline: - L0: ||u||=1.588588, matches 4-rank to 0.012%. - globalDofs matches exactly (4913) thanks to the lowest-rank-wins ownership tiebreaker in NodeHaloTopology.

Multi-rank correctness invariants

Maintained by the AMR pipeline:

Each globally-shared node has exactly one owner. Enforced by Allgatherv tiebreaker in NodeHaloTopology build, triggered automatically by ensureHalo() after every refinement.
Ghost-DOF values are sourced from each node's TRUE owner, not from intermediate ranks holding stale ghost data. Enforced by the per-node send/recv lists.
Per-rank emitted children with coincident coords across ranks dedupe via cstone SFC. Both ranks compute the same trilinear value (same parent corners), so dedup preserves correctness.
Solution transfer preserves the converged value on coincident pre/post-sync nodes. Enforced by post-sync key re-encode (re-encode pre-sync coords with cstone's new box, look up by SFC key).

Known limitations

No coarsening (de-refinement) yet. AMR is monotonic: marked elements split, others stay. Adding coarsening is straightforward but unimplemented.
||err|| discrepancy. 4-rank ||err|| runs ~2.17× higher than single-rank for the same mesh. Traced to ~9 corner-shared nodes whose RHS gets contributions only from one rank's owned elements (cstone halo-width gap). Doesn't block AMR; tracked separately.
Tet path partially tested. TetRefiner and TetErrorIndicator exist; multi-rank tet AMR end-to-end is on the validation list.

AMR Module

What's in the module

Refinement cycle (one level)

Public API

`AmrManager` (the orchestrator)

Marking strategies

`OctreeAlignedRefine`

`ErrorIndicator` + `HexErrorIndicator`

Working example: `mars_amr_cvfem_graph.cu`

Verified results

Multi-rank correctness invariants

Known limitations

See also

AMR Module

What's in the module

Refinement cycle (one level)

Public API

AmrManager (the orchestrator)

Marking strategies

OctreeAlignedRefine

ErrorIndicator + HexErrorIndicator

Working example: mars_amr_cvfem_graph.cu

Verified results

Multi-rank correctness invariants

Known limitations

See also

`AmrManager` (the orchestrator)

`OctreeAlignedRefine`

`ErrorIndicator` + `HexErrorIndicator`

Working example: `mars_amr_cvfem_graph.cu`