AMR Module
The AMR (Adaptive Mesh Refinement) module in backend/distributed/unstructured/amr/ provides GPU-native, multi-rank h-refinement for unstructured hex and tet meshes built on top of ElementDomain. The pipeline is octree-aligned: each rank emits children for its locally-owned, marked elements, and cstone redistributes them by SFC during the rebuild.
What's in the module
| File | Class / kernels | Purpose |
|---|---|---|
mars_amr.hpp |
umbrella include | exposes the public API |
mars_amr_octree.hpp |
AmrOctree |
refinement-level tracking, Doerfler marking, 1-irregular enforcement |
mars_amr_octree_native.hpp |
OctreeNativeAmr |
drives marking from the cstone leaf decomposition |
mars_amr_octree_refine.hpp |
OctreeAlignedRefine |
hex8 1→8 child emission with new edge/face/body nodes; solution interpolation kernel |
mars_amr_hex_refine.hpp |
HexRefiner |
legacy hex refiner (replaced by OctreeAlignedRefine for the multi-rank path) |
mars_amr_tet_refine.hpp |
TetRefiner |
tet4 1→8 red refinement (Bey's algorithm) |
mars_amr_error_indicator.hpp |
HexErrorIndicator, TetErrorIndicator, ErrorIndicator |
per-element gradient-jump error proxy + Doerfler marking helpers |
mars_amr_solution_transfer.hpp |
SolutionTransfer |
parentage-based interpolation (legacy hex/tet path) |
mars_amr_manager.hpp |
AmrManager |
orchestrates the full cycle: mark → refine → rebuild → transfer |
Refinement cycle (one level)
d_errorPerElement (size = numElements)
│
▼
┌─────────────────────────────────────┐
│ marking (Doerfler or OctreeNative) │ AmrOctree / OctreeNativeAmr
│ → d_marks (size = numElements) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ octree-aligned refinement │ OctreeAlignedRefine::refineLocal
│ → 8 children per marked parent │
│ + 19 new node coords / parent │
│ (no dedup; cstone handles it) │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ solution transfer (per field) │ OctreeAlignedRefine::transferSolution
│ → trilinear interp at the same │
│ 19 positions emitted by refine │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ rebuildDomainFromDevice │ ElementDomain device-data ctor
│ → cstone re-derives box, sorts by │
│ SFC, redistributes, dedupes, │
│ rebuilds halos │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ post-sync key re-encode + lookup │
│ → match transferred values onto │
│ new local node IDs by SFC key │
│ → exchangeNodeHalo to fill ghosts │
└─────────────────────────────────────┘
│
▼
d_nodeSolution (warm start for next level's CG)
Public API
AmrManager (the orchestrator)
template<typename ElementTag, typename KeyType, typename RealType>
class AmrManager {
public:
struct Config {
int maxLevels = 3;
RealType refineFraction = 0.3; // Doerfler fraction
RealType errorTolerance = 1e-6; // stop refining below this error
int blockSize = 256;
int bucketSize = 64;
MarkingStrategy strategy = MarkingStrategy::OctreeNative;
};
AmrManager(Config cfg = {});
void initialize(const std::string& meshFile, int rank, int numRanks);
// Single-field overload
AmrStats adaptMesh(const RealType* d_errorPerElement,
const RealType* d_nodeSolution,
cstone::DeviceVector<RealType>& d_newNodeSolution);
// Multi-field overload — takes any number of node-keyed fields
// (e.g. pressure, velocity components)
AmrStats adaptMeshMultiField(const RealType* d_errorPerElement,
const std::vector<const RealType*>& oldFields,
const std::vector<DevVector*>& newFields);
bool shouldContinue(const AmrStats& stats) const;
int currentLevel() const;
ElementDomain<ElementTag, RealType, KeyType, cstone::GpuTag>& domain();
static void printStats(const AmrStats& stats, int rank);
};
AmrStats reports per-level: elements before/after (local + global), nodes before/after, elements refined, error norm, total time.
shouldContinue(stats) returns false when any of: currentLevel >= maxLevels, errorNorm <= errorTolerance, or elementsRefined == 0.
Marking strategies
enum class MarkingStrategy {
Doerfler, // mark elements with err > refineFraction * maxError
OctreeNative, // mark via cstone leaf-error reduction + computeNodeOpsGpu
};
Doerfler runs AmrOctree::markForRefinement then enforces 1-irregular at mesh level (with an extra halo exchange of int-promoted marks for cross-rank consistency).
OctreeNative reduces per-leaf errors via MPI_Allreduce(MAX) on the global cstone tree, then calls cstone's computeNodeOpsGpu for split decisions. Per-leaf split conformity is the octree's invariant — no extra mark exchange needed.
OctreeAlignedRefine
template<typename KeyType, typename RealType>
struct OctreeAlignedRefine {
struct Result {
DeviceVector<KeyType> d_conn0..d_conn7; // hex connectivity
DeviceVector<RealType> d_x, d_y, d_z; // node coords
size_t numNodes, numElements;
};
static Result refineLocal(/* parent conn, coords, marks, sizes */);
static void transferSolution(/* parent conn, oldSolution, marks, sizes */,
DeviceVector<RealType>& newSolution);
};
refineLocal emits 8 children per marked element with 19 new local nodes per parent (12 edge midpoints + 6 face centers + 1 body center). No dedup — cstone::Domain::sync handles it during the rebuild.
transferSolution runs trilinear interpolation at the same 19 positions used by the geometry kernel. See Solution Transfer.
ErrorIndicator + HexErrorIndicator
template<typename KeyType, typename RealType>
class HexErrorIndicator {
public:
static cstone::DeviceVector<RealType> computeError(/* hex conn, sol, coords, n2d, count */);
};
template<typename KeyType, typename RealType>
class ErrorIndicator {
public:
// sums over ALL local elements (incl. halos) — double-counts at boundaries
static RealType globalErrorNorm(DeviceVector<RealType>& d_error);
// sums only over [startIdx, endIdx) — owned elements only, correct for distributed
static RealType globalErrorNormOwned(DeviceVector<RealType>& d_error,
size_t startIdx, size_t endIdx);
};
Use globalErrorNormOwned whenever d_error includes halo elements, which is always the case when computed via per-element kernels over the full local element array. globalErrorNorm is kept for cases where d_error is already filtered.
Working example: mars_amr_cvfem_graph.cu
Distributed Poisson with adaptive refinement and warm-start CG. Located at examples/distributed/unstructured/mars_amr_cvfem_graph.cu.
// Setup
amr.initialize(meshFile, rank, numRanks);
cstone::DeviceVector<RealType> d_nodeSolution;
cstone::DeviceVector<RealType> d_errorPerElement;
for (int level = 0; level <= amrLevels; ++level) {
// Solve the current level. d_nodeSolution arrives non-empty from the
// previous level (transferred + halo-filled), so CG warm-starts from it.
solvePoissonGraph(amr.domain(), sourceTerm, kernelVariant, ...,
d_nodeSolution, d_errorPerElement);
if (level >= amrLevels) break;
// Refine + transfer
cstone::DeviceVector<RealType> d_newSolution;
auto stats = amr.adaptMesh(d_errorPerElement.data(),
d_nodeSolution.data(), d_newSolution);
AmrManager<HexTag, KeyType, RealType>::printStats(stats, rank);
d_nodeSolution = std::move(d_newSolution);
if (!amr.shouldContinue(stats)) break;
}
The solver itself (solvePoissonGraph) uses the standard distributed CG path:
ConjugateGradientSolver<RealType, int, cstone::GpuTag> solver(maxIter, tolerance);
solver.setOwnedSize(numOwnedDofs);
solver.setHaloExchangeCallback(
[&domain, dofMap = d_node_to_dof.data()](cstone::DeviceVector<RealType>& p) {
domain.exchangeNodeHalo(p, dofMap);
});
solver.solve(A, b, x);
// Halo-exchange the per-node solution before the error indicator,
// so per-element gradient kernels see correct ghost-corner values.
domain.exchangeNodeHalo(d_nodeSolution);
See Halo Management for the halo layering, Node Halo Topology for exchangeNodeHalo internals, Solution Transfer for the warm-start mechanism.
Verified results
cube16 mesh (16³ hex), 4 ranks, --amr-levels=2:
| Level | Elements (global) | ||u|| |
||err|| |
Solve time |
|---|---|---|---|---|
| L0 | 4096 | 1.588786 | 0.0753 | 384 ms (cold) |
| L1 | 32768 | 4.513941 | 0.0796 | 70 ms (warm-started) |
L1 is ~3× faster than a cold-start L1 (~209 ms) thanks to the trilinear-interpolated initial guess for CG. See Solution Transfer.
Single-rank baseline:
- L0: ||u||=1.588588, matches 4-rank to 0.012%.
- globalDofs matches exactly (4913) thanks to the lowest-rank-wins ownership tiebreaker in NodeHaloTopology.
Multi-rank correctness invariants
Maintained by the AMR pipeline:
- Each globally-shared node has exactly one owner. Enforced by Allgatherv tiebreaker in
NodeHaloTopologybuild, triggered automatically byensureHalo()after every refinement. - Ghost-DOF values are sourced from each node's TRUE owner, not from intermediate ranks holding stale ghost data. Enforced by the per-node send/recv lists.
- Per-rank emitted children with coincident coords across ranks dedupe via cstone SFC. Both ranks compute the same trilinear value (same parent corners), so dedup preserves correctness.
- Solution transfer preserves the converged value on coincident pre/post-sync nodes. Enforced by post-sync key re-encode (re-encode pre-sync coords with cstone's new box, look up by SFC key).
Known limitations
- No coarsening (de-refinement) yet. AMR is monotonic: marked elements split, others stay. Adding coarsening is straightforward but unimplemented.
||err||discrepancy. 4-rank||err||runs ~2.17× higher than single-rank for the same mesh. Traced to ~9 corner-shared nodes whose RHS gets contributions only from one rank's owned elements (cstone halo-width gap). Doesn't block AMR; tracked separately.- Tet path partially tested.
TetRefinerandTetErrorIndicatorexist; multi-rank tet AMR end-to-end is on the validation list.