Skip to content

global_evaluate: faithful parallel evaluate() (fix out-of-domain mislocation)#222

Open
lmoresi wants to merge 1 commit into
developmentfrom
bugfix/global-evaluate-parallel-extrapolation
Open

global_evaluate: faithful parallel evaluate() (fix out-of-domain mislocation)#222
lmoresi wants to merge 1 commit into
developmentfrom
bugfix/global-evaluate-parallel-extrapolation

Conversation

@lmoresi
Copy link
Copy Markdown
Member

@lmoresi lmoresi commented Jun 6, 2026

Problem

global_evaluate's parallel path silently returned wrong values for query points outside the (old) domain. The evaluation-swarm migrate routes unclaimed points by nearest rank centre-of-mass and strands them on an arbitrary rank, which extrapolates from a geometrically-far local cell. Deterministic reproduction (rotation gate, linear field T=x, np=5): a point at x=+0.64 read −0.42 (opposite side), max_err=1.06. This corrupted mesh-variable transfer on parallel mover-adapted meshes.

Fix

Make global_evaluate a faithful parallel evaluate() — interpolate inside, extrapolate from the true nearest cell outside, flag inside/outside. A best-claim out-of-domain fallback in global_evaluate_nd:

  • allgather the (small, boundary-layer) extrapolated set,
  • each rank reports nearest-local-cell distance + its local rbf extrapolation,
  • Allreduce(MIN dist / MIN rank / SUM winner-value) picks the globally-nearest rank's value.

Only unconditional collectives + a local rbf_evaluate (never the collective FE interpolation, which would desync per-rank → hang). O(boundary points), no dense global tree. Deadlock-safe by construction.

Default on; GE_LOCAL_FALLBACK=0 restores legacy. Serial path unchanged (gated mpi.size>1).

Validation

Rotation gate, T=x, np=5: max_err 1.06 → 0.003 (bit-identical to serial 0.003); turning the fallback off reproduces 1.06. tier-A green; used throughout the parallel adaptive-convection runs.

Underworld development team with AI support from Claude Code

…location

global_evaluate's parallel path had no correct handling for query points
outside the (old) domain: the evaluation-swarm migrate routes unclaimed points
by nearest rank centre-of-mass and strands them on an arbitrary rank, which
then extrapolates from a geometrically-far cell -> silently-wrong values,
parallel-only (e.g. an annulus boundary point reading the opposite side). This
corrupted mesh-variable transfer on parallel mover-adapted meshes.

Restore the serial evaluate() contract (interpolate inside / extrapolate from
the TRUE nearest cell outside / flag inside-outside) with a best-claim
out-of-domain fallback in global_evaluate_nd: allgather the (small,
boundary-layer) extrapolated set; every rank reports its nearest-local-cell
distance + its LOCAL rbf extrapolation; Allreduce(MIN dist / MIN rank / SUM
winner value) picks the globally-nearest rank's value. Only unconditional
collectives + local rbf_evaluate (never the collective FE interpolation, which
would desync) -> deadlock-safe. O(boundary points), no dense global tree.

Default on; GE_LOCAL_FALLBACK=0 restores legacy. Serial unchanged (gated
mpi.size>1). Validated: deterministic-rotation gate, linear field T=x, np=5,
max_err 1.06 -> 0.003 (== serial).

Underworld development team with AI support from Claude Code
Copilot AI review requested due to automatic review settings June 6, 2026 08:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates global_evaluate_nd to correct MPI-parallel evaluation for points that end up outside any rank’s owned cells after the swarm migrate round-trip, aiming to make parallel behavior match the serial evaluate() contract (interpolate in-domain; extrapolate just outside; return inside/outside via check_extrapolated) and to avoid deadlocks by using only unconditional collectives and rank-local RBF evaluation.

Changes:

  • Adds a parallel “best-claim” out-of-domain fallback: allgather stranded points, compute per-rank distance/value, and Allreduce to select the globally-best rank’s extrapolation.
  • Adds an environment-variable escape hatch (GE_LOCAL_FALLBACK) to disable the new fallback and restore legacy behavior.
  • Expands the global_evaluate_nd docstring/comments to document the parallel contract and deadlock-safety constraints.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +593 to +603
ext_vals, ext_flag = evaluate_nd(
expr, all_ext, rbf=True, evalf=False, verbose=False,
check_extrapolated=True,)
ext_vals = np.ascontiguousarray(
np.asarray(ext_vals, dtype=np.double).reshape((n_ext_total,) + expr_shape))
ext_flag = np.asarray(ext_flag).reshape(n_ext_total).astype(np.int32)

# Nearest-local-cell distance for every point (local kd-tree query).
mesh._build_kd_tree_index()
dist2, _ = mesh._centroid_index.query(all_ext, k=1, sqr_dists=True)
dist2 = np.ascontiguousarray(np.asarray(dist2, dtype=np.double).ravel())
Comment on lines +574 to +576
import os
if uw.mpi.size > 1 and os.environ.get("GE_LOCAL_FALLBACK", "1") not in (
"0", "off", "false", "no", ""):
Comment on lines +371 to +374
Contract: this is a faithful *parallel* counterpart of :func:`evaluate` —
a query point is interpolated wherever in the mesh it lands (on any rank),
a point just outside the mesh is extrapolated from its true nearest cell,
and ``check_extrapolated`` returns an inside/outside flag per point. The
Comment on lines +533 to +537
# ------------------------------------------------------------------
# Out-of-domain extrapolation — keep the parallel result a faithful
# match for the serial ``evaluate()`` contract: interpolate a point
# wherever it lands across ranks, extrapolate a point just outside the
# mesh, and flag inside/outside.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants