Skip to content

mapping_code rules engine: config-driven token classification (replace hardcoded R case-when) #197

Description

@NewGraphEnvironment

Context

lnk_pipeline_mapping_code classifies the second token of mapping_code_<sp> via two hardcoded R case-when chains (R/lnk_pipeline_mapping_code.R:196-210), one each for resident and anadromous flavors:

mc_barrier_resident[any_remed] <- "REMEDIATED"
mc_barrier_resident[!set & any_dam_resident] <- "DAM"
mc_barrier_resident[!set & any_anth & any_pscis & !any_dam_resident] <- "ASSESSED"
mc_barrier_resident[!set & any_anth & !any_pscis & !any_dam_resident] <- "MODELLED"
mc_barrier_resident[!set & !any_anth] <- "NONE"

Inputs are also hardcoded — function probes specific column names in the access tibble (has_barriers_anthropogenic_dnstr, has_barriers_pscis_dnstr, has_barriers_dams_dnstr, dam_dnstr_ind, remediated_dnstr_ind).

What's locked

Surface Hardcoded? Where
Token vocabulary (DAM, MODELLED, ASSESSED, REMEDIATED, NONE) Yes Function body
Precedence order Yes Function body
Condition expressions (any_anth & any_pscis & !any_dam) Yes Function body
Source-flag column names Yes has() probes
Residence-flavor chains (resident vs anadromous) Yes Two parallel chains in function body

What's flex today

Why this matters

Adding a new token (e.g., FALLS for natural-barrier cases, CLOSURE for regional-management closures, SUBSURFACE for sub-surface) requires:

  • Editing function body
  • Coordinating across both residence chains
  • Adding column-probe entries for new source flags
  • Re-running full pipeline + parity

Same flavor as #189 (species residence hardcoded) — config-driven would let bundles override per-project.

Design

Naming convention

R-side: <noun>_<verb> per NGE convention. Function name stays lnk_pipeline_mapping_code (pre-existing, exported). Internal rule-evaluation helper: .lnk_rules_eval(rules, env) — verb-last, returns the per-row classification vector.

CSV: parameters_mapping_code_rules.csv per existing parameters_<noun>.csv convention (parameters_habitat_dimensions.csv, parameters_habitat_thresholds.csv, parameters_fresh.csv).

Param naming follows <type>_<role>:

  • rules arg on lnk_pipeline_mapping_code — accepts a tibble (loaded from CSV) or named list (programmatic override)

Rule shape

CSV columns:

flavor,precedence,token,when
resident,1,REMEDIATED,any_remed
resident,2,DAM,any_dam_resident
resident,3,ASSESSED,any_anth & any_pscis & !any_dam_resident
resident,4,MODELLED,any_anth & !any_pscis & !any_dam_resident
resident,5,NONE,!any_anth
anadromous,1,REMEDIATED,any_remed
anadromous,2,DAM,any_dam_anadr
anadromous,3,ASSESSED,any_pscis
anadromous,4,MODELLED,any_anth
anadromous,5,NONE,!any_anth

Abstract framing

Rules table is general-purpose. Same shape could drive future analogous classifications (e.g., habitat-class assignment, edge-type bucketing). The rules engine helper .lnk_rules_eval is the reusable primitive — give it a rules tibble + an environment with the bound variables (any_anth, any_pscis, etc.) → returns a length-N character vector.

Engine responsibilities:

  • Sort rules by (flavor, precedence)
  • For each row in input data: walk rules in order, first when that's TRUE assigns its token; subsequent rules are NA-only (don't overwrite)
  • Validation: lnk_rules_validate(rules) checks for duplicate (flavor, precedence), unparseable when expressions, missing tokens, unknown column references

Source-flag column names

Today's six columns (has_barriers_anthropogenic_dnstr, etc.) become the environment variables the when expressions reference. Rules library defines aliases:

  • any_anth := has_barriers_anthropogenic_dnstr
  • any_pscis := has_barriers_pscis_dnstr
  • any_dam_resident := dam_dnstr_ind OR has_barriers_dams_dnstr (fallback when sequence-aware unavailable)
  • any_dam_anadr := has_barriers_dams_dnstr
  • any_remed := remediated_dnstr_ind OR has_remediated_dnstr

Aliases declared in parameters_mapping_code_inputs.csv (or as an inputs: block in the rules file). Decouples user-facing rule expressions from the raw column names — column names can be renamed later (#189-flavor cleanup) without touching every rule's when.

Validation

lnk_rules_validate(rules, available_columns) runs before evaluation:

  • Each when parses as R (no syntax errors)
  • All identifiers in when are declared aliases or known columns
  • Precedences within flavor are unique
  • All declared flavors have a "fallthrough" rule (e.g. precedence-N when = TRUE) — guarantees no NA tokens leak

Acceptance

  • inst/extdata/configs/bcfishpass/parameters_mapping_code_rules.csv ships the current rules verbatim. Loaded into loaded$parameters_mapping_code_rules via lnk_load_overrides.
  • inst/extdata/configs/default/parameters_mapping_code_rules.csv same content (extendable per-bundle).
  • lnk_pipeline_mapping_code takes a rules arg (default reads from loaded if available, else falls back to hardcoded chains for back-compat one release).
  • .lnk_rules_eval(rules, env) helper — exported as @noRd internal initially; could be promoted to user-facing if other classifications adopt it.
  • lnk_rules_validate(rules) helper — validation surface.
  • Bcfishpass-bundle reproduces current output bit-identical.
  • Default-bundle can add a new rule (e.g., insert FALLS after DAM) without R code change.
  • Deprecation warning when function falls back to hardcoded chains; remove in v0.42.0.

Out of scope

  • SQL-side translation: rules could in principle be compiled to SQL CASE WHEN, run as INSERT...SELECT (no R round-trip, ~10× faster). Decision: stay R-side. 1–2 min provincial wall is acceptable; R-side keeps multi-row patterns / NULL semantics / unit-testability simpler. SQL is a follow-on if performance ever bites.
  • Per-bundle column aliasing: aliases stay in the package's parameters_mapping_code_inputs.csv for now; bundle-level alias override is a follow-up.
  • Rule-evaluation backend choice (eval(parse()) vs rlang::eval_tidy): implementation detail; pick at coding time.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions