Skip to content

Persist+consolidate must be host-/species-count-agnostic: persist_init blind to species-column-set drift #204

Description

@NewGraphEnvironment

Problem

The wide per-species persist tables — <persist>.streams_access and <persist>.streams_mapping_code — carry one column per species (access_<sp>, has_barriers_<sp>_dnstr, mapping_code_<sp>). Their column set is therefore a function of whatever species vector was passed to lnk_persist_init() when the table was first created.

lnk_persist_init() uses CREATE TABLE IF NOT EXISTS, and its drift guard (.lnk_validate_persist_table) only detects GENERATED ALWAYS column drift. It is blind to species-column-set drift: a host whose table was created from a different species set keeps the stale shape silently, and the next CREATE IF NOT EXISTS is a no-op.

This bites the cross-host COPY-consolidate (data-raw/schema_consolidate.R), which (until the fix below) did a positional COPY (SELECT *) TO STDOUTCOPY <t> FROM STDIN. The moment two hosts' wide tables had different column counts, consolidate failed with ERROR: extra data after last expected column.

How it was found

3-WSG smoke on M1 + 2 cyphers (2026-05-25, link#175). streams/streams_habitat_*/barriers/barrier_overrides consolidated fine; streams_access + streams_mapping_code failed. Diagnosis:

So the cyphers' wide tables had 6 extra columns (access_ct/dv/rb, has_barriers_ct/dv/rb_dnstr) → positional COPY mismatch. (The warm cypher snapshot predates the wide tables, #187/v0.40.0, so this was the species-source mismatch, not a stale-snapshot artifact.)

Fixes already landed on the 175-... branch

  1. data-raw/cypher_prep.sh — align persist species to cfg$species (with the same parameters_fresh fallback lnk_pipeline_species() uses), matching lnk_pipeline_run. Removes the drift at the source.
  2. data-raw/schema_consolidate.R — COPY by shared column name (destination ordinal order), not positional. Enumerates columns on both sides, copies the intersection by name; source-only columns drop at SELECT, destination-only take default/NULL. Makes the consolidate shape-tolerant regardless of host drift. (Sibling to schema_consolidate: enumerate wgc_tables per-source, not from destination #185.)

Design north star: abstract, not hardcoded

The persist + consolidate machinery should be host-agnostic and species-count-agnostic. It must not matter which machine runs a bucket, how many species columns a wide table has, or which WSGs are present — everything (column sets, table sets, species, WSG buckets, host/transport) is discovered at runtime, never hardcoded. Fix 2 above embodies this on the consolidate side (runtime column intersection, copy-by-name). The remaining work below embodies it on the persist side.

Remaining hardening (this issue)

lnk_persist_init() should detect species-column-set drift on streams_access / streams_mapping_code (compare live columns vs the expected set derived from species) and DROP CASCADE + recreate under force_recreate = TRUE — today it only handles GENERATED drift. This makes the persist tables self-healing on any host when the configured species set changes, rather than relying on operators remembering to drop the wide tables. The per-source flag columns (.lnk_cols_streams_access_source_flags()) are still a hardcoded list (#197) — data-driving those is the same abstraction principle applied to the second-token sources. Fold into / coordinate with the persist-family reshape (#177), #196, and #197.

Refs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions