You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The wide per-species persist tables — <persist>.streams_access and <persist>.streams_mapping_code — carry one column per species (access_<sp>, has_barriers_<sp>_dnstr, mapping_code_<sp>). Their column set is therefore a function of whatever species vector was passed to lnk_persist_init() when the table was first created.
lnk_persist_init() uses CREATE TABLE IF NOT EXISTS, and its drift guard (.lnk_validate_persist_table) only detects GENERATED ALWAYS column drift. It is blind to species-column-set drift: a host whose table was created from a different species set keeps the stale shape silently, and the next CREATE IF NOT EXISTS is a no-op.
This bites the cross-host COPY-consolidate (data-raw/schema_consolidate.R), which (until the fix below) did a positional COPY (SELECT *) TO STDOUT → COPY <t> FROM STDIN. The moment two hosts' wide tables had different column counts, consolidate failed with ERROR: extra data after last expected column.
data-raw/cypher_prep.sh seeded persist from unique(loaded$parameters_fresh$species_code) = 11 (adds CT DV RB).
So the cyphers' wide tables had 6 extra columns (access_ct/dv/rb, has_barriers_ct/dv/rb_dnstr) → positional COPY mismatch. (The warm cypher snapshot predates the wide tables, #187/v0.40.0, so this was the species-source mismatch, not a stale-snapshot artifact.)
Fixes already landed on the 175-... branch
data-raw/cypher_prep.sh — align persist species to cfg$species (with the same parameters_fresh fallback lnk_pipeline_species() uses), matching lnk_pipeline_run. Removes the drift at the source.
data-raw/schema_consolidate.R — COPY by shared column name (destination ordinal order), not positional. Enumerates columns on both sides, copies the intersection by name; source-only columns drop at SELECT, destination-only take default/NULL. Makes the consolidate shape-tolerant regardless of host drift. (Sibling to schema_consolidate: enumerate wgc_tables per-source, not from destination #185.)
Design north star: abstract, not hardcoded
The persist + consolidate machinery should be host-agnostic and species-count-agnostic. It must not matter which machine runs a bucket, how many species columns a wide table has, or which WSGs are present — everything (column sets, table sets, species, WSG buckets, host/transport) is discovered at runtime, never hardcoded. Fix 2 above embodies this on the consolidate side (runtime column intersection, copy-by-name). The remaining work below embodies it on the persist side.
Remaining hardening (this issue)
lnk_persist_init() should detect species-column-set drift on streams_access / streams_mapping_code (compare live columns vs the expected set derived from species) and DROP CASCADE + recreate under force_recreate = TRUE — today it only handles GENERATED drift. This makes the persist tables self-healing on any host when the configured species set changes, rather than relying on operators remembering to drop the wide tables. The per-source flag columns (.lnk_cols_streams_access_source_flags()) are still a hardcoded list (#197) — data-driving those is the same abstraction principle applied to the second-token sources. Fold into / coordinate with the persist-family reshape (#177), #196, and #197.
Problem
The wide per-species persist tables —
<persist>.streams_accessand<persist>.streams_mapping_code— carry one column per species (access_<sp>,has_barriers_<sp>_dnstr,mapping_code_<sp>). Their column set is therefore a function of whateverspeciesvector was passed tolnk_persist_init()when the table was first created.lnk_persist_init()usesCREATE TABLE IF NOT EXISTS, and its drift guard (.lnk_validate_persist_table) only detectsGENERATED ALWAYScolumn drift. It is blind to species-column-set drift: a host whose table was created from a different species set keeps the stale shape silently, and the nextCREATE IF NOT EXISTSis a no-op.This bites the cross-host COPY-consolidate (
data-raw/schema_consolidate.R), which (until the fix below) did a positionalCOPY (SELECT *) TO STDOUT→COPY <t> FROM STDIN. The moment two hosts' wide tables had different column counts, consolidate failed withERROR: extra data after last expected column.How it was found
3-WSG smoke on M1 + 2 cyphers (2026-05-25, link#175).
streams/streams_habitat_*/barriers/barrier_overridesconsolidated fine;streams_access+streams_mapping_codefailed. Diagnosis:lnk_pipeline_runsize persist DDL tocfg$species= 8 (BT CH CM CO PK SK ST WCT) — seeR/lnk_pipeline_run.R(lnk_persist_init(conn, cfg, species = cfg$species)), landed in lnk_persist_init: wide tables sized to active_species break on heterogeneous WSG runs #194 (v0.40.2).data-raw/cypher_prep.shseeded persist fromunique(loaded$parameters_fresh$species_code)= 11 (addsCT DV RB).So the cyphers' wide tables had 6 extra columns (
access_ct/dv/rb,has_barriers_ct/dv/rb_dnstr) → positional COPY mismatch. (The warm cypher snapshot predates the wide tables, #187/v0.40.0, so this was the species-source mismatch, not a stale-snapshot artifact.)Fixes already landed on the
175-...branchdata-raw/cypher_prep.sh— align persist species tocfg$species(with the sameparameters_freshfallbacklnk_pipeline_species()uses), matchinglnk_pipeline_run. Removes the drift at the source.data-raw/schema_consolidate.R— COPY by shared column name (destination ordinal order), not positional. Enumerates columns on both sides, copies the intersection by name; source-only columns drop at SELECT, destination-only take default/NULL. Makes the consolidate shape-tolerant regardless of host drift. (Sibling to schema_consolidate: enumerate wgc_tables per-source, not from destination #185.)Design north star: abstract, not hardcoded
The persist + consolidate machinery should be host-agnostic and species-count-agnostic. It must not matter which machine runs a bucket, how many species columns a wide table has, or which WSGs are present — everything (column sets, table sets, species, WSG buckets, host/transport) is discovered at runtime, never hardcoded. Fix 2 above embodies this on the consolidate side (runtime column intersection, copy-by-name). The remaining work below embodies it on the persist side.
Remaining hardening (this issue)
lnk_persist_init()should detect species-column-set drift onstreams_access/streams_mapping_code(compare live columns vs the expected set derived fromspecies) andDROP CASCADE+ recreate underforce_recreate = TRUE— today it only handlesGENERATEDdrift. This makes the persist tables self-healing on any host when the configured species set changes, rather than relying on operators remembering to drop the wide tables. The per-source flag columns (.lnk_cols_streams_access_source_flags()) are still a hardcoded list (#197) — data-driving those is the same abstraction principle applied to the second-token sources. Fold into / coordinate with the persist-family reshape (#177), #196, and #197.Refs
175-promote-with-mapping-code-flag-to-standbranch base, v0.40.4).