Skip to content

schema_consolidate: enumerate wgc_tables per-source, not from destination #185

Description

@NewGraphEnvironment

Context

data-raw/schema_consolidate.R (driver-side, runs on M4) iterates per-source per-table:

for (src in sources) {
  wgc_tables <- query_destination(...)        # ← enumerated from DESTINATION
  for (t in wgc_tables) {
    ssh src "psql -c 'COPY (SELECT ... FROM <schema>.<t> WHERE wsg IN bucket) TO STDOUT'"
    ...
  }
}

wgc_tables is the list of tables on the destination. The destination accumulates tables across runs (e.g. M4's fresh_default carries streams_habitat_ch/sk/st/pk/ko/co residue from prior runs with study-area WSGs whose species presence covered the full set). Source hosts only create habitat tables for species their assigned bucket actually models (lnk_persist_init creates streams_habitat_<species> per species in the WSG's presence list).

Problem

When a source's table set is a strict subset of destination's, the per-table COPY hits the first absent table and breaks the loop:

ERROR:  relation "fresh_default.streams_habitat_ch" does not exist
LINE 1: COPY (SELECT * FROM fresh_default.streams_habitat_ch WHERE w...

Effect: tables AFTER the failure point (alphabetically) never get copied even when they exist on the source.

Caught 2026-05-15 on branch 180-step0-additive-default (Peace Tier 2 retry, data-raw/logs/wsgs_run_pipeline/20260515_091243_consolidate.log). Cyphers' Peace bucket only built habitat tables for BT/GR/RB; M4 had 11 tables; loop broke at streams_habitat_ch so _gr and _rb data never copied. 7 WSGs (TOOD, NATR, UOMI, PCEA, FINA, MESI, CARP) ended up in M4's streams_habitat_bt only — missing RB/GR (and KO for NATR/CARP). Per-WSG modelling RDS confirmed the source data existed; consolidate just dropped it.

Goals

  • Per-source enumeration: each source's COPY loop iterates only tables that exist on that source (intersect with destination tables so we don't try to write to a destination-absent table).
  • next over break: a single per-table failure logs a warning and continues; doesn't poison other tables for the same source.
  • Return value's per-source result includes the full copied-table set + any errored tables — no silent partial copies.

Acceptance

  • Heterogeneous-config dispatch (some sources missing some habitat tables vs destination) succeeds end-to-end: every table that exists on a source lands on destination.
  • A specific table erroring is logged per-source per-table but does not prevent other tables for the same source from copying.
  • Result object names every table actually copied + any that errored; per-source ok reflects "all tables either copied or skipped-because-absent".

References

  • data-raw/schema_consolidate.Rwgc_tables enumeration around L135; per-table loop break around L210.
  • Consolidate log evidence: data-raw/logs/wsgs_run_pipeline/20260515_091243_consolidate.log (both cyphers stage=source_copy at streams_habitat_ch).
  • Continuation of Step 0 pre-clean should be additive-by-default; --reset-schema opts into full wipe #180 (additive Step 0 + bucket-filtered COPY-streaming). Fix lands on that branch — same code path, immediate follow-on.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions