Skip to content

feat(dataset): ploidy-1 unphased union view (#222)#224

Merged
d-laub merged 13 commits into
mainfrom
worktree-unphased-union-ploidy1
Jun 14, 2026
Merged

feat(dataset): ploidy-1 unphased union view (#222)#224
d-laub merged 13 commits into
mainfrom
worktree-unphased-union-ploidy1

Conversation

@d-laub

@d-laub d-laub commented Jun 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an opt-in unphased_union read-time view that folds a diploid dataset's two stored haplotypes onto a single haploid sequence — the union of called ALTs per (region, sample) — for haploid somatic modeling (issue #222).

  • New flag Dataset.with_settings(unphased_union=True) on the Haps reconstructor. Stored genotypes stay diploid on disk; this is purely a read-time view.
  • Under the flag: ds.ploidy == 1, n_variants(...).shape[-1] == 1 (naive sum across haplotypes, int32-preserved, no dedup), and "variants" / "variant-windows" decode at ploidy 1.
  • The core fold lives in get_variants_flat: a pure offset re-grouping (row_offsets[::ploidy]) with no sort and no data movement — safe because the downstream consumer is permutation-invariant. A content-level test confirms the union row is exactly hap-0's calls then hap-1's, with nothing dropped or duplicated.
  • Phased "haplotypes" / "annotated" output is rejected under the flag (guarded in both with_seqs and _check_valid_state, covering both orderings).
  • Removes the retired, order-dependent germline-CCF inference path (infer_germline_ccfs_ + _infer_germline_ccfs) — unused ~1 year and the only consumer that assumed start-ordering.
  • Documents the option in the genvarloader skill (SKILL.md).

Test Plan

  • New suite tests/dataset/test_unphased_union.py (15 tests): flag plumbing + genotypes guard, ploidy==1, n_variants fold + int32 contract, both phased-output guard orderings, variant-windows ploidy-axis collapse, union count == sum over haplotypes, union content == hap-0∥hap-1, ragged-variants path, toggle-off restores diploid, AF-filter composition.
  • Regression: test_flat_flanks.py, test_flat_variants_type.py, test_flat_mode_equivalence.py, test_rag_variants.py pass (the eff_ploidy == ploidy off-path is unchanged).
  • Full suite: pytest tests/dataset tests/unit — 528 passed, 21 skipped, 2 xfailed.
  • ruff check python/ clean; pyrefly typecheck 0 errors.

Acceptance criteria from the spec (all covered by passing tests): n_variants(...).shape[-1] == 1, variant-windows P=1 layout, union count == sum over haplotypes, haplotypes/annotated raise under the flag, ds.ploidy == 1.

Scoping note: this implements only the with_settings entry point (not Dataset.open(unphased_union=...)), per the plan — open promotes to the "haplotypes" default which the guard rejects, making it a footgun. Open-parity can be a follow-up.

🤖 Generated with Claude Code

@d-laub d-laub force-pushed the worktree-unphased-union-ploidy1 branch from 14af992 to 39d262d Compare June 14, 2026 02:08
@d-laub d-laub merged commit b221ef9 into main Jun 14, 2026
7 checks passed
@d-laub d-laub deleted the worktree-unphased-union-ploidy1 branch June 14, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant