feat(dataset): ploidy-1 unphased union view (#222)#224
Merged
Conversation
…ion (#222) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
14af992 to
39d262d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an opt-in
unphased_unionread-time view that folds a diploid dataset's two stored haplotypes onto a single haploid sequence — the union of called ALTs per(region, sample)— for haploid somatic modeling (issue #222).Dataset.with_settings(unphased_union=True)on theHapsreconstructor. Stored genotypes stay diploid on disk; this is purely a read-time view.ds.ploidy == 1,n_variants(...).shape[-1] == 1(naive sum across haplotypes, int32-preserved, no dedup), and"variants"/"variant-windows"decode at ploidy 1.get_variants_flat: a pure offset re-grouping (row_offsets[::ploidy]) with no sort and no data movement — safe because the downstream consumer is permutation-invariant. A content-level test confirms the union row is exactly hap-0's calls then hap-1's, with nothing dropped or duplicated."haplotypes"/"annotated"output is rejected under the flag (guarded in bothwith_seqsand_check_valid_state, covering both orderings).infer_germline_ccfs_+_infer_germline_ccfs) — unused ~1 year and the only consumer that assumed start-ordering.genvarloaderskill (SKILL.md).Test Plan
tests/dataset/test_unphased_union.py(15 tests): flag plumbing + genotypes guard,ploidy==1,n_variantsfold + int32 contract, both phased-output guard orderings, variant-windows ploidy-axis collapse, union count == sum over haplotypes, union content == hap-0∥hap-1, ragged-variants path, toggle-off restores diploid, AF-filter composition.test_flat_flanks.py,test_flat_variants_type.py,test_flat_mode_equivalence.py,test_rag_variants.pypass (theeff_ploidy == ploidyoff-path is unchanged).pytest tests/dataset tests/unit— 528 passed, 21 skipped, 2 xfailed.ruff check python/clean;pyreflytypecheck 0 errors.Acceptance criteria from the spec (all covered by passing tests):
n_variants(...).shape[-1] == 1, variant-windows P=1 layout, union count == sum over haplotypes,haplotypes/annotatedraise under the flag,ds.ploidy == 1.Scoping note: this implements only the
with_settingsentry point (notDataset.open(unphased_union=...)), per the plan —openpromotes to the"haplotypes"default which the guard rejects, making it a footgun. Open-parity can be a follow-up.🤖 Generated with Claude Code