Feature: unphased / ploidy-1 union variant-windows view for haploid (somatic) modeling
Repo: mcvickerlab/GenVarLoader · gvl: 932400b39 (main) · Date: 2026-06-14
Reported from: genvarformer@main WINDOW Variants driving the gvf-germ-som somatic trans model
Companion to #214 (flat mode, landed) and #221 (fetch overhead).
TL;DR
The flat variant-windows output mode is ploidy-aware: it emits per-(variant, haplotype)
windows over P = genotypes.shape[-2] slots. For somatic data stored as diploid genotypes
(ploidy=2), this breaks haploid modeling: the model wants one effective sequence carrying all
called ALTs, but the windows (and genvarformer's IntraGenicAnn, which auto-detects P from the
variant layout) split them across 2 haplotypes. There is no way to request a ploidy-1 / unphased
union view — ploidy is read straight from the svar genotypes and is not configurable on
Dataset.open / with_settings.
Request: an opt-in unphased ploidy-1 variant-windows view that folds variant occurrences
across haplotypes onto a single haploid sequence (union of called ALTs per (region, sample)),
so n_variants(...).shape[-1] == 1 and the windows/coding-annotation decode at ploidy 1.
Why (concrete)
gvf-germ-som models MMRF somatic mutations as haploid (PLOIDY=1, deliberate: somatic
mutations are not a diploid genotype, they're "present on the tumor genome"). On the OLD stack this
worked because with_seqs("variants") produced an unphased, effectively ploidy-1 variant set.
After migrating to the new WINDOW API, the model forward crashes in genvarformer's
IntraGenicVarGeneEncoder.pre_concat: the coding annotation builds P_data * total_G per-(hap,gene)
slots (P_data=2), while the encoder tiles gene tokens by the model's ctx.ploidy=1 →
shape mismatch: [342,64] vs [684,64]. (Both the OLD and NEW genvarformer encoders effectively
require model_ploidy == data_ploidy.)
The two haplotypes are not redundant — on this dataset somatic ALTs sit overwhelmingly on
hap-1 (5,226 occurrences) vs hap-0 (29) across chr21 — so "just take haplotype 0" loses ~all
variants. A correct haploid view must union the called ALTs across both haplotypes.
ds = gvl.Dataset.open("mmrf.gvl", "hg38.fa")
ds.ploidy # 2 (from svar genotypes.shape[-2]); not configurable
ds.n_variants(r, s).shape[-1] # 2 — want a mode where this is 1 (union)
Proposed API (sketch — exact shape TBD)
# opt in to a haploid union at open or via settings
ds = gvl.Dataset.open("mmrf.gvl", "hg38.fa", ploidy=1) # or:
ds = ds.with_settings(ploidy=1) # or a named flag, e.g. unphased_union=True
# then the existing flat path is ploidy-1:
ds.with_output_format("flat").with_seqs("variant-windows", gvl.VarWindowOpt(...))
# n_variants(...).shape[-1] == 1; windows union ALT occurrences across the stored haplotypes.
Semantics: for each (region, sample), take the union of variant indices across the stored
P haplotypes (dedup by variant id), reconstruct one haploid window set. Phase is discarded
(appropriate for unphased somatic calls). Het and hom both contribute the ALT once.
Acceptance criteria
- With the ploidy-1 view,
n_variants(...).shape[-1] == 1 and variant-windows decode emits one
window set per (region, sample) = the union of called ALTs.
- genvarformer's
IntraGenicAnn detects P=1, so IntraGenicVarGeneEncoder runs with a model
ploidy=1 (no ctx.ploidy vs P_data mismatch).
- Byte-equivalent to the OLD
with_seqs("variants") unphased set re-expressed as windows, on a
representative index set (we have a guardrail and can help validate).
Notes / non-blockers
- The old "ploidy=1 corrupted the heap" comment in genvarformer was an unrelated
nb.prange
write-race in a cost-model kernel (_add_allele_lens, since deleted; fixed in gvf 2f2fa0d).
It is not an obstacle to a ploidy-1 reconstruction path.
- Stopgap on the consumer side: set model
PLOIDY=2 and average the 2 haplotype embeddings — works,
but dilutes the somatic-haplotype signal ~50% against the (mostly-reference) other haplotype.
Repro / evidence
gvf-germ-som: scripts/investigate_ref_fetch.py (per-pair variant census shows ~98% empty,
somatic ALTs on hap-1), and the model forward crash under PLOIDY=1 (see
docs/bug-reports/2026-06-13-trans-dataload-reprofile-post-flat-migration.md).
Feature: unphased / ploidy-1 union
variant-windowsview for haploid (somatic) modelingRepo:
mcvickerlab/GenVarLoader· gvl:932400b39(main) · Date: 2026-06-14Reported from:
genvarformer@main WINDOWVariantsdriving thegvf-germ-somsomatic trans modelCompanion to #214 (flat mode, landed) and #221 (fetch overhead).
TL;DR
The flat
variant-windowsoutput mode is ploidy-aware: it emits per-(variant, haplotype)windows over
P = genotypes.shape[-2]slots. For somatic data stored as diploid genotypes(
ploidy=2), this breaks haploid modeling: the model wants one effective sequence carrying allcalled ALTs, but the windows (and
genvarformer'sIntraGenicAnn, which auto-detectsPfrom thevariant layout) split them across 2 haplotypes. There is no way to request a ploidy-1 / unphased
union view —
ploidyis read straight from the svar genotypes and is not configurable onDataset.open/with_settings.Request: an opt-in unphased ploidy-1
variant-windowsview that folds variant occurrencesacross haplotypes onto a single haploid sequence (union of called ALTs per
(region, sample)),so
n_variants(...).shape[-1] == 1and the windows/coding-annotation decode at ploidy 1.Why (concrete)
gvf-germ-sommodels MMRF somatic mutations as haploid (PLOIDY=1, deliberate: somaticmutations are not a diploid genotype, they're "present on the tumor genome"). On the OLD stack this
worked because
with_seqs("variants")produced an unphased, effectively ploidy-1 variant set.After migrating to the new WINDOW API, the model forward crashes in genvarformer's
IntraGenicVarGeneEncoder.pre_concat: the coding annotation buildsP_data * total_Gper-(hap,gene)slots (
P_data=2), while the encoder tiles gene tokens by the model'sctx.ploidy=1→shape mismatch: [342,64] vs [684,64]. (Both the OLD and NEW genvarformer encoders effectivelyrequire
model_ploidy == data_ploidy.)The two haplotypes are not redundant — on this dataset somatic ALTs sit overwhelmingly on
hap-1 (5,226 occurrences) vs hap-0 (29) across chr21 — so "just take haplotype 0" loses ~all
variants. A correct haploid view must union the called ALTs across both haplotypes.
Proposed API (sketch — exact shape TBD)
Semantics: for each
(region, sample), take the union of variant indices across the storedPhaplotypes (dedup by variant id), reconstruct one haploid window set. Phase is discarded(appropriate for unphased somatic calls). Het and hom both contribute the ALT once.
Acceptance criteria
n_variants(...).shape[-1] == 1andvariant-windowsdecode emits onewindow set per
(region, sample)= the union of called ALTs.IntraGenicAnndetectsP=1, soIntraGenicVarGeneEncoderruns with a modelploidy=1(noctx.ploidyvsP_datamismatch).with_seqs("variants")unphased set re-expressed as windows, on arepresentative index set (we have a guardrail and can help validate).
Notes / non-blockers
nb.prangewrite-race in a cost-model kernel (
_add_allele_lens, since deleted; fixed in gvf2f2fa0d).It is not an obstacle to a ploidy-1 reconstruction path.
PLOIDY=2and average the 2 haplotype embeddings — works,but dilutes the somatic-haplotype signal ~50% against the (mostly-reference) other haplotype.
Repro / evidence
gvf-germ-som:scripts/investigate_ref_fetch.py(per-pair variant census shows ~98% empty,somatic ALTs on hap-1), and the model forward crash under
PLOIDY=1(seedocs/bug-reports/2026-06-13-trans-dataload-reprofile-post-flat-migration.md).