Skip to content

Feature: unphased / ploidy-1 union variant-windows view for haploid (somatic) modeling #222

@d-laub

Description

@d-laub

Feature: unphased / ploidy-1 union variant-windows view for haploid (somatic) modeling

Repo: mcvickerlab/GenVarLoader · gvl: 932400b39 (main) · Date: 2026-06-14
Reported from: genvarformer@main WINDOW Variants driving the gvf-germ-som somatic trans model
Companion to #214 (flat mode, landed) and #221 (fetch overhead).

TL;DR

The flat variant-windows output mode is ploidy-aware: it emits per-(variant, haplotype)
windows over P = genotypes.shape[-2] slots. For somatic data stored as diploid genotypes
(ploidy=2), this breaks haploid modeling: the model wants one effective sequence carrying all
called ALTs, but the windows (and genvarformer's IntraGenicAnn, which auto-detects P from the
variant layout) split them across 2 haplotypes. There is no way to request a ploidy-1 / unphased
union view
ploidy is read straight from the svar genotypes and is not configurable on
Dataset.open / with_settings.

Request: an opt-in unphased ploidy-1 variant-windows view that folds variant occurrences
across haplotypes onto a single haploid sequence (union of called ALTs per (region, sample)),
so n_variants(...).shape[-1] == 1 and the windows/coding-annotation decode at ploidy 1.

Why (concrete)

gvf-germ-som models MMRF somatic mutations as haploid (PLOIDY=1, deliberate: somatic
mutations are not a diploid genotype, they're "present on the tumor genome"). On the OLD stack this
worked because with_seqs("variants") produced an unphased, effectively ploidy-1 variant set.

After migrating to the new WINDOW API, the model forward crashes in genvarformer's
IntraGenicVarGeneEncoder.pre_concat: the coding annotation builds P_data * total_G per-(hap,gene)
slots (P_data=2), while the encoder tiles gene tokens by the model's ctx.ploidy=1
shape mismatch: [342,64] vs [684,64]. (Both the OLD and NEW genvarformer encoders effectively
require model_ploidy == data_ploidy.)

The two haplotypes are not redundant — on this dataset somatic ALTs sit overwhelmingly on
hap-1 (5,226 occurrences) vs hap-0 (29) across chr21 — so "just take haplotype 0" loses ~all
variants. A correct haploid view must union the called ALTs across both haplotypes.

ds = gvl.Dataset.open("mmrf.gvl", "hg38.fa")
ds.ploidy                       # 2 (from svar genotypes.shape[-2]); not configurable
ds.n_variants(r, s).shape[-1]   # 2 — want a mode where this is 1 (union)

Proposed API (sketch — exact shape TBD)

# opt in to a haploid union at open or via settings
ds = gvl.Dataset.open("mmrf.gvl", "hg38.fa", ploidy=1)          # or:
ds = ds.with_settings(ploidy=1)                                  # or a named flag, e.g. unphased_union=True
# then the existing flat path is ploidy-1:
ds.with_output_format("flat").with_seqs("variant-windows", gvl.VarWindowOpt(...))
# n_variants(...).shape[-1] == 1; windows union ALT occurrences across the stored haplotypes.

Semantics: for each (region, sample), take the union of variant indices across the stored
P haplotypes (dedup by variant id), reconstruct one haploid window set. Phase is discarded
(appropriate for unphased somatic calls). Het and hom both contribute the ALT once.

Acceptance criteria

  1. With the ploidy-1 view, n_variants(...).shape[-1] == 1 and variant-windows decode emits one
    window set per (region, sample) = the union of called ALTs.
  2. genvarformer's IntraGenicAnn detects P=1, so IntraGenicVarGeneEncoder runs with a model
    ploidy=1 (no ctx.ploidy vs P_data mismatch).
  3. Byte-equivalent to the OLD with_seqs("variants") unphased set re-expressed as windows, on a
    representative index set (we have a guardrail and can help validate).

Notes / non-blockers

  • The old "ploidy=1 corrupted the heap" comment in genvarformer was an unrelated nb.prange
    write-race in a cost-model kernel (_add_allele_lens, since deleted; fixed in gvf 2f2fa0d).
    It is not an obstacle to a ploidy-1 reconstruction path.
  • Stopgap on the consumer side: set model PLOIDY=2 and average the 2 haplotype embeddings — works,
    but dilutes the somatic-haplotype signal ~50% against the (mostly-reference) other haplotype.

Repro / evidence

gvf-germ-som: scripts/investigate_ref_fetch.py (per-pair variant census shows ~98% empty,
somatic ALTs on hap-1), and the model forward crash under PLOIDY=1 (see
docs/bug-reports/2026-06-13-trans-dataload-reprofile-post-flat-migration.md).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions