perf: eliminate ref-fetch parallel overhead + fuse variant-window fetches (#221)#223
Merged
Merged
Conversation
…cate (#221) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…bytes (#221) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A non-integer override (e.g. "auto") previously raised ValueError during `import genvarloader`. Fall back to cgroup detection instead. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
for more information, see https://pre-commit.ci
dc98d4e to
1bba476
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Eliminates two sources of overhead in the
variant-windowsreference read path soReference.fetchscales with bytes copied instead of dominating the decode (issue #221):numba.get_num_threads()reports host logical CPUs, not the cgroup allocation (e.g. 208 reported vs. 52 allocated), soparallel=Trueregions paid a flat ~37 ms fork-join for trivial work. New_threads.pycaps the worker count to the cgroup-aware core count once at import (overridable viaGVL_NUM_THREADS)._fetch_implinReference.fetch, andget_reference) now route to a serial njit below a per-thread byte threshold and a parallel njit above it, viashould_parallelize(total_bytes). Both kernels share aninline="always"row body, so serial and parallel are byte-identical by construction.variant-windowsflank builders in_flat_flanks.pynow do a single[start−L, end+L)read and slicef5/f3internally. The both-window decode is routed through the fusedcompute_windows(1 fetch instead of 2). Public signatures are unchanged, so the existing oracle tests act as byte-identity guards.Test Plan
pixi run -e dev pytest tests/dataset/ tests/unit/dataset/ tests/unit/test_threads.py→ 294 passed, 4 skipped, 2 xfailed_fetch_impl_ser/_par,_get_reference_ser/_par) confirm serial ≡ parallel, incl. OOB-left/right regions_oracle_*/ split-equivalence tests confirm the fused fetch is byte-identical to the old separate-fetch pathtest_variant_windows_single_fetch_per_decodepins the both-window decode to exactly 1Reference.fetchcall (down from 3)ruff check python/clean;pyrefly0 errorsgvf-germ-somper Perf: Reference.fetch dominated by numba parallel=True fork-join overhead in variant-windows path (~37ms/call for tiny windows); also 3 redundant fetches/decode #221 acceptance criteriaCloses #221.
🤖 Generated with Claude Code