Nvbench compare process bulk data#386
Draft
oleksandr-pavlyk wants to merge 68 commits into
Draft
Conversation
f18d48d to
f0db899
Compare
Collaborator
Author
|
@coderabbitai full review |
This comment was marked as outdated.
This comment was marked as outdated.
📝 WalkthroughSummary by CodeRabbit
WalkthroughAdds a compare tool reference page, centralizes C++ timeout-warning logging and statistics helpers, and updates Python scripts and wheel packaging to use lazy optional dependencies under C++ statistics and timeout-warning refactor
Python tools packaging and lazy dependency loading
nvbench-compare documentation
Assessment against linked issues
Assessment against linked issues: Out-of-scope changes
Comment |
42927d9 to
a240460
Compare
Collaborator
Author
|
@CodeRabbit full review |
This comment was marked as outdated.
This comment was marked as outdated.
Teach nvbench_compare to parse GPU timing summaries into structured values and prefer the robust median/IQR summaries when both compared measurements provide them. Fall back to the existing mean/stdev summaries when robust summaries are not available. Classify comparisons with the larger available relative noise estimate instead of the smaller one, keep unavailable noise distinct from encoded infinite noise, and report improvements separately from regressions. Keep the process exit code as success for completed comparisons; regression counts are reported in the summary instead of being used as the process status. Make plotting tolerate unavailable noise by leaving gaps in confidence bands, sort plotted series by the plotted axis, and avoid reusing pyplot state across plot calls. Add focused Python tests for robust-summary preference, unavailable-noise classification, non-finite timing centers, plot-along handling when the selected axis is absent, and the exit-code contract.
Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.
Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.
Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.
Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.
Original commit messages folded into this change:
Tweaks for nvbench_compare
1. When JSON files contain multiple entries with the same name and axis values,
make sure that scripts compares corresponding entries.
Previous logic would extract the first entry from ref data, and would compare
measurements for each state in cmp against the first entry from ref. The
change introduces a counter to know which nth entry we process for a
particular axis value, and retrieve corresponding entry in ref.
Scope occurrence matching by device.
Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.
Require devices in ref and cmp to have the same cardinality
Handle mismatch when number of duplicates in ref data is not same as in cmp data
Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency
Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.
Fix to filter-before-matching.
- compare_benches() now pairs devices by selected position instead of taking a
device id.
- For each device pair, compare_benches() now builds:
- ref_device_states: matching reference device and axis filters
- cmp_device_states: matching compare device and axis filters
- State occurrence counts and duplicate occurrence matching now operate only
on those filtered per-device lists.
- Removed the later matches_axis_filters() skip inside the compare-state loop
because filtering now happens before matching.
Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.
Introduce device filtering in nvbench_compare
- --reference-devices all|ID|ID,ID,...
- --compare-devices all|ID|ID,ID,...
- Integer lists preserve order and duplicates.
- Requested IDs are validated against the file-level device list.
- Filtered reference/compare device counts must match before comparison.
- compare_benches() pairs selected reference and compare devices by position.
- Each benchmark validates that requested device IDs are present in its own
devices list.
Implemented benchmark-scoped --axis handling.
- --axis and --benchmark now share an ordered argparse action, so their
relative CLI order is preserved.
- -a before any -b becomes a global axis filter.
- -a after -b <name> applies to that most recent benchmark only.
- Repeated -b entries are treated as separate filter scopes and combined as
alternatives for that benchmark.
- Device filtering remains global and is applied independently.
Allow non-matching devices for explicit device selection
Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.
Fix for resolve_benchmark_device_ids, add comments
The return value of resolve_benchmark_device_ids now always owns its list.
Use monkeypatch class in set_test_devices helper
Stricted device id validation
Test for device id validation
Introduce GpuTimingData, SummaryComparison, ComparisonStats, and ComparisonRunData to make timing extraction, classification, and run-level state explicit. Load sample-time and SM-frequency bulk data from JSON binary output into GpuTimingData when available, preserving count validation between paired sample and frequency arrays. Move GPU timing comparison logic into compare_gpu_timings(), prefer robust median/IQR data when available, and fall back to mean/stdev summaries otherwise. Keep missing or invalid noise on the unknown path. Replace module-level comparison counters and selected-device globals with per-run data passed into compare_benches(). Update tests to validate timing classification, bulk-data loading, device pairing, filtered duplicate matching, and summary counters through the new structures.
It is not emitted just yet, but the code becomes ready for it when it starts being emitted
Store JSON-bin sample time and frequency metadata in GpuTimingData instead of reading the binary files during summary extraction. Add Float32BinarySource and lazy cached accessors for samples and frequencies. Use np.fromfile by default, but allow tests and alternate callers to inject a float32 reader returning any buffer-compatible object convertable to "<f4" data type. Treat optional bulk-data failures as unavailable evidence instead of aborting comparison: unreadable files, invalid buffers, count mismatches, and mismatched sample/frequency metadata now emit RuntimeWarning and return None. Update nvbench_compare tests to verify lazy loading, cache reuse, injected reader behavior, warning-based degradation, and count mismatch handling.
Its intent is to be cheaply retrievable metric of average SM clock frequence over entire sample
The quantile values are not currently used, but plumbed through
Implemented the clear-gap comparison, with the log-distance-equivalent
algebra and pessimistic SM-clock fallback.
What changed:
- Added TimingInterval and interval construction from summaries:
- robust interval: [min, q3], centered at median
- fallback interval: clipped [mean - stdev, mean + stdev] intersected with [min, max]
- Added CLEAR_GAP_RELATIVE_THRESHOLD = 0.005.
- FAST gap uses:
(ref.lower - cmp.upper) / cmp.upper >= delta
which is equivalent to log(ref.lower / cmp.upper) >= log(1 + delta).
- SLOW gap uses:
(cmp.lower - ref.upper) / ref.upper >= delta
- FAST/SLOW now requires SM clock summaries on both sides and the same clear-gap result after scaling intervals by sm_clock_rate_mean.
- If intervals are missing, overlap, fail the gap threshold, have missing/invalid clock summaries, or time/cycle comparison disagrees, status is UNDECIDED.
- Existing center/noise values are still computed and displayed, but no longer drive FAST/SLOW/SAME classification.
Updated tests to cover:
- center/noise-only comparisons becoming UNDECIDED
- clear FAST/SLOW with matching clock evidence
- missing clock fallback to UNDECIDED
- frequency-shift disagreement becoming UNDECIDED
- regression reporting with robust interval and clock evidence
If SLOW/FAST check returned undecided, we attempt conservative
SAME check based on summary data alone (bulk data are not read)
Reference and compare measurements are considered SAME if
- both centers are positive finite values;
- abs(ref - cmp) / min(ref, cmp) <= 0.5%.
This is equivalent to max(ref, cmp) / min(ref, cmp) <= 1 + delta;
- interval overlap must cover at least 50% of the smaller interval;
- relative dispersion must be finite on both sides and no more than 2%;
- if SM clock summaries are available, the same check must also pass in cycle space.
Otherwise UNDECIDED remains working decision, to be refined by further checks
- Add DecisionReason(code, message) and internal
TimingDecision(status, reason).
- SummaryComparison now carries reason
- ComparisonStats now aggregates undecided reasons.
- Final summary prints a reason breakdown only when
undecided reasons exist, e.g.:
- Undecided (comparison requires more evidence): 3
- Reasons:
- noise_too_high: 2 (relative dispersion is too
high to declare same)
- weak_interval_overlap: 1 (timing intervals do not
overlap strongly enough to declare same)
Add a bulk-data SAME path to nvbench_compare for cases where summary intervals do not provide a clear FAST/SLOW decision. The new path compares sample times and SM-clock-adjusted cycles with symmetric nearest-neighbor coverage over unique values and sample counts. The comparison now requires both sample-weight coverage and unique-support coverage to pass before declaring SAME. If bulk data is available but coverage does not pass, the result remains UNDECIDED instead of falling back to the summary-only SAME rule. Also improve undecided diagnostics by aggregating reason codes while preserving the most severe representative detail, including observed coverage values and thresholds for bulk support mismatches. Add tests for: - bulk data confirming SAME despite changed mode weights; - bulk time mismatch overriding summary-only SAME; - cycle coverage vetoing time-only agreement; - sample-weight and unique-support coverage diagnostics; - aggregation of undecided reason details.
Select robust timing inputs independently for reference and compare data: prefer robust summaries when present, otherwise recompute robust statistics from that side's bulk samples. Fall back to mean/stdev summaries only when both sides cannot provide robust timing inputs. This allows modern JSON data with robust summaries to compare against legacy JSON data that lacks robust summaries but includes bulk sample data, without mixing summary families or unnecessarily falling back to mean/stdev.
Reject boolean and floating-point values for int64 bulk binary sizes instead of silently converting them with int(). Keep integer strings accepted for existing NVBench JSON compatibility, and add regression coverage for valid and malformed size payloads.
Use rounded-rank method, rather than NumPy's quantile
…o occurrence order.
When any --benchmark filter is present, keep comparison limited to the explicitly selected benchmarks. Leading --axis filters are still replayed onto each selected benchmark, matching native NVBench option parsing, but they no longer cause unrelated benchmarks to be compared. E.g., `-a A=2 -b bench1` now compares only bench1, `-a A=2 -b bench1 -b bench2` applies A=2 to both selected benchmarks Update tests for global axis filters with benchmark scopes and document the selection behavior.
Also add a test to check that STDOUT also works.
Reject boolean summary float payloads instead of coercing them to 1.0/0.0, while keeping numeric strings accepted for NVBench JSON compatibility. Add regression coverage for generated bulk-debug Python filenames that require escaping, and strengthen the plot-along test to assert log-log axes and confidence-band rendering.
Move Float32BinarySource material-payload detection into the source object. Default file-backed sources still use resolved file size so missing or empty sidecars remain unavailable, but positive-count sources with custom readers are treated as material and proceed through the lazy read path. Add regression coverage for virtual bulk sources whose custom reader provides data without a local sidecar file.
Derive absolute standard deviation from relative stdev and mean when nv/cold/time/gpu/stdev/absolute is absent. This lets older JSON files that only contain mean and relative stdev still construct timing intervals. Also allow mean/stdev intervals to be built without min/max summaries, using min/max only as optional clipping bounds when present. This restores SAME classification for legacy fixture data instead of treating those rows as missing-interval AMBG cases. Update nvbench_compare tests to cover derived stdev handling and the legacy mean/stdev comparison path.
Add a shared nvbench_tooling_deps helper for importing packages required by NVBench console tools. Missing tooling packages now raise a dedicated error with an install recipe instead of failing with a raw ImportError. Update script imports to work both as installed package modules and as direct source-tree scripts by using the __package__ import pattern for nvbench_json and the new tooling helper. Defer nvbench-compare dependencies to the points where they are needed: NumPy/colorama during normal comparison setup, tabulate during table rendering, jsondiff only for device mismatch reporting, and plotting packages only for plot modes. Update tests to initialize compare tooling when calling internals directly and add coverage for the tooling dependency loader. Closes NVIDIA#384
Introduce optional dependency categories in the wheel metadata, with cuda-bench[tools] encompasing all dependencies of nvbench scripts. Closes NVIDIA#393
Move `load_nvbench_compare_tooling()` after: - config resolution / --dump-config - filter parsing - device-filter parsing - input arity validation
test_nvbench_tooling_deps.py now has a smoke test that builds a temporary package layout matching the wheel mapping: cuda/bench/scripts/nvbench_tooling_deps.py and imports: cuda.bench.scripts.nvbench_tooling_deps That covers the cuda.bench.scripts.* path without requiring a wheel build/install inside this unit test.
Add a shared nvbench_tooling_deps helper that imports packages required by console tools and reports missing packages with an actionable install recipe. Update NVBench scripts to support both direct source-tree execution and installed package execution through the __package__ import pattern. Defer third-party tooling imports until they are needed, including lazy loading for compare tables, device diffs, and plotting paths. Make loaders resilient to partial initialization failures so a later retry can complete any dependency that failed previously. Update tests to cover direct internal use, packaged cuda.bench.scripts imports, dependency-loader error messages, and cheap CLI validation before tooling dependencies are loaded.
require_tooling_dependency() now only translates ModuleNotFoundError when the missing module is the requested top-level package. Other import failures are re-raised unchanged. This helps in situation where third-party dependency is installed but broken for whatever reason. Previously we would intercept it and suggest to run pip install cuda-bench[tools], but that was already done.
round in Python returns nearest even, while in C++ it behaves as x -> floor(x + 0.5)
Factor finite-value checks through a common helper so positive and non-negative finite predicates share the same None/finite guard. Add a comment explaining the positive lower-bound clamp used for mean/stdev intervals, since later ratio and log-distance checks require positive bounds. Also quote the --axis pow2 example in CLI help so shell users can copy the example safely.
Implemented robo-review feedback suggesting to expand coverage of lazy loading helper.
77dae7c to
63fcd0a
Compare
Thread the originating JSON path through bulk timing extraction so sidecar filenames can be resolved with knowledge of the JSON basename. This lets nvbench-compare handle --jsonbin output where the filename field may store the benchmark-process-relative path, such as data/result.json-bin/0.bin, instead of a path relative to the JSON file directory. Keep resolution anchored to the JSON location and avoid falling back to the caller working directory. Add regression coverage for this sidecar layout. This change helps nvbench_compare properly resolve location of sidecar binary files in the following CCCL-based workflow: ``` ./bin/cub.bench.bitonic_sort.warp_keys.base -d 1 --cold-warmup-runs 16 --min-samples 100 --stopping-criterion entropy --min-r2 0.88 --jsonbin ../../../perf_data/warp-bitonic-sort-run1.json ./bin/cub.bench.bitonic_sort.warp_keys.base -d 1 --cold-warmup-runs 16 --min-samples 100 --stopping-criterion entropy --min-r2 0.88 --jsonbin ../../../perf_data/warp-bitonic-sort-run2.json PYTHONPATH=./_deps/nvbench-src/python/scripts python ./_deps/nvbench-src/python/scripts/nvbench_compare.py ../../../perf_data/warp-bitonic-sort-run1.json ../../../perf_data/warp-bitonic-sort-run2.json ``` Before this change, script generated numerous RuntimeWarnigns alerting that absolute path can not be resolved and bulk data will be treated as unavailable
This was referenced Jul 2, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR contains only Python changes.
python/scripts/nvbench_compare.pydocs/nvbench_compare.mddocumenting usage ofnvbench_compareand its decision logiccuda-bench[tools].cuda-bench([Skill]: Report and interpret benchmark results #390).Closes #320
Closes #393
Closes #384
This PR builds on top of #392 (changes to C++ sources in nvbench/ and testing/ folders) and contains only Python changes.The branch has been rebased after #392 was merged.