Skip to content

Nvbench compare process bulk data#386

Draft
oleksandr-pavlyk wants to merge 68 commits into
NVIDIA:mainfrom
oleksandr-pavlyk:nvbench-compare-process-bulk-data
Draft

Nvbench compare process bulk data#386
oleksandr-pavlyk wants to merge 68 commits into
NVIDIA:mainfrom
oleksandr-pavlyk:nvbench-compare-process-bulk-data

Conversation

@oleksandr-pavlyk

@oleksandr-pavlyk oleksandr-pavlyk commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

This PR contains only Python changes.

  • Main change: python/scripts/nvbench_compare.py
    • structured timing data extraction
    • robust timing intervals
    • bulk sample loading
    • device pairing/filtering
    • benchmark-axis scoped filtering
    • comparison thresholds
    • presets/TOML config
    • display modes
    • plotting validation
    • bulk-debug python script output for further bulk data analysis
  • Document docs/nvbench_compare.md documenting usage of nvbench_compare and its decision logic
  • All scripts load their 3rd-party dependent packages lazily, and raise a dedicated exception with informative message on how to install them (Indicate script dependencies in cuda.bench package metadata #384). Optional dependencies are advertised in wheel metadata as cuda-bench[tools].
  • The location of scripts in Python wheel has changed, but this is transparent to users of installed cuda-bench ([Skill]: Report and interpret benchmark results #390).

Closes #320
Closes #393
Closes #384


This PR builds on top of #392 (changes to C++ sources in nvbench/ and testing/ folders) and contains only Python changes.

The branch has been rebased after #392 was merged.

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the nvbench-compare-process-bulk-data branch from f18d48d to f0db899 Compare June 29, 2026 19:59
@oleksandr-pavlyk

Copy link
Copy Markdown
Collaborator Author

@coderabbitai full review

@coderabbitai

This comment was marked as outdated.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Summary by CodeRabbit

  • New Features
    • Added a new nvbench-compare documentation page with result classes, matching rules, and TOML/config guidance.
    • Improved Python CLI tools to load plotting/output dependencies lazily and exit with clear errors when tooling is missing.
  • Bug Fixes
    • Enhanced timeout/no-accepted-samples warnings during benchmark summary generation.
    • Improved noise/relative-noise computation and more consistent handling of NaN/invalid/insufficient-sample inputs in statistics (percentiles/quartiles/robust noise).
  • Tests
    • Added/expanded unit tests for timeout warnings, statistics edge cases, and tooling-dependency import behavior.

Walkthrough

Adds a compare tool reference page, centralizes C++ timeout-warning logging and statistics helpers, and updates Python scripts and wheel packaging to use lazy optional dependencies under cuda.bench.scripts.

C++ statistics and timeout-warning refactor

Layer / File(s) Summary
statistics helpers and NaN handling
nvbench/detail/statistics.cuh
Adds shared noise-gating and sentinel helpers, changes percentile and quartile behavior for NaN-containing inputs, and centralizes quartile selection threshold use.
centralized timeout warning logging
nvbench/detail/measure_timeout_warnings.cuh, nvbench/detail/measure_cold.cuh, nvbench/detail/measure_cold.cu, nvbench/detail/measure_cpu_only.cxx
Introduces shared timeout-warning helpers and wires them into cold and CPU-only measurement code.
statistics and timeout-warning tests
testing/CMakeLists.txt, testing/measure_timeout_warnings.cu, testing/statistics.cu
Adds CUDA tests for timeout warnings and expands statistics coverage for NaNs, duplicate-heavy quartiles, and new noise helpers.

Python tools packaging and lazy dependency loading

Layer / File(s) Summary
tooling dependency loader
python/scripts/nvbench_tooling_deps.py
Defines the tooling dependency dataclass, missing-dependency error, and helper for lazy importing optional packages with install hints.
lazy script loading
python/scripts/nvbench_histogram.py, python/scripts/nvbench_plot_bwutil.py, python/scripts/nvbench_walltime.py
Updates the histogram, plot_bwutil, and walltime scripts to use package-aware imports, lazy tooling loaders, and explicit exit codes on missing optional dependencies.
wheel packaging and tooling tests
python/pyproject.toml, python/test/test_nvbench_tooling_deps.py
Moves the scripts wheel package under cuda/bench/scripts, retargets console entry points, and adds tests for packaged imports and tooling-dependency error handling.

nvbench-compare documentation

Layer / File(s) Summary
compare reference documentation
docs/nvbench_compare.md
Documents nvbench-compare behavior, command-line usage, matching, timing summaries, bulk-debug Python, comparison decisions, AMBG handling, and configuration keys.

Assessment against linked issues

Objective Addressed Explanation
Distribution-based comparison in nvbench_compare.py with SAME/AMBG/INSUFFICIENT handling [#320] No code changes to nvbench_compare.py are present; only documentation for the tool was added.
Use a more descriptive scripts package name in the wheel and update script entry points [#393]
Add optional tool dependencies in package metadata and make scripts fail gracefully without them [#384]

Assessment against linked issues: Out-of-scope changes

Code Change Explanation
C++ statistics helper changes and NaN/noise gating (nvbench/detail/statistics.cuh lines 57-423) The linked issues do not mention modifying measurement statistics internals.
Centralized timeout-warning logging and cold/CPU-only measurement refactor (nvbench/detail/measure_timeout_warnings.cuh, nvbench/detail/measure_cold.cu, nvbench/detail/measure_cpu_only.cxx) These changes are unrelated to the linked packaging and compare-tool objectives.
New timeout-warning and statistics tests (testing/measure_timeout_warnings.cu, testing/statistics.cu, testing/CMakeLists.txt) The linked issues do not require these test additions.

Comment @coderabbitai help to get the list of available commands.

coderabbitai[bot]

This comment was marked as resolved.

@oleksandr-pavlyk oleksandr-pavlyk force-pushed the nvbench-compare-process-bulk-data branch 2 times, most recently from 42927d9 to a240460 Compare June 30, 2026 11:41
@oleksandr-pavlyk

Copy link
Copy Markdown
Collaborator Author

@CodeRabbit full review

@coderabbitai

This comment was marked as outdated.

coderabbitai[bot]

This comment was marked as resolved.

@oleksandr-pavlyk oleksandr-pavlyk added python type: enhancement New feature or request. labels Jun 30, 2026
@oleksandr-pavlyk oleksandr-pavlyk self-assigned this Jun 30, 2026
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 30, 2026
@oleksandr-pavlyk oleksandr-pavlyk moved this from Todo to In Progress in CCCL Jun 30, 2026
@jrhemstad jrhemstad requested a review from gevtushenko June 30, 2026 15:52
Teach nvbench_compare to parse GPU timing summaries into structured values and
prefer the robust median/IQR summaries when both compared measurements provide
them. Fall back to the existing mean/stdev summaries when robust summaries are
not available.

Classify comparisons with the larger available relative noise estimate instead
of the smaller one, keep unavailable noise distinct from encoded infinite noise,
and report improvements separately from regressions. Keep the process exit code
as success for completed comparisons; regression counts are reported in the
summary instead of being used as the process status.

Make plotting tolerate unavailable noise by leaving gaps in confidence bands,
sort plotted series by the plotted axis, and avoid reusing pyplot state across
plot calls.

Add focused Python tests for robust-summary preference, unavailable-noise
classification, non-finite timing centers, plot-along handling when the selected
axis is absent, and the exit-code contract.
Teach nvbench_compare to keep the order of --benchmark and --axis arguments so
axis filters can apply either globally or to the most recent benchmark. Build a
filter plan from the ordered CLI arguments and apply the same plan to table
output and plotting labels.

Add explicit --reference-devices and --compare-devices filters. The filters
accept all, a single device id, or a comma-separated list of ids; ordered lists
and duplicates are preserved so selected reference and compare devices can be
paired by position. Device-section mismatches remain fatal for unfiltered
all-vs-all comparisons, but become warnings when the user explicitly selects
devices and the selected device counts match.

Match duplicate benchmark states by occurrence within each filtered device
section instead of matching only by state name across the whole benchmark. This
keeps repeated axis values and filtered duplicate states aligned between the
reference and compare inputs, and reports mismatched occurrence counts instead
of silently dropping extra states.

Add Python tests for duplicate-state matching, axis filtering before matching,
device filter parsing and validation, explicit cross-device pairing, and
benchmark-scoped axis filters.

Original commit messages folded into this change:

Tweaks for nvbench_compare

1. When JSON files contain multiple entries with the same name and axis values,
   make sure that scripts compares corresponding entries.

   Previous logic would extract the first entry from ref data, and would compare
   measurements for each state in cmp against the first entry from ref. The
   change introduces a counter to know which nth entry we process for a
   particular axis value, and retrieve corresponding entry in ref.

Scope occurrence matching by device.

Device pairing in nvbench_compare.py is strictly index-based under
--ignore-devices, reused IDs in a different order no longer pair against the
wrong reference device.

Require devices in ref and cmp to have the same cardinality

Handle mismatch when number of duplicates in ref data is not same as in cmp data

Use pytest monkeypatch fixture to pretend third-party package dependencies are
available during test run for nvbench_compare without introducing test-time
dependency

Added the happy-path test and fixed its direct-call setup by initializing the
device globals that main() normally populates.

Fix to filter-before-matching.

 - compare_benches() now pairs devices by selected position instead of taking a
   device id.
 - For each device pair, compare_benches() now builds:
     - ref_device_states: matching reference device and axis filters
     - cmp_device_states: matching compare device and axis filters
 - State occurrence counts and duplicate occurrence matching now operate only
   on those filtered per-device lists.
 - Removed the later matches_axis_filters() skip inside the compare-state loop
   because filtering now happens before matching.

Added a regression test where ref/cmp have duplicate state names in opposite
order, and --axis keeps only one of them. The test verifies the kept compare
state is matched against the kept reference state, not the first unfiltered
occurrence.

Introduce device filtering in nvbench_compare

 - --reference-devices all|ID|ID,ID,...
 - --compare-devices all|ID|ID,ID,...
 - Integer lists preserve order and duplicates.
 - Requested IDs are validated against the file-level device list.
 - Filtered reference/compare device counts must match before comparison.
 - compare_benches() pairs selected reference and compare devices by position.
 - Each benchmark validates that requested device IDs are present in its own
   devices list.

Implemented benchmark-scoped --axis handling.

  - --axis and --benchmark now share an ordered argparse action, so their
    relative CLI order is preserved.
  - -a before any -b becomes a global axis filter.
  - -a after -b <name> applies to that most recent benchmark only.
  - Repeated -b entries are treated as separate filter scopes and combined as
    alternatives for that benchmark.
  - Device filtering remains global and is applied independently.

Allow non-matching devices for explicit device selection

Now the device-section equality check remains fatal only for unfiltered
all-vs-all comparisons. If either --reference-devices or --compare-devices is
explicit, mismatched selected device metadata is printed as a warning, but
comparison proceeds after the selected device counts have been validated.

Fix for resolve_benchmark_device_ids, add comments

The return value of resolve_benchmark_device_ids now always owns its list.

Use monkeypatch class in set_test_devices helper

Stricted device id validation

Test for device id validation
Introduce GpuTimingData, SummaryComparison, ComparisonStats, and
ComparisonRunData to make timing extraction, classification, and run-level
state explicit.

Load sample-time and SM-frequency bulk data from JSON binary output into
GpuTimingData when available, preserving count validation between paired
sample and frequency arrays.

Move GPU timing comparison logic into compare_gpu_timings(), prefer robust
median/IQR data when available, and fall back to mean/stdev summaries otherwise.
Keep missing or invalid noise on the unknown path.

Replace module-level comparison counters and selected-device globals with
per-run data passed into compare_benches(). Update tests to validate timing
classification, bulk-data loading, device pairing, filtered duplicate matching,
and summary counters through the new structures.
It is not emitted just yet, but the code becomes ready for it
when it starts being emitted
Store JSON-bin sample time and frequency metadata in GpuTimingData instead of
reading the binary files during summary extraction.

Add Float32BinarySource and lazy cached accessors for samples and frequencies.
Use np.fromfile by default, but allow tests and alternate callers to inject a
float32 reader returning any buffer-compatible object convertable to "<f4" data
type.

Treat optional bulk-data failures as unavailable evidence instead of aborting
comparison: unreadable files, invalid buffers, count mismatches, and mismatched
sample/frequency metadata now emit RuntimeWarning and return None.

Update nvbench_compare tests to verify lazy loading, cache reuse, injected
reader behavior, warning-based degradation, and count mismatch handling.
Its intent is to be cheaply retrievable metric of average
SM clock frequence over entire sample
The quantile values are not currently used, but plumbed through
Implemented the clear-gap comparison, with the log-distance-equivalent
algebra and pessimistic SM-clock fallback.

What changed:

 - Added TimingInterval and interval construction from summaries:
    - robust interval: [min, q3], centered at median
    - fallback interval: clipped [mean - stdev, mean + stdev] intersected with [min, max]
 - Added CLEAR_GAP_RELATIVE_THRESHOLD = 0.005.
 - FAST gap uses:

   (ref.lower - cmp.upper) / cmp.upper >= delta
   which is equivalent to log(ref.lower / cmp.upper) >= log(1 + delta).
 - SLOW gap uses:

   (cmp.lower - ref.upper) / ref.upper >= delta
 - FAST/SLOW now requires SM clock summaries on both sides and the same clear-gap result after scaling intervals by sm_clock_rate_mean.
 - If intervals are missing, overlap, fail the gap threshold, have missing/invalid clock summaries, or time/cycle comparison disagrees, status is UNDECIDED.
 - Existing center/noise values are still computed and displayed, but no longer drive FAST/SLOW/SAME classification.

Updated tests to cover:

 - center/noise-only comparisons becoming UNDECIDED
 - clear FAST/SLOW with matching clock evidence
 - missing clock fallback to UNDECIDED
 - frequency-shift disagreement becoming UNDECIDED
 - regression reporting with robust interval and clock evidence
If SLOW/FAST check returned undecided, we attempt conservative
SAME check based on summary data alone (bulk data are not read)

Reference and compare measurements are considered SAME if
   - both centers are positive finite values;
   - abs(ref - cmp) / min(ref, cmp) <= 0.5%.
     This is equivalent to max(ref, cmp) / min(ref, cmp) <= 1 + delta;
   - interval overlap must cover at least 50% of the smaller interval;
   - relative dispersion must be finite on both sides and no more than 2%;
   - if SM clock summaries are available, the same check must also pass in cycle space.

Otherwise UNDECIDED remains working decision, to be refined by further checks
- Add DecisionReason(code, message) and internal
  TimingDecision(status, reason).
- SummaryComparison now carries reason
- ComparisonStats now aggregates undecided reasons.
- Final summary prints a reason breakdown only when
  undecided reasons exist, e.g.:

  - Undecided   (comparison requires more evidence): 3
    - Reasons:
      - noise_too_high: 2 (relative dispersion is too
                           high to declare same)
      - weak_interval_overlap: 1 (timing intervals do not
                 overlap strongly enough to declare same)
Add a bulk-data SAME path to nvbench_compare for cases where summary
intervals do not provide a clear FAST/SLOW decision. The new path compares
sample times and SM-clock-adjusted cycles with symmetric nearest-neighbor
coverage over unique values and sample counts.

The comparison now requires both sample-weight coverage and unique-support
coverage to pass before declaring SAME. If bulk data is available but coverage
does not pass, the result remains UNDECIDED instead of falling back to the
summary-only SAME rule.

Also improve undecided diagnostics by aggregating reason codes while preserving
the most severe representative detail, including observed coverage values and
thresholds for bulk support mismatches.

Add tests for:
 - bulk data confirming SAME despite changed mode weights;
 - bulk time mismatch overriding summary-only SAME;
 - cycle coverage vetoing time-only agreement;
 - sample-weight and unique-support coverage diagnostics;
 - aggregation of undecided reason details.
Select robust timing inputs independently for reference and compare data:
prefer robust summaries when present, otherwise recompute robust statistics from
that side's bulk samples. Fall back to mean/stdev summaries only when both sides
cannot provide robust timing inputs.

This allows modern JSON data with robust summaries to compare against legacy JSON
data that lacks robust summaries but includes bulk sample data, without mixing
summary families or unnecessarily falling back to mean/stdev.
Reject boolean and floating-point values for int64 bulk binary sizes instead of
silently converting them with int(). Keep integer strings accepted for existing
NVBench JSON compatibility, and add regression coverage for valid and malformed
size payloads.
Use rounded-rank method, rather than NumPy's quantile
When any --benchmark filter is present, keep comparison limited to the
explicitly selected benchmarks. Leading --axis filters are still replayed onto
each selected benchmark, matching native NVBench option parsing, but they no
longer cause unrelated benchmarks to be compared.

E.g., `-a A=2 -b bench1` now compares only bench1,
`-a A=2 -b bench1 -b bench2` applies A=2 to both selected benchmarks

Update tests for global axis filters with benchmark scopes and document the
selection behavior.
Also add a test to check that STDOUT also works.
Reject boolean summary float payloads instead of coercing them to 1.0/0.0,
while keeping numeric strings accepted for NVBench JSON compatibility.

Add regression coverage for generated bulk-debug Python filenames that require
escaping, and strengthen the plot-along test to assert log-log axes and
confidence-band rendering.
Move Float32BinarySource material-payload detection into the source object.
Default file-backed sources still use resolved file size so missing or empty
sidecars remain unavailable, but positive-count sources with custom readers are
treated as material and proceed through the lazy read path.

Add regression coverage for virtual bulk sources whose custom reader provides
data without a local sidecar file.
Derive absolute standard deviation from relative stdev and mean when
nv/cold/time/gpu/stdev/absolute is absent. This lets older JSON files
that only contain mean and relative stdev still construct timing
intervals.

Also allow mean/stdev intervals to be built without min/max summaries,
using min/max only as optional clipping bounds when present. This
restores SAME classification for legacy fixture data instead of treating
those rows as missing-interval AMBG cases.

Update nvbench_compare tests to cover derived stdev handling and the
legacy mean/stdev comparison path.
Add a shared nvbench_tooling_deps helper for importing packages required
by NVBench console tools. Missing tooling packages now raise a dedicated
error with an install recipe instead of failing with a raw ImportError.

Update script imports to work both as installed package modules and as
direct source-tree scripts by using the __package__ import pattern for
nvbench_json and the new tooling helper.

Defer nvbench-compare dependencies to the points where they are needed:
NumPy/colorama during normal comparison setup, tabulate during table
rendering, jsondiff only for device mismatch reporting, and plotting
packages only for plot modes.

Update tests to initialize compare tooling when calling internals
directly and add coverage for the tooling dependency loader.

Closes NVIDIA#384
Introduce optional dependency categories in the wheel metadata,
with cuda-bench[tools] encompasing all dependencies of nvbench
scripts.

Closes NVIDIA#393
Move `load_nvbench_compare_tooling()` after:
  - config resolution / --dump-config
  - filter parsing
  - device-filter parsing
  - input arity validation
test_nvbench_tooling_deps.py now has a smoke test that builds a
temporary package layout matching the wheel mapping:

cuda/bench/scripts/nvbench_tooling_deps.py

and imports:

cuda.bench.scripts.nvbench_tooling_deps

That covers the cuda.bench.scripts.* path without requiring a wheel
build/install inside this unit test.
Add a shared nvbench_tooling_deps helper that imports packages required
by console tools and reports missing packages with an actionable install
recipe.

Update NVBench scripts to support both direct source-tree execution and
installed package execution through the __package__ import pattern. Defer
third-party tooling imports until they are needed, including lazy loading
for compare tables, device diffs, and plotting paths.

Make loaders resilient to partial initialization failures so a later
retry can complete any dependency that failed previously.

Update tests to cover direct internal use, packaged cuda.bench.scripts
imports, dependency-loader error messages, and cheap CLI validation before
tooling dependencies are loaded.
require_tooling_dependency() now only translates
ModuleNotFoundError when the missing module is the
requested top-level package. Other import
failures are re-raised unchanged.

This helps in situation where third-party dependency
is installed but broken for whatever reason. Previously
we would intercept it and suggest to run
pip install cuda-bench[tools], but that was already done.
round in Python returns nearest even, while in C++ it
behaves as x -> floor(x + 0.5)
Factor finite-value checks through a common helper so positive and
non-negative finite predicates share the same None/finite guard. Add a
comment explaining the positive lower-bound clamp used for mean/stdev
intervals, since later ratio and log-distance checks require positive
bounds.

Also quote the --axis pow2 example in CLI help so shell users can copy
the example safely.
Implemented robo-review feedback suggesting to expand coverage
of lazy loading helper.
Thread the originating JSON path through bulk timing extraction so sidecar
filenames can be resolved with knowledge of the JSON basename. This lets
nvbench-compare handle --jsonbin output where the filename field may
store the benchmark-process-relative path, such as data/result.json-bin/0.bin,
instead of a path relative to the JSON file directory.

Keep resolution anchored to the JSON location and avoid falling back to the
caller working directory. Add regression coverage for this sidecar layout.

This change helps nvbench_compare properly resolve location of sidecar
binary files in the following CCCL-based workflow:

```
./bin/cub.bench.bitonic_sort.warp_keys.base -d 1 --cold-warmup-runs 16 --min-samples 100 --stopping-criterion entropy --min-r2 0.88 --jsonbin ../../../perf_data/warp-bitonic-sort-run1.json
./bin/cub.bench.bitonic_sort.warp_keys.base -d 1 --cold-warmup-runs 16 --min-samples 100 --stopping-criterion entropy --min-r2 0.88 --jsonbin ../../../perf_data/warp-bitonic-sort-run2.json

PYTHONPATH=./_deps/nvbench-src/python/scripts python ./_deps/nvbench-src/python/scripts/nvbench_compare.py ../../../perf_data/warp-bitonic-sort-run1.json ../../../perf_data/warp-bitonic-sort-run2.json
```

Before this change, script generated numerous RuntimeWarnigns alerting
that absolute path can not be resolved and bulk data will be treated as unavailable
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

python type: enhancement New feature or request.

Projects

Status: In Progress

1 participant