Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion LICENSE.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved.
Copyright (c) 2019-2026 Advanced Micro Devices, Inc. All rights reserved.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
158 changes: 158 additions & 0 deletions docs/conceptual/transferbench-data-validation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
.. meta::
:description: Explains how TransferBench validates transfer correctness by comparing destination memory against precomputed expected values derived from source buffers.
:keywords: TransferBench data validation, TransferBench correctness, ValidateAllTransfers, PrepareReference, destination buffer, source buffer

.. _transferbench-data-validation:

==============================
TransferBench data validation
==============================

TransferBench validates the transfer results by comparing the destination (DST) memory to
precomputed expected values.

Overview
=========

Validation verifies that for each transfer, the DST buffer contains the expected value:
the sum of all source (SRC) buffers (or zero when there are no sources). A transfer is correct if, for
every element ``i``, the value matches the expected value given in the following table:

.. list-table::
:header-rows: 1

* - Number of sources
- Expected value

* - 0 sources
- ``dst[i] == 0`` (or memset value)

* - 1 source
- ``dst[i] == src0[i]``

* - N sources
- ``dst[i] == src0[i] + src1[i] + ... + srcN-1[i]``

Source data preparation
=======================

Before any transfers run, TransferBench prepares the SRC and DST memories as discussed in the following sections:

Expected source pattern (``PrepareReference``)
-----------------------------------------------

Before any transfers run, TransferBench builds reference SRC buffers on the host using
``PrepareReference(cfg, cpuBuffer, bufferIdx)``.

The pattern used depends on the configuration:

.. list-table::
:header-rows: 1

* - Configuration
- Behavior

* - ``fillCompress`` (non-empty)
- Mix of random floats with optional zeroing per 64-byte line:
``0`` = random, ``1`` = 1B0, ``2`` = 2B0, ``3`` = 4B0, ``4`` = 32B0.
Percentages control the mix. For details, see
:ref:`data-validation-var`.

* - ``fillPattern`` (non-empty)
- Repeats the given ``vector<float>`` over all SRC buffers.

* - Default
- Pseudo-random: ``PrepSrcValue(bufferIdx, i) = (((i % 383) * 517) % 383 + 31) * (bufferIdx + 1)``

``bufferIdx`` is the SRC index (0, 1, …) so each SRC buffer gets a different pattern.

Expected destination (``dstReference``)
----------------------------------------

The expected destination is computed once before the iteration loop:

.. code-block:: text

dstReference[0] = memset to MEMSET_CHAR (used when numSrcs == 0)
dstReference[1] = srcReference[0] (1 source)
dstReference[2] = dstReference[1] + srcReference[1] (2 sources)
dstReference[k] = dstReference[k-1] + srcReference[k-1] (k sources)

``dstReference[numSrcs]`` is the expected result for a transfer with ``numSrcs`` sources.

Initializing source and destination memories
---------------------------------------------

For each transfer, the SRC memory on the rank that owns it is filled from the corresponding
``srcReference`` buffer via ``hipMemcpy`` (host-to-device or device-to-device as appropriate).
DST memory is zeroed (or memset) before transfers run.

How validation is timed
========================

The timing of validation is controlled by the ``alwaysValidate`` option. By default
(``alwaysValidate = 0``), validation runs once after all timed iterations complete,
minimizing overhead during benchmarking. When ``alwaysValidate = 1``, validation is
performed after every iteration; any detected error immediately stops the run.

.. list-table::
:header-rows: 1

* - Option
- When
- Behavior

* - ``alwaysValidate = 0`` (default)
- Once at the end of all iterations
- ``ValidateAllTransfers`` called after the iteration loop.

* - ``alwaysValidate = 1``
- After every timed iteration
- ``ValidateAllTransfers`` called inside the loop; any error stops the run.

How validation (``ValidateAllTransfers``) works
================================================

For each transfer and each DST, the following steps are performed:

1. **Rank check:** Only the rank that owns the destination performs validation.

2. **Getting actual output:**

- **CPU destination** or ``validateDirect = 1``: Point directly at the destination memory.
- **GPU destination** and ``validateDirect = 0``: Copy destination to a host ``outputBuffer``
via ``hipMemcpy``, then compare against ``outputBuffer``.

3. **Comparison:** Performed using ``memcmp(output, expected, numBytes)``. On mismatch, the code finds the first differing index and returns an error with the index, expected value, and actual value.

4. **Expected values:** Calculated using ``expected = dstReference[t.srcs.size()].data()``. The precomputed sum for the number of sources.

Validation options
==================

The following options control when and how validation is performed. They can be set as
environment variables or in a configuration file.

.. list-table::
:header-rows: 1

* - Option
- Environment variable
- Description

* - ``alwaysValidate``
- ``ALWAYS_VALIDATE``
- To validate after each iteration, set to ``1``. To validate once at the end, set to ``0``.

* - ``validateDirect``
- ``VALIDATE_DIRECT``
- To compare GPU DST directly, set to ``1``. Supported on AMD hardware only; no host copy.
To copy to host and compare, set to ``0``.

* - ``validateSource``
- ``VALIDATE_SOURCE``
- To validate the SRC memory right after it's initialized, set to ``1``. (Optional early check).

.. note::

``validateDirect`` is not supported on NVIDIA. The code falls back to copying to host.
146 changes: 146 additions & 0 deletions docs/conceptual/transferbench-timing.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
.. meta::
:description: Explains how TransferBench measures performance at the test, executor, and transfer levels using HIP events and CPU wall-clock timing.
:keywords: TransferBench timing, TransferBench measurement, HIP events, CPU wall-clock, executor timing, transfer timing, overhead

.. _transferbench-timing:

====================
TransferBench timing
====================

TransferBench measures performance at three nested levels: Test, Executor, and Transfer. Each level
captures a different scope of elapsed time, and the timing method used depends on the executor type.

Timing levels
=============

The following diagram illustrates the three levels of timing:

.. image:: /data/timing.png
:width: 100%
:align: center

The following table provides a quick summary of the three timing levels:

.. list-table::
:header-rows: 1

* - Timing level
- What it measures
- How it is timed

* - Test
- All Transfers across all executors and all ranks
- CPU wall-clock (``std::chrono::high_resolution_clock``)

* - Executor
- All Transfers that run on this executor
- Varies by executor type (see :ref:`timing-methods`)

* - Transfer
- A single Transfer
- Varies by executor type (see :ref:`timing-methods`)

.. _timing-methods:

Timing methods
==============

The timing method used for each executor and transfer depends on the executor type and the value of
``USE_HIP_EVENTS``.

Executor timing
---------------

.. list-table::
:header-rows: 1

* - Executor type
- Timing method

* - CPU
- CPU wall-clock (``std::chrono::high_resolution_clock``)

* - GFX / DMA
- For ``USE_HIP_EVENTS=1`` (default): HIP events (``hipEventElapsedTime``)

For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)

* - NIC
- CPU wall-clock (``std::chrono::high_resolution_clock``)

Transfer timing
---------------

.. list-table::
:header-rows: 1

* - Executor type
- Timing method

* - CPU
- CPU wall-clock (``std::chrono::high_resolution_clock``)

* - GFX
- For ``USE_HIP_EVENTS=1`` (default): GPU wall-clock timestamp (``wall_clock64()``)

For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)

* - DMA
- For ``USE_HIP_EVENTS=1`` (default): HIP events (``hipEventElapsedTime``)

For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)

* - NIC
- CPU wall-clock (``std::chrono::high_resolution_clock``)

Overhead
========

Overhead is the difference between the total CPU wall-clock time (Test time) and the elapsed time of
the slowest executor:

.. code-block:: text

Overhead = Test Time - MAX(Executor 0 Time, Executor 1 Time, ...)

Overhead captures scheduling and synchronization costs that fall outside of executor-measured time,
such as barrier waits and thread management.

Example output
==============

The following example shows TransferBench output for a test with two executors (CPU and GPU) and
four transfers:

.. code-block:: text

Test 1:
-------------------┬--------------┬------------┬-------------------┬--------------------
Executor: CPU 00 │ 0.027 GB/s │ 77.492 ms │ 2097152 bytes │ 4.489 GB/s (sum)
Executor 0 Time = 77.492 ms
-------------------┼--------------┼------------┼-------------------┼--------------------
Transfer 0 │ 4.476 GB/s │ 0.234 ms │ 1048576 bytes │ C0 -> C0:4 -> N
Transfer 0 Time = 0.234 ms
Transfer 1 │ 0.014 GB/s │ 77.359 ms │ 1048576 bytes │ G0 -> C0:4 -> N
Transfer 1 Time = 77.359 ms
-------------------┼--------------┼------------┼-------------------┼--------------------
Executor: GPU 00 │ 97.436 GB/s │ 0.689 ms │ 67108864 bytes │ 129.692 GB/s (sum)
Executor 1 Time = 0.689 ms
-------------------┼--------------┼------------┼-------------------┼--------------------
Transfer 2 │ 80.886 GB/s │ 0.415 ms │ 33554432 bytes │ G0 -> G0:4 -> G0
Transfer 2 Time = 0.415 ms
Transfer 3 │ 48.807 GB/s │ 0.687 ms │ 33554432 bytes │ G0 -> G0:4 -> G1
Transfer 3 Time = 0.687 ms
-------------------┼--------------┼------------┼-------------------┼--------------------
Aggregate (CPU) │ 0.891 GB/s │ 77.688 ms │ 69206016 bytes │ Overhead 0.197 ms
Test Time = 77.688 ms
-------------------┴--------------┴------------┴-------------------┴--------------------
Overhead = 77.688 - MAX(77.492, 0.689) = 0.197 ms

In this example:

- **Executor 0** (CPU) runs Transfers 0 and 1 and takes 77.492 ms (dominated by Transfer 1 at 77.359 ms).
- **Executor 1** (GPU) runs Transfers 2 and 3 and takes 0.689 ms.
- **Test Time** is 77.688 ms, measured by the CPU wall-clock across all executors.
- **Overhead** is 0.197 ms, calculated as ``77.688 - MAX(77.492, 0.689)``.
Loading