ROCm · SwRaw · Jun 8, 2026 · Jun 9, 2026
@@ -1,4 +1,4 @@
-Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2019-2026 Advanced Micro Devices, Inc. All rights reserved.
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

@@ -0,0 +1,158 @@
+.. meta::
+  :description: Explains how TransferBench validates transfer correctness by comparing destination memory against precomputed expected values derived from source buffers.
+  :keywords: TransferBench data validation, TransferBench correctness, ValidateAllTransfers, PrepareReference, destination buffer, source buffer
+
+.. _transferbench-data-validation:
+
+==============================
+TransferBench data validation
+==============================
+
+TransferBench validates the transfer results by comparing the destination (DST) memory to
+precomputed expected values.
+
+Overview
+=========
+
+Validation verifies that for each transfer, the DST buffer contains the expected value:
+the sum of all source (SRC) buffers (or zero when there are no sources). A transfer is correct if, for
+every element ``i``, the value matches the expected value given in the following table:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Number of sources
+      - Expected value
+
+    * - 0 sources
+      - ``dst[i] == 0`` (or memset value)
+
+    * - 1 source
+      - ``dst[i] == src0[i]``
+
+    * - N sources
+      - ``dst[i] == src0[i] + src1[i] + ... + srcN-1[i]``
+
+Source data preparation
+=======================
+
+Before any transfers run, TransferBench prepares the SRC and DST memories as discussed in the following sections:
+
+Expected source pattern (``PrepareReference``)
+-----------------------------------------------
+
+Before any transfers run, TransferBench builds reference SRC buffers on the host using
+``PrepareReference(cfg, cpuBuffer, bufferIdx)``.
+
+The pattern used depends on the configuration:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Configuration
+      - Behavior
+
+    * - ``fillCompress`` (non-empty)
+      - Mix of random floats with optional zeroing per 64-byte line:
+        ``0`` = random, ``1`` = 1B0, ``2`` = 2B0, ``3`` = 4B0, ``4`` = 32B0.
+        Percentages control the mix. For details, see
+        :ref:`data-validation-var`.
+
+    * - ``fillPattern`` (non-empty)
+      - Repeats the given ``vector<float>`` over all SRC buffers.
+
+    * - Default
+      - Pseudo-random: ``PrepSrcValue(bufferIdx, i) = (((i % 383) * 517) % 383 + 31) * (bufferIdx + 1)``
+
+        ``bufferIdx`` is the SRC index (0, 1, …) so each SRC buffer gets a different pattern.
+
+Expected destination (``dstReference``)
+----------------------------------------
+
+The expected destination is computed once before the iteration loop:
+
+.. code-block:: text
+
+  dstReference[0] = memset to MEMSET_CHAR          (used when numSrcs == 0)
+  dstReference[1] = srcReference[0]                (1 source)
+  dstReference[2] = dstReference[1] + srcReference[1]  (2 sources)
+  dstReference[k] = dstReference[k-1] + srcReference[k-1]  (k sources)
+
+``dstReference[numSrcs]`` is the expected result for a transfer with ``numSrcs`` sources.
+
+Initializing source and destination memories
+---------------------------------------------
+
+For each transfer, the SRC memory on the rank that owns it is filled from the corresponding
+``srcReference`` buffer via ``hipMemcpy`` (host-to-device or device-to-device as appropriate).
+DST memory is zeroed (or memset) before transfers run.
+
+How validation is timed
+========================
+
+The timing of validation is controlled by the ``alwaysValidate`` option. By default
+(``alwaysValidate = 0``), validation runs once after all timed iterations complete,
+minimizing overhead during benchmarking. When ``alwaysValidate = 1``, validation is
+performed after every iteration; any detected error immediately stops the run.
+
+.. list-table::
+    :header-rows: 1
+
+    * - Option
+      - When
+      - Behavior
+
+    * - ``alwaysValidate = 0`` (default)
+      - Once at the end of all iterations
+      - ``ValidateAllTransfers`` called after the iteration loop.
+
+    * - ``alwaysValidate = 1``
+      - After every timed iteration
+      - ``ValidateAllTransfers`` called inside the loop; any error stops the run.
+
+How validation (``ValidateAllTransfers``) works
+================================================
+
+For each transfer and each DST, the following steps are performed:
+
+1. **Rank check:** Only the rank that owns the destination performs validation.
+
+2. **Getting actual output:**
+
+   - **CPU destination** or ``validateDirect = 1``: Point directly at the destination memory.
+   - **GPU destination** and ``validateDirect = 0``: Copy destination to a host ``outputBuffer``
+     via ``hipMemcpy``, then compare against ``outputBuffer``.
+
+3. **Comparison:** Performed using ``memcmp(output, expected, numBytes)``. On mismatch, the code finds the first differing index and returns an error with the index, expected value, and actual value.
+
+4. **Expected values:** Calculated using ``expected = dstReference[t.srcs.size()].data()``. The precomputed sum for the number of sources.
+
+Validation options
+==================
+
+The following options control when and how validation is performed. They can be set as
+environment variables or in a configuration file.
+
+.. list-table::
+    :header-rows: 1
+
+    * - Option
+      - Environment variable
+      - Description
+
+    * - ``alwaysValidate``
+      - ``ALWAYS_VALIDATE``
+      - To validate after each iteration, set to ``1``. To validate once at the end, set to ``0``.
+
+    * - ``validateDirect``
+      - ``VALIDATE_DIRECT``
+      - To compare GPU DST directly, set to ``1``. Supported on AMD hardware only; no host copy.
+        To copy to host and compare, set to ``0``.
+
+    * - ``validateSource``
+      - ``VALIDATE_SOURCE``
+      - To validate the SRC memory right after it's initialized, set to ``1``. (Optional early check).
+
+.. note::
+
+  ``validateDirect`` is not supported on NVIDIA. The code falls back to copying to host.
@@ -0,0 +1,146 @@
+.. meta::
+  :description: Explains how TransferBench measures performance at the test, executor, and transfer levels using HIP events and CPU wall-clock timing.
+  :keywords: TransferBench timing, TransferBench measurement, HIP events, CPU wall-clock, executor timing, transfer timing, overhead
+
+.. _transferbench-timing:
+
+====================
+TransferBench timing
+====================
+
+TransferBench measures performance at three nested levels: Test, Executor, and Transfer. Each level
+captures a different scope of elapsed time, and the timing method used depends on the executor type.
+
+Timing levels
+=============
+
+The following diagram illustrates the three levels of timing:
+
+.. image:: /data/timing.png
+  :width: 100%
+  :align: center
+
+The following table provides a quick summary of the three timing levels:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Timing level
+      - What it measures
+      - How it is timed
+
+    * - Test
+      - All Transfers across all executors and all ranks
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - Executor
+      - All Transfers that run on this executor
+      - Varies by executor type (see :ref:`timing-methods`)
+
+    * - Transfer
+      - A single Transfer
+      - Varies by executor type (see :ref:`timing-methods`)
+
+.. _timing-methods:
+
+Timing methods
+==============
+
+The timing method used for each executor and transfer depends on the executor type and the value of
+``USE_HIP_EVENTS``.
+
+Executor timing
+---------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Executor type
+      - Timing method
+
+    * - CPU
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - GFX / DMA
+      - For ``USE_HIP_EVENTS=1`` (default): HIP events (``hipEventElapsedTime``)
+
+        For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - NIC
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+Transfer timing
+---------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Executor type
+      - Timing method
+
+    * - CPU
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - GFX
+      - For ``USE_HIP_EVENTS=1`` (default): GPU wall-clock timestamp (``wall_clock64()``)
+
+        For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - DMA
+      - For ``USE_HIP_EVENTS=1`` (default): HIP events (``hipEventElapsedTime``)
+
+        For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - NIC
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+Overhead
+========
+
+Overhead is the difference between the total CPU wall-clock time (Test time) and the elapsed time of
+the slowest executor:
+
+.. code-block:: text
+
+  Overhead = Test Time - MAX(Executor 0 Time, Executor 1 Time, ...)
+
+Overhead captures scheduling and synchronization costs that fall outside of executor-measured time,
+such as barrier waits and thread management.
+
+Example output
+==============
+
+The following example shows TransferBench output for a test with two executors (CPU and GPU) and
+four transfers:
+
+.. code-block:: text
+
+  Test 1:
+  -------------------┬--------------┬------------┬-------------------┬--------------------
+  Executor: CPU 00   │  0.027 GB/s  │  77.492 ms │    2097152 bytes  │  4.489 GB/s (sum)
+  Executor 0 Time = 77.492 ms
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+      Transfer 0     │  4.476 GB/s  │   0.234 ms │    1048576 bytes  │  C0 -> C0:4 -> N
+  Transfer 0 Time =   0.234 ms
+      Transfer 1     │  0.014 GB/s  │  77.359 ms │    1048576 bytes  │  G0 -> C0:4 -> N
+  Transfer 1 Time =  77.359 ms
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+  Executor: GPU 00   │ 97.436 GB/s  │   0.689 ms │   67108864 bytes  │ 129.692 GB/s (sum)
+  Executor 1 Time = 0.689 ms
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+      Transfer 2     │ 80.886 GB/s  │   0.415 ms │   33554432 bytes  │  G0 -> G0:4 -> G0
+  Transfer 2 Time =   0.415 ms
+      Transfer 3     │ 48.807 GB/s  │   0.687 ms │   33554432 bytes  │  G0 -> G0:4 -> G1
+  Transfer 3 Time =   0.687 ms
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+  Aggregate (CPU)    │  0.891 GB/s  │  77.688 ms │   69206016 bytes  │  Overhead 0.197 ms
+  Test Time     = 77.688 ms
+  -------------------┴--------------┴------------┴-------------------┴--------------------
+  Overhead      = 77.688 - MAX(77.492, 0.689) = 0.197 ms
+
+In this example:
+
+- **Executor 0** (CPU) runs Transfers 0 and 1 and takes 77.492 ms (dominated by Transfer 1 at 77.359 ms).
+- **Executor 1** (GPU) runs Transfers 2 and 3 and takes 0.689 ms.
+- **Test Time** is 77.688 ms, measured by the CPU wall-clock across all executors.
+- **Overhead** is 0.197 ms, calculated as ``77.688 - MAX(77.492, 0.689)``.