diff --git a/LICENSE.md b/LICENSE.md index eb08d9e8..3d5a69a6 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -1,4 +1,4 @@ -Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved. +Copyright (c) 2019-2026 Advanced Micro Devices, Inc. All rights reserved. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal diff --git a/docs/conceptual/transferbench-data-validation.rst b/docs/conceptual/transferbench-data-validation.rst new file mode 100644 index 00000000..622909b5 --- /dev/null +++ b/docs/conceptual/transferbench-data-validation.rst @@ -0,0 +1,158 @@ +.. meta:: + :description: Explains how TransferBench validates transfer correctness by comparing destination memory against precomputed expected values derived from source buffers. + :keywords: TransferBench data validation, TransferBench correctness, ValidateAllTransfers, PrepareReference, destination buffer, source buffer + +.. _transferbench-data-validation: + +============================== +TransferBench data validation +============================== + +TransferBench validates the transfer results by comparing the destination (DST) memory to +precomputed expected values. + +Overview +========= + +Validation verifies that for each transfer, the DST buffer contains the expected value: +the sum of all source (SRC) buffers (or zero when there are no sources). A transfer is correct if, for +every element ``i``, the value matches the expected value given in the following table: + +.. list-table:: + :header-rows: 1 + + * - Number of sources + - Expected value + + * - 0 sources + - ``dst[i] == 0`` (or memset value) + + * - 1 source + - ``dst[i] == src0[i]`` + + * - N sources + - ``dst[i] == src0[i] + src1[i] + ... + srcN-1[i]`` + +Source data preparation +======================= + +Before any transfers run, TransferBench prepares the SRC and DST memories as discussed in the following sections: + +Expected source pattern (``PrepareReference``) +----------------------------------------------- + +Before any transfers run, TransferBench builds reference SRC buffers on the host using +``PrepareReference(cfg, cpuBuffer, bufferIdx)``. + +The pattern used depends on the configuration: + +.. list-table:: + :header-rows: 1 + + * - Configuration + - Behavior + + * - ``fillCompress`` (non-empty) + - Mix of random floats with optional zeroing per 64-byte line: + ``0`` = random, ``1`` = 1B0, ``2`` = 2B0, ``3`` = 4B0, ``4`` = 32B0. + Percentages control the mix. For details, see + :ref:`data-validation-var`. + + * - ``fillPattern`` (non-empty) + - Repeats the given ``vector`` over all SRC buffers. + + * - Default + - Pseudo-random: ``PrepSrcValue(bufferIdx, i) = (((i % 383) * 517) % 383 + 31) * (bufferIdx + 1)`` + + ``bufferIdx`` is the SRC index (0, 1, …) so each SRC buffer gets a different pattern. + +Expected destination (``dstReference``) +---------------------------------------- + +The expected destination is computed once before the iteration loop: + +.. code-block:: text + + dstReference[0] = memset to MEMSET_CHAR (used when numSrcs == 0) + dstReference[1] = srcReference[0] (1 source) + dstReference[2] = dstReference[1] + srcReference[1] (2 sources) + dstReference[k] = dstReference[k-1] + srcReference[k-1] (k sources) + +``dstReference[numSrcs]`` is the expected result for a transfer with ``numSrcs`` sources. + +Initializing source and destination memories +--------------------------------------------- + +For each transfer, the SRC memory on the rank that owns it is filled from the corresponding +``srcReference`` buffer via ``hipMemcpy`` (host-to-device or device-to-device as appropriate). +DST memory is zeroed (or memset) before transfers run. + +How validation is timed +======================== + +The timing of validation is controlled by the ``alwaysValidate`` option. By default +(``alwaysValidate = 0``), validation runs once after all timed iterations complete, +minimizing overhead during benchmarking. When ``alwaysValidate = 1``, validation is +performed after every iteration; any detected error immediately stops the run. + +.. list-table:: + :header-rows: 1 + + * - Option + - When + - Behavior + + * - ``alwaysValidate = 0`` (default) + - Once at the end of all iterations + - ``ValidateAllTransfers`` called after the iteration loop. + + * - ``alwaysValidate = 1`` + - After every timed iteration + - ``ValidateAllTransfers`` called inside the loop; any error stops the run. + +How validation (``ValidateAllTransfers``) works +================================================ + +For each transfer and each DST, the following steps are performed: + +1. **Rank check:** Only the rank that owns the destination performs validation. + +2. **Getting actual output:** + + - **CPU destination** or ``validateDirect = 1``: Point directly at the destination memory. + - **GPU destination** and ``validateDirect = 0``: Copy destination to a host ``outputBuffer`` + via ``hipMemcpy``, then compare against ``outputBuffer``. + +3. **Comparison:** Performed using ``memcmp(output, expected, numBytes)``. On mismatch, the code finds the first differing index and returns an error with the index, expected value, and actual value. + +4. **Expected values:** Calculated using ``expected = dstReference[t.srcs.size()].data()``. The precomputed sum for the number of sources. + +Validation options +================== + +The following options control when and how validation is performed. They can be set as +environment variables or in a configuration file. + +.. list-table:: + :header-rows: 1 + + * - Option + - Environment variable + - Description + + * - ``alwaysValidate`` + - ``ALWAYS_VALIDATE`` + - To validate after each iteration, set to ``1``. To validate once at the end, set to ``0``. + + * - ``validateDirect`` + - ``VALIDATE_DIRECT`` + - To compare GPU DST directly, set to ``1``. Supported on AMD hardware only; no host copy. + To copy to host and compare, set to ``0``. + + * - ``validateSource`` + - ``VALIDATE_SOURCE`` + - To validate the SRC memory right after it's initialized, set to ``1``. (Optional early check). + +.. note:: + + ``validateDirect`` is not supported on NVIDIA. The code falls back to copying to host. diff --git a/docs/conceptual/transferbench-timing.rst b/docs/conceptual/transferbench-timing.rst new file mode 100644 index 00000000..bc6852e1 --- /dev/null +++ b/docs/conceptual/transferbench-timing.rst @@ -0,0 +1,146 @@ +.. meta:: + :description: Explains how TransferBench measures performance at the test, executor, and transfer levels using HIP events and CPU wall-clock timing. + :keywords: TransferBench timing, TransferBench measurement, HIP events, CPU wall-clock, executor timing, transfer timing, overhead + +.. _transferbench-timing: + +==================== +TransferBench timing +==================== + +TransferBench measures performance at three nested levels: Test, Executor, and Transfer. Each level +captures a different scope of elapsed time, and the timing method used depends on the executor type. + +Timing levels +============= + +The following diagram illustrates the three levels of timing: + +.. image:: /data/timing.png + :width: 100% + :align: center + +The following table provides a quick summary of the three timing levels: + +.. list-table:: + :header-rows: 1 + + * - Timing level + - What it measures + - How it is timed + + * - Test + - All Transfers across all executors and all ranks + - CPU wall-clock (``std::chrono::high_resolution_clock``) + + * - Executor + - All Transfers that run on this executor + - Varies by executor type (see :ref:`timing-methods`) + + * - Transfer + - A single Transfer + - Varies by executor type (see :ref:`timing-methods`) + +.. _timing-methods: + +Timing methods +============== + +The timing method used for each executor and transfer depends on the executor type and the value of +``USE_HIP_EVENTS``. + +Executor timing +--------------- + +.. list-table:: + :header-rows: 1 + + * - Executor type + - Timing method + + * - CPU + - CPU wall-clock (``std::chrono::high_resolution_clock``) + + * - GFX / DMA + - For ``USE_HIP_EVENTS=1`` (default): HIP events (``hipEventElapsedTime``) + + For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``) + + * - NIC + - CPU wall-clock (``std::chrono::high_resolution_clock``) + +Transfer timing +--------------- + +.. list-table:: + :header-rows: 1 + + * - Executor type + - Timing method + + * - CPU + - CPU wall-clock (``std::chrono::high_resolution_clock``) + + * - GFX + - For ``USE_HIP_EVENTS=1`` (default): GPU wall-clock timestamp (``wall_clock64()``) + + For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``) + + * - DMA + - For ``USE_HIP_EVENTS=1`` (default): HIP events (``hipEventElapsedTime``) + + For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``) + + * - NIC + - CPU wall-clock (``std::chrono::high_resolution_clock``) + +Overhead +======== + +Overhead is the difference between the total CPU wall-clock time (Test time) and the elapsed time of +the slowest executor: + +.. code-block:: text + + Overhead = Test Time - MAX(Executor 0 Time, Executor 1 Time, ...) + +Overhead captures scheduling and synchronization costs that fall outside of executor-measured time, +such as barrier waits and thread management. + +Example output +============== + +The following example shows TransferBench output for a test with two executors (CPU and GPU) and +four transfers: + +.. code-block:: text + + Test 1: + -------------------┬--------------┬------------┬-------------------┬-------------------- + Executor: CPU 00 │ 0.027 GB/s │ 77.492 ms │ 2097152 bytes │ 4.489 GB/s (sum) + Executor 0 Time = 77.492 ms + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 0 │ 4.476 GB/s │ 0.234 ms │ 1048576 bytes │ C0 -> C0:4 -> N + Transfer 0 Time = 0.234 ms + Transfer 1 │ 0.014 GB/s │ 77.359 ms │ 1048576 bytes │ G0 -> C0:4 -> N + Transfer 1 Time = 77.359 ms + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: GPU 00 │ 97.436 GB/s │ 0.689 ms │ 67108864 bytes │ 129.692 GB/s (sum) + Executor 1 Time = 0.689 ms + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 2 │ 80.886 GB/s │ 0.415 ms │ 33554432 bytes │ G0 -> G0:4 -> G0 + Transfer 2 Time = 0.415 ms + Transfer 3 │ 48.807 GB/s │ 0.687 ms │ 33554432 bytes │ G0 -> G0:4 -> G1 + Transfer 3 Time = 0.687 ms + -------------------┼--------------┼------------┼-------------------┼-------------------- + Aggregate (CPU) │ 0.891 GB/s │ 77.688 ms │ 69206016 bytes │ Overhead 0.197 ms + Test Time = 77.688 ms + -------------------┴--------------┴------------┴-------------------┴-------------------- + Overhead = 77.688 - MAX(77.492, 0.689) = 0.197 ms + +In this example: + +- **Executor 0** (CPU) runs Transfers 0 and 1 and takes 77.492 ms (dominated by Transfer 1 at 77.359 ms). +- **Executor 1** (GPU) runs Transfers 2 and 3 and takes 0.689 ms. +- **Test Time** is 77.688 ms, measured by the CPU wall-clock across all executors. +- **Overhead** is 0.197 ms, calculated as ``77.688 - MAX(77.492, 0.689)``. diff --git a/docs/conceptual/transferbench-workflow.rst b/docs/conceptual/transferbench-workflow.rst new file mode 100644 index 00000000..d1f41eee --- /dev/null +++ b/docs/conceptual/transferbench-workflow.rst @@ -0,0 +1,117 @@ +.. meta:: + :description: Explains the TransferBench internal workflow, including how presets and config files feed into RunTransfers() and how transfers are executed and reported. + :keywords: TransferBench workflow, RunTransfers, TransferBench internals, TransferBench architecture, TransferBench conceptual + +.. _transferbench-workflow: + +======================= +TransferBench workflow +======================= + +Transfers enter the system either through presets or configuration (config) file, both of which ultimately call +``RunTransfers()``, which is the main utility function within TransferBench. + +Entry points +============ + +.. list-table:: + :header-rows: 1 + + * - Source + - Flow + + * - Presets + - The client selects a preset (for example, ``p2p``, ``nicp2p``, or ``sweep``). + The preset builds a ``std::vector`` and calls + ``TransferBench::RunTransfers(cfg, transfers, results)``. + + * - Config file or command line + - The client parses transfers from a config file or command-line arguments via + ``ParseTransfers()``, then calls ``RunTransfers()``. + +RunTransfers workflow +===================== + +``RunTransfers()`` is the main entry point into the backend TransferBench library: + +.. code-block:: cpp + + /** + * Run a set of Transfers + * + * @param[in] config Configuration options + * @param[in] transfers Set of Transfers to execute + * @param[out] results Timing results + * @returns true if and only if Transfers were run successfully without any fatal errors + */ + bool RunTransfers(ConfigOptions const& config, + vector const& transfers, + TestResults& results); + +As shown in the following diagram, the function executes four sequential phases: + +.. image:: /data/workflow.png + :width: 100% + :align: center + +1. **Initial validation:** Checks that inputs are consistent and valid. +2. **Prepare transfers:** Allocates resources and initializes memory. +3. **Iteration loop:** Runs the timed transfer iterations. +4. **Finalize:** Validates results and assembles output. + +Initial validation +------------------ + +Here are the steps involved in the first phase: + +1. **Check ConfigOptions:** Verify that the provided ``ConfigOptions`` are valid. ``ConfigOptions`` control how TransferBench runs (for example, GFX unroll factor and number of warmup iterations). When running in multinode mode, consistency across ranks is also checked. + +2. **Check transfers:** Verify that the provided ``Transfers`` are properly specified. Checks include confirming that requested devices exist and that each transfer has an appropriate number of source (SRC) and destination (DST) endpoints. When running in multinode mode, consistency across ranks is also checked. + +3. **Log transfers (optional):** If ``TB_DUMP_CFG_FILE`` is set, log the transfers to a config file that can be re-executed by TransferBench. This is useful for capturing the exact transfers run by a preset so they can be modified and replayed. + +Prepare transfers +----------------- + +Here are the steps involved in the second phase: + +1. **Prepare executors:** Perform all executor-specific setup, such as creating HIP streams, allocating SRC and DST memory, and exchanging fabric handles for pod communication support. This step also divides the work across subexecutors. + +2. **Initialize memory:** Initializes SRC memory buffers with data patterns and computes reference results used for later validation. + + .. note:: + + For GPU SRC memory locations, data is copied onto the GPUs via DMA (``hipMemcpy``). If profiling, this copy appears as part of the profiling trace, which is why setting ``USE_INTERACTIVE`` = 1 is recommended when profiling. + +3. **Optional pause:** When ``USE_INTERACTIVE`` = 1, TransferBench pauses for user input after all memory has been initialized. Virtual addresses are printed at this point, which is useful for attaching a profiler before any transfers execute. + +Iteration loop +-------------- + +In the third phase, iteration loop runs for the number of iterations specified by ``NUM_ITERATIONS``. +Each iteration proceeds through the following steps: + +1. **Barrier (pre):** Synchronizes all ranks before transfers begin, ensuring that every rank is ready before any rank starts executing transfers. + +2. **Start CPU timing:** Starts a CPU timer on the current rank, capturing the total elapsed time across all transfers on this rank. + +3. **Execute:** Spawns one CPU thread per executor. Each executor runs all the transfers it is assigned and is responsible for its own per-transfer timing. + +4. **Barrier (post):** Waits for all executors across all ranks to finish before proceeding. + +5. **Stop CPU timing:** Stops the CPU timer. + +6. **Validate (optional):** If ``ALWAYS_VALIDATE`` = 1, performs a correctness check after each iteration to verify that destination memory matches the expected reference results. + + .. note:: + + By default, validation runs only once after all iterations complete. Setting ``ALWAYS_VALIDATE`` = 1 validates after every iteration, which can help detect transient errors that would otherwise be masked by a passing final iteration. + +Finalize +-------- + +Here are the steps involved in the last phase: + +1. **Validate all transfers:** Checks all transfers to confirm that the DST memory matches the expected reference results computed during the Initialize Memory step. + +2. **Prepare results:** Collects timing data from each executor and assembles the final ``TestResults`` output returned to the caller. diff --git a/docs/data/a2a_MI300X.png b/docs/data/a2a_MI300X.png new file mode 100644 index 00000000..ba2d9680 Binary files /dev/null and b/docs/data/a2a_MI300X.png differ diff --git a/docs/data/a2a_MI350X.png b/docs/data/a2a_MI350X.png new file mode 100644 index 00000000..42948227 Binary files /dev/null and b/docs/data/a2a_MI350X.png differ diff --git a/docs/data/a2a_serialization.png b/docs/data/a2a_serialization.png new file mode 100644 index 00000000..01fcb02f Binary files /dev/null and b/docs/data/a2a_serialization.png differ diff --git a/docs/data/a2asweep_MI300X.png b/docs/data/a2asweep_MI300X.png new file mode 100644 index 00000000..0928dc1e Binary files /dev/null and b/docs/data/a2asweep_MI300X.png differ diff --git a/docs/data/a2asweep_MI350X.png b/docs/data/a2asweep_MI350X.png new file mode 100644 index 00000000..159aa30f Binary files /dev/null and b/docs/data/a2asweep_MI350X.png differ diff --git a/docs/data/nicrings.png b/docs/data/nicrings.png new file mode 100644 index 00000000..b8c966c7 Binary files /dev/null and b/docs/data/nicrings.png differ diff --git a/docs/data/nicrings_MI350X.png b/docs/data/nicrings_MI350X.png new file mode 100644 index 00000000..2d2ea021 Binary files /dev/null and b/docs/data/nicrings_MI350X.png differ diff --git a/docs/data/schmoo_MI300X.png b/docs/data/schmoo_MI300X.png new file mode 100644 index 00000000..b3490810 Binary files /dev/null and b/docs/data/schmoo_MI300X.png differ diff --git a/docs/data/schmoo_MI350X.png b/docs/data/schmoo_MI350X.png new file mode 100644 index 00000000..609cf595 Binary files /dev/null and b/docs/data/schmoo_MI350X.png differ diff --git a/docs/data/timing.png b/docs/data/timing.png new file mode 100644 index 00000000..8fd6b407 Binary files /dev/null and b/docs/data/timing.png differ diff --git a/docs/data/workflow.png b/docs/data/workflow.png new file mode 100644 index 00000000..32dd227c Binary files /dev/null and b/docs/data/workflow.png differ diff --git a/docs/how to/running-transferbench-customized.rst b/docs/how to/running-transferbench-customized.rst new file mode 100644 index 00000000..dafd2779 --- /dev/null +++ b/docs/how to/running-transferbench-customized.rst @@ -0,0 +1,131 @@ +.. meta:: + :description: How to run custom transfer tests using TransferBench by defining them in a configuration file or on the command line, including wildcard syntax and examples. + :keywords: TransferBench usage, TransferBench how to, TransferBench configuration file, TransferBench custom transfers, TransferBench wildcards + +.. _running-transferbench-customized: + +======================================== +Running custom tests using TransferBench +======================================== + +You can run custom transfer tests using TransferBench by defining them in a configuration file. This topic describes the configuration file format and how to run tests using a file or the command line. + +.. seealso:: + + :ref:`transfer-definition-syntax` — complete reference for simple and advanced mode syntax, memory and executor letter codes, and wildcards. + +Running TransferBench with a configuration file +================================================ + +To run TransferBench with a configuration file, use: + +.. code-block:: shell + + ./TransferBench [num_bytes] + +The command accepts the following arguments: + +- ``config_file``: Path to a configuration file that defines the transfers to run. + +- ``num_bytes``: Number of bytes per transfer. You can suffix this value with ``K``, ``M``, or ``G`` (for example, ``128M``). The value must be a multiple of 4. You can omit this argument. + +.. note:: + + If you set ``num_bytes`` to ``0``, TransferBench sweeps over transfer sizes from 1 KB to 512 MB in power-of-2 steps. Use ``SAMPLING_FACTOR`` to control how many sizes are sampled per power-of-2 range. For example, ``SAMPLING_FACTOR=4`` produces four evenly spaced sizes between each power of 2. + +Alternative modes +------------------ + +In addition to configuration file mode, TransferBench supports the following alternative modes: + +.. list-table:: + :header-rows: 1 + + * - Mode + - Usage + - Description + + * - ``cmdline`` + - ``./TransferBench cmdline [num_bytes] `` + - Defines transfers on the command line instead of in a file. + + * - ``dryrun`` + - ``./TransferBench dryrun [num_bytes] `` + - Parses and prints the expanded transfers without executing them. Use this mode to validate wildcard expansion. + +- Using ``cmdline`` mode: + + .. code-block:: shell + + ./TransferBench cmdline 1G "1 1 (G0->G0->G1)" + + Running the preceding command produces the same result as using a configuration file that contains the following line: + + .. code-block:: shell + + 1 1 (G0->G0->G1) + +- Using ``dryrun`` mode: + + .. code-block:: shell + + ./TransferBench dryrun 1G "1 1 (G0->G0->G*)" + + The output lists all transfers that would be executed: + + .. code-block:: shell + + ============================================================================================================= + Transfers to be executed (dry-run): + ================================================================================ + Transfer 0: (G0->G0->G0) + Transfer 1: (G0->G0->G1) + Transfer 2: (G0->G0->G2) + Transfer 3: (G0->G0->G3) + Transfer 4: (G0->G0->G4) + Transfer 5: (G0->G0->G5) + Transfer 6: (G0->G0->G6) + Transfer 7: (G0->G0->G7) + +Configuration file format +========================== + +Each line in a configuration file defines one test, which is a set of transfers that run in parallel. The following rules apply: + +- Blank lines are ignored. + +- Lines starting with ``#`` are treated as comments and are ignored. + +- Lines starting with ``##`` are echoed to the output. Use these for section headers. + +- Round brackets ``()`` and arrows ``->`` are optional and ignored. You can include them for readability. + +Example configuration file +============================ + +The following configuration file shows a range of transfer types using both simple and advanced modes: + +.. code-block:: shell + + # Single GPU-executed transfer between GPUs 0 and 1 using 4 CUs + 1 4 (G0->G0->G1) + + # Single DMA transfer between GPUs 0 and 1 + 1 1 (G0->D0->G1) + + # Advanced: 1 MB GPU 0 to GPU 1 with 4 CUs; 2 MB GPU 1 to GPU 0 with 8 CUs + -2 (G0 G0 G1 4 1M) (G1 G1 G0 8 2M) + + # Memset by GPU 0 to its own memory (null source) + 1 32 (N0->G0->G0) + + # Read-only by CPU 0 (null destination) + 1 4 (C0->C0->N0) + + # Broadcast from GPU 0 to GPUs 0 and 1 + 1 16 (G0->G0->G0G1) + + # Multi-rank: GPU 0 on rank 0 to GPU 1 on rank 1 + 1 4 (R0G0->R0G0->R1G1) + +For information on how to define transfers in the configuration file, including how to specify what memory to use, which executor runs the transfer, and how many subexecutors to use, see :ref:`transfer-definition-syntax`. diff --git a/docs/how to/use-transferbench.rst b/docs/how to/use-transferbench.rst deleted file mode 100644 index b237a1c8..00000000 --- a/docs/how to/use-transferbench.rst +++ /dev/null @@ -1,215 +0,0 @@ -.. meta:: - :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs) - :keywords: TransferBench usage, TransferBench how to, TransferBench user guide, TransferBench user manual - -.. _using-transferbench: - ---------------------- -Using TransferBench ---------------------- - -You can control the SRC and DST memory locations by indicating the memory type followed by the device index. TransferBench supports the following memory types: - -* Coarse-grained pinned host -* Unpinned host -* Fine-grained host -* Coarse-grained global device -* Fine-grained global device -* Null (for an empty transfer) - -In addition, you can determine the size of the transfer (number of bytes to copy) for the tests. - -You can also specify transfer executors. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of Sub-Executors (SE). The number of SEs specifies the number of CPU threads in the case of a CPU executor and the number of compute units (CU) for a GPU executor. -For a DMA executor, the SE argument determines the number of streams to be used. - -You can specify the transfers in a configuration file or use preset configurations for transfers. - -Specifying transfers in a configuration file ----------------------------------------------- - -A transfer is defined as a single operation where an executor reads and adds together values from SRC memory locations, followed by writing the sum to the DST memory locations. -This simplifies to a copy operation when using a single SRC or DST. -Here's a copy operation from a single SRC to DST: - -.. code-block:: bash - - SRC 0 DST 0 - SRC 1 -> Executor -> DST 1 - SRC X DST Y - -Three executors are supported by TransferBench: - -.. code-block:: bash - - Executor: SubExecutor: - 1. CPU CPU thread - 2. GPU GPU threadblock/Compute Unit (CU) - 3. DMA N/A (Can only be used for a single SRC to DST copy) - -Each line in the configuration file defines a set of transfers, also known as a test, to run in parallel. - -There are two ways to specify a test: - -- **Basic** - - The basic specification assumes the same number of SEs used per transfer. - A positive number of transfers is specified, followed by the number of SEs and triplets describing each transfer: - - .. code-block:: bash - - Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL) - - The arguments used to specify transfers in the config file are described in the :ref:`arguments table `. - - **Example**: - - .. code-block:: bash - - 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1 - 1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0 - 2 4 G0->G0->G1 G1->G1->G0 Copies from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs - -- **Advanced** - - In the advanced specification, a negative number of transfers is specified, followed by quintuplets describing each transfer. - Specifying a non-zero number of bytes overrides any provided value. - - .. code-block:: bash - - Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL) - - The arguments used to specify transfers in the config file are described in the :ref:`arguments table `. - - **Example**: - - .. code-block:: bash - - -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs and 2Mb from GPU1 to GPU0 with 2 SEs - -Here is the list of arguments used to specify transfers in the config file: - -.. _config_file_arguments_table: - -.. list-table:: - :header-rows: 1 - - * - Argument - - Description - - * - Transfers - - Number of transfers to be run in parallel - - * - SE - - Number of SEs to use (CPU threads or GPU threadblocks) - - * - srcMemL - - Source memory locations (where the data is read) - - * - Executor - - | Executor is specified by a character indicating type, followed by the device index (0-indexed): - | - C: CPU-executed (indexed from 0 to NUMA nodes - 1) - | - G: GPU-executed (indexed from 0 to GPUs - 1) - | - D: DMA-executor (indexed from 0 to GPUs - 1) - - * - dstMemL - - Destination memory locations (where the data is written) - - * - bytesL - - | Number of bytes to copy (use command-line specified size when 0). - | Must be a multiple of four and can be suffixed with ('K','M', or 'G'). - | Memory locations are specified by one or more device characters or device index pairs. - | Characters indicate memory type and are followed by device index (0-indexed). - | Here are the characters and their respective memory locations: - | - C: Pinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1]) - | - U: Unpinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1]) - | - B: Fine-grain host memory (on NUMA node, indexed from 0 to [NUMA nodes-1]) - | - G: Global device memory (on GPU device, indexed from 0 to [GPUs - 1]) - | - F: Fine-grain device memory (on GPU device, indexed from 0 to [GPUs - 1]) - | - N: Null memory (index ignored) - -Round brackets and arrows "->" can be included for human clarity, but will be ignored. -Lines starting with # are ignored while lines starting with ## are echoed to the output. - -**Transfer examples:** - -Single GPU-executed transfer between GPU 0 and 1 using 4 CUs:: - - 1 4 (G0->G0->G1) - -Single DMA-executed transfer between GPU 0 and 1:: - - 1 1 (G0->D0->G1) - -Copying 1Mb from GPU 0 to GPU 1 with 4 CUs, and 2Mb from GPU 1 to GPU 0 with 8 CUs:: - - -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M) - -"Memset" by GPU 0 to GPU 0 memory:: - - 1 32 (N0->G0->G0) - -"Read-only" by CPU 0:: - - 1 4 (C0->C0->N0) - -Broadcast from GPU 0 to GPU 0 and GPU 1:: - - 1 16 (G0->G0->G0G1) - -.. note:: - - Running TransferBench with no arguments displays usage instructions and detected topology information. - -Using preset configurations ------------------------------- - -Here is the list of preset configurations that can be used instead of configuration files: - -.. list-table:: - :header-rows: 1 - - * - Configuration - - Description - - * - ``a2a`` - - All-to-all benchmark test - - * - ``cmdline`` - - Allows transfers to run from the command line instead of a configuration file - - * - ``dryrun`` - - Lists the set of transfers to be executed as provided from the command line - - This is useful when using wildcards to ensure correctness - - * - ``healthcheck`` - - Simple health check (supported on AMD Instinct MI300 series only) - - * - ``nic_rings`` - - Measure performance of NICs set up in a ring across ranks - - * - ``p2p`` - - Peer-to-peer benchmark test - - * - ``pcopy`` - - Benchmark parallel copies from a single GPU to other GPUs - - * - ``rsweep`` - - Random sweep across possible sets of transfers - - * - ``rwrite`` - - Benchmark parallel remote writes from a single GPU to other GPUs - - * - ``scaling`` - - GPU subexecutor scaling tests - - * - ``schmoo`` - - Read, write, or copy operation on local or remote between two GPUs - - * - ``sweep`` - - Sweep across possible sets of transfers - -Performance tuning ---------------------- - -When you use the same GPU executor in multiple simultaneous transfers on separate streams by setting ``USE_SINGLE_STREAM=0``, the performance might be serialized due to the maximum number of hardware queues available. -To improve the performance, adjust the number of maximum hardware queues using ``GPU_MAX_HW_QUEUES``. diff --git a/docs/index.rst b/docs/index.rst index 2e8b36df..98795e77 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -1,13 +1,30 @@ .. meta:: - :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs) - :keywords: Benchmarking utility, Memory transfers, Device transfers + :description: TransferBench documentation home. TransferBench is a utility for benchmarking simultaneous memory transfers between CPUs, GPUs, and NICs. + :keywords: TransferBench, benchmarking utility, memory transfers, GPU transfers, NIC transfers, multinode benchmark **************************** TransferBench documentation **************************** -TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. -This simplifies to a simple copy operation when dealing with a single SRC or DST. +TransferBench is a utility for benchmarking simultaneous memory transfers between user-specified devices (CPUs, GPUs, and NICs). + +A memory transfer is a single operation where an Executor (EXE) reads and adds values from source (SRC) memory devices, then writes the sum to destination (DST) memory devices. When dealing with a single SRC or DST, a memory transfer is similar to a simple copy operation. The memory transfer is commonly denoted by the (SRC->EXE->DST) triplet. + +A Memory device consists of a location (a specific device that owns the memory) and a memory type (usually some attribute about the memory). For example, fine-grained HBM memory (memory type) on GPU 0 (location) or pinned CPU memory (memory type) on NUMA node 1 (location). + +TransferBench supports the following features: + +- **Multiple executors:** CPU threads, GPU compute kernels, GPU Direct Memory Access (DMA) or System DMA (SDMA), and Remote Direct Memory Access (RDMA) NIC or RNIC. Some Executors support SubExecutors, allowing further partitioning of the data to be transferred. + +- **Multi-input or multi-output (MIMO) transfers:** Element-wise sum from multiple SRCs to multiple DSTs. + +- **Multinode execution:** Using MPI or sockets across distributed systems. + +- **Flexible configuration:** Using Config files or presets for common benchmarks. + +- **Flexible hardware:** Supports HIP and CUDA programs that can run on both AMD and NVIDIA hardware. + +TransferBench provides a frontend client (the executable) and a backend library (the header-only TransferBench.hpp). The backend library can be used to integrate TransferBench into other custom applications. The code is open and hosted at ``_. @@ -18,13 +35,21 @@ The code is open and hosted at ``_. * :ref:`install-transferbench` - .. grid-item-card:: API reference + .. grid-item-card:: How to + + * :ref:`running-transferbench-customized` - * :ref:`transferbench-api` + .. grid-item-card:: Conceptual - .. grid-item-card:: How to + * :ref:`transferbench-workflow` + * :ref:`transferbench-timing` + * :ref:`transferbench-data-validation` + + .. grid-item-card:: Reference - * :ref:`using-transferbench` + * :ref:`running-presets` + * :ref:`environment-variables` + * :ref:`faq` To contribute to the documentation, refer to `Contributing to ROCm `_. diff --git a/docs/install/install.rst b/docs/install/install.rst index 4a44ff59..f243dc17 100644 --- a/docs/install/install.rst +++ b/docs/install/install.rst @@ -1,85 +1,409 @@ .. meta:: - :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs) - :keywords: Build TransferBench, Install TransferBench + :description: Instructions for installing TransferBench from source or using a package manager on supported platforms. + :keywords: Build TransferBench, Install TransferBench, TransferBench package manager, TransferBench source build .. _install-transferbench: ---------------------------- -Installing TransferBench ---------------------------- +----------------------- +Install TransferBench +----------------------- -This topic describes how to build TransferBench. +To install TransferBench, you have the following options: -Prerequisite ---------------- +- :ref:`Build from source ` -* Install ROCm stack on the system to obtain :doc:`HIP runtime ` -* Install ``libnuma`` on the system -* `Enable AMD IOMMU `_ and set to passthrough for AMD Instinct cards +- :ref:`Use package manager ` -Building TransferBench ------------------------- +.. _source-build: -Here are the steps to build TransferBench: +Building TransferBench from source +=================================== -1. Download the latest version of TransferBench from the git repository. +First, install the following required dependencies. - .. code-block:: bash +Required dependencies +---------------------- - git clone https://github.com/ROCm/TransferBench.git - cd TransferBench +* `ROCm stack `_ to obtain :doc:`HIP runtime `. -2. Build TransferBench using Makefile or CMake. + - The installed HIP version might impact support for some features, such as amd-smi pod membership detection, or UALoE support. - To build using Makefile, use: +* ``libnuma`` for allocating memory or spawning threads on correct NUMA nodes. - .. code-block:: bash + - For Ubuntu/Debian: - make + .. code-block:: shell - To build using CMake, use: + sudo apt install libnuma-dev - .. code-block:: bash + - For RHEL/CentOS: - mkdir build - cd build - CXX=/opt/rocm/bin/hipcc cmake .. - make + .. code-block:: shell -.. note:: + sudo yum install numactl-devel - If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately. +Optional dependencies +---------------------- - NIC executor support will be enabled if IBVerbs is detected and if ``infiniband/verbs.h`` is found in the default include path. - NIC executor support can be disabled explicitly by setting ``DISABLE_NIC_EXEC=1`` +Depending on your requirement, you can install these optional dependencies: - MPI support will be enabled if mpi.h is found in ``MPI_PATH/include/`` - MPI executor support can be disabled explicitly by setting ``DISABLE_MPI_COMM=1`` +- ``libibverbs``: Required for enabling NIC executor for RDMA transfers. -Building documentation ------------------------ + - For Ubuntu/Debian: + + .. code-block:: shell + + sudo apt install rdma-core libibverbs-dev ibverbs-utils + + - For RHEL/CentOS: + + .. code-block:: shell + + sudo yum install rdma-core libibverbs libibverbs-devel + +- MPI installation (any of the following) + + - ``OpenMPI``: + + - For Ubuntu/Debian: + + .. code-block:: shell + + sudo apt install openmpi-bin libopenmpi-dev + + - For RHEL/CentOS: + + .. code-block:: shell -To build documentation locally, use: + sudo yum install openmpi openmpi-devel -.. code-block:: bash + - ``MPICH``: - cd docs - pip3 install -r ./sphinx/requirements.txt - python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html + - For Ubuntu/Debian: -NVIDIA platform support --------------------------- + .. code-block:: shell -You can build TransferBench to run on NVIDIA platforms using native NVIDIA CUDA Compiler Driver (NVCC). + sudo apt-get install mpich libmpich-dev -To build with native NVCC, use: + - For RHEL/CentOS: -.. code-block:: bash + .. code-block:: shell + sudo yum install mpich mpich-devel + +You can build TransferBench from source using two methods: :ref:`Makefile ` and :ref:`CMake `. + +.. _makefile: + +Method 1: Building from source using Makefile +---------------------------------------------- + +To build TransferBench from source using Makefile, run: + +.. code-block:: shell + + git clone https://github.com/ROCm/TransferBench.git + cd TransferBench + make + +.. note:: + + By default, building with ``make`` only builds for the GPU detected on the machine being used for compilation. To specifically target GPU architectures to compile for, set ``GPU_TARGETS``. See :ref:`menv-var`. + +.. _menv-var: + +Makefile environment variables ++++++++++++++++++++++++++++++++ + +To modify the Makefile behavior, use the following environment variables: + +.. raw:: html + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CategoryEnvironment variableDescriptionDefault value
Paths and compilers - To customize which compiler to use or the library to link against.ROCM_PATHROCm installation path for HIP compiler, includes, and libs./opt/rocm
CUDA_PATHCUDA installation path for NVCC when building TransferBenchCuda./usr/local/cuda
MPI_PATHMPI installation path (for mpi.h and MPI libraries)./usr/local/openmpi
HIPCCHIP compiler. Falls back to hipcc, if not found.$(ROCM_PATH)/bin/amdclang++
NVCCNVIDIA CUDA compiler (for building TransferBenchCuda)$(CUDA_PATH)/bin/nvcc
ROCM_DEVICE_LIB_PATHPath to amdgcn bitcode. Auto-detected from the ROCm layout.(auto)
HIPCONFIGPath to hipconfig, which is used to query the HIP version (for pod communication support check).hipconfig
Feature flags - To control enabling features that require compile-time support. By default, these are enabled under the right conditions.DISABLE_NIC_EXECDisables NIC executor support.0
DISABLE_DMA_BUFDisables DMA-BUF for GPU Direct RDMA. Requires NIC executor support.1
DISABLE_MPI_COMMDisables MPI communication backend support for multinode TransferBench.0
DISABLE_AMD_SMIDisables AMDI-SMI pod membership checks.0
DISABLE_POD_COMMDisables pod communication support (UALoE / MNNVL).0
Build optionsSINGLE_KERNELTo compile with a single GFX kernel (faster build, but fewer kernel variants), set to 1. Used mostly for development and debug.0
GPU_TARGETSComma-separated GPU architecture targets such as gfx942, gfx950.native
DEBUGTo build in debug mode with debug symbols (-O0, -g), set to 1. Runs otherwise in the release mode (-O3).0
+
+ +.. _cmake: + +Method 2: Building from source using CMake +------------------------------------------- + +To build TransferBench from source using CMake, run: + +.. code-block:: shell + + git clone https://github.com/ROCm/TransferBench.git + cd TransferBench + mkdir build && cd build + cmake .. make -TransferBench looks for NVCC in ``/usr/local/cuda`` by default. To modify the location of NVCC, use environment variable `CUDA_PATH`: +CMake environment variables +++++++++++++++++++++++++++++ + +To modify the CMake behavior, use the following environment variables: + +.. raw:: html + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
CategoryEnvironment variableDescriptionDefault value
Paths and compilers - To customize which compiler to use or the library to link against.ROCM_PATHROCm installation path./opt/rocm
CMAKE_TOOLCHAIN_FILEToolchain file. Uses ROCM_PATH and CXX to select compiler.toolchain-linux.cmake
CXXC++ compiler. If not set, amdclang++ or hipcc is used. Taken from the toolchain
MPI_PATHPath to MPI installation. Takes priority over find_package(MPI).
Build options (ON/OFF) - Pass -DVAR=value to setBUILD_LOCAL_GPU_TARGET_ONLYBuilds only for the GPUs detected on the given machine using rocm_agent_enumerator.OFF
ENABLE_NIC_EXECEnables RDMA NIC executor.OFF
ENABLE_MPI_COMMEnables MPI communicator as backbone for multinode TransferBench.OFF
ENABLE_DMA_BUFEnables DMA-BUF for GPU Direct RDMA (requires NIC).OFF
ENABLE_AMD_SMIEnables AMD-SMI pod membership queries.OFF
ENABLE_POD_COMMEnables pod communication (HIP >= 8.0).OFF
CMake cache variablesGPU_TARGETSSemicolon-separated GPU architectures. Overridden if BUILD_LOCAL_GPU_TARGET_ONLY is ONgfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201;gfx1250
AMD_SMI_EXECUTABLEPath to amd-smi for AMD-SMI version check.amd-smi
HIPCONFIG_EXECUTABLEPath to hipconfig for HIP version or pod check.hipconfig
+
+ +.. note:: + + CMake uses ``opt-in`` for optional features, which is ``OFF`` by default, whereas Makefile uses ``opt-out``, which is ``ON`` by default. To set cache variables, pass ``-DVAR=value`` to CMake. + +**Example: building with MPI and NIC support** + +.. code-block:: shell + + git clone https://github.com/ROCm/TransferBench.git + cd TransferBench + mkdir build && cd build + cmake .. -DENABLE_NIC_EXEC=ON -DENABLE_MPI_COMM=ON + make + +Troubleshooting common build errors +------------------------------------ + +Here are some commonly encountered build errors and their fix: + +- ``Could not find /opt/rocm/bin/amdclang++ or /opt/rocm/bin/hipcc. Check if the path is correct if you want to build TransferBench`` + + Occurs if HIP isn't installed correctly. If it is installed in a different directory, specify it using ``ROCM_PATH``. + +- ``Could not find standard C++ header 'cmath'`` + + Normally occurs if the standard C++ headers aren't installed. Try installing ``g++-12`` or ``g++-14`` based on the OS version. For example, ``apt-get install g++-12``. + +.. _package-manager: + +Installing TransferBench using package manager +=============================================== + +To install TransferBench using package, install ROCm first and then run: + +.. code-block:: shell + + ## Install the transferbench-dev package + sudo apt-get install transferbench-dev + +This installs in ``/opt/{rocm-version}/bin/TransferBench``. To check, use: + +.. code-block:: shell + + dpkg -L transferbench-dev + +.. note:: + + Pre-packaged installation doesn't support any enabled features, such as NIC executor, MPI support, pod support, and others. + +Building TransferBenchCuda +=========================== + +To build TransferBenchCuda from the source code, install the required dependencies first. + +Required dependencies +---------------------- + +- CUDA: The installed CUDA version might impact support for some features such MNNVL support. + +- libnuma: Used for allocating memory or spawning threads on the right NUMA nodes. Here are the install instructions based on the OS: + + - Ubuntu/Debian: + + .. code-block:: shell + + sudo apt install libnuma-dev + + - RHEL/CentOS: + + .. code-block:: shell + + sudo yum install numactl-devel + +Building TransferBenchCuda from source code +-------------------------------------------- + +To build TransferBenchCuda, run: -.. code-block:: bash +.. code-block:: shell - CUDA_PATH=/usr/local/cuda make + git clone https://github.com/ROCm/TransferBench.git + cd TransferBench + make TransferBenchCuda diff --git a/docs/reference/api.rst b/docs/reference/api.rst index 438696e6..ae88593a 100644 --- a/docs/reference/api.rst +++ b/docs/reference/api.rst @@ -1,6 +1,6 @@ .. meta:: - :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs) - :keywords: TransferBench library, TransferBench functions, Transferbench API, Transferbench interface + :description: API reference for the TransferBench backend library, including functions and interfaces exposed by the header-only TransferBench.hpp. + :keywords: TransferBench library, TransferBench functions, TransferBench API, TransferBench interface, TransferBench.hpp .. _transferbench-api: diff --git a/docs/reference/environment-variables.rst b/docs/reference/environment-variables.rst new file mode 100644 index 00000000..d6f63116 --- /dev/null +++ b/docs/reference/environment-variables.rst @@ -0,0 +1,413 @@ +.. meta:: + :description: Reference for TransferBench environment variables that control the frontend client, backend library, and runtime behavior for configuration-file and preset runs. + :keywords: TransferBench environment variables, TransferBench configuration, TransferBench client, TransferBench customization, TransferBench how to + +.. _environment-variables: + +==================================== +TransferBench environment variables +==================================== + +TransferBench behavior can be customized using environment variables. This topic describes the environment variables that control the TransferBench client (frontend), backend library, and runtime. + +.. note:: + + Environment variables read by the frontend client apply to both configuration-file runs and preset runs. Variables prefixed with ``TB_`` are read by the backend library and are not part of the frontend client. + +Frontend client environment variables +======================================= + +The following environment variables are read by the TransferBench client and apply to configuration-file runs and presets. + +General options +--------------- + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``NUM_ITERATIONS`` + - Number of timed iterations per test. If negative, runs for that many seconds instead. + - ``10`` + + * - ``NUM_SUBITERATIONS`` + - Sub-iterations per iteration. Set to ``0`` for infinite sub-iterations. + + Within each iteration, the transfer is repeated ``NUM_SUBITERATIONS`` times. This can reduce the impact of kernel launch latencies, but might over-emphasize cache reuse. + - ``1`` + + * - ``NUM_WARMUPS`` + - Untimed warmup iterations per test. + - ``3`` + + * - ``SHOW_BORDERS`` + - Shows ASCII box-drawing characters in tables. Set to ``1`` to show, ``0`` to hide. + - ``1`` + + * - ``SHOW_ITERATIONS`` + - Shows per-iteration timing. Set to ``1`` to show, ``0`` to hide. + - ``0`` + + * - ``USE_INTERACTIVE`` + - Specifies whether to pause for user input before the transfer loop. Set to ``1`` to pause, ``0`` to skip. + + The first pause occurs after memory allocations are prepared and before any transfers are executed. The second pause occurs after transfers are executed and before transfers are validated. This is useful for profiling: start profiling after the first pause, then capture data before validation begins. + - ``0`` + + * - ``HIDE_ENV`` + - Hides the environment variable listing. Set to ``1`` to hide, ``0`` to show. + - ``0`` + + * - ``OUTPUT_TO_CSV`` + - Generates results in CSV format. Set to ``1`` for CSV, ``0`` for human-readable output. + - ``0`` + + * - ``SAMPLING_FACTOR`` + - Affects auto-generated N values when N is ``0``. + - ``1`` + +.. _data-validation-var: + +Data and validation options +---------------------------- + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``ALWAYS_VALIDATE`` + - Specifies whether to validate after each iteration. Set to ``1`` to validate each iteration, ``0`` to validate once after all iterations. + + By default, validation is only done after all iterations, which can mask errors that occurred in all but the last iteration. + - ``0`` + + * - ``BLOCK_BYTES`` + - Granularity in bytes for dividing work across sub-executors. + - ``256`` + + * - ``BYTE_OFFSET`` + - Initial byte offset for allocations. Must be a multiple of 4. + - ``0`` + + * - ``FILL_COMPRESS`` + - Comma-separated percentages for 64-byte line fill across five bins: random, 1B0, 2B0, 4B0, and 32B0. + + This feature tests various compressible data patterns supported by XGMI. The integer values must sum to 100 and correspond to the following bins: + + .. list-table:: + :header-rows: 1 + + * - Bin + - Name + - Description + + * - 0 + - Random + - Random data + + * - 1 + - 1B0 + - The upper 1 byte of each 2 bytes in the 64-byte line is 0 + + * - 2 + - 2B0 + - The upper 2 bytes of each 4 bytes in the 64-byte line are 0 + + * - 3 + - 4B0 + - The upper 4 bytes of each 8 bytes in the 64-byte line are 0 + + * - 4 + - 32B0 + - The upper 32 bytes of each 64-byte line are 0 + + - — + + * - ``FILL_PATTERN`` + - Big-endian hex pattern for source data. Must have an even number of digits. Allows users to specify a particular data pattern. + - — + + * - ``VALIDATE_DIRECT`` + - Specifies whether to validate the GPU destination directly. Set to ``1`` to validate directly, ``0`` to validate via a CPU staging buffer. + + On AMD hardware, the CPU can directly access GPU device memory, avoiding the need for a staging buffer. This feature is not supported on NVIDIA hardware. + - ``0`` + + * - ``VALIDATE_SOURCE`` + - Specifies whether to validate the source immediately after preparation. Set to ``1`` to validate, ``0`` to skip. + + This was introduced to help debug issues where the initial copy of source data to the GPU didn't meet expectations, due to a hardware DMA issue. + - ``0`` + +.. _gfx-options: + +GFX and kernel options +----------------------- + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``GFX_BLOCK_ORDER`` + - Block ordering when running in multitransfer single-stream mode. ``0`` = sequential, ``1`` = interleaved, ``2`` = random. + + This controls how threadblocks are assigned to transfers. For example, with 4 transfers (A, B, C, D) each using 3 CUs: + + .. code-block:: shell + + Threadblock : 00 01 02 03 04 05 06 07 08 09 10 11 + ==================================================== + 0 = Sequential : A0 A1 A2 B0 B1 B2 C0 C1 C2 D0 D1 D2 + 1 = Interleaved: A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2 + 2 = Random : C1 D2 B1 B0 A0 C0 D1 D0 A1 C2 A2 B2 + + Use this setting to investigate how threadblock assignment to XCCs impacts performance. + - ``0`` + + * - ``GFX_BLOCK_SIZE`` + - Number of threads per threadblock. Must be a multiple of 64. + - ``256`` + + * - ``GFX_SE_TYPE`` + - Subexecutor granularity. ``0`` = threadblock, ``1`` = warp. + + By default, each subexecutor consists of one threadblock. Setting this to ``1`` makes each subexecutor consist of one warp instead. On some architectures such as AMD Instinct™ MI355X, this can impact performance, especially when used together with ``GFX_BLOCK_ORDER``. + - ``0`` + + * - ``GFX_TEMPORAL`` + - Controls how stores and loads are performed using non-temporal operations. ``0`` = none, ``1`` = loads only, ``2`` = stores only, ``3`` = both loads and stores. + - ``0`` + + * - ``GFX_UNROLL`` + - Unroll factor for the GFX kernel. Set to ``0`` for automatic selection. See :ref:`gfx-unroll`. + - (architecture-dependent) + + * - ``GFX_SINGLE_TEAM`` + - Subexecutor memory access mode. ``1`` = subexecutors operate on the full array, ``0`` = subexecutors operate on disjoint subarrays. + - ``1`` + + * - ``GFX_WAVE_ORDER`` + - Stride ordering for GFX waves. ``0`` = UWC, ``1`` = UCW, ``2`` = WUC, ``3`` = WCU, ``4`` = CUW, ``5`` = CWU. + - ``0`` + + * - ``GFX_WORD_SIZE`` + - Packed data size in DWORDs. ``4`` = DWORD x 4, ``2`` = DWORD x 2, ``1`` = DWORD x 1. + - ``4`` + + * - ``USE_HIP_EVENTS`` + - Timing method for GFX and DMA transfers. ``1`` = use HIP events, ``0`` = use CPU wall time. + - ``1`` + + * - ``USE_SINGLE_STREAM`` + - Stream assignment. ``1`` = one stream per GPU, ``0`` = one stream per transfer. + - ``1`` + + * - ``CU_MASK`` + - CU mask for streams, specified as a comma-separated list of indices or ranges (for example, ``5,10-12,14``). AMD only. + - — + + * - ``XCC_PREF_TABLE`` + - Preferred XCC per source-by-destination GPU pair, specified as a comma-separated list. Supported on AMD hardware only. + - — + +DMA options +----------- + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``USE_HSA_DMA`` + - DMA implementation. ``1`` = use ``hsa_amd_async_copy``, ``0`` = use ``hipMemcpy``. + - ``0`` + +Variable subexecutor options +------------------------------ + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``MIN_VAR_SUBEXEC`` + - Minimum number of subexecutors for variable subexecutor transfers. + - ``1`` + + * - ``MAX_VAR_SUBEXEC`` + - Maximum number of subexecutors. Set to ``0`` to use the device limit. + - ``0`` + +NIC options +----------- + +The following environment variables apply only when NIC support is enabled. + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``IB_GID_INDEX`` + - RoCE GID index. Set to ``-1`` for automatic selection. + - ``-1`` + + * - ``IB_PORT_NUMBER`` + - RDMA port number. + - ``1`` + + * - ``IP_ADDRESS_FAMILY`` + - IP address family. ``4`` = IPv4, ``6`` = IPv6. + - ``4`` + + * - ``NIC_CHUNK_BYTES`` + - Bytes per NIC RDMA chunk. + - ``1073741824`` + + * - ``NIC_RELAX_ORDER`` + - RDMA ordering. ``1`` = relaxed, ``0`` = strict. + - ``1`` + + * - ``ROCE_VERSION`` + - RoCE version. + - ``2`` + +Backend and runtime options (client-read) +------------------------------------------ + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``GPU_MAX_HW_QUEUES`` + - A HIP runtime environment variable that determines the maximum number of hardware queues each GPU has access to per process. When more than four GPU-executed transfers run simultaneously, they might serialize while waiting for available hardware queues. In this case, increase the value of this environment variable from the default ``4`` when ``USE_SINGLE_STREAM=0``. See :ref:`gpu-max-hw-queues`. + - ``4`` + +Backend environment variables +============================== + +The following environment variables are read by the TransferBench backend library or build system. They are not part of the frontend client and are prefixed with ``TB_``. + +Socket-related variables +------------------------ + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + + * - ``TB_RANK`` + - The process rank (0-based). Used for socket-based multinode runs. + + * - ``TB_NUM_RANKS`` + - Total number of processes. + + * - ``TB_MASTER_ADDR`` + - IP address of rank 0, used by the socket communicator. + + * - ``TB_MASTER_PORT`` + - Port for rank coordination. Default: ``29500``. + + * - ``TB_SINGLE_LOG`` + - When set, only rank 0 produces output. Useful for multinode socket mode. + +Backend variables +----------------- + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + + * - ``TB_VERBOSE`` + - Backend verbosity level. For example, set to ``1`` for extra logging. + + * - ``TB_DUMP_CFG_FILE`` + - Path of the configuration file used to dump executed transfers. + + This dumps all executed transfers to a configuration file that can then be re-executed by TransferBench. This can be used to capture the transfers executed by a preset, facilitating any further modifications and customizations. + + * - ``TB_PAUSE`` + - Pause before execution. Useful for attaching a debugger. + + For example: + + .. code-block:: shell + + > TB_PAUSE=1 ./TransferBench + # Pausing for debug attachment (PID: 2443974) + + > sudo gdb -p 2443974 + 5741 while (pause); + + set pause=false + continue + + * - ``TB_NIC_FILTER`` + - Regex pattern to limit visible NICs. Useful in preset scenarios that require homogeneous configurations. + + For example: + + .. code-block:: shell + + # Without filter: + > mlx5_0, mlx5_1, mlx5_2, mlx5_3, mlx5_4, mlx5_5, mlx5_6, mlx5_7 + + TB_NIC_FILTER="mlx5_1|mlx5_3" > mlx5_1, mlx5_3 + TB_NIC_FILTER="mlx5_[1,4,5]" > mlx5_1, mlx5_4, mlx5_5 + TB_NIC_FILTER="mlx5_[1-3,7]" > mlx5_1, mlx5_2, mlx5_3, mlx5_7 + TB_NIC_FILTER="mlx5_.*" > mlx5_0, mlx5_1, mlx5_2, mlx5_3, mlx5_4, mlx5_5, mlx5_6, mlx5_7 + + * - ``TB_DUMP_LINES`` + - Number of 64-byte lines to dump when debugging ``FILL_COMPRESS``. + + For example: + + .. code-block:: shell + + TB_DUMP_LINES=10 + + Input pattern 64B line statistics for bufferIdx 0: + Total lines: 16384 + - 0: Random : 3276 ( 19.995%) + - 1: 1B0 : 3277 ( 20.001%) + - 2: 2B0 : 3277 ( 20.001%) + - 3: 4B0 : 3277 ( 20.001%) + - 4: 32B0 : 3277 ( 20.001%) + + * - ``TB_FORCE_SINGLE_POD`` + - Forces single pod mode, skipping AMD-SMI and NVML pod queries. This assumes that all GPUs are in the same pod and skips cluster membership API calls. + +HSA runtime variables +--------------------- + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + + * - ``HSA_ENABLE_SDMA`` + - Enables SDMA when set to ``1``. To disable, set to ``0``. + + This is a HIP runtime environment variable. When SDMA is disabled, the DMA executor falls back to using blit kernels (GFX) internally. diff --git a/docs/reference/faq.rst b/docs/reference/faq.rst new file mode 100644 index 00000000..ba970090 --- /dev/null +++ b/docs/reference/faq.rst @@ -0,0 +1,428 @@ +.. meta:: + :description: Frequently asked questions about TransferBench, covering common errors, warnings, and configuration issues including IOMMU, memory types, and XGMI. + :keywords: TransferBench FAQ, TransferBench errors, TransferBench warnings, IOMMU, GPU_MAX_HW_QUEUES, GFX_UNROLL, validation, XGMI, UALoE, memory types + +.. _faq: + +========================== +Frequently asked questions +========================== + +This topic answers common questions about TransferBench errors, warnings, features, environment variables, and presets. + +Error and warning messages +=========================== + +This section describes common TransferBench error and warning messages and how to resolve them. + +[ERROR] Unexpected mismatch at index... +----------------------------------------- + +TransferBench validates each transfer to ensure that data has been moved correctly. This +error indicates that the destination (DST) memory doesn't match the expected value. + +For example: + +.. code-block:: text + + [ERROR] Transfer 0: Unexpected mismatch at index 0 of destination 0 on rank 0: Expected 31.00000 Actual: 0.00000 + +In this example, the first element of the DST memory was expected to hold +``31.00000`` but actually contained ``0.00000``. + +This error is generally not a TransferBench issue. It's usually a sign of a system +configuration problem. + +Common causes include: + +- Improperly configured IOMMU +- A ROCm runtime and driver version mismatch + +IOMMU must be set to pass-through mode in the BIOS. To verify, check for ``iommu=pt`` +in the kernel command line: + +.. code-block:: shell + + # Check for iommu=pt in the output + cat /proc/cmdline + + BOOT_IMAGE=/boot/vmlinuz-5.15.0-70-generic root=UUID=7489cc43-aaab-4b61-8c63-86a419728dea + ro panic=0 nowatchdog msr.allow_writes=on nokaslr amdgpu.noretry=1 pci=realloc=off + modprobe.blacklist=amdgpu intel_iommu=on iommu=pt numa_balancing=disable console=tty0 + console=ttyS0,115200n8 + +For IOMMU configuration guidance, see +`AMD Instinct MI300X system optimization `_. + +.. _gpu-max-hw-queues: + +[WARN] ... attempting X parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 +------------------------------------------------------------------------------------- + +The HIP runtime limits the number of independent hardware queues each GPU can use per +process. This limit is controlled by the ``GPU_MAX_HW_QUEUES`` environment variable. For +more information, see +`ROCm environment variables `_. + +When the number of transfers requiring hardware queues exceeds the configured limit, +those transfers serialize instead of running in parallel. TransferBench detects this +condition and issues this warning. + +This commonly occurs with DMA-executed transfers, because each DMA transfer requires one +hardware queue. It is frequently seen when running the :ref:`all-to-all preset `. + +To resolve this, set ``GPU_MAX_HW_QUEUES`` to a value greater than the number of +transfers. It is recommended to set at least one extra queue beyond the number of +transfers. + +The following examples show the effect on an 8-GPU system running the all-to-all preset +with DMA execution enabled. + +Without setting ``GPU_MAX_HW_QUEUES``: + +.. code-block:: shell + + USE_DMA_EXEC=1 ./TransferBench a2a + + ... + GPU-DMA All-To-All benchmark: + ============================== + [268435456 bytes per Transfer] [DMA:8] [1 Read(s) 1 Write(s)] [MemType:uncached GPU] [NIC QueuePairs:0] [#Ranks:1] + + Average bandwidth (GPU Timed): 60.952 GB/s + Aggregate bandwidth (GPU Timed): 3413.290 GB/s + Aggregate bandwidth (CPU Timed): 1338.252 GB/s + [WARN] DMA 0 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 + [WARN] DMA 1 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 + [WARN] DMA 2 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 + [WARN] DMA 3 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 + [WARN] DMA 4 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 + [WARN] DMA 5 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 + [WARN] DMA 6 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 + [WARN] DMA 7 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4 + +Setting ``GPU_MAX_HW_QUEUES=8``: + +.. code-block:: shell + + GPU_MAX_HW_QUEUES=8 USE_DMA_EXEC=1 ./TransferBench a2a + + ... + GPU-DMA All-To-All benchmark: + ============================== + [268435456 bytes per Transfer] [DMA:8] [1 Read(s) 1 Write(s)] [MemType:uncached GPU] [NIC QueuePairs:0] [#Ranks:1] + + Average bandwidth (GPU Timed): 60.091 GB/s + Aggregate bandwidth (GPU Timed): 3365.111 GB/s + Aggregate bandwidth (CPU Timed): 2222.415 GB/s + +.. note:: + + Individual transfer bandwidths are similar in both cases because each transfer is timed + from when it starts. However, the CPU wall-clock time is nearly double in the + ``GPU_MAX_HW_QUEUES=4`` case, because serialized transfers complete one after another + instead of running in parallel. + +Feature questions +================== + +This section answers common questions about TransferBench features and behavior. + +Can TransferBench target a specific UALoE station? +---------------------------------------------------- + +No. TransferBench has no direct control over which Unified Accelerator Link over Ethernet +(UALoE) station gets used, and doesn't have any knowledge of which station is selected. + +Does TransferBench perform any validation? +------------------------------------------- + +Yes. TransferBench initializes source data buffers with a pattern (which can be +user-specified), then checks that destination data buffers contain the expected result +after each transfer completes. For details, see :ref:`transferbench-data-validation`. + +Does TransferBench alter underlying XGMI speeds when it runs? +-------------------------------------------------------------- + +No. TransferBench runs on the current hardware settings and doesn't modify them. + +To query current XGMI settings on AMD Instinct machines, use ``amd-smi xgmi``: + +.. code-block:: shell + + amd-smi xgmi + + LINK METRIC TABLE: + bdf bit_rate max_bandwidth link_type GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 + GPU0 0000:0c:00.0 38 Gb/s 608 Gb/s XGMI + Read N/A 39.61 TB 15.40 TB 15.47 TB 5.349 TB 4.993 TB 5.078 TB 5.952 TB + Write N/A 41.96 TB 15.32 TB 15.00 TB 5.332 TB 4.859 TB 4.979 TB 5.448 TB + +Environment variable questions +================================ + +This section answers common questions about TransferBench environment variables. + +.. _gfx-unroll: + +What is the GFX unroll factor? +-------------------------------- + +Specifying an unroll factor of X means that each GPU thread reads X pieces of source data +into registers, then writes those X pieces of data out to the destination, as shown in the following table: + +.. raw:: html + + + + + + + + + + + + + + + + + + + + + +

Instruction order

Unroll 1

Unroll 2

Unroll 4

1READ [A] READ [A] READ [A]
2WRITE [A]READ [B] READ [B]
3READ [B] WRITE [A]READ [C]
4WRITE [B]WRITE [B]READ [D]
5READ [C] READ [C] WRITE [A]
6WRITE [C]READ [D] WRITE [B]
7READ [D] WRITE [C]WRITE [C]
8WRITE [D]WRITE [D]WRITE [D]
+ +Having more reads in flight can reduce write stalls. However, a higher unroll factor also +increases register pressure because more intermediate values must be held simultaneously. + +The following example assumes four units of time before a read arrives or when the write can be issued. The example also assumes that the link hasn't reached the capacity. + +.. raw:: html + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Unroll 1AABBCCDD
Unroll 2ABABCDCD
Unroll 4ABCDABCD
+ +The measured effect of unroll factor varies by transfer type. The following table shows +example bandwidth values (in GB/s): + +.. list-table:: + :header-rows: 1 + + * - ``GFX_UNROLL`` + - Local copy with 4 CUs (``1 4 G0->G0->G0``) + - Remote 1 subexecutor copy (``1 1 G0->G0->G1``) + + * - 1 + - 20.297 + - 20.297 + + * - 2 + - 37.669 + - 36.599 + + * - 3 + - 48.781 + - 48.439 + + * - 4 + - 62.887 + - 59.407 + + * - 5 + - 74.076 + - 44.100 + + * - 6 + - 84.769 + - 59.386 + + * - 7 + - 95.074 + - — + + * - 8 + - 101.101 + - — + +For the remote copy case, performance doesn't scale monotonically beyond unroll 4 because +the link becomes the bottleneck rather than register occupancy. + +To configure the unroll factor, see :ref:`GFX_UNROLL environment variable `. + +Preset questions +================= + +This section answers common questions about TransferBench presets. + +.. _mem-type: + +What memory types do presets support? +--------------------------------------- + +Some TransferBench presets use the ``MEM_TYPE`` environment variable (or CPU- and +GPU-specific variants) to select the memory type used during the transfer. The following +table lists the supported memory types based on CPU or GPU: + +.. list-table:: + :header-rows: 1 + + * - Memory device + - Memory type index + - Description + - Symbol + - Allocation method + + * - CPU + - 0 + - Default pinned host memory + - ``C`` + - ``hipHostMalloc`` + + * - CPU + - 1 + - Coherent pinned host memory + - ``B`` + - ``hipHostMalloc`` with ``hipHostMallocCoherent`` flag + + * - CPU + - 2 + - Non-coherent pinned host memory + - ``D`` + - ``hipHostMalloc`` with ``hipHostMallocNonCoherent`` flag + + * - CPU + - 3 + - Uncached pinned host memory + - ``K`` + - ``hipHostMalloc`` with ``hipHostMallocUncached`` flag + + * - CPU + - 4 + - Unpinned host memory + - ``H`` + - ``numa_alloc_onnode`` + + * - GPU + - 0 + - Default GPU memory + - ``G`` + - ``hipMalloc`` + + * - GPU + - 1 + - Fine-grained GPU memory + - ``F`` + - ``hipExtMallocWithFlags`` with ``hipDeviceMallocFinegrained`` + + * - GPU + - 2 + - Uncached GPU memory + - ``U`` + - ``hipExtMallocWithFlags`` with ``hipDeviceMallocUncached`` + + * - GPU + - 3 + - Managed memory + - ``M`` + - ``hipMallocManaged`` diff --git a/docs/reference/presets.rst b/docs/reference/presets.rst new file mode 100644 index 00000000..58c42728 --- /dev/null +++ b/docs/reference/presets.rst @@ -0,0 +1,1722 @@ +.. meta:: + :description: Reference for TransferBench presets, including all-to-all, peer-to-peer, NIC rings, sweep, and scaling tests with supported environment variables and example outputs. + :keywords: TransferBench presets, TransferBench a2a, TransferBench p2p, TransferBench nicrings, TransferBench nicp2p, TransferBench sweep, TransferBench scaling + +.. _running-presets: + +======================= +TransferBench presets +======================= + +Presets are a predefined series of Transfers that can be used instead of manually configuring the Transfers. + +The following table lists the presets available on TransferBench 1.66.03: + +.. list-table:: + :header-rows: 1 + + * - Preset name + - Description + - Multinode support + + * - :ref:`All-to-all preset (a2a) ` + - Tests parallel transfers between all pairs of GPU devices. + - ✅ + + * - :ref:`All-to-all via nearest NIC preset (a2a_n) ` + - Tests parallel transfers between all pairs of GPU devices using nearest NIC RDMA + - ❌ + + * - :ref:`All-to-all sweep preset (a2asweep) ` + - Performs a parameter sweep of GFX-based all-to-all transfers across different subexecutor counts, unroll factors, and thread block sizes. + - ❌ + + * - :ref:`NIC rings preset (nicrings) ` + - Tests NIC rings created across identical NIC indices across ranks. + - ✅ + + * - :ref:`NIC peer-to-peer preset (nicp2p) ` + - Tests multinode peer-to-peer RDMA transfer between all NICs across all ranks. + - ✅ + + * - :ref:`One-to-all preset (one2all) ` + - Tests all subsets of parallel transfers from one GPU to the others. + - ❌ + + * - :ref:`Peer-to-peer preset (p2p) ` + - Tests unidirectional and bidirectional transfers for CPU-to-CPU, CPU-to-GPU, and GPU-to-GPU combinations. + - ❌ + + * - :ref:`Scaling preset (scaling) ` + - Runs a scaling test from one GPU to all other devices (CPUs and GPUs). + - ❌ + + * - :ref:`Schmoo preset (schmoo) ` + - Runs scaling tests for local and remote read, write, and copy operations between two GPUs. + - ❌ + + * - :ref:`Sweep or random sweep preset (sweep/rsweep) ` + - Tests combinations of source (SRC), executor, and destination (DST) with varying parallelism. + - ❌ + +.. note:: + + You can modify a preset using environment variables, which are detailed when running the preset. + +.. _a2a: + +All-to-all preset (a2a) +======================== + +The a2a preset tests parallel transfers between all pairs of GPU devices. It measures bidirectional bandwidth across every GPU-to-GPU combination on a single node or multinode system. It supports GFX (compute kernel) and DMA all-to-all, and allows for NIC executor ring in parallel. + +**Key features:** + +- **GFX/DMA mode:** Creates transfers for every (src GPU to dst GPU) pair on each rank. Optionally restricts to directly connected XGMI links (A2A_DIRECT=1). + +- **Transfer modes:** Copy (1 src → 1 dst), read-only (1 src → null), write-only (null → 1 dst), or custom (numSrcs:numDsts). + +- **NIC rings:** When ``NUM_QUEUE_PAIRS`` > 0, adds NIC-based ring transfers (GPU i → GPU (i+1)%N) using nearest-NIC RDMA. + +- Prints a SRC x DST bandwidth matrix with row or column totals, aggregate bandwidth, and min/max/avg across ranks for multinode system. + +- Forces ``USE_SINGLE_STREAM=1`` for all-to-all. + +- **On AMD hardware:** ``A2A_DIRECT=1`` uses ``hipExtGetLinkTypeAndHopCount`` to skip non-direct XGMI pairs. + +- **Multinode:** Each rank must have the same number of GPUs. Differences in the NIC configuration across ranks produce a warning. + +**Usage:** + +.. code-block:: shell + + ./TransferBench a2a [numBytes] + +Environment variables +---------------------- + +To modify the behavior of a2a preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``A2A_DIRECT`` + - To use only directly connected XGMI links (hop count = 1). 0 = full all-to-all. This can be useful on older MI2XX hardware that doesn't feature full all-to-all XGMI connectivity, and running the standard all-to-all between all pairs of GPUs ends up utilizing XGMI links more than once. + - ``1`` + + * - ``A2A_LOCAL`` + - To include local transfers (i→i). 0 = exclude, 1 = include. + - ``0`` + + * - ``A2A_MODE`` + - Transfer mode: 0=Copy, 1=Read-Only, 2=Write-Only, or numSrcs:numDsts for custom. Systems with multiple sources or destinations mimic the behavior of some collective algorithms such as RingReduce, which sometimes require reading from two local buffers, adding them together, then writing to a local output buffer and remote temp buffer. + - ``0`` + + * - ``GFX_UNROLL`` + - GFX kernel unroll factor. Overrides global default. See :ref:`gfx-unroll`. + - ``2`` + + * - ``MEM_TYPE`` + - GPU memory type: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`. + - ``2`` + + * - ``NUM_GPU_DEVICES`` + - Number of GPUs to use. + - (detected) + + * - ``NUM_QUEUE_PAIRS`` + - Queue pairs per NIC transfer. 0 = no NIC rings. + - ``0`` + + * - ``NUM_RESULTS`` + - Shows top or bottom N results per cell for multinode. Default = 1 if numRanks > 1. + - ``0`` or ``1`` + + * - ``NUM_SUB_EXEC`` + - Sub-executors (CUs or WGPs) per transfer. + - ``8`` + + * - ``SHOW_DETAILS`` + - Shows full results per transfer. + - ``0`` + + * - ``USE_DMA_EXEC`` + - To use DMA executor instead of GFX. Valid only for A2A_MODE=0 (copy). + - ``0`` + + * - ``USE_FINE_GRAIN`` + - To use MEM_TYPE. + - (deprecated) + + * - ``USE_REMOTE_READ`` + - To use DST GPU as executor (remote read) instead of SRC GPU (local read). + - ``0`` + +Example output +--------------- + +.. tab-set:: + + .. tab-item:: AMD Instinct™ MI300X + + .. image:: /data/a2a_MI300X.png + :width: 100% + :align: center + + .. tab-item:: AMD Instinct™ MI350X + + .. image:: /data/a2a_MI350X.png + :width: 100% + :align: center + +The table in the output shows the transfer rate for each pair of GPUs, as measured using GPU timestamps. + +- ``STotal``: Indicates the total send bandwidth as a sum of SRC GPU's bandwidth. + +- ``RTotal``: Indicates the total receive bandwidth as a sum of DST GPU's bandwidth. + +- ``Actual``: Reflects the actual time for the kernel to finish executing the slowest transfer. Because one GFX kernel is launched to handle all Transfers to other GPUs, the kernel doesn't finish until the slowest transfer completes. + +- ``CPU Timed``: Measures all the Transfers. + +.. note:: + + To rule out any possibility of serialization, check if the CPU Timed bandwidth is close to the aggregate GPU Timed bandwidth. + + To avoid serialization when running with DMA executor, increase the number of hardware queues available. + + As the following output shows, ``GPU_MAX_HW_QUEUES`` defaults to just 4 if not set: + + .. image:: /data/a2a_serialization.png + :width: 100% + :align: center + + Although TransferBench issues a warning ``[WARN] DMA 0 attempting n parallel transfers, however GPU_MAX_HW_QUEUES only set to 4``, the hardware queue insufficiency can also be noticed by the large discrepancy between CPU Timed aggregate bandwidth and GPU timed aggregated bandwidth. + +.. _a2a_n: + +All-to-all via nearest NIC preset (a2a_n) +========================================== + +The a2a_n preset tests parallel transfers between all pairs of GPU devices using nearest NIC RDMA. Each transfer uses the NIC closest to the SRC GPU to send to the NIC closest to the DST GPU. + +**Key features:** + +- Creates Transfers for every SRC GPU and DST GPU pair using the NIC closest to the SRC GPU to read, and the NIC closest to the DST GPU to write. + +- Prints a SRC x DST bandwidth matrix with row totals, column totals, and aggregate bandwidth. + +- Reports average and aggregate bandwidth (Tx-thread timed and CPU timed). + +- Supports single node only: Multinode is not supported. + +**Usage:** + +.. code-block:: shell + + ./TransferBench a2a_n [numBytes] + +Environment variables +---------------------- + +To modify the behavior of a2a_n preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``MEM_TYPE`` + - GPU memory type: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`. + - ``2`` + + * - ``NUM_GPU_DEVICES`` + - Number of GPUs to use. + - (detected) + + * - ``NUM_QUEUE_PAIRS`` + - Queue pairs per transfer. + - ``1`` + +.. note:: + + The a2a_n preset divides the available NIC bandwidth into the number of GPU peers. + +.. _a2asweep: + +All-to-all sweep preset (a2asweep) +=================================== + +The a2asweep preset performs a parameter sweep of GFX-based all-to-all transfers across different subexecutor counts, unroll factors, and thread block sizes. It helps find optimal configurations for GPU all-to-all bandwidth on your hardware. + +**Key features:** + +- Sweeps ``BLOCKSIZES`` (thread block size). + +- For each block size, sweeps ``NUM_SUB_EXECS`` (CU count) x ``UNROLLS`` (unroll factor). + +- Sweep order: Outer loop over ``BLOCKSIZES``, then table of (``NUM_SUB_EXECS`` x ``UNROLLS``). + +- By default reports only the slowest GPU's bandwidth (min bandwidth) per CU-Unroll combination. To include the fastest GPU's bandwidth (max bandwidth) per config, set ``SHOW_MIN_ONLY`` = 0. + +- Uses same transfer topology as a2a preset, such as direct links, A2A_MODE, and others. + +**Restrictions:** + +- Supports single node only: Multinode is not supported. + +- Forced single-stream: ``useSingleStream`` = 1. + +- Can't use ``USE_SPRAY`` with multiple destination buffers (``numDsts`` > 1). + +**Usage:** + +.. code-block:: shell + + ./TransferBench a2asweep + +To use custom sweep ranges: + +.. code-block:: shell + + BLOCKSIZES=256,384 UNROLLS=2,4,8 NUM_SUB_EXECS=4,8,16 ./TransferBench a2asweep + +Environment variables +---------------------- + +To modify the behavior of a2asweep preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``A2A_DIRECT`` + - To use only directly-connected GPU pairs, set to ``1``. For full all-to-all, set to ``0``. + - ``1`` + + * - ``A2A_LOCAL`` + - To include local transfers, set to ``1``. To exclude, set to ``0``. + - ``0`` + + * - ``A2A_MODE`` + - Transfer mode: 0=Copy, 1=Read-Only, 2=Write-Only, or numSrcs:numDsts for custom. + - ``0`` + + * - ``BLOCKSIZES`` + - Comma-separated thread block sizes, such as 256, 384, or 512. + - ``256`` + + * - ``MEM_TYPE`` + - GPU memory type: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`. + - ``2`` + + * - ``NUM_GPU_DEVICES`` + - Number of GPUs in all-to-all group. + - (all detected) + + * - ``NUM_SUB_EXECS`` + - Comma-separated subexecutor (CU or WGP) counts to sweep. + - ``4,8,12,16,24,32`` + + * - ``SHOW_MIN_ONLY`` + - To show only the slowest GPU result, set to ``1``. To show the slowest and the fastest GPU results, set to ``0``. + - ``1`` + + * - ``UNROLLS`` + - Comma-separated unroll factors to sweep. See :ref:`gfx-unroll`. + - ``1,2,3,4,6,8`` + + * - ``USE_REMOTE_READ`` + - To use the executor on DST, set to ``1``. To use the executor on SRC, set to ``0``. + - ``0`` + + * - ``USE_SPRAY`` + - To configure each subexecutor to target all GPUs, set to ``1``. To target only one GPU, set to ``0``. Invalid for multiple DST. + - ``0`` + + * - ``VERBOSE`` + - Shows detailed results per config. + - ``0`` + +Example output +--------------- + +.. tab-set:: + + .. tab-item:: AMD Instinct MI300X + + .. image:: /data/a2asweep_MI300X.png + :width: 100% + :align: center + + .. tab-item:: AMD Instinct MI350X + + .. image:: /data/a2asweep_MI350X.png + :width: 100% + :align: center + +.. _nicrings: + +NIC rings (nicrings) +===================== + +The nicrings preset tests NIC rings created across identical NIC indices across ranks. It measures RDMA bandwidth in ring topologies where each rank sends to the next rank in the ring, using GPU or CPU memory closest to each NIC. + +The following image shows the ring topology: + +.. image:: /data/nicrings.png + :width: 100% + :align: center + +**Key features:** + +- Ring construction: Creates parallel RDMA rings across all ranks with one ring per GPU/CPU-to-NIC pair (memIndex-nicIndex), where that NIC is the closest to that memory. + +- Topology of each ring: Rank 0->1->2->...->N-1->0. + +- Can use GPU memory or CPU memory (NUMA nearest to NIC) as buffer. + +- Supports RDMA read or write. To choose the rank for RDMA read or write in multirank systems, use ``USE_RDMA_READ``. + +- Homogeneous ranks required: Supports multinode provided that all ranks are homogeneous (same topology). Use ``NIC_FILTER`` to limit NIC visibility if needed. + +- Transfer direction: ``currRank`` sends to (``currRank`` + 1) % ``numRanks``. + +- Executor placement: Executor is placed on the SRC rank for RDMA write and DST rank for RDMA read. + +**Usage:** + +.. code-block:: shell + + ./TransferBench nicrings + +To use CPU memory: + +.. code-block:: shell + + USE_CPU_MEM=1 ./TransferBench nicrings + +To use RDMA read and see details: + +.. code-block:: shell + + SHOW_DETAILS=1 USE_RDMA_READ=1 NUM_QUEUE_PAIRS=2 ./TransferBench nicrings + +Environment variables +---------------------- + +To modify the behavior of nicrings preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``MEM_TYPE`` + - Memory type index. See :ref:`mem-type`. + - ``0`` + + * - ``NUM_QUEUE_PAIRS`` + - Queue pairs per NIC transfer. + - ``1`` + + * - ``SHOW_DETAILS`` + - To see full transfer details, set to ``1``. + - ``0`` + + * - ``USE_CPU_MEM`` + - To use CPU memory closest to each NIC, set to ``1``. To use GPU memory, set to ``0``. + - ``0`` + + * - ``USE_RDMA_READ`` + - To use RDMA reads, set to ``1``. To use RDMA writes, set to ``0``. Applies when ``numRanks`` > 1. + - ``0`` + +Example output +--------------- + +Here is an example output collected on four MI350X nodes with 8 NICs: + +.. image:: /data/nicrings_MI350X.png + :width: 100% + :align: center + +.. _nicp2p: + +NIC peer-to-peer preset (nicp2p) +================================= + +The nicp2p preset runs a multinode peer-to-peer RDMA transfer test between all NICs across all ranks. It measures bandwidth for every NIC-to-NIC pair using round-robin scheduling to avoid contention. + +**Key features:** + +- Tests all (``srcRank``, ``srcNic``) -> (``dstRank``, ``dstNic``) pairs. + +- Device selection: Uses ``GetClosestDeviceToNic()`` to pick CPU NUMA or GPU closest to each NIC based on ``SRC_MEM_TYPE`` or ``DST_MEM_TYPE``, and ``USE_CPU_*`` flags. + +- Allows using RDMA read instead of write through ``USE_REMOTE_READ``. + +- Round-robin and combination schedule: Node pairs are scheduled in round-robin. Within each node pair, NIC pairs use combination schedule with ``NIC_PARALLEL_LEVEL``. + +- Output: Full matrix or column format, including top 10 fastest or slowest connections. + +- Progress report: Prints progress to stderr. For example, "Completed X/Y pairs in Zs, estimated remaining time Ws". + +- Homogeneous ranks required: Supports multinode provided that all ranks are homogeneous (same topology). Use ``NIC_FILTER`` to limit NIC visibility if needed. + +- NICs required: Exits with error if no NICs are detected. + +**Usage:** + +.. code-block:: shell + + ./TransferBench nicp2p + +To use CPU memory and see output in column format: + +.. code-block:: shell + + OUTPUT_FORMAT=0 USE_CPU_SRC_MEM=1 USE_CPU_DST_MEM=1 ./TransferBench nicp2p + +Environment variables +---------------------- + +To modify the behavior of nicp2p preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``NUM_QUEUE_PAIRS`` + - Queue pairs per transfer (displayed as ``NUM_NIC_SE``). + - ``1`` + + * - ``USE_REMOTE_READ`` + - To use DST GPU as executor (remote read) instead of SRC GPU (local read). + - ``0`` + + * - ``OUTPUT_FORMAT`` + - To output full matrix, set to ``1``. For output in column format, set to ``0``. Column format is recommended when there are lots of NIC pairs. + - ``1`` + + * - ``USE_CPU_SRC_MEM`` + - To use CPU memory as SRC, set to ``1``. To use GPU memory as SRC, set to ``0``. + - ``0`` + + * - ``USE_CPU_DST_MEM`` + - To use CPU memory as DST, set to ``1``. To use GPU memory as DST, set to ``0``. + - ``0`` + + * - ``SRC_MEM_TYPE`` + - Source memory type index. See :ref:`mem-type`. + - ``2`` + + * - ``DST_MEM_TYPE`` + - Destination memory type index. See :ref:`mem-type`. + - ``2`` + + * - ``PARALLEL_NODE`` + - To execute node pairs in parallel, set to ``1``. For serial execution, set to ``0``. By default, nicp2p tries to run Transfers between node pairs in parallel to reduce the overall runtime. For example, (Rank 0->Rank 1) + (Rank 2->Rank 3) are run in parallel instead of (Rank 0->Rank 1) followed by (Rank 2->Rank 3). + - ``1`` + + * - ``NIC_PARALLEL_LEVEL`` + - NIC-to-NIC pairs that run in parallel between a node pair. By default, between a pair of nodes, all available NICs are used in parallel. NICs aren't used more than once at a time. This option reduces the overall runtime, which can be disabled if it impacts the performance. + - ``numNicsPerRank`` + +Example output +--------------- + +.. code-block:: shell + + [P2P Network Related] + NUM_NIC_SE = 1 : Using 1 queue pairs per Transfer + USE_REMOTE_READ = 0 : Using SRC as executor + OUTPUT_FORMAT = 1 : Printing results in full matrix format + USE_CPU_SRC_MEM = 0 : Source memory is GPU + USE_CPU_DST_MEM = 0 : Destination memory is GPU + SRC_MEM_TYPE = 2 : Using uncached GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed) + DST_MEM_TYPE = 2 : Using uncached GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed) + PARALLEL_NODE = 1 : Executing p2p node pairs in parallel: yes + NIC_PARALLEL_LEVEL = 8 : Between a pair of nodes, 8 pairs of NIC-NIC transfers executed in parallel + + Unidirectional copy peak bandwidth GB/s (NIC RDMA Using Nearest Device) + Completed 8/256 pairs in 2.656s, estimated remaining time 82.326s. + Completed 16/256 pairs in 5.537s, estimated remaining time 83.057s. + Completed 24/256 pairs in 8.351s, estimated remaining time 80.731s. + Completed 32/256 pairs in 11.251s, estimated remaining time 78.756s. + Completed 40/256 pairs in 14.159s, estimated remaining time 76.460s. + Completed 48/256 pairs in 16.688s, estimated remaining time 72.315s. + Completed 56/256 pairs in 19.113s, estimated remaining time 68.261s. + Completed 64/256 pairs in 21.748s, estimated remaining time 65.245s. + Completed 72/256 pairs in 24.465s, estimated remaining time 62.521s. + Completed 80/256 pairs in 27.377s, estimated remaining time 60.229s. + Completed 88/256 pairs in 30.264s, estimated remaining time 57.777s. + Completed 96/256 pairs in 32.851s, estimated remaining time 54.752s. + Completed 104/256 pairs in 35.601s, estimated remaining time 52.033s. + Completed 112/256 pairs in 38.404s, estimated remaining time 49.377s. + Completed 120/256 pairs in 41.035s, estimated remaining time 46.507s. + Completed 128/256 pairs in 43.756s, estimated remaining time 43.756s. + Completed 144/256 pairs in 45.877s, estimated remaining time 35.682s. + Completed 160/256 pairs in 47.736s, estimated remaining time 28.641s. + Completed 176/256 pairs in 50.091s, estimated remaining time 22.769s. + Completed 192/256 pairs in 51.892s, estimated remaining time 17.297s. + Completed 208/256 pairs in 53.863s, estimated remaining time 12.430s. + Completed 224/256 pairs in 55.850s, estimated remaining time 7.979s. + Completed 240/256 pairs in 57.924s, estimated remaining time 3.862s. + Completed 256/256 pairs in 60.043s, estimated remaining time 0.000s. + ┌------------┬-------------------------┬---------------------------------------------------------------------------------------┬---------------------------------------------------------------------------------------┐ + │SRC+EXE\DST │ │ Rank 00 │ Rank 01 │ + ├------------┼-------------------------┼---------------------------------------------------------------------------------------┼---------------------------------------------------------------------------------------┤ + │ │ NIC Device │ bnxt_re0 bnxt_re1 bnxt_re2 bnxt_re3 bnxt_re4 bnxt_re5 bnxt_re6 bnxt_re7 │ bnxt_re0 bnxt_re1 bnxt_re2 bnxt_re3 bnxt_re4 bnxt_re5 bnxt_re6 bnxt_re7 │ + │ │ Mem Device │ GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 │ GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 │ + ├------------┼-------------------------┼---------------------------------------------------------------------------------------┼---------------------------------------------------------------------------------------┤ + │ Rank 00 │ bnxt_re0 GPU 00 │ 31.36 31.31 31.31 31.31 31.31 31.31 31.31 31.30 │ 31.32 31.31 31.31 31.30 31.30 31.31 31.31 31.31 │ + │ │ bnxt_re1 GPU 01 │ 31.31 31.35 31.31 31.31 31.31 31.31 31.31 31.31 │ 31.31 31.32 31.31 31.31 31.31 31.31 31.31 31.31 │ + │ │ bnxt_re2 GPU 02 │ 31.31 31.32 31.36 31.31 31.31 31.31 31.30 31.31 │ 31.30 31.30 31.32 31.31 31.31 31.30 31.30 31.31 │ + │ │ bnxt_re3 GPU 03 │ 31.31 31.32 31.32 31.35 31.30 31.31 31.31 31.30 │ 31.31 31.32 31.31 31.31 31.31 31.30 31.31 31.31 │ + │ │ bnxt_re4 GPU 04 │ 31.31 31.32 31.31 31.32 31.35 31.31 31.31 31.30 │ 31.31 31.31 31.31 31.31 31.32 31.31 31.31 31.30 │ + │ │ bnxt_re5 GPU 05 │ 31.32 31.32 31.32 31.32 31.32 31.35 31.31 31.31 │ 31.31 31.31 31.30 31.32 31.31 31.33 31.31 31.32 │ + │ │ bnxt_re6 GPU 06 │ 31.31 31.31 31.32 31.32 31.32 31.32 31.36 31.31 │ 31.31 31.31 31.31 31.31 31.31 31.31 31.33 31.31 │ + │ │ bnxt_re7 GPU 07 │ 31.31 31.32 31.32 31.32 31.32 31.31 31.32 31.36 │ 31.31 31.32 31.30 31.31 31.30 31.31 31.30 31.32 │ + ├------------┼-------------------------┼---------------------------------------------------------------------------------------┼---------------------------------------------------------------------------------------┤ + │ Rank 01 │ bnxt_re0 GPU 00 │ 31.33 31.30 31.30 31.31 31.31 31.31 31.31 31.30 │ 31.36 31.31 31.31 31.31 31.31 31.31 31.30 31.31 │ + │ │ bnxt_re1 GPU 01 │ 31.32 31.32 31.31 31.30 31.31 31.31 31.31 31.31 │ 31.32 31.36 31.31 31.31 31.30 31.30 31.30 31.30 │ + │ │ bnxt_re2 GPU 02 │ 31.31 31.30 31.32 31.31 31.31 31.31 31.30 31.31 │ 31.32 31.32 31.35 31.31 31.31 31.31 31.30 31.31 │ + │ │ bnxt_re3 GPU 03 │ 31.31 31.31 31.31 31.32 31.30 31.31 31.30 31.31 │ 31.31 31.32 31.32 31.36 31.31 31.31 31.31 31.30 │ + │ │ bnxt_re4 GPU 04 │ 31.30 31.31 31.31 31.31 31.32 31.31 31.32 31.31 │ 31.32 31.32 31.31 31.32 31.36 31.31 31.31 31.31 │ + │ │ bnxt_re5 GPU 05 │ 31.30 31.31 31.31 31.31 31.30 31.32 31.31 31.31 │ 31.31 31.32 31.32 31.32 31.32 31.36 31.31 31.31 │ + │ │ bnxt_re6 GPU 06 │ 31.32 31.31 31.31 31.30 31.31 31.30 31.33 31.30 │ 31.32 31.31 31.31 31.32 31.32 31.31 31.35 31.31 │ + │ │ bnxt_re7 GPU 07 │ 31.31 31.31 31.31 31.31 31.31 31.31 31.31 31.32 │ 31.31 31.31 31.32 31.32 31.31 31.32 31.32 31.35 │ + └------------┴-------------------------┴---------------------------------------------------------------------------------------┴---------------------------------------------------------------------------------------┘ + Summary of top 10 fastest/slowest connection + ┌--------------------------┬--------------┬--------------┬--------------------------┬--------------┬--------------┐ + │ Fastest Bandwidth (GB/s) │ Src │ Dst │ Slowest Bandwidth (GB/s) │ Src │ Dst │ + ├--------------------------┼--------------┼--------------┼--------------------------┼--------------┼--------------┤ + │ 31.36 │ R00:bnxt_re0 │ R00:bnxt_re0 │ 31.30 │ R01:bnxt_re0 │ R00:bnxt_re1 │ + │ 31.36 │ R01:bnxt_re5 │ R01:bnxt_re5 │ 31.30 │ R00:bnxt_re4 │ R01:bnxt_re7 │ + │ 31.36 │ R00:bnxt_re7 │ R00:bnxt_re7 │ 31.30 │ R01:bnxt_re5 │ R00:bnxt_re4 │ + │ 31.36 │ R01:bnxt_re0 │ R01:bnxt_re0 │ 31.30 │ R00:bnxt_re3 │ R01:bnxt_re7 │ + │ 31.36 │ R00:bnxt_re2 │ R00:bnxt_re2 │ 31.30 │ R01:bnxt_re2 │ R00:bnxt_re1 │ + │ 31.36 │ R00:bnxt_re6 │ R00:bnxt_re6 │ 31.30 │ R01:bnxt_re0 │ R00:bnxt_re7 │ + │ 31.36 │ R01:bnxt_re1 │ R01:bnxt_re1 │ 31.30 │ R00:bnxt_re5 │ R01:bnxt_re2 │ + │ 31.36 │ R01:bnxt_re4 │ R01:bnxt_re4 │ 31.30 │ R01:bnxt_re1 │ R01:bnxt_re5 │ + │ 31.36 │ R01:bnxt_re3 │ R01:bnxt_re3 │ 31.30 │ R01:bnxt_re6 │ R01:bnxt_re7 │ + │ 31.35 │ R01:bnxt_re7 │ R01:bnxt_re7 │ 31.30 │ R01:bnxt_re2 │ R01:bnxt_re6 │ + └--------------------------┴--------------┴--------------┴--------------------------┴--------------┴--------------┘ + +.. _one2all: + +One-to-all preset (one2all) +============================ + +The one2all preset tests all subsets of parallel transfers from one GPU to the others. It sweeps over varying numbers of DST peers (from ``SWEEP_MIN`` to ``SWEEP_MAX``), and for each count, tests every combination of DST GPUs from a single SRC or executor GPU. + +**Key features:** + +- Minimum two GPUs: Requires at least two GPUs. Uses one GPU (``EXE_INDEX``) as SRC and executor. + +- Sweeps over all combinations of 1, 2, ..., N DST GPUs (excluding the SRC). + +- Combination sweep: For each peer count ``p``, iterates over all bitmasks with exactly ``p`` bits set (excluding ``EXE_INDEX``). + +- For each combination, runs parallel transfers and reports bandwidth per DST. + +- Supports GFX or DMA executor. SRC or DST can either be GPU or Null. + +- Supports single node only: Multinode is not supported. + +- Invalid configs skipped: Skips when (``exe`` = DMA and ( ``src`` = N or ``dst`` = N)) or ( ``src`` = N and ``dst`` = N). + +- Output format: Each line shows bandwidth per DST GPU, ``p``, ``numSubExecs``, and transfer triplets. + +**Usage:** + +.. code-block:: shell + + ./TransferBench one2all + +To run using GPU 2 as SRC and DST peers between 4 to 7: + +.. code-block:: shell + + EXE_INDEX=2 SWEEP_MIN=4 SWEEP_MAX=7 ./TransferBench one2all + +Environment variables +---------------------- + +To modify the behavior of one2all preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``NUM_GPU_DEVICES`` + - Number of GPUs. + - (all detected) + + * - ``NUM_GPU_SE`` + - Subexecutors (CUs) per transfer. + - ``4`` + + * - ``EXE_INDEX`` + - GPU index to use as executor or SRC. + - ``0`` + + * - ``SWEEP_DIR`` + - Transfer direction. + - ``0`` + + * - ``SWEEP_SRC`` + - SRC memory types: G=GPU, N=Null. + - ``G`` + + * - ``SWEEP_DST`` + - DST memory types. + - ``G`` + + * - ``SWEEP_EXE`` + - Executor types: G=GFX, D=DMA. + - ``G`` + + * - ``SWEEP_MIN`` + - Minimum number of DST peers. + - ``1`` + + * - ``SWEEP_MAX`` + - Maximum number of DST peers. + - ``numGpuDevices`` + +Example output +--------------- + +.. tab-set:: + + .. tab-item:: AMD Instinct MI300X + + .. code-block:: shell + + [One-To-All Related] + NUM_GPU_DEVICES = 8 : Using 8 GPUs + NUM_GPU_SE = 4 : Using 4 subExecutors/CUs per Transfer + EXE_INDEX = 0 : Executing on GPU 0 + SWEEP_DIR = 0 : Direction of transfer + SWEEP_DST = G : DST memory types to sweep + SWEEP_EXE = G : Executor type to use + SWEEP_MAX = 8 : Maximum number of peers + SWEEP_MIN = 1 : Minimum number of peers + SWEEP_SRC = G : SRC memory types to sweep + + Executing (G0 -> G0 -> G*) + GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 + ------------------------------------------------------------------------------------------- + 49.409 1 4 (G0 G0 G1) + 49.467 1 4 (G0 G0 G2) + 49.215 1 4 (G0 G0 G3) + 47.526 1 4 (G0 G0 G4) + 48.045 1 4 (G0 G0 G5) + 48.278 1 4 (G0 G0 G6) + 48.132 1 4 (G0 G0 G7) + 48.954 35.346 2 4 (G0 G0 G1) (G0 G0 G2) + 48.851 48.869 2 4 (G0 G0 G1) (G0 G0 G3) + 49.009 48.861 2 4 (G0 G0 G2) (G0 G0 G3) + 48.962 47.599 2 4 (G0 G0 G1) (G0 G0 G4) + 49.008 47.486 2 4 (G0 G0 G2) (G0 G0 G4) + 35.706 47.563 2 4 (G0 G0 G3) (G0 G0 G4) + 48.833 31.660 2 4 (G0 G0 G1) (G0 G0 G5) + 49.002 35.160 2 4 (G0 G0 G2) (G0 G0 G5) + 49.137 47.565 2 4 (G0 G0 G3) (G0 G0 G5) + 47.613 47.706 2 4 (G0 G0 G4) (G0 G0 G5) + 48.972 48.413 2 4 (G0 G0 G1) (G0 G0 G6) + 48.917 48.389 2 4 (G0 G0 G2) (G0 G0 G6) + 37.319 48.397 2 4 (G0 G0 G3) (G0 G0 G6) + 32.618 48.334 2 4 (G0 G0 G4) (G0 G0 G6) + 47.749 48.497 2 4 (G0 G0 G5) (G0 G0 G6) + 48.787 35.541 2 4 (G0 G0 G1) (G0 G0 G7) + 48.824 32.099 2 4 (G0 G0 G2) (G0 G0 G7) + 48.862 47.863 2 4 (G0 G0 G3) (G0 G0 G7) + 47.478 48.014 2 4 (G0 G0 G4) (G0 G0 G7) + 47.705 35.595 2 4 (G0 G0 G5) (G0 G0 G7) + 48.509 47.931 2 4 (G0 G0 G6) (G0 G0 G7) + 44.235 48.729 44.548 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) + 43.164 45.482 43.238 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) + 31.360 48.819 31.280 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) + 31.624 48.941 31.406 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) + 41.797 46.652 41.706 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) + 41.739 48.994 41.575 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) + 42.676 48.992 42.683 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) + 42.621 47.369 42.536 3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) + 43.504 47.353 43.639 3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) + 31.263 47.357 31.202 3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) + 44.168 47.169 44.632 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G6) + 30.692 48.787 30.939 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G6) + 32.297 48.687 32.237 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) + 28.916 47.483 29.027 3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G6) + 28.024 47.429 28.253 3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) + 27.484 47.547 27.506 3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) + 43.660 40.609 44.131 3 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G6) + 44.196 46.915 44.520 3 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) + 42.547 47.627 43.041 3 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) + 44.828 47.705 45.032 3 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 46.291 44.552 46.139 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G7) + 46.779 48.784 46.969 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G7) + 42.319 48.889 42.591 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G7) + 46.980 47.296 47.003 3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G7) + 44.806 47.395 45.020 3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G7) + 31.296 47.280 31.418 3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G7) + 45.477 44.531 45.229 3 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G7) + 45.001 43.060 44.962 3 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G7) + 47.083 41.743 46.937 3 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G7) + 42.876 45.829 43.211 3 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 42.205 48.237 42.679 3 4 (G0 G0 G1) (G0 G0 G6) (G0 G0 G7) + 46.007 48.087 45.818 3 4 (G0 G0 G2) (G0 G0 G6) (G0 G0 G7) + 31.938 48.267 32.044 3 4 (G0 G0 G3) (G0 G0 G6) (G0 G0 G7) + 28.835 48.077 28.934 3 4 (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 46.681 48.237 46.443 3 4 (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 40.538 39.734 40.637 39.989 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) + 43.540 35.372 43.132 35.497 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) + 46.522 36.693 46.656 36.883 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) + 41.551 35.359 41.382 35.482 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) + 41.302 40.839 40.951 40.931 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) + 38.601 37.573 38.677 37.788 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) + 39.196 41.692 39.371 42.069 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) + 39.194 46.098 39.083 45.956 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) + 33.541 41.203 33.486 41.436 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) + 41.140 38.015 41.354 37.837 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) + 41.764 42.981 42.139 43.384 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) + 44.813 46.952 45.157 47.063 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) + 42.990 42.942 42.790 42.787 4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 42.439 41.103 42.451 41.035 4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 41.678 42.340 41.546 42.608 4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 46.897 43.268 46.988 43.206 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G7) + 42.473 35.981 42.221 35.803 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G7) + 39.066 37.271 38.889 37.162 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7) + 41.392 40.677 41.546 40.580 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7) + 38.916 30.582 39.062 30.730 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G7) + 43.248 39.370 43.099 39.565 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7) + 45.966 34.208 46.186 34.160 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7) + 42.943 37.965 43.105 37.827 4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 37.814 29.784 37.870 29.790 4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 38.329 38.749 38.351 38.800 4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 44.992 32.694 44.743 32.608 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G6) (G0 G0 G7) + 39.867 39.650 39.837 39.575 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7) + 31.324 30.215 31.371 30.228 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7) + 34.020 39.810 33.860 39.709 4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 33.420 33.132 33.431 33.105 4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 31.942 41.954 32.008 41.790 4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 37.573 31.076 37.701 31.144 4 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 38.455 36.476 38.483 36.316 4 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 45.473 38.297 45.467 38.204 4 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 44.440 37.996 44.530 38.044 4 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 37.237 44.266 37.207 44.146 37.286 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) + 34.692 45.404 34.637 45.561 34.513 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) + 35.046 32.117 34.965 32.262 35.007 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) + 39.664 33.774 39.592 33.895 39.598 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 32.818 32.518 32.747 32.515 32.774 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 31.579 43.096 31.577 43.457 31.578 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 40.813 42.963 40.801 43.090 40.737 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7) + 40.565 34.567 40.630 34.859 40.559 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7) + 39.137 32.169 39.183 32.270 39.037 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 31.289 34.060 31.225 34.050 31.250 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 38.908 42.629 38.936 43.247 38.947 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 41.545 44.415 41.614 44.221 41.622 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7) + 34.760 37.380 34.741 37.467 34.541 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 28.091 35.858 28.037 35.823 28.072 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 28.942 37.485 28.963 37.353 28.894 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 32.473 36.354 32.466 36.272 32.430 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 41.725 37.835 41.615 37.916 41.462 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 35.491 45.836 35.415 45.785 35.436 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 44.632 38.803 44.496 38.664 44.305 5 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 39.944 44.310 40.085 44.310 39.938 5 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 29.816 36.004 29.770 35.960 29.717 5 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 34.725 35.633 34.708 35.705 34.657 35.797 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 39.720 37.520 39.526 37.566 39.491 37.550 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 39.609 41.426 39.536 41.532 39.521 41.447 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 39.203 33.233 39.339 33.162 39.220 33.234 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 35.246 34.889 35.226 34.842 35.218 34.841 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 41.457 37.283 41.567 37.332 41.352 37.204 6 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 33.003 37.075 33.068 36.900 32.971 36.937 6 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 38.626 41.000 38.632 41.087 38.518 40.911 38.775 7 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + + .. tab-item:: AMD Instinct MI355X + + .. code-block:: shell + + [One-To-All Related] + NUM_GPU_DEVICES = 8 : Using 8 GPUs + NUM_GPU_SE = 4 : Using 4 subExecutors/CUs per Transfer + EXE_INDEX = 0 : Executing on GPU 0 + SWEEP_DIR = 0 : Direction of transfer + SWEEP_DST = G : DST memory types to sweep + SWEEP_EXE = G : Executor type to use + SWEEP_MAX = 8 : Maximum number of peers + SWEEP_MIN = 1 : Minimum number of peers + SWEEP_SRC = G : SRC memory types to sweep + + Executing (G0 -> G0 -> G*) + GPU 1 GPU 2 GPU 3 GPU 4 GPU 5 GPU 6 GPU 7 + -------------------------------------------------------------------------------- ----------- + 57.060 1 4 (G0 G0 G1) + 56.969 1 4 (G0 G0 G2) + 49.018 1 4 (G0 G0 G3) + 49.616 1 4 (G0 G0 G4) + 56.926 1 4 (G0 G0 G5) + 56.751 1 4 (G0 G0 G6) + 49.459 1 4 (G0 G0 G7) + 57.858 55.950 2 4 (G0 G0 G1) (G0 G0 G2) + 56.203 56.584 2 4 (G0 G0 G1) (G0 G0 G3) + 56.249 55.990 2 4 (G0 G0 G2) (G0 G0 G3) + 56.304 56.307 2 4 (G0 G0 G1) (G0 G0 G4) + 55.829 56.026 2 4 (G0 G0 G2) (G0 G0 G4) + 55.066 55.944 2 4 (G0 G0 G3) (G0 G0 G4) + 55.941 53.563 2 4 (G0 G0 G1) (G0 G0 G5) + 48.896 49.449 2 4 (G0 G0 G2) (G0 G0 G5) + 50.291 50.699 2 4 (G0 G0 G3) (G0 G0 G5) + 49.792 49.264 2 4 (G0 G0 G4) (G0 G0 G5) + 48.798 49.999 2 4 (G0 G0 G1) (G0 G0 G6) + 55.917 53.447 2 4 (G0 G0 G2) (G0 G0 G6) + 49.444 49.879 2 4 (G0 G0 G3) (G0 G0 G6) + 50.038 49.559 2 4 (G0 G0 G4) (G0 G0 G6) + 57.729 56.534 2 4 (G0 G0 G5) (G0 G0 G6) + 56.182 55.834 2 4 (G0 G0 G1) (G0 G0 G7) + 55.878 55.928 2 4 (G0 G0 G2) (G0 G0 G7) + 56.481 57.752 2 4 (G0 G0 G3) (G0 G0 G7) + 49.900 49.185 2 4 (G0 G0 G4) (G0 G0 G7) + 55.853 56.308 2 4 (G0 G0 G5) (G0 G0 G7) + 56.321 55.775 2 4 (G0 G0 G6) (G0 G0 G7) + 52.080 50.746 51.941 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) + 54.335 54.254 54.202 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) + 49.266 55.731 49.445 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) + 52.413 55.947 52.325 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) + 39.503 54.296 39.712 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) + 57.383 56.119 57.456 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) + 50.184 56.256 50.205 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) + 57.250 56.207 57.346 3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) + 49.933 56.055 49.519 3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) + 48.265 56.240 48.151 3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) + 47.040 50.109 47.149 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G6) + 50.567 56.220 50.564 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G6) + 56.907 56.313 56.986 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) + 50.609 56.264 50.417 3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G6) + 56.975 56.041 56.826 3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) + 48.868 56.275 48.590 3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) + 49.474 49.799 49.414 3 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G6) + 39.407 53.626 39.264 3 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) + 52.668 51.885 52.746 3 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) + 54.683 50.035 54.503 3 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 54.751 51.185 54.714 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G7) + 49.464 56.451 49.507 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G7) + 50.542 56.494 50.419 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G7) + 47.802 53.791 47.561 3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G7) + 47.249 52.755 47.091 3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G7) + 41.682 55.054 41.609 3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G7) + 53.857 50.240 53.689 3 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G7) + 46.694 49.802 46.467 3 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G7) + 52.817 49.695 52.708 3 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G7) + 42.766 49.378 42.681 3 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 47.020 50.272 46.866 3 4 (G0 G0 G1) (G0 G0 G6) (G0 G0 G7) + 51.293 50.344 51.281 3 4 (G0 G0 G2) (G0 G0 G6) (G0 G0 G7) + 52.745 50.363 52.573 3 4 (G0 G0 G3) (G0 G0 G6) (G0 G0 G7) + 43.464 50.005 43.378 3 4 (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 52.110 53.252 52.204 3 4 (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 53.978 53.951 53.909 53.994 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) + 52.088 48.838 52.174 48.706 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) + 54.746 51.347 54.722 51.213 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) + 53.295 54.767 53.528 54.685 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) + 50.468 48.308 50.462 47.927 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) + 50.893 46.216 50.966 46.051 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) + 52.775 43.437 52.870 43.390 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) + 51.347 47.597 51.299 47.533 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) + 54.851 54.193 54.852 54.315 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) + 52.597 53.273 52.389 53.026 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) + 49.185 51.880 49.343 51.712 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) + 50.603 56.058 50.795 55.960 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) + 49.493 53.818 49.462 53.614 4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 50.473 52.841 50.388 52.713 4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 55.233 53.448 54.880 53.259 4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 53.965 53.219 54.128 53.233 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G7) + 48.949 50.712 48.946 50.613 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G7) + 52.486 47.821 52.730 47.730 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7) + 51.232 49.069 51.309 48.869 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7) + 49.876 51.404 49.772 51.046 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G7) + 57.132 57.070 56.963 56.772 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7) + 49.970 57.176 49.987 56.920 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7) + 57.333 49.658 57.264 49.806 4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 50.165 49.903 50.134 49.860 4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 52.488 51.273 52.639 51.069 4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 51.169 54.829 51.031 54.709 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G6) (G0 G0 G7) + 50.695 57.240 50.471 56.931 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7) + 56.892 57.171 56.747 57.028 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7) + 50.567 49.730 50.262 49.642 4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 56.972 49.999 56.850 49.764 4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 55.711 51.511 55.656 51.235 4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 51.660 54.717 51.631 54.697 4 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 48.339 50.612 48.182 50.660 4 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 54.962 54.106 54.762 53.969 4 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 49.339 51.046 49.435 50.976 4 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 44.784 52.269 44.716 52.113 44.546 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) + 51.310 52.573 51.138 52.632 51.216 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) + 53.244 47.458 53.128 47.570 53.279 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) + 53.537 49.462 53.431 49.427 53.541 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 47.703 56.413 47.796 56.427 47.780 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 47.025 53.682 46.996 53.527 47.114 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 44.808 52.363 44.935 52.466 44.864 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7) + 53.774 44.200 53.745 44.146 53.889 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7) + 50.409 42.969 50.554 42.820 50.407 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 46.721 55.426 46.727 55.217 46.646 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 49.917 52.813 50.019 52.586 49.718 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 49.463 50.373 49.695 50.244 49.436 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7) + 49.394 50.794 49.331 50.565 49.373 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 47.873 51.213 47.900 51.305 47.921 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 47.109 51.965 47.182 51.776 47.153 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 50.039 54.672 50.159 54.760 50.229 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 46.918 52.488 47.028 52.327 47.033 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 49.807 52.877 50.009 52.756 49.904 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 48.313 54.666 48.258 54.596 48.103 5 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 47.352 52.476 47.680 52.375 47.412 5 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 45.647 51.850 45.700 51.787 45.618 5 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 53.797 53.041 53.715 53.185 53.728 53.055 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) + 50.819 49.056 50.912 49.257 50.800 49.082 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7) + 53.691 53.287 53.672 53.443 53.601 53.312 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7) + 51.184 51.978 51.156 51.922 51.261 51.993 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 52.548 50.879 52.511 51.038 52.776 50.962 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 52.010 52.226 51.881 52.229 51.977 52.150 6 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 49.444 48.838 49.543 48.895 49.396 48.811 6 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + 48.520 53.242 48.504 53.057 48.517 53.075 48.642 7 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7) + +.. _p2p: + +Peer-to-peer preset (p2p) +========================== + +The p2p preset measures device memory bandwidth between all pairs of CPU NUMA nodes and GPUs. It tests unidirectional and bidirectional transfers for CPU-to-CPU, CPU-to-GPU, and GPU-to-GPU combinations. + +**Key features:** + +- Tests all SRC-to-DST pairs across CPUs and GPUs. + +- Supports both unidirectional and bidirectional transfers (``P2P_MODE``). + +- Uses GFX or DMA as GPU executor (``USE_GPU_DMA``). + +- Supports remote read (DST GPU as executor) instead of source-side execution. + +- Prints bandwidth matrix with row and column labels. Optionally shows min/max/stddev per iteration. + + +**Restrictions:** + +- Supports single node only: Multinode is not supported. + +- ``USE_FINE_GRAIN`` deprecated: Returns error if ``USE_FINE_GRAIN`` is set. Use ``CPU_MEM_TYPE`` and ``GPU_MEM_TYPE`` instead. + +- NVIDIA CPU: On NVIDIA, CPU executors can't access GPU memory; those pairs are skipped. + +- Self-transfers skipped: CPU i-to-i and GPU i-to-i are skipped in bidirectional mode. + +**Usage:** + +.. code-block:: shell + + ./TransferBench p2p + +For exclusively unidirectional transfer with DMA: + +.. code-block:: shell + + P2P_MODE=1 USE_GPU_DMA=1 ./TransferBench p2p + +Environment variables +---------------------- + +To modify the behavior of p2p preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``CPU_MEM_TYPE`` + - CPU memory: 0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned. See :ref:`mem-type`. + - ``0`` + + * - ``GPU_MEM_TYPE`` + - GPU memory: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`. + - ``0`` + + * - ``NUM_CPU_DEVICES`` + - Number of CPU NUMA nodes. To avoid using any pairs involving CPUs, set it to ``0``. + - (all detected) + + * - ``NUM_CPU_SE`` + - CPU threads per CPU-executed transfer. + - ``4`` + + * - ``NUM_GPU_DEVICES`` + - Number of GPUs. This can be modified to reduce the number of GPUs to test. + - (all detected) + + * - ``NUM_GPU_SE`` + - GPU CUs per transfer. Default value varies according to ``USE_GPU_DMA``. + - (device max / GFX default) + + * - ``SHOW_ITERATIONS`` + - To show detailed min/max/stddev per iteration, set to ``1``. + - ``0`` + + * - ``P2P_MODE`` + - 1=Unidirectional only, 2=Bidirectional only, 0=both. + - ``0`` + + * - ``USE_GPU_DMA`` + - To use DMA for GPU executor, set to ``1``. To use GFX, set to ``0``. + - ``0`` + + * - ``USE_REMOTE_READ`` + - To place the executor on DST, set to ``1``. To place on SRC, set to ``0``. + - ``0`` + +Example output +--------------- + +.. tab-set:: + + .. tab-item:: AMD Instinct MI300X + + .. code-block:: shell + + [P2P Related] + CPU_MEM_TYPE = 0 : Using default CPU (0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned) + GPU_MEM_TYPE = 0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed) + NUM_CPU_DEVICES = 2 : Using 2 CPUs + NUM_CPU_SE = 4 : Using 4 CPU threads per Transfer + NUM_GPU_DEVICES = 8 : Using 8 GPUs + NUM_GPU_SE = 304 : Using 304 GPU subexecutors/CUs per Transfer + P2P_MODE = 0 : Running Uni + Bi transfers + USE_GPU_DMA = 0 : Using GPU-GFX as GPU executor + USE_REMOTE_READ = 0 : Using SRC as executor + Bytes Per Direction 268435456 + Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX) + SRC+EXE\DST CPU 00 CPU 01 GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 + CPU 00 -> 37.62 38.04 39.44 34.00 33.12 35.53 31.90 29.73 28.11 31.00 + CPU 01 -> 37.84 37.69 29.92 29.85 31.19 29.63 38.99 38.41 38.32 39.56 + GPU 00 -> 55.36 55.25 1618.87 48.83 48.89 49.00 48.05 47.94 48.27 47.85 + GPU 01 -> 55.36 54.14 48.89 1860.47 48.95 48.95 47.91 48.04 48.49 48.32 + GPU 02 -> 55.35 55.26 48.83 49.01 1868.43 49.07 48.70 48.34 48.85 48.97 + GPU 03 -> 55.34 55.26 49.01 49.02 49.07 1877.42 48.51 48.17 48.85 49.04 + GPU 04 -> 55.30 55.38 47.95 48.26 48.85 48.61 1849.65 48.99 48.85 48.84 + GPU 05 -> 55.29 55.35 47.95 48.02 48.51 48.03 49.01 1853.87 49.15 49.01 + GPU 06 -> 55.32 55.34 48.31 48.62 48.88 48.94 48.99 48.83 1829.05 49.17 + GPU 07 -> 55.30 55.34 48.23 48.27 48.59 48.90 48.60 49.09 49.14 1841.42 + CPU->CPU CPU->GPU GPU->CPU GPU->GPU + Averages (During UniDir): 37.94 33.67 55.25 48.65 + Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX) + SRC\DST CPU 00 CPU 01 GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 + CPU 00 -> N/A 33.59 33.54 36.84 33.35 35.02 29.55 31.33 31.30 28.09 + CPU 00 <- N/A 39.94 54.81 54.73 54.51 54.48 29.25 28.84 28.13 30.44 + CPU 00 <-> N/A 73.52 88.35 91.57 87.86 89.51 58.80 60.17 59.43 58.53 + CPU 01 -> 36.21 N/A 31.09 28.54 31.93 31.76 38.02 38.74 37.19 36.09 + CPU 01 <- 33.60 N/A 28.85 28.27 27.93 28.54 54.85 54.80 54.68 54.70 + CPU 01 <-> 69.81 N/A 59.94 56.81 59.86 60.30 92.87 93.54 91.86 90.78 + GPU 00 -> 54.77 29.18 N/A 46.15 46.10 46.55 46.16 46.05 46.31 45.95 + GPU 00 <- 34.70 30.98 N/A 46.21 46.40 46.65 46.12 46.00 46.25 45.98 + GPU 00 <-> 89.47 60.15 N/A 92.36 92.50 93.20 92.27 92.05 92.56 91.93 + GPU 01 -> 54.77 29.18 46.19 N/A 46.08 46.54 46.17 46.05 46.33 46.14 + GPU 01 <- 32.11 30.59 46.11 N/A 46.64 46.42 46.16 46.09 46.51 46.20 + GPU 01 <-> 86.89 59.77 92.30 N/A 92.73 92.97 92.32 92.14 92.84 92.33 + GPU 02 -> 54.76 29.56 46.40 46.63 N/A 46.62 46.49 46.16 46.41 46.09 + GPU 02 <- 32.05 27.70 46.07 46.05 N/A 46.24 46.18 46.26 46.12 46.27 + GPU 02 <-> 86.81 57.25 92.47 92.68 N/A 92.86 92.67 92.42 92.53 92.37 + GPU 03 -> 54.73 30.33 46.62 46.44 46.23 N/A 46.15 46.34 46.25 46.47 + GPU 03 <- 33.13 29.77 46.50 46.52 46.61 N/A 46.17 46.22 46.23 46.46 + GPU 03 <-> 87.86 60.10 93.13 92.96 92.84 N/A 92.32 92.56 92.48 92.93 + GPU 04 -> 29.91 54.85 46.18 46.20 46.21 46.17 N/A 46.56 46.23 46.50 + GPU 04 <- 30.60 34.45 46.27 46.37 46.58 46.17 N/A 46.49 46.25 46.44 + GPU 04 <-> 60.52 89.30 92.45 92.57 92.78 92.34 N/A 93.05 92.49 92.93 + GPU 05 -> 30.58 54.76 45.99 46.04 46.24 46.32 46.51 N/A 46.38 46.15 + GPU 05 <- 26.98 35.95 46.00 46.01 46.18 46.38 46.56 N/A 46.26 46.20 + GPU 05 <-> 57.55 90.70 91.99 92.05 92.43 92.69 93.07 N/A 92.63 92.36 + GPU 06 -> 30.22 54.65 46.34 46.40 46.13 46.24 46.26 46.33 N/A 46.43 + GPU 06 <- 27.72 35.78 46.37 46.35 46.35 46.28 46.25 46.37 N/A 46.30 + GPU 06 <-> 57.94 90.44 92.72 92.75 92.48 92.52 92.51 92.70 N/A 92.73 + GPU 07 -> 30.55 54.66 46.03 46.15 46.35 46.38 46.39 46.17 46.35 N/A + GPU 07 <- 27.28 36.17 46.05 46.11 46.12 46.45 46.48 46.15 46.41 N/A + GPU 07 <-> 57.83 90.83 92.08 92.26 92.47 92.83 92.87 92.32 92.76 N/A + CPU->CPU CPU->GPU GPU->CPU GPU->GPU + Averages (During BiDir): 35.83 37.51 36.98 46.28 + + .. tab-item:: AMD Instinct MI350X + + .. code-block:: shell + + [P2P Related] + CPU_MEM_TYPE = 0 : Using default CPU (0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned) + GPU_MEM_TYPE = 0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed) + NUM_CPU_DEVICES = 2 : Using 2 CPUs + NUM_CPU_SE = 4 : Using 4 CPU threads per Transfer + NUM_GPU_DEVICES = 8 : Using 8 GPUs + NUM_GPU_SE = 256 : Using 256 GPU subexecutors/CUs per Transfer + P2P_MODE = 0 : Running Uni + Bi transfers + USE_GPU_DMA = 0 : Using GPU-GFX as GPU executor + USE_REMOTE_READ = 0 : Using SRC as executor + Bytes Per Direction 268435456 + Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX) + SRC+EXE\DST CPU 00 CPU 01 GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 + CPU 00 -> 83.89 93.99 42.90 42.89 42.94 42.93 42.90 42.88 41.76 42.81 + CPU 01 -> 91.09 83.25 42.77 42.84 42.27 42.88 42.79 42.91 42.83 42.79 + GPU 00 -> 53.18 53.14 2285.12 57.51 57.46 57.38 57.33 57.32 57.28 57.64 + GPU 01 -> 53.11 53.16 57.53 2280.83 57.36 57.30 57.32 57.33 57.48 57.44 + GPU 02 -> 53.11 53.13 57.45 57.29 2286.68 57.36 57.58 57.53 57.38 57.35 + GPU 03 -> 53.19 53.11 57.31 57.26 57.52 2281.59 57.52 57.47 57.33 57.38 + GPU 04 -> 53.11 53.12 57.33 57.27 57.57 57.53 2292.99 57.51 57.36 57.36 + GPU 05 -> 53.13 53.13 57.34 57.32 57.55 57.48 57.28 2276.23 57.42 57.50 + GPU 06 -> 53.18 53.19 57.28 57.47 57.39 57.35 57.54 57.40 2305.57 57.49 + GPU 07 -> 53.16 53.15 57.44 57.47 57.35 57.36 57.32 57.35 57.51 2289.74 + CPU->CPU CPU->GPU GPU->CPU GPU->GPU + Averages (During UniDir): 92.54 42.76 53.14 57.41 + Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX) + SRC\DST CPU 00 CPU 01 GPU 00 GPU 01 GPU 02 GPU 03 GPU 04 GPU 05 GPU 06 GPU 07 + CPU 00 -> N/A 79.71 42.40 42.40 42.39 42.51 42.05 41.14 42.22 42.45 + CPU 00 <- N/A 80.90 52.72 52.61 52.74 52.69 52.69 52.68 52.61 52.64 + CPU 00 <-> N/A 160.62 95.11 95.01 95.13 95.20 94.75 93.82 94.83 95.09 + CPU 01 -> 80.77 N/A 42.27 42.39 42.50 42.49 42.50 42.47 42.43 42.46 + CPU 01 <- 79.50 N/A 52.68 52.60 52.69 52.66 52.68 52.65 52.68 52.68 + CPU 01 <-> 160.27 N/A 94.95 94.99 95.19 95.15 95.17 95.11 95.11 95.14 + GPU 00 -> 52.72 52.61 N/A 54.77 54.78 54.61 54.66 54.58 54.51 54.85 + GPU 00 <- 42.48 42.34 N/A 54.77 54.72 54.52 54.58 54.53 54.57 54.72 + GPU 00 <-> 95.20 94.95 N/A 109.54 109.51 109.13 109.23 109.11 109.08 109.57 + GPU 01 -> 52.68 52.69 54.75 N/A 54.66 54.51 54.54 54.57 54.74 54.70 + GPU 01 <- 42.43 42.40 54.84 N/A 54.46 54.55 54.45 54.61 54.82 54.79 + GPU 01 <-> 95.11 95.09 109.59 N/A 109.12 109.06 108.99 109.18 109.56 109.50 + GPU 02 -> 52.72 52.59 54.80 54.52 N/A 54.62 54.87 54.86 54.64 54.53 + GPU 02 <- 42.48 42.36 54.80 54.62 N/A 54.71 54.79 54.75 54.59 54.56 + GPU 02 <-> 95.20 94.94 109.60 109.15 N/A 109.33 109.66 109.61 109.23 109.09 + GPU 03 -> 52.61 52.59 54.43 54.52 54.64 N/A 54.80 54.82 54.61 54.59 + GPU 03 <- 42.49 42.38 54.63 54.53 54.63 N/A 54.79 54.73 54.47 54.49 + GPU 03 <-> 95.09 94.97 109.06 109.05 109.28 N/A 109.59 109.56 109.08 109.08 + GPU 04 -> 52.69 52.59 54.56 54.50 54.74 54.76 N/A 54.75 54.57 54.64 + GPU 04 <- 41.98 42.47 54.66 54.53 54.82 54.81 N/A 54.56 54.74 54.53 + GPU 04 <-> 94.67 95.06 109.22 109.03 109.56 109.57 N/A 109.31 109.31 109.17 + GPU 05 -> 52.71 52.58 54.54 54.56 54.78 54.71 54.55 N/A 54.59 54.73 + GPU 05 <- 42.33 42.36 54.59 54.58 54.85 54.83 54.74 N/A 54.50 54.68 + GPU 05 <-> 95.04 94.94 109.13 109.14 109.64 109.55 109.29 N/A 109.09 109.41 + GPU 06 -> 52.64 52.70 54.56 54.82 54.63 54.53 54.61 54.59 N/A 54.82 + GPU 06 <- 42.37 42.52 54.53 54.83 54.66 54.56 54.60 54.59 N/A 54.75 + GPU 06 <-> 95.02 95.22 109.10 109.65 109.28 109.09 109.21 109.18 N/A 109.57 + GPU 07 -> 52.70 52.66 54.70 54.84 54.58 54.53 54.50 54.68 54.83 N/A + GPU 07 <- 42.16 42.45 54.88 54.72 54.55 54.63 54.61 54.73 54.73 N/A + GPU 07 <-> 94.85 95.11 109.58 109.56 109.12 109.16 109.11 109.41 109.56 N/A + CPU->CPU CPU->GPU GPU->CPU GPU->GPU + Averages (During BiDir): 80.22 47.49 47.51 54.66 + +.. _scaling: + +Scaling preset (scaling) +========================= + +The scaling preset runs a scaling test from one GPU to all other devices (CPUs and GPUs). It varies the number of subexecutors (CUs) from SWEEP_MIN to SWEEP_MAX and reports bandwidth for each target device. It helps find optimal CU count per transfer. + +**Key feature:** + +- Uses one GPU (``LOCAL_IDX``) as source. + +- Single transfer per target: Performs only one transfer at a time (one SRC to one DST) per cell. + +- Copies to each CPU NUMA node and every other GPU. + +- For each CU count (``SWEEP_MIN`` to ``SWEEP_MAX``), runs one transfer per target and reports bandwidth. + +- Prints a table: rows = CU count, columns = target device. + +- Reports best row: Shows peak bandwidth and optimal CU count per target. + +**Restrictions:** + +- Supports single node only: Multinode is not supported. + +- ``USE_FINE_GRAIN`` deprecated: Returns error if set. Use ``CPU_MEM_TYPE`` and ``GPU_MEM_TYPE`` instead. + +**Usage:** + +.. code-block:: shell + + ./TransferBench scaling + +To run using GPU 2 as SRC with CU range between 4 and 64: + +.. code-block:: shell + + LOCAL_IDX=2 SWEEP_MIN=4 SWEEP_MAX=64 ./TransferBench scaling + +Environment variables +---------------------- + +To modify the behavior of scaling preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``CPU_MEM_TYPE`` + - CPU memory type: 0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned. See :ref:`mem-type`. + - ``0`` + + * - ``GPU_MEM_TYPE`` + - GPU memory type: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`. + - ``0`` + + * - ``LOCAL_IDX`` + - Index of the GPU performing copy to other GPUs. + - ``0`` + + * - ``NUM_CPU_DEVICES`` + - Number of CPU NUMA nodes. + - (all detected) + + * - ``NUM_GPU_DEVICES`` + - Number of GPUs. + - (all detected) + + * - ``SWEEP_MIN`` + - Minimum subexecutors (CUs). + - ``1`` + + * - ``SWEEP_MAX`` + - Maximum subexecutors. + - ``32`` + +Example output +--------------- + +.. tab-set:: + + .. tab-item:: AMD Instinct MI300X + + .. code-block:: shell + + [Scaling Related] + CPU_MEM_TYPE = 0 : Using default CPU (0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned) + GPU_MEM_TYPE = 0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed) + LOCAL_IDX = 0 : Local GPU index + NUM_CPU_DEVICES = 2 : Using 2 CPUs + NUM_GPU_DEVICES = 8 : Using 8 GPUs + SWEEP_MAX = 32 : Max number of subExecutors to use + SWEEP_MIN = 1 : Min number of subExecutors to use + GPU-GFX Scaling benchmark: + ========================== + - Copying 268435456 bytes from GPU 0 to other devices + - All numbers reported as GB/sec + NumCUs CPU00 CPU01 GPU00 GPU01 GPU02 GPU03 GPU04 GPU05 GPU06 GPU07 + 1 20.22 20.41 18.68 25.91 26.08 26.06 25.95 26.04 26.01 26.02 + 2 37.37 37.03 36.88 48.65 48.33 49.24 47.91 47.72 48.48 47.13 + 3 52.96 51.92 55.74 48.92 48.43 49.22 47.45 47.47 48.11 47.88 + 4 56.38 53.18 73.05 49.41 49.19 49.34 47.84 47.79 48.46 48.05 + 5 54.61 52.96 91.57 45.23 44.60 49.03 47.73 44.32 48.57 44.11 + 6 54.61 53.78 109.70 48.98 48.82 49.13 47.84 47.46 48.38 48.19 + 7 56.48 54.27 127.43 49.15 49.14 49.15 48.00 47.85 48.50 48.05 + 8 56.60 54.71 142.35 49.14 49.37 49.31 47.86 47.98 48.39 48.13 + 9 56.66 54.93 161.43 49.13 49.58 49.16 47.77 47.92 48.44 48.18 + 10 56.83 55.31 178.08 49.17 49.33 49.07 47.79 47.99 48.33 48.23 + 11 56.84 55.63 195.82 49.36 49.43 49.10 47.56 47.96 48.54 48.37 + 12 57.12 55.83 210.39 49.50 48.97 49.43 47.97 47.73 48.63 48.27 + 13 56.91 55.65 226.79 49.52 48.86 49.22 47.63 47.92 48.60 48.16 + 14 57.10 55.83 238.49 49.26 49.42 49.13 48.08 48.18 48.44 48.18 + 15 57.09 55.86 258.25 49.23 49.19 49.42 47.74 47.96 48.68 48.11 + 16 57.11 55.98 271.55 49.62 49.25 49.54 47.84 47.75 48.39 47.93 + 17 57.10 55.82 287.98 49.10 49.36 49.35 47.64 47.97 48.81 48.28 + 18 57.10 55.81 306.06 49.33 49.14 49.34 47.81 47.99 48.47 48.05 + 19 56.94 55.69 319.71 49.20 49.14 49.32 48.13 47.93 48.61 48.30 + 20 57.14 55.88 334.89 49.35 49.25 49.22 48.19 47.97 48.62 48.24 + 21 57.12 55.94 346.59 49.13 49.23 49.19 48.24 47.84 48.52 48.16 + 22 57.13 56.01 362.42 49.34 49.39 49.09 47.95 48.00 48.53 48.20 + 23 57.13 56.17 375.70 49.10 49.22 49.43 47.98 48.14 48.58 48.46 + 24 57.14 56.23 388.97 49.25 49.24 49.52 47.72 48.06 48.67 48.31 + 25 57.14 56.30 403.32 49.04 49.20 49.42 48.05 48.01 48.51 47.97 + 26 57.14 56.17 417.88 49.57 49.59 49.57 47.89 48.04 48.79 48.34 + 27 57.12 56.02 426.76 49.32 49.24 49.29 48.14 48.01 48.50 48.04 + 28 57.13 56.05 444.58 49.31 49.37 49.12 48.00 47.96 48.44 47.99 + 29 57.14 56.07 453.05 49.55 49.40 49.56 48.16 47.78 48.18 48.17 + 30 57.14 56.12 462.74 49.11 49.27 49.33 47.97 48.20 48.63 48.26 + 31 57.13 56.12 478.60 49.35 48.96 49.06 47.94 48.33 48.43 48.35 + 32 57.15 56.35 493.17 49.23 49.55 49.33 47.77 48.28 48.56 48.22 + Best 57.15( 32) 56.35( 32) 493.17( 32) 49.62( 16) 49.59( 26) 49.57( 26) 48.24( 21) 48.33( 31) 48.81( 17) 48.46( 23) + + .. tab-item:: AMD Instinct MI350X + + .. code-block:: shell + + [Scaling Related] + CPU_MEM_TYPE = 0 : Using default CPU (0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned) + GPU_MEM_TYPE = 0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed) + LOCAL_IDX = 0 : Local GPU index + NUM_CPU_DEVICES = 2 : Using 2 CPUs + NUM_GPU_DEVICES = 8 : Using 8 GPUs + SWEEP_MAX = 32 : Max number of subExecutors to use + SWEEP_MIN = 1 : Min number of subExecutors to use + GPU-GFX Scaling benchmark: + ========================== + - Copying 268435456 bytes from GPU 0 to other devices + - All numbers reported as GB/sec + NumCUs CPU00 CPU01 GPU00 GPU01 GPU02 GPU03 GPU04 GPU05 GPU06 GPU07 + 1 26.51 26.30 15.81 26.48 26.57 26.44 25.68 26.39 26.58 26.04 + 2 51.52 50.86 31.50 52.65 52.28 52.28 52.00 52.57 52.95 52.39 + 3 42.83 43.01 46.13 53.32 57.39 55.81 49.08 57.41 49.65 48.31 + 4 50.02 49.93 61.77 57.34 57.09 49.10 49.67 57.02 56.89 49.79 + 5 53.58 53.58 77.07 55.85 57.78 57.22 50.69 57.72 54.80 50.27 + 6 53.84 53.82 91.73 58.29 58.48 56.60 54.91 58.41 57.81 54.89 + 7 53.56 53.63 106.60 57.98 57.24 57.18 55.86 56.97 57.79 55.87 + 8 53.22 52.97 121.07 58.40 58.27 57.43 58.07 58.17 58.07 58.39 + 9 54.22 54.22 135.97 58.37 57.88 57.54 58.10 57.72 58.21 58.26 + 10 54.37 54.34 148.80 58.61 58.63 57.74 58.49 58.63 58.32 58.35 + 11 54.62 54.58 163.28 57.83 58.55 58.26 58.17 58.53 57.94 58.38 + 12 54.63 54.56 177.99 58.93 58.69 58.59 58.49 58.68 58.57 58.64 + 13 54.66 54.69 191.79 58.55 58.51 58.59 58.24 58.50 58.42 58.33 + 14 54.73 54.63 205.97 58.73 58.49 58.36 58.28 58.40 58.51 58.30 + 15 54.73 54.64 221.70 58.65 58.55 58.41 58.44 58.61 58.56 58.43 + 16 54.63 54.59 233.49 59.14 59.04 58.84 58.93 59.03 58.98 59.08 + 17 54.75 54.76 247.85 58.78 58.56 58.43 58.55 58.61 58.62 58.48 + 18 54.74 54.73 262.07 58.70 58.42 58.34 58.33 58.37 58.48 58.40 + 19 54.77 54.73 274.98 58.57 58.42 58.38 58.37 58.55 58.40 58.33 + 20 54.82 54.86 287.02 58.76 58.77 58.58 58.62 58.67 58.58 58.69 + 21 54.79 54.76 301.35 58.62 58.48 58.38 58.40 58.45 58.47 58.38 + 22 54.74 54.72 313.96 58.59 58.56 58.43 58.43 58.45 58.56 58.42 + 23 54.79 54.78 328.28 58.55 58.53 58.41 58.38 58.48 58.41 58.34 + 24 54.65 54.73 343.28 58.76 59.02 58.68 58.78 58.68 59.01 58.76 + 25 54.72 54.78 354.62 58.57 58.50 58.38 58.42 58.39 58.50 58.41 + 26 54.67 54.71 367.90 58.58 58.51 58.54 58.43 58.46 58.52 58.55 + 27 54.74 54.73 377.03 58.52 58.41 58.26 58.31 58.45 58.39 58.36 + 28 54.67 54.73 393.19 58.69 58.36 58.32 58.40 58.44 58.46 58.41 + 29 54.72 54.71 402.84 58.50 58.31 58.26 58.33 58.36 58.48 58.35 + 30 54.75 54.79 418.54 58.82 58.52 58.37 58.39 58.67 58.52 58.46 + 31 54.79 54.75 429.11 58.65 58.33 58.33 58.55 58.35 58.41 58.41 + 32 54.74 54.79 445.36 59.08 59.12 58.85 58.81 59.02 59.13 59.11 + Best 54.82( 20) 54.86( 20) 445.36( 32) 59.14( 16) 59.12( 32) 58.85( 32) 58.93( 16) 59.03( 16) 59.13( 32) 59.11( 32) + +.. _schmoo: + +Schmoo preset (schmoo) +======================= + +The schmoo preset runs scaling tests for local and remote read, write, and copy operations between two GPUs. For each CU count (``SWEEP_MIN`` to ``SWEEP_MAX``), it measures six bandwidth values: Local Read, Local Write, Local Copy, Remote Read, Remote Write, and Remote Copy. + +**Key features:** + +- Minimum 2 GPUs: Requires at least two GPUs: ``LOCAL_IDX`` (local) and ``REMOTE_IDX`` (remote). + +- Fixed topology: Always two GPUs (local and remote). No sweep over device count. + +- For each CU count, runs the following six tests. Each test measures bandwidth for the corresponding operation pattern: + + - Local Read: Local GPU reads from local memory (SRC->G->null). + + - Local Write: Local GPU writes to local memory (null->G->DST). + + - Local Copy: Local GPU copies (local->local). + + - Remote Read: Local GPU reads from remote memory. + + - Remote Write: Local GPU writes to remote memory. + + - Remote Copy: Local GPU copies (local->remote). + +- Outputs a table: rows = #CUs, columns = the 6 operation types. + +- Supports single node only: Multinode is not supported. + +**Usage:** + +.. code-block:: shell + + ./TransferBench schmoo + +To run using GPUs 0 and 3: + +.. code-block:: shell + + LOCAL_IDX=0 REMOTE_IDX=3 SWEEP_MIN=4 SWEEP_MAX=32 ./TransferBench schmoo + +To run using fine-grained memory: + +.. code-block:: shell + + USE_FINE_GRAIN=1 ./TransferBench schmoo + +Environment variables +---------------------- + +To modify the behavior of schmoo preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``LOCAL_IDX`` + - Local GPU index. + - ``0`` + + * - ``REMOTE_IDX`` + - Remote GPU index. + - ``1`` + + * - ``SWEEP_MIN`` + - Minimum CUs. + - ``1`` + + * - ``SWEEP_MAX`` + - Maximum CUs. + - ``32`` + + * - ``USE_FINE_GRAIN`` + - To use fine-grained GPU memory, set to ``1``. For coarse-grained memory, set to ``0``. + - ``0`` + +Example output +--------------- + +.. tab-set:: + + .. tab-item:: AMD Instinct MI300X + + .. image:: /data/schmoo_MI300X.png + :width: 100% + :align: center + + .. tab-item:: AMD Instinct MI350X + + .. image:: /data/schmoo_MI350X.png + :width: 100% + :align: center + +.. _sweep: + +Sweep (sweep) and random sweep preset (rsweep) +=============================================== + +The sweep preset performs an ordered sweep through sets of transfers. It systematically tests combinations of (SRC, executor, DST) with varying parallelism (from ``SWEEP_MIN`` simultaneous transfers up to ``SWEEP_MAX``) using lexicographic permutation order. The rsweep preset performs similar functions as sweep preset, but in a random order. + +.. note:: + + This preset is primarily used for stress testing. + +**Key features:** + +- Possible set: Builds all possible sets of triplets (SRC, EXE, DST) from ``SWEEP_SRC``, ``SWEEP_EXE``, ``SWEEP_DST``, and device counts, as Cartesian product (``srcList`` x ``exeList`` x ``dstList``) with filters such as XGMI hop, and CPU-on-GPU skip on NVIDIA. + +- Optionally filters using XGMI hop count (``SWEEP_XGMI_MIN``, ``SWEEP_XGMI_MAX``). + +- M increment: Selects M transfers for each test. M starts at ``SWEEP_MIN``, and increments until M > ``SWEEP_MAX`` (or ``SWEEP_MAX`` = 0 for no limit) when all M-combinations are exhausted. + +- Ordered permutation: Uses ``std::prev_permutation`` to iterate through M-combinations of the possible transfer set in a deterministic order. + +- Log format: Logs each test's transfers to ``SWEEP_FILE``. The ``SWEEP_FILE`` contains lines such as "# Test N" and "-M (src->exe->dst CUs bytes)...". + +- Follows ``SWEEP_TEST_LIMIT`` and ``SWEEP_TIME_LIMIT``. + +- Default executors: ``SWEEP_EXE`` = CDG includes CPU, DMA, and GFX for broad coverage. + +- Supports single node only: Multinode is not supported. + +.. note:: + + Running the default sweep might never finish executing, especially on configurations with large number of devices. It is highly recommended to set either ``SWEEP_TEST_LIMIT`` or ``SWEEP_TIME_LIMIT``. + +**Usage:** + +.. code-block:: shell + + ./TransferBench sweep + +To run with memory and executor limited to GPU only, and XGMI: + +.. code-block:: shell + + SWEEP_SRC=G SWEEP_DST=G SWEEP_EXE=G SWEEP_XGMI_MIN=1 SWEEP_MAX=16 ./TransferBench sweep + +To limit the duration of run: + +.. code-block:: shell + + SWEEP_TIME_LIMIT=3600 SWEEP_FILE=/tmp/mySweep.cfg ./TransferBench sweep + +Environment variables +---------------------- + +To modify the behavior of sweep and rsweep preset, use the following environment variables: + +.. list-table:: + :header-rows: 1 + + * - Environment variable + - Description + - Default value + + * - ``CONTINUE_ON_ERROR`` + - To continue despite validation error, set to ``1``. To stop, set to ``0``. + - ``0`` + + * - ``NUM_CPU_DEVICES`` + - Number of CPU NUMA nodes. + - (all detected) + + * - ``NUM_CPU_SE`` + - CPU threads per CPU-executed transfer. + - ``4`` + + * - ``NUM_GPU_DEVICES`` + - Number of GPUs. + - (all detected) + + * - ``NUM_GPU_SE`` + - CUs per GPU-executed transfer. + - ``4`` + + * - ``SWEEP_SRC`` + - Source memory types: C=CPU, G=GPU, N=Null. + - ``CG`` + + * - ``SWEEP_DST`` + - Destination memory types. + - ``CG`` + + * - ``SWEEP_EXE`` + - Executor types: C=CPU, D=DMA, G=GFX. + - ``CDG`` + + * - ``SWEEP_FILE`` + - File where sweep configuration is saved. + - ``/tmp/lastSweep.cfg`` + + * - ``SWEEP_MIN`` + - Minimum simultaneous transfers. + - ``1`` + + * - ``SWEEP_MAX`` + - Maximum simultaneous transfers (0=no limit). + - ``24`` + + * - ``SWEEP_RAND_BYTES`` + - To use random transfer size, set to ``1``. For constant, set to ``0``. + - ``0`` + + * - ``SWEEP_SEED`` + - Random seed. Used for rsweep or ``SWEEP_RAND_BYTES``. + - time(NULL) + + * - ``SWEEP_TEST_LIMIT`` + - Maximum number of tests allowed to run. ``0`` = no limit. + - ``0`` + + * - ``SWEEP_TIME_LIMIT`` + - Maximum allowed test duration (in seconds). ``0`` = no limit. + - ``0`` + + * - ``SWEEP_XGMI_MIN`` + - Minimum XGMI hops for transfers. + - ``0`` + + * - ``SWEEP_XGMI_MAX`` + - Maximum allowed XGMI hops. ``-1`` = no limit. + - ``-1`` + +Example output +--------------- + +.. code-block:: shell + + [Sweep Related] + CONTINUE_ON_ERROR = 0 : Stop after first error + NUM_CPU_DEVICES = 2 : Using 2 CPUs + NUM_CPU_SE = 4 : Using 4 CPU threads per CPU executed Transfer + NUM_GPU_DEVICES = 8 : Using 8 GPUs + NUM_GPU_SE = 4 : Using 4 subExecutors/CUs per GPU executed Transfer + SWEEP_DST = CG : Destination Memory Types to sweep + SWEEP_EXE = CDG : Executor Types to sweep + SWEEP_FILE = /tmp/lastSweep.cfg : File to store the executing sweep configuration + SWEEP_MAX = 24 : Max simultaneous transfers (0 = no limit) + SWEEP_MIN = 1 : Min simultaenous transfers + SWEEP_RAND_BYTES = 0 : Using constant number of bytes per Transfer + SWEEP_SEED = 1773692223 : Random seed set to 1773692223 + SWEEP_SRC = CG : Source Memory Types to sweep + SWEEP_TEST_LIMIT = 0 : Max number of tests to run during sweep (0 = no limit) + SWEEP_TIME_LIMIT = 0 : Max number of seconds to run sweep for (0 = no limit) + SWEEP_XGMI_MAX = -1 : Max number of XGMI hops for Transfers (-1 = no limit) + SWEEP_XGMI_MIN = 0 : Min number of XGMI hops for Transfers + + Sweep configuration saved to: /tmp/lastSweep.cfg + Test 1: + -------------------┬--------------┬------------┬-------------------┬-------------------- + Executor: CPU 00 │ 30.660 GB/s │ 8.755 ms │ 268435456 bytes │ 30.847 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 0 │ 30.847 GB/s │ 8.702 ms │ 268435456 bytes │ C1 -> C0:4 -> G6 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: GPU 01 │ 38.662 GB/s │ 6.943 ms │ 268435456 bytes │ 38.669 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 8 │ 38.669 GB/s │ 6.942 ms │ 268435456 bytes │ G2 -> G1:4 -> G1 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: GPU 02 │ 61.598 GB/s │ 4.358 ms │ 268435456 bytes │ 61.615 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 9 │ 61.615 GB/s │ 4.357 ms │ 268435456 bytes │ G2 -> G2:4 -> G0 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: GPU 03 │ 38.816 GB/s │ 6.916 ms │ 268435456 bytes │ 38.826 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 10 │ 38.826 GB/s │ 6.914 ms │ 268435456 bytes │ G2 -> G3:4 -> G7 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: GPU 06 │ 44.298 GB/s │ 12.120 ms │ 536870912 bytes │ 58.182 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 11 │ 22.151 GB/s │ 12.118 ms │ 268435456 bytes │ G1 -> G6:4 -> C1 + Transfer 12 │ 36.030 GB/s │ 7.450 ms │ 268435456 bytes │ G2 -> G6:4 -> G5 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: GPU 07 │ 37.963 GB/s │ 7.071 ms │ 268435456 bytes │ 37.969 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 13 │ 37.969 GB/s │ 7.070 ms │ 268435456 bytes │ G4 -> G7:4 -> G6 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: DMA 01 │ 43.428 GB/s │ 12.362 ms │ 536870912 bytes │ 77.585 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 1 │ 55.481 GB/s │ 4.838 ms │ 268435456 bytes │ C0 -> D1:4 -> G0 + Transfer 2 │ 22.105 GB/s │ 12.144 ms │ 268435456 bytes │ G7 -> D1:4 -> C1 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: DMA 03 │ 31.427 GB/s │ 8.541 ms │ 268435456 bytes │ 32.353 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 3 │ 32.353 GB/s │ 8.297 ms │ 268435456 bytes │ G4 -> D3:4 -> G6 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: DMA 04 │ 22.214 GB/s │ 12.084 ms │ 268435456 bytes │ 22.536 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 4 │ 22.536 GB/s │ 11.912 ms │ 268435456 bytes │ C1 -> D4:4 -> G1 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: DMA 06 │ 53.665 GB/s │ 10.004 ms │ 536870912 bytes │ 72.749 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 5 │ 27.768 GB/s │ 9.667 ms │ 268435456 bytes │ G2 -> D6:4 -> C0 + Transfer 6 │ 44.981 GB/s │ 5.968 ms │ 268435456 bytes │ G3 -> D6:4 -> G2 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Executor: DMA 07 │ 57.440 GB/s │ 4.673 ms │ 268435456 bytes │ 60.131 GB/s (sum) + -------------------┼--------------┼------------┼-------------------┼-------------------- + Transfer 7 │ 60.131 GB/s │ 4.464 ms │ 268435456 bytes │ G7 -> D7:4 -> G5 + -------------------┼--------------┼------------┼-------------------┼-------------------- + Aggregate (CPU) │ 295.108 GB/s │ 12.735 ms │ 3758096384 bytes │ Overhead 0.372 ms + -------------------┴--------------┴------------┴-------------------┴-------------------- + +The exact format depends on ``OUTPUT_TO_CSV`` and ``PrintResults``. Typically shows test number, transfer count, bandwidth, and timing per test. diff --git a/docs/reference/transfer-definition-syntax.rst b/docs/reference/transfer-definition-syntax.rst new file mode 100644 index 00000000..694f6ab1 --- /dev/null +++ b/docs/reference/transfer-definition-syntax.rst @@ -0,0 +1,267 @@ +.. meta:: + :description: Reference for TransferBench transfer definition syntax, including simple and advanced modes, memory and executor letter codes, and wildcard syntax. + :keywords: TransferBench transfer definition, TransferBench syntax, TransferBench memory types, TransferBench executor types, TransferBench wildcards + +.. _transfer-definition-syntax: + +Transfer definition syntax +========================== + +A transfer is a single operation where an executor reads and adds values from source (SRC) memory, then writes the sum to destination (DST) memory. When a transfer has a single SRC and a single DST, it is a copy operation. + +TransferBench supports two modes for defining transfers: simple mode and advanced mode. + +Simple mode +----------- + +Use simple mode when all transfers in a test share the same number of subexecutors. The format is: + +.. code-block:: shell + + #Transfers #SEs (srcMem1 Executor1 dstMem1) ... (srcMemL ExecutorL dstMemL) + +The format uses the following fields: + +- ``#Transfers``: A positive integer that specifies the number of parallel transfers. +- ``#SEs``: The number of subexecutors (CUs, threads, or queue pairs) used by all transfers. +- ``(srcMem Executor dstMem)``: A triplet that describes one transfer. + +The transfer size comes from the ``num_bytes`` command-line argument. + +The following examples show valid simple mode definitions: + +.. code-block:: shell + + 1 4 (G0->G0->G1) # Uses 4 CUs on GPU 0 to copy from GPU 0 to GPU 1 + 2 4 (G0 G0 G1) (G1 G1 G0) # Two parallel transfers: GPU 0 to GPU 1 and GPU 1 to GPU 0 + 1 4 (G0->G0->G0G1G2G3) # Reads GPU 0 then writes to GPU 0, GPU 1, GPU 2, and GPU 3 + +Advanced mode +------------- + +Use advanced mode when transfers in a test require different subexecutor counts or sizes. The format is: + +.. code-block:: shell + + -#Transfers (srcMem1 Executor1 dstMem1 #SEs1 Bytes1) ... (srcMemL ExecutorL dstMemL #SEsL BytesL) + +The format uses the following fields: + +- ``-#Transfers``: A negative integer that specifies the number of parallel transfers. +- ``(srcMem Executor dstMem #SEs Bytes)``: A quintuplet that describes one transfer. +- ``Bytes``: The per-transfer size. Set to ``0`` to use the command-line ``num_bytes`` value. You can suffix this value with ``K``, ``M``, or ``G``. A non-zero value overrides the command-line value for that transfer only. + +The following example runs two transfers with different sizes and subexecutor counts: + +.. code-block:: shell + + -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) # 1 MB GPU 0 to GPU 1 with 4 CUs; 2 MB GPU 1 to GPU 0 with 2 CUs + +Memory and executor letter codes +================================== + +The memory locations and executors use a format consisting of letters indicating the memory or executor type, and index. The following tables indicate what these letters stand for. + +Memory location letters +------------------------ + +Memory locations use the format ``[R]``, where ``MemType`` is a single letter and ``Index`` is the zero-based device index. If you omit the rank prefix ``R``, TransferBench uses the local rank. + +The following table lists the supported memory type letters: + +.. list-table:: + :header-rows: 1 + + * - Letter + - Memory type + - Indexed by + + * - ``C`` + - Coarse-grained pinned host + - NUMA node (0 to #NUMA - 1) + + * - ``P`` + - Pinned host (closest to GPU) + - GPU index (0 to #GPUs - 1) + + * - ``B`` + - Coherent pinned host + - NUMA node + + * - ``D`` + - Non-coherent pinned host + - NUMA node + + * - ``K`` + - Uncached pinned host + - NUMA node + + * - ``H`` + - Unpinned host + - NUMA node + + * - ``G`` + - Coarse-grained global device + - GPU (0 to #GPUs - 1) + + * - ``F`` + - Fine-grained device + - GPU + + * - ``U`` + - Uncached device + - GPU + + * - ``M`` + - Managed memory + - GPU + + * - ``N`` + - Null (empty). Use this to denote read-only or write-only transfers. + - Index ignored. + +You can concatenate multiple memory locations for broadcast or reduce operations. For example, ``G0G1`` refers to GPU 0 and GPU 1. + +Executor letters +----------------- + +Executors use the format ``[R][Slot][.][SubSlot]``. + +The following table lists the supported executor type letters: + +.. list-table:: + :header-rows: 1 + + * - Letter + - Executor type + - Subexecutor + - Indexed by + + * - ``C`` + - CPU + - CPU thread + - NUMA node (0 to #NUMA - 1) + + * - ``G`` + - GPU (GFX/kernel) + - Threadblock/CU + - GPU (0 to #GPUs - 1) + + * - ``D`` + - DMA (SDMA) + - N/A (streams) + - GPU + + * - ``I`` + - NIC RDMA + - Queue pair + - NIC index. Requires ``.SubIndex`` (for example, ``I0.2``). + + * - ``N`` + - Nearest NIC + - Queue pair + - GPU. Uses the closest NIC for SRC and DST. SubIndex is optional. + +The optional executor fields have the following meanings: + +- ``Slot`` (``A``, ``B``, ...): Selects which closest NIC to use, where ``A`` is the first and ``B`` is the second, and so on. Used for ``EXE_NIC_NEAREST``. +- ``SubIndex`` (after ``.``): Specifies the queue pair or sub-index for the NIC. +- ``SubSlot``: Specifies the alpha range for sub-slot selection. + +Wildcards +========== + +Wildcards expand a single configuration file line into multiple concrete transfers at runtime. + +Numeric wildcards +------------------ + +Numeric wildcards apply to ranks (``R``), device indices (for example, ``G0`` or ``C1``), and executor indices. + +The following table describes the supported numeric wildcard syntax: + +.. list-table:: + :header-rows: 1 + + * - Syntax + - Meaning + - Example + + * - ``*`` + - All values in range + - ``R*G0``: GPU 0 on all ranks + + * - ``[n]`` + - Single value + - ``R[2]G0``: GPU 0 on rank 2 + + * - ``[n,m,...]`` + - Comma-separated list + - ``G[0,2,4]``: GPUs 0, 2, and 4 + + * - ``[start..end]`` + - Inclusive range + - ``G[1..3]``: GPUs 1, 2, and 3 + +The following list shows additional examples: + +- ``G*``: All GPUs (for example, G0, G1, G2, ...) +- ``G[0,2,4]``: GPUs 0, 2, and 4 +- ``R[1..3]G0``: GPU 0 on ranks 1, 2, and 3 + +Alpha wildcards +---------------- + +Alpha wildcards apply to executor slots and subslots (``A``, ``B``, ``C``, ...). + +The following table describes the supported alpha wildcard syntax: + +.. list-table:: + :header-rows: 1 + + * - Syntax + - Meaning + - Example + + * - ``*`` + - Full wildcard, resolved at runtime + - All available slots + + * - ``[A]`` or ``A`` + - Single letter + - First closest NIC + + * - ``[A..C]`` + - Letter range + - Slots A, B, and C + + * - ``[A,D,F]`` + - Comma-separated letters + - Slots A, D, and F + +Rank prefix (multinode) +------------------------- + +The ``R`` prefix is optional and specifies the rank index for a memory location or executor: + +- ``R2G3``: GPU 3 on rank 2. +- ``G3`` (no ``R``): GPU 3 on the local rank. + +The rank prefix accepts the same numeric wildcards as device indices, such as ``R*`` and ``R[0..1]``. + +Nearest NIC wildcard +--------------------- + +For the ``N`` (Nearest NIC) executor, you can omit the executor index and sub-index. TransferBench resolves the correct NICs for the SRC and DST memory locations at runtime. + +For example, the following definition: + +.. code-block:: shell + + (R0G0 N R2G4) + +This definition expands to: + +.. code-block:: shell + + (R0G0 N0.4 R2G4) diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in index 3f09d2e5..4a0c1e16 100644 --- a/docs/sphinx/_toc.yml.in +++ b/docs/sphinx/_toc.yml.in @@ -5,16 +5,24 @@ subtrees: - caption: Install entries: - file: install/install.rst - title: Installation -- caption: API reference +- caption: How to entries: - - file: reference/api.rst - title: API library + - file: how to/running-transferbench-customized.rst + title: Run custom tests using TransferBench -- caption: How to +- caption: Conceptual + entries: + - file: conceptual/transferbench-workflow.rst + - file: conceptual/transferbench-timing.rst + - file: conceptual/transferbench-data-validation.rst + +- caption: Reference entries: - - file: how to/use-transferbench.rst + - file: reference/presets.rst + - file: reference/transfer-definition-syntax.rst + - file: reference/environment-variables.rst + - file: reference/faq.rst - caption: About entries: