diff --git a/LICENSE.md b/LICENSE.md
index eb08d9e8..3d5a69a6 100644
--- a/LICENSE.md
+++ b/LICENSE.md
@@ -1,4 +1,4 @@
-Copyright (c) 2019-2025 Advanced Micro Devices, Inc. All rights reserved.
+Copyright (c) 2019-2026 Advanced Micro Devices, Inc. All rights reserved.
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal
diff --git a/docs/conceptual/transferbench-data-validation.rst b/docs/conceptual/transferbench-data-validation.rst
new file mode 100644
index 00000000..622909b5
--- /dev/null
+++ b/docs/conceptual/transferbench-data-validation.rst
@@ -0,0 +1,158 @@
+.. meta::
+  :description: Explains how TransferBench validates transfer correctness by comparing destination memory against precomputed expected values derived from source buffers.
+  :keywords: TransferBench data validation, TransferBench correctness, ValidateAllTransfers, PrepareReference, destination buffer, source buffer
+
+.. _transferbench-data-validation:
+
+==============================
+TransferBench data validation
+==============================
+
+TransferBench validates the transfer results by comparing the destination (DST) memory to
+precomputed expected values.
+
+Overview
+=========
+
+Validation verifies that for each transfer, the DST buffer contains the expected value:
+the sum of all source (SRC) buffers (or zero when there are no sources). A transfer is correct if, for
+every element ``i``, the value matches the expected value given in the following table:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Number of sources
+      - Expected value
+
+    * - 0 sources
+      - ``dst[i] == 0`` (or memset value)
+
+    * - 1 source
+      - ``dst[i] == src0[i]``
+
+    * - N sources
+      - ``dst[i] == src0[i] + src1[i] + ... + srcN-1[i]``
+
+Source data preparation
+=======================
+
+Before any transfers run, TransferBench prepares the SRC and DST memories as discussed in the following sections:
+
+Expected source pattern (``PrepareReference``)
+-----------------------------------------------
+
+Before any transfers run, TransferBench builds reference SRC buffers on the host using
+``PrepareReference(cfg, cpuBuffer, bufferIdx)``.
+
+The pattern used depends on the configuration:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Configuration
+      - Behavior
+
+    * - ``fillCompress`` (non-empty)
+      - Mix of random floats with optional zeroing per 64-byte line:
+        ``0`` = random, ``1`` = 1B0, ``2`` = 2B0, ``3`` = 4B0, ``4`` = 32B0.
+        Percentages control the mix. For details, see
+        :ref:`data-validation-var`.
+
+    * - ``fillPattern`` (non-empty)
+      - Repeats the given ``vector<float>`` over all SRC buffers.
+
+    * - Default
+      - Pseudo-random: ``PrepSrcValue(bufferIdx, i) = (((i % 383) * 517) % 383 + 31) * (bufferIdx + 1)``
+
+        ``bufferIdx`` is the SRC index (0, 1, …) so each SRC buffer gets a different pattern.
+
+Expected destination (``dstReference``)
+----------------------------------------
+
+The expected destination is computed once before the iteration loop:
+
+.. code-block:: text
+
+  dstReference[0] = memset to MEMSET_CHAR          (used when numSrcs == 0)
+  dstReference[1] = srcReference[0]                (1 source)
+  dstReference[2] = dstReference[1] + srcReference[1]  (2 sources)
+  dstReference[k] = dstReference[k-1] + srcReference[k-1]  (k sources)
+
+``dstReference[numSrcs]`` is the expected result for a transfer with ``numSrcs`` sources.
+
+Initializing source and destination memories
+---------------------------------------------
+
+For each transfer, the SRC memory on the rank that owns it is filled from the corresponding
+``srcReference`` buffer via ``hipMemcpy`` (host-to-device or device-to-device as appropriate).
+DST memory is zeroed (or memset) before transfers run.
+
+How validation is timed
+========================
+
+The timing of validation is controlled by the ``alwaysValidate`` option. By default
+(``alwaysValidate = 0``), validation runs once after all timed iterations complete,
+minimizing overhead during benchmarking. When ``alwaysValidate = 1``, validation is
+performed after every iteration; any detected error immediately stops the run.
+
+.. list-table::
+    :header-rows: 1
+
+    * - Option
+      - When
+      - Behavior
+
+    * - ``alwaysValidate = 0`` (default)
+      - Once at the end of all iterations
+      - ``ValidateAllTransfers`` called after the iteration loop.
+
+    * - ``alwaysValidate = 1``
+      - After every timed iteration
+      - ``ValidateAllTransfers`` called inside the loop; any error stops the run.
+
+How validation (``ValidateAllTransfers``) works
+================================================
+
+For each transfer and each DST, the following steps are performed:
+
+1. **Rank check:** Only the rank that owns the destination performs validation.
+
+2. **Getting actual output:**
+
+   - **CPU destination** or ``validateDirect = 1``: Point directly at the destination memory.
+   - **GPU destination** and ``validateDirect = 0``: Copy destination to a host ``outputBuffer``
+     via ``hipMemcpy``, then compare against ``outputBuffer``.
+
+3. **Comparison:** Performed using ``memcmp(output, expected, numBytes)``. On mismatch, the code finds the first differing index and returns an error with the index, expected value, and actual value.
+
+4. **Expected values:** Calculated using ``expected = dstReference[t.srcs.size()].data()``. The precomputed sum for the number of sources.
+
+Validation options
+==================
+
+The following options control when and how validation is performed. They can be set as
+environment variables or in a configuration file.
+
+.. list-table::
+    :header-rows: 1
+
+    * - Option
+      - Environment variable
+      - Description
+
+    * - ``alwaysValidate``
+      - ``ALWAYS_VALIDATE``
+      - To validate after each iteration, set to ``1``. To validate once at the end, set to ``0``.
+
+    * - ``validateDirect``
+      - ``VALIDATE_DIRECT``
+      - To compare GPU DST directly, set to ``1``. Supported on AMD hardware only; no host copy.
+        To copy to host and compare, set to ``0``.
+
+    * - ``validateSource``
+      - ``VALIDATE_SOURCE``
+      - To validate the SRC memory right after it's initialized, set to ``1``. (Optional early check).
+
+.. note::
+
+  ``validateDirect`` is not supported on NVIDIA. The code falls back to copying to host.
diff --git a/docs/conceptual/transferbench-timing.rst b/docs/conceptual/transferbench-timing.rst
new file mode 100644
index 00000000..bc6852e1
--- /dev/null
+++ b/docs/conceptual/transferbench-timing.rst
@@ -0,0 +1,146 @@
+.. meta::
+  :description: Explains how TransferBench measures performance at the test, executor, and transfer levels using HIP events and CPU wall-clock timing.
+  :keywords: TransferBench timing, TransferBench measurement, HIP events, CPU wall-clock, executor timing, transfer timing, overhead
+
+.. _transferbench-timing:
+
+====================
+TransferBench timing
+====================
+
+TransferBench measures performance at three nested levels: Test, Executor, and Transfer. Each level
+captures a different scope of elapsed time, and the timing method used depends on the executor type.
+
+Timing levels
+=============
+
+The following diagram illustrates the three levels of timing:
+
+.. image:: /data/timing.png
+  :width: 100%
+  :align: center
+
+The following table provides a quick summary of the three timing levels:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Timing level
+      - What it measures
+      - How it is timed
+
+    * - Test
+      - All Transfers across all executors and all ranks
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - Executor
+      - All Transfers that run on this executor
+      - Varies by executor type (see :ref:`timing-methods`)
+
+    * - Transfer
+      - A single Transfer
+      - Varies by executor type (see :ref:`timing-methods`)
+
+.. _timing-methods:
+
+Timing methods
+==============
+
+The timing method used for each executor and transfer depends on the executor type and the value of
+``USE_HIP_EVENTS``.
+
+Executor timing
+---------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Executor type
+      - Timing method
+
+    * - CPU
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - GFX / DMA
+      - For ``USE_HIP_EVENTS=1`` (default): HIP events (``hipEventElapsedTime``)
+
+        For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - NIC
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+Transfer timing
+---------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Executor type
+      - Timing method
+
+    * - CPU
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - GFX
+      - For ``USE_HIP_EVENTS=1`` (default): GPU wall-clock timestamp (``wall_clock64()``)
+
+        For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - DMA
+      - For ``USE_HIP_EVENTS=1`` (default): HIP events (``hipEventElapsedTime``)
+
+        For ``USE_HIP_EVENTS=0``: CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+    * - NIC
+      - CPU wall-clock (``std::chrono::high_resolution_clock``)
+
+Overhead
+========
+
+Overhead is the difference between the total CPU wall-clock time (Test time) and the elapsed time of
+the slowest executor:
+
+.. code-block:: text
+
+  Overhead = Test Time - MAX(Executor 0 Time, Executor 1 Time, ...)
+
+Overhead captures scheduling and synchronization costs that fall outside of executor-measured time,
+such as barrier waits and thread management.
+
+Example output
+==============
+
+The following example shows TransferBench output for a test with two executors (CPU and GPU) and
+four transfers:
+
+.. code-block:: text
+
+  Test 1:
+  -------------------┬--------------┬------------┬-------------------┬--------------------
+  Executor: CPU 00   │  0.027 GB/s  │  77.492 ms │    2097152 bytes  │  4.489 GB/s (sum)
+  Executor 0 Time = 77.492 ms
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+      Transfer 0     │  4.476 GB/s  │   0.234 ms │    1048576 bytes  │  C0 -> C0:4 -> N
+  Transfer 0 Time =   0.234 ms
+      Transfer 1     │  0.014 GB/s  │  77.359 ms │    1048576 bytes  │  G0 -> C0:4 -> N
+  Transfer 1 Time =  77.359 ms
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+  Executor: GPU 00   │ 97.436 GB/s  │   0.689 ms │   67108864 bytes  │ 129.692 GB/s (sum)
+  Executor 1 Time = 0.689 ms
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+      Transfer 2     │ 80.886 GB/s  │   0.415 ms │   33554432 bytes  │  G0 -> G0:4 -> G0
+  Transfer 2 Time =   0.415 ms
+      Transfer 3     │ 48.807 GB/s  │   0.687 ms │   33554432 bytes  │  G0 -> G0:4 -> G1
+  Transfer 3 Time =   0.687 ms
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+  Aggregate (CPU)    │  0.891 GB/s  │  77.688 ms │   69206016 bytes  │  Overhead 0.197 ms
+  Test Time     = 77.688 ms
+  -------------------┴--------------┴------------┴-------------------┴--------------------
+  Overhead      = 77.688 - MAX(77.492, 0.689) = 0.197 ms
+
+In this example:
+
+- **Executor 0** (CPU) runs Transfers 0 and 1 and takes 77.492 ms (dominated by Transfer 1 at 77.359 ms).
+- **Executor 1** (GPU) runs Transfers 2 and 3 and takes 0.689 ms.
+- **Test Time** is 77.688 ms, measured by the CPU wall-clock across all executors.
+- **Overhead** is 0.197 ms, calculated as ``77.688 - MAX(77.492, 0.689)``.
diff --git a/docs/conceptual/transferbench-workflow.rst b/docs/conceptual/transferbench-workflow.rst
new file mode 100644
index 00000000..d1f41eee
--- /dev/null
+++ b/docs/conceptual/transferbench-workflow.rst
@@ -0,0 +1,117 @@
+.. meta::
+  :description: Explains the TransferBench internal workflow, including how presets and config files feed into RunTransfers() and how transfers are executed and reported.
+  :keywords: TransferBench workflow, RunTransfers, TransferBench internals, TransferBench architecture, TransferBench conceptual
+
+.. _transferbench-workflow:
+
+=======================
+TransferBench workflow
+=======================
+
+Transfers enter the system either through presets or configuration (config) file, both of which ultimately call
+``RunTransfers()``, which is the main utility function within TransferBench.
+
+Entry points
+============
+
+.. list-table::
+    :header-rows: 1
+
+    * - Source
+      - Flow
+
+    * - Presets
+      - The client selects a preset (for example, ``p2p``, ``nicp2p``, or ``sweep``).
+        The preset builds a ``std::vector<Transfer>`` and calls
+        ``TransferBench::RunTransfers(cfg, transfers, results)``.
+
+    * - Config file or command line
+      - The client parses transfers from a config file or command-line arguments via
+        ``ParseTransfers()``, then calls ``RunTransfers()``.
+
+RunTransfers workflow
+=====================
+
+``RunTransfers()`` is the main entry point into the backend TransferBench library:
+
+.. code-block:: cpp
+
+    /**
+     * Run a set of Transfers
+     *
+     * @param[in] config     Configuration options
+     * @param[in] transfers  Set of Transfers to execute
+     * @param[out] results   Timing results
+     * @returns true if and only if Transfers were run successfully without any fatal errors
+     */
+    bool RunTransfers(ConfigOptions const& config,
+                      vector<Transfer> const& transfers,
+                      TestResults& results);
+
+As shown in the following diagram, the function executes four sequential phases:
+
+.. image:: /data/workflow.png
+  :width: 100%
+  :align: center
+
+1. **Initial validation:** Checks that inputs are consistent and valid.
+2. **Prepare transfers:** Allocates resources and initializes memory.
+3. **Iteration loop:** Runs the timed transfer iterations.
+4. **Finalize:** Validates results and assembles output.
+
+Initial validation
+------------------
+
+Here are the steps involved in the first phase:
+
+1. **Check ConfigOptions:** Verify that the provided ``ConfigOptions`` are valid. ``ConfigOptions`` control how TransferBench runs (for example, GFX unroll factor and number of warmup iterations). When running in multinode mode, consistency across ranks is also checked.
+
+2. **Check transfers:** Verify that the provided ``Transfers`` are properly specified. Checks include confirming that requested devices exist and that each transfer has an appropriate number of source (SRC) and destination (DST) endpoints. When running in multinode mode, consistency across ranks is also checked.
+
+3. **Log transfers (optional):** If ``TB_DUMP_CFG_FILE`` is set, log the transfers to a config file that can be re-executed by TransferBench. This is useful for capturing the exact transfers run by a preset so they can be modified and replayed.
+
+Prepare transfers
+-----------------
+
+Here are the steps involved in the second phase:
+
+1. **Prepare executors:** Perform all executor-specific setup, such as creating HIP streams, allocating SRC and DST memory, and exchanging fabric handles for pod communication support. This step also divides the work across subexecutors.
+
+2. **Initialize memory:** Initializes SRC memory buffers with data patterns and computes reference results used for later validation.
+
+   .. note::
+
+    For GPU SRC memory locations, data is copied onto the GPUs via DMA (``hipMemcpy``). If profiling, this copy appears as part of the profiling trace, which is why setting ``USE_INTERACTIVE`` = 1 is recommended when profiling.
+
+3. **Optional pause:** When ``USE_INTERACTIVE`` = 1, TransferBench pauses for user input after all memory has been initialized. Virtual addresses are printed at this point, which is useful for attaching a profiler before any transfers execute.
+
+Iteration loop
+--------------
+
+In the third phase, iteration loop runs for the number of iterations specified by ``NUM_ITERATIONS``.
+Each iteration proceeds through the following steps:
+
+1. **Barrier (pre):** Synchronizes all ranks before transfers begin, ensuring that every rank is ready before any rank starts executing transfers.
+
+2. **Start CPU timing:** Starts a CPU timer on the current rank, capturing the total elapsed time across all transfers on this rank.
+
+3. **Execute:** Spawns one CPU thread per executor. Each executor runs all the transfers it is assigned and is responsible for its own per-transfer timing.
+
+4. **Barrier (post):** Waits for all executors across all ranks to finish before proceeding.
+
+5. **Stop CPU timing:** Stops the CPU timer.
+
+6. **Validate (optional):** If ``ALWAYS_VALIDATE`` = 1, performs a correctness check after each iteration to verify that destination memory matches the expected reference results.
+
+   .. note::
+
+    By default, validation runs only once after all iterations complete. Setting ``ALWAYS_VALIDATE`` = 1 validates after every iteration, which can help detect transient errors that would otherwise be masked by a passing final iteration.
+
+Finalize
+--------
+
+Here are the steps involved in the last phase:
+
+1. **Validate all transfers:** Checks all transfers to confirm that the DST memory matches the expected reference results computed during the Initialize Memory step.
+
+2. **Prepare results:** Collects timing data from each executor and assembles the final ``TestResults`` output returned to the caller.
diff --git a/docs/data/a2a_MI300X.png b/docs/data/a2a_MI300X.png
new file mode 100644
index 00000000..ba2d9680
Binary files /dev/null and b/docs/data/a2a_MI300X.png differ
diff --git a/docs/data/a2a_MI350X.png b/docs/data/a2a_MI350X.png
new file mode 100644
index 00000000..42948227
Binary files /dev/null and b/docs/data/a2a_MI350X.png differ
diff --git a/docs/data/a2a_serialization.png b/docs/data/a2a_serialization.png
new file mode 100644
index 00000000..01fcb02f
Binary files /dev/null and b/docs/data/a2a_serialization.png differ
diff --git a/docs/data/a2asweep_MI300X.png b/docs/data/a2asweep_MI300X.png
new file mode 100644
index 00000000..0928dc1e
Binary files /dev/null and b/docs/data/a2asweep_MI300X.png differ
diff --git a/docs/data/a2asweep_MI350X.png b/docs/data/a2asweep_MI350X.png
new file mode 100644
index 00000000..159aa30f
Binary files /dev/null and b/docs/data/a2asweep_MI350X.png differ
diff --git a/docs/data/nicrings.png b/docs/data/nicrings.png
new file mode 100644
index 00000000..b8c966c7
Binary files /dev/null and b/docs/data/nicrings.png differ
diff --git a/docs/data/nicrings_MI350X.png b/docs/data/nicrings_MI350X.png
new file mode 100644
index 00000000..2d2ea021
Binary files /dev/null and b/docs/data/nicrings_MI350X.png differ
diff --git a/docs/data/schmoo_MI300X.png b/docs/data/schmoo_MI300X.png
new file mode 100644
index 00000000..b3490810
Binary files /dev/null and b/docs/data/schmoo_MI300X.png differ
diff --git a/docs/data/schmoo_MI350X.png b/docs/data/schmoo_MI350X.png
new file mode 100644
index 00000000..609cf595
Binary files /dev/null and b/docs/data/schmoo_MI350X.png differ
diff --git a/docs/data/timing.png b/docs/data/timing.png
new file mode 100644
index 00000000..8fd6b407
Binary files /dev/null and b/docs/data/timing.png differ
diff --git a/docs/data/workflow.png b/docs/data/workflow.png
new file mode 100644
index 00000000..32dd227c
Binary files /dev/null and b/docs/data/workflow.png differ
diff --git a/docs/how to/running-transferbench-customized.rst b/docs/how to/running-transferbench-customized.rst
new file mode 100644
index 00000000..dafd2779
--- /dev/null
+++ b/docs/how to/running-transferbench-customized.rst	
@@ -0,0 +1,131 @@
+.. meta::
+  :description: How to run custom transfer tests using TransferBench by defining them in a configuration file or on the command line, including wildcard syntax and examples.
+  :keywords: TransferBench usage, TransferBench how to, TransferBench configuration file, TransferBench custom transfers, TransferBench wildcards
+
+.. _running-transferbench-customized:
+
+========================================
+Running custom tests using TransferBench
+========================================
+
+You can run custom transfer tests using TransferBench by defining them in a configuration file. This topic describes the configuration file format and how to run tests using a file or the command line.
+
+.. seealso::
+
+  :ref:`transfer-definition-syntax` — complete reference for simple and advanced mode syntax, memory and executor letter codes, and wildcards.
+
+Running TransferBench with a configuration file
+================================================
+
+To run TransferBench with a configuration file, use:
+
+.. code-block:: shell
+
+    ./TransferBench <config_file> [num_bytes]
+
+The command accepts the following arguments:
+
+- ``config_file``: Path to a configuration file that defines the transfers to run.
+
+- ``num_bytes``: Number of bytes per transfer. You can suffix this value with ``K``, ``M``, or ``G`` (for example, ``128M``). The value must be a multiple of 4. You can omit this argument.
+
+.. note::
+
+  If you set ``num_bytes`` to ``0``, TransferBench sweeps over transfer sizes from 1 KB to 512 MB in power-of-2 steps. Use ``SAMPLING_FACTOR`` to control how many sizes are sampled per power-of-2 range. For example, ``SAMPLING_FACTOR=4`` produces four evenly spaced sizes between each power of 2.
+
+Alternative modes
+------------------
+
+In addition to configuration file mode, TransferBench supports the following alternative modes:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Mode
+      - Usage
+      - Description
+
+    * - ``cmdline``
+      - ``./TransferBench cmdline [num_bytes] <transfer_definitions...>``
+      - Defines transfers on the command line instead of in a file.
+
+    * - ``dryrun``
+      - ``./TransferBench dryrun [num_bytes] <transfer_definitions...>``
+      - Parses and prints the expanded transfers without executing them. Use this mode to validate wildcard expansion.
+
+- Using ``cmdline`` mode:
+
+  .. code-block:: shell
+
+    ./TransferBench cmdline 1G "1 1 (G0->G0->G1)"
+
+  Running the preceding command produces the same result as using a configuration file that contains the following line:
+
+  .. code-block:: shell
+
+    1 1 (G0->G0->G1)
+
+- Using ``dryrun`` mode:
+
+  .. code-block:: shell
+
+    ./TransferBench dryrun 1G "1 1 (G0->G0->G*)"
+
+  The output lists all transfers that would be executed:
+
+  .. code-block:: shell
+
+    =============================================================================================================
+    Transfers to be executed (dry-run):
+    ================================================================================
+    Transfer     0: (G0->G0->G0)
+    Transfer     1: (G0->G0->G1)
+    Transfer     2: (G0->G0->G2)
+    Transfer     3: (G0->G0->G3)
+    Transfer     4: (G0->G0->G4)
+    Transfer     5: (G0->G0->G5)
+    Transfer     6: (G0->G0->G6)
+    Transfer     7: (G0->G0->G7)
+
+Configuration file format
+==========================
+
+Each line in a configuration file defines one test, which is a set of transfers that run in parallel. The following rules apply:
+
+- Blank lines are ignored.
+
+- Lines starting with ``#`` are treated as comments and are ignored.
+
+- Lines starting with ``##`` are echoed to the output. Use these for section headers.
+
+- Round brackets ``()`` and arrows ``->`` are optional and ignored. You can include them for readability.
+
+Example configuration file
+============================
+
+The following configuration file shows a range of transfer types using both simple and advanced modes:
+
+.. code-block:: shell
+
+  # Single GPU-executed transfer between GPUs 0 and 1 using 4 CUs
+  1 4 (G0->G0->G1)
+
+  # Single DMA transfer between GPUs 0 and 1
+  1 1 (G0->D0->G1)
+
+  # Advanced: 1 MB GPU 0 to GPU 1 with 4 CUs; 2 MB GPU 1 to GPU 0 with 8 CUs
+  -2 (G0 G0 G1 4 1M) (G1 G1 G0 8 2M)
+
+  # Memset by GPU 0 to its own memory (null source)
+  1 32 (N0->G0->G0)
+
+  # Read-only by CPU 0 (null destination)
+  1 4 (C0->C0->N0)
+
+  # Broadcast from GPU 0 to GPUs 0 and 1
+  1 16 (G0->G0->G0G1)
+
+  # Multi-rank: GPU 0 on rank 0 to GPU 1 on rank 1
+  1 4 (R0G0->R0G0->R1G1)
+
+For information on how to define transfers in the configuration file, including how to specify what memory to use, which executor runs the transfer, and how many subexecutors to use, see :ref:`transfer-definition-syntax`.
diff --git a/docs/how to/use-transferbench.rst b/docs/how to/use-transferbench.rst
deleted file mode 100644
index b237a1c8..00000000
--- a/docs/how to/use-transferbench.rst	
+++ /dev/null
@@ -1,215 +0,0 @@
-.. meta::
-  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: TransferBench usage, TransferBench how to, TransferBench user guide, TransferBench user manual
-
-.. _using-transferbench:
-
----------------------
-Using TransferBench
----------------------
-
-You can control the SRC and DST memory locations by indicating the memory type followed by the device index. TransferBench supports the following memory types:
-
-* Coarse-grained pinned host
-* Unpinned host
-* Fine-grained host
-* Coarse-grained global device
-* Fine-grained global device
-* Null (for an empty transfer)
-
-In addition, you can determine the size of the transfer (number of bytes to copy) for the tests.
-
-You can also specify transfer executors. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of Sub-Executors (SE). The number of SEs specifies the number of CPU threads in the case of a CPU executor and the number of compute units (CU) for a GPU executor.
-For a DMA executor, the SE argument determines the number of streams to be used.
-
-You can specify the transfers in a configuration file or use preset configurations for transfers.
-
-Specifying transfers in a configuration file
-----------------------------------------------
-
-A transfer is defined as a single operation where an executor reads and adds together values from SRC memory locations, followed by writing the sum to the DST memory locations.
-This simplifies to a copy operation when using a single SRC or DST.
-Here's a copy operation from a single SRC to DST:
-
-.. code-block:: bash
-
-   SRC 0                DST 0
-   SRC 1 -> Executor -> DST 1
-   SRC X                DST Y
-
-Three executors are supported by TransferBench:
-
-.. code-block:: bash
-
-  Executor:        SubExecutor:
-  1. CPU           CPU thread
-  2. GPU           GPU threadblock/Compute Unit (CU)
-  3. DMA           N/A (Can only be used for a single SRC to DST copy)
-
-Each line in the configuration file defines a set of transfers, also known as a test, to run in parallel.
-
-There are two ways to specify a test:
-
-- **Basic**
-
-  The basic specification assumes the same number of SEs used per transfer.
-  A positive number of transfers is specified, followed by the number of SEs and triplets describing each transfer:
-
-  .. code-block:: bash
-
-    Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
-
-  The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
-
-  **Example**:
-
-  .. code-block:: bash
-
-   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
-   1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
-   2 4 G0->G0->G1 G1->G1->G0          Copies from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
-
-- **Advanced**
-
-  In the advanced specification, a negative number of transfers is specified, followed by quintuplets describing each transfer.
-  Specifying a non-zero number of bytes overrides any provided value.
-
-  .. code-block:: bash
-
-    Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
-
-  The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
-
-  **Example**:
-
-  .. code-block:: bash
-
-   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs and 2Mb from GPU1 to GPU0 with 2 SEs
-
-Here is the list of arguments used to specify transfers in the config file:
-
-.. _config_file_arguments_table:
-
-.. list-table::
-   :header-rows: 1
-
-   * - Argument
-     - Description
-
-   * - Transfers
-     - Number of transfers to be run in parallel
-
-   * - SE
-     - Number of SEs to use (CPU threads or GPU threadblocks)
-
-   * - srcMemL
-     - Source memory locations (where the data is read)
-
-   * - Executor
-     - | Executor is specified by a character indicating type, followed by the device index (0-indexed):
-       | - C: CPU-executed  (indexed from 0 to NUMA nodes - 1)
-       | - G: GPU-executed  (indexed from 0 to GPUs - 1)
-       | - D: DMA-executor  (indexed from 0 to GPUs - 1)
-
-   * - dstMemL
-     - Destination memory locations (where the data is written)
-
-   * - bytesL
-     - | Number of bytes to copy (use command-line specified size when 0).
-       | Must be a multiple of four and can be suffixed with ('K','M', or 'G').
-       | Memory locations are specified by one or more device characters or device index pairs.
-       | Characters indicate memory type and are followed by device index (0-indexed).
-       | Here are the characters and their respective memory locations:
-       | - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
-       | - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
-       | - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
-       | - G:    Global device memory     (on GPU device, indexed from 0 to [GPUs - 1])
-       | - F:    Fine-grain device memory (on GPU device, indexed from 0 to [GPUs - 1])
-       | - N:    Null memory              (index ignored)
-
-Round brackets and arrows "->" can be included for human clarity, but will be ignored.
-Lines starting with # are ignored while lines starting with ## are echoed to the output.
-
-**Transfer examples:**
-
-Single GPU-executed transfer between GPU 0 and 1 using 4 CUs::
-
-   1 4 (G0->G0->G1)
-
-Single DMA-executed transfer between GPU 0 and 1::
-
-   1 1 (G0->D0->G1)
-
-Copying 1Mb from GPU 0 to GPU 1 with 4 CUs, and 2Mb from GPU 1 to GPU 0 with 8 CUs::
-
-   -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
-
-"Memset" by GPU 0 to GPU 0 memory::
-
-   1 32 (N0->G0->G0)
-
-"Read-only" by CPU 0::
-
-   1 4 (C0->C0->N0)
-
-Broadcast from GPU 0 to GPU 0 and GPU 1::
-
-   1 16 (G0->G0->G0G1)
-
-.. note::
-
-   Running TransferBench with no arguments displays usage instructions and detected topology information.
-
-Using preset configurations
-------------------------------
-
-Here is the list of preset configurations that can be used instead of configuration files:
-
-.. list-table::
-   :header-rows: 1
-
-   * - Configuration
-     - Description
-
-   * - ``a2a``
-     - All-to-all benchmark test
-
-   * - ``cmdline``
-     - Allows transfers to run from the command line instead of a configuration file
-
-   * - ``dryrun``
-     - Lists the set of transfers to be executed as provided from the command line
-     - This is useful when using wildcards to ensure correctness
-
-   * - ``healthcheck``
-     - Simple health check (supported on AMD Instinct MI300 series only)
-
-   * - ``nic_rings``
-     - Measure performance of NICs set up in a ring across ranks
-
-   * - ``p2p``
-     - Peer-to-peer benchmark test
-
-   * - ``pcopy``
-     - Benchmark parallel copies from a single GPU to other GPUs
-
-   * - ``rsweep``
-     - Random sweep across possible sets of transfers
-
-   * - ``rwrite``
-     - Benchmark parallel remote writes from a single GPU to other GPUs
-
-   * - ``scaling``
-     - GPU subexecutor scaling tests
-
-   * - ``schmoo``
-     - Read, write, or copy operation on local or remote between two GPUs
-
-   * - ``sweep``
-     - Sweep across possible sets of transfers
-
-Performance tuning
----------------------
-
-When you use the same GPU executor in multiple simultaneous transfers on separate streams by setting ``USE_SINGLE_STREAM=0``, the performance might be serialized due to the maximum number of hardware queues available.
-To improve the performance, adjust the number of maximum hardware queues using ``GPU_MAX_HW_QUEUES``.
diff --git a/docs/index.rst b/docs/index.rst
index 2e8b36df..98795e77 100644
--- a/docs/index.rst
+++ b/docs/index.rst
@@ -1,13 +1,30 @@
 .. meta::
-  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: Benchmarking utility, Memory transfers, Device transfers
+  :description: TransferBench documentation home. TransferBench is a utility for benchmarking simultaneous memory transfers between CPUs, GPUs, and NICs.
+  :keywords: TransferBench, benchmarking utility, memory transfers, GPU transfers, NIC transfers, multinode benchmark
 
 ****************************
 TransferBench documentation
 ****************************
 
-TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
-This simplifies to a simple copy operation when dealing with a single SRC or DST.
+TransferBench is a utility for benchmarking simultaneous memory transfers between user-specified devices (CPUs, GPUs, and NICs).
+
+A memory transfer is a single operation where an Executor (EXE) reads and adds values from source (SRC) memory devices, then writes the sum to destination (DST) memory devices. When dealing with a single SRC or DST, a memory transfer is similar to a simple copy operation. The memory transfer is commonly denoted by the (SRC->EXE->DST) triplet.
+
+A Memory device consists of a location (a specific device that owns the memory) and a memory type (usually some attribute about the memory). For example, fine-grained HBM memory (memory type) on GPU 0 (location) or pinned CPU memory (memory type) on NUMA node 1 (location).
+
+TransferBench supports the following features:
+
+- **Multiple executors:** CPU threads, GPU compute kernels, GPU Direct Memory Access (DMA) or System DMA (SDMA), and Remote Direct Memory Access (RDMA) NIC or RNIC. Some Executors support SubExecutors, allowing further partitioning of the data to be transferred.
+
+- **Multi-input or multi-output (MIMO) transfers:** Element-wise sum from multiple SRCs to multiple DSTs.
+
+- **Multinode execution:** Using MPI or sockets across distributed systems.
+
+- **Flexible configuration:** Using Config files or presets for common benchmarks.
+
+- **Flexible hardware:** Supports HIP and CUDA programs that can run on both AMD and NVIDIA hardware.
+
+TransferBench provides a frontend client (the executable) and a backend library (the header-only TransferBench.hpp). The backend library can be used to integrate TransferBench into other custom applications.
 
 The code is open and hosted at `<https://github.com/ROCm/TransferBench>`_.
 
@@ -18,13 +35,21 @@ The code is open and hosted at `<https://github.com/ROCm/TransferBench>`_.
 
     * :ref:`install-transferbench`
 
-  .. grid-item-card:: API reference
+  .. grid-item-card:: How to
+
+    * :ref:`running-transferbench-customized`
 
-    * :ref:`transferbench-api`
+  .. grid-item-card:: Conceptual
 
-  .. grid-item-card:: How to
+    * :ref:`transferbench-workflow`
+    * :ref:`transferbench-timing`
+    * :ref:`transferbench-data-validation`
+
+  .. grid-item-card:: Reference
 
-    * :ref:`using-transferbench`
+    * :ref:`running-presets`
+    * :ref:`environment-variables`
+    * :ref:`faq`
 
 To contribute to the documentation, refer to
 `Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
diff --git a/docs/install/install.rst b/docs/install/install.rst
index 4a44ff59..f243dc17 100644
--- a/docs/install/install.rst
+++ b/docs/install/install.rst
@@ -1,85 +1,409 @@
 .. meta::
-  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: Build TransferBench, Install TransferBench
+  :description: Instructions for installing TransferBench from source or using a package manager on supported platforms.
+  :keywords: Build TransferBench, Install TransferBench, TransferBench package manager, TransferBench source build
 
 .. _install-transferbench:
 
----------------------------
-Installing TransferBench
----------------------------
+-----------------------
+Install TransferBench
+-----------------------
 
-This topic describes how to build TransferBench.
+To install TransferBench, you have the following options:
 
-Prerequisite
----------------
+- :ref:`Build from source <source-build>`
 
-* Install ROCm stack on the system to obtain :doc:`HIP runtime <hip:index>`
-* Install ``libnuma`` on the system
-* `Enable AMD IOMMU <https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#iommu-configuration-systems-with-256-cpu-threads>`_ and set to passthrough for AMD Instinct cards
+- :ref:`Use package manager <package-manager>`
 
-Building TransferBench
-------------------------
+.. _source-build:
 
-Here are the steps to build TransferBench:
+Building TransferBench from source
+===================================
 
-1. Download the latest version of TransferBench from the git repository.
+First, install the following required dependencies.
 
-   .. code-block:: bash
+Required dependencies
+----------------------
 
-    git clone https://github.com/ROCm/TransferBench.git
-    cd TransferBench
+* `ROCm stack <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/>`_ to obtain :doc:`HIP runtime <hip:index>`.
 
-2. Build TransferBench using Makefile or CMake.
+  - The installed HIP version might impact support for some features, such as amd-smi pod membership detection, or UALoE support.
 
-   To build using Makefile, use:
+* ``libnuma`` for allocating memory or spawning threads on correct NUMA nodes.
 
-   .. code-block:: bash
+  - For Ubuntu/Debian:
 
-    make
+    .. code-block:: shell
 
-   To build using CMake, use:
+      sudo apt install libnuma-dev
 
-   .. code-block:: bash
+  - For RHEL/CentOS:
 
-    mkdir build
-    cd build
-    CXX=/opt/rocm/bin/hipcc cmake ..
-    make
+    .. code-block:: shell
 
-.. note::
+      sudo yum install numactl-devel
 
-  If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately.
+Optional dependencies
+----------------------
 
-  NIC executor support will be enabled if IBVerbs is detected and if ``infiniband/verbs.h`` is found in the default include path.
-  NIC executor support can be disabled explicitly by setting ``DISABLE_NIC_EXEC=1``
+Depending on your requirement, you can install these optional dependencies:
 
-  MPI support will be enabled if mpi.h is found in ``MPI_PATH/include/``
-  MPI executor support can be disabled explicitly by setting ``DISABLE_MPI_COMM=1``
+- ``libibverbs``: Required for enabling NIC executor for RDMA transfers.
 
-Building documentation
------------------------
+  - For Ubuntu/Debian:
+
+    .. code-block:: shell
+
+      sudo apt install rdma-core libibverbs-dev ibverbs-utils
+
+  - For RHEL/CentOS:
+
+    .. code-block:: shell
+
+      sudo yum install rdma-core libibverbs libibverbs-devel
+
+- MPI installation (any of the following)
+
+  - ``OpenMPI``:
+
+    - For Ubuntu/Debian:
+
+      .. code-block:: shell
+
+        sudo apt install openmpi-bin libopenmpi-dev
+
+    - For RHEL/CentOS:
+
+      .. code-block:: shell
 
-To build documentation locally, use:
+        sudo yum install openmpi openmpi-devel
 
-.. code-block:: bash
+  - ``MPICH``:
 
-  cd docs
-  pip3 install -r ./sphinx/requirements.txt
-  python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
+    - For Ubuntu/Debian:
 
-NVIDIA platform support
---------------------------
+      .. code-block:: shell
 
-You can build TransferBench to run on NVIDIA platforms using native NVIDIA CUDA Compiler Driver (NVCC).
+        sudo apt-get install mpich libmpich-dev
 
-To build with native NVCC, use:
+    - For RHEL/CentOS:
 
-.. code-block:: bash
+      .. code-block:: shell
 
+        sudo yum install mpich mpich-devel
+
+You can build TransferBench from source using two methods: :ref:`Makefile <makefile>` and :ref:`CMake <cmake>`.
+
+.. _makefile:
+
+Method 1: Building from source using Makefile
+----------------------------------------------
+
+To build TransferBench from source using Makefile, run:
+
+.. code-block:: shell
+
+  git clone https://github.com/ROCm/TransferBench.git
+  cd TransferBench
+  make
+
+.. note::
+
+  By default, building with ``make`` only builds for the GPU detected on the machine being used for compilation. To specifically target GPU architectures to compile for, set ``GPU_TARGETS``. See :ref:`menv-var`.
+
+.. _menv-var:
+
+Makefile environment variables
++++++++++++++++++++++++++++++++
+
+To modify the Makefile behavior, use the following environment variables:
+
+.. raw:: html
+
+  <div class="pst-scrollable-table-container">
+    <table id="makefile-env-var" class="table">
+        <thead>
+            <tr>
+                <th>Category</th>
+                <th>Environment variable</th>
+                <th>Description</th>
+                <th>Default value</th>
+            </tr>
+        </thead>
+        <colgroup>
+            <col span="1">
+            <col span="1">
+        </colgroup>
+        <tbody class="makefile-env-variables">
+            <tr>
+              <td rowspan="7"><b>Paths and compilers</b> - To customize which compiler to use or the library to link against.</td>
+              <td><code>ROCM_PATH</code></td>
+              <td>ROCm installation path for HIP compiler, includes, and libs.</td>
+              <td><code>/opt/rocm</code></td>
+            </tr>
+            <tr>
+              <td><code>CUDA_PATH</code></td>
+              <td>CUDA installation path for NVCC when building <code>TransferBenchCuda</code>.</td>
+              <td><code>/usr/local/cuda</code></td>
+            </tr>
+            <tr>
+              <td><code>MPI_PATH</code></td>
+              <td>MPI installation path (for <code>mpi.h</code> and MPI libraries).</td>
+              <td><code>/usr/local/openmpi</code></td>
+            </tr>
+            <tr>
+              <td><code>HIPCC</code></td>
+              <td>HIP compiler. Falls back to <code>hipcc</code>, if not found.</td>
+              <td><code>$(ROCM_PATH)/bin/amdclang++</code></td>
+            </tr>
+            <tr>
+              <td><code>NVCC</code></td>
+              <td>NVIDIA CUDA compiler (for building <code>TransferBenchCuda</code>)</td>
+              <td><code>$(CUDA_PATH)/bin/nvcc</code></td>
+            </tr>
+            <tr>
+              <td><code>ROCM_DEVICE_LIB_PATH</code></td>
+              <td>Path to <code>amdgcn</code> bitcode. Auto-detected from the ROCm layout.</td>
+              <td><code>(auto)</code></td>
+            </tr>
+            <tr>
+              <td><code>HIPCONFIG</code></td>
+              <td>Path to <code>hipconfig</code>, which is used to query the HIP version (for pod communication support check).</td>
+              <td><code>hipconfig</code></td>
+            </tr>
+            <tr>
+              <td rowspan="5"><b>Feature flags</b> - To control enabling features that require compile-time support. By default, these are enabled under the right conditions.</td>
+              <td><code>DISABLE_NIC_EXEC</code></td>
+              <td>Disables NIC executor support.</td>
+              <td><code>0</code></td>
+            </tr>
+            <tr>
+              <td><code>DISABLE_DMA_BUF</code></td>
+              <td>Disables <code>DMA-BUF</code> for GPU Direct RDMA. Requires NIC executor support.</td>
+              <td><code>1</code></td>
+            </tr>
+            <tr>
+              <td><code>DISABLE_MPI_COMM</code></td>
+              <td>Disables MPI communication backend support for multinode TransferBench.</td>
+              <td><code>0</code></td>
+            </tr>
+            <tr>
+              <td><code>DISABLE_AMD_SMI</code></td>
+              <td>Disables AMDI-SMI pod membership checks.</td>
+              <td><code>0</code></td>
+            </tr>
+            <tr>
+              <td><code>DISABLE_POD_COMM</code></td>
+              <td>Disables pod communication support (UALoE / MNNVL).</td>
+              <td><code>0</code></td>
+            </tr>
+            <tr>
+              <td rowspan="3"><b>Build options</b></td>
+              <td><code>SINGLE_KERNEL</code></td>
+              <td>To compile with a single GFX kernel (faster build, but fewer kernel variants), set to 1. Used mostly for development and debug.</td>
+              <td><code>0</code></td>
+            </tr>
+            <tr>
+              <td><code>GPU_TARGETS</code></td>
+              <td>Comma-separated GPU architecture targets such as gfx942, gfx950.</td>
+              <td><code>native</code></td>
+            </tr>
+            <tr>
+              <td><code>DEBUG</code></td>
+              <td>To build in debug mode with debug symbols (-O0, -g), set to 1. Runs otherwise in the release mode (-O3).</td>
+              <td><code>0</code></td>
+            </tr>
+        </tbody>
+    </table>
+  </div>
+
+.. _cmake:
+
+Method 2: Building from source using CMake
+-------------------------------------------
+
+To build TransferBench from source using CMake, run:
+
+.. code-block:: shell
+
+  git clone https://github.com/ROCm/TransferBench.git
+  cd TransferBench
+  mkdir build && cd build
+  cmake ..
   make
 
-TransferBench looks for NVCC in ``/usr/local/cuda`` by default. To modify the location of NVCC, use environment variable `CUDA_PATH`:
+CMake environment variables
+++++++++++++++++++++++++++++
+
+To modify the CMake behavior, use the following environment variables:
+
+.. raw:: html
+
+  <div class="pst-scrollable-table-container">
+    <table id="cmake-env-var" class="table">
+        <thead>
+            <tr>
+                <th>Category</th>
+                <th>Environment variable</th>
+                <th>Description</th>
+                <th>Default value</th>
+            </tr>
+        </thead>
+        <colgroup>
+            <col span="1">
+            <col span="1">
+        </colgroup>
+        <tbody class="cmake-env-variables">
+            <tr>
+              <td rowspan="4"><b>Paths and compilers</b> - To customize which compiler to use or the library to link against.</td>
+              <td><code>ROCM_PATH</code></td>
+              <td>ROCm installation path.</td>
+              <td><code>/opt/rocm</code></td>
+            </tr>
+            <tr>
+              <td><code>CMAKE_TOOLCHAIN_FILE</code></td>
+              <td>Toolchain file. Uses ROCM_PATH and CXX to select compiler.</td>
+              <td><code>toolchain-linux.cmake</code></td>
+            </tr>
+            <tr>
+              <td><code>CXX</code></td>
+              <td>C++ compiler. If not set, <code>amdclang++</code> or <code>hipcc</code> is used.</td>
+              <td> Taken from the toolchain</td>
+            </tr>
+            <tr>
+              <td><code>MPI_PATH</code></td>
+              <td>Path to MPI installation. Takes priority over <code>find_package(MPI)</code>.</td>
+              <td> </td>
+            </tr>
+            <tr>
+              <td rowspan="6"><b>Build options (ON/OFF)</b> - Pass -DVAR=value to set</td>
+              <td><code>BUILD_LOCAL_GPU_TARGET_ONLY</code></td>
+              <td>Builds only for the GPUs detected on the given machine using <code>rocm_agent_enumerator</code>.</td>
+              <td><code>OFF</code></td>
+            </tr>
+            <tr>
+              <td><code>ENABLE_NIC_EXEC</code></td>
+              <td>Enables RDMA NIC executor.</td>
+              <td><code>OFF</code></td>
+            </tr>
+            <tr>
+              <td><code>ENABLE_MPI_COMM</code></td>
+              <td>Enables MPI communicator as backbone for multinode TransferBench.</td>
+              <td><code>OFF</code></td>
+            </tr>
+            <tr>
+              <td><code>ENABLE_DMA_BUF</code></td>
+              <td>Enables DMA-BUF for GPU Direct RDMA (requires NIC).</td>
+              <td><code>OFF</code></td>
+            </tr>
+            <tr>
+              <td><code>ENABLE_AMD_SMI</code></td>
+              <td>Enables AMD-SMI pod membership queries.</td>
+              <td><code>OFF</code></td>
+            </tr>
+            <tr>
+              <td><code>ENABLE_POD_COMM</code></td>
+              <td>Enables pod communication (HIP >= 8.0).</td>
+              <td><code>OFF</code></td>
+            </tr>
+            <tr>
+              <td rowspan="3"><b>CMake cache variables</b></td>
+              <td><code>GPU_TARGETS</code></td>
+              <td>Semicolon-separated GPU architectures. Overridden if <code>BUILD_LOCAL_GPU_TARGET_ONLY</code> is <code>ON</code></td>
+              <td><code style="word-break: break-all;">gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201;gfx1250</code></td>
+              </tr>
+              <tr>
+                <td><code>AMD_SMI_EXECUTABLE</code></td>
+                <td>Path to <code>amd-smi</code> for AMD-SMI version check.</td>
+                <td><code>amd-smi</code></td>
+              </tr>
+              <tr>
+                <td><code>HIPCONFIG_EXECUTABLE</code></td>
+                <td>Path to <code>hipconfig</code> for HIP version or pod check.</td>
+                <td><code>hipconfig</code></td>
+        </tbody>
+    </table>
+  </div>
+
+.. note::
+
+  CMake uses ``opt-in`` for optional features, which is ``OFF`` by default, whereas Makefile uses ``opt-out``, which is ``ON`` by default. To set cache variables, pass ``-DVAR=value`` to CMake.
+
+**Example: building with MPI and NIC support**
+
+.. code-block:: shell
+
+  git clone https://github.com/ROCm/TransferBench.git
+  cd TransferBench
+  mkdir build && cd build
+  cmake .. -DENABLE_NIC_EXEC=ON -DENABLE_MPI_COMM=ON
+  make
+
+Troubleshooting common build errors
+------------------------------------
+
+Here are some commonly encountered build errors and their fix:
+
+- ``Could not find /opt/rocm/bin/amdclang++ or /opt/rocm/bin/hipcc. Check if the path is correct if you want to build TransferBench``
+
+  Occurs if HIP isn't installed correctly. If it is installed in a different directory, specify it using ``ROCM_PATH``.
+
+- ``Could not find standard C++ header 'cmath'``
+
+  Normally occurs if the standard C++ headers aren't installed. Try installing ``g++-12`` or ``g++-14`` based on the OS version. For example, ``apt-get install g++-12``.
+
+.. _package-manager:
+
+Installing TransferBench using package manager
+===============================================
+
+To install TransferBench using package, install ROCm first and then run:
+
+.. code-block:: shell
+
+  ## Install the transferbench-dev package
+  sudo apt-get install transferbench-dev
+
+This installs in ``/opt/{rocm-version}/bin/TransferBench``. To check, use:
+
+.. code-block:: shell
+
+  dpkg -L transferbench-dev
+
+.. note::
+
+  Pre-packaged installation doesn't support any enabled features, such as NIC executor, MPI support, pod support, and others.
+
+Building TransferBenchCuda
+===========================
+
+To build TransferBenchCuda from the source code, install the required dependencies first.
+
+Required dependencies
+----------------------
+
+- CUDA: The installed CUDA version might impact support for some features such MNNVL support.
+
+- libnuma: Used for allocating memory or spawning threads on the right NUMA nodes. Here are the install instructions based on the OS:
+
+  - Ubuntu/Debian:
+
+    .. code-block:: shell
+
+      sudo apt install libnuma-dev
+
+  - RHEL/CentOS:
+
+    .. code-block:: shell
+
+      sudo yum install numactl-devel
+
+Building TransferBenchCuda from source code
+--------------------------------------------
+
+To build TransferBenchCuda, run:
 
-.. code-block:: bash
+.. code-block:: shell
 
-  CUDA_PATH=/usr/local/cuda make
+  git clone https://github.com/ROCm/TransferBench.git
+  cd TransferBench
+  make TransferBenchCuda
diff --git a/docs/reference/api.rst b/docs/reference/api.rst
index 438696e6..ae88593a 100644
--- a/docs/reference/api.rst
+++ b/docs/reference/api.rst
@@ -1,6 +1,6 @@
 .. meta::
-  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: TransferBench library, TransferBench functions, Transferbench API, Transferbench interface
+  :description: API reference for the TransferBench backend library, including functions and interfaces exposed by the header-only TransferBench.hpp.
+  :keywords: TransferBench library, TransferBench functions, TransferBench API, TransferBench interface, TransferBench.hpp
 
 .. _transferbench-api:
 
diff --git a/docs/reference/environment-variables.rst b/docs/reference/environment-variables.rst
new file mode 100644
index 00000000..d6f63116
--- /dev/null
+++ b/docs/reference/environment-variables.rst
@@ -0,0 +1,413 @@
+.. meta::
+  :description: Reference for TransferBench environment variables that control the frontend client, backend library, and runtime behavior for configuration-file and preset runs.
+  :keywords: TransferBench environment variables, TransferBench configuration, TransferBench client, TransferBench customization, TransferBench how to
+
+.. _environment-variables:
+
+====================================
+TransferBench environment variables
+====================================
+
+TransferBench behavior can be customized using environment variables. This topic describes the environment variables that control the TransferBench client (frontend), backend library, and runtime.
+
+.. note::
+
+  Environment variables read by the frontend client apply to both configuration-file runs and preset runs. Variables prefixed with ``TB_`` are read by the backend library and are not part of the frontend client.
+
+Frontend client environment variables
+=======================================
+
+The following environment variables are read by the TransferBench client and apply to configuration-file runs and presets.
+
+General options
+---------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``NUM_ITERATIONS``
+      - Number of timed iterations per test. If negative, runs for that many seconds instead.
+      - ``10``
+
+    * - ``NUM_SUBITERATIONS``
+      - Sub-iterations per iteration. Set to ``0`` for infinite sub-iterations.
+
+        Within each iteration, the transfer is repeated ``NUM_SUBITERATIONS`` times. This can reduce the impact of kernel launch latencies, but might over-emphasize cache reuse.
+      - ``1``
+
+    * - ``NUM_WARMUPS``
+      - Untimed warmup iterations per test.
+      - ``3``
+
+    * - ``SHOW_BORDERS``
+      - Shows ASCII box-drawing characters in tables. Set to ``1`` to show, ``0`` to hide.
+      - ``1``
+
+    * - ``SHOW_ITERATIONS``
+      - Shows per-iteration timing. Set to ``1`` to show, ``0`` to hide.
+      - ``0``
+
+    * - ``USE_INTERACTIVE``
+      - Specifies whether to pause for user input before the transfer loop. Set to ``1`` to pause, ``0`` to skip.
+
+        The first pause occurs after memory allocations are prepared and before any transfers are executed. The second pause occurs after transfers are executed and before transfers are validated. This is useful for profiling: start profiling after the first pause, then capture data before validation begins.
+      - ``0``
+
+    * - ``HIDE_ENV``
+      - Hides the environment variable listing. Set to ``1`` to hide, ``0`` to show.
+      - ``0``
+
+    * - ``OUTPUT_TO_CSV``
+      - Generates results in CSV format. Set to ``1`` for CSV, ``0`` for human-readable output.
+      - ``0``
+
+    * - ``SAMPLING_FACTOR``
+      - Affects auto-generated N values when N is ``0``.
+      - ``1``
+
+.. _data-validation-var:
+
+Data and validation options
+----------------------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``ALWAYS_VALIDATE``
+      - Specifies whether to validate after each iteration. Set to ``1`` to validate each iteration, ``0`` to validate once after all iterations.
+
+        By default, validation is only done after all iterations, which can mask errors that occurred in all but the last iteration.
+      - ``0``
+
+    * - ``BLOCK_BYTES``
+      - Granularity in bytes for dividing work across sub-executors.
+      - ``256``
+
+    * - ``BYTE_OFFSET``
+      - Initial byte offset for allocations. Must be a multiple of 4.
+      - ``0``
+
+    * - ``FILL_COMPRESS``
+      - Comma-separated percentages for 64-byte line fill across five bins: random, 1B0, 2B0, 4B0, and 32B0.
+
+        This feature tests various compressible data patterns supported by XGMI. The integer values must sum to 100 and correspond to the following bins:
+
+        .. list-table::
+            :header-rows: 1
+
+            * - Bin
+              - Name
+              - Description
+
+            * - 0
+              - Random
+              - Random data
+
+            * - 1
+              - 1B0
+              - The upper 1 byte of each 2 bytes in the 64-byte line is 0
+
+            * - 2
+              - 2B0
+              - The upper 2 bytes of each 4 bytes in the 64-byte line are 0
+
+            * - 3
+              - 4B0
+              - The upper 4 bytes of each 8 bytes in the 64-byte line are 0
+
+            * - 4
+              - 32B0
+              - The upper 32 bytes of each 64-byte line are 0
+
+      - —
+
+    * - ``FILL_PATTERN``
+      - Big-endian hex pattern for source data. Must have an even number of digits. Allows users to specify a particular data pattern.
+      - —
+
+    * - ``VALIDATE_DIRECT``
+      - Specifies whether to validate the GPU destination directly. Set to ``1`` to validate directly, ``0`` to validate via a CPU staging buffer.
+
+        On AMD hardware, the CPU can directly access GPU device memory, avoiding the need for a staging buffer. This feature is not supported on NVIDIA hardware.
+      - ``0``
+
+    * - ``VALIDATE_SOURCE``
+      - Specifies whether to validate the source immediately after preparation. Set to ``1`` to validate, ``0`` to skip.
+
+        This was introduced to help debug issues where the initial copy of source data to the GPU didn't meet expectations, due to a hardware DMA issue.
+      - ``0``
+
+.. _gfx-options:
+
+GFX and kernel options
+-----------------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``GFX_BLOCK_ORDER``
+      - Block ordering when running in multitransfer single-stream mode. ``0`` = sequential, ``1`` = interleaved, ``2`` = random.
+
+        This controls how threadblocks are assigned to transfers. For example, with 4 transfers (A, B, C, D) each using 3 CUs:
+
+        .. code-block:: shell
+
+          Threadblock : 00 01 02 03 04 05 06 07 08 09 10 11
+          ====================================================
+          0 = Sequential : A0 A1 A2 B0 B1 B2 C0 C1 C2 D0 D1 D2
+          1 = Interleaved: A0 B0 C0 D0 A1 B1 C1 D1 A2 B2 C2 D2
+          2 = Random      : C1 D2 B1 B0 A0 C0 D1 D0 A1 C2 A2 B2
+
+        Use this setting to investigate how threadblock assignment to XCCs impacts performance.
+      - ``0``
+
+    * - ``GFX_BLOCK_SIZE``
+      - Number of threads per threadblock. Must be a multiple of 64.
+      - ``256``
+
+    * - ``GFX_SE_TYPE``
+      - Subexecutor granularity. ``0`` = threadblock, ``1`` = warp.
+
+        By default, each subexecutor consists of one threadblock. Setting this to ``1`` makes each subexecutor consist of one warp instead. On some architectures such as AMD Instinct™ MI355X, this can impact performance, especially when used together with ``GFX_BLOCK_ORDER``.
+      - ``0``
+
+    * - ``GFX_TEMPORAL``
+      - Controls how stores and loads are performed using non-temporal operations. ``0`` = none, ``1`` = loads only, ``2`` = stores only, ``3`` = both loads and stores.
+      - ``0``
+
+    * - ``GFX_UNROLL``
+      - Unroll factor for the GFX kernel. Set to ``0`` for automatic selection. See :ref:`gfx-unroll`.
+      - (architecture-dependent)
+
+    * - ``GFX_SINGLE_TEAM``
+      - Subexecutor memory access mode. ``1`` = subexecutors operate on the full array, ``0`` = subexecutors operate on disjoint subarrays.
+      - ``1``
+
+    * - ``GFX_WAVE_ORDER``
+      - Stride ordering for GFX waves. ``0`` = UWC, ``1`` = UCW, ``2`` = WUC, ``3`` = WCU, ``4`` = CUW, ``5`` = CWU.
+      - ``0``
+
+    * - ``GFX_WORD_SIZE``
+      - Packed data size in DWORDs. ``4`` = DWORD x 4, ``2`` = DWORD x 2, ``1`` = DWORD x 1.
+      - ``4``
+
+    * - ``USE_HIP_EVENTS``
+      - Timing method for GFX and DMA transfers. ``1`` = use HIP events, ``0`` = use CPU wall time.
+      - ``1``
+
+    * - ``USE_SINGLE_STREAM``
+      - Stream assignment. ``1`` = one stream per GPU, ``0`` = one stream per transfer.
+      - ``1``
+
+    * - ``CU_MASK``
+      - CU mask for streams, specified as a comma-separated list of indices or ranges (for example, ``5,10-12,14``). AMD only.
+      - —
+
+    * - ``XCC_PREF_TABLE``
+      - Preferred XCC per source-by-destination GPU pair, specified as a comma-separated list. Supported on AMD hardware only.
+      - —
+
+DMA options
+-----------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``USE_HSA_DMA``
+      - DMA implementation. ``1`` = use ``hsa_amd_async_copy``, ``0`` = use ``hipMemcpy``.
+      - ``0``
+
+Variable subexecutor options
+------------------------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``MIN_VAR_SUBEXEC``
+      - Minimum number of subexecutors for variable subexecutor transfers.
+      - ``1``
+
+    * - ``MAX_VAR_SUBEXEC``
+      - Maximum number of subexecutors. Set to ``0`` to use the device limit.
+      - ``0``
+
+NIC options
+-----------
+
+The following environment variables apply only when NIC support is enabled.
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``IB_GID_INDEX``
+      - RoCE GID index. Set to ``-1`` for automatic selection.
+      - ``-1``
+
+    * - ``IB_PORT_NUMBER``
+      - RDMA port number.
+      - ``1``
+
+    * - ``IP_ADDRESS_FAMILY``
+      - IP address family. ``4`` = IPv4, ``6`` = IPv6.
+      - ``4``
+
+    * - ``NIC_CHUNK_BYTES``
+      - Bytes per NIC RDMA chunk.
+      - ``1073741824``
+
+    * - ``NIC_RELAX_ORDER``
+      - RDMA ordering. ``1`` = relaxed, ``0`` = strict.
+      - ``1``
+
+    * - ``ROCE_VERSION``
+      - RoCE version.
+      - ``2``
+
+Backend and runtime options (client-read)
+------------------------------------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``GPU_MAX_HW_QUEUES``
+      - A HIP runtime environment variable that determines the maximum number of hardware queues each GPU has access to per process. When more than four GPU-executed transfers run simultaneously, they might serialize while waiting for available hardware queues. In this case, increase the value of this environment variable from the default ``4`` when ``USE_SINGLE_STREAM=0``. See :ref:`gpu-max-hw-queues`.
+      - ``4``
+
+Backend environment variables
+==============================
+
+The following environment variables are read by the TransferBench backend library or build system. They are not part of the frontend client and are prefixed with ``TB_``.
+
+Socket-related variables
+------------------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+
+    * - ``TB_RANK``
+      - The process rank (0-based). Used for socket-based multinode runs.
+
+    * - ``TB_NUM_RANKS``
+      - Total number of processes.
+
+    * - ``TB_MASTER_ADDR``
+      - IP address of rank 0, used by the socket communicator.
+
+    * - ``TB_MASTER_PORT``
+      - Port for rank coordination. Default: ``29500``.
+
+    * - ``TB_SINGLE_LOG``
+      - When set, only rank 0 produces output. Useful for multinode socket mode.
+
+Backend variables
+-----------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+
+    * - ``TB_VERBOSE``
+      - Backend verbosity level. For example, set to ``1`` for extra logging.
+
+    * - ``TB_DUMP_CFG_FILE``
+      - Path of the configuration file used to dump executed transfers.
+
+        This dumps all executed transfers to a configuration file that can then be re-executed by TransferBench. This can be used to capture the transfers executed by a preset, facilitating any further modifications and customizations.
+
+    * - ``TB_PAUSE``
+      - Pause before execution. Useful for attaching a debugger.
+
+        For example:
+
+        .. code-block:: shell
+
+          > TB_PAUSE=1 ./TransferBench
+          # Pausing for debug attachment (PID: 2443974)
+
+          > sudo gdb -p 2443974
+          5741        while (pause);
+
+          set pause=false
+          continue
+
+    * - ``TB_NIC_FILTER``
+      - Regex pattern to limit visible NICs. Useful in preset scenarios that require homogeneous configurations.
+
+        For example:
+
+        .. code-block:: shell
+
+          # Without filter:
+          > mlx5_0, mlx5_1, mlx5_2, mlx5_3, mlx5_4, mlx5_5, mlx5_6, mlx5_7
+
+          TB_NIC_FILTER="mlx5_1|mlx5_3"    > mlx5_1, mlx5_3
+          TB_NIC_FILTER="mlx5_[1,4,5]"     > mlx5_1, mlx5_4, mlx5_5
+          TB_NIC_FILTER="mlx5_[1-3,7]"     > mlx5_1, mlx5_2, mlx5_3, mlx5_7
+          TB_NIC_FILTER="mlx5_.*"           > mlx5_0, mlx5_1, mlx5_2, mlx5_3, mlx5_4, mlx5_5, mlx5_6, mlx5_7
+
+    * - ``TB_DUMP_LINES``
+      - Number of 64-byte lines to dump when debugging ``FILL_COMPRESS``.
+
+        For example:
+
+        .. code-block:: shell
+
+          TB_DUMP_LINES=10
+
+          Input pattern 64B line statistics for bufferIdx 0:
+          Total lines: 16384
+          - 0: Random :  3276 ( 19.995%)
+          - 1: 1B0    :  3277 ( 20.001%)
+          - 2: 2B0    :  3277 ( 20.001%)
+          - 3: 4B0    :  3277 ( 20.001%)
+          - 4: 32B0   :  3277 ( 20.001%)
+
+    * - ``TB_FORCE_SINGLE_POD``
+      - Forces single pod mode, skipping AMD-SMI and NVML pod queries. This assumes that all GPUs are in the same pod and skips cluster membership API calls.
+
+HSA runtime variables
+---------------------
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+
+    * - ``HSA_ENABLE_SDMA``
+      - Enables SDMA when set to ``1``. To disable, set to ``0``.
+
+        This is a HIP runtime environment variable. When SDMA is disabled, the DMA executor falls back to using blit kernels (GFX) internally.
diff --git a/docs/reference/faq.rst b/docs/reference/faq.rst
new file mode 100644
index 00000000..ba970090
--- /dev/null
+++ b/docs/reference/faq.rst
@@ -0,0 +1,428 @@
+.. meta::
+  :description: Frequently asked questions about TransferBench, covering common errors, warnings, and configuration issues including IOMMU, memory types, and XGMI.
+  :keywords: TransferBench FAQ, TransferBench errors, TransferBench warnings, IOMMU, GPU_MAX_HW_QUEUES, GFX_UNROLL, validation, XGMI, UALoE, memory types
+
+.. _faq:
+
+==========================
+Frequently asked questions
+==========================
+
+This topic answers common questions about TransferBench errors, warnings, features, environment variables, and presets.
+
+Error and warning messages
+===========================
+
+This section describes common TransferBench error and warning messages and how to resolve them.
+
+[ERROR] Unexpected mismatch at index...
+-----------------------------------------
+
+TransferBench validates each transfer to ensure that data has been moved correctly. This
+error indicates that the destination (DST) memory doesn't match the expected value.
+
+For example:
+
+.. code-block:: text
+
+  [ERROR] Transfer 0: Unexpected mismatch at index 0 of destination 0 on rank 0: Expected 31.00000 Actual: 0.00000
+
+In this example, the first element of the DST memory was expected to hold
+``31.00000`` but actually contained ``0.00000``.
+
+This error is generally not a TransferBench issue. It's usually a sign of a system
+configuration problem.
+
+Common causes include:
+
+- Improperly configured IOMMU
+- A ROCm runtime and driver version mismatch
+
+IOMMU must be set to pass-through mode in the BIOS. To verify, check for ``iommu=pt``
+in the kernel command line:
+
+.. code-block:: shell
+
+  # Check for iommu=pt in the output
+  cat /proc/cmdline
+
+  BOOT_IMAGE=/boot/vmlinuz-5.15.0-70-generic root=UUID=7489cc43-aaab-4b61-8c63-86a419728dea
+  ro panic=0 nowatchdog msr.allow_writes=on nokaslr amdgpu.noretry=1 pci=realloc=off
+  modprobe.blacklist=amdgpu intel_iommu=on iommu=pt numa_balancing=disable console=tty0
+  console=ttyS0,115200n8
+
+For IOMMU configuration guidance, see
+`AMD Instinct MI300X system optimization <https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html>`_.
+
+.. _gpu-max-hw-queues:
+
+[WARN] ... attempting X parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+-------------------------------------------------------------------------------------
+
+The HIP runtime limits the number of independent hardware queues each GPU can use per
+process. This limit is controlled by the ``GPU_MAX_HW_QUEUES`` environment variable. For
+more information, see
+`ROCm environment variables <https://rocm.docs.amd.com/en/latest/reference/env-variables.html#debug-variables>`_.
+
+When the number of transfers requiring hardware queues exceeds the configured limit,
+those transfers serialize instead of running in parallel. TransferBench detects this
+condition and issues this warning.
+
+This commonly occurs with DMA-executed transfers, because each DMA transfer requires one
+hardware queue. It is frequently seen when running the :ref:`all-to-all preset <a2a>`.
+
+To resolve this, set ``GPU_MAX_HW_QUEUES`` to a value greater than the number of
+transfers. It is recommended to set at least one extra queue beyond the number of
+transfers.
+
+The following examples show the effect on an 8-GPU system running the all-to-all preset
+with DMA execution enabled.
+
+Without setting ``GPU_MAX_HW_QUEUES``:
+
+.. code-block:: shell
+
+  USE_DMA_EXEC=1 ./TransferBench a2a
+
+  ...
+  GPU-DMA All-To-All benchmark:
+  ==============================
+  [268435456 bytes per Transfer] [DMA:8] [1 Read(s) 1 Write(s)] [MemType:uncached GPU] [NIC QueuePairs:0] [#Ranks:1]
+
+  Average bandwidth (GPU Timed): 60.952 GB/s
+  Aggregate bandwidth (GPU Timed): 3413.290 GB/s
+  Aggregate bandwidth (CPU Timed): 1338.252 GB/s
+  [WARN] DMA 0 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+  [WARN] DMA 1 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+  [WARN] DMA 2 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+  [WARN] DMA 3 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+  [WARN] DMA 4 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+  [WARN] DMA 5 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+  [WARN] DMA 6 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+  [WARN] DMA 7 attempting 7 parallel transfers, however GPU_MAX_HW_QUEUES only set to 4
+
+Setting ``GPU_MAX_HW_QUEUES=8``:
+
+.. code-block:: shell
+
+  GPU_MAX_HW_QUEUES=8 USE_DMA_EXEC=1 ./TransferBench a2a
+
+  ...
+  GPU-DMA All-To-All benchmark:
+  ==============================
+  [268435456 bytes per Transfer] [DMA:8] [1 Read(s) 1 Write(s)] [MemType:uncached GPU] [NIC QueuePairs:0] [#Ranks:1]
+
+  Average bandwidth (GPU Timed): 60.091 GB/s
+  Aggregate bandwidth (GPU Timed): 3365.111 GB/s
+  Aggregate bandwidth (CPU Timed): 2222.415 GB/s
+
+.. note::
+
+  Individual transfer bandwidths are similar in both cases because each transfer is timed
+  from when it starts. However, the CPU wall-clock time is nearly double in the
+  ``GPU_MAX_HW_QUEUES=4`` case, because serialized transfers complete one after another
+  instead of running in parallel.
+
+Feature questions
+==================
+
+This section answers common questions about TransferBench features and behavior.
+
+Can TransferBench target a specific UALoE station?
+----------------------------------------------------
+
+No. TransferBench has no direct control over which Unified Accelerator Link over Ethernet
+(UALoE) station gets used, and doesn't have any knowledge of which station is selected.
+
+Does TransferBench perform any validation?
+-------------------------------------------
+
+Yes. TransferBench initializes source data buffers with a pattern (which can be
+user-specified), then checks that destination data buffers contain the expected result
+after each transfer completes. For details, see :ref:`transferbench-data-validation`.
+
+Does TransferBench alter underlying XGMI speeds when it runs?
+--------------------------------------------------------------
+
+No. TransferBench runs on the current hardware settings and doesn't modify them.
+
+To query current XGMI settings on AMD Instinct machines, use ``amd-smi xgmi``:
+
+.. code-block:: shell
+
+  amd-smi xgmi
+
+  LINK METRIC TABLE:
+  bdf             bit_rate  max_bandwidth  link_type  GPU0     GPU1     GPU2     GPU3     GPU4     GPU5     GPU6     GPU7
+  GPU0  0000:0c:00.0  38 Gb/s  608 Gb/s  XGMI
+    Read   N/A       39.61 TB  15.40 TB  15.47 TB  5.349 TB  4.993 TB  5.078 TB  5.952 TB
+    Write  N/A       41.96 TB  15.32 TB  15.00 TB  5.332 TB  4.859 TB  4.979 TB  5.448 TB
+
+Environment variable questions
+================================
+
+This section answers common questions about TransferBench environment variables.
+
+.. _gfx-unroll:
+
+What is the GFX unroll factor?
+--------------------------------
+
+Specifying an unroll factor of X means that each GPU thread reads X pieces of source data
+into registers, then writes those X pieces of data out to the destination, as shown in the following table:
+
+.. raw:: html
+
+   <style>
+     .tb-unroll { border-collapse: collapse; }
+     .tb-unroll td, .tb-unroll th { border: 1px solid #ccc; padding: 6px 14px; text-align: center; }
+     .tb-unroll td { font-family: monospace; }
+     .tb-unroll thead tr { background: var(--pst-color-primary, #f0f0f0); color: var(--pst-color-on-primary, #000); }
+     .tb-unroll thead th { font-weight: bold; }
+     .tb-unroll tbody tr:nth-child(odd) td:first-child  { background: var(--pst-color-surface, #f8f8f8); }
+     .tb-unroll tbody tr:nth-child(even) td:first-child { background: var(--pst-color-on-background, #eaeaea); }
+     .tb-unroll .r { background: #c8f0d8; }
+     .tb-unroll .w { background: #f8c8c8; }
+   </style>
+   <table class="table table--middle-left tb-unroll">
+     <thead>
+       <tr>
+         <th class="head"><p>Instruction order</p></th>
+         <th class="head"><p>Unroll 1</p></th>
+         <th class="head"><p>Unroll 2</p></th>
+         <th class="head"><p>Unroll 4</p></th>
+       </tr>
+     </thead>
+     <tbody>
+       <tr><td>1</td><td class="r">READ [A]</td> <td class="r">READ [A]</td> <td class="r">READ [A]</td></tr>
+       <tr><td>2</td><td class="w">WRITE [A]</td><td class="r">READ [B]</td> <td class="r">READ [B]</td></tr>
+       <tr><td>3</td><td class="r">READ [B]</td> <td class="w">WRITE [A]</td><td class="r">READ [C]</td></tr>
+       <tr><td>4</td><td class="w">WRITE [B]</td><td class="w">WRITE [B]</td><td class="r">READ [D]</td></tr>
+       <tr><td>5</td><td class="r">READ [C]</td> <td class="r">READ [C]</td> <td class="w">WRITE [A]</td></tr>
+       <tr><td>6</td><td class="w">WRITE [C]</td><td class="r">READ [D]</td> <td class="w">WRITE [B]</td></tr>
+       <tr><td>7</td><td class="r">READ [D]</td> <td class="w">WRITE [C]</td><td class="w">WRITE [C]</td></tr>
+       <tr><td>8</td><td class="w">WRITE [D]</td><td class="w">WRITE [D]</td><td class="w">WRITE [D]</td></tr>
+     </tbody>
+   </table>
+
+Having more reads in flight can reduce write stalls. However, a higher unroll factor also
+increases register pressure because more intermediate values must be held simultaneously.
+
+The following example assumes four units of time before a read arrives or when the write can be issued. The example also assumes that the link hasn't reached the capacity.
+
+.. raw:: html
+
+   <style>.tb-timeline th { font-family: inherit; } .tb-timeline td { font-family: monospace; }</style>
+   <table class="tb-timeline" style="border-collapse:collapse;text-align:center;">
+     <colgroup>
+       <col style="width:80px;">
+       <col span="24" style="width:32px;">
+     </colgroup>
+     <tbody>
+       <tr>
+         <th style="text-align:left;padding:4px 8px;">Unroll 1</th>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">A</td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">A</td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">B</td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">B</td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">C</td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">C</td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">D</td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">D</td>
+         <td style="border:1px solid #ccc;"></td>
+       </tr>
+       <tr>
+         <th style="text-align:left;padding:4px 8px;">Unroll 2</th>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">A</td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">B</td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">A</td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">B</td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">C</td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">D</td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#aaa;border:1px solid #ccc;"></td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">C</td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">D</td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+       </tr>
+       <tr>
+         <th style="text-align:left;padding:4px 8px;">Unroll 4</th>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">A</td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">B</td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">C</td>
+         <td style="background:#c8f0d8;border:1px solid #ccc;">D</td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">A</td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">B</td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">C</td>
+         <td style="background:#f8c8c8;border:1px solid #ccc;">D</td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+         <td style="border:1px solid #ccc;"></td>
+       </tr>
+     </tbody>
+   </table>
+
+The measured effect of unroll factor varies by transfer type. The following table shows
+example bandwidth values (in GB/s):
+
+.. list-table::
+    :header-rows: 1
+
+    * - ``GFX_UNROLL``
+      - Local copy with 4 CUs (``1 4 G0->G0->G0``)
+      - Remote 1 subexecutor copy (``1 1 G0->G0->G1``)
+
+    * - 1
+      - 20.297
+      - 20.297
+
+    * - 2
+      - 37.669
+      - 36.599
+
+    * - 3
+      - 48.781
+      - 48.439
+
+    * - 4
+      - 62.887
+      - 59.407
+
+    * - 5
+      - 74.076
+      - 44.100
+
+    * - 6
+      - 84.769
+      - 59.386
+
+    * - 7
+      - 95.074
+      - —
+
+    * - 8
+      - 101.101
+      - —
+
+For the remote copy case, performance doesn't scale monotonically beyond unroll 4 because
+the link becomes the bottleneck rather than register occupancy.
+
+To configure the unroll factor, see :ref:`GFX_UNROLL environment variable <gfx-options>`.
+
+Preset questions
+=================
+
+This section answers common questions about TransferBench presets.
+
+.. _mem-type:
+
+What memory types do presets support?
+---------------------------------------
+
+Some TransferBench presets use the ``MEM_TYPE`` environment variable (or CPU- and
+GPU-specific variants) to select the memory type used during the transfer. The following
+table lists the supported memory types based on CPU or GPU:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Memory device
+      - Memory type index
+      - Description
+      - Symbol
+      - Allocation method
+
+    * - CPU
+      - 0
+      - Default pinned host memory
+      - ``C``
+      - ``hipHostMalloc``
+
+    * - CPU
+      - 1
+      - Coherent pinned host memory
+      - ``B``
+      - ``hipHostMalloc`` with ``hipHostMallocCoherent`` flag
+
+    * - CPU
+      - 2
+      - Non-coherent pinned host memory
+      - ``D``
+      - ``hipHostMalloc`` with ``hipHostMallocNonCoherent`` flag
+
+    * - CPU
+      - 3
+      - Uncached pinned host memory
+      - ``K``
+      - ``hipHostMalloc`` with ``hipHostMallocUncached`` flag
+
+    * - CPU
+      - 4
+      - Unpinned host memory
+      - ``H``
+      - ``numa_alloc_onnode``
+
+    * - GPU
+      - 0
+      - Default GPU memory
+      - ``G``
+      - ``hipMalloc``
+
+    * - GPU
+      - 1
+      - Fine-grained GPU memory
+      - ``F``
+      - ``hipExtMallocWithFlags`` with ``hipDeviceMallocFinegrained``
+
+    * - GPU
+      - 2
+      - Uncached GPU memory
+      - ``U``
+      - ``hipExtMallocWithFlags`` with ``hipDeviceMallocUncached``
+
+    * - GPU
+      - 3
+      - Managed memory
+      - ``M``
+      - ``hipMallocManaged``
diff --git a/docs/reference/presets.rst b/docs/reference/presets.rst
new file mode 100644
index 00000000..58c42728
--- /dev/null
+++ b/docs/reference/presets.rst
@@ -0,0 +1,1722 @@
+.. meta::
+  :description: Reference for TransferBench presets, including all-to-all, peer-to-peer, NIC rings, sweep, and scaling tests with supported environment variables and example outputs.
+  :keywords: TransferBench presets, TransferBench a2a, TransferBench p2p, TransferBench nicrings, TransferBench nicp2p, TransferBench sweep, TransferBench scaling
+
+.. _running-presets:
+
+=======================
+TransferBench presets
+=======================
+
+Presets are a predefined series of Transfers that can be used instead of manually configuring the Transfers.
+
+The following table lists the presets available on TransferBench 1.66.03:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Preset name
+      - Description
+      - Multinode support
+
+    * - :ref:`All-to-all preset (a2a) <a2a>`
+      - Tests parallel transfers between all pairs of GPU devices.
+      - ✅
+
+    * - :ref:`All-to-all via nearest NIC preset (a2a_n) <a2a_n>`
+      - Tests parallel transfers between all pairs of GPU devices using nearest NIC RDMA
+      - ❌
+
+    * - :ref:`All-to-all sweep preset (a2asweep) <a2asweep>`
+      - Performs a parameter sweep of GFX-based all-to-all transfers across different subexecutor counts, unroll factors, and thread block sizes.
+      - ❌
+
+    * - :ref:`NIC rings preset (nicrings) <nicrings>`
+      - Tests NIC rings created across identical NIC indices across ranks.
+      - ✅
+
+    * - :ref:`NIC peer-to-peer preset (nicp2p) <nicp2p>`
+      - Tests multinode peer-to-peer RDMA transfer between all NICs across all ranks.
+      - ✅
+
+    * - :ref:`One-to-all preset (one2all) <one2all>`
+      - Tests all subsets of parallel transfers from one GPU to the others.
+      - ❌
+
+    * - :ref:`Peer-to-peer preset (p2p) <p2p>`
+      - Tests unidirectional and bidirectional transfers for CPU-to-CPU, CPU-to-GPU, and GPU-to-GPU combinations.
+      - ❌
+
+    * - :ref:`Scaling preset (scaling) <scaling>`
+      - Runs a scaling test from one GPU to all other devices (CPUs and GPUs).
+      - ❌
+
+    * - :ref:`Schmoo preset (schmoo) <schmoo>`
+      - Runs scaling tests for local and remote read, write, and copy operations between two GPUs.
+      - ❌
+
+    * - :ref:`Sweep or random sweep preset (sweep/rsweep) <sweep>`
+      - Tests combinations of source (SRC), executor, and destination (DST) with varying parallelism.
+      - ❌
+
+.. note::
+
+    You can modify a preset using environment variables, which are detailed when running the preset.
+
+.. _a2a:
+
+All-to-all preset (a2a)
+========================
+
+The a2a preset tests parallel transfers between all pairs of GPU devices. It measures bidirectional bandwidth across every GPU-to-GPU combination on a single node or multinode system. It supports GFX (compute kernel) and DMA all-to-all, and allows for NIC executor ring in parallel.
+
+**Key features:**
+
+- **GFX/DMA mode:** Creates transfers for every (src GPU to dst GPU) pair on each rank. Optionally restricts to directly connected XGMI links (A2A_DIRECT=1).
+
+- **Transfer modes:** Copy (1 src → 1 dst), read-only (1 src → null), write-only (null → 1 dst), or custom (numSrcs:numDsts).
+
+- **NIC rings:** When ``NUM_QUEUE_PAIRS`` > 0, adds NIC-based ring transfers (GPU i → GPU (i+1)%N) using nearest-NIC RDMA.
+
+- Prints a SRC x DST bandwidth matrix with row or column totals, aggregate bandwidth, and min/max/avg across ranks for multinode system.
+
+- Forces ``USE_SINGLE_STREAM=1`` for all-to-all.
+
+- **On AMD hardware:** ``A2A_DIRECT=1`` uses ``hipExtGetLinkTypeAndHopCount`` to skip non-direct XGMI pairs.
+
+- **Multinode:** Each rank must have the same number of GPUs. Differences in the NIC configuration across ranks produce a warning.
+
+**Usage:**
+
+.. code-block:: shell
+
+    ./TransferBench a2a [numBytes]
+
+Environment variables
+----------------------
+
+To modify the behavior of a2a preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``A2A_DIRECT``
+      - To use only directly connected XGMI links (hop count = 1). 0 = full all-to-all. This can be useful on older MI2XX hardware that doesn't feature full all-to-all XGMI connectivity, and running the standard all-to-all between all pairs of GPUs ends up utilizing XGMI links more than once.
+      - ``1``
+
+    * - ``A2A_LOCAL``
+      - To include local transfers (i→i). 0 = exclude, 1 = include.
+      - ``0``
+
+    * - ``A2A_MODE``
+      - Transfer mode: 0=Copy, 1=Read-Only, 2=Write-Only, or numSrcs:numDsts for custom. Systems with multiple sources or destinations mimic the behavior of some collective algorithms such as RingReduce, which sometimes require reading from two local buffers, adding them together, then writing to a local output buffer and remote temp buffer.
+      - ``0``
+
+    * - ``GFX_UNROLL``
+      - GFX kernel unroll factor. Overrides global default. See :ref:`gfx-unroll`.
+      - ``2``
+
+    * - ``MEM_TYPE``
+      - GPU memory type: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`.
+      - ``2``
+
+    * - ``NUM_GPU_DEVICES``
+      - Number of GPUs to use.
+      - (detected)
+
+    * - ``NUM_QUEUE_PAIRS``
+      - Queue pairs per NIC transfer. 0 = no NIC rings.
+      - ``0``
+
+    * - ``NUM_RESULTS``
+      - Shows top or bottom N results per cell for multinode. Default = 1 if numRanks > 1.
+      - ``0`` or ``1``
+
+    * - ``NUM_SUB_EXEC``
+      - Sub-executors (CUs or WGPs) per transfer.
+      - ``8``
+
+    * - ``SHOW_DETAILS``
+      - Shows full results per transfer.
+      - ``0``
+
+    * - ``USE_DMA_EXEC``
+      - To use DMA executor instead of GFX. Valid only for A2A_MODE=0 (copy).
+      - ``0``
+
+    * - ``USE_FINE_GRAIN``
+      - To use MEM_TYPE.
+      - (deprecated)
+
+    * - ``USE_REMOTE_READ``
+      - To use DST GPU as executor (remote read) instead of SRC GPU (local read).
+      - ``0``
+
+Example output
+---------------
+
+.. tab-set::
+
+    .. tab-item:: AMD Instinct™ MI300X
+
+        .. image:: /data/a2a_MI300X.png
+            :width: 100%
+            :align: center
+
+    .. tab-item:: AMD Instinct™ MI350X
+
+        .. image:: /data/a2a_MI350X.png
+            :width: 100%
+            :align: center
+
+The table in the output shows the transfer rate for each pair of GPUs, as measured using GPU timestamps.
+
+- ``STotal``: Indicates the total send bandwidth as a sum of SRC GPU's bandwidth.
+
+- ``RTotal``: Indicates the total receive bandwidth as a sum of DST GPU's bandwidth.
+
+- ``Actual``: Reflects the actual time for the kernel to finish executing the slowest transfer. Because one GFX kernel is launched to handle all Transfers to other GPUs, the kernel doesn't finish until the slowest transfer completes.
+
+- ``CPU Timed``: Measures all the Transfers.
+
+.. note::
+
+    To rule out any possibility of serialization, check if the CPU Timed bandwidth is close to the aggregate GPU Timed bandwidth.
+
+    To avoid serialization when running with DMA executor, increase the number of hardware queues available.
+
+    As the following output shows, ``GPU_MAX_HW_QUEUES`` defaults to just 4 if not set:
+
+    .. image:: /data/a2a_serialization.png
+        :width: 100%
+        :align: center
+
+    Although TransferBench issues a warning ``[WARN] DMA 0 attempting n parallel transfers, however GPU_MAX_HW_QUEUES only set to 4``, the hardware queue insufficiency can also be noticed by the large discrepancy between CPU Timed aggregate bandwidth and GPU timed aggregated bandwidth.
+
+.. _a2a_n:
+
+All-to-all via nearest NIC preset (a2a_n)
+==========================================
+
+The a2a_n preset tests parallel transfers between all pairs of GPU devices using nearest NIC RDMA. Each transfer uses the NIC closest to the SRC GPU to send to the NIC closest to the DST GPU.
+
+**Key features:**
+
+- Creates Transfers for every SRC GPU and DST GPU pair using the NIC closest to the SRC GPU to read, and the NIC closest to the DST GPU to write.
+
+- Prints a SRC x DST bandwidth matrix with row totals, column totals, and aggregate bandwidth.
+
+- Reports average and aggregate bandwidth (Tx-thread timed and CPU timed).
+
+- Supports single node only: Multinode is not supported.
+
+**Usage:**
+
+.. code-block:: shell
+
+    ./TransferBench a2a_n [numBytes]
+
+Environment variables
+----------------------
+
+To modify the behavior of a2a_n preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``MEM_TYPE``
+      - GPU memory type: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`.
+      - ``2``
+
+    * - ``NUM_GPU_DEVICES``
+      - Number of GPUs to use.
+      - (detected)
+
+    * - ``NUM_QUEUE_PAIRS``
+      - Queue pairs per transfer.
+      - ``1``
+
+.. note::
+
+    The a2a_n preset divides the available NIC bandwidth into the number of GPU peers.
+
+.. _a2asweep:
+
+All-to-all sweep preset (a2asweep)
+===================================
+
+The a2asweep preset performs a parameter sweep of GFX-based all-to-all transfers across different subexecutor counts, unroll factors, and thread block sizes. It helps find optimal configurations for GPU all-to-all bandwidth on your hardware.
+
+**Key features:**
+
+- Sweeps ``BLOCKSIZES`` (thread block size).
+
+- For each block size, sweeps ``NUM_SUB_EXECS`` (CU count) x ``UNROLLS`` (unroll factor).
+
+- Sweep order: Outer loop over ``BLOCKSIZES``, then table of (``NUM_SUB_EXECS`` x ``UNROLLS``).
+
+- By default reports only the slowest GPU's bandwidth (min bandwidth) per CU-Unroll combination. To include the fastest GPU's bandwidth (max bandwidth) per config, set ``SHOW_MIN_ONLY`` = 0.
+
+- Uses same transfer topology as a2a preset, such as direct links, A2A_MODE, and others.
+
+**Restrictions:**
+
+- Supports single node only: Multinode is not supported.
+
+- Forced single-stream: ``useSingleStream`` = 1.
+
+- Can't use ``USE_SPRAY`` with multiple destination buffers (``numDsts`` > 1).
+
+**Usage:**
+
+.. code-block:: shell
+
+    ./TransferBench a2asweep
+
+To use custom sweep ranges:
+
+.. code-block:: shell
+
+    BLOCKSIZES=256,384 UNROLLS=2,4,8 NUM_SUB_EXECS=4,8,16 ./TransferBench a2asweep
+
+Environment variables
+----------------------
+
+To modify the behavior of a2asweep preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``A2A_DIRECT``
+      - To use only directly-connected GPU pairs, set to ``1``. For full all-to-all, set to ``0``.
+      - ``1``
+
+    * - ``A2A_LOCAL``
+      - To include local transfers, set to ``1``. To exclude, set to ``0``.
+      - ``0``
+
+    * - ``A2A_MODE``
+      - Transfer mode: 0=Copy, 1=Read-Only, 2=Write-Only, or numSrcs:numDsts for custom.
+      - ``0``
+
+    * - ``BLOCKSIZES``
+      - Comma-separated thread block sizes, such as 256, 384, or 512.
+      - ``256``
+
+    * - ``MEM_TYPE``
+      - GPU memory type: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`.
+      - ``2``
+
+    * - ``NUM_GPU_DEVICES``
+      - Number of GPUs in all-to-all group.
+      - (all detected)
+
+    * - ``NUM_SUB_EXECS``
+      - Comma-separated subexecutor (CU or WGP) counts to sweep.
+      - ``4,8,12,16,24,32``
+
+    * - ``SHOW_MIN_ONLY``
+      - To show only the slowest GPU result, set to ``1``. To show the slowest and the fastest GPU results, set to ``0``.
+      - ``1``
+
+    * - ``UNROLLS``
+      - Comma-separated unroll factors to sweep. See :ref:`gfx-unroll`.
+      - ``1,2,3,4,6,8``
+
+    * - ``USE_REMOTE_READ``
+      - To use the executor on DST, set to ``1``. To use the executor on SRC, set to ``0``.
+      - ``0``
+
+    * - ``USE_SPRAY``
+      - To configure each subexecutor to target all GPUs, set to ``1``. To target only one GPU, set to ``0``. Invalid for multiple DST.
+      - ``0``
+
+    * - ``VERBOSE``
+      - Shows detailed results per config.
+      - ``0``
+
+Example output
+---------------
+
+.. tab-set::
+
+    .. tab-item:: AMD Instinct MI300X
+
+        .. image:: /data/a2asweep_MI300X.png
+            :width: 100%
+            :align: center
+
+    .. tab-item:: AMD Instinct MI350X
+
+        .. image:: /data/a2asweep_MI350X.png
+            :width: 100%
+            :align: center
+
+.. _nicrings:
+
+NIC rings (nicrings)
+=====================
+
+The nicrings preset tests NIC rings created across identical NIC indices across ranks. It measures RDMA bandwidth in ring topologies where each rank sends to the next rank in the ring, using GPU or CPU memory closest to each NIC.
+
+The following image shows the ring topology:
+
+.. image:: /data/nicrings.png
+    :width: 100%
+    :align: center
+
+**Key features:**
+
+- Ring construction: Creates parallel RDMA rings across all ranks with one ring per GPU/CPU-to-NIC pair (memIndex-nicIndex), where that NIC is the closest to that memory.
+
+- Topology of each ring: Rank 0->1->2->...->N-1->0.
+
+- Can use GPU memory or CPU memory (NUMA nearest to NIC) as buffer.
+
+- Supports RDMA read or write. To choose the rank for RDMA read or write in multirank systems, use ``USE_RDMA_READ``.
+
+- Homogeneous ranks required: Supports multinode provided that all ranks are homogeneous (same topology). Use ``NIC_FILTER`` to limit NIC visibility if needed.
+
+- Transfer direction: ``currRank`` sends to (``currRank`` + 1) % ``numRanks``.
+
+- Executor placement: Executor is placed on the SRC rank for RDMA write and DST rank for RDMA read.
+
+**Usage:**
+
+.. code-block:: shell
+
+    ./TransferBench nicrings
+
+To use CPU memory:
+
+.. code-block:: shell
+
+  USE_CPU_MEM=1 ./TransferBench nicrings
+
+To use RDMA read and see details:
+
+.. code-block:: shell
+
+  SHOW_DETAILS=1 USE_RDMA_READ=1 NUM_QUEUE_PAIRS=2 ./TransferBench nicrings
+
+Environment variables
+----------------------
+
+To modify the behavior of nicrings preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``MEM_TYPE``
+      - Memory type index. See :ref:`mem-type`.
+      - ``0``
+
+    * - ``NUM_QUEUE_PAIRS``
+      - Queue pairs per NIC transfer.
+      - ``1``
+
+    * - ``SHOW_DETAILS``
+      - To see full transfer details, set to ``1``.
+      - ``0``
+
+    * - ``USE_CPU_MEM``
+      - To use CPU memory closest to each NIC, set to ``1``. To use GPU memory, set to ``0``.
+      - ``0``
+
+    * - ``USE_RDMA_READ``
+      - To use RDMA reads, set to ``1``. To use RDMA writes, set to ``0``. Applies when ``numRanks`` > 1.
+      - ``0``
+
+Example output
+---------------
+
+Here is an example output collected on four MI350X nodes with 8 NICs:
+
+.. image:: /data/nicrings_MI350X.png
+  :width: 100%
+  :align: center
+
+.. _nicp2p:
+
+NIC peer-to-peer preset (nicp2p)
+=================================
+
+The nicp2p preset runs a multinode peer-to-peer RDMA transfer test between all NICs across all ranks. It measures bandwidth for every NIC-to-NIC pair using round-robin scheduling to avoid contention.
+
+**Key features:**
+
+- Tests all (``srcRank``, ``srcNic``) -> (``dstRank``, ``dstNic``) pairs.
+
+- Device selection: Uses ``GetClosestDeviceToNic()`` to pick CPU NUMA or GPU closest to each NIC based on ``SRC_MEM_TYPE`` or ``DST_MEM_TYPE``, and ``USE_CPU_*`` flags.
+
+- Allows using RDMA read instead of write through ``USE_REMOTE_READ``.
+
+- Round-robin and combination schedule: Node pairs are scheduled in round-robin. Within each node pair, NIC pairs use combination schedule with ``NIC_PARALLEL_LEVEL``.
+
+- Output: Full matrix or column format, including top 10 fastest or slowest connections.
+
+- Progress report: Prints progress to stderr. For example, "Completed X/Y pairs in Zs, estimated remaining time Ws".
+
+- Homogeneous ranks required: Supports multinode provided that all ranks are homogeneous (same topology). Use ``NIC_FILTER`` to limit NIC visibility if needed.
+
+- NICs required: Exits with error if no NICs are detected.
+
+**Usage:**
+
+.. code-block:: shell
+
+  ./TransferBench nicp2p
+
+To use CPU memory and see output in column format:
+
+.. code-block:: shell
+
+  OUTPUT_FORMAT=0 USE_CPU_SRC_MEM=1 USE_CPU_DST_MEM=1 ./TransferBench nicp2p
+
+Environment variables
+----------------------
+
+To modify the behavior of nicp2p preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``NUM_QUEUE_PAIRS``
+      - Queue pairs per transfer (displayed as ``NUM_NIC_SE``).
+      - ``1``
+
+    * - ``USE_REMOTE_READ``
+      - To use DST GPU as executor (remote read) instead of SRC GPU (local read).
+      - ``0``
+
+    * - ``OUTPUT_FORMAT``
+      - To output full matrix, set to ``1``. For output in column format, set to ``0``. Column format is recommended when there are lots of NIC pairs.
+      - ``1``
+
+    * - ``USE_CPU_SRC_MEM``
+      - To use CPU memory as SRC, set to ``1``. To use GPU memory as SRC, set to ``0``.
+      - ``0``
+
+    * - ``USE_CPU_DST_MEM``
+      - To use CPU memory as DST, set to ``1``. To use GPU memory as DST, set to ``0``.
+      - ``0``
+
+    * - ``SRC_MEM_TYPE``
+      - Source memory type index. See :ref:`mem-type`.
+      - ``2``
+
+    * - ``DST_MEM_TYPE``
+      - Destination memory type index. See :ref:`mem-type`.
+      - ``2``
+
+    * - ``PARALLEL_NODE``
+      - To execute node pairs in parallel, set to ``1``. For serial execution, set to ``0``. By default, nicp2p tries to run Transfers between node pairs in parallel to reduce the overall runtime. For example, (Rank 0->Rank 1) + (Rank 2->Rank 3) are run in parallel instead of (Rank 0->Rank 1) followed by (Rank 2->Rank 3).
+      - ``1``
+
+    * - ``NIC_PARALLEL_LEVEL``
+      - NIC-to-NIC pairs that run in parallel between a node pair. By default, between a pair of nodes, all available NICs are used in parallel. NICs aren't used more than once at a time. This option reduces the overall runtime, which can be disabled if it impacts the performance.
+      - ``numNicsPerRank``
+
+Example output
+---------------
+
+.. code-block:: shell
+
+  [P2P Network Related]
+  NUM_NIC_SE           =            1 : Using 1 queue pairs per Transfer
+  USE_REMOTE_READ      =            0 : Using SRC as executor
+  OUTPUT_FORMAT        =            1 : Printing results in full matrix format
+  USE_CPU_SRC_MEM      =            0 : Source memory is GPU
+  USE_CPU_DST_MEM      =            0 : Destination memory is GPU
+  SRC_MEM_TYPE         =            2 : Using uncached GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
+  DST_MEM_TYPE         =            2 : Using uncached GPU memory (0=default, 1=fine-grained, 2=uncached, 3=managed)
+  PARALLEL_NODE        =            1 : Executing p2p node pairs in parallel: yes
+  NIC_PARALLEL_LEVEL   =            8 : Between a pair of nodes, 8 pairs of NIC-NIC transfers executed in parallel
+
+  Unidirectional copy peak bandwidth GB/s (NIC RDMA Using Nearest Device)
+  Completed 8/256 pairs in  2.656s, estimated remaining time 82.326s.
+  Completed 16/256 pairs in  5.537s, estimated remaining time 83.057s.
+  Completed 24/256 pairs in  8.351s, estimated remaining time 80.731s.
+  Completed 32/256 pairs in 11.251s, estimated remaining time 78.756s.
+  Completed 40/256 pairs in 14.159s, estimated remaining time 76.460s.
+  Completed 48/256 pairs in 16.688s, estimated remaining time 72.315s.
+  Completed 56/256 pairs in 19.113s, estimated remaining time 68.261s.
+  Completed 64/256 pairs in 21.748s, estimated remaining time 65.245s.
+  Completed 72/256 pairs in 24.465s, estimated remaining time 62.521s.
+  Completed 80/256 pairs in 27.377s, estimated remaining time 60.229s.
+  Completed 88/256 pairs in 30.264s, estimated remaining time 57.777s.
+  Completed 96/256 pairs in 32.851s, estimated remaining time 54.752s.
+  Completed 104/256 pairs in 35.601s, estimated remaining time 52.033s.
+  Completed 112/256 pairs in 38.404s, estimated remaining time 49.377s.
+  Completed 120/256 pairs in 41.035s, estimated remaining time 46.507s.
+  Completed 128/256 pairs in 43.756s, estimated remaining time 43.756s.
+  Completed 144/256 pairs in 45.877s, estimated remaining time 35.682s.
+  Completed 160/256 pairs in 47.736s, estimated remaining time 28.641s.
+  Completed 176/256 pairs in 50.091s, estimated remaining time 22.769s.
+  Completed 192/256 pairs in 51.892s, estimated remaining time 17.297s.
+  Completed 208/256 pairs in 53.863s, estimated remaining time 12.430s.
+  Completed 224/256 pairs in 55.850s, estimated remaining time  7.979s.
+  Completed 240/256 pairs in 57.924s, estimated remaining time  3.862s.
+  Completed 256/256 pairs in 60.043s, estimated remaining time  0.000s.
+  ┌------------┬-------------------------┬---------------------------------------------------------------------------------------┬---------------------------------------------------------------------------------------┐
+  │SRC+EXE\DST │                         │  Rank 00                                                                              │  Rank 01                                                                              │
+  ├------------┼-------------------------┼---------------------------------------------------------------------------------------┼---------------------------------------------------------------------------------------┤
+  │            │ NIC Device              │ bnxt_re0   bnxt_re1   bnxt_re2   bnxt_re3   bnxt_re4   bnxt_re5   bnxt_re6   bnxt_re7 │ bnxt_re0   bnxt_re1   bnxt_re2   bnxt_re3   bnxt_re4   bnxt_re5   bnxt_re6   bnxt_re7 │
+  │            │              Mem Device │   GPU 00     GPU 01     GPU 02     GPU 03     GPU 04     GPU 05     GPU 06     GPU 07 │   GPU 00     GPU 01     GPU 02     GPU 03     GPU 04     GPU 05     GPU 06     GPU 07 │
+  ├------------┼-------------------------┼---------------------------------------------------------------------------------------┼---------------------------------------------------------------------------------------┤
+  │    Rank 00 │   bnxt_re0       GPU 00 │    31.36      31.31      31.31      31.31      31.31      31.31      31.31      31.30 │    31.32      31.31      31.31      31.30      31.30      31.31      31.31      31.31 │
+  │            │   bnxt_re1       GPU 01 │    31.31      31.35      31.31      31.31      31.31      31.31      31.31      31.31 │    31.31      31.32      31.31      31.31      31.31      31.31      31.31      31.31 │
+  │            │   bnxt_re2       GPU 02 │    31.31      31.32      31.36      31.31      31.31      31.31      31.30      31.31 │    31.30      31.30      31.32      31.31      31.31      31.30      31.30      31.31 │
+  │            │   bnxt_re3       GPU 03 │    31.31      31.32      31.32      31.35      31.30      31.31      31.31      31.30 │    31.31      31.32      31.31      31.31      31.31      31.30      31.31      31.31 │
+  │            │   bnxt_re4       GPU 04 │    31.31      31.32      31.31      31.32      31.35      31.31      31.31      31.30 │    31.31      31.31      31.31      31.31      31.32      31.31      31.31      31.30 │
+  │            │   bnxt_re5       GPU 05 │    31.32      31.32      31.32      31.32      31.32      31.35      31.31      31.31 │    31.31      31.31      31.30      31.32      31.31      31.33      31.31      31.32 │
+  │            │   bnxt_re6       GPU 06 │    31.31      31.31      31.32      31.32      31.32      31.32      31.36      31.31 │    31.31      31.31      31.31      31.31      31.31      31.31      31.33      31.31 │
+  │            │   bnxt_re7       GPU 07 │    31.31      31.32      31.32      31.32      31.32      31.31      31.32      31.36 │    31.31      31.32      31.30      31.31      31.30      31.31      31.30      31.32 │
+  ├------------┼-------------------------┼---------------------------------------------------------------------------------------┼---------------------------------------------------------------------------------------┤
+  │    Rank 01 │   bnxt_re0       GPU 00 │    31.33      31.30      31.30      31.31      31.31      31.31      31.31      31.30 │    31.36      31.31      31.31      31.31      31.31      31.31      31.30      31.31 │
+  │            │   bnxt_re1       GPU 01 │    31.32      31.32      31.31      31.30      31.31      31.31      31.31      31.31 │    31.32      31.36      31.31      31.31      31.30      31.30      31.30      31.30 │
+  │            │   bnxt_re2       GPU 02 │    31.31      31.30      31.32      31.31      31.31      31.31      31.30      31.31 │    31.32      31.32      31.35      31.31      31.31      31.31      31.30      31.31 │
+  │            │   bnxt_re3       GPU 03 │    31.31      31.31      31.31      31.32      31.30      31.31      31.30      31.31 │    31.31      31.32      31.32      31.36      31.31      31.31      31.31      31.30 │
+  │            │   bnxt_re4       GPU 04 │    31.30      31.31      31.31      31.31      31.32      31.31      31.32      31.31 │    31.32      31.32      31.31      31.32      31.36      31.31      31.31      31.31 │
+  │            │   bnxt_re5       GPU 05 │    31.30      31.31      31.31      31.31      31.30      31.32      31.31      31.31 │    31.31      31.32      31.32      31.32      31.32      31.36      31.31      31.31 │
+  │            │   bnxt_re6       GPU 06 │    31.32      31.31      31.31      31.30      31.31      31.30      31.33      31.30 │    31.32      31.31      31.31      31.32      31.32      31.31      31.35      31.31 │
+  │            │   bnxt_re7       GPU 07 │    31.31      31.31      31.31      31.31      31.31      31.31      31.31      31.32 │    31.31      31.31      31.32      31.32      31.31      31.32      31.32      31.35 │
+  └------------┴-------------------------┴---------------------------------------------------------------------------------------┴---------------------------------------------------------------------------------------┘
+  Summary of top 10 fastest/slowest connection
+  ┌--------------------------┬--------------┬--------------┬--------------------------┬--------------┬--------------┐
+  │ Fastest Bandwidth (GB/s) │          Src │          Dst │ Slowest Bandwidth (GB/s) │          Src │          Dst │
+  ├--------------------------┼--------------┼--------------┼--------------------------┼--------------┼--------------┤
+  │                    31.36 │ R00:bnxt_re0 │ R00:bnxt_re0 │                    31.30 │ R01:bnxt_re0 │ R00:bnxt_re1 │
+  │                    31.36 │ R01:bnxt_re5 │ R01:bnxt_re5 │                    31.30 │ R00:bnxt_re4 │ R01:bnxt_re7 │
+  │                    31.36 │ R00:bnxt_re7 │ R00:bnxt_re7 │                    31.30 │ R01:bnxt_re5 │ R00:bnxt_re4 │
+  │                    31.36 │ R01:bnxt_re0 │ R01:bnxt_re0 │                    31.30 │ R00:bnxt_re3 │ R01:bnxt_re7 │
+  │                    31.36 │ R00:bnxt_re2 │ R00:bnxt_re2 │                    31.30 │ R01:bnxt_re2 │ R00:bnxt_re1 │
+  │                    31.36 │ R00:bnxt_re6 │ R00:bnxt_re6 │                    31.30 │ R01:bnxt_re0 │ R00:bnxt_re7 │
+  │                    31.36 │ R01:bnxt_re1 │ R01:bnxt_re1 │                    31.30 │ R00:bnxt_re5 │ R01:bnxt_re2 │
+  │                    31.36 │ R01:bnxt_re4 │ R01:bnxt_re4 │                    31.30 │ R01:bnxt_re1 │ R01:bnxt_re5 │
+  │                    31.36 │ R01:bnxt_re3 │ R01:bnxt_re3 │                    31.30 │ R01:bnxt_re6 │ R01:bnxt_re7 │
+  │                    31.35 │ R01:bnxt_re7 │ R01:bnxt_re7 │                    31.30 │ R01:bnxt_re2 │ R01:bnxt_re6 │
+  └--------------------------┴--------------┴--------------┴--------------------------┴--------------┴--------------┘
+
+.. _one2all:
+
+One-to-all preset (one2all)
+============================
+
+The one2all preset tests all subsets of parallel transfers from one GPU to the others. It sweeps over varying numbers of DST peers (from ``SWEEP_MIN`` to ``SWEEP_MAX``), and for each count, tests every combination of DST GPUs from a single SRC or executor GPU.
+
+**Key features:**
+
+- Minimum two GPUs: Requires at least two GPUs. Uses one GPU (``EXE_INDEX``) as SRC and executor.
+
+- Sweeps over all combinations of 1, 2, ..., N DST GPUs (excluding the SRC).
+
+- Combination sweep: For each peer count ``p``, iterates over all bitmasks with exactly ``p`` bits set (excluding ``EXE_INDEX``).
+
+- For each combination, runs parallel transfers and reports bandwidth per DST.
+
+- Supports GFX or DMA executor. SRC or DST can either be GPU or Null.
+
+- Supports single node only: Multinode is not supported.
+
+- Invalid configs skipped: Skips when (``exe`` = DMA and ( ``src`` = N or ``dst`` = N)) or ( ``src`` = N and ``dst`` = N).
+
+- Output format: Each line shows bandwidth per DST GPU, ``p``, ``numSubExecs``, and transfer triplets.
+
+**Usage:**
+
+.. code-block:: shell
+
+  ./TransferBench one2all
+
+To run using GPU 2 as SRC and DST peers between 4 to 7:
+
+.. code-block:: shell
+
+  EXE_INDEX=2 SWEEP_MIN=4 SWEEP_MAX=7 ./TransferBench one2all
+
+Environment variables
+----------------------
+
+To modify the behavior of one2all preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``NUM_GPU_DEVICES``
+      - Number of GPUs.
+      - (all detected)
+
+    * - ``NUM_GPU_SE``
+      - Subexecutors (CUs) per transfer.
+      - ``4``
+
+    * - ``EXE_INDEX``
+      - GPU index to use as executor or SRC.
+      - ``0``
+
+    * - ``SWEEP_DIR``
+      - Transfer direction.
+      - ``0``
+
+    * - ``SWEEP_SRC``
+      - SRC memory types: G=GPU, N=Null.
+      - ``G``
+
+    * - ``SWEEP_DST``
+      - DST memory types.
+      - ``G``
+
+    * - ``SWEEP_EXE``
+      - Executor types: G=GFX, D=DMA.
+      - ``G``
+
+    * - ``SWEEP_MIN``
+      - Minimum number of DST peers.
+      - ``1``
+
+    * - ``SWEEP_MAX``
+      - Maximum number of DST peers.
+      - ``numGpuDevices``
+
+Example output
+---------------
+
+.. tab-set::
+
+  .. tab-item:: AMD Instinct MI300X
+
+    .. code-block:: shell
+
+      [One-To-All Related]
+      NUM_GPU_DEVICES      =            8 : Using 8 GPUs
+      NUM_GPU_SE           =            4 : Using 4 subExecutors/CUs per Transfer
+      EXE_INDEX            =            0 : Executing on GPU 0
+      SWEEP_DIR            =            0 : Direction of transfer
+      SWEEP_DST            =            G : DST memory types to sweep
+      SWEEP_EXE            =            G : Executor type to use
+      SWEEP_MAX            =            8 : Maximum number of peers
+      SWEEP_MIN            =            1 : Minimum number of peers
+      SWEEP_SRC            =            G : SRC memory types to sweep
+
+      Executing (G0 -> G0 -> G*)
+        GPU 1        GPU 2        GPU 3        GPU 4        GPU 5        GPU 6        GPU 7
+      -------------------------------------------------------------------------------------------
+          49.409                                                                                  1 4 (G0 G0 G1)
+                      49.467                                                                     1 4 (G0 G0 G2)
+                                    49.215                                                        1 4 (G0 G0 G3)
+                                                47.526                                           1 4 (G0 G0 G4)
+                                                              48.045                              1 4 (G0 G0 G5)
+                                                                          48.278                 1 4 (G0 G0 G6)
+                                                                                        48.132    1 4 (G0 G0 G7)
+          48.954       35.346                                                                     2 4 (G0 G0 G1) (G0 G0 G2)
+          48.851                    48.869                                                        2 4 (G0 G0 G1) (G0 G0 G3)
+                      49.009       48.861                                                        2 4 (G0 G0 G2) (G0 G0 G3)
+          48.962                                 47.599                                           2 4 (G0 G0 G1) (G0 G0 G4)
+                      49.008                    47.486                                           2 4 (G0 G0 G2) (G0 G0 G4)
+                                    35.706       47.563                                           2 4 (G0 G0 G3) (G0 G0 G4)
+          48.833                                              31.660                              2 4 (G0 G0 G1) (G0 G0 G5)
+                      49.002                                 35.160                              2 4 (G0 G0 G2) (G0 G0 G5)
+                                    49.137                    47.565                              2 4 (G0 G0 G3) (G0 G0 G5)
+                                                47.613       47.706                              2 4 (G0 G0 G4) (G0 G0 G5)
+          48.972                                                           48.413                 2 4 (G0 G0 G1) (G0 G0 G6)
+                      48.917                                              48.389                 2 4 (G0 G0 G2) (G0 G0 G6)
+                                    37.319                                 48.397                 2 4 (G0 G0 G3) (G0 G0 G6)
+                                                32.618                    48.334                 2 4 (G0 G0 G4) (G0 G0 G6)
+                                                              47.749       48.497                 2 4 (G0 G0 G5) (G0 G0 G6)
+          48.787                                                                        35.541    2 4 (G0 G0 G1) (G0 G0 G7)
+                      48.824                                                           32.099    2 4 (G0 G0 G2) (G0 G0 G7)
+                                    48.862                                              47.863    2 4 (G0 G0 G3) (G0 G0 G7)
+                                                47.478                                 48.014    2 4 (G0 G0 G4) (G0 G0 G7)
+                                                              47.705                    35.595    2 4 (G0 G0 G5) (G0 G0 G7)
+                                                                          48.509       47.931    2 4 (G0 G0 G6) (G0 G0 G7)
+          44.235       48.729       44.548                                                        3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3)
+          43.164       45.482                    43.238                                           3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4)
+          31.360                    48.819       31.280                                           3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4)
+                      31.624       48.941       31.406                                           3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4)
+          41.797       46.652                                 41.706                              3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5)
+          41.739                    48.994                    41.575                              3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5)
+                      42.676       48.992                    42.683                              3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5)
+          42.621                                 47.369       42.536                              3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5)
+                      43.504                    47.353       43.639                              3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5)
+                                    31.263       47.357       31.202                              3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5)
+          44.168       47.169                                              44.632                 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G6)
+          30.692                    48.787                                 30.939                 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G6)
+                      32.297       48.687                                 32.237                 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G6)
+          28.916                                 47.483                    29.027                 3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G6)
+                      28.024                    47.429                    28.253                 3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G6)
+                                    27.484       47.547                    27.506                 3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G6)
+          43.660                                              40.609       44.131                 3 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G6)
+                      44.196                                 46.915       44.520                 3 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G6)
+                                    42.547                    47.627       43.041                 3 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G6)
+                                                44.828       47.705       45.032                 3 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+          46.291       44.552                                                           46.139    3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G7)
+          46.779                    48.784                                              46.969    3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G7)
+                      42.319       48.889                                              42.591    3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G7)
+          46.980                                 47.296                                 47.003    3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G7)
+                      44.806                    47.395                                 45.020    3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G7)
+                                    31.296       47.280                                 31.418    3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G7)
+          45.477                                              44.531                    45.229    3 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G7)
+                      45.001                                 43.060                    44.962    3 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G7)
+                                    47.083                    41.743                    46.937    3 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G7)
+                                                42.876       45.829                    43.211    3 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+          42.205                                                           48.237       42.679    3 4 (G0 G0 G1) (G0 G0 G6) (G0 G0 G7)
+                      46.007                                              48.087       45.818    3 4 (G0 G0 G2) (G0 G0 G6) (G0 G0 G7)
+                                    31.938                                 48.267       32.044    3 4 (G0 G0 G3) (G0 G0 G6) (G0 G0 G7)
+                                                28.835                    48.077       28.934    3 4 (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+                                                              46.681       48.237       46.443    3 4 (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+          40.538       39.734       40.637       39.989                                           4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4)
+          43.540       35.372       43.132                    35.497                              4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5)
+          46.522       36.693                    46.656       36.883                              4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5)
+          41.551                    35.359       41.382       35.482                              4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5)
+                      41.302       40.839       40.951       40.931                              4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5)
+          38.601       37.573       38.677                                 37.788                 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G6)
+          39.196       41.692                    39.371                    42.069                 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G6)
+          39.194                    46.098       39.083                    45.956                 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6)
+                      33.541       41.203       33.486                    41.436                 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6)
+          41.140       38.015                                 41.354       37.837                 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G6)
+          41.764                    42.981                    42.139       43.384                 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6)
+                      44.813       46.952                    45.157       47.063                 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6)
+          42.990                                 42.942       42.790       42.787                 4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+                      42.439                    41.103       42.451       41.035                 4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+                                    41.678       42.340       41.546       42.608                 4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+          46.897       43.268       46.988                                              43.206    4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G7)
+          42.473       35.981                    42.221                                 35.803    4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G7)
+          39.066                    37.271       38.889                                 37.162    4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7)
+                      41.392       40.677       41.546                                 40.580    4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7)
+          38.916       30.582                                 39.062                    30.730    4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G7)
+          43.248                    39.370                    43.099                    39.565    4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7)
+                      45.966       34.208                    46.186                    34.160    4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7)
+          42.943                                 37.965       43.105                    37.827    4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+                      37.814                    29.784       37.870                    29.790    4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+                                    38.329       38.749       38.351                    38.800    4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+          44.992       32.694                                              44.743       32.608    4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G6) (G0 G0 G7)
+          39.867                    39.650                                 39.837       39.575    4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7)
+                      31.324       30.215                                 31.371       30.228    4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7)
+          34.020                                 39.810                    33.860       39.709    4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+                      33.420                    33.132                    33.431       33.105    4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+                                    31.942       41.954                    32.008       41.790    4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+          37.573                                              31.076       37.701       31.144    4 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                      38.455                                 36.476       38.483       36.316    4 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                                    45.473                    38.297       45.467       38.204    4 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                                                44.440       37.996       44.530       38.044    4 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+          37.237       44.266       37.207       44.146       37.286                              5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5)
+          34.692       45.404       34.637       45.561                    34.513                 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6)
+          35.046       32.117       34.965                    32.262       35.007                 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6)
+          39.664       33.774                    39.592       33.895       39.598                 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+          32.818                    32.518       32.747       32.515       32.774                 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+                      31.579       43.096       31.577       43.457       31.578                 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+          40.813       42.963       40.801       43.090                                 40.737    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7)
+          40.565       34.567       40.630                    34.859                    40.559    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7)
+          39.137       32.169                    39.183       32.270                    39.037    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+          31.289                    34.060       31.225       34.050                    31.250    5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+                      38.908       42.629       38.936       43.247                    38.947    5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+          41.545       44.415       41.614                                 44.221       41.622    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7)
+          34.760       37.380                    34.741                    37.467       34.541    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+          28.091                    35.858       28.037                    35.823       28.072    5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+                      28.942       37.485       28.963                    37.353       28.894    5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+          32.473       36.354                                 32.466       36.272       32.430    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+          41.725                    37.835                    41.615       37.916       41.462    5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                      35.491       45.836                    35.415       45.785       35.436    5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+          44.632                                 38.803       44.496       38.664       44.305    5 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                      39.944                    44.310       40.085       44.310       39.938    5 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                                    29.816       36.004       29.770       35.960       29.717    5 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+          34.725       35.633       34.708       35.705       34.657       35.797                 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+          39.720       37.520       39.526       37.566       39.491                    37.550    6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+          39.609       41.426       39.536       41.532                    39.521       41.447    6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+          39.203       33.233       39.339                    33.162       39.220       33.234    6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+          35.246       34.889                    35.226       34.842       35.218       34.841    6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+          41.457                    37.283       41.567       37.332       41.352       37.204    6 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                      33.003       37.075       33.068       36.900       32.971       36.937    6 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+          38.626       41.000       38.632       41.087       38.518       40.911       38.775    7 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+
+    .. tab-item:: AMD Instinct MI355X
+
+      .. code-block:: shell
+
+        [One-To-All Related]
+        NUM_GPU_DEVICES      =            8 : Using 8 GPUs
+        NUM_GPU_SE           =            4 : Using 4 subExecutors/CUs per Transfer
+        EXE_INDEX            =            0 : Executing on GPU 0
+        SWEEP_DIR            =            0 : Direction of transfer
+        SWEEP_DST            =            G : DST memory types to sweep
+        SWEEP_EXE            =            G : Executor type to use
+        SWEEP_MAX            =            8 : Maximum number of peers
+        SWEEP_MIN            =            1 : Minimum number of peers
+        SWEEP_SRC            =            G : SRC memory types to sweep
+
+        Executing (G0 -> G0 -> G*)
+          GPU 1        GPU 2        GPU 3        GPU 4        GPU 5        GPU 6                                                                                                                                                                     GPU 7
+        --------------------------------------------------------------------------------                                                                                                                                                             -----------
+            57.060                                                                                  1 4 (G0 G0 G1)
+                        56.969                                                                     1 4 (G0 G0 G2)
+                                      49.018                                                        1 4 (G0 G0 G3)
+                                                  49.616                                           1 4 (G0 G0 G4)
+                                                                56.926                              1 4 (G0 G0 G5)
+                                                                            56.751                 1 4 (G0 G0 G6)
+                                                                                          49.459    1 4 (G0 G0 G7)
+            57.858       55.950                                                                     2 4 (G0 G0 G1) (G0 G0 G2)
+            56.203                    56.584                                                        2 4 (G0 G0 G1) (G0 G0 G3)
+                        56.249       55.990                                                        2 4 (G0 G0 G2) (G0 G0 G3)
+            56.304                                 56.307                                           2 4 (G0 G0 G1) (G0 G0 G4)
+                        55.829                    56.026                                           2 4 (G0 G0 G2) (G0 G0 G4)
+                                      55.066       55.944                                           2 4 (G0 G0 G3) (G0 G0 G4)
+            55.941                                              53.563                              2 4 (G0 G0 G1) (G0 G0 G5)
+                        48.896                                 49.449                              2 4 (G0 G0 G2) (G0 G0 G5)
+                                      50.291                    50.699                              2 4 (G0 G0 G3) (G0 G0 G5)
+                                                  49.792       49.264                              2 4 (G0 G0 G4) (G0 G0 G5)
+            48.798                                                           49.999                 2 4 (G0 G0 G1) (G0 G0 G6)
+                        55.917                                              53.447                 2 4 (G0 G0 G2) (G0 G0 G6)
+                                      49.444                                 49.879                 2 4 (G0 G0 G3) (G0 G0 G6)
+                                                  50.038                    49.559                 2 4 (G0 G0 G4) (G0 G0 G6)
+                                                                57.729       56.534                 2 4 (G0 G0 G5) (G0 G0 G6)
+            56.182                                                                        55.834    2 4 (G0 G0 G1) (G0 G0 G7)
+                        55.878                                                           55.928    2 4 (G0 G0 G2) (G0 G0 G7)
+                                      56.481                                              57.752    2 4 (G0 G0 G3) (G0 G0 G7)
+                                                  49.900                                 49.185    2 4 (G0 G0 G4) (G0 G0 G7)
+                                                                55.853                    56.308    2 4 (G0 G0 G5) (G0 G0 G7)
+                                                                            56.321       55.775    2 4 (G0 G0 G6) (G0 G0 G7)
+            52.080       50.746       51.941                                                        3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3)
+            54.335       54.254                    54.202                                           3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4)
+            49.266                    55.731       49.445                                           3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4)
+                        52.413       55.947       52.325                                           3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4)
+            39.503       54.296                                 39.712                              3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5)
+            57.383                    56.119                    57.456                              3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5)
+                        50.184       56.256                    50.205                              3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5)
+            57.250                                 56.207       57.346                              3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5)
+                        49.933                    56.055       49.519                              3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5)
+                                      48.265       56.240       48.151                              3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5)
+            47.040       50.109                                              47.149                 3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G6)
+            50.567                    56.220                                 50.564                 3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G6)
+                        56.907       56.313                                 56.986                 3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G6)
+            50.609                                 56.264                    50.417                 3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G6)
+                        56.975                    56.041                    56.826                 3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G6)
+                                      48.868       56.275                    48.590                 3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G6)
+            49.474                                              49.799       49.414                 3 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G6)
+                        39.407                                 53.626       39.264                 3 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G6)
+                                      52.668                    51.885       52.746                 3 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G6)
+                                                  54.683       50.035       54.503                 3 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+            54.751       51.185                                                           54.714    3 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G7)
+            49.464                    56.451                                              49.507    3 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G7)
+                        50.542       56.494                                              50.419    3 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G7)
+            47.802                                 53.791                                 47.561    3 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G7)
+                        47.249                    52.755                                 47.091    3 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G7)
+                                      41.682       55.054                                 41.609    3 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G7)
+            53.857                                              50.240                    53.689    3 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G7)
+                        46.694                                 49.802                    46.467    3 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G7)
+                                      52.817                    49.695                    52.708    3 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G7)
+                                                  42.766       49.378                    42.681    3 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+            47.020                                                           50.272       46.866    3 4 (G0 G0 G1) (G0 G0 G6) (G0 G0 G7)
+                        51.293                                              50.344       51.281    3 4 (G0 G0 G2) (G0 G0 G6) (G0 G0 G7)
+                                      52.745                                 50.363       52.573    3 4 (G0 G0 G3) (G0 G0 G6) (G0 G0 G7)
+                                                  43.464                    50.005       43.378    3 4 (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+                                                                52.110       53.252       52.204    3 4 (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+            53.978       53.951       53.909       53.994                                           4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4)
+            52.088       48.838       52.174                    48.706                              4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5)
+            54.746       51.347                    54.722       51.213                              4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5)
+            53.295                    54.767       53.528       54.685                              4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5)
+                        50.468       48.308       50.462       47.927                              4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5)
+            50.893       46.216       50.966                                 46.051                 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G6)
+            52.775       43.437                    52.870                    43.390                 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G6)
+            51.347                    47.597       51.299                    47.533                 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6)
+                        54.851       54.193       54.852                    54.315                 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6)
+            52.597       53.273                                 52.389       53.026                 4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G6)
+            49.185                    51.880                    49.343       51.712                 4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6)
+                        50.603       56.058                    50.795       55.960                 4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6)
+            49.493                                 53.818       49.462       53.614                 4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+                        50.473                    52.841       50.388       52.713                 4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+                                      55.233       53.448       54.880       53.259                 4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+            53.965       53.219       54.128                                              53.233    4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G7)
+            48.949       50.712                    48.946                                 50.613    4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G7)
+            52.486                    47.821       52.730                                 47.730    4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7)
+                        51.232       49.069       51.309                                 48.869    4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7)
+            49.876       51.404                                 49.772                    51.046    4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G7)
+            57.132                    57.070                    56.963                    56.772    4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7)
+                        49.970       57.176                    49.987                    56.920    4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7)
+            57.333                                 49.658       57.264                    49.806    4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+                        50.165                    49.903       50.134                    49.860    4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+                                      52.488       51.273       52.639                    51.069    4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+            51.169       54.829                                              51.031       54.709    4 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G6) (G0 G0 G7)
+            50.695                    57.240                                 50.471       56.931    4 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7)
+                        56.892       57.171                                 56.747       57.028    4 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7)
+            50.567                                 49.730                    50.262       49.642    4 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+                        56.972                    49.999                    56.850       49.764    4 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+                                      55.711       51.511                    55.656       51.235    4 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+            51.660                                              54.717       51.631       54.697    4 4 (G0 G0 G1) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                        48.339                                 50.612       48.182       50.660    4 4 (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                                      54.962                    54.106       54.762       53.969    4 4 (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                                                  49.339       51.046       49.435       50.976    4 4 (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+            44.784       52.269       44.716       52.113       44.546                              5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5)
+            51.310       52.573       51.138       52.632                    51.216                 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6)
+            53.244       47.458       53.128                    47.570       53.279                 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6)
+            53.537       49.462                    53.431       49.427       53.541                 5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+            47.703                    56.413       47.796       56.427       47.780                 5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+                        47.025       53.682       46.996       53.527       47.114                 5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+            44.808       52.363       44.935       52.466                                 44.864    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G7)
+            53.774       44.200       53.745                    44.146                    53.889    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G7)
+            50.409       42.969                    50.554       42.820                    50.407    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+            46.721                    55.426       46.727       55.217                    46.646    5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+                        49.917       52.813       50.019       52.586                    49.718    5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+            49.463       50.373       49.695                                 50.244       49.436    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G6) (G0 G0 G7)
+            49.394       50.794                    49.331                    50.565       49.373    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+            47.873                    51.213       47.900                    51.305       47.921    5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+                        47.109       51.965       47.182                    51.776       47.153    5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+            50.039       54.672                                 50.159       54.760       50.229    5 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+            46.918                    52.488                    47.028       52.327       47.033    5 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                        49.807       52.877                    50.009       52.756       49.904    5 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+            48.313                                 54.666       48.258       54.596       48.103    5 4 (G0 G0 G1) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                        47.352                    52.476       47.680       52.375       47.412    5 4 (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                                      45.647       51.850       45.700       51.787       45.618    5 4 (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+            53.797       53.041       53.715       53.185       53.728       53.055                 6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6)
+            50.819       49.056       50.912       49.257       50.800                    49.082    6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G7)
+            53.691       53.287       53.672       53.443                    53.601       53.312    6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G6) (G0 G0 G7)
+            51.184       51.978       51.156                    51.922       51.261       51.993    6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+            52.548       50.879                    52.511       51.038       52.776       50.962    6 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+            52.010                    52.226       51.881       52.229       51.977       52.150    6 4 (G0 G0 G1) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+                        49.444       48.838       49.543       48.895       49.396       48.811    6 4 (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+            48.520       53.242       48.504       53.057       48.517       53.075       48.642    7 4 (G0 G0 G1) (G0 G0 G2) (G0 G0 G3) (G0 G0 G4) (G0 G0 G5) (G0 G0 G6) (G0 G0 G7)
+
+.. _p2p:
+
+Peer-to-peer preset (p2p)
+==========================
+
+The p2p preset measures device memory bandwidth between all pairs of CPU NUMA nodes and GPUs. It tests unidirectional and bidirectional transfers for CPU-to-CPU, CPU-to-GPU, and GPU-to-GPU combinations.
+
+**Key features:**
+
+- Tests all SRC-to-DST pairs across CPUs and GPUs.
+
+- Supports both unidirectional and bidirectional transfers (``P2P_MODE``).
+
+- Uses GFX or DMA as GPU executor (``USE_GPU_DMA``).
+
+- Supports remote read (DST GPU as executor) instead of source-side execution.
+
+- Prints bandwidth matrix with row and column labels. Optionally shows min/max/stddev per iteration.
+
+
+**Restrictions:**
+
+- Supports single node only: Multinode is not supported.
+
+- ``USE_FINE_GRAIN`` deprecated: Returns error if ``USE_FINE_GRAIN`` is set. Use ``CPU_MEM_TYPE`` and ``GPU_MEM_TYPE`` instead.
+
+- NVIDIA CPU: On NVIDIA, CPU executors can't access GPU memory; those pairs are skipped.
+
+- Self-transfers skipped: CPU i-to-i and GPU i-to-i are skipped in bidirectional mode.
+
+**Usage:**
+
+.. code-block:: shell
+
+  ./TransferBench p2p
+
+For exclusively unidirectional transfer with DMA:
+
+.. code-block:: shell
+
+  P2P_MODE=1 USE_GPU_DMA=1 ./TransferBench p2p
+
+Environment variables
+----------------------
+
+To modify the behavior of p2p preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``CPU_MEM_TYPE``
+      - CPU memory: 0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned. See :ref:`mem-type`.
+      - ``0``
+
+    * - ``GPU_MEM_TYPE``
+      - GPU memory: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`.
+      - ``0``
+
+    * - ``NUM_CPU_DEVICES``
+      - Number of CPU NUMA nodes. To avoid using any pairs involving CPUs, set it to ``0``.
+      - (all detected)
+
+    * - ``NUM_CPU_SE``
+      - CPU threads per CPU-executed transfer.
+      - ``4``
+
+    * - ``NUM_GPU_DEVICES``
+      - Number of GPUs. This can be modified to reduce the number of GPUs to test.
+      - (all detected)
+
+    * - ``NUM_GPU_SE``
+      - GPU CUs per transfer. Default value varies according to ``USE_GPU_DMA``.
+      - (device max / GFX default)
+
+    * - ``SHOW_ITERATIONS``
+      - To show detailed min/max/stddev per iteration, set to ``1``.
+      - ``0``
+
+    * - ``P2P_MODE``
+      - 1=Unidirectional only, 2=Bidirectional only, 0=both.
+      - ``0``
+
+    * - ``USE_GPU_DMA``
+      - To use DMA for GPU executor, set to ``1``. To use GFX, set to ``0``.
+      - ``0``
+
+    * - ``USE_REMOTE_READ``
+      - To place the executor on DST, set to ``1``. To place on SRC, set to ``0``.
+      - ``0``
+
+Example output
+---------------
+
+.. tab-set::
+
+  .. tab-item:: AMD Instinct MI300X
+
+    .. code-block:: shell
+
+      [P2P Related]
+      CPU_MEM_TYPE         =            0 : Using default CPU (0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned)
+      GPU_MEM_TYPE         =            0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed)
+      NUM_CPU_DEVICES      =            2 : Using 2 CPUs
+      NUM_CPU_SE           =            4 : Using 4 CPU threads per Transfer
+      NUM_GPU_DEVICES      =            8 : Using 8 GPUs
+      NUM_GPU_SE           =          304 : Using 304 GPU subexecutors/CUs per Transfer
+      P2P_MODE             =            0 : Running Uni + Bi transfers
+      USE_GPU_DMA          =            0 : Using GPU-GFX as GPU executor
+      USE_REMOTE_READ      =            0 : Using SRC as executor
+      Bytes Per Direction 268435456
+      Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
+      SRC+EXE\DST    CPU 00    CPU 01       GPU 00    GPU 01    GPU 02    GPU 03    GPU 04    GPU 05    GPU 06    GPU 07
+        CPU 00  ->     37.62     38.04        39.44     34.00     33.12     35.53     31.90     29.73     28.11     31.00
+        CPU 01  ->     37.84     37.69        29.92     29.85     31.19     29.63     38.99     38.41     38.32     39.56
+        GPU 00  ->     55.36     55.25      1618.87     48.83     48.89     49.00     48.05     47.94     48.27     47.85
+        GPU 01  ->     55.36     54.14        48.89   1860.47     48.95     48.95     47.91     48.04     48.49     48.32
+        GPU 02  ->     55.35     55.26        48.83     49.01   1868.43     49.07     48.70     48.34     48.85     48.97
+        GPU 03  ->     55.34     55.26        49.01     49.02     49.07   1877.42     48.51     48.17     48.85     49.04
+        GPU 04  ->     55.30     55.38        47.95     48.26     48.85     48.61   1849.65     48.99     48.85     48.84
+        GPU 05  ->     55.29     55.35        47.95     48.02     48.51     48.03     49.01   1853.87     49.15     49.01
+        GPU 06  ->     55.32     55.34        48.31     48.62     48.88     48.94     48.99     48.83   1829.05     49.17
+        GPU 07  ->     55.30     55.34        48.23     48.27     48.59     48.90     48.60     49.09     49.14   1841.42
+                                CPU->CPU  CPU->GPU  GPU->CPU  GPU->GPU
+      Averages (During UniDir):     37.94     33.67     55.25     48.65
+      Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
+          SRC\DST    CPU 00    CPU 01       GPU 00    GPU 01    GPU 02    GPU 03    GPU 04    GPU 05    GPU 06    GPU 07
+        CPU 00  ->       N/A     33.59        33.54     36.84     33.35     35.02     29.55     31.33     31.30     28.09
+        CPU 00 <-        N/A     39.94        54.81     54.73     54.51     54.48     29.25     28.84     28.13     30.44
+        CPU 00 <->       N/A     73.52        88.35     91.57     87.86     89.51     58.80     60.17     59.43     58.53
+        CPU 01  ->     36.21       N/A        31.09     28.54     31.93     31.76     38.02     38.74     37.19     36.09
+        CPU 01 <-      33.60       N/A        28.85     28.27     27.93     28.54     54.85     54.80     54.68     54.70
+        CPU 01 <->     69.81       N/A        59.94     56.81     59.86     60.30     92.87     93.54     91.86     90.78
+        GPU 00  ->     54.77     29.18          N/A     46.15     46.10     46.55     46.16     46.05     46.31     45.95
+        GPU 00 <-      34.70     30.98          N/A     46.21     46.40     46.65     46.12     46.00     46.25     45.98
+        GPU 00 <->     89.47     60.15          N/A     92.36     92.50     93.20     92.27     92.05     92.56     91.93
+        GPU 01  ->     54.77     29.18        46.19       N/A     46.08     46.54     46.17     46.05     46.33     46.14
+        GPU 01 <-      32.11     30.59        46.11       N/A     46.64     46.42     46.16     46.09     46.51     46.20
+        GPU 01 <->     86.89     59.77        92.30       N/A     92.73     92.97     92.32     92.14     92.84     92.33
+        GPU 02  ->     54.76     29.56        46.40     46.63       N/A     46.62     46.49     46.16     46.41     46.09
+        GPU 02 <-      32.05     27.70        46.07     46.05       N/A     46.24     46.18     46.26     46.12     46.27
+        GPU 02 <->     86.81     57.25        92.47     92.68       N/A     92.86     92.67     92.42     92.53     92.37
+        GPU 03  ->     54.73     30.33        46.62     46.44     46.23       N/A     46.15     46.34     46.25     46.47
+        GPU 03 <-      33.13     29.77        46.50     46.52     46.61       N/A     46.17     46.22     46.23     46.46
+        GPU 03 <->     87.86     60.10        93.13     92.96     92.84       N/A     92.32     92.56     92.48     92.93
+        GPU 04  ->     29.91     54.85        46.18     46.20     46.21     46.17       N/A     46.56     46.23     46.50
+        GPU 04 <-      30.60     34.45        46.27     46.37     46.58     46.17       N/A     46.49     46.25     46.44
+        GPU 04 <->     60.52     89.30        92.45     92.57     92.78     92.34       N/A     93.05     92.49     92.93
+        GPU 05  ->     30.58     54.76        45.99     46.04     46.24     46.32     46.51       N/A     46.38     46.15
+        GPU 05 <-      26.98     35.95        46.00     46.01     46.18     46.38     46.56       N/A     46.26     46.20
+        GPU 05 <->     57.55     90.70        91.99     92.05     92.43     92.69     93.07       N/A     92.63     92.36
+        GPU 06  ->     30.22     54.65        46.34     46.40     46.13     46.24     46.26     46.33       N/A     46.43
+        GPU 06 <-      27.72     35.78        46.37     46.35     46.35     46.28     46.25     46.37       N/A     46.30
+        GPU 06 <->     57.94     90.44        92.72     92.75     92.48     92.52     92.51     92.70       N/A     92.73
+        GPU 07  ->     30.55     54.66        46.03     46.15     46.35     46.38     46.39     46.17     46.35       N/A
+        GPU 07 <-      27.28     36.17        46.05     46.11     46.12     46.45     46.48     46.15     46.41       N/A
+        GPU 07 <->     57.83     90.83        92.08     92.26     92.47     92.83     92.87     92.32     92.76       N/A
+                                CPU->CPU  CPU->GPU  GPU->CPU  GPU->GPU
+      Averages (During  BiDir):     35.83     37.51     36.98     46.28
+
+  .. tab-item:: AMD Instinct MI350X
+
+    .. code-block:: shell
+
+      [P2P Related]
+      CPU_MEM_TYPE         =            0 : Using default CPU (0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned)
+      GPU_MEM_TYPE         =            0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed)
+      NUM_CPU_DEVICES      =            2 : Using 2 CPUs
+      NUM_CPU_SE           =            4 : Using 4 CPU threads per Transfer
+      NUM_GPU_DEVICES      =            8 : Using 8 GPUs
+      NUM_GPU_SE           =          256 : Using 256 GPU subexecutors/CUs per Transfer
+      P2P_MODE             =            0 : Running Uni + Bi transfers
+      USE_GPU_DMA          =            0 : Using GPU-GFX as GPU executor
+      USE_REMOTE_READ      =            0 : Using SRC as executor
+      Bytes Per Direction 268435456
+      Unidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
+      SRC+EXE\DST    CPU 00    CPU 01       GPU 00    GPU 01    GPU 02    GPU 03    GPU 04    GPU 05    GPU 06    GPU 07
+        CPU 00  ->     83.89     93.99        42.90     42.89     42.94     42.93     42.90     42.88     41.76     42.81
+        CPU 01  ->     91.09     83.25        42.77     42.84     42.27     42.88     42.79     42.91     42.83     42.79
+        GPU 00  ->     53.18     53.14      2285.12     57.51     57.46     57.38     57.33     57.32     57.28     57.64
+        GPU 01  ->     53.11     53.16        57.53   2280.83     57.36     57.30     57.32     57.33     57.48     57.44
+        GPU 02  ->     53.11     53.13        57.45     57.29   2286.68     57.36     57.58     57.53     57.38     57.35
+        GPU 03  ->     53.19     53.11        57.31     57.26     57.52   2281.59     57.52     57.47     57.33     57.38
+        GPU 04  ->     53.11     53.12        57.33     57.27     57.57     57.53   2292.99     57.51     57.36     57.36
+        GPU 05  ->     53.13     53.13        57.34     57.32     57.55     57.48     57.28   2276.23     57.42     57.50
+        GPU 06  ->     53.18     53.19        57.28     57.47     57.39     57.35     57.54     57.40   2305.57     57.49
+        GPU 07  ->     53.16     53.15        57.44     57.47     57.35     57.36     57.32     57.35     57.51   2289.74
+                                CPU->CPU  CPU->GPU  GPU->CPU  GPU->GPU
+      Averages (During UniDir):     92.54     42.76     53.14     57.41
+      Bidirectional copy peak bandwidth GB/s [Local read / Remote write] (GPU-Executor: GFX)
+          SRC\DST    CPU 00    CPU 01       GPU 00    GPU 01    GPU 02    GPU 03    GPU 04    GPU 05    GPU 06    GPU 07
+        CPU 00  ->       N/A     79.71        42.40     42.40     42.39     42.51     42.05     41.14     42.22     42.45
+        CPU 00 <-        N/A     80.90        52.72     52.61     52.74     52.69     52.69     52.68     52.61     52.64
+        CPU 00 <->       N/A    160.62        95.11     95.01     95.13     95.20     94.75     93.82     94.83     95.09
+        CPU 01  ->     80.77       N/A        42.27     42.39     42.50     42.49     42.50     42.47     42.43     42.46
+        CPU 01 <-      79.50       N/A        52.68     52.60     52.69     52.66     52.68     52.65     52.68     52.68
+        CPU 01 <->    160.27       N/A        94.95     94.99     95.19     95.15     95.17     95.11     95.11     95.14
+        GPU 00  ->     52.72     52.61          N/A     54.77     54.78     54.61     54.66     54.58     54.51     54.85
+        GPU 00 <-      42.48     42.34          N/A     54.77     54.72     54.52     54.58     54.53     54.57     54.72
+        GPU 00 <->     95.20     94.95          N/A    109.54    109.51    109.13    109.23    109.11    109.08    109.57
+        GPU 01  ->     52.68     52.69        54.75       N/A     54.66     54.51     54.54     54.57     54.74     54.70
+        GPU 01 <-      42.43     42.40        54.84       N/A     54.46     54.55     54.45     54.61     54.82     54.79
+        GPU 01 <->     95.11     95.09       109.59       N/A    109.12    109.06    108.99    109.18    109.56    109.50
+        GPU 02  ->     52.72     52.59        54.80     54.52       N/A     54.62     54.87     54.86     54.64     54.53
+        GPU 02 <-      42.48     42.36        54.80     54.62       N/A     54.71     54.79     54.75     54.59     54.56
+        GPU 02 <->     95.20     94.94       109.60    109.15       N/A    109.33    109.66    109.61    109.23    109.09
+        GPU 03  ->     52.61     52.59        54.43     54.52     54.64       N/A     54.80     54.82     54.61     54.59
+        GPU 03 <-      42.49     42.38        54.63     54.53     54.63       N/A     54.79     54.73     54.47     54.49
+        GPU 03 <->     95.09     94.97       109.06    109.05    109.28       N/A    109.59    109.56    109.08    109.08
+        GPU 04  ->     52.69     52.59        54.56     54.50     54.74     54.76       N/A     54.75     54.57     54.64
+        GPU 04 <-      41.98     42.47        54.66     54.53     54.82     54.81       N/A     54.56     54.74     54.53
+        GPU 04 <->     94.67     95.06       109.22    109.03    109.56    109.57       N/A    109.31    109.31    109.17
+        GPU 05  ->     52.71     52.58        54.54     54.56     54.78     54.71     54.55       N/A     54.59     54.73
+        GPU 05 <-      42.33     42.36        54.59     54.58     54.85     54.83     54.74       N/A     54.50     54.68
+        GPU 05 <->     95.04     94.94       109.13    109.14    109.64    109.55    109.29       N/A    109.09    109.41
+        GPU 06  ->     52.64     52.70        54.56     54.82     54.63     54.53     54.61     54.59       N/A     54.82
+        GPU 06 <-      42.37     42.52        54.53     54.83     54.66     54.56     54.60     54.59       N/A     54.75
+        GPU 06 <->     95.02     95.22       109.10    109.65    109.28    109.09    109.21    109.18       N/A    109.57
+        GPU 07  ->     52.70     52.66        54.70     54.84     54.58     54.53     54.50     54.68     54.83       N/A
+        GPU 07 <-      42.16     42.45        54.88     54.72     54.55     54.63     54.61     54.73     54.73       N/A
+        GPU 07 <->     94.85     95.11       109.58    109.56    109.12    109.16    109.11    109.41    109.56       N/A
+                                CPU->CPU  CPU->GPU  GPU->CPU  GPU->GPU
+      Averages (During  BiDir):     80.22     47.49     47.51     54.66
+
+.. _scaling:
+
+Scaling preset (scaling)
+=========================
+
+The scaling preset runs a scaling test from one GPU to all other devices (CPUs and GPUs). It varies the number of subexecutors (CUs) from SWEEP_MIN to SWEEP_MAX and reports bandwidth for each target device. It helps find optimal CU count per transfer.
+
+**Key feature:**
+
+- Uses one GPU (``LOCAL_IDX``) as source.
+
+- Single transfer per target: Performs only one transfer at a time (one SRC to one DST) per cell.
+
+- Copies to each CPU NUMA node and every other GPU.
+
+- For each CU count (``SWEEP_MIN`` to ``SWEEP_MAX``), runs one transfer per target and reports bandwidth.
+
+- Prints a table: rows = CU count, columns = target device.
+
+- Reports best row: Shows peak bandwidth and optimal CU count per target.
+
+**Restrictions:**
+
+- Supports single node only: Multinode is not supported.
+
+- ``USE_FINE_GRAIN`` deprecated: Returns error if set. Use ``CPU_MEM_TYPE`` and ``GPU_MEM_TYPE`` instead.
+
+**Usage:**
+
+.. code-block:: shell
+
+  ./TransferBench scaling
+
+To run using GPU 2 as SRC with CU range between 4 and 64:
+
+.. code-block:: shell
+
+  LOCAL_IDX=2 SWEEP_MIN=4 SWEEP_MAX=64 ./TransferBench scaling
+
+Environment variables
+----------------------
+
+To modify the behavior of scaling preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``CPU_MEM_TYPE``
+      - CPU memory type: 0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned. See :ref:`mem-type`.
+      - ``0``
+
+    * - ``GPU_MEM_TYPE``
+      - GPU memory type: 0=default, 1=fine-grained, 2=uncached, 3=managed. See :ref:`mem-type`.
+      - ``0``
+
+    * - ``LOCAL_IDX``
+      - Index of the GPU performing copy to other GPUs.
+      - ``0``
+
+    * - ``NUM_CPU_DEVICES``
+      - Number of CPU NUMA nodes.
+      - (all detected)
+
+    * - ``NUM_GPU_DEVICES``
+      - Number of GPUs.
+      - (all detected)
+
+    * - ``SWEEP_MIN``
+      - Minimum subexecutors (CUs).
+      - ``1``
+
+    * - ``SWEEP_MAX``
+      - Maximum subexecutors.
+      - ``32``
+
+Example output
+---------------
+
+.. tab-set::
+
+  .. tab-item:: AMD Instinct MI300X
+
+    .. code-block:: shell
+
+      [Scaling Related]
+      CPU_MEM_TYPE         =            0 : Using default CPU (0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned)
+      GPU_MEM_TYPE         =            0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed)
+      LOCAL_IDX            =            0 : Local GPU index
+      NUM_CPU_DEVICES      =            2 : Using 2 CPUs
+      NUM_GPU_DEVICES      =            8 : Using 8 GPUs
+      SWEEP_MAX            =           32 : Max number of subExecutors to use
+      SWEEP_MIN            =            1 : Min number of subExecutors to use
+      GPU-GFX Scaling benchmark:
+      ==========================
+      - Copying 268435456 bytes from GPU 0 to other devices
+      - All numbers reported as GB/sec
+      NumCUs   CPU00        CPU01        GPU00        GPU01        GPU02        GPU03        GPU04        GPU05        GPU06        GPU07
+        1     20.22        20.41        18.68        25.91        26.08        26.06        25.95        26.04        26.01        26.02
+        2     37.37        37.03        36.88        48.65        48.33        49.24        47.91        47.72        48.48        47.13
+        3     52.96        51.92        55.74        48.92        48.43        49.22        47.45        47.47        48.11        47.88
+        4     56.38        53.18        73.05        49.41        49.19        49.34        47.84        47.79        48.46        48.05
+        5     54.61        52.96        91.57        45.23        44.60        49.03        47.73        44.32        48.57        44.11
+        6     54.61        53.78       109.70        48.98        48.82        49.13        47.84        47.46        48.38        48.19
+        7     56.48        54.27       127.43        49.15        49.14        49.15        48.00        47.85        48.50        48.05
+        8     56.60        54.71       142.35        49.14        49.37        49.31        47.86        47.98        48.39        48.13
+        9     56.66        54.93       161.43        49.13        49.58        49.16        47.77        47.92        48.44        48.18
+        10     56.83        55.31       178.08        49.17        49.33        49.07        47.79        47.99        48.33        48.23
+        11     56.84        55.63       195.82        49.36        49.43        49.10        47.56        47.96        48.54        48.37
+        12     57.12        55.83       210.39        49.50        48.97        49.43        47.97        47.73        48.63        48.27
+        13     56.91        55.65       226.79        49.52        48.86        49.22        47.63        47.92        48.60        48.16
+        14     57.10        55.83       238.49        49.26        49.42        49.13        48.08        48.18        48.44        48.18
+        15     57.09        55.86       258.25        49.23        49.19        49.42        47.74        47.96        48.68        48.11
+        16     57.11        55.98       271.55        49.62        49.25        49.54        47.84        47.75        48.39        47.93
+        17     57.10        55.82       287.98        49.10        49.36        49.35        47.64        47.97        48.81        48.28
+        18     57.10        55.81       306.06        49.33        49.14        49.34        47.81        47.99        48.47        48.05
+        19     56.94        55.69       319.71        49.20        49.14        49.32        48.13        47.93        48.61        48.30
+        20     57.14        55.88       334.89        49.35        49.25        49.22        48.19        47.97        48.62        48.24
+        21     57.12        55.94       346.59        49.13        49.23        49.19        48.24        47.84        48.52        48.16
+        22     57.13        56.01       362.42        49.34        49.39        49.09        47.95        48.00        48.53        48.20
+        23     57.13        56.17       375.70        49.10        49.22        49.43        47.98        48.14        48.58        48.46
+        24     57.14        56.23       388.97        49.25        49.24        49.52        47.72        48.06        48.67        48.31
+        25     57.14        56.30       403.32        49.04        49.20        49.42        48.05        48.01        48.51        47.97
+        26     57.14        56.17       417.88        49.57        49.59        49.57        47.89        48.04        48.79        48.34
+        27     57.12        56.02       426.76        49.32        49.24        49.29        48.14        48.01        48.50        48.04
+        28     57.13        56.05       444.58        49.31        49.37        49.12        48.00        47.96        48.44        47.99
+        29     57.14        56.07       453.05        49.55        49.40        49.56        48.16        47.78        48.18        48.17
+        30     57.14        56.12       462.74        49.11        49.27        49.33        47.97        48.20        48.63        48.26
+        31     57.13        56.12       478.60        49.35        48.96        49.06        47.94        48.33        48.43        48.35
+        32     57.15        56.35       493.17        49.23        49.55        49.33        47.77        48.28        48.56        48.22
+      Best    57.15( 32)   56.35( 32)  493.17( 32)   49.62( 16)   49.59( 26)   49.57( 26)   48.24( 21)   48.33( 31)   48.81( 17)   48.46( 23)
+
+  .. tab-item:: AMD Instinct MI350X
+
+    .. code-block:: shell
+
+      [Scaling Related]
+      CPU_MEM_TYPE         =            0 : Using default CPU (0=default, 1=coherent, 2=non-coherent, 3=uncached, 4=unpinned)
+      GPU_MEM_TYPE         =            0 : Using default GPU (0=default, 1=fine-grained, 2=uncached, 3=managed)
+      LOCAL_IDX            =            0 : Local GPU index
+      NUM_CPU_DEVICES      =            2 : Using 2 CPUs
+      NUM_GPU_DEVICES      =            8 : Using 8 GPUs
+      SWEEP_MAX            =           32 : Max number of subExecutors to use
+      SWEEP_MIN            =            1 : Min number of subExecutors to use
+      GPU-GFX Scaling benchmark:
+      ==========================
+      - Copying 268435456 bytes from GPU 0 to other devices
+      - All numbers reported as GB/sec
+      NumCUs   CPU00        CPU01        GPU00        GPU01        GPU02        GPU03        GPU04        GPU05        GPU06        GPU07
+        1     26.51        26.30        15.81        26.48        26.57        26.44        25.68        26.39        26.58        26.04
+        2     51.52        50.86        31.50        52.65        52.28        52.28        52.00        52.57        52.95        52.39
+        3     42.83        43.01        46.13        53.32        57.39        55.81        49.08        57.41        49.65        48.31
+        4     50.02        49.93        61.77        57.34        57.09        49.10        49.67        57.02        56.89        49.79
+        5     53.58        53.58        77.07        55.85        57.78        57.22        50.69        57.72        54.80        50.27
+        6     53.84        53.82        91.73        58.29        58.48        56.60        54.91        58.41        57.81        54.89
+        7     53.56        53.63       106.60        57.98        57.24        57.18        55.86        56.97        57.79        55.87
+        8     53.22        52.97       121.07        58.40        58.27        57.43        58.07        58.17        58.07        58.39
+        9     54.22        54.22       135.97        58.37        57.88        57.54        58.10        57.72        58.21        58.26
+        10     54.37        54.34       148.80        58.61        58.63        57.74        58.49        58.63        58.32        58.35
+        11     54.62        54.58       163.28        57.83        58.55        58.26        58.17        58.53        57.94        58.38
+        12     54.63        54.56       177.99        58.93        58.69        58.59        58.49        58.68        58.57        58.64
+        13     54.66        54.69       191.79        58.55        58.51        58.59        58.24        58.50        58.42        58.33
+        14     54.73        54.63       205.97        58.73        58.49        58.36        58.28        58.40        58.51        58.30
+        15     54.73        54.64       221.70        58.65        58.55        58.41        58.44        58.61        58.56        58.43
+        16     54.63        54.59       233.49        59.14        59.04        58.84        58.93        59.03        58.98        59.08
+        17     54.75        54.76       247.85        58.78        58.56        58.43        58.55        58.61        58.62        58.48
+        18     54.74        54.73       262.07        58.70        58.42        58.34        58.33        58.37        58.48        58.40
+        19     54.77        54.73       274.98        58.57        58.42        58.38        58.37        58.55        58.40        58.33
+        20     54.82        54.86       287.02        58.76        58.77        58.58        58.62        58.67        58.58        58.69
+        21     54.79        54.76       301.35        58.62        58.48        58.38        58.40        58.45        58.47        58.38
+        22     54.74        54.72       313.96        58.59        58.56        58.43        58.43        58.45        58.56        58.42
+        23     54.79        54.78       328.28        58.55        58.53        58.41        58.38        58.48        58.41        58.34
+        24     54.65        54.73       343.28        58.76        59.02        58.68        58.78        58.68        59.01        58.76
+        25     54.72        54.78       354.62        58.57        58.50        58.38        58.42        58.39        58.50        58.41
+        26     54.67        54.71       367.90        58.58        58.51        58.54        58.43        58.46        58.52        58.55
+        27     54.74        54.73       377.03        58.52        58.41        58.26        58.31        58.45        58.39        58.36
+        28     54.67        54.73       393.19        58.69        58.36        58.32        58.40        58.44        58.46        58.41
+        29     54.72        54.71       402.84        58.50        58.31        58.26        58.33        58.36        58.48        58.35
+        30     54.75        54.79       418.54        58.82        58.52        58.37        58.39        58.67        58.52        58.46
+        31     54.79        54.75       429.11        58.65        58.33        58.33        58.55        58.35        58.41        58.41
+        32     54.74        54.79       445.36        59.08        59.12        58.85        58.81        59.02        59.13        59.11
+      Best    54.82( 20)   54.86( 20)  445.36( 32)   59.14( 16)   59.12( 32)   58.85( 32)   58.93( 16)   59.03( 16)   59.13( 32)   59.11( 32)
+
+.. _schmoo:
+
+Schmoo preset (schmoo)
+=======================
+
+The schmoo preset runs scaling tests for local and remote read, write, and copy operations between two GPUs. For each CU count (``SWEEP_MIN`` to ``SWEEP_MAX``), it measures six bandwidth values: Local Read, Local Write, Local Copy, Remote Read, Remote Write, and Remote Copy.
+
+**Key features:**
+
+- Minimum 2 GPUs: Requires at least two GPUs: ``LOCAL_IDX`` (local) and ``REMOTE_IDX`` (remote).
+
+- Fixed topology: Always two GPUs (local and remote). No sweep over device count.
+
+- For each CU count, runs the following six tests. Each test measures bandwidth for the corresponding operation pattern:
+
+  - Local Read: Local GPU reads from local memory (SRC->G->null).
+
+  - Local Write: Local GPU writes to local memory (null->G->DST).
+
+  - Local Copy: Local GPU copies (local->local).
+
+  - Remote Read: Local GPU reads from remote memory.
+
+  - Remote Write: Local GPU writes to remote memory.
+
+  - Remote Copy: Local GPU copies (local->remote).
+
+- Outputs a table: rows = #CUs, columns = the 6 operation types.
+
+- Supports single node only: Multinode is not supported.
+
+**Usage:**
+
+.. code-block:: shell
+
+  ./TransferBench schmoo
+
+To run using GPUs 0 and 3:
+
+.. code-block:: shell
+
+  LOCAL_IDX=0 REMOTE_IDX=3 SWEEP_MIN=4 SWEEP_MAX=32 ./TransferBench schmoo
+
+To run using fine-grained memory:
+
+.. code-block:: shell
+
+  USE_FINE_GRAIN=1 ./TransferBench schmoo
+
+Environment variables
+----------------------
+
+To modify the behavior of schmoo preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``LOCAL_IDX``
+      - Local GPU index.
+      - ``0``
+
+    * - ``REMOTE_IDX``
+      - Remote GPU index.
+      - ``1``
+
+    * - ``SWEEP_MIN``
+      - Minimum CUs.
+      - ``1``
+
+    * - ``SWEEP_MAX``
+      - Maximum CUs.
+      - ``32``
+
+    * - ``USE_FINE_GRAIN``
+      - To use fine-grained GPU memory, set to ``1``. For coarse-grained memory, set to ``0``.
+      - ``0``
+
+Example output
+---------------
+
+.. tab-set::
+
+  .. tab-item:: AMD Instinct MI300X
+
+    .. image:: /data/schmoo_MI300X.png
+      :width: 100%
+      :align: center
+
+  .. tab-item:: AMD Instinct MI350X
+
+    .. image:: /data/schmoo_MI350X.png
+      :width: 100%
+      :align: center
+
+.. _sweep:
+
+Sweep (sweep) and random sweep preset (rsweep)
+===============================================
+
+The sweep preset performs an ordered sweep through sets of transfers. It systematically tests combinations of (SRC, executor, DST) with varying parallelism (from ``SWEEP_MIN`` simultaneous transfers up to ``SWEEP_MAX``) using lexicographic permutation order. The rsweep preset performs similar functions as sweep preset, but in a random order.
+
+.. note::
+
+  This preset is primarily used for stress testing.
+
+**Key features:**
+
+- Possible set: Builds all possible sets of triplets (SRC, EXE, DST) from ``SWEEP_SRC``, ``SWEEP_EXE``, ``SWEEP_DST``, and device counts, as Cartesian product (``srcList`` x ``exeList`` x ``dstList``) with filters such as XGMI hop, and CPU-on-GPU skip on NVIDIA.
+
+- Optionally filters using XGMI hop count (``SWEEP_XGMI_MIN``, ``SWEEP_XGMI_MAX``).
+
+- M increment: Selects M transfers for each test. M starts at ``SWEEP_MIN``, and increments until M > ``SWEEP_MAX`` (or ``SWEEP_MAX`` = 0 for no limit) when all M-combinations are exhausted.
+
+- Ordered permutation: Uses ``std::prev_permutation`` to iterate through M-combinations of the possible transfer set in a deterministic order.
+
+- Log format: Logs each test's transfers to ``SWEEP_FILE``. The ``SWEEP_FILE`` contains lines such as "# Test N" and "-M (src->exe->dst CUs bytes)...".
+
+- Follows ``SWEEP_TEST_LIMIT`` and ``SWEEP_TIME_LIMIT``.
+
+- Default executors: ``SWEEP_EXE`` = CDG includes CPU, DMA, and GFX for broad coverage.
+
+- Supports single node only: Multinode is not supported.
+
+.. note::
+
+  Running the default sweep might never finish executing, especially on configurations with large number of devices. It is highly recommended to set either ``SWEEP_TEST_LIMIT`` or ``SWEEP_TIME_LIMIT``.
+
+**Usage:**
+
+.. code-block:: shell
+
+  ./TransferBench sweep
+
+To run with memory and executor limited to GPU only, and XGMI:
+
+.. code-block:: shell
+
+  SWEEP_SRC=G SWEEP_DST=G SWEEP_EXE=G SWEEP_XGMI_MIN=1 SWEEP_MAX=16 ./TransferBench sweep
+
+To limit the duration of run:
+
+.. code-block:: shell
+
+  SWEEP_TIME_LIMIT=3600 SWEEP_FILE=/tmp/mySweep.cfg ./TransferBench sweep
+
+Environment variables
+----------------------
+
+To modify the behavior of sweep and rsweep preset, use the following environment variables:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Environment variable
+      - Description
+      - Default value
+
+    * - ``CONTINUE_ON_ERROR``
+      - To continue despite validation error, set to ``1``. To stop, set to ``0``.
+      - ``0``
+
+    * - ``NUM_CPU_DEVICES``
+      - Number of CPU NUMA nodes.
+      - (all detected)
+
+    * - ``NUM_CPU_SE``
+      - CPU threads per CPU-executed transfer.
+      - ``4``
+
+    * - ``NUM_GPU_DEVICES``
+      - Number of GPUs.
+      - (all detected)
+
+    * - ``NUM_GPU_SE``
+      - CUs per GPU-executed transfer.
+      - ``4``
+
+    * - ``SWEEP_SRC``
+      - Source memory types: C=CPU, G=GPU, N=Null.
+      - ``CG``
+
+    * - ``SWEEP_DST``
+      - Destination memory types.
+      - ``CG``
+
+    * - ``SWEEP_EXE``
+      - Executor types: C=CPU, D=DMA, G=GFX.
+      - ``CDG``
+
+    * - ``SWEEP_FILE``
+      - File where sweep configuration is saved.
+      - ``/tmp/lastSweep.cfg``
+
+    * - ``SWEEP_MIN``
+      - Minimum simultaneous transfers.
+      - ``1``
+
+    * - ``SWEEP_MAX``
+      - Maximum simultaneous transfers (0=no limit).
+      - ``24``
+
+    * - ``SWEEP_RAND_BYTES``
+      - To use random transfer size, set to ``1``. For constant, set to ``0``.
+      - ``0``
+
+    * - ``SWEEP_SEED``
+      - Random seed. Used for rsweep or ``SWEEP_RAND_BYTES``.
+      - time(NULL)
+
+    * - ``SWEEP_TEST_LIMIT``
+      - Maximum number of tests allowed to run. ``0`` = no limit.
+      - ``0``
+
+    * - ``SWEEP_TIME_LIMIT``
+      - Maximum allowed test duration (in seconds). ``0`` = no limit.
+      - ``0``
+
+    * - ``SWEEP_XGMI_MIN``
+      - Minimum XGMI hops for transfers.
+      - ``0``
+
+    * - ``SWEEP_XGMI_MAX``
+      - Maximum allowed XGMI hops. ``-1`` = no limit.
+      - ``-1``
+
+Example output
+---------------
+
+.. code-block:: shell
+
+  [Sweep Related]
+  CONTINUE_ON_ERROR    =            0 : Stop after first error
+  NUM_CPU_DEVICES      =            2 : Using 2 CPUs
+  NUM_CPU_SE           =            4 : Using 4 CPU threads per CPU executed Transfer
+  NUM_GPU_DEVICES      =            8 : Using 8 GPUs
+  NUM_GPU_SE           =            4 : Using 4 subExecutors/CUs per GPU executed Transfer
+  SWEEP_DST            =           CG : Destination Memory Types to sweep
+  SWEEP_EXE            =          CDG : Executor Types to sweep
+  SWEEP_FILE           = /tmp/lastSweep.cfg : File to store the executing sweep configuration
+  SWEEP_MAX            =           24 : Max simultaneous transfers (0 = no limit)
+  SWEEP_MIN            =            1 : Min simultaenous transfers
+  SWEEP_RAND_BYTES     =            0 : Using constant number of bytes per Transfer
+  SWEEP_SEED           =   1773692223 : Random seed set to 1773692223
+  SWEEP_SRC            =           CG : Source Memory Types to sweep
+  SWEEP_TEST_LIMIT     =            0 : Max number of tests to run during sweep (0 = no limit)
+  SWEEP_TIME_LIMIT     =            0 : Max number of seconds to run sweep for  (0 = no limit)
+  SWEEP_XGMI_MAX       =           -1 : Max number of XGMI hops for Transfers  (-1 = no limit)
+  SWEEP_XGMI_MIN       =            0 : Min number of XGMI hops for Transfers
+
+  Sweep configuration saved to: /tmp/lastSweep.cfg
+  Test 1:
+  -------------------┬--------------┬------------┬-------------------┬--------------------
+    Executor: CPU 00 │  30.660 GB/s │   8.755 ms │   268435456 bytes │  30.847 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 0    │  30.847 GB/s │   8.702 ms │   268435456 bytes │ C1 -> C0:4 -> G6
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: GPU 01 │  38.662 GB/s │   6.943 ms │   268435456 bytes │  38.669 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 8    │  38.669 GB/s │   6.942 ms │   268435456 bytes │ G2 -> G1:4 -> G1
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: GPU 02 │  61.598 GB/s │   4.358 ms │   268435456 bytes │  61.615 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 9    │  61.615 GB/s │   4.357 ms │   268435456 bytes │ G2 -> G2:4 -> G0
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: GPU 03 │  38.816 GB/s │   6.916 ms │   268435456 bytes │  38.826 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 10   │  38.826 GB/s │   6.914 ms │   268435456 bytes │ G2 -> G3:4 -> G7
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: GPU 06 │  44.298 GB/s │  12.120 ms │   536870912 bytes │  58.182 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 11   │  22.151 GB/s │  12.118 ms │   268435456 bytes │ G1 -> G6:4 -> C1
+       Transfer 12   │  36.030 GB/s │   7.450 ms │   268435456 bytes │ G2 -> G6:4 -> G5
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: GPU 07 │  37.963 GB/s │   7.071 ms │   268435456 bytes │  37.969 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 13   │  37.969 GB/s │   7.070 ms │   268435456 bytes │ G4 -> G7:4 -> G6
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: DMA 01 │  43.428 GB/s │  12.362 ms │   536870912 bytes │  77.585 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 1    │  55.481 GB/s │   4.838 ms │   268435456 bytes │ C0 -> D1:4 -> G0
+       Transfer 2    │  22.105 GB/s │  12.144 ms │   268435456 bytes │ G7 -> D1:4 -> C1
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: DMA 03 │  31.427 GB/s │   8.541 ms │   268435456 bytes │  32.353 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 3    │  32.353 GB/s │   8.297 ms │   268435456 bytes │ G4 -> D3:4 -> G6
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: DMA 04 │  22.214 GB/s │  12.084 ms │   268435456 bytes │  22.536 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 4    │  22.536 GB/s │  11.912 ms │   268435456 bytes │ C1 -> D4:4 -> G1
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: DMA 06 │  53.665 GB/s │  10.004 ms │   536870912 bytes │  72.749 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 5    │  27.768 GB/s │   9.667 ms │   268435456 bytes │ G2 -> D6:4 -> C0
+       Transfer 6    │  44.981 GB/s │   5.968 ms │   268435456 bytes │ G3 -> D6:4 -> G2
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+    Executor: DMA 07 │  57.440 GB/s │   4.673 ms │   268435456 bytes │  60.131 GB/s (sum)
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+       Transfer 7    │  60.131 GB/s │   4.464 ms │   268435456 bytes │ G7 -> D7:4 -> G5
+  -------------------┼--------------┼------------┼-------------------┼--------------------
+     Aggregate (CPU) │ 295.108 GB/s │  12.735 ms │  3758096384 bytes │ Overhead 0.372 ms
+  -------------------┴--------------┴------------┴-------------------┴--------------------
+
+The exact format depends on ``OUTPUT_TO_CSV`` and ``PrintResults``. Typically shows test number, transfer count, bandwidth, and timing per test.
diff --git a/docs/reference/transfer-definition-syntax.rst b/docs/reference/transfer-definition-syntax.rst
new file mode 100644
index 00000000..694f6ab1
--- /dev/null
+++ b/docs/reference/transfer-definition-syntax.rst
@@ -0,0 +1,267 @@
+.. meta::
+  :description: Reference for TransferBench transfer definition syntax, including simple and advanced modes, memory and executor letter codes, and wildcard syntax.
+  :keywords: TransferBench transfer definition, TransferBench syntax, TransferBench memory types, TransferBench executor types, TransferBench wildcards
+
+.. _transfer-definition-syntax:
+
+Transfer definition syntax
+==========================
+
+A transfer is a single operation where an executor reads and adds values from source (SRC) memory, then writes the sum to destination (DST) memory. When a transfer has a single SRC and a single DST, it is a copy operation.
+
+TransferBench supports two modes for defining transfers: simple mode and advanced mode.
+
+Simple mode
+-----------
+
+Use simple mode when all transfers in a test share the same number of subexecutors. The format is:
+
+.. code-block:: shell
+
+    #Transfers #SEs (srcMem1 Executor1 dstMem1) ... (srcMemL ExecutorL dstMemL)
+
+The format uses the following fields:
+
+- ``#Transfers``: A positive integer that specifies the number of parallel transfers.
+- ``#SEs``: The number of subexecutors (CUs, threads, or queue pairs) used by all transfers.
+- ``(srcMem Executor dstMem)``: A triplet that describes one transfer.
+
+The transfer size comes from the ``num_bytes`` command-line argument.
+
+The following examples show valid simple mode definitions:
+
+.. code-block:: shell
+
+    1 4 (G0->G0->G1)                  # Uses 4 CUs on GPU 0 to copy from GPU 0 to GPU 1
+    2 4 (G0 G0 G1) (G1 G1 G0)         # Two parallel transfers: GPU 0 to GPU 1 and GPU 1 to GPU 0
+    1 4 (G0->G0->G0G1G2G3)            # Reads GPU 0 then writes to GPU 0, GPU 1, GPU 2, and GPU 3
+
+Advanced mode
+-------------
+
+Use advanced mode when transfers in a test require different subexecutor counts or sizes. The format is:
+
+.. code-block:: shell
+
+    -#Transfers (srcMem1 Executor1 dstMem1 #SEs1 Bytes1) ... (srcMemL ExecutorL dstMemL #SEsL BytesL)
+
+The format uses the following fields:
+
+- ``-#Transfers``: A negative integer that specifies the number of parallel transfers.
+- ``(srcMem Executor dstMem #SEs Bytes)``: A quintuplet that describes one transfer.
+- ``Bytes``: The per-transfer size. Set to ``0`` to use the command-line ``num_bytes`` value. You can suffix this value with ``K``, ``M``, or ``G``. A non-zero value overrides the command-line value for that transfer only.
+
+The following example runs two transfers with different sizes and subexecutor counts:
+
+.. code-block:: shell
+
+    -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M)   # 1 MB GPU 0 to GPU 1 with 4 CUs; 2 MB GPU 1 to GPU 0 with 2 CUs
+
+Memory and executor letter codes
+==================================
+
+The memory locations and executors use a format consisting of letters indicating the memory or executor type, and index. The following tables indicate what these letters stand for.
+
+Memory location letters
+------------------------
+
+Memory locations use the format ``[R<rank>]<MemType><Index>``, where ``MemType`` is a single letter and ``Index`` is the zero-based device index. If you omit the rank prefix ``R``, TransferBench uses the local rank.
+
+The following table lists the supported memory type letters:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Letter
+      - Memory type
+      - Indexed by
+
+    * - ``C``
+      - Coarse-grained pinned host
+      - NUMA node (0 to #NUMA - 1)
+
+    * - ``P``
+      - Pinned host (closest to GPU)
+      - GPU index (0 to #GPUs - 1)
+
+    * - ``B``
+      - Coherent pinned host
+      - NUMA node
+
+    * - ``D``
+      - Non-coherent pinned host
+      - NUMA node
+
+    * - ``K``
+      - Uncached pinned host
+      - NUMA node
+
+    * - ``H``
+      - Unpinned host
+      - NUMA node
+
+    * - ``G``
+      - Coarse-grained global device
+      - GPU (0 to #GPUs - 1)
+
+    * - ``F``
+      - Fine-grained device
+      - GPU
+
+    * - ``U``
+      - Uncached device
+      - GPU
+
+    * - ``M``
+      - Managed memory
+      - GPU
+
+    * - ``N``
+      - Null (empty). Use this to denote read-only or write-only transfers.
+      - Index ignored.
+
+You can concatenate multiple memory locations for broadcast or reduce operations. For example, ``G0G1`` refers to GPU 0 and GPU 1.
+
+Executor letters
+-----------------
+
+Executors use the format ``[R<rank>]<ExeType><Index>[Slot][.<SubIndex>][SubSlot]``.
+
+The following table lists the supported executor type letters:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Letter
+      - Executor type
+      - Subexecutor
+      - Indexed by
+
+    * - ``C``
+      - CPU
+      - CPU thread
+      - NUMA node (0 to #NUMA - 1)
+
+    * - ``G``
+      - GPU (GFX/kernel)
+      - Threadblock/CU
+      - GPU (0 to #GPUs - 1)
+
+    * - ``D``
+      - DMA (SDMA)
+      - N/A (streams)
+      - GPU
+
+    * - ``I``
+      - NIC RDMA
+      - Queue pair
+      - NIC index. Requires ``.SubIndex`` (for example, ``I0.2``).
+
+    * - ``N``
+      - Nearest NIC
+      - Queue pair
+      - GPU. Uses the closest NIC for SRC and DST. SubIndex is optional.
+
+The optional executor fields have the following meanings:
+
+- ``Slot`` (``A``, ``B``, ...): Selects which closest NIC to use, where ``A`` is the first and ``B`` is the second, and so on. Used for ``EXE_NIC_NEAREST``.
+- ``SubIndex`` (after ``.``): Specifies the queue pair or sub-index for the NIC.
+- ``SubSlot``: Specifies the alpha range for sub-slot selection.
+
+Wildcards
+==========
+
+Wildcards expand a single configuration file line into multiple concrete transfers at runtime.
+
+Numeric wildcards
+------------------
+
+Numeric wildcards apply to ranks (``R``), device indices (for example, ``G0`` or ``C1``), and executor indices.
+
+The following table describes the supported numeric wildcard syntax:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Syntax
+      - Meaning
+      - Example
+
+    * - ``*``
+      - All values in range
+      - ``R*G0``: GPU 0 on all ranks
+
+    * - ``[n]``
+      - Single value
+      - ``R[2]G0``: GPU 0 on rank 2
+
+    * - ``[n,m,...]``
+      - Comma-separated list
+      - ``G[0,2,4]``: GPUs 0, 2, and 4
+
+    * - ``[start..end]``
+      - Inclusive range
+      - ``G[1..3]``: GPUs 1, 2, and 3
+
+The following list shows additional examples:
+
+- ``G*``: All GPUs (for example, G0, G1, G2, ...)
+- ``G[0,2,4]``: GPUs 0, 2, and 4
+- ``R[1..3]G0``: GPU 0 on ranks 1, 2, and 3
+
+Alpha wildcards
+----------------
+
+Alpha wildcards apply to executor slots and subslots (``A``, ``B``, ``C``, ...).
+
+The following table describes the supported alpha wildcard syntax:
+
+.. list-table::
+    :header-rows: 1
+
+    * - Syntax
+      - Meaning
+      - Example
+
+    * - ``*``
+      - Full wildcard, resolved at runtime
+      - All available slots
+
+    * - ``[A]`` or ``A``
+      - Single letter
+      - First closest NIC
+
+    * - ``[A..C]``
+      - Letter range
+      - Slots A, B, and C
+
+    * - ``[A,D,F]``
+      - Comma-separated letters
+      - Slots A, D, and F
+
+Rank prefix (multinode)
+-------------------------
+
+The ``R`` prefix is optional and specifies the rank index for a memory location or executor:
+
+- ``R2G3``: GPU 3 on rank 2.
+- ``G3`` (no ``R``): GPU 3 on the local rank.
+
+The rank prefix accepts the same numeric wildcards as device indices, such as ``R*`` and ``R[0..1]``.
+
+Nearest NIC wildcard
+---------------------
+
+For the ``N`` (Nearest NIC) executor, you can omit the executor index and sub-index. TransferBench resolves the correct NICs for the SRC and DST memory locations at runtime.
+
+For example, the following definition:
+
+.. code-block:: shell
+
+  (R0G0 N R2G4)
+
+This definition expands to:
+
+.. code-block:: shell
+
+  (R0G0 N0.4 R2G4)
diff --git a/docs/sphinx/_toc.yml.in b/docs/sphinx/_toc.yml.in
index 3f09d2e5..4a0c1e16 100644
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -5,16 +5,24 @@ subtrees:
 - caption: Install
   entries:
   - file: install/install.rst
-    title: Installation
 
-- caption: API reference
+- caption: How to
   entries:
-  - file: reference/api.rst
-    title: API library
+  - file: how to/running-transferbench-customized.rst
+    title: Run custom tests using TransferBench
 
-- caption: How to
+- caption: Conceptual
+  entries:
+  - file: conceptual/transferbench-workflow.rst
+  - file: conceptual/transferbench-timing.rst
+  - file: conceptual/transferbench-data-validation.rst
+
+- caption: Reference
   entries:
-  - file: how to/use-transferbench.rst
+  - file: reference/presets.rst
+  - file: reference/transfer-definition-syntax.rst
+  - file: reference/environment-variables.rst
+  - file: reference/faq.rst
 
 - caption: About
   entries:

Category	Environment variable	Description	Default value
Paths and compilers - To customize which compiler to use or the library to link against.	`ROCM_PATH`	ROCm installation path for HIP compiler, includes, and libs.	`/opt/rocm`
	`CUDA_PATH`	CUDA installation path for NVCC when building `TransferBenchCuda`.	`/usr/local/cuda`
	`MPI_PATH`	MPI installation path (for `mpi.h` and MPI libraries).	`/usr/local/openmpi`
	`HIPCC`	HIP compiler. Falls back to `hipcc`, if not found.	`$(ROCM_PATH)/bin/amdclang++`
	`NVCC`	NVIDIA CUDA compiler (for building `TransferBenchCuda`)	`$(CUDA_PATH)/bin/nvcc`
	`ROCM_DEVICE_LIB_PATH`	Path to `amdgcn` bitcode. Auto-detected from the ROCm layout.	`(auto)`
	`HIPCONFIG`	Path to `hipconfig`, which is used to query the HIP version (for pod communication support check).	`hipconfig`
Feature flags - To control enabling features that require compile-time support. By default, these are enabled under the right conditions.	`DISABLE_NIC_EXEC`	Disables NIC executor support.	`0`
	`DISABLE_DMA_BUF`	Disables `DMA-BUF` for GPU Direct RDMA. Requires NIC executor support.	`1`
	`DISABLE_MPI_COMM`	Disables MPI communication backend support for multinode TransferBench.	`0`
	`DISABLE_AMD_SMI`	Disables AMDI-SMI pod membership checks.	`0`
	`DISABLE_POD_COMM`	Disables pod communication support (UALoE / MNNVL).	`0`
Build options	`SINGLE_KERNEL`	To compile with a single GFX kernel (faster build, but fewer kernel variants), set to 1. Used mostly for development and debug.	`0`
	`GPU_TARGETS`	Comma-separated GPU architecture targets such as gfx942, gfx950.	`native`
	`DEBUG`	To build in debug mode with debug symbols (-O0, -g), set to 1. Runs otherwise in the release mode (-O3).	`0`
Category	Environment variable	Description	Default value
Paths and compilers - To customize which compiler to use or the library to link against.	`ROCM_PATH`	ROCm installation path.	`/opt/rocm`
	`CMAKE_TOOLCHAIN_FILE`	Toolchain file. Uses ROCM_PATH and CXX to select compiler.	`toolchain-linux.cmake`
	`CXX`	C++ compiler. If not set, `amdclang++` or `hipcc` is used.	Taken from the toolchain
	`MPI_PATH`	Path to MPI installation. Takes priority over `find_package(MPI)`.
Build options (ON/OFF) - Pass -DVAR=value to set	`BUILD_LOCAL_GPU_TARGET_ONLY`	Builds only for the GPUs detected on the given machine using `rocm_agent_enumerator`.	`OFF`
	`ENABLE_NIC_EXEC`	Enables RDMA NIC executor.	`OFF`
	`ENABLE_MPI_COMM`	Enables MPI communicator as backbone for multinode TransferBench.	`OFF`
	`ENABLE_DMA_BUF`	Enables DMA-BUF for GPU Direct RDMA (requires NIC).	`OFF`
	`ENABLE_AMD_SMI`	Enables AMD-SMI pod membership queries.	`OFF`
	`ENABLE_POD_COMM`	Enables pod communication (HIP >= 8.0).	`OFF`
CMake cache variables	`GPU_TARGETS`	Semicolon-separated GPU architectures. Overridden if `BUILD_LOCAL_GPU_TARGET_ONLY` is `ON`	`gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201;gfx1250`
	`AMD_SMI_EXECUTABLE`	Path to `amd-smi` for AMD-SMI version check.	`amd-smi`
	`HIPCONFIG_EXECUTABLE`	Path to `hipconfig` for HIP version or pod check.	`hipconfig`
Instruction order	Unroll 1	Unroll 2	Unroll 4
1	READ [A]	READ [A]	READ [A]
2	WRITE [A]	READ [B]	READ [B]
3	READ [B]	WRITE [A]	READ [C]
4	WRITE [B]	WRITE [B]	READ [D]
5	READ [C]	READ [C]	WRITE [A]
6	WRITE [C]	READ [D]	WRITE [B]
7	READ [D]	WRITE [C]	WRITE [C]
8	WRITE [D]	WRITE [D]	WRITE [D]