Skip to content

Ctrls performance#739

Draft
JPRichings wants to merge 3 commits intoQuEST-Kit:develfrom
JPRichings:ctrls_performance
Draft

Ctrls performance#739
JPRichings wants to merge 3 commits intoQuEST-Kit:develfrom
JPRichings:ctrls_performance

Conversation

@JPRichings
Copy link
Copy Markdown
Contributor

Initial pass on performance changed to remove thrust device_vector that is used to move ctrls to device that is causing a performance impact due to the thrust device_vector moving data from host to device on construction.

@JPRichings
Copy link
Copy Markdown
Contributor Author

Todo:

  • Confirm no horrible race condition is introduced by cudamemcpyToSymbol
  • Set the ctrls buffer size programmatically or to some sensible limit, say 50? (at least 64KB of constant memory on device so no worries giving ourselves some extra room
  • Apply fix to all kernels in QuEST

Assumptions in this code:

  • Single operation on quantum register at a time which allows the assumption that ctrls can be overwritten before each subsequent gate application.

@otbrown
Copy link
Copy Markdown
Collaborator

otbrown commented May 3, 2026

Alloc size 64 qubits -- add as macro somewhere if not done already MAX_QUREG_SIZE

Comment thread quest/src/gpu/gpu_kernels.cuh Outdated
const int NUM_THREADS_PER_BLOCK = 128;
const int NUM_THREADS_PER_BLOCK =128;

__device__ __constant__ int ctrl_device[30];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: use a MAX_NUM_QUBITS = 64 or something constant in constants.hpp

@TysonRayJones
Copy link
Copy Markdown
Member

Reminder of other stuff from meeting:

  • storing targets also in constant mem
  • retain passing ptrs to kernels; dispatcher passes constant mem ptr

@JPRichings
Copy link
Copy Markdown
Contributor Author

JPRichings commented May 5, 2026

Way easier to do this instead:

Beginning with CUDA 12.1, you can now pass up to 32,764 bytes as kernel parameters on NVIDIA Volta and above

#define TOTAL_PARAMS (8000) // ints

typedef struct {
    int param[TOTAL_PARAMS];
} param_large_t;

__global__ void kernelLargeParam(__grid_constant__ const param_large_t p,...) {
    // access all parameters from p
}

int main() {
    param_large_t p_large;
    kernelLargeParam<<<GRIDDIM,BLOCKDIM>>>(p_large,...);
    cudaDeviceSynchronize();
}

Note that in both preceding examples, kernel parameters are annotated with the grid_constant qualifier to indicate they are read-only.

reference: https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/

I think this solves our concerns about multi-qureg operations both accessing a common ctrls cache in future.

(this is also much better than the variadic kernel idea I had earlier today)

Other benefits:

  1. Performance: sweet sweet 200ns more like a few micro seconds now ive checked my profiling data (could be another factor of two here) performance save possible by removing cudaMemcpyToSymbol call.
  2. Removes any risk of race condition

@JPRichings
Copy link
Copy Markdown
Contributor Author

JPRichings commented May 5, 2026

Profiling to confirm but here is an extra factor of 2 over #729 (comment):

Total number of gates: 210
Measured probability amplitude of |0..0> state: 9.53674e-07
Calculated probability amplitude of |0..0>, C0 = 1 / 2^20: 9.53674e-07
Measuring final state: (all probabilities should be 0.5)
Qubit 0 measured in state 1 with probability 0.5
Qubit 1 measured in state 1 with probability 0.5
Qubit 2 measured in state 1 with probability 0.5
Qubit 3 measured in state 0 with probability 0.5
Qubit 4 measured in state 0 with probability 0.5
Qubit 5 measured in state 0 with probability 0.5
Qubit 6 measured in state 1 with probability 0.5
Qubit 7 measured in state 0 with probability 0.5
Qubit 8 measured in state 1 with probability 0.5
Qubit 9 measured in state 1 with probability 0.5
Qubit 10 measured in state 0 with probability 0.5
Qubit 11 measured in state 0 with probability 0.5
Qubit 12 measured in state 0 with probability 0.5
Qubit 13 measured in state 1 with probability 0.5
Qubit 14 measured in state 1 with probability 0.5
Qubit 15 measured in state 1 with probability 0.5
Qubit 16 measured in state 0 with probability 0.5
Qubit 17 measured in state 0 with probability 0.5
Qubit 18 measured in state 1 with probability 0.5
Qubit 19 measured in state 0 with probability 0.5

Final state:
|11100010110001110010>
QFT run time: 0.00141512s
Total run time: 2.36306s

@JPRichings
Copy link
Copy Markdown
Contributor Author

JPRichings commented May 5, 2026

profile to confirm no data movement outside of kernel launch at all:

image

@TysonRayJones
Copy link
Copy Markdown
Member

TysonRayJones commented May 6, 2026

Is it possible to pass multiple, distinct arguments to the one function that way? E.g. so that the target qubits of kernel_statevec_anyCtrlAnyTargDiagMatr_sub are also __grid_constant__? It'd be gross (but not the end of the world) if each kernel must manually isolate the targets array from a single ctrlsAndTargs array (although sneakily, kernel_statevec_anyCtrlOneTargDenseMatr_subA already performs such a trick).

The pattern of gpu_subroutines.cpp "secretly" passing __device__ __constant__ to unchanged kernels which still receive a ctrls ptr (and separately, a targs ptr) is still more pleasant to me, because then only gpu_subroutines.cpp needs to know about the optimisation being done, at copy-time. But a 2x speedup is nothing to sneeze at (if I interpreted that right)!

Orthogonally, do we have consternations about bumping min CUDA to 12.1 (released 2023)?

@JPRichings
Copy link
Copy Markdown
Contributor Author

I think it is possible to have multiple arguments passed this way. All this is (and its what I tried to first this this optimisation) capture by value in the kernel launch a c style array. The __grid_const__ is just to make sure its treated as read only on the device.

I think this is much cleaner that the data movement to const memory as instead of pointers to device constant memory we just pass the array directly. Only change to the original kernel (before all this optimisation stuff) is then we need to access a data member of a struct. There is no other change inside the kernel and we aren't accessing a global device variable out of the blue mid kernel. In gpu_subroutines.cpp if I can get util_getSorted to return directly to the struct we are using for hiding the c array then there will be no change to that code apart from a function returning to a new small struct opposed to a vector<int>.

We probably don't need to to change cuda versions unless we think we are passing over more than 4096 bytes in the kernel. I don't think we are anywhere near this but no objection to moving to CUDA 12.1.

@JPRichings
Copy link
Copy Markdown
Contributor Author

JPRichings commented May 6, 2026

Just to hammer home the performance improvement here. The red is the explicit cudamemcpytosymbol is the previous ctrls buffer which is now eliminated in this version.
image

@JPRichings
Copy link
Copy Markdown
Contributor Author

Finally usual caveats that correctness checking needs to take place with a full run of the test suite!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants