CAMM - CUDA Accelerated Matrix Multiplication

A comprehensive CUDA implementation showcasing various matrix multiplication optimization techniques, from naive approaches to highly optimized kernels with register tiling and vectorization.

🚀 Features

9 Different Kernel Implementations with progressive optimizations
Auto-dispatch entry point that selects the best kernel per shape
Arbitrary-shape kernels (any M, N, K — square, rectangular, non-aligned)
Autotuner that sweeps tile geometries to find the best config per GPU
Comprehensive Benchmarking against cuBLAS and CUTLASS
Performance Analysis with detailed metrics
Modular Architecture for easy experimentation

📁 Project Structure

CAMM/
├── Kernel/                          # CUDA kernel implementations
│   ├── matmul_naive/               # Basic matrix multiplication
│   ├── mat_mul_coalesced/          # Memory coalescing optimization
│   ├── mat_mul_sharedmem/          # Shared memory optimization
│   ├── mat_mul_register_tiling/    # Register tiling with specialization
│   ├── mat_mul_vectorized/         # float4 vectorized memory access
│   ├── mat_mul_tuned/              # Autotuned 128x128 / BK16 / 16x8 reg tile
│   ├── mat_mul_doublebuffer/       # Software-pipelined double buffering
│   ├── mat_mul_general/            # Arbitrary M,N,K (bounds-checked)
│   ├── mat_mul_boundary/           # Arbitrary M,N,K (fast interior + masked edges)
│   └── mat_mul_auto/               # Auto-dispatch: best kernel per shape
├── Header/
│   └── matmul_kernels.cuh          # Kernel function declarations
├── utils/                          # Benchmarking and utility functions
│   ├── benchmark_matmul_*.cu       # Individual kernel benchmarks
│   ├── benchmark_matmul_shapes.cu  # Arbitrary-shape (general/boundary) benchmark
│   ├── autotune.cu                 # Tile-geometry sweep vs cuBLAS
│   ├── main.cu                     # Main benchmarking suite
│   └── cpu_benchmarking.cpp        # CPU reference implementation
├── Benchmarks/                     # Performance results (ignored by git)
└── cutlass/                        # NVIDIA CUTLASS library integration

🔧 Kernel Implementations

1. Naive Implementation (`matmul_naive`)

Description: Basic matrix multiplication without optimizations
Grid/Block: Standard 2D grid configuration
Use Case: Baseline performance reference

2. Coalesced Memory Access (`mat_mul_coalesced`)

Description: Optimized memory access patterns for better bandwidth utilization
Optimization: Ensures coalesced global memory access
Performance: ~2-3x improvement over naive implementation

3. Shared Memory (`mat_mul_sharedmem`)

Description: Utilizes shared memory to reduce global memory accesses
Optimization: Tile-based computation with shared memory blocking
Performance: ~4-6x improvement over naive implementation

4. Register Tiling (`mat_mul_register_tiling`)

Description: Advanced optimization using register-level tiling
Features:
- Base register tiling implementation
- Size-specialized kernels for 128x128 and 512x512 matrices
- Optimized grid dimensions: gridDim(16,16), blockDim(16,16)
Performance: ~8-12x improvement over naive implementation

5. Vectorized Operations (`matmul_vectorized`)

Description: Specialized kernels with vectorized memory operations
Performance: Highest performance Achieved Yet
⚠️ Note: Currently experiencing efficiency loss that requires optimization fixes

6. Tuned Register Tiling (`matmul_tuned`)

Description: Autotuned 128x128 block tile, BK=16, asymmetric 16x8 thread register tile (128 threads/block), transposed shared A, float4-vectorized loads/stores
Key finding: the asymmetric 16x8 register tile beats both 8x8 and 8x16
Constraint: square N, multiple of 128

7. Double Buffering (`matmul_doublebuffer`)

Description: Software-pipelined version of the tuned kernel — prefetches the next K-block into registers while the current block's FMAs run, overlapping global-load latency with compute
Constraint: square N, multiple of 128

8. General (`launch_matmul_general`)

Description: Arbitrary M, N, K matmul. Same register-blocking compute as the tuned kernel, but every global access is bounds-checked, so any shape runs correctly (square, rectangular, non-128-aligned). Uses a 64x64 tile for small dimensions (fills all SMs) and 128x128 otherwise.
Signature: launch_matmul_general(A, B, C, M, N, K) — C(MxN)=A(MxK)*B(KxN)

9. Boundary (`launch_matmul_boundary`)

Description: Arbitrary M, N, K, but interior 128x128 tiles take the fast float4 double-buffered path while only the partial edge tiles + K-remainder take the scalar masked path — no padding buffers, no extra copies. The fast interior path needs N%4==0 && K%4==0; otherwise it falls back to the scalar path.
Signature: launch_matmul_boundary(A, B, C, M, N, K)

⭐ Auto (`launch_matmul_auto`) — recommended entry point

Description: Picks the best kernel for the given shape automatically:
- square & multiple of 128, K in [5120, 8192) → tuned (single-buffer wins at deep K)
- square & multiple of 128, otherwise → doublebuffer (best general default)
- any other shape → boundary (handles arbitrary M, N, K)
Signature: launch_matmul_auto(A, B, C, M, N, K)
Use this if you just want the fastest correct result for any shape.

🏗️ Build Instructions

Prerequisites

NVIDIA GPU with CUDA Compute Capability 6.0+
CUDA Toolkit 11.0+
GCC/G++ compiler
CMake (optional)

Compilation

Individual Kernels

# Naive implementation
nvcc -o naive utils/benchmark_matmul_naive.cu Kernel/matmul_naive/*.cu

# Coalesced memory access
nvcc -o coalesced utils/benchmark_matmul_coalesced.cu Kernel/mat_mul_coalesced/*.cu

# Shared memory optimization
nvcc -o shared utils/benchmark_matmul_sharedmem.cu Kernel/mat_mul_sharedmem/*.cu

# Register tiling
nvcc -o register utils/benchmark_matmul_register_tiling.cu Kernel/mat_mul_register_tiling/*.cu

# Vectorized
nvcc -o vectorized utils/benchmark_matmul_vectorized.cu Kernel/mat_mul_vectorized/*.cu

# Tuned (128x128, BK=16, 16x8 register tile)
nvcc -O3 -arch=sm_86 -o tuned utils/benchmark_matmul_tuned.cu Kernel/mat_mul_tuned/*.cu

# Double-buffered
nvcc -O3 -arch=sm_86 -o doublebuffer utils/benchmark_matmul_doublebuffer.cu Kernel/mat_mul_doublebuffer/*.cu

# Arbitrary-shape kernels (general + boundary), square/rectangular/non-aligned
nvcc -O3 -arch=sm_86 -o shapes utils/benchmark_matmul_shapes.cu \
     Kernel/mat_mul_general/*.cu Kernel/mat_mul_boundary/*.cu

# Autotuner: sweep tile geometries vs cuBLAS (needs -lcublas)
nvcc -O3 -arch=sm_86 -o autotune utils/autotune.cu -lcublas

# ⭐ Auto kernel — picks the best kernel per shape automatically
nvcc -O3 -arch=sm_86 -o auto utils/benchmark_matmul_auto.cu \
     Kernel/mat_mul_auto/*.cu Kernel/mat_mul_tuned/*.cu \
     Kernel/mat_mul_doublebuffer/*.cu Kernel/mat_mul_boundary/*.cu

# Head-to-head: cuBLAS vs our auto kernel on matched sizes (needs -lcublas)
nvcc -O3 -arch=sm_86 -o compare utils/benchmark_compare.cu \
     Kernel/mat_mul_auto/*.cu Kernel/mat_mul_tuned/*.cu \
     Kernel/mat_mul_doublebuffer/*.cu Kernel/mat_mul_boundary/*.cu -lcublas

Complete Benchmarking Suite

# Compile main benchmarking application
nvcc -o benchmark utils/main.cu Kernel/*/*.cu -I./Header

# Compare against cuBLAS
nvcc -o cublas_bench utils/benchmark_matmul_cublas.cu -lcublas

# Compare against CUTLASS
nvcc -o cutlass_bench utils/benchmark_matmul_cutlass.cu -I./cutlass/include

Compilation Flags (Recommended)

nvcc -O3 -arch=sm_75 -use_fast_math -Xptxas -O3 -o <output> <source_files>

📊 Performance Benchmarking

Running Benchmarks

# Run individual kernel benchmark
./benchmark

# ⭐ Auto kernel — best kernel per shape, the recommended entry point
./auto              # default square sweep (128 .. 8192)
./auto 4096         # single square N=4096
./auto 1024 4096 2048   # rectangular M=1024 N=4096 K=2048

# Head-to-head: cuBLAS vs our auto kernel (matched sizes, prints comparison table)
./compare

# Tuned kernel sweep (N = 128 .. 8192)
./tuned

# Double-buffered kernel sweep
./doublebuffer

# Arbitrary-shape kernels: square, rectangular, non-aligned (with correctness check)
./shapes

# Autotuner — sweep tile geometries at the given sizes (defaults to 2048 4096)
./autotune 2048 4096

# Compare with cuBLAS
./cublas_bench

# Compare with CUTLASS
./cutlass_bench

Note on -arch: use sm_86 for Ampere consumer GPUs (e.g. RTX 30-series, RTX 2050), sm_80 for A100, sm_75 for Turing. The tuned/double-buffer/boundary kernels are square-tile-aligned to multiples of 128 for the fast path; the general and boundary kernels accept any M, N, K. Re-run ./autotune on a new GPU to re-find the best tile.

Expected Performance Characteristics

Kernel Type	Relative Performance	Memory Efficiency
Naive	1x (baseline)	Low
Coalesced	2-3x	Medium
Shared Memory	4-6x	High
Register Tiling	8-12x	Very High
Specialized	10-15x	Very High

🔬 Optimization Techniques Demonstrated

Memory Coalescing: Ensuring aligned memory access patterns
Shared Memory Utilization: Reducing global memory bandwidth requirements
Register Tiling: Maximizing register usage and reducing memory latency
Thread Block Optimization: Optimal thread block dimensions
Vectorized Operations: Using float4 vector load/store instructions
Asymmetric Register Tiles: 16x8 thread tile (autotuned winner on Ampere)
Double Buffering: Software-pipelined prefetch overlapping load with compute
Predicated Boundary Handling: fast interior tiles + masked edges for any shape
Autotuning: sweeping tile geometry to find the GPU-specific optimum

📈 Development and Testing

Adding New Kernels

Create kernel implementation in Kernel/<kernel_name>/
Add declaration to Header/matmul_kernels.cuh
Create benchmark in utils/benchmark_<kernel_name>.cu
Update main benchmarking suite

Performance Testing

Benchmark results are automatically saved to Benchmarks/ (git-ignored)
Use consistent matrix sizes for fair comparisons
Run multiple iterations for statistical significance

🤝 Contributing

Follow the existing code structure
Add comprehensive benchmarks for new implementations
Document optimization techniques used
Ensure compatibility with existing build system

📝 License

This project is for educational and research purposes, demonstrating CUDA optimization techniques for matrix multiplication.

🔗 References

Note: Performance results may vary based on GPU architecture, CUDA version, and system configuration. Benchmark on your target hardware for accurate performance characteristics.

Name		Name	Last commit message	Last commit date
Latest commit History 41 Commits
Benchmarks		Benchmarks
Header		Header
Kernel		Kernel
utils		utils
.gitignore		.gitignore
README.md		README.md
mylib.h		mylib.h

Folders and files

Latest commit

History

Repository files navigation

CAMM - CUDA Accelerated Matrix Multiplication

🚀 Features

📁 Project Structure

🔧 Kernel Implementations

1. Naive Implementation (matmul_naive)

2. Coalesced Memory Access (mat_mul_coalesced)

3. Shared Memory (mat_mul_sharedmem)

4. Register Tiling (mat_mul_register_tiling)

5. Vectorized Operations (matmul_vectorized)

6. Tuned Register Tiling (matmul_tuned)

7. Double Buffering (matmul_doublebuffer)

8. General (launch_matmul_general)

9. Boundary (launch_matmul_boundary)

⭐ Auto (launch_matmul_auto) — recommended entry point

🏗️ Build Instructions

Prerequisites

Compilation

Individual Kernels

Complete Benchmarking Suite

Compilation Flags (Recommended)

📊 Performance Benchmarking

Running Benchmarks

Expected Performance Characteristics

🔬 Optimization Techniques Demonstrated

📈 Development and Testing

Adding New Kernels

Performance Testing

🤝 Contributing

📝 License

🔗 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. Naive Implementation (`matmul_naive`)

2. Coalesced Memory Access (`mat_mul_coalesced`)

3. Shared Memory (`mat_mul_sharedmem`)

4. Register Tiling (`mat_mul_register_tiling`)

5. Vectorized Operations (`matmul_vectorized`)

6. Tuned Register Tiling (`matmul_tuned`)

7. Double Buffering (`matmul_doublebuffer`)

8. General (`launch_matmul_general`)

9. Boundary (`launch_matmul_boundary`)

⭐ Auto (`launch_matmul_auto`) — recommended entry point

Packages