libHPC is a high-performance computing library focused on Linux and Windows environments. It provides SIMD-optimized kernels, concurrent data structures, GPU utilities, and HPC-oriented memory management components.
This public archive preserves the state of libHPC at the point where its core HPC primitives — GPU radix sort, ABA-safe lock-free queue, SIMD kernels, and cache-hierarchy benchmarks — reached a stable, validated milestone.
Active development continues privately. The archive is retained for study, reference, and portfolio purposes; commercial use, redistribution, or derivative proprietary work without explicit permission is not permitted.
Platform support status, known limitations, and benchmarks are documented in the sections below.
| Platform | Status |
|---|---|
| Linux (x86_64 / CUDA) | ✓ Supported |
| Windows (MSVC / CUDA) | ✓ Supported |
| macOS (Intel) | ✓ Supported, limited |
| macOS (Apple Silicon / ARM64) | ✗ Not supported |
libHPC does not support macOS ARM (Apple Silicon).
The reason is simple:
Apple’s recent macOS / Xcode toolchain updates introduced ABI changes in libc++, causing oneTBB and other HPC components to fail at link-time.
Apple’s recent macOS / Xcode toolchain updates introduced ABI changes in
libc++, causing oneTBB and other HPC components to fail at link-time.
Specifically,std::__1::__hash_memory, a critical dependency for oneTBB, has been removed/hidden at the SDK level. These issues do not occur on Linux or Windows, and they did not occur on older macOS versions.
Since the goal of libHPC is stable, reproducible high-performance computing, macOS ARM is excluded to avoid degraded reliability or performance.
libHPC includes GPU-accelerated kernels optimized for high-throughput computation on NVIDIA CUDA-compatible devices:
- Radix-Sort Kernel: Processes 500M elements in ~360ms on an RTX 3080 Ti(laptop), sustaining ~1.39B elements/sec throughput.
- Warp-Synchronous & Tiled Memory Layouts: Maximizes shared memory utilization and minimizes global memory latency.
- Concurrent GPU Pipelines: Supports asynchronous kernel launches and stream-based scheduling for overlapping compute and memory operations.
- Profiling & Validation: Includes tools for warp efficiency, memory access analysis, and synchronization correctness across GPU architectures.
- Realistic HPC Throughput: Designed for bulk-parallel computation and scientific workloads, not real-time ultra-low-latency trading systems.