Skip to content

TxsharDev/carbon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CARBON

Same seed. Different GPU. Identical weights.

PyPI License Deterministic


Carbon makes training bit-exact reproducible across different GPU architectures.

Train on an RTX 4090. Train on an RTX 5090. Same SHA-256 hash on every weight tensor. Same optimizer state. Same loss. Different silicon, identical bits.

Built by Tushar Sharma at ALIA Labs.

Install

pip install alia-carbon

The Problem

RTX 4090:  hash = 6a6e2bc1...  loss = 0.069844
RTX 5090:  hash = 46681ef8...  loss = 0.069844

Same loss. Different weights. cuBLAS picked different internal algorithms on different silicon. Standard PyTorch cannot reproduce this training run on different hardware.

The Fix

import carbon
carbon.enable(seed=42)
RTX 4090:  hash = 62118e9c...  loss = 0.070026
RTX 5090:  hash = 62118e9c...  loss = 0.070026

Identical. 10 seeds tested, up to 500 steps, every configuration matches.

How It Works

Every non-deterministic op replaced with a deterministic one:

Op Standard Carbon
MatMul cuBLAS (arch-dependent) Tiled fp64 + Kahan accumulation
LayerNorm Parallel reduction (thread-dependent) fp64 mean/variance
AllReduce NCCL (arrival-order-dependent) AllGather + rank-order reduce
Scatter Atomic race conditions Sorted index ops

The mechanism: split every matmul into tiles, upcast to float64, accumulate with Kahan-Babushka-Neumaier compensation in a fixed order. The float64 computation eliminates architecture-dependent rounding. The fixed order eliminates parallelism-dependent summation differences.

The Proof

Toy Model (500K params, 50 steps, 5 seeds)

Seed Steps 4090 Hash 5090 Hash Match
42 500 d6830c89... d6830c89... yes
123 500 44843c64... 44843c64... yes
7 500 7bf2c902... 7bf2c902... yes
999 500 8cc60024... 8cc60024... yes
2024 500 1e420d04... 1e420d04... yes

10 out of 10 configurations. Every hash matches.

GPT-2 124M Fine-Tune (60M trainable, 20 steps)

Run Hash Loss
Standard PyTorch (5090) 85b72d9f... 7.1904
Carbon run 1 (5090) 995d4c9b... 7.1904
Carbon run 2 (5090) 995d4c9b... 7.1904
Carbon cross-GPU (4090) 995d4c9b... 7.1904

Three Carbon runs, two GPUs, one hash. Standard PyTorch produces a different hash.

Overhead

Scale Overhead
500K toy model 1.07x
GPT-2 124M (60M trainable) 10.1x

The cost of bit-exact determinism. fp64 Kahan-compensated matmul is slower than cuBLAS. For alignment research and debugging where you need exact reproducibility, it's worth it.

Important: What Carbon Requires

Cross-architecture determinism requires replacing nn.Linear with DeterministicLinear and nn.LayerNorm with CarbonLayerNorm. The carbon.enable() call patches torch.matmul globally, but standard PyTorch modules use internal C++ paths that bypass the patch.

This is not a one-line fix for existing training code. It's a mechanism that works when you build with Carbon's layers.

Tested On

RTX 4090 (Ada) | RTX 5090 (Blackwell) | H100 SXM | A100 SXM

Consumer GPUs match each other. Datacenter GPUs match each other. Cross-tier (consumer vs datacenter) produces different hashes. Documented, not hidden.

Citation

@article{sharma2026carbon,
  title={Carbon: Bit-Exact Deterministic Training Across Consumer GPU Architectures},
  author={Sharma, Tushar},
  year={2026},
  url={https://github.com/TxsharDev/carbon}
}

Roadmap

v0.1 (current) - Proof of concept. KBN summation, tiled fp64 matmul, Python-level tiling.

v0.2 - Performance. Replace Kahan accumulation with superaccumulators for order-independent exact summation. This eliminates the need for fixed tile order and could close the consumer-vs-datacenter hash gap. CUDA-native tiled matmul to cut the 10x overhead to under 2x.

v0.3 - Scale. Mixed precision (bf16 forward, fp32 master weights). Multi-GPU deterministic allreduce. FSDP/DDP integration. Target: deterministic fine-tuning of 7B+ models.

License

Apache-2.0 | ALIA Labs

About

Same seed. Different GPU. Identical weights. Bit-exact deterministic training across GPU architectures.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors