Write the math. Get the kernel.
Molten turns mathematical operation specs into fused, portable CUDA kernels.
No tile loops. No schedules. No framework lock-in. The output is a .cu file. It compiles with nvcc. It runs without PyTorch.
Built by Tushar Sharma at ALIA Labs.
pip install alia-moltenfrom molten import ZeroCompiler
from molten.ir import DataflowGraph, TensorShape
g = DataflowGraph("fused_rmsnorm")
x = g.add_input("x", TensorShape([2048, 5120]))
w = g.add_input("w", TensorShape([5120]))
out = g.rms_norm(x, w, "norm")
g.add_output(out)
compiler = ZeroCompiler()
kernels = compiler.compile(g) # 3 ops -> 1 kernel
compiler.save(kernels, "output/") # standalone .cu fileThat's it. Three operations. One kernel. Zero CUDA written by hand.
Math Spec -> DataflowGraph -> Optimizer -> Fusion Engine -> CUDA Codegen -> .cu
The fusion engine knows six rules:
| Pattern | What It Does |
|---|---|
| Elementwise chain | Fuses N ops into 1. Kills N-1 memory round-trips. |
| MatMul + bias + activation | Epilogue fusion. One kernel does matmul, adds bias, applies GELU. |
| RMSNorm | Fuses reduce + normalize + scale. One pass over the data. |
| Softmax | Fuses max + exp + sum + divide. Three passes become one. |
RTX 5090. Same session. 19/19 correctness. Also validated on RTX 4090 and H100 SXM.
Molten-generated RMSNorm (zero hand-written CUDA):
| Eager | torch.compile | Molten | vs Compile | |
|---|---|---|---|---|
| decode (1 token) | 167 us | 127 us | 28 us | 4.6x |
| prefill (2048 tokens) | 160 us | 96 us | 55 us | 1.7x |
| long (8192 tokens) | 793 us | 323 us | 394 us | 0.82x |
Molten wins at decode and prefill. torch.compile wins at long sequences (scalar loads vs vectorized). That gap closes in v0.2.
Hand-written fused RMSNorm+SiLU*gate (the target Molten is closing in on):
| Eager (3 ops) | Fused (1 kernel) | Speedup | |
|---|---|---|---|
| decode | 207 us | 27 us | 7.6x |
| prefill | 347 us | 97 us | 3.6x |
| long | 1327 us | 403 us | 3.3x |
torch.compile generates Triton code tied to PyTorch. You can't deploy it without the full Python + PyTorch + Triton stack.
Molten generates a .cu file. Ship it to TensorRT, ONNX Runtime, a C++ server, a Jetson, whatever. It's just CUDA.
RTX 4090 | RTX 5090 | H100 SXM 80GB
Real model validation on Qwen2.5-7B (57 RMSNorm layers, hidden=3584).
@article{sharma2026molten,
title={Molten: Fused GPU Kernel Generation from Mathematical Specifications},
author={Sharma, Tushar},
year={2026},
url={https://github.com/TxsharDev/molten}
}v0.1 (current) - IR, fusion engine, CUDA codegen, JIT runtime. RMSNorm and elementwise fusion proven. Scalar memory access.
v0.2 - Vectorized loads (float4/half2). This closes the gap where torch.compile currently wins at long sequences. fp16 I/O benchmarked end-to-end. @zero decorator dispatches generated kernels directly.
v0.3 - Attention fusion (Q@K softmax @V as one kernel). RoPE integration. Polyhedral loop optimization for complex fusion patterns. Auto-tuning via hardware counter feedback.
Apache-2.0 | ALIA Labs