MOLTEN

Write the math. Get the kernel.

Molten turns mathematical operation specs into fused, portable CUDA kernels.

No tile loops. No schedules. No framework lock-in. The output is a .cu file. It compiles with nvcc. It runs without PyTorch.

Built by Tushar Sharma at ALIA Labs.

Install

pip install alia-molten

30 Seconds to a Fused Kernel

from molten import ZeroCompiler
from molten.ir import DataflowGraph, TensorShape

g = DataflowGraph("fused_rmsnorm")
x = g.add_input("x", TensorShape([2048, 5120]))
w = g.add_input("w", TensorShape([5120]))
out = g.rms_norm(x, w, "norm")
g.add_output(out)

compiler = ZeroCompiler()
kernels = compiler.compile(g)        # 3 ops -> 1 kernel
compiler.save(kernels, "output/")    # standalone .cu file

That's it. Three operations. One kernel. Zero CUDA written by hand.

What Happens Under the Hood

Math Spec -> DataflowGraph -> Optimizer -> Fusion Engine -> CUDA Codegen -> .cu

The fusion engine knows six rules:

Pattern	What It Does
Elementwise chain	Fuses N ops into 1. Kills N-1 memory round-trips.
MatMul + bias + activation	Epilogue fusion. One kernel does matmul, adds bias, applies GELU.
RMSNorm	Fuses reduce + normalize + scale. One pass over the data.
Softmax	Fuses max + exp + sum + divide. Three passes become one.

Benchmarks

RTX 5090. Same session. 19/19 correctness. Also validated on RTX 4090 and H100 SXM.

Molten-generated RMSNorm (zero hand-written CUDA):

	Eager	torch.compile	Molten	vs Compile
decode (1 token)	167 us	127 us	28 us	4.6x
prefill (2048 tokens)	160 us	96 us	55 us	1.7x
long (8192 tokens)	793 us	323 us	394 us	0.82x

Molten wins at decode and prefill. torch.compile wins at long sequences (scalar loads vs vectorized). That gap closes in v0.2.

Hand-written fused RMSNorm+SiLU*gate (the target Molten is closing in on):

	Eager (3 ops)	Fused (1 kernel)	Speedup
decode	207 us	27 us	7.6x
prefill	347 us	97 us	3.6x
long	1327 us	403 us	3.3x

Why Not torch.compile?

torch.compile generates Triton code tied to PyTorch. You can't deploy it without the full Python + PyTorch + Triton stack.

Molten generates a .cu file. Ship it to TensorRT, ONNX Runtime, a C++ server, a Jetson, whatever. It's just CUDA.

Tested On

RTX 4090 | RTX 5090 | H100 SXM 80GB

Real model validation on Qwen2.5-7B (57 RMSNorm layers, hidden=3584).

Citation

@article{sharma2026molten,
  title={Molten: Fused GPU Kernel Generation from Mathematical Specifications},
  author={Sharma, Tushar},
  year={2026},
  url={https://github.com/TxsharDev/molten}
}

Roadmap

v0.1 (current) - IR, fusion engine, CUDA codegen, JIT runtime. RMSNorm and elementwise fusion proven. Scalar memory access.

v0.2 - Vectorized loads (float4/half2). This closes the gap where torch.compile currently wins at long sequences. fp16 I/O benchmarked end-to-end. @zero decorator dispatches generated kernels directly.

v0.3 - Attention fusion (Q@K softmax @V as one kernel). RoPE integration. Polyhedral loop optimization for complex fusion patterns. Auto-tuning via hardware counter feedback.

License

Apache-2.0 | ALIA Labs

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
benchmarks		benchmarks
molten		molten
paper		paper
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MOLTEN

Install

30 Seconds to a Fused Kernel

What Happens Under the Hood

Benchmarks

Why Not torch.compile?

Tested On

Citation

Roadmap

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MOLTEN

Install

30 Seconds to a Fused Kernel

What Happens Under the Hood

Benchmarks

Why Not torch.compile?

Tested On

Citation

Roadmap

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages