| Bilinear | Learned |
|---|---|
![]() |
![]() |
A tiny neural network that learns to reproduce a single image, written and trained end-to-end on the GPU in Mojo: the encoding, the fused MLP, the backward pass, and the optimizer are all one language. That is the point of the project, a worked example of the kind of neural-graphics technique Mojo lets you express and run on the GPU from a single, readable codebase. A learnable feature grid feeds a small MLP, and both the forward and backward passes run as fused kernels on the GPU.
The network's only input is a pixel's (u, v) coordinate; its output is that
pixel's RGB. Train it on one image and it becomes a compact, continuous
representation of that image.
Neural rendering and ML-based graphics are a personal interest, and this project is a small experiment in writing one such technique end-to-end in Mojo.
The appeal is a single language for both the research code and the GPU kernels. Mojo's syntax stays close to Python, readable enough to nearly double as pseudocode, yet it compiles to and runs on the GPU, so a method can be easy to read and reproducible on real hardware from one self-contained repo. A lot of GPU neural-graphics work is split between Python and hand-written CUDA, which raises the bar for both reading and re-running it; closing that gap is what I think could empower this kind of research.
This implementation calls the Apple MMA directly, but the model code need not be tied to it: the same kernels could sit behind a small matrix-multiply abstraction that wraps the equivalent instructions on NVIDIA or AMD, an FMA variant is also provided to run on non-Apple backends.
Reconstruction of lake.ppm (800×550) over training. The top row is the
network's prediction; the bottom row is the learned 200×137 feature grid (its
three channels shown as RGB, min-max normalized per frame). Step 1 is the
randomly initialized network; the scene is legible within a few hundred steps;
step 40,000 resolves the chairs, foliage, and shoreline. The grid settles into a
coarse color-and-structure map that the MLP refines into the final image.
| Step 1 | Step 10 | Step 100 | Step 1,000 | Step 40,000 | |
|---|---|---|---|---|---|
| Prediction | ![]() |
![]() |
![]() |
![]() |
![]() |
| Feature grid | ![]() |
![]() |
![]() |
![]() |
![]() |
By 40k steps the reconstruction reaches ~24.6 dB PSNR, above the ~20.7 dB of a
bilinear baseline (lake.ppm downscaled to the feature-grid resolution and
upsampled back), so the network recovers detail a plain grid at that resolution
cannot. Snapshots are written at log-spaced steps during training. On the
local runs the final training loss settles at a full-image MSE of ~1e-2.
Input encoding (encode): each sample's (u, v) bilinearly samples a
learnable feature grid (200×137×3) and is concatenated with a small positional
encoding (sub-cell sinusoids plus a global Fourier encoding: two octave
frequencies per axis, each a sin/cos pair), giving 13 logical channels padded to
16.
MLP (eval_layer): 16 → (ReLU) 32 → (ReLU) 32 → (Sigmoid) 16,
logically 13 → 32 → 32 → 3 (RGB): two hidden matmuls plus the output matmul
(NUM_LAYERS = 2). Each layer is one tiled X @ W over the MMA units: a 32-lane simdgroup
owns one 16×16 output tile, operands are F16/BF16, accumulation is F32. Per-lane
fragment coordinates are computed once (FragLayout) and reused across every
load/store/scatter.
Backward (bwd): a single fused kernel. It seeds the output gradient
(MSE × sigmoid derivative), walks the layers in reverse with transposed MMAs to
produce both weight gradients (atomic-accumulated) and input gradients, then
scatters the input-layer gradient back into the feature grid. On the MMA path
the gradient intermediates are cast to BF16 to run the backward matmuls at the
GPU's faster 16-bit rate (BF16's wider exponent range vs F16 holds the gradient
magnitudes); the scalar-FMA fallback keeps these intermediates in F32.
Optimizer (update_weight, update_encoding): per-parameter Adam on F32
master weights, with the F16 matmul copy refreshed each step. The feature grid
uses per-cell Adam, stepping only the grid cells touched that iteration.
Each training step draws a fresh batch of 16,384 random UVs.
| File | Responsibility |
|---|---|
common.mojo |
Config, type aliases, activations, weight-layout helpers, and the backend-agnostic kernels (random, encode, update_weight, update_encoding) plus the ground-truth sampler and encoding scatter. |
fused_mlp_mma.mojo |
Apple-MMA backend: FragLayout and the fused tiled kernels (eval_layer, eval_layer_bwd, mlp_fwd_train, predict_mlp, bwd). |
fused_mlp_fma.mojo |
Scalar-FMA fallback kernels (one thread per row) for Metal GPUs without the 16×16 MMA. |
harness.mojo |
Host orchestration: device-capability detection, buffer init, launch grids, and the wrappers that dispatch MMA vs FMA. |
ppm.mojo |
Binary PPM (P6) reader / writer. |
main.mojo |
Driver: allocates device buffers, wires up initialization and the training loop, renders snapshots. |
- A GPU. The fused 16×16
simdgroup_matrixpath needs an Apple M5 (developed on an M5 Max); any other GPU with F32 atomics falls back to the scalar-FMA kernels, chosen at runtime inharness.mojo. - A Mojo toolchain targeting that GPU. On an Apple Silicon Mac a plain
mojo buildalready targetsmetal.
mojo main.mojoThis trains for 40,000 steps and writes:
prediction_at_<step>.ppm: log-spaced snapshots during trainingfeature_at_<step>.ppm: the learned feature grid at each snapshot, normalized for viewingprediction.ppm: the final reconstructionground_truth.ppm: the target, for side-by-side comparison
(.ppm opens directly on macOS Preview; convert with sips -s format png in.ppm --out out.png.)
The knobs live at the top of common.mojo: BATCH_SIZE, the layer widths
(IN_WIDTH/HIDDEN_WIDTH/OUTPUT_WIDTH, all multiples of TILE=16),
NUM_LAYERS (hidden-layer count), the feature-grid resolution (ENC_W/ENC_H,
IN_LEARNED), and the activations (HIDDEN_ACT, OUTPUT_ACT). The step count
and learning rates are in main.mojo.
tiny-cuda-nn (Thomas Müller, NVIDIA; BSD-3-Clause): its fully-fused MLP and learnable feature-grid encoding, reimplemented here in Mojo for the Apple GPU.
Slang: its
automatic differentiation and atomic gradient accumulation informed the fused
bwd kernel.
Research and resources that inspired this work.
Feature grid + tiny MLP
- Instant Neural Graphics Primitives — Müller et al., SIGGRAPH 2022. Multiresolution hash grid + small MLP; the multi-level generalization of this repo's single learned grid.
- Compact Neural Graphics Primitives — Takikawa et al., SIGGRAPH Asia 2023. Shrinking the feature grid, which is the dominant memory cost.
- Variable Bitrate Neural Fields — Takikawa et al., SIGGRAPH 2022. Vector-quantized grids with level of detail.
Texture compression with random access
- Random-Access Neural Compression of Material Textures — Vaidyanathan et al. (NVIDIA), SIGGRAPH 2023. A per-material small MLP decodes texture sets at sample time, ~16× over block compression (RTXNTC SDK).
- Neural Graphics Texture Compression Supporting Random Access — Qualcomm, ECCV 2024. Sampled latents plus positional input feed a small fully-connected decoder.
Neural materials / appearance
- NeuMIP — Kuznetsov et al., SIGGRAPH 2021. Feature texture pyramids decoded by small per-material MLPs.
- Real-Time Neural Appearance Models — NVIDIA, SIGGRAPH 2024. Small MLPs plus learned latent textures, compiled to Slang.
- Real-Time Neural Materials using Block-Compressed Features — Weinreich et al., Eurographics 2024. Stores learned features in hardware BC texture format and decodes them with a lightweight shader.
On-GPU productization
- D3D12 Linear Algebra APIs
— Microsoft, Shader Model 6.10 (preview). Hardware-accelerated matrix
operations in HLSL at thread, wave, and threadgroup scope, intended for neural
rendering and in-shader network inference. The wave-scope matrix op is the
cross-vendor analog of the
simdgroup_matrixtiling used here.
For the same idea (miniature MLP) in Metal + Metal Shading Language: Apple's sample Training a neural network to render irradiance in real time. It trains on the GPU through Metal Performance Primitives operations.
The register-resident, no-shared-memory design follows Apple's Metal Performance Primitives Programming Guide, which recommends keeping matmul operands in registers and avoiding threadgroup (shared) memory on Apple GPUs.
MIT-0 (MIT No Attribution).
"Let's Learn a Lake in Mojo" is for learning Mojo and Neural Graphics. Be nice.











