Skip to content

pprovins/LearnALakeInMojo

Repository files navigation

Let's Learn a Lake in Mojo (Neural Image Fitting)

Bilinear Learned
Bilinear Learned

A tiny neural network that learns to reproduce a single image, written and trained end-to-end on the GPU in Mojo: the encoding, the fused MLP, the backward pass, and the optimizer are all one language. That is the point of the project, a worked example of the kind of neural-graphics technique Mojo lets you express and run on the GPU from a single, readable codebase. A learnable feature grid feeds a small MLP, and both the forward and backward passes run as fused kernels on the GPU.

The network's only input is a pixel's (u, v) coordinate; its output is that pixel's RGB. Train it on one image and it becomes a compact, continuous representation of that image.

Motivation

Neural rendering and ML-based graphics are a personal interest, and this project is a small experiment in writing one such technique end-to-end in Mojo.

The appeal is a single language for both the research code and the GPU kernels. Mojo's syntax stays close to Python, readable enough to nearly double as pseudocode, yet it compiles to and runs on the GPU, so a method can be easy to read and reproducible on real hardware from one self-contained repo. A lot of GPU neural-graphics work is split between Python and hand-written CUDA, which raises the bar for both reading and re-running it; closing that gap is what I think could empower this kind of research.

This implementation calls the Apple MMA directly, but the model code need not be tied to it: the same kernels could sit behind a small matrix-multiply abstraction that wraps the equivalent instructions on NVIDIA or AMD, an FMA variant is also provided to run on non-Apple backends.

Results

Reconstruction of lake.ppm (800×550) over training. The top row is the network's prediction; the bottom row is the learned 200×137 feature grid (its three channels shown as RGB, min-max normalized per frame). Step 1 is the randomly initialized network; the scene is legible within a few hundred steps; step 40,000 resolves the chairs, foliage, and shoreline. The grid settles into a coarse color-and-structure map that the MLP refines into the final image.

Step 1 Step 10 Step 100 Step 1,000 Step 40,000
Prediction prediction step 1 prediction step 10 prediction step 100 prediction step 1000 prediction step 40000
Feature grid feature grid step 1 feature grid step 10 feature grid step 100 feature grid step 1000 feature grid step 40000

By 40k steps the reconstruction reaches ~24.6 dB PSNR, above the ~20.7 dB of a bilinear baseline (lake.ppm downscaled to the feature-grid resolution and upsampled back), so the network recovers detail a plain grid at that resolution cannot. Snapshots are written at log-spaced steps during training. On the local runs the final training loss settles at a full-image MSE of ~1e-2.

How it works

Input encoding (encode): each sample's (u, v) bilinearly samples a learnable feature grid (200×137×3) and is concatenated with a small positional encoding (sub-cell sinusoids plus a global Fourier encoding: two octave frequencies per axis, each a sin/cos pair), giving 13 logical channels padded to 16.

MLP (eval_layer): 16 → (ReLU) 32 → (ReLU) 32 → (Sigmoid) 16, logically 13 → 32 → 32 → 3 (RGB): two hidden matmuls plus the output matmul (NUM_LAYERS = 2). Each layer is one tiled X @ W over the MMA units: a 32-lane simdgroup owns one 16×16 output tile, operands are F16/BF16, accumulation is F32. Per-lane fragment coordinates are computed once (FragLayout) and reused across every load/store/scatter.

Backward (bwd): a single fused kernel. It seeds the output gradient (MSE × sigmoid derivative), walks the layers in reverse with transposed MMAs to produce both weight gradients (atomic-accumulated) and input gradients, then scatters the input-layer gradient back into the feature grid. On the MMA path the gradient intermediates are cast to BF16 to run the backward matmuls at the GPU's faster 16-bit rate (BF16's wider exponent range vs F16 holds the gradient magnitudes); the scalar-FMA fallback keeps these intermediates in F32.

Optimizer (update_weight, update_encoding): per-parameter Adam on F32 master weights, with the F16 matmul copy refreshed each step. The feature grid uses per-cell Adam, stepping only the grid cells touched that iteration.

Each training step draws a fresh batch of 16,384 random UVs.

Layout

File Responsibility
common.mojo Config, type aliases, activations, weight-layout helpers, and the backend-agnostic kernels (random, encode, update_weight, update_encoding) plus the ground-truth sampler and encoding scatter.
fused_mlp_mma.mojo Apple-MMA backend: FragLayout and the fused tiled kernels (eval_layer, eval_layer_bwd, mlp_fwd_train, predict_mlp, bwd).
fused_mlp_fma.mojo Scalar-FMA fallback kernels (one thread per row) for Metal GPUs without the 16×16 MMA.
harness.mojo Host orchestration: device-capability detection, buffer init, launch grids, and the wrappers that dispatch MMA vs FMA.
ppm.mojo Binary PPM (P6) reader / writer.
main.mojo Driver: allocates device buffers, wires up initialization and the training loop, renders snapshots.

Requirements

  • A GPU. The fused 16×16 simdgroup_matrix path needs an Apple M5 (developed on an M5 Max); any other GPU with F32 atomics falls back to the scalar-FMA kernels, chosen at runtime in harness.mojo.
  • A Mojo toolchain targeting that GPU. On an Apple Silicon Mac a plain mojo build already targets metal.

Run

mojo main.mojo

This trains for 40,000 steps and writes:

  • prediction_at_<step>.ppm: log-spaced snapshots during training
  • feature_at_<step>.ppm: the learned feature grid at each snapshot, normalized for viewing
  • prediction.ppm: the final reconstruction
  • ground_truth.ppm: the target, for side-by-side comparison

(.ppm opens directly on macOS Preview; convert with sips -s format png in.ppm --out out.png.)

Tuning

The knobs live at the top of common.mojo: BATCH_SIZE, the layer widths (IN_WIDTH/HIDDEN_WIDTH/OUTPUT_WIDTH, all multiples of TILE=16), NUM_LAYERS (hidden-layer count), the feature-grid resolution (ENC_W/ENC_H, IN_LEARNED), and the activations (HIDDEN_ACT, OUTPUT_ACT). The step count and learning rates are in main.mojo.

Acknowledgments

tiny-cuda-nn (Thomas Müller, NVIDIA; BSD-3-Clause): its fully-fused MLP and learnable feature-grid encoding, reimplemented here in Mojo for the Apple GPU.

Slang: its automatic differentiation and atomic gradient accumulation informed the fused bwd kernel.

Further reading

Research and resources that inspired this work.

Feature grid + tiny MLP

Texture compression with random access

Neural materials / appearance

On-GPU productization

  • D3D12 Linear Algebra APIs — Microsoft, Shader Model 6.10 (preview). Hardware-accelerated matrix operations in HLSL at thread, wave, and threadgroup scope, intended for neural rendering and in-shader network inference. The wave-scope matrix op is the cross-vendor analog of the simdgroup_matrix tiling used here.

See also

For the same idea (miniature MLP) in Metal + Metal Shading Language: Apple's sample Training a neural network to render irradiance in real time. It trains on the GPU through Metal Performance Primitives operations.

The register-resident, no-shared-memory design follows Apple's Metal Performance Primitives Programming Guide, which recommends keeping matmul operands in registers and avoiding threadgroup (shared) memory on Apple GPUs.

License

MIT-0 (MIT No Attribution).

"Let's Learn a Lake in Mojo" is for learning Mojo and Neural Graphics. Be nice.

About

Teaching a tiny MLP to paint a lake: end-to-end, on-GPU neural image fitting via fused fully kernels written entirely in Mojo.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages