Let's Learn a Lake in Mojo (Neural Image Fitting)

Bilinear	Learned

A tiny neural network that learns to reproduce a single image, written and trained end-to-end on the GPU in Mojo: the encoding, the fused MLP, the backward pass, and the optimizer are all one language. That is the point of the project, a worked example of the kind of neural-graphics technique Mojo lets you express and run on the GPU from a single, readable codebase. A learnable feature grid feeds a small MLP, and both the forward and backward passes run as fused kernels on the GPU.

The network's only input is a pixel's (u, v) coordinate; its output is that pixel's RGB. Train it on one image and it becomes a compact, continuous representation of that image.

Motivation

Neural rendering and ML-based graphics are a personal interest, and this project is a small experiment in writing one such technique end-to-end in Mojo.

The appeal is a single language for both the research code and the GPU kernels. Mojo's syntax stays close to Python, readable enough to nearly double as pseudocode, yet it compiles to and runs on the GPU, so a method can be easy to read and reproducible on real hardware from one self-contained repo. A lot of GPU neural-graphics work is split between Python and hand-written CUDA, which raises the bar for both reading and re-running it; closing that gap is what I think could empower this kind of research.

This implementation calls the Apple MMA directly, but the model code need not be tied to it: the same kernels could sit behind a small matrix-multiply abstraction that wraps the equivalent instructions on NVIDIA or AMD, an FMA variant is also provided to run on non-Apple backends.

Results

Reconstruction of lake.ppm (800×550) over training. The top row is the network's prediction; the bottom row is the learned 200×137 feature grid (its three channels shown as RGB, min-max normalized per frame). Step 1 is the randomly initialized network; the scene is legible within a few hundred steps; step 40,000 resolves the chairs, foliage, and shoreline. The grid settles into a coarse color-and-structure map that the MLP refines into the final image.

	Step 1	Step 10	Step 100	Step 1,000	Step 40,000
Prediction
Feature grid

By 40k steps the reconstruction reaches ~24.6 dB PSNR, above the ~20.7 dB of a bilinear baseline (lake.ppm downscaled to the feature-grid resolution and upsampled back), so the network recovers detail a plain grid at that resolution cannot. Snapshots are written at log-spaced steps during training. On the local runs the final training loss settles at a full-image MSE of ~1e-2.

How it works

Input encoding (encode): each sample's (u, v) bilinearly samples a learnable feature grid (200×137×3) and is concatenated with a small positional encoding (sub-cell sinusoids plus a global Fourier encoding: two octave frequencies per axis, each a sin/cos pair), giving 13 logical channels padded to 16.

MLP (eval_layer): 16 → (ReLU) 32 → (ReLU) 32 → (Sigmoid) 16, logically 13 → 32 → 32 → 3 (RGB): two hidden matmuls plus the output matmul (NUM_LAYERS = 2). Each layer is one tiled X @ W over the MMA units: a 32-lane simdgroup owns one 16×16 output tile, operands are F16/BF16, accumulation is F32. Per-lane fragment coordinates are computed once (FragLayout) and reused across every load/store/scatter.

Backward (bwd): a single fused kernel. It seeds the output gradient (MSE × sigmoid derivative), walks the layers in reverse with transposed MMAs to produce both weight gradients (atomic-accumulated) and input gradients, then scatters the input-layer gradient back into the feature grid. On the MMA path the gradient intermediates are cast to BF16 to run the backward matmuls at the GPU's faster 16-bit rate (BF16's wider exponent range vs F16 holds the gradient magnitudes); the scalar-FMA fallback keeps these intermediates in F32.

Optimizer (update_weight, update_encoding): per-parameter Adam on F32 master weights, with the F16 matmul copy refreshed each step. The feature grid uses per-cell Adam, stepping only the grid cells touched that iteration.

Each training step draws a fresh batch of 16,384 random UVs.

Layout

File	Responsibility
`common.mojo`	Config, type aliases, activations, weight-layout helpers, and the backend-agnostic kernels (`random`, `encode`, `update_weight`, `update_encoding`) plus the ground-truth sampler and encoding scatter.
`fused_mlp_mma.mojo`	Apple-MMA backend: `FragLayout` and the fused tiled kernels (`eval_layer`, `eval_layer_bwd`, `mlp_fwd_train`, `predict_mlp`, `bwd`).
`fused_mlp_fma.mojo`	Scalar-FMA fallback kernels (one thread per row) for Metal GPUs without the 16×16 MMA.
`harness.mojo`	Host orchestration: device-capability detection, buffer init, launch grids, and the wrappers that dispatch MMA vs FMA.
`ppm.mojo`	Binary PPM (P6) reader / writer.
`main.mojo`	Driver: allocates device buffers, wires up initialization and the training loop, renders snapshots.

Requirements

A GPU. The fused 16×16 simdgroup_matrix path needs an Apple M5 (developed on an M5 Max); any other GPU with F32 atomics falls back to the scalar-FMA kernels, chosen at runtime in harness.mojo.
A Mojo toolchain targeting that GPU. On an Apple Silicon Mac a plain mojo build already targets metal.

Run

mojo main.mojo

This trains for 40,000 steps and writes:

prediction_at_<step>.ppm: log-spaced snapshots during training
feature_at_<step>.ppm: the learned feature grid at each snapshot, normalized for viewing
prediction.ppm: the final reconstruction
ground_truth.ppm: the target, for side-by-side comparison

(.ppm opens directly on macOS Preview; convert with sips -s format png in.ppm --out out.png.)

Tuning

The knobs live at the top of common.mojo: BATCH_SIZE, the layer widths (IN_WIDTH/HIDDEN_WIDTH/OUTPUT_WIDTH, all multiples of TILE=16), NUM_LAYERS (hidden-layer count), the feature-grid resolution (ENC_W/ENC_H, IN_LEARNED), and the activations (HIDDEN_ACT, OUTPUT_ACT). The step count and learning rates are in main.mojo.

Acknowledgments

tiny-cuda-nn (Thomas Müller, NVIDIA; BSD-3-Clause): its fully-fused MLP and learnable feature-grid encoding, reimplemented here in Mojo for the Apple GPU.

Slang: its automatic differentiation and atomic gradient accumulation informed the fused bwd kernel.

License

MIT-0 (MIT No Attribution).

"Let's Learn a Lake in Mojo" is for learning Mojo and Neural Graphics. Be nice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Let's Learn a Lake in Mojo (Neural Image Fitting)

Motivation

Results

How it works

Layout

Requirements

Run

Tuning

Acknowledgments

Further reading

See also

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
common.mojo		common.mojo
fused_mlp_fma.mojo		fused_mlp_fma.mojo
fused_mlp_mma.mojo		fused_mlp_mma.mojo
harness.mojo		harness.mojo
lake.ppm		lake.ppm
main.mojo		main.mojo
ppm.mojo		ppm.mojo

Folders and files

Latest commit

History

Repository files navigation

Let's Learn a Lake in Mojo (Neural Image Fitting)

Motivation

Results

How it works

Layout

Requirements

Run

Tuning

Acknowledgments

Further reading

See also

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages