diff --git a/Cargo.lock b/Cargo.lock
index fe0ad1f86..b395bdcb9 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -3939,6 +3939,7 @@ name = "lpir"
 version = "40.0.0"
 dependencies = [
  "libm",
+ "log",
  "lps-q32",
 ]
 
@@ -4072,6 +4073,7 @@ dependencies = [
  "cranelift-native 0.127.0",
  "cranelift-object",
  "libm",
+ "log",
  "lp-riscv-elf",
  "lpir",
  "lps-builtin-ids",
diff --git a/docs/design/optimization/inline.md b/docs/design/optimization/inline.md
new file mode 100644
index 000000000..c78c6a930
--- /dev/null
+++ b/docs/design/optimization/inline.md
@@ -0,0 +1,326 @@
+# LPIR inlining pass
+
+Function inlining for LPIR. Lives in `lp-shader/lpir/src/inline/`, exposed as
+`lpir::inline_module(&mut LpirModule, &InlineConfig) -> InlineResult`.
+
+## Goals
+
+1. **Reduce call overhead** on the rv32n target. Local LPIR calls lower to a
+   prologue / argument shuffle / `jal` / epilogue per call site; for tiny
+   helpers this overhead dominates the body.
+2. **Enable downstream constant folding.** Inlined parameters often become
+   constants at the call site, opening folding and dead-code opportunities
+   the const-fold pass alone cannot reach across a call boundary.
+3. **Stay embedded-friendly.** The pass is mutative (in-place), allocation-
+   bounded, and uses `BTreeMap` / `Vec` — no recursion in the algorithm,
+   no large temporaries.
+
+Non-goals: cross-module inlining (imports are never inlined), inlining
+through indirect calls (LPIR has none), removing functions that became
+unreachable after inlining (handled separately by a future pass).
+
+## Algorithm
+
+Bottom-up over the local call graph. Each callee is considered exactly
+once, after every function it calls has been processed.
+
+1. **Build the call graph** from `LpirOp::Call` ops. Imports are excluded
+   (`CalleeRef::Import` does not introduce an edge). The graph stores
+   callees-of, callers-of, and `(op_idx, callee)` call sites per caller —
+   all keyed by `BTreeMap<FuncId, …>` for deterministic iteration over a
+   sparse `FuncId` space.
+2. **Topological sort** (Kahn's, leaves first). Nodes with `callees_of[g]
+   == 0` come first; cycles are extracted separately and reported as
+   `functions_skipped_recursive`. Isolated functions (no incoming or
+   outgoing local calls) are still emitted in the order so that orphan
+   leaves are not lost.
+3. For each callee in topo order:
+   - Apply the [heuristic](#heuristic) to decide whether to inline.
+   - If yes, [splice](#splicer) the callee body into every caller call
+     site. The callee `IrFunction` itself is left in the module.
+4. After the loop, recompute control-flow offsets once per mutated caller
+   (see [offset recompute](#offset-recompute)).
+
+The callee is *not* deleted after inlining — call sites are replaced but
+the body remains addressable via `FuncId`. A future "dead function" pass
+may sweep what is no longer reachable from entry points.
+
+## Splicer
+
+`splice::inline_call_site(caller, callee, call_op_idx)` replaces a single
+`LpirOp::Call` with a remapped copy of the callee body.
+
+### Steps
+
+1. **Arity check** between the call's `args` / `results` and the callee's
+   `param_count` / `return_types`. Mismatch is a no-op (`debug_assert!`
+   in debug builds).
+2. **Param-write scan** (`scan_param_writes`) walks the callee body and
+   marks any parameter VReg that is the destination of any op via
+   `LpirOp::def_vreg`. Read-only params can be aliased; written params
+   need a private copy.
+3. **Build the remap** (`build_remap`):
+   - `vreg_table[0]` always maps to `VMCTX_VREG` in the caller (`vmctx`
+     is a process-wide singleton; aliasing is safe and required for
+     pointer identity through chained calls).
+   - Each read-only param maps to the matching argument VReg in the
+     caller (alias).
+   - Each mutated param allocates a fresh caller VReg of the callee's
+     type, emitted as a leading `LpirOp::Copy { dst: new, src: arg }`
+     in the spliced scratch.
+   - Non-param VRegs map to fresh caller VRegs of matching type.
+   - Slots are translated by `slot_offset = caller.slots.len()` after
+     extending `caller.slots` with the callee's slots.
+   - `vreg_pool` ranges from the callee are appended to
+     `caller.vreg_pool` and recorded as a base offset for `VRegRange`
+     translation.
+4. **Classify return shape** of the callee body:
+   - `None` — body has no `LpirOp::Return`.
+   - `SingleAtEnd` — exactly one `Return` and it is the last op.
+   - `Multi` — anything else (early returns or multiple returns).
+5. **Build the scratch** `Vec<LpirOp>`:
+   - Emit `Copy` ops for each mutated parameter.
+   - Walk the callee body, emit `remap_op(op)` for each non-`Return`.
+     `Return` ops are emitted as the appropriate `Copy`s into the
+     caller's `results` VRegs:
+     - `SingleAtEnd` and `None`: a flat sequence of `Copy { dst: results[i], src: ret_vals[i] }`.
+     - `Multi`: the entire spliced body is wrapped as
+       `Block { end_offset: 0 } … ExitBlock End` and each in-body
+       `Return` becomes the `Copy` sequence followed by `ExitBlock`.
+       This preserves early-exit semantics in structured control flow.
+6. **Splice** the scratch into `caller.body` at `call_op_idx`,
+   replacing the `Call` op (`Vec::splice` of length 1).
+
+`end_offset` fields on `Block` / `IfStart` / `LoopStart` and the `Switch`
+family are left set to `0` in the splicer; the [offset recompute](#offset-recompute)
+fixes them after all splicing for that caller is done.
+
+### Why scan-then-alias-or-copy
+
+GLSL by-value parameters are mutable inside the function. A naive "always
+copy" strategy spends `Copy` ops the const-folder can rarely remove. A
+naive "always alias" strategy is unsound when the callee writes through
+the param. The scan is `O(callee.body.len())` and is the cheapest way to
+get aliasing for the common read-only case (the majority of helpers) and
+correctness for the rest.
+
+`vmctx` (`VReg(0)`) is a special case: it is never written by any
+function and aliases unconditionally.
+
+## Offset recompute
+
+Control-flow ops carry cached offsets — `IfStart::else_offset`,
+`IfStart::end_offset`, `LoopStart::end_offset`,
+`LoopStart::continuing_offset`, `Block::end_offset`,
+`SwitchStart::end_offset`, etc. Splicing inserts ops at arbitrary
+positions and invalidates every offset in or around the spliced range.
+
+Rather than thread incremental fixups through the splicer,
+`offsets::recompute_offsets(&mut Vec<LpirOp>)` runs once per mutated
+caller after all splicing for that caller is complete. It does a
+single stack-walk of the body and re-derives every offset structurally,
+matching `FunctionBuilder` conventions.
+
+This requires structural markers for every control region. The
+`continuing` block of a loop previously had only a cached
+`continuing_offset` and no marker op, which made structural recompute
+ambiguous. Stage III (M2.5) added [`LpirOp::Continuing`](#continuing-marker)
+to fix this.
+
+### `Continuing` marker
+
+`LpirOp::Continuing` is emitted at the start of a loop's continuing
+block. Backends still consume `LoopStart::continuing_offset` for fast
+branch-target lookup; the marker is what lets the recompute pass
+re-derive that cached value structurally. The marker is a no-op at
+runtime and lowers to nothing on every backend.
+
+## Configuration
+
+`InlineConfig` (`lp-shader/lpir/src/compiler_config.rs`):
+
+| field | default | meaning |
+|---|---|---|
+| `mode` | `Auto` | `Never` skips everything; `Always` ignores the size threshold; `Auto` consults `small_func_threshold`. |
+| `always_inline_single_site` | `true` | When `Auto`, inline a callee that has exactly one call site even if it is over `small_func_threshold`. |
+| `small_func_threshold` | `16` | Maximum `func_weight` for "small" callees that are inlined unconditionally under `Auto`. See [empirical tuning](#empirical-tuning). |
+| `max_growth_budget` | `None` | Per-callee cap on `weight × callsite_count`; on overflow the callee is skipped and processing continues. |
+| `module_op_budget` | `None` | Module-wide cap on total ops projected after inlining a callee; on overflow the pass stops early and `InlineResult::budget_exceeded = true`. |
+
+Fields are settable via `compile-opt inline.<field> = <value>` directives
+in shader source.
+
+## Heuristic
+
+`should_inline(weight, callsite_count, current_module_op_count, config)`
+returns one of:
+
+| decision | when |
+|---|---|
+| `Inline` | All gates pass. |
+| `SkipMode` | `mode == Never`. |
+| `SkipTooLarge { weight, threshold }` | `Auto`, `weight > threshold`, and not (single call site with `always_inline_single_site`). |
+| `SkipBudget { reason: MaxGrowth, … }` | `weight × sites > max_growth_budget`. Per-callee skip; pass continues. |
+| `SkipBudget { reason: ModuleTotal, … }` | Projected module ops would exceed `module_op_budget`. Pass stops; further callees not considered this run. |
+
+The two skip-budget variants behave differently because per-callee
+budgeting is a local decision (other callees may still fit), while
+module-total budgeting is monotonic over remaining work — there is no
+point continuing once we've crossed it.
+
+## `func_weight`
+
+Production weight is the simplest possible:
+
+```rust
+fn func_weight(func: &IrFunction) -> usize {
+    func.body.len()
+}
+```
+
+Three candidates were evaluated empirically in M3.1; all three remain
+public under `lpir::inline_weights::{weight_body_len, weight_markers_zero, weight_heavy_bias}`
+and a `WeightKind` dispatcher, retained for re-tuning when the cost
+model shifts (e.g. switching to a different rv32 backend).
+
+| candidate | rule | combined Pearson r vs `rv32n_insns` |
+|---|---|---|
+| `body_len` (production) | `func.body.len()` | **0.980** |
+| `markers_zero` | All ops weight 1 except structural markers (`IfStart`, `Else`, `Continuing`, `LoopStart`, `*Start`, `End`, `Block`, `ExitBlock`, `Break`, `Continue`, `Return`) which weight 0. | 0.974 |
+| `heavy_bias` | `markers_zero` + `Call=5`, `Memcpy=4`, `Fsqrt=4`, `Fdiv`/`IdivS`/`IdivU`/`IremS`/`IremU`=3. | 0.962 |
+
+`body_len` won linear correlation and is the simplest. `markers_zero`
+adds branching for negligible gain — structural ops are a small fraction
+of body length for typical code. `heavy_bias` over-penalizes single-cycle
+hardware ops like `FSQRT.S` on the rv32n backend; the resulting weight
+distorts the cliff at which the threshold sits.
+
+## Empirical tuning
+
+`small_func_threshold = 16` was picked from the M3.1 corpus
+(`lp-shader/lps-filetests/filetests/debug/inline-weights.glsl` plus the
+existing `rainbow.glsl`) by mapping `body_len` to measured rv32n
+instruction count. Selected representative rows:
+
+| function | body_len | rv32n insns |
+|---|---|---|
+| `iw_clamp01` | 7 | 25 |
+| `iw_lerp` | 10 | 33 |
+| `iw_mul3` | 12 | 46 |
+| `iw_add3` | **16** | 51 |
+| `iw_fold_rgb` | 18 | **85** |
+| `paletteFire` | 22 | 104 |
+| `applyPalette` | 42 | 148 |
+| `rainbow_main` | 154 | 541 |
+
+`body_len ≤ 16` cleanly captures every corpus function that lowers to
+≤ ~50 rv32n insns (well under the M3.1 target of ≤ 64) without picking
+up `iw_fold_rgb` at 85.
+
+Re-tune by running `lp-cli shader-debug --weights …` against the
+corpus; the flag emits `body_len` / `mz` / `hb` columns next to the
+existing rv32n / rv32c counts. See
+`docs/roadmaps/2026-04-15-lpir-inliner/m3.1-tune-inline-weights.md` for
+the methodology.
+
+## Recursion
+
+Local call graphs may contain cycles (GLSL 4.50 permits recursion).
+The inliner detects cycles during the topological sort and counts
+their members in `InlineResult::functions_skipped_recursive`. Bodies
+of recursive functions are not modified.
+
+Imports never participate in the call graph and are never inlined.
+
+## Determinism
+
+All adjacency structures are `BTreeMap<FuncId, …>` and call-site lists
+are sorted by op index (descending, so splicing earlier sites does not
+shift later ones). Topological sort, splicer, and offset recompute are
+all deterministic functions of the input module. Re-running
+`inline_module` on identical input yields byte-identical output.
+
+## Logging
+
+Decisions and a per-run summary are emitted via the `log` crate at
+`debug` and `info` levels. Embedded builds depend on `log` with
+`default-features = false`; the calls compile to no-ops when no logger
+is installed.
+
+```
+inline: callee=FuncId(3) weight=12 sites=2 module_ops=87 decision=inline
+inline: callee=FuncId(7) skip too_large weight=42 threshold=16
+inline: callee=FuncId(9) skip budget projected=400 budget=300 reason=ModuleTotal
+inline: done inlined=4 sites=11 skipped_recursive=1 budget_exceeded=false
+```
+
+## File layout
+
+```
+lp-shader/lpir/src/inline/
+├── mod.rs          # InlineResult, inline_module orchestration
+├── callgraph.rs    # CallGraph, build, topo_order
+├── heuristic.rs    # func_weight, weight candidates, should_inline
+├── offsets.rs      # recompute_offsets
+├── remap.rs        # ParamWriteMask, scan_param_writes, Remap, remap_op
+└── splice.rs       # inline_call_site
+```
+
+Public surface from `lpir`:
+
+```rust
+pub fn inline_module(&mut LpirModule, &InlineConfig) -> InlineResult;
+pub struct InlineResult { … }            // counters
+pub mod inline_weights {                 // M3.1 candidates, re-tuning
+    pub enum WeightKind { BodyLen, MarkersZero, HeavyBias }
+    pub fn weight(WeightKind, &IrFunction) -> usize;
+    pub fn weight_body_len(&IrFunction) -> usize;
+    pub fn weight_markers_zero(&IrFunction) -> usize;
+    pub fn weight_heavy_bias(&IrFunction) -> usize;
+}
+```
+
+Everything else (`CallGraph`, `Remap`, `splice::*`, `should_inline`,
+`Decision`) is `pub(crate)`.
+
+## Alternatives considered
+
+### Top-down inlining
+
+Walking from entry points down would let the heuristic see specialized
+parameters (constants flowing through) before deciding. It would also
+make budget accounting easier (you stop when you hit the budget at any
+depth). Bottom-up was chosen because it composes: by the time we
+consider `f`, every callee inside `f` has already been processed, so
+`weight(f)` reflects the *post-inline* size of `f`. Top-down would
+require either a fixed-point loop or per-call-site re-evaluation.
+
+### Inlining-with-deletion
+
+Removing a callee `IrFunction` from the module after every call site
+has been spliced would shrink the module and reduce subsequent
+serialization cost. It would also require fixing every other reference
+to that `FuncId` (none exist in LPIR today, but a future pass could
+add them) and would make incremental recompilation harder. The chosen
+design leaves the function in place; a separate "dead function" pass
+can sweep unreachable functions when needed.
+
+### Per-call-site cost model
+
+A more accurate heuristic would weight each call site by the cost of
+the surrounding call (argument shuffle, return-value movement) so that
+a 3-op leaf inlined twenty times in the same loop is preferred over a
+3-op leaf called once in cold code. The current pass treats every site
+uniformly. The simpler model is sufficient at present module sizes;
+revisiting requires a profile-driven workflow that does not yet exist.
+
+### Smarter weight functions
+
+`weight_markers_zero` and `weight_heavy_bias` were designed to better
+predict rv32n instruction count. Empirically (M3.1) they did not beat
+`body.len()` as a linear predictor, and `heavy_bias`'s non-linearity
+distorts the threshold cliff in the wrong direction (over-penalizing
+fast hardware ops like `FSQRT.S`). They remain available as public
+candidates so a future cost-model change (different backend, SIMD
+expansion, etc.) can be evaluated without re-deriving the
+infrastructure.
diff --git a/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/00-design.md b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/00-design.md
new file mode 100644
index 000000000..477dd3046
--- /dev/null
+++ b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/00-design.md
@@ -0,0 +1,93 @@
+# Design — `lpir-inliner` stage ii (M1 `CompilerConfig` + filetest `compile-opt`)
+
+## Scope of work
+
+Implement **M1** from `docs/roadmaps/2026-04-15-lpir-inliner/m1-optpass-filetests.md`:
+
+- Introduce **`lpir::CompilerConfig`** (and **`InlineConfig`**, **`InlineMode`**, **`ConfigError`**) as **`no_std` + `alloc`** middle-end options for LPIR optimization passes.
+- Thread **`config: CompilerConfig`** through **`lpvm-native`**, **`lpvm-cranelift`**, and **`lpvm-wasm`** option structs; **`Default`** uses **`CompilerConfig::default()`**.
+- Add filetest directive **`// compile-opt(key, value)`**, **`TestFile::config_overrides`**, duplicate-key errors, and merge overrides before compilation in **`filetest_lpvm`** for **all** backends.
+
+**No behavior change** for existing tests until files add **`compile-opt`** and later milestones wire the inliner to read **`InlineConfig`**.
+
+**Out of scope:** M0 **`CalleeRef`** refactor (parallel plan); inliner body; tagging **`filetests/function/*.glsl`** with **`compile-opt`** (optional follow-up).
+
+See **`00-notes.md`** for resolved questions.
+
+## Implementation granularity
+
+Prefer **keeping the workspace building and tests passing after each phase** (additive `Default` fields and plumbing). If M0 lands in parallel and causes transient conflicts, resolve before declaring the plan complete.
+
+## File structure (relevant areas)
+
+```
+lp-shader/lpir/src/
+├── compiler_config.rs          # NEW: CompilerConfig, InlineConfig, InlineMode, ConfigError, apply, FromStr
+└── lib.rs                      # UPDATE: mod + re-exports
+
+lp-shader/lpvm-native/src/
+├── native_options.rs           # UPDATE: + config; Clone not Copy
+├── compile.rs                  # UPDATE: pass options.config where passes need it (inline = later; may no-op for M1)
+└── …                           # UPDATE: any NativeCompileOptions { … } literals
+
+lp-shader/lpvm-cranelift/src/
+├── compile_options.rs          # UPDATE: + config; likely Clone only
+└── …                           # UPDATE: struct literals, engine paths
+
+lp-shader/lpvm-wasm/src/
+├── options.rs                  # UPDATE: + config; likely Clone only
+└── …
+
+lp-shader/lps-filetests/src/parse/
+├── parse_compile_opt.rs        # NEW: // compile-opt(key, value)
+├── mod.rs                      # UPDATE: try compile-opt before @ annotations; duplicate keys
+├── test_type.rs                # UPDATE: TestFile::config_overrides
+└── parse_annotation.rs         # (unchanged kinds — no Config on AnnotationKind)
+
+lp-shader/lps-filetests/src/test_run/
+└── filetest_lpvm.rs           # UPDATE: build CompilerConfig, set on FaCompileOptions, CompileOptions, WasmOptions
+
+lp-shader/lps-frontend / lp-engine / fw / tests
+└── UPDATE: any ..Default::default() or struct copies that assumed Copy on option structs
+```
+
+## Conceptual architecture
+
+```
+┌──────────────────────────────────────────────────────────────────┐
+│  lps-frontend (GLSL → LPIR)                                        │
+└────────────────────────────┬─────────────────────────────────────────┘
+                             ▼
+┌──────────────────────────────────────────────────────────────────┐
+│  LPIR module                                                        │
+│  ─────────────────────────────────────────────────────────────── │
+│  CompilerConfig  ← middle-end: inline mode, budgets, future passes  │
+│       ▲                                                             │
+│       │  filetest: // compile-opt(k, v) → apply() on defaults        │
+│       │  production: NativeCompileOptions / CompileOptions / …     │
+└───────┼────────────────────────────────────────────────────────────┘
+        │
+        ▼  LPIR passes (const_fold today; inline when wired) read config
+┌──────────────────────────────────────────────────────────────────┐
+│  Backend lowering                                                  │
+│  NativeCompileOptions │ CompileOptions │ WasmOptions                │
+│  (+ float_mode, emu_trace, q32_options, … per backend)             │
+└──────────────────────────────────────────────────────────────────┘
+```
+
+**Separation:** **`CompilerConfig`** does not subsume backend flags ( **`FloatMode`**, debug, WASM-only knobs). It only groups **shared LPIR pass** settings so every codegen path sees the same middle-end choices.
+
+## Main components and interactions
+
+| Piece | Role |
+|-------|------|
+| **`CompilerConfig::apply`** | Single namespace for **`compile-opt`** string keys → field updates; unknown key / bad value → error |
+| **`TestFile::config_overrides`** | Raw **`(key, value)`** from file; duplicate keys rejected in **`parse_test_file`** |
+| **`CompiledShader::compile_glsl`** | After **`lower_glsl`**, merge overrides into **`CompilerConfig::default()`**, install on each backend’s options before **`compile`** |
+
+## Phases
+
+1. **`01-lpir-compiler-config.md`** — `compiler_config.rs`, tests for **`apply`** / **`InlineMode::from_str`**
+2. **`02-thread-config-through-backends.md`** — **`NativeCompileOptions`**, **`CompileOptions`**, **`WasmOptions`**, fix **`Copy`/`Clone`** and all call sites
+3. **`03-filetests-compile-opt.md`** — parsing, **`TestFile`**, **`filetest_lpvm`** wiring
+4. **`04-cleanup-and-validation.md`** — diff hygiene, full test matrix, **`summary.md`**, move to **`plans-done/`**, commit template
diff --git a/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/00-notes.md b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/00-notes.md
new file mode 100644
index 000000000..d2268b7ca
--- /dev/null
+++ b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/00-notes.md
@@ -0,0 +1,57 @@
+# Plan notes — `lpir-inliner` stage ii (M1 compiler config + filetest `compile-opt`)
+
+## Scope of work
+
+Implement **M1 — Compiler config + per-file opt overrides** from
+`docs/roadmaps/2026-04-15-lpir-inliner/m1-optpass-filetests.md`, with the **syntax decision below** (replaces roadmap’s `@config` spelling).
+
+- Add **`no_std` + `alloc`** `CompilerConfig` / `InlineConfig` / `InlineMode` / `ConfigError` in `lpir`, with `CompilerConfig::apply` for string key/value overrides (canonical key namespace for opt passes).
+- Add **`config: CompilerConfig`** to **`NativeCompileOptions`**, Cranelift **`CompileOptions`**, and **`WasmOptions`**; passes read their slice of config (inline consumes when wired in later milestones).
+- Extend **filetest parsing** with **`// compile-opt(key, value)`** (e.g. `// compile-opt(inline.mode, never)`), typically **at the top of the file**; store as **`TestFile::config_overrides`**, duplicate-key detection, merge into defaults before compilation in **`filetest_lpvm`** / compile path.
+- **No intended behavior change** for existing tests: no new directive lines until we add them in a later milestone; inliner not wired until later roadmap work — defaults only.
+
+Explicitly **out of scope** for this plan: M0 `CalleeRef` work (parallel track), actual inliner implementation, tagging individual `.glsl` files with `compile-opt` until a later milestone (e.g. M4) unless we add optional tagging in cleanup.
+
+## Current state of the codebase (relevant to this scope)
+
+- **Paths**: Shader stack lives under `lp-shader/` (`lpir`, `lpvm-native`, `lps-filetests`, etc.).
+- **`lpir`**: `#![no_std]` + `alloc`; has `const_fold`, no `compiler_config` module yet. `FloatMode` already lives here and is reused by backends.
+- **`NativeCompileOptions`** (`lp-shader/lpvm-native/src/native_options.rs`): `float_mode`, `debug_info`, `emu_trace_instructions`, `alloc_trace`; **`Copy`** + **`Default`**. Will likely **`Clone`** instead of **`Copy`** once it holds `CompilerConfig` (unless config is behind `Arc` — unlikely for tiny structs).
+- **Filetest parse loop** (`lp-shader/lps-filetests/src/parse/mod.rs`): Lines matching `parse_annotation_line` are **target-scoped** (`@unimplemented(target)`, etc.) and accumulate in **`pending_annotations`**, then attach to the **next** `// run:`**.** File-level **`compile-opt`** must **not** use that pipeline.
+- **New directive**: parse **`// compile-opt(...)`** in a dedicated path (comma-separated key/value inside parens, same logical shape as the old roadmap `@config` examples).
+- **`Annotation` / `AnnotationKind`**: Keep **`AnnotationKind`** `Copy` for run annotations; **do not** add config here — use **`config_overrides`** on **`TestFile`**.
+- **`CompiledShader::compile_glsl`** (`filetest_lpvm.rs`): builds **`FaCompileOptions`**, Cranelift **`CompileOptions`**, **`WasmOptions`** per target. **`CompilerConfig`** is **middle-end** (LPIR opts); it must thread into **all** of these so filetests and prod behave consistently on every backend (see updated **`m1-optpass-filetests.md`**).
+
+## Questions (planning)
+
+| # | Question | Status |
+|---|----------|--------|
+| 1 | Model config as `AnnotationKind::Config` vs **`TestFile::config_overrides`** + dedicated parse? | **Resolved** |
+| 2 | Directive spelling for file-level overrides? | **Resolved** |
+| 3 | Thread **`CompilerConfig`** only through native vs **all** backends? | **Resolved** |
+
+### Suggested directions (for discussion)
+
+_(Q1–Q2 resolved — see Answers.)_
+
+## Answers (from chat)
+
+### Q1 — Modeling
+
+**Answer:** **`TestFile::config_overrides: Vec<(String, String)>`** plus a **dedicated** parse branch (e.g. `parse_compile_opt_line`), **not** `Annotation` / `AnnotationKind`. Do not push these lines into **`pending_annotations`**.
+
+### Q2 — Syntax
+
+**Answer:** Use **`// compile-opt(key, value)`** — file-level compiler / LPIR opt overrides, conventionally **at the top of the file**. Example: `// compile-opt(inline.mode, never)`.
+
+**Rationale:** Keeps **`// @…(target)`** meaning “target-scoped, attaches to next **`// run:`**”; **`compile-opt`** reads as “how this file is compiled,” distinct from per-run annotations.
+
+### Q3 — Where does `CompilerConfig` live conceptually, and who gets a field?
+
+**Answer:** **`CompilerConfig` is middle-end (LPIR optimization pipeline)** — not **`lps-frontend`**, not backend-specific codegen toggles. **Thread `config: CompilerConfig` through every backend option struct** that compiles LPIR (`NativeCompileOptions`, **`CompileOptions`**, **`WasmOptions`**) so overrides apply everywhere; backend crates remain responsible for their **own** non-LPIR fields.
+
+## Notes
+
+- **Roadmap** `docs/roadmaps/2026-04-15-lpir-inliner/m1-optpass-filetests.md` is updated for **`compile-opt`**, middle-end framing, and **everywhere** threading.
+- **Parallel work with M0 (stage i)**: M0 and M1 both touch **`lpvm-native`** and possibly **`lps-filetests`**; **`lpir`** gains new files in both. Expect occasional rebase conflicts; **M1 does not depend on enum `CalleeRef`** for `CompilerConfig` itself. Merge order: land M0 first if both touch the same lines, or coordinate.
+- **`NativeCompileOptions` non-`Copy`**: All struct literals and `#[derive(Copy)]` call sites need review after adding **`CompilerConfig`**.
diff --git a/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/01-lpir-compiler-config.md b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/01-lpir-compiler-config.md
new file mode 100644
index 000000000..6e4984645
--- /dev/null
+++ b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/01-lpir-compiler-config.md
@@ -0,0 +1,30 @@
+# Phase 1 — LPIR `CompilerConfig`
+
+## Scope of phase
+
+Add **`lpir::compiler_config`**: **`CompilerConfig`**, **`InlineConfig`**, **`InlineMode`**, **`ConfigError`**, and **`CompilerConfig::apply`**, matching the data layout and key set in `docs/roadmaps/2026-04-15-lpir-inliner/m1-optpass-filetests.md`. Implement **`core::str::FromStr`** for **`InlineMode`** (`auto`, `always`, `never` — pick consistent lowercase spelling in **`from_str`** and document it).
+
+Export from **`lib.rs`**. No backend or filetest changes yet.
+
+## Code Organization Reminders
+
+- Prefer one concept per file; **`compiler_config.rs`** holds the whole public surface for this phase.
+- Entry points and types first; helper fns at the bottom if any.
+- Keep **`#![no_std]`** + **`alloc`** only as needed (e.g. **`String`** in errors — use **`&str`** / static messages if avoiding **`String`**, or align with existing **`lpir`** error patterns).
+
+## Implementation Details
+
+- **`ConfigError`**: support at least **`UnknownKey`**, **`InvalidValue`** (duplicate keys are enforced in the **filetest harness**, not in **`apply`**).
+- **`CompilerConfig::default()`** / **`InlineConfig::default()`** per roadmap defaults.
+- **`apply(&mut self, key: &str, value: &str)`** — match arms for keys listed in roadmap (`inline.mode`, `inline.small_func_threshold`, `inline.max_growth_budget`, `inline.module_op_budget`). Either add **`inline.always_inline_single_site` → `bool`** or document that it is default-only until a key exists.
+
+### Tests (`lpir` crate)
+
+- **`apply`** success for valid pairs; failure for unknown key and bad parse.
+- **`InlineMode`** parsing round-trip.
+
+## Validate
+
+```bash
+cargo test -p lpir
+```
diff --git a/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/02-thread-config-through-backends.md b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/02-thread-config-through-backends.md
new file mode 100644
index 000000000..1b55b361f
--- /dev/null
+++ b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/02-thread-config-through-backends.md
@@ -0,0 +1,50 @@
+# Phase 2 — Thread `CompilerConfig` through backends
+
+## Scope of phase
+
+Add **`pub config: lpir::CompilerConfig`** to:
+
+- **`lpvm_native::NativeCompileOptions`** (`native_options.rs`)
+- **`lpvm_cranelift::CompileOptions`** (`compile_options.rs`)
+- **`lpvm_wasm::WasmOptions`** (`options.rs`)
+
+Update **`Default`** impls to set **`config: CompilerConfig::default()`**. Replace **`Copy`** with **`Clone`** (and **`PartialEq`/`Eq`** as needed) wherever **`CompilerConfig`** prevents **`Copy`**.
+
+Update **every** construction site: **`..Default::default()`**, field updates, and any code that assumed **`Copy`** (e.g. pass-by-value patterns may become **`.clone()`**).
+
+**Passes:** thread **`options.config`** into **`compile_module` / `compile`** paths so **future** passes (inliner) can read it. For M1, if no pass consumes **`InlineConfig`** yet, wiring is still “plumbing only” with no semantic change.
+
+## Code Organization Reminders
+
+- Touch only what **`grep`** / the compiler flags for **`NativeCompileOptions`**, **`CompileOptions`**, **`WasmOptions`**.
+- Keep **`CompilerConfig`** ownership clear: one **`Clone`** per compile from options is fine; no need for **`Arc`** unless profiling says otherwise.
+
+## Implementation Details
+
+- **`lp-core/lp-engine/src/gfx/native_jit.rs`** and any **`fw-*` / tests** that build **`NativeCompileOptions`** — add **`..Default::default()`** or explicit **`config`** fields.
+- **`lps-filetests/tests/rv32n_smoke.rs`** and similar — update struct literals.
+- **`lpvm_native::compile.rs`**: forward **`config`** only where the roadmap expects (inline in M4); optional comment **`// M1: config available on options`** if no consumer yet.
+
+### Tests
+
+```bash
+cargo test -p lpvm-native
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-wasm
+```
+
+Fix any **`cargo check -p lp-engine`** / **`fw-esp32`** breakage from option type changes before phase 3.
+
+## Validate
+
+```bash
+cargo test -p lpvm-native
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-wasm
+cargo test -p lps-frontend
+cargo check -p lp-engine
+cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf \
+  --profile release-esp32 --features esp32c6,server
+```
+
+Adjust crate paths if the repo workspace layout differs.
diff --git a/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/03-filetests-compile-opt.md b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/03-filetests-compile-opt.md
new file mode 100644
index 000000000..93ec44356
--- /dev/null
+++ b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/03-filetests-compile-opt.md
@@ -0,0 +1,33 @@
+# Phase 3 — Filetests `compile-opt`
+
+## Scope of phase
+
+- Add **`parse_compile_opt.rs`** (or equivalent) recognizing lines of the form **`// compile-opt(key, value)`** after trim: balanced parens or simple rule — **key** and **value** are trimmed strings inside **`(` `)`**, split on the **first comma** (value may contain commas if we document otherwise; MVP: no commas in **value** or use last-comma split — align with roadmap “two-part” mental model).
+- In **`parse_test_file`**: handle **`compile-opt`** **before** the branch that treats lines as **`// @…`** target annotations, so **`compile-opt`** is never pushed to **`pending_annotations`**.
+- Add **`config_overrides: Vec<(String, String)>`** to **`TestFile`**; on duplicate **key**, return **`Err`** with line number.
+- **`filetest_lpvm`**: from **`TestFile`**, build **`CompilerConfig::default()`**, **`apply`** each pair (or merge after duplicate check), pass **`config`** into **`FaCompileOptions`**, Cranelift **`CompileOptions`**, and **`WasmOptions`** in **`compile_glsl`**.
+
+Thread **`&TestFile`** or **`CompilerConfig`** through **`run_test_file` → compile`** as needed so **`compile_glsl`** receives overrides.
+
+## Code Organization Reminders
+
+- Parser tests live next to **`parse_compile_opt`** (unit tests) and optionally one integration test on a temp **`.glsl`** file in **`parse/mod.rs`** tests.
+- **`AnnotationKind`** / **`parse_annotation.rs`** remain unchanged.
+
+## Implementation Details
+
+- **Whitespace:** allow **`// compile-opt( inline.mode , never )`** style trimming.
+- **Errors:** unknown key from **`apply`** should surface with file context (path + line) when merging in the harness.
+
+### Tests
+
+- Parse single and multiple **`compile-opt`** lines.
+- Duplicate key error.
+- Invalid line syntax error (missing parens, empty key).
+- End-to-end: optional minimal **`.glsl`** under **`filetests/`** with one **`compile-opt`** only if we want coverage without changing expectations — otherwise rely on parser + harness unit tests until M4 adds real tagged files.
+
+## Validate
+
+```bash
+cargo test -p lps-filetests -- --test-threads=4
+```
diff --git a/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/04-cleanup-and-validation.md b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/04-cleanup-and-validation.md
new file mode 100644
index 000000000..4b389081e
--- /dev/null
+++ b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/04-cleanup-and-validation.md
@@ -0,0 +1,42 @@
+# Phase 4 — Cleanup & validation
+
+## Scope of phase
+
+- Grep the working tree for **`TODO`**, **`FIXME`**, stray **`dbg!`**, debug **`println!`** introduced during this plan.
+- Fix warnings (unused imports after plumbing, **`dead_code`** only if legitimately unused stubs — prefer **`allow`** with a one-line reason or remove).
+- Run the **full M1 validation matrix** from the roadmap.
+
+## Cleanup & validation
+
+```bash
+cargo test -p lpir
+cargo test -p lpvm-native
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-wasm
+cargo test -p lps-filetests -- --test-threads=4
+cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf \
+  --profile release-esp32 --features esp32c6,server
+```
+
+Add **`cargo check -p fw-emu`** or **`lp-server`** if this workspace’s AGENTS checklist applies to these crates.
+
+## Plan cleanup
+
+- Write **`docs/plans/2026-04-15-lpir-inliner-stage-ii/summary.md`**: bullets — what shipped (`CompilerConfig`, three backends, **`compile-opt`** parsing + harness), crates touched, follow-ups (M4 inliner reads **`InlineConfig`**; tag **`filetests/function/*.glsl`**).
+- Move **`docs/plans/2026-04-15-lpir-inliner-stage-ii/`** → **`docs/plans-done/2026-04-15-lpir-inliner-stage-ii/`** when implementation is complete.
+
+## Commit (when requested)
+
+Conventional Commits example:
+
+```
+feat(lpir): add CompilerConfig and filetest compile-opt directive
+
+- Add CompilerConfig / InlineConfig / InlineMode in lpir
+- Thread config through native, Cranelift, and WASM compile options
+- Parse // compile-opt(key, value) into TestFile and apply before compile
+```
+
+## Code Organization Reminders
+
+- Final pass: no temporary hacks without **`TODO(plan):`** if something must remain.
diff --git a/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/summary.md b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/summary.md
new file mode 100644
index 000000000..f3dd82b0a
--- /dev/null
+++ b/docs/plans-done/2026-04-15-lpir-inliner-stage-ii/summary.md
@@ -0,0 +1,34 @@
+# Summary — `lpir-inliner` stage ii (M1 `CompilerConfig` + `compile-opt`)
+
+## Shipped
+
+- **`lpir::compiler_config`**: `CompilerConfig`, `InlineConfig`, `InlineMode`, `ConfigError`, and `CompilerConfig::apply` for string keys (`inline.mode`, `inline.always_inline_single_site`, thresholds, optional budgets). `InlineMode`: `FromStr` / `Display` (`auto`, `always`, `never`). `no_std` + `alloc`.
+- **Middle-end threading**: `config: CompilerConfig` on `NativeCompileOptions`, Cranelift `CompileOptions`, and `WasmOptions` (defaults via `CompilerConfig::default()`; options structs use `Clone` where `Copy` no longer applies).
+- **Filetests**: `// compile-opt(key, value)` parsed in `parse_compile_opt.rs`; `TestFile::config_overrides`; duplicate keys rejected at parse time; `build_compiler_config` + merge before `compile_for_target`; GLSL output strips `compile-opt` lines; all backends in `filetest_lpvm` receive the merged config.
+
+## Crates touched (main)
+
+- `lp-shader/lpir` — `compiler_config.rs`, `lib.rs`
+- `lp-shader/lpvm-native`, `lp-shader/lpvm-cranelift`, `lp-shader/lpvm-wasm`, `lp-shader/lpvm-emu` — options + clone/move fixes
+- `lp-shader/lps-filetests` — parse, source strip, compile harness, `run_detail`
+- `lp-core/lp-engine`, `lp-app/web-demo` — option struct literals
+
+## Follow-ups
+
+- **M4+**: Wire the inliner (and any other LPIR pass) to read `options.config.inline` (and friends).
+- **Roadmap tagging**: Add `// compile-opt(inline.mode, never)` / `always` to the listed `filetests/function/*.glsl` when inliner behavior must be pinned.
+- **`lp-server` / `fw-emu`**: Run `cargo check` if the full AGENTS matrix is required for a release; stage-ii phase 4 matrix covered shader pipeline crates + `fw-esp32` when run in CI.
+
+## Validation (recorded at completion)
+
+```bash
+cargo test -p lpir
+cargo test -p lpvm-native
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-wasm
+cargo test -p lps-filetests -- --test-threads=4
+cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf \
+  --profile release-esp32 --features esp32c6,server
+```
+
+All of the above completed successfully before this summary was added.
diff --git a/docs/plans/2026-04-15-lpir-inliner-stage-i/00-design.md b/docs/plans/2026-04-15-lpir-inliner-stage-i/00-design.md
new file mode 100644
index 000000000..30e256345
--- /dev/null
+++ b/docs/plans/2026-04-15-lpir-inliner-stage-i/00-design.md
@@ -0,0 +1,109 @@
+# Design — `lpir-inliner` stage i (M0 stable `CalleeRef`)
+
+## Scope of work
+
+Replace flat `CalleeRef(u32)` with `CalleeRef::Import(ImportId)` / `CalleeRef::Local(FuncId)`, store local functions in `BTreeMap<FuncId, IrFunction>` with stable ids (no redundant `func_id` on `IrFunction`), keep `imports: Vec<ImportDecl>` with `ImportId` = vector index. Update all `lpir` and downstream crates. **No intentional semantic change**; validate with full test matrix from M0 roadmap.
+
+See `00-notes.md` for resolved planning questions.
+
+## Implementation granularity
+
+Intermediate phases **do not need to keep the workspace building**. It is fine if `cargo check` fails after an early phase until downstream crates are updated. The **contract is end-to-end green** after phase **5** (full test matrix + firmware `cargo check` in `05-cleanup-and-validation.md`). Phases are organizational slices, not merge checkpoints.
+
+## File structure (relevant areas)
+
+```
+lp-shader/lpir/src/
+├── types.rs                    # UPDATE: ImportId, FuncId, CalleeRef enum
+├── lpir_module.rs              # UPDATE: BTreeMap functions; import helpers
+├── builder.rs                  # UPDATE: ModuleBuilder next_func_id; add_* returns
+├── lpir_op.rs                  # (Call shape unchanged; CalleeRef type only)
+├── print.rs                    # UPDATE: callee + function iteration
+├── parse.rs                    # UPDATE: CalleeRef construction
+├── validate.rs                 # UPDATE: local lookup by FuncId
+├── interp.rs                   # UPDATE: callee resolution + callee body fetch
+├── lib.rs                      # UPDATE: re-export ImportId, FuncId
+└── tests/                      # UPDATE: CalleeRef construction
+
+lp-shader/lpvm-native/src/
+├── lower.rs                    # UPDATE: resolve_callee_name, sret path
+├── compile.rs, link.rs         # UPDATE: iterate functions / indices
+├── regalloc/render.rs          # UPDATE: comment / clone path for map
+├── debug_asm.rs, rt_emu/*.rs, rt_jit/*.rs, …  # UPDATE: ir.functions access
+
+lp-shader/lpvm-wasm/src/
+├── emit/mod.rs, emit/imports.rs, emit/ops.rs
+├── compile.rs                  # zip IR funcs with meta — order contract
+└── rt_*/instance.rs
+
+lp-shader/lpvm-cranelift/src/
+└── module_lower.rs, emit/call.rs, call.rs, …  # UPDATE: index→FuncId; alias cranelift FuncId
+
+lp-shader/lps-frontend/src/
+├── lower.rs, lower_ctx.rs, lower_lpfx.rs
+
+lp-shader/lpvm-emu/src/
+└── instance.rs, emu_run.rs
+
+lp-shader/lpvm/src/debug.rs     # (verify; may be HashMap name→, not LpirModule)
+
+lp-shader/lps-filetests, …      # indirect via frontend
+```
+
+## Conceptual architecture
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│ LpirModule                                                   │
+│   imports: Vec<ImportDecl>     ImportId(i) ↔ imports[i]      │
+│   functions: BTreeMap<FuncId, IrFunction>  (stable keys)     │
+└─────────────────────────────────────────────────────────────┘
+              │
+              │  CalleeRef::Import(id) ──► ImportDecl + index in imports
+              │  CalleeRef::Local(id)  ──► functions.get(&id)
+              ▼
+┌─────────────────────────────────────────────────────────────┐
+│ ModuleBuilder                                                │
+│   next_func_id: u16 (or u32) monotonic for new locals       │
+│   add_function → insert map, return Local(FuncId)            │
+└─────────────────────────────────────────────────────────────┘
+```
+
+**Id allocation:** each `add_function` allocates the next unused `FuncId` (wrapper type over incrementing counter). **Deletion** is out of scope for M0, but the map + stable ids is the intended contract for M5.
+
+**Name collision:** Cranelift uses `cranelift_module::FuncId`; LPIR gains `lpir::FuncId`. Use explicit qualification or `use lpir::FuncId as LpirFuncId` in files where both appear.
+
+## Main components and interactions
+
+| Component | Role |
+|-----------|------|
+| `ImportId` / `FuncId` | Newtype wrappers (`u16`); `Hash`, `Ord` for map keys |
+| `CalleeRef` | Enum; all `Call` and name resolution match on it |
+| `LpirModule::callee_as_*` | Becomes `callee_as_import` → `Option<ImportId>` + slice access, or match-only helpers; local path returns `Option<&IrFunction>` via `FuncId` |
+| `ModuleBuilder` | Owns `next_func_id`; `finish()` moves map into `LpirModule` |
+| Backends | Replace `functions[i]` / `enumerate()` with map iteration or sorted `Vec<FuncId>` for deterministic codegen order matching existing behavior |
+
+## Suggested implementation phases
+
+Listed as separate files `01-*.md` … `05-*.md` in this directory.
+
+1. **LPIR core** — types, `LpirModule`, `ModuleBuilder`, `lib` exports; compile `lpir` only.
+2. **LPIR surface** — print, parse, validate, interp, unit tests.
+3. **Primary backends** — `lpvm-native`, `lpvm-wasm`, `lps-frontend` (+ `lower` paths).
+4. **Remaining runtimes** — `lpvm-cranelift` (index/order maps; `FuncId` alias), `lpvm-emu`, JIT/EMU instances, `link.rs` / `compile.rs` ordering vs `LpsModuleSig`.
+5. **Cleanup & validation** — `cargo test` / `cargo check` matrix from M0, fix warnings, `summary.md`, move plan to `docs/plans-done/` when done.
+
+## Validate (full stage)
+
+From M0 roadmap (run from workspace root):
+
+```bash
+cargo test -p lpir
+cargo test -p lpvm-native
+cargo test -p lpvm-wasm
+cargo test -p lps-frontend
+cargo test -p lps-filetests -- --test-threads=4
+cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf --profile release-esp32 --features esp32c6,server
+```
+
+Add `cargo test -p lpvm-cranelift` / `cargo test -p lpvm-emu` if those crates cover changed paths.
diff --git a/docs/plans/2026-04-15-lpir-inliner-stage-i/00-notes.md b/docs/plans/2026-04-15-lpir-inliner-stage-i/00-notes.md
new file mode 100644
index 000000000..e920650cc
--- /dev/null
+++ b/docs/plans/2026-04-15-lpir-inliner-stage-i/00-notes.md
@@ -0,0 +1,63 @@
+# Plan notes — `lpir-inliner` stage i (M0 stable `CalleeRef`)
+
+## Scope of work
+
+Implement the **M0 — Stable CalleeRef refactor** from
+`docs/roadmaps/2026-04-15-lpir-inliner/m0-stable-callee-ref.md`:
+
+- Replace flat `CalleeRef(pub u32)` (imports first, then locals in one index space) with a typed enum `CalleeRef::Import(ImportId)` / `CalleeRef::Local(FuncId)`.
+- Add `ImportId(u16)` and `FuncId(u16)` with stable identity (safe for future dead-function elimination).
+- Update `lpir` (types, module, builder, parse, print, validate, interp, tests) and downstream crates (`lpvm-native`, `lpvm-wasm`, `lps-frontend`) per the roadmap.
+- **No intended behavior change**: same IR semantics and test expectations; mechanical migration off index arithmetic.
+
+Out of scope for this stage: inliner, `Block` ops, filetest `@config`, dead-function elimination (later milestones).
+
+## Current state of the codebase (relevant to this scope)
+
+- **Layout**: LPIR lives under `lp-shader/lpir/` (not the repo root crate name alone).
+- **`CalleeRef`**: `lp-shader/lpir/src/types.rs` defines `pub struct CalleeRef(pub u32)` with comment “imports first, then local functions”.
+- **`LpirModule`**: `lp-shader/lpir/src/lpir_module.rs` holds `imports: Vec<ImportDecl>` and `functions: Vec<IrFunction>`. Helpers `callee_ref_import`, `callee_ref_function`, `callee_as_import`, `callee_as_function` implement the flat index split.
+- **`ModuleBuilder`**: `add_import` / `add_function` return `CalleeRef` using the same flat encoding (`lp-shader/lpir/src/builder.rs`).
+- **`IrFunction`**: has `name`, `is_entry`, `vmctx_vreg`, params, body, etc.; **no** `FuncId` field today.
+- **Consumers**: `CalleeRef` appears in `print`, `parse`, `validate`, `interp` (uses `callee_as_import` / `callee_as_function`), `lpvm-native` `lower.rs`, `lpvm-wasm` `emit/ops.rs` and `emit/imports.rs`, `lps-frontend` `lower.rs` / `lower_ctx.rs` / `lower_lpfx.rs`, tests in `lpir/src/tests/validate.rs`. **`lpvm-cranelift` has no `CalleeRef` string matches** in a quick grep — may not need changes for M0.
+- **Roadmap validation commands** assume workspace crates; commands should be run from the workspace that contains `lp-shader` members (see root `Cargo.toml` / workspace structure when validating).
+
+## Questions (planning)
+
+Answers will be appended below as we resolve them in chat.
+
+| # | Question | Status |
+|---|----------|--------|
+| 1 | How should `LpirModule` store local functions so `FuncId` stays stable across future deletion without renumbering `Call` sites? | **Resolved** |
+| 2 | Should each `IrFunction` store a `func_id: FuncId` field (redundant with map keys), or only the `BTreeMap` key? | **Resolved** |
+| 3 | Imports: keep `Vec<ImportDecl>` + `ImportId` as vec index vs symmetric map? | **Resolved** |
+
+### Suggested directions (for discussion)
+
+- **Storage**: Options include `(a)` `Vec<IrFunction>` with `FuncId` **not** equal to vec index + side map `FuncId -> usize`, `(b)` `BTreeMap<FuncId, IrFunction>`, `(c)` `Vec<Option<IrFunction>>` with `FuncId` as slot index (sparse, deletion = `None`). Roadmap allows “simpler option” for small counts.
+- **`IrFunction`**: Optional `func_id: FuncId` field for debugging and map-free reverse lookup — roadmap says “consider”.
+- **Width**: Roadmap uses `u16` for ids; confirm vs existing counts (imports + functions) in largest modules.
+
+## Answers (from chat)
+
+### Q1 — Local function storage
+
+**Answer:** Use **`BTreeMap<FuncId, IrFunction>`** (option 2).
+
+**Implications:**
+
+- Iteration order is sorted by **`FuncId`**, not insertion order. With monotonic id assignment (`0, 1, 2, …`), codegen order usually matches old `Vec` order; after deletes + new inserts, new ids should be ordered consistently if we allocate ids from a counter.
+- All call sites and builders must construct **`CalleeRef::Local(FuncId)`** instead of flat indices.
+
+### Q2 — `FuncId` on `IrFunction`
+
+**Answer:** **No redundant field** — single source of truth: **`FuncId` only as the `BTreeMap` key**. APIs that need both pass **`(FuncId, &IrFunction)`** or look up with **`module.functions.get(&id)`**.
+
+### Q3 — Import storage
+
+**Answer:** Keep **`imports: Vec<ImportDecl>`** with **`ImportId(u16)`** equal to the **index** in that vector (same model as today, but typed). No `BTreeMap` for imports in M0.
+
+## Notes
+
+- **Cranelift / native / interp** iterate `module.functions` today as a `Vec`; they will iterate **`BTreeMap`** entries or collect sorted ids—small mechanical updates alongside `CalleeRef` migration.
+- **Build granularity:** Intermediate steps do not need to keep `cargo check` green; only the **end of the plan** (phase 5 / full validation) must pass. Phases are logical slices, not per-commit merge requirements.
diff --git a/docs/plans/2026-04-15-lpir-inliner-stage-i/01-lpir-core-types-and-module.md b/docs/plans/2026-04-15-lpir-inliner-stage-i/01-lpir-core-types-and-module.md
new file mode 100644
index 000000000..5d3db4df0
--- /dev/null
+++ b/docs/plans/2026-04-15-lpir-inliner-stage-i/01-lpir-core-types-and-module.md
@@ -0,0 +1,33 @@
+# Phase 1 — LPIR core: types, module, builder
+
+## Scope of phase
+
+Introduce `ImportId`, `FuncId`, and `CalleeRef` enum in `lpir`. Replace `LpirModule::functions: Vec<IrFunction>` with `BTreeMap<FuncId, IrFunction>`. Extend `ModuleBuilder` with monotonic `FuncId` allocation (`add_function`). Update `callee_*` helpers to the new model. Print/parse/validate/interp are **phase 2**; it is OK if **`lpir` does not compile** until those files are updated—no requirement to stub just to keep the build green mid-plan.
+
+## Code organization reminders
+
+- One concept per file where it already exists (`types.rs`, `lpir_module.rs`, `builder.rs`).
+- Entry points: public types and `LpirModule` / `ModuleBuilder` APIs first.
+- Helper constructors (`CalleeRef::import`, `local`) at bottom if useful.
+
+## Implementation details
+
+- **`FuncId` / `ImportId`:** `#[repr(transparent)]` `u16` (or plain newtype); implement `Debug`, `Display`, `Ord`, `FromStr` not needed for ids.
+- **`CalleeRef`:** `Import(ImportId)` | `Local(FuncId)`; derive `Copy`, `Eq`, `Hash`.
+- **`LpirModule`:** `functions: BTreeMap<FuncId, IrFunction>`; remove flat `CalleeRef` index helpers; add:
+  - `fn local_function(&self, id: FuncId) -> Option<&IrFunction>`
+  - iterators as needed for phase 2 (`functions.values()`, `functions.iter()`).
+- **`function_count`:** `self.functions.len()` as `u32`.
+- **`ModuleBuilder`:** field `next_func_id: u32` (or u16 with overflow check); `add_function`: `let id = FuncId(...); self.functions.insert(id, func);` return `CalleeRef::Local(id)`.
+- **`lib.rs`:** `pub use types::{ImportId, FuncId, CalleeRef, ...}`.
+## Tests to write
+
+- (Defer to phase 2 if core lands first without a compiling `lpir` crate.) Unit tests on builder: two `add_function` calls receive distinct `FuncId`s and both appear in the finished module’s map.
+
+## Validate
+
+Optional until the crate compiles again (usually after phase 2):
+
+```bash
+cargo test -p lpir
+```
diff --git a/docs/plans/2026-04-15-lpir-inliner-stage-i/02-lpir-print-parse-validate-interp.md b/docs/plans/2026-04-15-lpir-inliner-stage-i/02-lpir-print-parse-validate-interp.md
new file mode 100644
index 000000000..6593a4806
--- /dev/null
+++ b/docs/plans/2026-04-15-lpir-inliner-stage-i/02-lpir-print-parse-validate-interp.md
@@ -0,0 +1,26 @@
+# Phase 2 — LPIR print, parse, validate, interpreter, tests
+
+## Scope of phase
+
+Complete `lpir` crate: printing and parsing `CalleeRef`, validation of `Call` targets via `ImportId`/`FuncId`, interpreter local call dispatch, and update all `lpir` unit tests (`src/tests/*.rs`).
+
+## Code organization reminders
+
+- Match existing style in `print.rs` / `parse.rs` (indentation, keyword names).
+- Validation control-flow stack unchanged except `Call` target check (match enum, bounds on ImportId, key present for Local).
+
+## Implementation details
+
+- **`print.rs`:** `callee_name` match on enum; iterate `module.functions` with `.iter()` (pairs `(FuncId, &IrFunction)`). Preserve any ordering expectations (e.g. sorted by `FuncId` for stable output).
+- **`parse.rs`:** build `CalleeRef::Import(ImportId(i))` / `Local(FuncId(i))` per name table; remove `import_count + local_index` flat math.
+- **`validate.rs`:** resolve local callee via `FuncId`; `total` / indexing fixes where it assumed `Vec` index space.
+- **`interp.rs`:** replace `callee_as_function` + `functions[fi]` with `FuncId` map lookup; dereference `callee` op field (may need `*` if pattern matched refs).
+- **Tests:** replace every `CalleeRef(n)` with enum constructors; fix `m.functions[0]` → get by `FuncId` or iterate.
+
+## Tests to write
+
+- Existing tests updated; add one test that parses/reprints a module with mixed import + local call if not already covered.
+
+## Validate
+
+Target state for this phase: **`cargo test -p lpir` passes.** (Still OK if the rest of the workspace is red until later phases.)
diff --git a/docs/plans/2026-04-15-lpir-inliner-stage-i/03-lpvm-native-wasm-frontend.md b/docs/plans/2026-04-15-lpir-inliner-stage-i/03-lpvm-native-wasm-frontend.md
new file mode 100644
index 000000000..05dfad465
--- /dev/null
+++ b/docs/plans/2026-04-15-lpir-inliner-stage-i/03-lpvm-native-wasm-frontend.md
@@ -0,0 +1,30 @@
+# Phase 3 — lpvm-native, lpvm-wasm, lps-frontend
+
+## Scope of phase
+
+Migrate the main compiler front: native lowering and compile pipeline, WASM emit/compile (IR↔meta ordering), and GLSL lowering that builds `CalleeRef` / iterates IR functions.
+
+## Code organization reminders
+
+- In `lower.rs`, keep `resolve_callee_name` and `callee_return_uses_sret` structure; swap implementation to enum match.
+- For `lpvm-wasm/compile.rs`, document or preserve **zip order** between `ir.functions` and `meta.functions`—after map change, define order explicitly (e.g. sort by `FuncId` then zip with meta sorted the same way, or match by **name** if that is the existing contract—**verify in code before shipping**).
+
+## Implementation details
+
+- **`lpvm-native`:** `lower.rs`, `compile.rs`, `link.rs`, `regalloc/render.rs` (clone or iterate map), `debug_asm.rs`, `rt_jit/*`, `rt_emu/*`—replace `functions[idx]` with `FuncId`→lookup or ordered vec of `(FuncId, &IrFunction)` where linear index is still needed for ABI tables.
+- **`lpvm-wasm`:** `emit/mod.rs`, `emit/imports.rs`, `emit/ops.rs`, `compile.rs`, runtime `instance.rs` files.
+- **`lps-frontend`:** `lower.rs`, `lower_ctx.rs`, `lower_lpfx.rs`—construct typed `CalleeRef`; any `ir.functions.len()` / indexing in tests (`lib.rs`).
+
+## Tests to write
+
+- Rely on crate tests; fix breakages from API change.
+
+## Validate
+
+When these crates compile again:
+
+```bash
+cargo test -p lpvm-native
+cargo test -p lpvm-wasm
+cargo test -p lps-frontend
+```
diff --git a/docs/plans/2026-04-15-lpir-inliner-stage-i/04-cranelift-emu-instances.md b/docs/plans/2026-04-15-lpir-inliner-stage-i/04-cranelift-emu-instances.md
new file mode 100644
index 000000000..97699bb75
--- /dev/null
+++ b/docs/plans/2026-04-15-lpir-inliner-stage-i/04-cranelift-emu-instances.md
@@ -0,0 +1,32 @@
+# Phase 4 — lpvm-cranelift, lpvm-emu, remaining instances
+
+## Scope of phase
+
+Update `lpvm-cranelift` (`module_lower.rs`, `emit/call.rs`, `call.rs`, `lpvm_instance.rs`) and `lpvm-emu` / any remaining `ir.functions[usize]` paths. Disambiguate **`cranelift_module::FuncId`** vs **`lpir::FuncId`** using imports (`use lpir::FuncId as LpirFuncId` or fully qualified paths).
+
+## Code organization reminders
+
+- `LpirFuncEmitOrder::Source` today means vec order; redefine as **sorted `FuncId` order** (matches monotonic assignment) or explicit vec of ids—**document in code comment** so JIT/object order stays deterministic.
+
+## Implementation details
+
+- **`module_lower.rs`:** `indices: Vec<usize>` becomes `Vec<FuncId>` or `Vec<(FuncId, usize)>`; `ir.functions[i]` → `ir.functions.get(&id)`; `id_at_ir` keyed by something stable—may become `BTreeMap<LpirFuncId, cranelift_module::FuncId>` or vec indexed by emit order with parallel `LpirFuncId` list.
+- **`emit/call.rs`:** local callee index → `FuncId` + map lookup.
+- **`lpvm-emu` / instances:** same patterns as `rt_emu` (phase 3); ensure name→IR lookup still works.
+
+## Tests to write
+
+```bash
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-emu
+```
+
+## Validate
+
+When applicable:
+
+```bash
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-emu
+cargo test -p lpvm
+```
diff --git a/docs/plans/2026-04-15-lpir-inliner-stage-i/05-cleanup-and-validation.md b/docs/plans/2026-04-15-lpir-inliner-stage-i/05-cleanup-and-validation.md
new file mode 100644
index 000000000..8a3a20ebe
--- /dev/null
+++ b/docs/plans/2026-04-15-lpir-inliner-stage-i/05-cleanup-and-validation.md
@@ -0,0 +1,41 @@
+# Phase 5 — Cleanup, filetests, firmware check, summary
+
+## Scope of phase
+
+Remove `TODO` / stray debug, fix warnings introduced by the refactor, run full validation from M0 roadmap, write `summary.md`, and move this plan directory to `docs/plans-done/` per project convention. Optional: **commit** with Conventional Commits message when implementation is complete.
+
+This phase is the **first gate** where the **entire workspace** touched by the refactor must be green (see commands below). Earlier phases may leave the build broken.
+
+## Cleanup & validation
+
+- Grep diff for `FIXME`, `TODO`, `dbg!`, `println!` used for debugging.
+- Ensure no unused imports after renames (especially `FuncId` in cranelift files).
+- **Full matrix:**
+
+```bash
+cargo test -p lpir
+cargo test -p lpvm-native
+cargo test -p lpvm-wasm
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-emu
+cargo test -p lps-frontend
+cargo test -p lps-filetests -- --test-threads=4
+cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf --profile release-esp32 --features esp32c6,server
+```
+
+Adjust paths if workspace uses different feature flags.
+
+## Plan cleanup
+
+- Add `summary.md` bullet list: what merged, crates touched, any follow-ups (e.g. M5 dead elim).
+- Move `docs/plans/2026-04-15-lpir-inliner-stage-i/` → `docs/plans-done/2026-04-15-lpir-inliner-stage-i/` when work is complete.
+
+## Commit (when requested)
+
+```
+refactor(lpir): stable CalleeRef with ImportId and FuncId
+
+- Replace flat CalleeRef(u32) with enum Import/Local
+- Store local functions in BTreeMap<FuncId, IrFunction>
+- Update backends and frontend for new module layout
+```
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/00-design.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/00-design.md
new file mode 100644
index 000000000..2e9f13b10
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/00-design.md
@@ -0,0 +1,182 @@
+# LPIR Inliner — Stage III Design (M3 + M2.5)
+
+Source roadmap: `docs/roadmaps/2026-04-15-lpir-inliner/m3-inlining-pass.md`
+plus `m2.5-continuing-marker.md` (folded in as Phase 1).
+
+Question / answer trail in [00-notes.md](00-notes.md).
+
+## Scope of work
+
+Implement the LPIR inlining pass: `lpir::inline_module(&mut LpirModule,
+&InlineConfig) -> InlineResult`. Bottom-up, never deletes functions, never
+hard-errors, fully structural offset recompute, per-param scan-then-alias-or-copy,
+heuristic-driven with debug-level decision logging.
+
+Bundled prerequisite: M2.5's `LpirOp::Continuing` marker, which adds the
+final piece of structural symmetry needed for offset recompute (loops gain
+a marker for the start of their continuing block, mirroring `Else` for
+ifs). Backends and interpreter keep using the cached
+`LoopStart::continuing_offset` field unchanged.
+
+Out of scope (deferred): wiring into `lpvm-native::compile_module` (M4),
+GLSL filetests with `compile-opt` annotations (M4), perf measurement on
+real shaders (M4 step 3), `func_weight` empirical tuning (M3.1), dead
+function elimination (M5), inline-and-delete-as-we-go (future-work),
+removing offset fields entirely (future-work).
+
+## File structure
+
+```
+lp-shader/
+├── lpir/
+│   └── src/
+│       ├── lib.rs                            # UPDATE: re-export inline_module / InlineResult
+│       ├── lpir_op.rs                        # UPDATE (M2.5): + LpirOp::Continuing variant
+│       ├── builder.rs                        # UPDATE (M2.5): push_continuing emits the marker
+│       ├── parse.rs                          # UPDATE (M2.5): existing `continuing:` token → marker
+│       ├── print.rs                          # UPDATE (M2.5): print marker, drop offset detection
+│       ├── validate.rs                       # UPDATE (M2.5): exhaustive matches + nesting check
+│       ├── interp.rs                         # UPDATE (M2.5): Continuing => pc += 1
+│       ├── const_fold.rs                     # UPDATE (M2.5): conservative-clear arm
+│       ├── inline/                           # NEW: the inliner
+│       │   ├── mod.rs                        #   public API + orchestration loop
+│       │   ├── callgraph.rs                  #   callees-of, callers-of, topological order, cycle detection
+│       │   ├── offsets.rs                    #   recompute_offsets(&mut [LpirOp]) — reusable
+│       │   ├── remap.rs                      #   scan_param_writes, build_remap, remap_op
+│       │   ├── splice.rs                     #   inline_call_site (the splicer)
+│       │   └── heuristic.rs                  #   func_weight, Decision, should_inline
+│       └── tests/
+│           ├── inline_basic.rs               # NEW: void / single-return / multi-return / nested
+│           ├── inline_callgraph.rs           # NEW: cycles, diamond, chains
+│           ├── inline_remap.rs               # NEW: vmctx alias, slot remap, pool splice via imports
+│           ├── inline_heuristic.rs           # NEW: thresholds, budgets, mode=Never/Always/Auto
+│           ├── inline_offsets.rs             # NEW: recompute_offsets correctness
+│           └── inline_param_writes.rs        # NEW: read-only alias vs mutated copy
+├── lpvm-native/
+│   └── src/
+│       └── lower.rs                          # UPDATE (M2.5): no-op match arm for Continuing
+├── lpvm-wasm/
+│   └── src/
+│       └── emit/
+│           └── ops.rs                        # UPDATE (M2.5): no-op match arm for Continuing
+└── lpvm-cranelift/
+    └── src/
+        └── emit/
+            └── control.rs                    # UPDATE (M2.5): no-op match arm for Continuing
+```
+
+## Conceptual architecture
+
+```
+                  inline_module(&mut LpirModule, &InlineConfig) -> InlineResult
+                                          │
+                                          ▼
+        ┌──────────────────────── inline/mod.rs ────────────────────────┐
+        │                                                                │
+        │   ┌───────────────┐   ┌───────────────────────────────────┐   │
+        │   │  callgraph.rs │   │            heuristic.rs           │   │
+        │   │  build_graph  │──▶│  func_weight  │  should_inline    │   │
+        │   │  topo_order   │   │               │  Decision         │   │
+        │   │  detect_cycles│   └───────────────────────────────────┘   │
+        │   └───────┬───────┘              │                             │
+        │           │                      │                             │
+        │           ▼                      ▼                             │
+        │   For each callee in topo order, for each caller of that      │
+        │   callee, if Decision::Inline:                                 │
+        │           │                                                    │
+        │           ▼                                                    │
+        │   ┌─────────────────────── splice.rs ──────────────────────┐  │
+        │   │  inline_call_site(caller, callee, call_op_idx, …):     │  │
+        │   │      ① scan_param_writes(callee)        (remap.rs)     │  │
+        │   │      ② build_remap(...)                 (remap.rs)     │  │
+        │   │      ③ analyze return shape (0 / 1-at-end / multi)     │  │
+        │   │      ④ build scratch Vec<LpirOp>:                      │  │
+        │   │           - per-param Copy (if written) or alias       │  │
+        │   │           - clone+remap callee body, splicing pool     │  │
+        │   │             entries into caller.vreg_pool              │  │
+        │   │           - rewrite Return → Copy (+ ExitBlock if      │  │
+        │   │             multi); wrap in Block { _ } / End if multi │  │
+        │   │      ⑤ caller.body.splice(call_idx..=call_idx, scratch)│  │
+        │   └────────────────────────────────────────────────────────┘  │
+        │           │                                                    │
+        │           ▼                                                    │
+        │   After all call sites of all callees processed:               │
+        │   For each mutated function:                                   │
+        │       recompute_offsets(&mut func.body)   (offsets.rs)         │
+        │           │                                                    │
+        │           ▼                                                    │
+        │   Return InlineResult { functions_inlined, ... }               │
+        └────────────────────────────────────────────────────────────────┘
+```
+
+## Key invariants enforced by the orchestration
+
+- **Bottom-up topological order:** callee fully inlined before caller
+  processes it. Single bottom-up pass.
+- **Cycle nodes left alone** (Q3); counted in
+  `result.functions_skipped_recursive`. Logged at `debug!`.
+- **`module_op_budget`** checked between callees; sets `budget_exceeded`
+  on overflow and stops the pass. Bottom-up means partial result still
+  has the highest-leverage inlinings.
+- **`growth_used`** accumulated across multi-callsite inlinings (Q11).
+- **All original `IrFunction`s retained** in `module.functions`. No
+  deletion. M5's job.
+- **`debug_assert!`s** on internal invariants: remap arity matches
+  callee.vreg_types.len(), control-flow stack empty at end of recompute,
+  pool splice arity matches, vmctx slot of `param_writes` is `false`,
+  every spliced `Call` op's `args.start` points inside `caller.vreg_pool`.
+
+## Component responsibilities
+
+| Module | Inputs | Outputs / Side effects | Reusable? |
+|--------|--------|------------------------|-----------|
+| `callgraph.rs` | `&LpirModule` | `CallGraph { callers_of, callees_of, topo_order, cyclic_set }` | yes — useful for any module-level pass |
+| `heuristic.rs` | callgraph, `&InlineConfig`, `&mut growth_used`, callee id | `Decision { Inline { extra_growth }, Skip(reason) }` | inliner-specific |
+| `remap.rs` | `&IrFunction` (callee), caller arg vregs, vmctx | `Remap { table: Vec<VReg>, param_copies: Vec<LpirOp> }` | inliner-specific |
+| `splice.rs` | `&mut IrFunction` (caller), callee, call op idx, remap, return-shape | mutates caller body + pool | inliner-specific |
+| `offsets.rs` | `&mut [LpirOp]` | patches all opener offsets in place | yes — also useful for any future structural transform |
+| `mod.rs` | `&mut LpirModule`, `&InlineConfig` | `InlineResult`, mutates module | public API |
+
+## Public API
+
+```rust
+// In lpir/src/inline/mod.rs, re-exported from lib.rs.
+
+pub struct InlineResult {
+    /// Distinct callees whose body was spliced into ≥1 caller this run.
+    pub functions_inlined: usize,
+    /// Total `Call` ops replaced.
+    pub call_sites_replaced: usize,
+    /// Distinct functions skipped due to call-graph cycles.
+    pub functions_skipped_recursive: usize,
+    /// True iff `module_op_budget` was hit and the pass stopped early.
+    pub budget_exceeded: bool,
+}
+
+pub fn inline_module(
+    module: &mut LpirModule,
+    config: &InlineConfig,
+) -> InlineResult;
+```
+
+## Logging contract
+
+- `log::debug!` per-callee decision line (callee name, id, sites, size,
+  decision, reason, growth deltas).
+- `log::info!` end-of-pass summary line (totals + budget usage).
+- No `log::warn!` / `log::error!` — recursion is silently skipped per Q3,
+  budget overflow is signaled via the result field.
+
+## Validation
+
+```bash
+cargo test -p lpir
+cargo test -p lpvm-native
+cargo test -p lpvm-wasm
+cargo test -p lpvm-cranelift
+cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf \
+    --profile release-esp32 --features esp32c6,server
+```
+
+All existing tests must still pass — M3 doesn't wire the inliner into any
+production compile path (that's M4). Behavior is purely additive.
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/00-notes.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/00-notes.md
new file mode 100644
index 000000000..de4bd5464
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/00-notes.md
@@ -0,0 +1,545 @@
+# LPIR Inliner — Stage III Notes (M3: Inlining Pass)
+
+## Summary (shipped 2026-04-17)
+
+**M2.5:** `LpirOp::Continuing` marks the start of a loop’s continuing block; builder, parse, print, validate, interpreter, const-fold, and all three backends handle it (marker is structural; backends still use cached `LoopStart::continuing_offset`).
+
+**M3:** `lpir::inline_module(&mut LpirModule, &InlineConfig) -> InlineResult` plus the crate-private `lpir/src/inline/` submodule (`callgraph`, `offsets`, `remap`, `splice`, `heuristic`): bottom-up topo order, cycle skip, per-param scan with alias-or-copy remap, multi-return `Block`/`ExitBlock`/`End` wrapping, structural `recompute_offsets`, heuristic + `log::debug!` / `log::info!`. Only `inline_module` and `InlineResult` are public from `lpir`; the `inline` module is not re-exported as a path.
+
+Source roadmap: `docs/roadmaps/2026-04-15-lpir-inliner/m3-inlining-pass.md`.
+
+This is the meat of the inliner work. M0 (stable `CalleeRef`) and M2 (`Block` /
+`ExitBlock` ops) are landed; M1 (`compile-opt` + `CompilerConfig`) is landed
+in `lpir`. This stage adds `lpir/src/inline.rs`: a module-level pass that
+replaces every local `Call` with the callee's body, in place, never deleting
+functions. Wiring (M4) and dead-function elimination (M5) are out of scope.
+
+## Scope of work
+
+Build `lpir::inline_module(&mut LpirModule, &InlineConfig) ->
+InlineResult` plus everything it needs:
+
+1. Call-graph construction (callees-of, callers-of, call-site count).
+2. Bottom-up topological order (leaves first), with a cycle-skip safety net.
+3. Per-function inlining transform:
+   - VReg remap (vmctx → caller vmctx; params → arg vregs; rest → fresh).
+   - Slot remap (append callee slots to caller, offset by `caller.slots.len()`).
+   - VReg-pool splice for any remaining (import) `Call` ops in the inlined body.
+   - Body splicing with multi-return wrapping (`Block` / `ExitBlock` / `End`).
+4. Single offset-recomputation pass per mutated function (`else_offset`,
+   `end_offset`, `continuing_offset`).
+5. Heuristic decision (`InlineMode::Auto` / `Always` / `Never` + budgets).
+6. Unit tests covering: single-return callee, multi-return callee, callee
+   that calls an import, callee with slots, diamond call graph (A→B,C; B→C),
+   void callee, recursion-skip, post-condition that all original functions
+   remain.
+7. Round-trip safety: parse → inline → validate must succeed for every
+   passing test.
+
+Out of scope (deferred): wiring into `lpvm-native::compile_module`, filetest
+`compile-opt` tagging, perf measurement on `rainbow.glsl`, dead function
+elimination.
+
+## Current state of the codebase
+
+### What's already in place (M0/M1/M2)
+
+- `CalleeRef = Import(ImportId(u16)) | Local(FuncId(u16))`. Stable ids.
+  `LpirModule.functions` is a `BTreeMap<FuncId, IrFunction>` keyed by stable id.
+- `LpirOp::Block { end_offset }` and `LpirOp::ExitBlock` exist with full
+  parser/printer/interp/validator support and lower in all three backends.
+- `CompilerConfig { inline: InlineConfig, .. }` lives in
+  `lpir/src/compiler_config.rs` with `apply(key, value)` plus `FromStr` for
+  `InlineMode`. `InlineConfig` has all the knobs the M3 doc calls for
+  (`mode`, `always_inline_single_site`, `small_func_threshold`,
+  `max_growth_budget`, `module_op_budget`).
+- `FunctionBuilder` already has `push_block` / `push_exit_block` / `end_block`,
+  so the inliner's emitted IR is constructable through normal channels (good
+  for tests).
+- `IrFunction` shape: flat `body: Vec<LpirOp>`; per-function `vreg_types`,
+  `slots`, `vreg_pool`. `vmctx_vreg = VReg(0)`, user params at `v1..=v(param_count)`.
+- `Call { callee, args, results }`: `args` is a `VRegRange` into the caller's
+  `vreg_pool` and **includes vmctx as the first entry** (so for a callee with
+  `param_count = N`, `args.count = 1 + N`). `results` does not include vmctx.
+- `LpirOp::SlotAddr` is the **only** op that references a `SlotId` (slot remap
+  is therefore very targeted).
+
+### What's missing
+
+- No `lpir::inline` module exists today (`Glob lpir/src/inline*` is empty).
+- `LpirOp` has no general "iterate uses / remap vregs" helper — only
+  `def_vreg()`. The inliner needs a `for_each_vreg_mut` (or equivalent
+  per-arm rewrite). const_fold avoids this by replacing-in-place without
+  remap.
+- `validate_module` has no recursion check today; the M3 doc assumes
+  recursion is forbidden upstream (GLSL frontend), but our inliner must
+  defend itself anyway because a malformed test or hand-written LPIR could
+  contain it. We detect cycles in topo-sort and skip those nodes.
+- No offset-recompute helper exists today. The builder patches offsets as it
+  goes; const_fold preserves length so doesn't need it. We have to write one.
+
+### Existing call-overhead context
+
+`rainbow.glsl` is the canonical perf target (many tiny helper calls). M3
+doesn't measure perf — that's M4 — but the design must allow significant
+shrinkage there. Per-call overhead on rv32n.q32 is ~18-24 instructions today.
+
+### Pipeline integration (preview)
+
+`lpvm-native/src/compile.rs::compile_module` clones the IR module before
+per-function compilation; that's the natural place to insert
+`inline_module(&mut ir_opt, &options.config.inline)` once M4 lands. We don't
+modify `compile.rs` in this stage; the unit tests call `inline_module`
+directly.
+
+## Questions
+
+### Q1: Where to compute callee body length for the heuristic, and what counts as an "op"?
+
+**Context.** `InlineConfig::small_func_threshold` and `max_growth_budget` are
+phrased in "ops". Some `LpirOp` variants are pure markers (`Else`, `End`,
+`Break`, `Continue`, `ExitBlock`, the `*Start` openers); some lower to many
+machine instructions (`Call` to an import, `Memcpy`). Definition matters for
+threshold tuning later.
+
+**Resolution.** Land M3 with the simplest possible metric and defer
+weighting to a small empirical follow-up:
+
+- Single private function `func_weight(&IrFunction) -> u32` whose body is
+  `f.body.len() as u32`. The heuristic and budgets all go through it.
+- Tracked as **M3.1** (`docs/roadmaps/2026-04-15-lpir-inliner/m3.1-tune-inline-weights.md`):
+  build a small `filetests/debug/inline-weights.glsl` corpus, dump
+  `lp-cli shader-debug --lpir --asm`, tabulate `lpir_ops` vs candidate
+  `weighted_ops` vs `rv32n_insns`, pick the simplest weighting that
+  correlates well, swap the body of `func_weight`, retune
+  `small_func_threshold`. Independent of M4 (no inliner wiring required).
+- Default `small_func_threshold` stays at 20 in M3; M3.1 will revise.
+
+### Q2: How to lay out the `inline` module — single file or submodule?
+
+**Context.** The roadmap says `lpir/src/inline.rs`. The transform has several
+distinct concerns: call-graph build, topo order, vreg/slot/pool remap,
+splice, offset recompute, heuristic. Keeping them in one file is fine if it
+stays under ~600 lines, otherwise it gets unwieldy.
+
+**Resolution.** Submodule layout:
+
+```
+lpir/src/inline/
+├── mod.rs         # public API: inline_module, InlineResult; orchestration
+├── callgraph.rs   # build callees-of / callers-of, topological order, cycle detection
+├── remap.rs       # VReg + SlotId + vreg_pool remap helpers
+├── splice.rs      # body cloning + multi-return Block/ExitBlock wrapping
+├── offsets.rs     # single-pass offset recompute (reusable)
+└── heuristic.rs   # InlineConfig decisions, func_weight, budget accounting
+```
+
+Each helper file is small and individually unit-testable. Tests live in
+`lpir/src/tests/inline_*.rs` mirroring the `block_ops.rs` pattern.
+
+### Q3: Recursion / cycle handling — error or skip silently?
+
+**Context.** GLSL forbids recursion, so the frontend should never produce a
+cycle. But the inliner gets handed an `LpirModule`, not GLSL. The M3 doc
+says "If cycles exist (shouldn't in GLSL — recursion is forbidden), skip
+them." There's no `ValidationError::Recursion` today.
+
+**Resolution.** Skip silently and log at `debug!`. Detect cycles by spotting
+any function that remains unprocessable once all leaves are exhausted in the
+topological walk; leave its `Call` ops untouched. Record the count in
+`InlineResult.functions_skipped_recursive` for visibility. Other (non-GLSL)
+frontends writing to LPIR are theoretically possible, so failing hard would
+be punishing — defense-in-depth without breakage. Adding a validator check
+belongs in a separate change.
+
+### Q4: When the call-site arg vreg already matches the remapped param, do we still emit `Mov`?
+
+**Context.** The roadmap's 3a says "`v1..v(param_count)` → map to the actual
+argument vregs from the `Call`'s `args` range" — i.e., **no `Mov`**, the
+remap table just aliases the callee param vreg to the caller arg vreg. The
+roadmap's 3c then says "Argument moves: For each user parameter, emit `Mov
+{ dst: remapped_param_vreg, src: arg_vreg }`. (If remapping maps params
+directly to arg vregs, these can be skipped.)" These two statements are
+consistent only if you pick one strategy.
+
+**Resolution.** Per-param scan-then-alias-or-copy. LPIR is **not SSA** and
+the frontend's `param_aliases` optimization (`lps-frontend/src/lower_ctx.rs`
+`scan_param_argument_indices`) deliberately makes by-value GLSL params
+mutable in LPIR — `t = t * 2.0` lowers to `v1 = fmul v1, v2_const` writing
+the param vreg in place. Blind aliasing (strategy A) is therefore a
+correctness bug. Blanket copying (strategy B) is safe but leaves easy
+performance wins on the table (constant args don't const-fold through the
+inserted `Copy`).
+
+The scan:
+
+```rust
+fn scan_param_writes(callee: &IrFunction) -> Vec<bool> {
+    let n = 1 + callee.param_count as usize;
+    let mut written = vec![false; n];
+    for op in &callee.body {
+        if let Some(v) = op.def_vreg() {
+            let i = v.0 as usize;
+            if i < n { written[i] = true; }
+        }
+        if let LpirOp::Call { results, .. } = op {
+            for v in callee.pool_slice(*results) {
+                let i = v.0 as usize;
+                if i < n { written[i] = true; }
+            }
+        }
+    }
+    written
+}
+```
+
+Per-param remap decision:
+
+- `remap[0] = caller_vmctx_vreg` always (vmctx is opaque pointer; user code
+  never writes it; `debug_assert!(!written[0])`).
+- For each user param `i`:
+  - `written[1 + i] == false` → alias `remap[1 + i] = caller_arg_vreg[1 + i]`.
+    Zero overhead, const-fold sees through.
+  - `written[1 + i] == true`  → allocate fresh vreg in caller, prepend
+    `Copy { dst: fresh, src: caller_arg }` to spliced body, set
+    `remap[1 + i] = fresh`. Correctness guaranteed.
+- `remap[rest] = fresh` (callee locals always get fresh caller vregs).
+
+Properties: O(n) one extra pass per callee, ~50 LOC, tested via dedicated
+unit tests (`scan_param_writes_*`, `inline_aliases_readonly_params`,
+`inline_copies_mutated_param_only`). Bottom-up traversal keeps the analysis
+correct even for callees that already had their own callees spliced in —
+splices add fresh vregs only, never write to the outer callee's params.
+
+### Q5: Use `LpirOp::Copy` or `LpirOp::Mov` for the return-value plumbing?
+
+**Context.** I keep saying "Mov" but LPIR's actual move op is `LpirOp::Copy
+{ dst, src }` (verified in `lpir_op.rs` and `const_fold.rs`). There is no
+`Mov`.
+
+**Resolution.** Use `LpirOp::Copy` everywhere — for the per-param
+pre-copies (Q4) and for the result moves at the end of the inlined body.
+No new opcode. Mentally substitute `Copy` wherever the M3 doc says `Mov`.
+
+### Q6: Multi-return wrapping — when exactly do we need `Block` / `ExitBlock`?
+
+**Context.** The M3 doc says "If the callee has exactly one `Return` at the
+end, no `Block`/`EndBlock` wrapper is needed." Otherwise we wrap the body
+in `Block { end_offset: _ }` and rewrite each `Return` as
+"copies to caller results, then `ExitBlock`". The trailing `End` falls
+through to the post-call moves.
+
+**Resolution.** Three cases, decided by a single piggybacked scan on the
+callee body (same pass as Q4's `scan_param_writes` — count `Return` ops,
+note the position of the last one):
+
+| Callee return shape | Splice strategy |
+|--|--|
+| **0 returns** (void) | Splice body. No wrapper. No result `Copy`s. |
+| **Exactly 1 `Return` and it's the last op** | Splice body without the trailing `Return`. Replace it with `Copy { dst: caller_result_vreg[k], src: remap[callee_return_vreg[k]] }` for each return value. No wrapper. **Most common case.** |
+| **≥1 `Return`, not the unique-final pattern** | Emit `Block { end_offset: 0 }`. Splice body; replace each `Return` with the result `Copy`s followed by `ExitBlock`. Close with `End`. Caller's fall-through is the op after `End`. |
+
+Notes:
+
+- The `end_offset` on the opened `Block` gets patched by the offset-recompute
+  pass (Q10), not the splicer — splicer emits `Block { end_offset: 0 }`.
+- "1 return at the end of the body" is the GLSL pattern for almost every
+  helper (`paletteHeatmap`, `paletteRainbow`, `applyPalette`'s arms, etc.),
+  so the wrapper-free path is the hot one.
+- Multi-return case correctly handles GLSL early-return idioms
+  (`if (cond) return X; ... return Y;`).
+
+### Q7: Do we re-validate after inlining inside the pass, or trust the contract?
+
+**Context.** const_fold doesn't re-validate. But inlining does much more
+structural work and is much easier to get subtly wrong (offset patching,
+vreg remap arity, slot count).
+
+**Resolution.** Tiered validation:
+
+- **Production callers (M4 wiring):** no `validate_module` after the pass —
+  doubles work for no benefit; the pass owns its output's correctness.
+- **Unit tests:** always call `validate_module` after `inline_module`. Cheap
+  insurance with good error messages.
+- **Inside the pass:** `debug_assert!`s on internal invariants the validator
+  doesn't know about (remap table size = `callee.vreg_types.len()`,
+  control-flow stack empty at end of offset recompute, pool splice arity
+  matches, vmctx slot of `written` bitset is `false`, etc.). Free in
+  release, loud in debug.
+
+### Q8: Bottom-up order — what when a function calls itself indirectly via an import?
+
+**Context.** Imports are external; we never inline them. Calls to imports
+are leaves of the local call graph regardless of what the import does.
+
+**Resolution.** The call graph only tracks `CalleeRef::Local` edges. Import
+`Call` ops are leaves; they stay as-is in the inlined body with `vreg_pool`
+entries remapped and appended (Q9). LPIR has no re-entrant import path
+today, and even if a host did re-enter, we'd have no IR to optimize against
+— so this is the only sensible policy.
+
+### Q9: How do we splice the callee's `vreg_pool` entries safely?
+
+**Context.** The callee's body contains `Call` ops (to imports — local ones
+are already inlined since we go bottom-up) and `Return` ops, both of which
+reference `vreg_pool` slices via `VRegRange { start, count }`. When we
+copy the callee's body into the caller, those `start` offsets are wrong.
+
+**Resolution.** Single linear pass through the callee body, cooperating
+with the splicer's main loop:
+
+- **`Call { callee, args, results }`** — read both callee pool slices,
+  remap each `VReg`, append remapped vregs to the *caller's* `vreg_pool`.
+  Rewrite the op with `start = new pool position`; counts unchanged.
+- **`Return { values }`** — never appears in spliced body verbatim. Read
+  the callee pool slice once, remap, use values directly to emit result
+  `Copy`s (and `ExitBlock` in multi-return case per Q6). Nothing appended
+  to caller's pool for this op.
+- **All other ops** — no pool references. Just remap `VReg` fields in
+  place.
+
+Implementation pattern: emit spliced ops into a `Vec<LpirOp>` scratch
+buffer, growing `caller.vreg_pool` as we go. Then a single `splice` on
+`caller.body` replaces the original `Call` op with the scratch contents.
+Pool entries become valid the moment the scratch op gets its `start`
+offset, so there's no "patch start offsets after the fact" step.
+
+The caller's existing pool entries (for ops outside the spliced range) are
+unaffected — `vreg_pool` is append-only from the inliner's POV.
+
+### Q10: How do we recompute control-flow offsets after splicing?
+
+**Context.** After splicing, every `IfStart`, `LoopStart`, `SwitchStart`,
+`CaseStart`, `DefaultStart`, and `Block` op may have stale `else_offset` /
+`end_offset` / `continuing_offset` values, since we've inserted ops.
+
+**Resolution.** Fully structural recompute pass, made possible by the
+**M2.5 prerequisite** (`docs/roadmaps/2026-04-15-lpir-inliner/m2.5-continuing-marker.md`):
+
+- M2.5 adds `LpirOp::Continuing` as a marker op so loops have parity with
+  if-else (which has the `Else` marker). Backends keep using the cached
+  `LoopStart::continuing_offset` field unchanged. The marker is purely so
+  any pass that reshapes the body (today: the inliner) can rebuild every
+  cached offset structurally with no special cases.
+- M3 then ships `inline/offsets.rs` with one function:
+
+```
+fn recompute_offsets(body: &mut [LpirOp]):
+  stack: Vec<(Kind, idx)> = []
+  for (i, op) in body.iter().enumerate():
+    match op:
+      IfStart       -> push (If, i)
+      LoopStart     -> push (Loop, i, continuing=None)
+      SwitchStart   -> push (Switch, i, pending_case=None)
+      Block         -> push (Block, i)
+      Else          -> top must be (If, i0); body[i0].else_offset = i;
+                       replace top with (Else, i0)
+      Continuing    -> top must be (Loop, i0, _); store i in stack frame
+      CaseStart     -> patch top.pending_case.end_offset = i;
+                       set top.pending_case = i
+      DefaultStart  -> same
+      End           -> pop top:
+        (If, i0)        -> body[i0].else_offset = i; body[i0].end_offset = i+1
+        (Else, i0)      -> body[i0].end_offset = i+1
+        (Loop, i0, c)   -> body[i0].continuing_offset = c.unwrap_or(i0+1);
+                            body[i0].end_offset = i+1
+        (Switch, i0, p) -> patch p.end_offset = i (if any);
+                            body[i0].end_offset = i+1
+        (Block, i0)     -> body[i0].end_offset = i+1
+  debug_assert!(stack.is_empty())
+```
+
+Single forward pass, O(body.len()), small stack only allocation. Lives in
+`inline/offsets.rs`. Reusable by any future structural transform.
+
+The M3 plan **depends on M2.5 landing first** — M2.5 is a small,
+mechanical change (~9 files, similar shape to M2 itself).
+
+### Q11: How do we handle the heuristic budgets (multi-call-site growth)?
+
+**Context.** `max_growth_budget` caps total growth from multi-site
+inlining; `module_op_budget` aborts entirely if module total exceeds it.
+Single-site inlining is always free in code-size terms because the original
+will (eventually) be deleted by M5.
+
+**Resolution.** Per-callee in topological (bottom-up) order:
+
+- `body_size = func_weight(callee)` (M3: `body.len()`; M3.1 will tune).
+- `local_call_sites = #callers in callgraph`.
+- `extra_growth = max(0, local_call_sites - 1) * (body_size - 1)`
+  (first site is "free" because the original gets pruned by M5; each
+  subsequent site replaces a `Call` op with `body_size` ops).
+
+Decision in `heuristic.rs::should_inline(callee_id, callgraph, config,
+growth_used) -> Decision` returning the verdict + projected delta:
+
+```
+match config.mode:
+  Never  -> Skip("config: mode=never")
+  Always -> Inline { extra_growth: 0 }     // budgets ignored
+  Auto:
+    if local_call_sites == 0:
+        return Skip("no callers")
+    if local_call_sites == 1 && always_inline_single_site:
+        return Inline { extra_growth: 0 }   // single-site is free
+    if body_size <= small_func_threshold:
+        return Inline { extra_growth }      // small enough regardless
+    if let Some(budget) = max_growth_budget:
+        if growth_used + extra_growth > budget:
+            return Skip("max_growth_budget exhausted")
+    Inline { extra_growth }
+```
+
+Caller updates `growth_used += extra_growth` only on `Inline`.
+
+`module_op_budget`: check the running sum of all functions' `func_weight`
+before processing each callee. If exceeded, set
+`result.budget_exceeded = true` and stop the pass entirely. Bottom-up
+order means we've already done the leaves (highest leverage), so a
+partial result is still useful.
+
+**Debug logging.** At `log::debug!` level in the orchestration loop, emit
+one line per decision so a future debugging session has a paper trail:
+
+```
+[lpir-inline] callee=@paletteHeatmap (id=3) sites=4 size=14
+              decision=inline reason=small_func_threshold
+              extra_growth=39 growth_used=0 -> 39
+[lpir-inline] callee=@bigHelper (id=11) sites=3 size=180
+              decision=skip   reason=max_growth_budget_exhausted
+              would_grow=358 budget=300 used=212
+[lpir-inline] callee=@only_caller_helper (id=7) sites=1 size=92
+              decision=inline reason=single_site
+              extra_growth=0
+```
+
+Plus a single `log::info!` summary at the end:
+
+```
+[lpir-inline] inlined 12 functions across 38 call sites,
+              skipped 2 (1 recursive, 1 over budget),
+              growth_used=412 / module_total=2104 ops
+```
+
+The structured fields make it grep-friendly without needing a parser.
+Logging lives in `inline/mod.rs` (the orchestrator), not in
+`heuristic.rs` — the heuristic returns enough info (`Decision` carries
+the reason) for the orchestrator to log.
+
+### Q12: Result shape — what does `InlineResult` track?
+
+**Context.** Roadmap declares:
+
+```rust
+pub struct InlineResult {
+    pub functions_inlined: usize,
+    pub call_sites_replaced: usize,
+    pub budget_exceeded: bool,
+}
+```
+
+**Resolution.** The roadmap shape plus `functions_skipped_recursive`:
+
+```rust
+pub struct InlineResult {
+    /// Distinct callees whose body was spliced into ≥1 caller this run.
+    pub functions_inlined: usize,
+    /// Total `Call` ops replaced.
+    pub call_sites_replaced: usize,
+    /// Distinct functions skipped due to call-graph cycles (Q3).
+    pub functions_skipped_recursive: usize,
+    /// True iff `module_op_budget` was hit and the pass stopped early (Q11).
+    pub budget_exceeded: bool,
+}
+```
+
+No `Result<_, InlineError>` — we never hard-error (Q3 silently skips
+recursion; Q11 signals budget overrun via the field, not an error).
+
+### Q13: How do we want to test? In-process LPIR or via parser?
+
+**Context.** Tests can build `LpirModule` either via `ModuleBuilder` (Rust
+API, terse, type-safe) or by parsing LPIR text (matches what production
+sees). M2 tests parsed text for round-trip and built directly for in-depth
+work.
+
+**Resolution.** Mix per concern, all in-process LPIR (no GLSL compile):
+
+```
+lpir/src/tests/
+├── inline_basic.rs        # parser-based: void, single-return, multi-return, nested
+├── inline_callgraph.rs    # builder-based: cycles, diamond (A→B,C; B→C), chains
+├── inline_remap.rs        # parser-based: vmctx alias, slot remap, pool splice via imports
+├── inline_heuristic.rs    # builder-based: thresholds, budgets, mode=Never/Always/Auto
+├── inline_offsets.rs      # builder-based: hand-built bodies, run recompute_offsets, assert
+└── inline_param_writes.rs # parser-based: read-only params alias, mutated params copy (Q4)
+```
+
+All wired via `lpir/src/tests.rs`. Pattern matches `block_ops.rs`:
+parse → inline → validate → interp → assert.
+
+**GLSL filetests are M4.** That's where `compile_module` gets the inliner
+wired in and we get end-to-end semantic coverage on real shaders
+(`rainbow.glsl` etc.) and where `// compile-opt(inline.mode, …)`
+annotations come into play.
+
+### Q14: Should `inline_module` clone the input or always mutate?
+
+**Context.** The roadmap's signature is `inline_module(&mut LpirModule,
+&InlineMode)`. `lpvm-native::compile_module` already does
+`let mut ir_opt = ir.clone();` before per-function compile.
+
+**Resolution.** Take `&mut LpirModule` as the roadmap declares. Mutating
+in place is critical for embedded targets where every clone of an
+`LpirModule` is a real cost on a constrained heap. The caller (M4 wiring)
+clones once at the start of `compile_module` if they need the original
+preserved; the inliner does no internal cloning of the module structure.
+
+Future optimization (out of scope for M3, captured here for M5): delete
+orphaned functions as we go to keep peak memory low — currently a fully
+inlined helper sticks around in `LpirModule.functions` until M5's
+DeadFuncElim runs separately. Inline-and-delete-as-we-go would be one
+pass instead of two and would lower peak module size during compilation,
+which matters for big shaders on the ESP32. Stays as a separate pass for
+now to keep the inliner focused and the M5 deletion logic reusable.
+
+### Q15: Naming — `inline_module` vs `run`?
+
+**Context.** Other LPIR passes use snake_case verbs (`fold_constants`,
+`validate_module`, `parse_module`).
+
+**Resolution.** `pub fn inline_module(module: &mut LpirModule, config:
+&InlineConfig) -> InlineResult`. Re-exported from `lpir/src/lib.rs` so
+callers say `lpir::inline_module(..)`, matching `lpir::validate_module`,
+`lpir::parse_module`, `lpir::print_module`, `lpir::interpret`.
+
+## Notes
+
+- The roadmap says the `mode` parameter is `&InlineMode` in one place and
+  `&InlineConfig` in another. Use `&InlineConfig` (it's the richer struct
+  and includes `mode`) — that matches the M4 wiring snippet
+  (`&options.config.inline`) which is the production caller.
+- `InlineConfig` has no `Default` impl issue — it's already there.
+- The M3 doc mentions `EndBlock`; M2 closed `Block` with the existing `End`
+  op instead. Treat all "EndBlock" mentions in the M3 doc as `End`.
+- We do **not** delete or rename any function in this stage. After
+  `inline_module`, every `IrFunction` previously in the module is still
+  present, with the same `FuncId`. Functions that were fully inlined now
+  have zero remaining callers but are still compilable; M5 will prune them.
+- Const-fold runs per-function *after* inlining (M4 pipeline). Inlining
+  exposes new constants (e.g. `paletteHeatmap(0.0)`), so this is the
+  intended order. M3 doesn't need to invoke const_fold itself.
+
+## Execution notes (implementation vs plan)
+
+Appendix for Phase 7 — deviations and concrete choices during build-out:
+
+- **`topo_order` direction:** Kahn’s algorithm uses **in-degree = number of distinct local callees** per function. The queue seeds functions with in-degree 0 (no local calls). Peeling a callee decrements its callers’ in-degrees. The resulting `Vec` is **bottom-up** (leaves first), matching the design intent; early sketches that treated “out-degree” were corrected during implementation.
+
+- **Adjacency keyed by `BTreeMap`:** `callees_of`, `callers_of`, and `call_sites_of` use `BTreeMap<FuncId, …>` for deterministic iteration order (stable tests and logs).
+
+- **`Decision::SkipBudget`:** Split budget motivation into `BudgetReason` (`MaxGrowth` vs `ModuleTotal`) so the orchestrator can set `budget_exceeded` only when the **module total** cap trips (multi-site growth cap does not abort the whole pass).
+
+- **Multi-return `Block`:** The splicer emits **`ExitBlock` after each rewritten `Return`**, and ensures a trailing **`ExitBlock`** before **`End`** when the last body op is not already an exit (so `Block` always pairs with `ExitBlock` + `End` as required by LPIR structure).
+
+- **Param scan / remap:** `scan_param_writes` tracks **only defs via `def_vreg()`** for user params (`v1..=vN`); vmctx is asserted never defined. Read-only params **alias** caller arg vregs; written params get a fresh vreg plus a leading **`Copy`**. Callee locals and appended slots get fresh caller indices; import `Call` pool slices are remapped in `remap_op`.
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/01-continuing-marker.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/01-continuing-marker.md
new file mode 100644
index 000000000..c00368a2e
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/01-continuing-marker.md
@@ -0,0 +1,119 @@
+# Phase 1 — `LpirOp::Continuing` marker (M2.5)
+
+## Scope of phase
+
+Add **`LpirOp::Continuing`** as a structural marker for the start of a
+loop's continuing block, mirroring how **`Else`** marks the start of an
+if's else arm. The cached **`LoopStart::continuing_offset`** field is
+**kept** — backends and the interpreter keep using it unchanged. The
+marker exists purely so structural passes (Phase 2's
+**`recompute_offsets`**) can rebuild the cache after body mutation.
+
+This phase is the M2.5 prerequisite from
+`docs/roadmaps/2026-04-15-lpir-inliner/m2.5-continuing-marker.md`,
+folded in as the first phase of stage III.
+
+## Code Organization Reminders
+
+- One concept per change: a single new variant + one no-op arm per
+  consumer. No drive-by refactoring of nearby code.
+- Backend changes are minimal — every consumer just needs to *not panic*
+  on the new variant. Don't restructure existing match logic.
+- Keep **`#![no_std]`** + **`alloc`** — no new heap usage required.
+
+## Implementation Details
+
+### `lpir/src/lpir_op.rs`
+
+- Add **`Continuing`** variant to **`LpirOp`** enum (no fields).
+- Update **`LpirOp::def_vreg(&self)`** to return **`None`** for it
+  (matches other markers like **`Else`** / **`End`**).
+
+### `lpir/src/builder.rs`
+
+- **`FunctionBuilder::push_continuing()`**: prepend
+  **`self.body.push(LpirOp::Continuing)`** before the existing
+  **`continuing_offset`** patch on the open **`LoopStart`**. The patched
+  offset must equal the index of the just-pushed **`Continuing`** op.
+
+### `lpir/src/parse.rs`
+
+- The existing **`continuing:`** text token already triggers the
+  **`continuing_offset`** patch. Update that path to also call
+  **`fb.push_continuing()`** so the marker lands in the body.
+
+### `lpir/src/print.rs`
+
+- Add a match arm for **`LpirOp::Continuing`** that prints
+  **`continuing:`** (no trailing brace, like **`else:`**).
+- Remove the existing logic that conditionally prints **`continuing:`**
+  based on whether **`continuing_offset != start_pc + 1`**. The marker
+  is now the single source of truth for placement; just print it where
+  it appears in the body.
+
+### `lpir/src/validate.rs`
+
+- Add **`Continuing`** arms to all exhaustive matches that mention
+  **`Else`** / **`End`** / opener variants.
+- Structural check: **`Continuing`** is only legal inside a
+  **`LoopStart`** … **`End`** pair, and not nested inside another
+  **`IfStart`** / **`SwitchStart`** / **`Block`** / inner **`LoopStart`**
+  inside that loop. Reuse the existing control-flow stack walk; on
+  encountering **`Continuing`**, assert the top of the stack is the
+  expected **`LoopStart`**.
+- Validate **`LoopStart::continuing_offset`** points at a **`Continuing`**
+  op when present, **or** at **`start_pc + 1`** if no marker is in the
+  body (legacy behavior — keep both legal).
+
+### `lpir/src/interp.rs`
+
+- One arm in the dispatch loop:
+  **`LpirOp::Continuing => { pc += 1; }`**.
+
+### `lpir/src/const_fold.rs`
+
+- Add **`| LpirOp::Continuing`** to the conservative-clear arm next to
+  the other markers (**`Else`** / **`End`** / opener variants), so
+  constant propagation state is reset across the boundary, matching how
+  control-flow joins are handled today.
+
+### `lpvm-native/src/lower.rs`
+
+- One match arm: **`LpirOp::Continuing => { /* structural marker, no
+  emit */ }`**. The existing range-based continuing-block lowering
+  already starts at **`continuing_offset`** which now points at the
+  marker, so the marker is naturally inside the lowered slice and the
+  no-op arm makes it skip cleanly.
+
+### `lpvm-wasm/src/emit/ops.rs`
+
+- One match arm: same no-op pattern as native.
+
+### `lpvm-cranelift/src/emit/control.rs`
+
+- One match arm: same no-op pattern as native.
+
+## Tests (`lpir` crate)
+
+Extend existing test files; do not add a new module just for this.
+
+- `tests/all_ops_roundtrip.rs`: add a loop with an explicit
+  **`continuing:`** body to the round-trip set.
+- `tests/block_ops.rs` (or wherever loop validation tests live —
+  inspect first; create a small new file only if no good home exists):
+  one test asserting that after `parse → build`, the
+  **`LoopStart::continuing_offset`** value equals the index of the
+  **`Continuing`** op in the body.
+
+## Validate
+
+```bash
+cargo test -p lpir
+cargo test -p lpvm-native
+cargo test -p lpvm-wasm
+cargo test -p lpvm-cranelift
+cargo test -p lps-filetests -- --test-threads=4
+```
+
+No behavioral change is expected — every existing test must pass
+unchanged. The marker is purely additive.
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/02-inline-scaffold-and-offsets.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/02-inline-scaffold-and-offsets.md
new file mode 100644
index 000000000..5d06910c5
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/02-inline-scaffold-and-offsets.md
@@ -0,0 +1,120 @@
+# Phase 2 — Inline scaffold + `recompute_offsets`
+
+## Scope of phase
+
+Stand up the empty **`lpir::inline`** module with the public API stubs
+(returning **`InlineResult::default()`**) and the first real piece of
+machinery: **`recompute_offsets(&mut [LpirOp])`**. The orchestration
+loop, callgraph, splicer, and heuristic come in later phases.
+
+`recompute_offsets` is the foundational reusable utility — it walks a
+mutated body, matches structural markers to their openers via a stack,
+and patches **`else_offset`** / **`end_offset`** /
+**`continuing_offset`** in place. Once Phase 1's **`Continuing`** marker
+exists, every offset on every opener is recoverable purely from
+markers.
+
+## Code Organization Reminders
+
+- New submodule layout per Q2 in **`00-design.md`**:
+  - `lpir/src/inline/mod.rs`
+  - `lpir/src/inline/offsets.rs`
+- Re-export only the public surface from **`lpir/src/lib.rs`**:
+  **`inline_module`**, **`InlineResult`**. Internal helpers stay
+  crate-private.
+- One concept per file; **`offsets.rs`** is just the recompute helper
+  and its tests-of-record (full coverage lives in
+  `tests/inline_offsets.rs`).
+
+## Implementation Details
+
+### `lpir/src/inline/mod.rs`
+
+```rust
+//! LPIR inlining pass — bottom-up, never deletes functions, structural
+//! offset recompute. See docs/plans/2026-04-17-lpir-inliner-stage-iii.
+
+mod offsets;
+
+pub(crate) use offsets::recompute_offsets;
+
+#[derive(Debug, Default, Clone, Copy)]
+pub struct InlineResult {
+    pub functions_inlined: usize,
+    pub call_sites_replaced: usize,
+    pub functions_skipped_recursive: usize,
+    pub budget_exceeded: bool,
+}
+
+pub fn inline_module(
+    _module: &mut crate::LpirModule,
+    _config: &crate::InlineConfig,
+) -> InlineResult {
+    // Filled in by Phase 6.
+    InlineResult::default()
+}
+```
+
+### `lpir/src/inline/offsets.rs`
+
+- **`pub(crate) fn recompute_offsets(body: &mut [LpirOp])`**.
+- Walk forward over **`body`**. Maintain a stack of **`(opener_idx,
+  Opener)`** entries where **`Opener`** is a small internal enum
+  capturing which opener variant we're inside (**`If`** / **`Loop`** /
+  **`Switch`** / **`Block`**).
+- On **`Else`**: peek top, must be **`If`**, patch
+  **`body[opener_idx].as_if_mut().else_offset = current_idx`** (or
+  whatever the existing field name is — match the struct shape exactly).
+- On **`Continuing`**: peek top, must be **`Loop`**, patch
+  **`continuing_offset = current_idx`**.
+- On **`End`** / **`ExitBlock`**: pop. For the matching opener, patch
+  **`end_offset = current_idx`** (or **`exit_offset`** for **`Block`** —
+  match existing field names).
+- On any opener: push **`(current_idx, kind)`**. Inner offsets are
+  patched by inner pops first, so an outer recompute is correct as long
+  as we patch on the way *up* (i.e. when we see the marker, not when
+  we push).
+- Debug-assert the stack is empty at end-of-body.
+
+This function never reads existing offset values — it always overwrites
+from the markers. That makes it idempotent and order-independent within
+a single call.
+
+### `lpir/src/lib.rs`
+
+- **`pub mod inline;`** (or `mod inline;` + targeted `pub use`).
+- **`pub use inline::{inline_module, InlineResult};`**.
+
+## Tests (`lpir` crate)
+
+`tests/inline_offsets.rs` (new):
+
+- **`if_else_end`**: build via **`FunctionBuilder`**, then *zero out*
+  every offset field, call **`recompute_offsets`**, assert they match
+  the original.
+- **`loop_with_continuing_marker`**: same, including a **`Continuing`**
+  marker midway through the body.
+- **`loop_without_continuing_marker`**: legacy form (no marker) — the
+  recomputed **`continuing_offset`** should equal **`loop_start_pc + 1`**
+  (i.e. unchanged from the legacy convention; verify the helper handles
+  this either by leaving the existing offset alone or by patching to the
+  same value).
+- **`switch_multi_arm`**: nested case if **`SwitchStart`** carries
+  per-arm offsets — match whatever shape exists today.
+- **`block_exit`**: one **`Block`** + **`ExitBlock`**; assert
+  **`end_offset`** patched.
+- **`nested_loop_in_if_in_block`**: stress nesting; offsets must all
+  match a fresh build of the same structure.
+- **`mutated_body_grows`**: take a built body, splice in extra
+  no-op-ish ops between an opener and its closer, run
+  **`recompute_offsets`**, assert offsets shifted correctly.
+
+## Validate
+
+```bash
+cargo test -p lpir
+```
+
+The scaffold's stub **`inline_module`** is a no-op; nothing else in the
+workspace can depend on it yet, so only the **`lpir`** crate needs to
+build/test in this phase.
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/03-callgraph.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/03-callgraph.md
new file mode 100644
index 000000000..0017a4542
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/03-callgraph.md
@@ -0,0 +1,102 @@
+# Phase 3 — Call graph + topological order
+
+## Scope of phase
+
+Add **`lpir/src/inline/callgraph.rs`**: the data the orchestrator needs
+to walk functions bottom-up and to skip recursive cycles cleanly.
+
+This phase is purely additive analysis — it does not mutate the
+module — so it can be tested in isolation against parsed LPIR
+fixtures.
+
+## Code Organization Reminders
+
+- One file: `lpir/src/inline/callgraph.rs`. Internal to **`inline`**;
+  not re-exported.
+- Use **`alloc::vec::Vec`** + **`alloc::collections::BTreeMap`** /
+  **`BTreeSet`** for determinism in **`#![no_std]`**. Avoid hash maps
+  in core data.
+- Edges only follow **`CalleeRef::FuncId`** — **`CalleeRef::ImportId`**
+  is treated as an external leaf (no edge added).
+
+## Implementation Details
+
+### Public surface (crate-private)
+
+```rust
+pub(crate) struct CallGraph {
+    /// callees_of[caller] = sorted, deduplicated list of local FuncIds called.
+    pub callees_of: BTreeMap<FuncId, Vec<FuncId>>,
+    /// callers_of[callee] = sorted, deduplicated list of local FuncIds calling it.
+    pub callers_of: BTreeMap<FuncId, Vec<FuncId>>,
+    /// Per-call-site list parallel to body order, for splicer iteration.
+    pub call_sites_of: BTreeMap<FuncId, Vec<(usize, FuncId)>>,
+}
+
+pub(crate) fn build(module: &LpirModule) -> CallGraph;
+
+/// Returns (topo_order, cyclic_set).
+/// topo_order: leaves-first ordering of FuncIds reachable in a DAG.
+/// cyclic_set: FuncIds participating in any cycle (skipped by inliner).
+pub(crate) fn topo_order(g: &CallGraph) -> (Vec<FuncId>, BTreeSet<FuncId>);
+```
+
+`LpirModule::functions` is `BTreeMap<FuncId, IrFunction>` keyed by sparse
+`FuncId(u16)` ids, so `BTreeMap<FuncId, _>` is the correct adjacency
+shape. `CalleeRef::Local(FuncId)` is the local-call variant
+(`CalleeRef::Import(ImportId)` is the external one — skipped here).
+
+### `build`
+
+- Iterate **`module.functions`**; for each function index **`f`**, walk
+  **`func.body`** and collect every **`LpirOp::Call { callee:
+  CalleeRef::FuncId(g), .. }`** along with its op index.
+- Populate **`call_sites_of[f]`** in body order (no dedup — every call
+  site is a distinct splice target).
+- Populate **`callees_of[f]`** as the deduplicated, sorted set of
+  `FuncId`s called from `f`. Same for **`callers_of`** in reverse.
+
+### `topo_order`
+
+- Kahn's algorithm; **leaves-first** = functions with **no outgoing
+  local edges** come first.
+- **`in_degree[g] = callees_of[g].len()`** (count of distinct local
+  callees). Initial queue: all `g` with `in_degree == 0`.
+- Pop the smallest `FuncId` from the queue into `topo_order`. For each
+  `caller ∈ callers_of[g]`, decrement `in_degree[caller]`; push
+  to the queue when it hits zero.
+- Anything left with **`in_degree > 0`** after the queue drains is in a
+  cycle (self-loops, mutual recursion, larger SCCs); collect those into
+  **`cyclic_set`**.
+- Determinism: process the queue in ascending **`FuncId`** order (use
+  **`BTreeSet`** as the queue).
+
+### Self-recursion is a cycle
+
+A function that calls itself directly is a 1-cycle and lands in
+**`cyclic_set`**. No special-casing needed — Kahn's handles it.
+
+## Tests (`lpir` crate)
+
+`tests/inline_callgraph.rs` (new):
+
+- **`leaf`**: function calling no one → in `topo_order`, not in
+  `cyclic_set`.
+- **`linear_chain_a_b_c`**: A→B→C → topo order is `[C, B, A]`.
+- **`diamond_a_bc_d`**: A→{B,C}, B→D, C→D → topo order is `[D, B, C, A]`
+  or `[D, C, B, A]` (deterministic by `FuncId` order).
+- **`self_recursive`**: A→A → A in `cyclic_set`, not in `topo_order`.
+- **`mutual_recursion`**: A→B, B→A → both in `cyclic_set`.
+- **`recursion_with_acyclic_tail`**: A→B, B→A, A→C → A and B in
+  `cyclic_set`; C in `topo_order`.
+- **`import_only_callee`**: A calls only an `ImportId` → A is a leaf
+  (no edges out), in `topo_order`.
+- **`multiple_call_sites_same_callee`**: A calls B twice →
+  `callees_of[A] = [B]` (deduped); `call_sites_of[A]` has two entries
+  with distinct op indices.
+
+## Validate
+
+```bash
+cargo test -p lpir
+```
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/04-remap-and-param-scan.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/04-remap-and-param-scan.md
new file mode 100644
index 000000000..80b267b9e
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/04-remap-and-param-scan.md
@@ -0,0 +1,161 @@
+# Phase 4 — Remap helpers + param-write scan
+
+## Scope of phase
+
+Add **`lpir/src/inline/remap.rs`**: the per-call-site machinery that
+prepares the callee body for splicing into a caller. Three pieces:
+
+1. **`scan_param_writes(callee) -> ParamWriteMask`** — which params are
+   written by the callee body (for the per-param alias-or-copy
+   strategy from Q4 in `00-design.md`).
+2. **`build_remap(...)`** — produce the **`VReg`** translation table
+   plus the list of preamble **`Copy`** ops needed for mutated params.
+3. **`remap_op(...)`** — clone a single callee op with **`VReg`** /
+   **`SlotId`** / **`vreg_pool`** fixups applied.
+
+Splicing itself (Phase 5) drives these helpers; this phase tests them
+in isolation.
+
+## Code Organization Reminders
+
+- One file: `lpir/src/inline/remap.rs`. Crate-private.
+- Keep helpers pure: no module mutation here. `build_remap` and
+  `remap_op` produce data; the splicer (Phase 5) applies it.
+- Use **`alloc::vec::Vec`** indexed by callee **`VReg::index()`** for
+  the translation table — dense, **`O(1)`** lookup, deterministic.
+
+## Implementation Details
+
+### `ParamWriteMask`
+
+```rust
+/// Bit per callee param (excluding vmctx). True = param is written
+/// somewhere in the callee body (definitely a `Copy` is needed).
+pub(crate) struct ParamWriteMask {
+    /// One bool per param in callee param order (params live in
+    /// VReg(1)..=VReg(param_count); index 0 here = first non-vmctx).
+    pub written: Vec<bool>,
+}
+
+pub(crate) fn scan_param_writes(callee: &IrFunction) -> ParamWriteMask;
+```
+
+- Iterate **`callee.body`**. For each op, ask
+  **`op.def_vreg() -> Option<VReg>`** (already exists in
+  **`lpir_op.rs`**).
+- If the defined **`VReg`** falls in the param range
+  (**`1..=callee.param_count`**), mark
+  **`written[idx_of(vreg)] = true`**.
+- Skip **`VReg(0)`** — vmctx is read-only by construction; debug-assert
+  it never appears as **`def_vreg`**.
+
+### `build_remap`
+
+```rust
+pub(crate) struct Remap {
+    /// callee VReg index → caller VReg.
+    pub vreg_table: Vec<VReg>,
+    /// Preamble `Copy` ops (param mutated → fresh caller vreg from arg).
+    /// Empty for read-only params (those alias arg vreg directly).
+    pub param_copies: Vec<LpirOp>,
+    /// Slot offset to add to callee SlotId references.
+    pub slot_offset: u32,
+}
+
+pub(crate) fn build_remap(
+    caller: &mut IrFunction,
+    callee: &IrFunction,
+    call_args: &[VReg],          // resolved from call site's VRegRange
+    call_results: &[VReg],       // resolved from call site's result range
+    param_writes: &ParamWriteMask,
+) -> Remap;
+```
+
+- Allocate **`vreg_table`** sized to **`callee.vreg_count`**. Initialize
+  to a sentinel (e.g. **`VReg::INVALID`** or `VReg::from_index(u32::MAX)`).
+- **`vreg_table[0] = VMCTX_VREG`** — vmctx always aliases.
+- For each param **`i`** in `1..=callee.param_count`:
+  - Caller's arg vreg for that param is `call_args[i]` (call_args[0] is
+    vmctx, by Q4 convention; verify against existing call lowering).
+  - If **`!param_writes.written[i-1]`**: alias —
+    `vreg_table[i] = call_args[i]`.
+  - Else: allocate fresh `caller.alloc_vreg()` → `vreg_table[i] = new`,
+    push **`LpirOp::Copy { dst: new, src: call_args[i] }`** into
+    `param_copies`.
+- For each non-param vreg **`v`** in
+  `callee.param_count+1..callee.vreg_count`:
+  - If `v` is one of the callee's return vregs **and** the corresponding
+    `call_results[k]` slot exists, alias to that result vreg (Phase 5
+    rewrites Returns to write directly there). Otherwise allocate fresh.
+  - For now (this phase), allocate fresh for *all* non-params; the
+    Return-to-result aliasing is decided by Phase 5's return-shape
+    analysis based on the actual `Return` operand list. Keep
+    `build_remap` shape-agnostic.
+- **`slot_offset = caller.slot_count`**; reserve `callee.slot_count`
+  fresh slots in caller (call `caller.alloc_slot()` in a loop, or bump
+  the count directly — match the existing API in `IrFunction`).
+
+Debug-assert: every entry in `vreg_table` is non-sentinel before
+returning.
+
+### `remap_op`
+
+```rust
+pub(crate) fn remap_op(
+    op: &LpirOp,
+    remap: &Remap,
+    caller_vreg_pool: &mut Vec<VReg>,
+    callee_vreg_pool: &[VReg],
+) -> LpirOp;
+```
+
+- Clone **`op`**, then for each **`VReg`** field replace with
+  **`remap.vreg_table[v.index()]`**.
+- For each **`SlotId`** field, add **`remap.slot_offset`**.
+- For any **`VRegRange`** that indexes into the callee's `vreg_pool`
+  (e.g. **`Call { args, results }`** for nested calls inside the
+  callee body): read the slice from `callee_vreg_pool`, remap each
+  vreg through `vreg_table`, append to `caller_vreg_pool`, and rewrite
+  the `VRegRange` to point at the new caller-pool location.
+- Markers (**`Else`** / **`End`** / **`ExitBlock`** / **`Continuing`**)
+  and openers' offset fields: leave offsets at zero / placeholder.
+  Phase 5 splices the body, Phase 6's
+  **`recompute_offsets`** call (after splice) fixes them.
+- Don't touch **`Return`** here — Phase 5's splicer handles return
+  rewriting before calling `remap_op` (or skips Returns entirely and
+  emits the rewritten form directly).
+
+## Tests (`lpir` crate)
+
+`tests/inline_param_writes.rs` (new):
+
+- **`vmctx_never_written`**: assert via debug-build test that scanning
+  any well-formed callee never marks `VReg(0)` as written; trivial
+  callees produce all-false masks.
+- **`single_param_read_only`**: callee `fn(a) -> a + 1` → mask
+  `[false]`.
+- **`single_param_mutated`**: callee where `a` is the dst of an `Add`
+  → mask `[true]`.
+- **`multi_param_mixed`**: 3 params, second one mutated → `[false,
+  true, false]`.
+
+`tests/inline_remap.rs` (new):
+
+- **`alias_for_readonly_param`**: `build_remap` produces empty
+  `param_copies` and aliases vreg directly.
+- **`copy_for_mutated_param`**: `param_copies` length 1, fresh dst
+  vreg, src is caller arg vreg.
+- **`vmctx_aliases`**: `vreg_table[0] == VMCTX_VREG` regardless of
+  param-write mask.
+- **`slot_offset_applied`**: callee with 2 slots inlined into caller
+  with 3 slots → remapped slot ids are 3 and 4.
+- **`vreg_pool_splice`**: callee body contains a `Call` to an import
+  with multiple args; after `remap_op`, caller's `vreg_pool` has the
+  spliced entries with translated vregs and the new `Call`'s
+  `VRegRange` points at them.
+
+## Validate
+
+```bash
+cargo test -p lpir
+```
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/05-splicer.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/05-splicer.md
new file mode 100644
index 000000000..28de4d327
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/05-splicer.md
@@ -0,0 +1,174 @@
+# Phase 5 — Body splicer
+
+## Scope of phase
+
+Add **`lpir/src/inline/splice.rs`**: the function that actually replaces
+one **`LpirOp::Call`** in a caller with the cloned, remapped body of a
+callee. This is where the **return-shape analysis** from `00-design.md`
+lives, and it's the only place that mutates `caller.body` for inlining.
+
+Per Q14, the splicer is **mutative on the caller** — it does not
+allocate a parallel `Vec<IrFunction>`. Memory for the call-site `Call`
+op is reclaimed by `Vec::splice`.
+
+The orchestration loop that *calls* this for every site comes in
+Phase 6; tests in this phase exercise the splicer directly.
+
+## Code Organization Reminders
+
+- One file: `lpir/src/inline/splice.rs`. Crate-private.
+- `inline_call_site` is the only public-to-`inline` function.
+- All offset patching is deferred to a single
+  **`recompute_offsets(&mut caller.body)`** call by the orchestrator
+  after *all* of a caller's sites are spliced (Phase 6). The splicer
+  itself never touches offsets.
+
+## Implementation Details
+
+### Signature
+
+```rust
+pub(crate) fn inline_call_site(
+    caller: &mut IrFunction,
+    callee: &IrFunction,
+    call_op_idx: usize,
+);
+```
+
+The caller, callee, and call-site index are picked by Phase 6. The
+function must not panic on any well-formed input.
+
+### Step 1 — Read & destructure the call site
+
+- Snapshot the **`Call`** op: extract **`args: VRegRange`** and
+  **`results: VRegRange`**, resolve to **`Vec<VReg>`** via
+  `caller.vreg_pool`.
+- Validate against callee shape: `args.len() == 1 +
+  callee.param_count` (the `+1` is vmctx); `results.len() ==
+  callee.return_count`. Debug-assert; in release, log and bail (return
+  without splicing) — the orchestrator counts this as "not inlined".
+
+### Step 2 — Param-write scan + remap
+
+```rust
+let pw   = scan_param_writes(callee);
+let rmap = build_remap(caller, callee, &call_args, &call_results, &pw);
+```
+
+### Step 3 — Return-shape analysis
+
+Walk **`callee.body`** once and classify:
+
+```rust
+enum ReturnShape {
+    /// Zero `Return` ops (unreachable terminator) OR void return.
+    None,
+    /// Exactly one `Return` and it's the very last op of callee.body.
+    SingleAtEnd,
+    /// Anything else: multiple Returns, or a Return not at the end.
+    Multi,
+}
+```
+
+This decides how `Return` ops are rewritten and whether the inlined
+body needs a `Block { … } / ExitBlock` wrapper.
+
+### Step 4 — Build the scratch `Vec<LpirOp>`
+
+In order:
+
+1. **Param copies**: extend with `rmap.param_copies` (already in
+   correct form, vregs already in caller-space).
+2. **`Block` opener** (only if `ReturnShape::Multi`):
+   `LpirOp::Block { end_offset: 0 }` — placeholder offset, fixed by
+   `recompute_offsets`.
+3. **Cloned + remapped body**: walk `callee.body` op by op:
+   - If op is **`LpirOp::Return { values }`**:
+     - Resolve each return value vreg through `rmap.vreg_table`.
+     - Emit `LpirOp::Copy { dst: call_results[k], src: remapped }` for
+       each `k` (or whatever the multi-return primitive is — match
+       existing return-handling lowering; if a single move-list op
+       exists, use that instead of N `Copy` ops).
+     - If `ReturnShape::Multi`: append `LpirOp::ExitBlock`.
+     - If `ReturnShape::SingleAtEnd`: no `ExitBlock` needed; this is
+       the last op anyway.
+     - If `ReturnShape::None`: no Returns to rewrite — but if we hit
+       one, classification was wrong → debug-assert.
+   - Else: push `remap_op(op, &rmap, &mut caller.vreg_pool,
+     &callee.vreg_pool)`.
+4. **`ExitBlock` close** (only if `ReturnShape::Multi`): append one
+   final `LpirOp::ExitBlock` to terminate the wrapper if the last
+   callee op was *not* a Return (otherwise step 3 already emitted it).
+   - Cleaner formulation: track `last_was_exit_block: bool` while
+     building; emit a trailing `ExitBlock` iff
+     `Multi && !last_was_exit_block`.
+
+### Step 5 — Splice into caller
+
+```rust
+caller.body.splice(call_op_idx..=call_op_idx, scratch);
+```
+
+Single splice replaces the `Call` op in place. Capacity reclamation is
+implicit; for embedded targets we may want a follow-up
+`caller.body.shrink_to_fit()` once per caller after all sites are done
+(Phase 6 calls it once at the end).
+
+### Step 6 — Slot/vreg counts
+
+After splice, ensure:
+
+- `caller.slot_count` already incremented by `build_remap`.
+- `caller.vreg_count` reflects fresh allocations made by `build_remap`.
+
+The splicer doesn't touch these directly — they were updated when
+`build_remap` allocated.
+
+### What the splicer does *not* do
+
+- Does **not** call `recompute_offsets`. Phase 6 batches that per
+  caller after all sites are processed (avoids `O(sites × body_len)`
+  re-walks).
+- Does **not** validate the result. Phase 6's orchestrator runs
+  validation in debug builds.
+- Does **not** delete the callee. Per Q14, dead-function elimination is
+  M5.
+
+## Tests (`lpir` crate)
+
+`tests/inline_basic.rs` (new): drive `inline_call_site` directly with
+hand-built modules; after each splice, run **`recompute_offsets`** then
+**`validate`** then **`interp::run_function`** (or whatever the
+existing test harness uses) and compare results with the same module
+*pre*-inlining.
+
+- **`void_callee`**: callee returns nothing, single statement body
+  (e.g. write to a slot). Result: same observable side effect, no
+  result vreg writes.
+- **`single_return_at_end`**: `fn add1(a) -> a + 1`. Inlining produces
+  no `Block`, no `ExitBlock`. Verify caller body shape and result.
+- **`single_return_not_at_end`**: callee with an early `Return` inside
+  an `If`. Should classify as `Multi`, wrap in `Block`/`ExitBlock`.
+- **`multiple_returns`**: callee with two `Return`s in different `If`
+  arms. Wrapped in `Block`; both Returns become `Copy + ExitBlock`.
+- **`nested_call_in_callee`**: callee body itself contains a `Call`
+  to an import. Verify `vreg_pool` splice happens correctly via
+  `remap_op` and the inlined call still references the right import.
+- **`mutated_param`**: callee writes to its first param. Verify a
+  `Copy` is emitted into a fresh vreg and subsequent reads use that.
+- **`readonly_param`**: callee never writes its params. Verify zero
+  `Copy` ops, direct alias.
+- **`vmctx_propagation`**: any callee op that reads `VReg(0)` (vmctx)
+  remains reading `VReg(0)` post-splice.
+- **`slot_remap`**: callee uses 2 slots; caller has 3 pre-inlining.
+  Post-inlining, callee's slot uses are at 3, 4.
+
+For each test: build module via `FunctionBuilder`, snapshot expected
+behavior via `interp::run_function`, splice, recompute offsets,
+validate, re-interp, compare.
+
+## Validate
+
+```bash
+cargo test -p lpir
+```
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/06-heuristic-and-orchestration.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/06-heuristic-and-orchestration.md
new file mode 100644
index 000000000..04536ebd3
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/06-heuristic-and-orchestration.md
@@ -0,0 +1,276 @@
+# Phase 6 — Heuristic + orchestration
+
+## Scope of phase
+
+Tie everything together: add **`lpir/src/inline/heuristic.rs`** and
+fill in **`lpir/src/inline/mod.rs::inline_module`** with the full
+orchestration loop. After this phase, calling
+**`lpir::inline_module(&mut module, &config)`** actually inlines.
+
+Per Q1 / M3.1, the **`func_weight`** heuristic uses `body.len()` as a
+first-pass approximation; empirical tuning is deferred to M3.1.
+
+Per Q11, the orchestrator emits **`log::debug!`** for every
+inlining decision (inline / skip-budget / skip-recursive /
+skip-too-large) so behavior is debuggable from CLI tools.
+
+## Code Organization Reminders
+
+- Two files: `lpir/src/inline/heuristic.rs` (new) and
+  `lpir/src/inline/mod.rs` (fill in the stub from Phase 2).
+- **`log`** crate: confirm it's already a dependency of **`lpir`**
+  (other crates in the workspace use it). If not, add with
+  `default-features = false` for **`#![no_std]`** compatibility.
+- Keep heuristic decisions pure functions: input is `(callee_size,
+  call_count, current_module_size, config)`, output is `Decision`.
+
+## Implementation Details
+
+### `lpir/src/inline/heuristic.rs`
+
+```rust
+pub(crate) fn func_weight(func: &IrFunction) -> usize {
+    func.body.len()
+}
+
+#[derive(Debug, Clone, Copy)]
+pub(crate) enum Decision {
+    Inline,
+    SkipTooLarge { weight: usize, threshold: usize },
+    SkipBudget { projected: usize, budget: usize },
+    SkipMode,
+}
+
+pub(crate) fn should_inline(
+    callee_weight: usize,
+    callsite_count_at_callee: usize,
+    current_module_op_count: usize,
+    config: &InlineConfig,
+) -> Decision {
+    use crate::InlineMode::*;
+    match config.mode {
+        Never => return Decision::SkipMode,
+        Always => { /* fall through; only budget can stop us */ }
+        Auto => {
+            if callee_weight > config.small_func_threshold
+                && callsite_count_at_callee > 1
+            {
+                return Decision::SkipTooLarge {
+                    weight: callee_weight,
+                    threshold: config.small_func_threshold,
+                };
+            }
+        }
+    }
+
+    // max_growth_budget per call site (post-inline body grows by ~weight per site).
+    let projected_growth = callee_weight.saturating_mul(callsite_count_at_callee);
+    if projected_growth > config.max_growth_budget {
+        return Decision::SkipBudget {
+            projected: projected_growth,
+            budget: config.max_growth_budget,
+        };
+    }
+
+    // module_op_budget: hard cap on total module ops post-inline.
+    let projected_total =
+        current_module_op_count.saturating_add(projected_growth);
+    if projected_total > config.module_op_budget {
+        return Decision::SkipBudget {
+            projected: projected_total,
+            budget: config.module_op_budget,
+        };
+    }
+
+    Decision::Inline
+}
+```
+
+> Confirm field names against `CompilerConfig` / `InlineConfig`
+> (added in stage II); rename above if they differ.
+
+### `lpir/src/inline/mod.rs` — full orchestration
+
+```rust
+pub fn inline_module(
+    module: &mut LpirModule,
+    config: &InlineConfig,
+) -> InlineResult {
+    let graph = callgraph::build(module);
+    let (topo, cyclic) = callgraph::topo_order(&graph);
+
+    let mut result = InlineResult {
+        functions_skipped_recursive: cyclic.len(),
+        ..Default::default()
+    };
+
+    for &cyc in &cyclic {
+        log::debug!("inline: skip recursive func={:?}", cyc);
+    }
+
+    let mut current_op_count = total_op_count(module);
+    let mut inlined_callees = BTreeSet::new();
+    let mut mutated_callers = BTreeSet::new();
+
+    'outer: for callee_id in topo {
+        if cyclic.contains(&callee_id) { continue; }
+
+        let callee_weight = heuristic::func_weight(&module.functions[callee_id]);
+        let sites: Vec<(FuncId, usize)> = graph
+            .callers_of
+            .get(&callee_id)
+            .into_iter()
+            .flat_map(|callers| callers.iter())
+            .flat_map(|&caller| {
+                graph
+                    .call_sites_of
+                    .get(&caller)
+                    .into_iter()
+                    .flat_map(move |sites| {
+                        sites.iter().filter_map(move |&(idx, c)| {
+                            (c == callee_id).then_some((caller, idx))
+                        })
+                    })
+            })
+            .collect();
+
+        if sites.is_empty() { continue; }
+
+        let decision = heuristic::should_inline(
+            callee_weight, sites.len(), current_op_count, config,
+        );
+
+        match decision {
+            Decision::Inline => {
+                log::debug!(
+                    "inline: callee={:?} weight={} sites={} module_ops={}",
+                    callee_id, callee_weight, sites.len(), current_op_count,
+                );
+                // Splice each site. Process within a caller in DESCENDING
+                // op_idx order so earlier indices stay valid as later ones
+                // are spliced in place.
+                let by_caller = group_by_caller_desc(&sites);
+                // Take the callee out of the map so we can freely &mut
+                // every caller; put it back when done with this callee.
+                let callee = module.functions.remove(&callee_id)
+                    .expect("topo callee must exist");
+                for (caller_id, indices) in by_caller {
+                    let caller = module.functions.get_mut(&caller_id)
+                        .expect("caller must exist");
+                    for op_idx in indices {
+                        splice::inline_call_site(caller, &callee, op_idx);
+                        result.call_sites_replaced += 1;
+                    }
+                    mutated_callers.insert(caller_id);
+                }
+                module.functions.insert(callee_id, callee);
+                inlined_callees.insert(callee_id);
+                current_op_count = total_op_count(module);
+            }
+            Decision::SkipTooLarge { weight, threshold } => log::debug!(
+                "inline: skip callee={:?} too_large weight={} threshold={}",
+                callee_id, weight, threshold,
+            ),
+            Decision::SkipBudget { projected, budget } => {
+                log::debug!(
+                    "inline: skip callee={:?} budget projected={} budget={}",
+                    callee_id, projected, budget,
+                );
+                if projected > config.module_op_budget {
+                    result.budget_exceeded = true;
+                    break 'outer;
+                }
+            }
+            Decision::SkipMode => log::debug!(
+                "inline: skip callee={:?} mode=Never", callee_id,
+            ),
+        }
+    }
+
+    // Recompute offsets once per mutated caller.
+    for caller_id in mutated_callers {
+        let f = module.functions.get_mut(&caller_id)
+            .expect("mutated caller must exist");
+        recompute_offsets(&mut f.body);
+        // Optional: shrink_to_fit for embedded RAM hygiene.
+        f.body.shrink_to_fit();
+    }
+
+    result.functions_inlined = inlined_callees.len();
+    result
+}
+```
+
+Helpers:
+
+- **`total_op_count(module) -> usize`**: sum of `body.len()` across
+  functions. Cheap; recompute on each iteration is fine.
+- **`borrow_two_mut(map, a, b)`**: helper to borrow two distinct
+  entries `&mut IrFunction` out of `BTreeMap<FuncId, IrFunction>`
+  simultaneously. Cleanest approach: temporarily `take`/`remove` one
+  entry into a local, mutate the other in place via `get_mut`, then
+  re-insert. Or use unsafe pointer math through two `get_mut` calls
+  (avoid). Or restructure the loop so each splice borrows only one
+  function at a time. Prefer the take/insert dance for clarity;
+  performance impact is negligible since this happens once per inlined
+  callee.
+- **`group_by_caller_desc`**: bucket `(caller, op_idx)` pairs by
+  caller into `Vec<(FuncId, Vec<usize>)>` with each inner vec sorted
+  descending. Iteration order across callers is not material.
+
+### Determinism notes
+
+- Topo order is deterministic (Kahn with `BTreeSet` queue).
+- For each callee, the set of call sites comes from
+  `callers_of[callee]` (sorted) cross `call_sites_of[caller]` (body
+  order); descending splice order within a caller keeps op indices
+  stable.
+- `inline_module` is therefore deterministic across runs given the
+  same input module + config.
+
+### `lpir/src/lib.rs`
+
+- Already re-exports `inline_module` and `InlineResult` from Phase 2.
+- Add `pub use inline::InlineResult;` if not already present.
+
+## Tests (`lpir` crate)
+
+`tests/inline_basic.rs` (extend from Phase 5): add end-to-end tests
+that go through `inline_module` rather than calling `inline_call_site`
+directly:
+
+- **`leaf_inlined_into_caller`**: 2-function module, default config.
+  After `inline_module`: 1 call site replaced, caller body grew
+  appropriately, callee still present (M5 will delete it).
+- **`chain_inlined_bottom_up`**: A→B→C. Expect C inlined into B first,
+  then B (with C inlined inside it) inlined into A.
+- **`recursive_skipped`**: A→A. Expect `functions_skipped_recursive ==
+  1`, `call_sites_replaced == 0`, A's body unchanged.
+
+`tests/inline_heuristic.rs` (new):
+
+- **`mode_never`**: any callee → `SkipMode`, no inlining.
+- **`mode_always_inlines_huge_callee`**: huge callee (weight ≫
+  threshold) called once → still inlined under `Always` (only budget
+  can stop it).
+- **`auto_skips_large_multi_site`**: weight > threshold, 2 call sites
+  → `SkipTooLarge`, not inlined.
+- **`auto_inlines_large_single_site`**: weight > threshold, 1 call
+  site → inlined (single-site exception per `should_inline` logic).
+- **`module_op_budget_hit`**: tiny budget → `budget_exceeded == true`,
+  partial work preserved.
+- **`max_growth_budget_per_callee`**: callee weight × sites exceeds
+  per-callee growth → `SkipBudget`, other callees still considered.
+- **`debug_log_contains_decisions`**: capture `log` output (use
+  `log::set_logger` with a test sink), assert one line per decision
+  category.
+
+## Validate
+
+```bash
+cargo test -p lpir
+```
+
+Other crates (lpvm-native / wasm / cranelift / lps-filetests) can
+build but are not exercised here — `inline_module` is opt-in and not
+yet wired into the compile pipeline (M4).
diff --git a/docs/plans/2026-04-17-lpir-inliner-stage-iii/07-cleanup-and-validation.md b/docs/plans/2026-04-17-lpir-inliner-stage-iii/07-cleanup-and-validation.md
new file mode 100644
index 000000000..9779ce074
--- /dev/null
+++ b/docs/plans/2026-04-17-lpir-inliner-stage-iii/07-cleanup-and-validation.md
@@ -0,0 +1,91 @@
+# Phase 7 — Cleanup & validation
+
+## Scope of phase
+
+- Grep the working tree for **`TODO`**, **`FIXME`**, stray **`dbg!`**,
+  debug **`println!`** introduced during this plan.
+- Fix warnings: unused imports left over from scaffold phases,
+  **`dead_code`** on test-only helpers (prefer **`#[allow(dead_code)]`**
+  with a one-line reason or remove).
+- Re-skim the public surface in `lpir/src/lib.rs` — only
+  **`inline_module`** and **`InlineResult`** should be exported from
+  the inliner; everything else stays crate-private.
+- Confirm `log::debug!` calls are at the right level (decisions =
+  debug; per-op chatter, if any was added during bring-up, must be
+  removed or downgraded to `trace`).
+- Run the **full validation matrix** below.
+
+## Cleanup & validation
+
+```bash
+# Per-crate tests.
+cargo test -p lpir
+cargo test -p lpvm-native
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-wasm
+
+# Filetests (M2.5 backend no-op arms must not regress anything).
+cargo test -p lps-filetests -- --test-threads=4
+
+# Embedded build path — required by no-std-compile-path rule.
+cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf \
+  --profile release-esp32 --features esp32c6,server
+
+# Other consumers of lpir if applicable to this workspace's AGENTS list
+# (e.g. fw-emu, lp-server). Add as required.
+cargo check -p fw-emu
+cargo check -p lp-server
+```
+
+Expected results:
+
+- All existing tests pass — the inliner is opt-in and not yet wired
+  into `compile_module` (M4 wires it).
+- M2.5 marker round-trips through parse/print and validate stays
+  silent on legacy loops (no `Continuing` marker) and on new loops
+  (with `Continuing` marker).
+- No new warnings under `-D warnings` if the workspace enforces it.
+
+## Plan cleanup
+
+- Write **`docs/plans/2026-04-17-lpir-inliner-stage-iii/summary.md`**:
+  bullets — what shipped (`LpirOp::Continuing` marker, `inline` module,
+  `inline_module` public API, callgraph + topo order, per-param scan +
+  alias-or-copy remap, body splicer, heuristic with `func_weight =
+  body.len()`, structural `recompute_offsets`), crates touched
+  (`lpir`, `lpvm-native`, `lpvm-cranelift`, `lpvm-wasm`), follow-ups
+  (M3.1 empirical `func_weight` tuning, M4 wire into
+  `compile_module` + GLSL filetests with `compile-opt`, M5 dead-func
+  elimination, future-work removal of denormalized offset fields).
+- Move **`docs/plans/2026-04-17-lpir-inliner-stage-iii/`** →
+  **`docs/plans-done/2026-04-17-lpir-inliner-stage-iii/`** when
+  implementation is complete.
+
+## Commit (when requested)
+
+Single Conventional Commits message covering both M2.5 and M3:
+
+```
+feat(lpir): inliner pass + Continuing marker op (M3 + M2.5)
+
+- Add LpirOp::Continuing structural marker for loop continuing block;
+  cached LoopStart::continuing_offset retained for backend efficiency.
+  No-op arms in lpvm-native, lpvm-cranelift, lpvm-wasm.
+- Add lpir::inline module: inline_module public API, call graph with
+  bottom-up topological order and cycle skipping, per-param
+  scan-then-alias-or-copy remap, body splicer with return-shape
+  analysis, structural recompute_offsets, heuristic gated by
+  InlineConfig (mode + thresholds + budgets).
+- Decisions emitted at log::debug for CLI observability.
+- Empirical func_weight tuning deferred to M3.1; dead-func elimination
+  deferred to M5; pipeline wiring + GLSL filetests deferred to M4.
+```
+
+## Code Organization Reminders
+
+- Final pass: no temporary hacks without **`TODO(plan):`** if something
+  must remain. Any remaining TODOs must reference a follow-up
+  milestone (M3.1 / M4 / M5 / future-work).
+- Keep the inliner crate-private surface tight — future contributors
+  should be able to refactor `inline/` internals without touching
+  any other crate.
diff --git a/docs/plans/2026-04-19-lpir-inliner-m5-dead-func-elim/00-notes.md b/docs/plans/2026-04-19-lpir-inliner-m5-dead-func-elim/00-notes.md
new file mode 100644
index 000000000..0ab8f9690
--- /dev/null
+++ b/docs/plans/2026-04-19-lpir-inliner-m5-dead-func-elim/00-notes.md
@@ -0,0 +1,85 @@
+# M5 — LPIR Dead Function Elimination — Notes
+
+Plan for the `dead_func_elim` pass: a small post-inline cleanup that drops
+local functions with zero remaining call sites that aren't in the
+caller-supplied root set. Implements
+[m5-dead-func-elim.md](../../roadmaps/2026-04-15-lpir-inliner/m5-dead-func-elim.md).
+
+## Scope of work
+
+1. **`dead_func_elim` pass** in `lpir/src/dead_func_elim.rs`:
+   - Inputs: `&mut LpirModule`, `roots: &[FuncId]`.
+   - Algorithm: count local call sites per function (walk all bodies),
+     mark reachable transitively from roots, remove unmarked entries
+     from `module.functions`. Stable `FuncId` (M0) makes deletion safe.
+   - Returns `DeadFuncElimResult { functions_removed: usize }` plus a
+     `log::info!` summary like the inliner.
+2. **`DeadFuncElimConfig`** added to `CompilerConfig`, mirroring
+   `InlineConfig`:
+   - `mode: DeadFuncElimMode` ∈ {`Auto`, `Never`}, default `Never`.
+   - String keys `dead_func_elim.mode` plumbed through
+     `CompilerConfig::apply` and `COMPILER_CONFIG_APPLY_HELP`.
+3. **Backend wiring** (4 spots — same shape as M4):
+   - `lpvm-native::compile_module`,
+     `lpvm-cranelift::build_jit_module`,
+     `lpvm-cranelift::object_bytes_from_ir`,
+     `lpvm-wasm::compile_lpir`.
+   - After the existing `inline_module` call, when `mode != Never`,
+     compute roots and call `dead_func_elim`.
+4. **Roots resolution.** GLSL frontend currently does **not** set
+   `is_entry`. Production wiring needs an explicit signal. Two clean
+   options (Q2): wire `is_entry` in `lps-frontend`, or carry an
+   `entry_names` list in `CompilerConfig`. Filetests stay on `Never`.
+5. **Tests:**
+   - Rust unit tests in `lpir/src/tests/dead_func_elim.rs` (BTreeMap
+     module, root reachability, multiple roots, no-op when nothing
+     dead, removal of import-callers preserved).
+   - One filetest under `filetests/optimizer/dead_func_elim/` exercising
+     the `compile-opt(dead_func_elim.mode, auto)` + forced inline path
+     end-to-end.
+6. **Docs:** update `m5-dead-func-elim.md` to match current code shape
+   (BTreeMap, no `OptPass` enum, roots-by-name in callers).
+
+## Current state of the codebase
+
+- `LpirModule { imports: Vec<ImportDecl>, functions: BTreeMap<FuncId,
+  IrFunction> }` — keyed by stable `FuncId` (M0 done).
+- `IrFunction { is_entry: bool, ... }` — set by `parse.rs` from textual
+  `is_entry` directives and by some hand-rolled builder paths, but
+  **not** by the GLSL frontend (`lps-frontend`).
+- `CalleeRef::Local(FuncId)` references survive arbitrary
+  insertion/removal in `module.functions` (no renumbering).
+- `CompilerConfig { inline: InlineConfig }` lives in
+  `lpir/src/compiler_config.rs`; `apply(key, value)` parses string
+  overrides; `COMPILER_CONFIG_APPLY_HELP` documents them for
+  `shader-debug --compiler-opt`.
+- `inline_module(&mut module, &config.inline) -> InlineResult` is wired
+  into all 4 backend entry points (M4). Each clones the IR, runs the
+  inliner, then proceeds. Same pattern fits dead-func-elim.
+- Filetests directly invoke arbitrary user functions by name (e.g.
+  `test_call_simple_single_arg()`). Anything dead-func-elim removes
+  that the harness wanted to call would break the test.
+- Runtime instances also look up entries by name
+  (`module.entry_offset(name)`), not by `is_entry`.
+
+## Questions & Answers
+
+- **Q1 — pass takes `roots: &[FuncId]` (not `&[&str]`).** ✓
+  Pass works in `FuncId` space; provide a small `roots_by_name(&module,
+  &[&str]) -> Vec<FuncId>` helper for callers with names.
+- **Q2 — root resolution:** **(A) `is_entry` flag**, with prerequisite
+  fix to `lps-frontend/src/lower.rs` that marks the GLSL entry point
+  function with `is_entry = true`. Backends call
+  `roots_from_is_entry(&module)` to populate the root set.
+- **Q3 — default `dead_func_elim.mode = Never`.** ✓ Production opts in;
+  filetests work unchanged.
+- **Q4 — leave `LpsModuleSig` alone.** ✓ Sig is name-keyed; staleness
+  is harmless.
+- **Q5 — defer "inline-and-delete-as-we-go".** ✓ Already captured in
+  `future-work.md`; revisit if peak memory becomes a real problem.
+- **Q6 — filetest uses `compile-opt(inline.mode, always)` +
+  `compile-opt(dead_func_elim.mode, auto)`.** ✓ Realistic production
+  combo; harness asserts correctness; `functions_removed` visible via
+  `log::info!`.
+- **Q7 — `lp-cli shader-debug` prints `functions_removed`.** ✓ One
+  extra log line, gated on `mode != Never`.
diff --git a/docs/roadmaps/2026-04-15-lpir-inliner/future-work.md b/docs/roadmaps/2026-04-15-lpir-inliner/future-work.md
new file mode 100644
index 000000000..ad2f9c214
--- /dev/null
+++ b/docs/roadmaps/2026-04-15-lpir-inliner/future-work.md
@@ -0,0 +1,231 @@
+# LPIR Inliner — Future Work
+
+Things surfaced while planning M0–M5 that are real wins but not blocking
+the inliner. Capture here so they don't get forgotten.
+
+## Remove denormalized control-flow offsets
+
+### Problem
+
+`LpirOp::IfStart`, `LoopStart`, `SwitchStart`, `CaseStart`, `DefaultStart`,
+and `Block` all carry `else_offset` / `end_offset` / `continuing_offset`
+fields. These are **caches of structural information** — they can be fully
+recomputed by walking the body and matching openers to their closers
+(`Else`, `End`, the new `Continuing` marker from M2.5).
+
+Storing them in the IR is denormalization. The cost shows up every time a
+pass mutates the body:
+
+- M3 (inliner) needs a recompute pass over the entire body of every
+  function it transforms.
+- Every future structural transform (loop unrolling, dead-code-elim, peephole
+  on control flow, etc.) inherits the same maintenance burden.
+- Bugs in offset maintenance are subtle: tests pass for "happy path" code
+  shapes and explode on contrived nesting. Hard to fuzz.
+
+The inliner conversation made this concrete: even after M2.5 adds the
+`Continuing` marker for parity, every consumer that mutates the body has
+to remember to call `recompute_offsets` or the cached fields go stale.
+
+### Proposal
+
+1. Drop `else_offset`, `end_offset`, `continuing_offset` from
+   `LpirOp`. The opener variants become e.g.:
+
+   ```rust
+   IfStart   { cond: VReg }
+   LoopStart {}                  // no fields at all
+   SwitchStart { selector: VReg }
+   CaseStart   { value: i32 }
+   DefaultStart {}
+   Block     {}
+   ```
+
+2. Add a single `lpir::offsets` module exposing:
+
+   ```rust
+   /// Side-table keyed by op index (`body[i]`) → derived offsets.
+   pub struct OffsetMap {
+       /// Per-index entry; only populated for opener ops.
+       entries: Vec<Option<Offsets>>,
+   }
+
+   pub enum Offsets {
+       If    { else_pc: u32, end_pc: u32 },
+       Loop  { continuing_pc: u32, end_pc: u32 },
+       Switch { end_pc: u32, /* per-arm: */ arm_ends: SmallVec<...> },
+       Case  { end_pc: u32 },
+       Block { end_pc: u32 },
+   }
+
+   pub fn compute_offsets(body: &[LpirOp]) -> OffsetMap;
+   ```
+
+   Single O(n) pass, identical to the M3 inliner's recompute pass. No
+   allocation per op for non-opener positions (use `Option<Offsets>` or
+   a sparse map).
+
+3. Each backend / interpreter / validator calls `compute_offsets(&body)`
+   exactly once at function entry, then looks up by `pc` as needed.
+
+   - Cost: one extra O(n) walk per function compile. Negligible compared
+     to actual codegen.
+   - Benefit: zero maintenance burden for any pass that mutates the
+     body. Inliner becomes simpler. Any future transform (loop fusion,
+     control-flow simplification, predicate hoisting, …) becomes
+     trivially correct w.r.t. offsets.
+
+### Scope estimate
+
+Touches all three backends + interpreter + validator + parser/printer
+(printer needs to walk and find positions to print `else:` / `end` text;
+parser already builds without offsets, just patches at end). Roughly the
+same shape as M2 + M2.5 combined. ~12-15 files.
+
+### When to do it
+
+- **Not** during M3-M5 — those should stay focused.
+- After M5 lands, when we're touching backends for other reasons (more
+  passes, perf tuning, etc.) and the velocity benefit of "no offset
+  bookkeeping in transforms" starts compounding.
+- Pre-requisite for M2.5 to land first (or land them together as a
+  combined cleanup).
+
+### Acceptance criteria
+
+- All filetests pass with no behavioral change.
+- A representative pass that mutates the body (could be the inliner
+  itself, after M3) becomes shorter — measure LOC delta on `inline/`.
+- A new test category: "structural mutation" — randomly insert/remove
+  `Copy` ops in valid loop nests and assert behavior is preserved
+  without any offset bookkeeping.
+
+## Inline-and-delete-as-we-go (peak-memory optimization)
+
+### Problem
+
+Today (M3 + M5):
+
+1. M3 inlines all `Call` ops, leaving fully-inlined helpers in
+   `LpirModule.functions` with zero remaining callers.
+2. M5 (DeadFuncElim) runs as a separate pass and deletes them.
+
+In between, the module holds **both** the original helpers *and* the
+inlined-into callers. Peak memory during compile is roughly
+`sizeof(callers post-inline) + sizeof(unused helpers)`. On embedded
+targets (ESP32, ~120 KB heap budget for compile state), this matters for
+shaders with many helpers.
+
+### Proposal
+
+When the inliner finishes a callee `f` (i.e. has spliced into all
+callers), and `f` is not in the configured root set / entry set, delete
+`f` from `LpirModule.functions` immediately.
+
+- Saves peak memory ≈ `sizeof(f.body) + sizeof(f.vreg_pool)` per fully
+  inlined helper, summed over all helpers, integrated over the time
+  between M3 and M5 today.
+- Bottom-up topological order makes this safe: `f` is processed only
+  after all its own callees have been inlined into `f`'s body, and `f`
+  is deleted only after all *its* callers have been processed.
+
+### Why not now (M3)
+
+- M5's deletion logic is non-trivial (root set, sig filtering, `FuncId`
+  hygiene). Building it first as a standalone pass and then optionally
+  collapsing into M3 is the safer path.
+- M3 staying read-only at the function-set level (only mutates `body` /
+  `vreg_pool` / `vreg_types` / `slots`) keeps tests simple — every
+  function the test set up is still there to be inspected after the
+  pass.
+
+### When
+
+After M5 lands and is well-tested. Add an `InlineConfig` knob like
+`prune_during_inline: bool` (default `false` for filetests, `true` for
+production callers with a configured root set).
+
+## Other follow-ups
+
+### CI optimization-profile sweeps (Target × OptProfile axis)
+
+Today `Target` only encodes backend / ISA / float mode. To get automatic
+regression detection on the inliner perf signal, we want the filetest
+harness to be able to run the same test under multiple
+`(target, opt-profile)` combinations and emit deltas.
+
+Concrete shape: extend `Target` (or add a parallel `OptProfile` axis)
+with named profiles like `o0` (no inlining, no const-fold), `o1`
+(default Auto), `o2` (always inline). CI runs the suite under each
+profile and asserts no unexpected pass/fail flips. Output table gets a
+new column or row per profile.
+
+Deferred from M4 because the surface area was larger than the ad-hoc
+`--force-opt` flag we ended up shipping (which is sufficient for
+human-driven A/B today).
+
+### Grow `examples/` corpus with more representative shaders
+
+The M4 outcome measurement leaned on a single shader
+(`examples/rainbow.glsl`). That's enough to confirm the pipeline works
+but not enough to drive heuristic tuning or catch regressions on real
+content. Write 3–5 more shaders that exercise different code-shapes:
+heavy palette/lookup, math-heavy fragment work, control-flow-heavy
+animation, etc. Bonus: include a shader that mirrors a real artist's
+output.
+
+### Inliner: refresh stale call-graph indices between callees
+
+Surfaced during M5 filetest design. `inline_module` builds the call
+graph once at the start of the pass and uses the cached
+`(caller, op_idx)` pairs unchanged for every callee. Splicing a call
+site mutates the caller body and shifts every subsequent op's index, so
+when a single caller has Calls to **two distinct local callees** the
+second callee's recorded `op_idx` is stale by the time we get there.
+`splice::inline_call_site` then sees a non-`Call` op at that index and
+silently returns; the inliner reports `inlined=N` but the second callee
+isn't actually spliced.
+
+Workarounds today: filetests avoid the pattern (see
+`optimizer/dead_func_elim/dfe-removes-unreachable.glsl`, where `render`
+calls only one local function under `inline.mode=always`).
+
+Fix options:
+1. Rebuild the call graph after each callee is processed (simplest,
+   O(n) per callee).
+2. Maintain a small per-caller index-shift vector during splicing and
+   apply it when looking up subsequent sites.
+3. Refresh sites for a caller lazily right before splicing, by
+   re-walking that caller's body once per (caller, callee) pair.
+
+Acceptance: a filetest like `dfe-after-inline.glsl` (small `helper`,
+small `test_dfe_*`, `render` calls both `pipeline(...)` *and*
+`test_dfe_*` directly) compiles and `// run:` lines pass on every
+backend with `inline.mode=always`.
+
+### Mark `test_*` functions as `is_entry` in the filetest path
+
+Surfaced during M5. The harness invokes user functions by name (e.g.
+`test_dfe_after_inline`). With `inline.mode=always` the inliner copies
+small `test_*` bodies into `render` and removes the original call site;
+DFE then drops the now-orphan `test_*`, and the harness fails with
+"symbol not found".
+
+Cheapest fix: have either the filetest harness or the GLSL frontend
+mark every function named `test_*` as `is_entry`, so it survives DFE
+even after being inlined. Alternative: extend
+`CompilerConfig`/`DeadFuncElimConfig` with an explicit `entry_names:
+Vec<String>` knob that the harness populates from the parsed `// run:`
+directives.
+
+### Triage `function/call-order.glsl` under `--force-opt inline.mode=always`
+
+Surfaced during M4 Phase 4 acceptance: this test is annotated
+`@unimplemented` for some target but starts passing when inlining is
+forced on. Either inlining is accidentally working around a real bug,
+or the `@unimplemented` annotation is stale. Quick triage:
+1. Run the file under default Auto and confirm the same `@unimplemented`
+   assertion still fires.
+2. Diff the LPIR between Auto and Always to identify which call site
+   gets inlined.
+3. Either delete the stale annotation or file a real bug.
diff --git a/docs/roadmaps/2026-04-15-lpir-inliner/impl-notes.md b/docs/roadmaps/2026-04-15-lpir-inliner/impl-notes.md
new file mode 100644
index 000000000..60c1da9cd
--- /dev/null
+++ b/docs/roadmaps/2026-04-15-lpir-inliner/impl-notes.md
@@ -0,0 +1,91 @@
+# Implementation notes
+
+Cross-cutting context for the LPIR inliner work that doesn't belong in any
+single milestone doc.
+
+## Unified `lps-shader` crate (parallel branch)
+
+A separate in-flight branch introduces a new top-level **`lps-shader`** crate
+that consolidates the LPIR-side compile pipeline. Today, three backends each
+have their own entry point with their own options struct and their own copy
+of the "lower GLSL → optimize LPIR → emit" wiring:
+
+```
+lps_frontend::compile  +  lps_frontend::lower
+    ↓
+LpirModule
+    ↓
+lpvm-cranelift  (CraneliftEngine::compile,  CompileOptions)
+lpvm-native     (NativeFaEngine::compile,   NativeCompileOptions)
+lpvm-wasm       (WasmLpvmEngine::compile,   WasmOptions)
+```
+
+The unified crate will own the LPIR-side pipeline once and let each backend
+plug in only its target-specific bits:
+
+```
+lps_shader::compile(source, target, options)
+    ↓
+lps_frontend → LpirModule
+    ↓  ←  shared mid-end (inline, const_fold, future passes)
+LpirModule (post-mid-end)
+    ↓  →  one of: cranelift / native / wasm backend
+```
+
+That branch is **waiting on this one** (the inliner). Once both land:
+
+- The inliner call site moves from three places (one per backend's
+  `compile_module` / equivalent) to a single place in `lps-shader`.
+- `CompilerConfig` lives at the `lps-shader` API boundary; backend
+  `CompileOptions` / `NativeCompileOptions` / `WasmOptions` lose the
+  `config: CompilerConfig` field they all carry today.
+- The filetest harness's `CompiledShader::compile_glsl` (which currently
+  dispatches per backend and threads `compiler_config` into each options
+  struct) collapses into a single call.
+
+### Implications for M4
+
+We're wiring `inline_module` into all three backends in M4 (per the
+"all backends for consistency" decision). That means M4 lands three call
+sites — one in each backend's compile entry — that the unified-crate
+branch will later consolidate into one.
+
+This is intentional. The alternatives were worse:
+
+- Wait for the unified crate before wiring inlining → blocks the
+  unified-crate branch on the inliner *and* delays the rv32n perf win.
+- Native-only in M4 → leaves cranelift/wasm divergent from native, which
+  defeats the "preview matches device, reference matches optimization
+  semantics" rationale that motivated the all-backends decision.
+
+The duplication is mechanical and cheap to remove. Each call site is one
+function-call's worth of code. The unified-crate PR can rip them out as
+part of its consolidation step with no behavior change.
+
+### Guidance for the unified-crate agent
+
+When consolidating:
+
+1. The inliner is **mid-end**, not backend-specific. It runs once per
+   compile, on a clone of `LpirModule`, before per-function passes
+   (`const_fold` then backend-specific lowering).
+2. `inline_module` is mutative; clone the module before passing it in
+   (the backends do `let mut ir_opt = ir.clone();` today).
+3. The current per-function pipeline order on each backend is:
+   `inline_module` (module) → `const_fold` (per function) → backend
+   lower / emit. Preserve this order in the unified crate.
+4. `CompilerConfig` is `Clone`, `no_std`-compatible, and lives in
+   `lpir`. It already carries everything every backend needs at the
+   mid-end layer (`inline: InlineConfig`; future passes will add
+   sibling fields).
+5. The three filetest annotations that already exist
+   (`compile-opt(inline.mode, never)` and `compile-opt(inline.mode, always)`
+   sprinkled across `filetests/function/`, `filetests/lpvm/native/`,
+   and the new `filetests/inline/` dir) are file-scoped and apply to
+   every backend invocation for that test. The unified `lps-shader`
+   entry will see them through the same `CompilerConfig` channel.
+
+If the unified-crate branch lands first for any reason, the M4 work
+slots in trivially: one call to `inline_module` at the top of the
+shared `compile` function, and the per-backend wiring this milestone
+adds becomes a no-op delete.
diff --git a/docs/roadmaps/2026-04-15-lpir-inliner/m1-optpass-filetests.md b/docs/roadmaps/2026-04-15-lpir-inliner/m1-optpass-filetests.md
index f8f492ab9..5e7bbcb54 100644
--- a/docs/roadmaps/2026-04-15-lpir-inliner/m1-optpass-filetests.md
+++ b/docs/roadmaps/2026-04-15-lpir-inliner/m1-optpass-filetests.md
@@ -1,25 +1,39 @@
-# M1 — Compiler Config + Filetest `@config` Annotation
+# M1 — Compiler Config + Filetest `compile-opt`
 
-Add a `@config(key, value)` annotation to filetests for controlling
-compiler options per file. All optimizations are always in the pipeline —
-they disable themselves via their own config (e.g. `inline.mode = never`).
+Add a **`// compile-opt(key, value)`** file directive to filetests for controlling
+**LPIR optimization** options per file. Passes stay in the pipeline and consult
+**`CompilerConfig`** (e.g. `inline.mode = never` skips inlining in the pass).
+
+`CompilerConfig` is **not** part of the GLSL frontend (`lps-frontend`). It is a
+**middle-end** concern: options for **LPIR-level** transforms (inline, future
+passes) that run **after** lowering to LPIR and **before or during** lowering to
+each backend. Backend-only knobs (native float mode / debug flags, Cranelift
+memory strategy, WASM emit details) stay on each backend’s option struct and are
+**layered** beside `CompilerConfig`, not merged into it.
 
 ## Design
 
-### `@config` annotation
+### `compile-opt` directive
 
-Single annotation syntax for all compiler options:
+Single directive syntax for all **string-configurable** compiler (middle-end)
+options. Conventionally placed **at the top of the file** (before `// run:` and
+`// @…` lines):
 
 ```glsl
-// @config(inline.mode, never)
+// compile-opt(inline.mode, never)
 ```
 
 Parsed as a key-value pair: `key = "inline.mode"`, `value = "never"`.
-The harness maps these to the appropriate config structs before compilation.
+The harness maps these to **`CompilerConfig`** before compilation.
+
+This is **not** the same family as **`// @unimplemented(target)`** / etc.:
+those are **target-scoped** and attach to the **next** `// run:`**.
+**`compile-opt`** is **file-scoped** and applies to **how the whole module is
+compiled** on every backend path that runs the LPIR pipeline.
 
 ### CompilerConfig
 
-Top-level config struct that holds all optimization configs. Lives in
+Top-level config struct that holds all **LPIR** optimization configs. Lives in
 `lpir` (since passes live there). Must be `no_std`-compatible (`lpir` is
 `#![no_std]` + `alloc`).
 
@@ -31,13 +45,14 @@ pub struct CompilerConfig {
 }
 ```
 
-`CompilerConfig` is about LPIR-level optimization passes. It's separate
-from backend-specific options (`NativeCompileOptions` has float_mode,
-debug_info, etc.). They're layered, not merged:
+Layering vs backends:
 
 ```
-CompilerConfig             (LPIR-level: inline, const_fold, future passes)
-  └─ NativeCompileOptions  (backend-level: float_mode, debug_info, emu_trace)
+CompilerConfig          (LPIR passes: inline, const_fold config, …)  ← middle-end
+  used alongside:
+  NativeCompileOptions  (RV32 native: float_mode, debug_info, emu_trace, …)
+  CompileOptions        (Cranelift: q32_options, memory_strategy, …)
+  WasmOptions           (WASM: float_mode, …)
 ```
 
 ### InlineConfig
@@ -71,7 +86,7 @@ names — no `std` dependency needed for parsing.
 
 ### Config application from key-value pairs
 
-`CompilerConfig` has an `apply` method for mapping annotation strings to
+`CompilerConfig` has an `apply` method for mapping directive strings to
 fields:
 
 ```rust
@@ -103,13 +118,23 @@ impl CompilerConfig {
 }
 ```
 
-Unknown keys are parse errors (catches typos like `inlien.mode`). This
-is the single place that knows the full key namespace — adding a new pass
-means adding match arms here.
+Unknown keys are errors (catches typos like `inlien.mode`). This is the single
+place that knows the full key namespace — adding a new pass means adding match
+arms here.
+
+### Threading through compile options (everywhere)
+
+**`CompilerConfig` must be available on every path that compiles LPIR** so
+filetests and production agree regardless of target (JIT, RV32 Cranelift, RV32
+native, WASM).
+
+Add a `config: CompilerConfig` field to:
 
-### Threading through compile options
+- **`NativeCompileOptions`** (`lpvm-native`)
+- **`CompileOptions`** (`lpvm-cranelift`)
+- **`WasmOptions`** (`lpvm-wasm`)
 
-`NativeCompileOptions` gets a `config: CompilerConfig` field:
+Example (native):
 
 ```rust
 pub struct NativeCompileOptions {
@@ -121,56 +146,49 @@ pub struct NativeCompileOptions {
 }
 ```
 
-Each pass checks its own config. The const_fold and imm_fold passes
-can remain unconditional for now (no config needed — they're cheap and
-always beneficial). Add configs for them later if needed.
+These structs may drop **`Copy`** where they were **`Copy`** (`CompilerConfig`
+is **`Clone`**). **`Default`** continues to use **`CompilerConfig::default()`**
+for `config`.
 
-### Annotation parsing
-
-Extend `parse_annotation.rs` to handle `@config`:
-
-```rust
-// @config(inline.mode, never)
-//         ^key          ^value
-```
+Each pass reads the shared **`CompilerConfig`**. The const_fold and imm_fold
+passes can remain unconditional for now (no config — cheap and always
+beneficial). Add configs for them later if needed.
 
-New annotation kind: `AnnotationKind::Config { key: String, value: String }`.
+### Parsing (`compile-opt`)
 
-`@config` is **not target-scoped** (unlike `@unimplemented(target)`).
-It applies to the LPIR-level source, not a specific backend. If
-target-specific config is ever needed, a third parameter can be added
-later.
+Implement a **dedicated** parser (e.g. `parse_compile_opt_line`) — **not** an
+`AnnotationKind` variant on `// @…(target)` lines.
 
-### Duplicate key handling
-
-If a file has two `@config` lines with the same key, that's an error:
-
-```glsl
-// @config(inline.mode, never)
-// @config(inline.mode, always)   // ERROR: duplicate key 'inline.mode'
+```text
+// compile-opt(inline.mode, never)
+//           ^key          ^value
 ```
 
-The harness tracks seen keys and rejects duplicates before calling
+**Duplicate keys:** two `compile-opt` lines with the same key → error before
 `CompilerConfig::apply`.
 
 ### Changes to TestFile
 
-Add `config_overrides: Vec<(String, String)>` to `TestFile`. The compile
-path merges these into the default `CompilerConfig` before compilation.
+Add `config_overrides: Vec<(String, String)>` to `TestFile`. The compile path
+merges these into **`CompilerConfig::default()`** and passes the result into
+**each** backend’s options struct when building engines in the filetest
+harness.
 
 ### Filetest harness flow
 
 ```
-parse_annotation_line
-    │  @config(key, value) → AnnotationKind::Config { key, value }
+parse_compile_opt_line (or shared trim → try compile-opt first)
+    │  // compile-opt(key, value) → push onto TestFile.config_overrides
     ▼
 TestFile { config_overrides: Vec<(key, value)> }
     │
-    ▼  (in compile_glsl)
+    ▼  (in compile_glsl, for every backend)
 CompilerConfig::default()
-    │  .apply(key, value) for each override
+    │  .apply(key, value) for each override (duplicate keys rejected earlier)
     ▼
-NativeCompileOptions { config, float_mode, .. }
+CompileOptions { config, float_mode, .. }           // Jit / Rv32 c.flift
+NativeCompileOptions { config, float_mode, .. }     // Rv32 native
+WasmOptions { config, float_mode }                  // wasm
     │
     ▼
 compile_module(ir, sig, options)
@@ -181,45 +199,56 @@ compile_module(ir, sig, options)
 Once the inliner is wired in (M4):
 
 **Call-semantics tests** (keep real calls):
+
 ```glsl
-// @config(inline.mode, never)
+// compile-opt(inline.mode, never)
 ```
+
 - `filetests/function/call-simple.glsl`
 - `filetests/function/call-multiple.glsl`
 - `filetests/function/call-order.glsl`
 - `filetests/function/call-return-value.glsl`
 
 **Inliner correctness tests** (always inline, heuristic-proof):
+
 ```glsl
-// @config(inline.mode, always)
+// compile-opt(inline.mode, always)
 ```
+
 - New tests added in M4 specifically for inliner validation.
 
-**Everything else:** No annotation. Uses defaults (`Auto`).
+**Everything else:** No directive. Uses defaults (`Auto`).
 
 ## Changes by file
 
 | File | Change |
 |------|--------|
-| `lpir/src/compiler_config.rs` (new) | `CompilerConfig`, `InlineConfig`, `InlineMode`, `ConfigError`, `apply()` method. `InlineMode` impls `FromStr`. All `no_std`. |
+| `lpir/src/compiler_config.rs` (new) | `CompilerConfig`, `InlineConfig`, `InlineMode`, `ConfigError`, `apply()`. `InlineMode` impls `FromStr`. All `no_std`. |
 | `lpir/src/lib.rs` | `pub mod compiler_config;` + re-exports |
-| `lpvm-native/src/native_options.rs` | Add `config: CompilerConfig` field to `NativeCompileOptions` |
-| `lpvm-native/src/compile.rs` | Pass config to inline pass (M4). Guard const_fold/imm_fold behind config checks if configs are added for them. |
-| `lps-filetests/src/parse/parse_annotation.rs` | Add `Config` annotation kind, parse `@config(key, value)` |
-| `lps-filetests/src/parse/mod.rs` | Collect config annotations into `TestFile`, check for duplicate keys |
+| `lpvm-native/src/native_options.rs` | Add `config: CompilerConfig` |
+| `lpvm-cranelift/src/compile_options.rs` | Add `config: CompilerConfig` (may drop `Copy` on `CompileOptions`) |
+| `lpvm-wasm/src/options.rs` | Add `config: CompilerConfig` (may drop `Copy` on `WasmOptions`) |
+| `lpvm-native/src/compile.rs` | Pass `config` to inline pass (M4). Optional: const_fold behind config later. |
+| `lpvm-cranelift` / `lpvm-wasm` compile paths | Thread `config` through to wherever LPIR passes run (same as native when added) |
+| `lps-filetests/src/parse/parse_compile_opt.rs` (new) or inline in `mod.rs` | Parse `// compile-opt(key, value)`; validate duplicate keys in `parse_test_file` |
+| `lps-filetests/src/parse/mod.rs` | Recognize `compile-opt` before `@` annotations; collect into `TestFile` |
 | `lps-filetests/src/parse/test_type.rs` | Add `config_overrides: Vec<(String, String)>` to `TestFile` |
-| `lps-filetests/src/test_run/filetest_lpvm.rs` | Build `CompilerConfig` from overrides, thread into compile options |
-| `lps-filetests/src/targets/mod.rs` | Add `Config` to `AnnotationKind` |
+| `lps-filetests/src/test_run/filetest_lpvm.rs` | Build `CompilerConfig`, thread into **all** `CompileOptions` / native / WASM builds |
+
+Do **not** add `compile-opt` to `AnnotationKind` — keep `// @…` for
+per-target / per-run annotations only.
 
 ## Validation
 
 ```bash
 cargo test -p lpir
 cargo test -p lpvm-native
+cargo test -p lpvm-cranelift
+cargo test -p lpvm-wasm
 cargo test -p lps-filetests -- --test-threads=4
 cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf \
     --profile release-esp32 --features esp32c6,server
 ```
 
-All existing filetests pass — no behavioral change since no files have
-`@config` annotations yet, and the inliner isn't wired in until M4.
+All existing filetests pass — no behavioral change until files use
+`compile-opt` and the inliner is wired in (M4).
diff --git a/docs/roadmaps/2026-04-15-lpir-inliner/m2.5-continuing-marker.md b/docs/roadmaps/2026-04-15-lpir-inliner/m2.5-continuing-marker.md
new file mode 100644
index 000000000..cef31da3b
--- /dev/null
+++ b/docs/roadmaps/2026-04-15-lpir-inliner/m2.5-continuing-marker.md
@@ -0,0 +1,138 @@
+# M2.5 — `Continuing` marker op
+
+Tiny structural-symmetry milestone. Lands before M3 implementation.
+
+## Why
+
+`LoopStart` carries `continuing_offset: u32`, a cached pointer into the body
+that says "the continuing block starts here". `IfStart`'s analogue
+(`else_offset`) has a partner marker op (`Else`) — you can find the else
+position by scanning for the marker. `LoopStart` has no such marker; the
+continuing position is **only** discoverable from the cached offset.
+
+That asymmetry hurts the inliner (M3): when splicing changes body indices,
+`else_offset` and `end_offset` can be recomputed from structure (find
+matching `Else` / `End`), but `continuing_offset` cannot. The inliner would
+either need ugly position-tracking bookkeeping or every loop in the
+inlined IR has stale offsets.
+
+Adding `LpirOp::Continuing` as a marker op (no fields) closes the gap.
+Backends and interpreter keep using the cached offset — zero perf change —
+but the offset is now structurally derivable for any pass that mutates the
+body.
+
+## Design
+
+### New op
+
+```rust
+pub enum LpirOp {
+    // ...
+    /// Marker for the start of the continuing block of the enclosing
+    /// `LoopStart`. Position cached in `LoopStart::continuing_offset` for
+    /// fast backend access; recomputable structurally by scanning the body
+    /// for this op.
+    Continuing,
+    // ...
+}
+```
+
+`def_vreg()` returns `None`. No new fields.
+
+### Cache stays
+
+`LoopStart::continuing_offset` is **kept**. Backends and the interpreter
+keep using it. The marker is purely for structural recompute — it lets
+mutating passes (today: the inliner; tomorrow: any other transform that
+reshapes the body) rebuild the cache without bookkeeping.
+
+### Backend / interpreter touch
+
+All consumers add a one-arm handler for `LpirOp::Continuing` that does
+nothing semantically (just advances iteration):
+
+- `lpir/src/interp.rs`: `LpirOp::Continuing => { pc += 1; }`
+- `lpvm-native/src/lower.rs`: continuing-block range slicing already
+  starts at `continuing_offset`, which now points *at* the marker; the
+  range lowerer treats it as a no-op (skip).
+- `lpvm-wasm/src/emit/ops.rs`: no-op match arm.
+- `lpvm-cranelift/src/emit/control.rs`: no-op match arm.
+
+The existing `End`-handling logic in each backend that watches for
+`pc == continuing_offset` is **unchanged**.
+
+### Builder / parser / printer
+
+- `builder.rs::push_continuing()`: pushes `LpirOp::Continuing` *and*
+  patches `continuing_offset` on the open `LoopStart` (same as today, plus
+  one line for the op).
+- `parse.rs`: existing `continuing:` text token now constructs the marker.
+  No grammar change.
+- `print.rs`: emits `continuing:` for `LpirOp::Continuing`. Drop the
+  current "did the offset move from `start_pc + 1`?" detection — just
+  print the marker wherever it appears.
+
+### Validator
+
+- `Continuing` must appear inside a `LoopStart`/`End` pair, not nested
+  inside another control construct of that loop. Easy stack-based check.
+- `LoopStart::continuing_offset` must point at a `Continuing` op (or at
+  `start_pc + 1` if no `Continuing` is present, matching today's
+  default-on-missing semantics — the validator can permit either).
+- All exhaustive matches in `validate.rs` get the `Continuing` arm.
+
+### `const_fold`
+
+Add `Continuing` to the conservative-clear arm (treat like other markers).
+
+## Files touched (estimate ~9)
+
+| File | Change |
+|------|--------|
+| `lpir/src/lpir_op.rs` | Add `Continuing` variant; update `def_vreg`. |
+| `lpir/src/builder.rs` | `push_continuing()` emits the op. |
+| `lpir/src/parse.rs` | Existing `continuing:` token → emit marker. |
+| `lpir/src/print.rs` | Print `continuing:` for the marker; remove offset-detection. |
+| `lpir/src/validate.rs` | New variant in all matches; nesting check. |
+| `lpir/src/interp.rs` | One-line `Continuing => pc += 1`. |
+| `lpir/src/const_fold.rs` | Add to conservative-clear arm. |
+| `lpvm-native/src/lower.rs` | Handle `Continuing` (no-op in body lowering). |
+| `lpvm-wasm/src/emit/ops.rs` | Handle `Continuing` (no-op). |
+| `lpvm-cranelift/src/emit/control.rs` | Handle `Continuing` (no-op). |
+
+## Tests
+
+Use the same `tests/all_ops_roundtrip.rs` pattern as M2:
+
+- Add a loop with explicit `continuing:` body to the round-trip set.
+- One unit test asserting `continuing_offset` matches the position of the
+  `Continuing` op in the body.
+- All existing loop tests still pass (no behavioral change).
+
+## Validation
+
+```bash
+cargo test -p lpir
+cargo test -p lpvm-native
+cargo test -p lpvm-wasm
+cargo test -p lpvm-cranelift
+cargo test -p lps-filetests -- --test-threads=4
+```
+
+No test should change behavior — `continuing_offset` is still set the same
+way at parse / build time. The marker is purely additive.
+
+## Why now (not in M3)
+
+- Keeps M3's diff focused on the inliner itself.
+- Lets the inliner's recompute pass be fully structural with no special
+  cases — a clear, simple algorithm.
+- This is a small, mechanical, easily-reviewable change. Bundling it into
+  M3 would add ~9 unrelated files to that diff.
+
+## Out of scope
+
+- Removing `continuing_offset` field (and likewise `else_offset` /
+  `end_offset`). That's the long-term cleanup tracked in
+  [future-work.md](future-work.md) — separate, much bigger refactor
+  across all backends. Land well after M5.
diff --git a/docs/roadmaps/2026-04-15-lpir-inliner/m3.1-tune-inline-weights.md b/docs/roadmaps/2026-04-15-lpir-inliner/m3.1-tune-inline-weights.md
new file mode 100644
index 000000000..6256c6325
--- /dev/null
+++ b/docs/roadmaps/2026-04-15-lpir-inliner/m3.1-tune-inline-weights.md
@@ -0,0 +1,99 @@
+# M3.1 — Tune `func_weight` empirically
+
+Tiny follow-up to M3. Runs independently of M4 (uses the un-inlined per-function
+output of `lp-cli shader-debug`).
+
+## Why
+
+M3 lands with the simplest possible size metric:
+
+```rust
+fn func_weight(f: &IrFunction) -> u32 { f.body.len() as u32 }
+```
+
+…wired through every consumer of `InlineConfig::small_func_threshold` /
+`max_growth_budget` / `module_op_budget`. The `20`-op default for
+`small_func_threshold` is a guess. We don't want to ship a guess as the
+production threshold.
+
+## What
+
+Build a tiny benchmark corpus and a one-shot script that prints
+
+```
+function          lpir_ops   weighted_ops   rv32n_insns
+paletteHeatmap    14         14             52
+paletteRainbow    27         27             88
+applyPalette      19         19             71
+…
+```
+
+then pick the weighting that best correlates with `rv32n_insns` for the
+shapes of code we actually compile. Re-tune `small_func_threshold` so a
+"small" function lines up with whatever rv32n size we want to always inline
+(say ≤ 64 instructions).
+
+## Steps
+
+1. Add `lp-shader/lps-filetests/filetests/debug/inline-weights.glsl` (or a
+   small set of files in `filetests/debug/inline-weights/`) covering:
+   - tiny scalar helpers (`float lerp(float a, float b, float t)`),
+   - vec3 arithmetic helpers (`vec3 mul3(vec3 v, float s)`),
+   - branchy helpers (`applyPalette`-style if-chains),
+   - helpers that call builtins (`sqrt`, `mix`, `clamp`, `cos`),
+   - one larger helper (~50 LPIR ops) for the upper end.
+2. For each function, run:
+   ```bash
+   cargo run -p lp-cli -- shader-debug --lpir --asm \
+       lp-shader/lps-filetests/filetests/debug/inline-weights.glsl \
+       > /tmp/inline-weights.txt
+   ```
+   and tabulate `lpir_ops`, candidate `weighted_ops`, `rv32n_insns`.
+   (A small awk / Python one-liner over the output is fine — no need to
+   build a full harness.)
+3. Compare candidate weight functions:
+   - `body.len()` (current).
+   - markers-zero: structural ops (`Else`, `End`, `*Start`, `Block`,
+     `ExitBlock`, `Break`, `Continue`) weighted 0.
+   - heavy-bias: as above plus `Call` = 5, `Memcpy` = 4, `Fsqrt` = 4.
+4. Pick the simplest one that correlates well, replace the body of
+   `func_weight` in `lpir/src/inline/heuristic.rs` (or wherever it lands),
+   and re-tune `InlineConfig::small_func_threshold` accordingly.
+5. Drop the corpus into `filetests/debug/` so it stays as a regression
+   reference for future tuning.
+
+## Validation
+
+```bash
+cargo test -p lpir
+cargo test -p lps-filetests -- --test-threads=4
+```
+
+Behavior should be unchanged for files without local function calls.
+Files that do change should improve (fewer rv32n instructions per
+inlined call site or no change).
+
+## Out of scope
+
+- Wiring the inliner into `lpvm-native::compile_module` — that's M4.
+- Comparing inlined-vs-not perf — that's M4 step 3.
+
+## Outcome (2026-04-17)
+
+We measured `body_len` (raw LPIR op count), markers-zero (`mz`), and heavy-bias (`hb`) weights against `rv32n_insns` on the new `inline-weights.glsl` corpus plus representative functions from `rainbow.glsl`. `body.len()` stayed the strongest linear correlate on the combined set while staying the simplest implementation, so production `func_weight` remains `func.body.len()`. The default `small_func_threshold` was lowered from 20 to 16: every corpus function with `body_len` ≤ 16 lowered to at most 51 rv32n instructions, while the next step up (`iw_fold_rgb` at body 18) jumps to 85 — giving the cleanest cut under the informal “always inline ≤ 64 rv32n insns” target.
+
+| function (corpus) | body_len | rv32n_insns |
+| --- | ---: | ---: |
+| iw_step01 | 11 | 19 |
+| iw_clamp01 | 7 | 25 |
+| iw_lerp | 10 | 33 |
+| iw_add3 | 16 | 51 |
+| iw_fold_rgb | 18 | 85 |
+| paletteFire | 22 | 104 |
+| rainbow_main | 154 | 541 |
+
+**Chosen:** `func_weight` = `func.body.len()`; **`small_func_threshold` = 16** (production default in `InlineConfig`).
+
+**Pearson r** (combined corpus, vs `rv32n_insns`): body_len **0.980**, markers-zero **0.974**, heavy-bias **0.962**.
+
+Three weight candidates remain available as `lpir::inline_weights::{weight_body_len, weight_markers_zero, weight_heavy_bias}` plus the `lp-cli shader-debug --weights` flag for future re-tuning.
diff --git a/docs/roadmaps/2026-04-15-lpir-inliner/m4-wire-and-validate.md b/docs/roadmaps/2026-04-15-lpir-inliner/m4-wire-and-validate.md
index f71f0ba8d..737ed1133 100644
--- a/docs/roadmaps/2026-04-15-lpir-inliner/m4-wire-and-validate.md
+++ b/docs/roadmaps/2026-04-15-lpir-inliner/m4-wire-and-validate.md
@@ -1,168 +1,391 @@
 # M4 — Wire Inliner + Full Validation
 
-Connect the inlining pass to the native compilation pipeline, tag filetests
-with disable annotations where needed, and run the full suite.
-
-## Wire into `lpvm-native`
-
-### `compile.rs` changes
-
-The inlining pass runs on the **module** before per-function compilation
-(unlike const_fold and imm_fold which run per-function). Add it to
-`compile_module`:
-
-```rust
-pub fn compile_module(
-    ir: &LpirModule,
-    sig: &lps_shared::LpsModuleSig,
-    float_mode: FloatMode,
-    options: NativeCompileOptions,
-) -> Result<CompiledModule, NativeError> {
-    let mut ir_opt = ir.clone();
-    let inline_result = lpir::inline::inline_module(
-        &mut ir_opt,
-        &options.config.inline,
-    );
-    if inline_result.call_sites_replaced > 0 {
-        log::debug!(
-            "[native-fa] inline: {} calls replaced across {} functions",
-            inline_result.call_sites_replaced,
-            inline_result.functions_inlined,
-        );
-    }
-
-    let module_abi = ModuleAbi::from_ir_and_sig(&ir_opt, sig);
-    let mut session = CompileSession::new(module_abi, float_mode, options);
-
-    // ... compile each function in ir_opt.functions ...
-}
-```
-
-### Signature handling
-
-When functions are deleted from the module, the `LpsModuleSig` still has
-entries for them. Two options:
-
-A. Filter `sig.functions` to only include functions still present in the
-   inlined module. Match by name.
-B. Have `inline_module` return a list of deleted function names so the
-   caller can filter.
-
-Option A is simplest and sufficient.
-
-### Per-function passes
-
-After inlining, each function's body may be larger (inlined code). The
-existing per-function passes (const_fold, imm_fold) run on the inlined
-bodies — this is desirable since inlining exposes new constant folding
-opportunities (e.g. `paletteHeatmap(0.0)` — the constant `0.0` flows
-into the inlined body).
-
-Pipeline order:
+Connect the M3 inlining pass into all three backend compile pipelines, give
+operators an A/B switch (CLI + filetest harness) so the suite itself becomes
+a perf signal, add a small set of inliner-specific filetests, run the full
+suite under both configurations, and document the result.
+
+## Decisions (Q1–Q5)
+
+See conversation transcript `5a8829f9-bf7c-4f6e-9340-7e4b3be3626c` for the full
+discussion. Summary:
+
+- **Q1 — Wire scope.** Wire `inline_module` into all three backends
+  (`lpvm-native`, `lpvm-cranelift`, `lpvm-wasm`). Native is the prime path;
+  cranelift is the correctness/perf reference; wasm is the editor preview path.
+  We want one consistent LPIR-side optimization story across all three so
+  cross-backend correctness comparison is meaningful and the editor preview
+  matches device behavior. Note: the upcoming unified `lps-shader` crate (see
+  `impl-notes.md`) will absorb this duplication.
+
+- **Q2 — Filetest tagging.** Surgical: tag only the files that exist
+  specifically to exercise call/return mechanics. ~54 files total. Insert
+  `// compile-opt(inline.mode, never)` as line 1.
+
+- **Q3 — Perf A/B.** Add `--compiler-opt key=value` to `lp-cli shader-debug`
+  for single-file inspection, and `--force-opt key=value` to the filetest
+  harness for whole-suite A/B (with env-var fallback `LPS_FILETEST_FORCE_OPT`
+  and a `scripts/glsl-filetests.sh --force-opt` passthrough). Force semantics:
+  the flag/env wins over per-file `compile-opt(...)` directives. Move
+  `debug/rainbow.glsl` → `examples/rainbow.glsl`. Defer Target × OptProfile
+  axis to `future-work.md`.
+
+- **Q4 — Firmware code-size.** Ship + measure with abort threshold. Land M4
+  with `InlineMode::Auto` everywhere (firmware too). Measure
+  `lpir_ops` and `rv32n_insns` growth on `examples/`. If median growth
+  exceeds **25%**, add a one-liner override in
+  `lp-core/lp-engine/src/gfx/native_jit.rs::NativeJitGraphics::new` to set
+  `config.inline.mode = InlineMode::Never` until M5 lands DCE.
+
+- **Q5 — New filetests.** Minimal 4-file set in `filetests/optimizer/inline/`
+  for inliner-specific behaviors. The ~700 untagged filetests running under
+  default Auto are the bulk of the correctness coverage.
+
+## Phase plan
+
+Sized for `composer-2` sub-agents. Phases are listed in dependency order;
+phases 3, 4, 5a can run in parallel after phase 2.
+
+### Phase 1 — Surgical filetest tagging (Q2)
+
+Mechanical pass: insert `// compile-opt(inline.mode, never)` as line 1 of
+each listed file. The directive is parsed today (M3 landed `compiler_config`)
+but is a no-op until phase 2 wires the inliner, so this phase is safe to
+land standalone.
+
+Files to tag (54):
+- `lp-shader/lps-filetests/filetests/function/call-*.glsl` (5)
+- `lp-shader/lps-filetests/filetests/function/param-*.glsl` (10)
+- `lp-shader/lps-filetests/filetests/function/return-*.glsl` (13)
+- `lp-shader/lps-filetests/filetests/function/edge-*.glsl` (8 — all runtime-semantic)
+- `lp-shader/lps-filetests/filetests/function/forward-declare.glsl`
+- `lp-shader/lps-filetests/filetests/function/declare-prototype.glsl`
+- `lp-shader/lps-filetests/filetests/lpvm/native/native-call-*.glsl` (7)
+- `lp-shader/lps-filetests/filetests/lpvm/native/perf/*.glsl` (9)
+
+Skip:
+- `function/scope-*` (scope semantics, unrelated to call mechanics)
+- `function/define-simple` (definition only, no call)
+- `function/recursive-static-error` (static error path)
+- `function/overload-*` (overload resolution, unrelated)
+
+Acceptance:
+- `cargo test -p lps-filetests` passes unchanged (directives are parsed but
+  inert pre-phase-2).
+- `git diff --stat` shows 54 files each with 1–2 lines added at top.
+
+### Phase 2 — Wire `inline_module` into all three backends
+
+2a. Fix `lp-shader/lpvm-native/src/rt_jit/compiler.rs::compile_module_jit` to
+    thread `NativeCompileOptions.config` through instead of discarding it.
+    Currently it builds a default `CompilerConfig` regardless of input.
+
+2b. Add `lpir::inline_module(&mut module, &config.inline)` at the top of LPIR
+    processing in each backend's compile entry. Recommended location: just
+    before per-function lowering, after parsing/validation, before
+    `const_fold` and other per-function passes.
+    - `lp-shader/lpvm-native/src/compile.rs::compile_module`
+    - `lp-shader/lpvm-cranelift/src/...` (entry: `LpvmEngine::compile`)
+    - `lp-shader/lpvm-wasm/src/...` (entry: `LpvmEngine::compile`)
+
+2c. Filter `LpsModuleSig` entries to match the post-inline function set if
+    a backend uses sig entries to drive function compilation. Match by name.
+    (Inliner doesn't delete functions today, so this is a no-op until M5,
+    but the plumbing should exist.)
+
+Pipeline order per backend (after phase 2):
 1. `inline_module` (module-level)
-2. For each function:
-   a. `const_fold` (LPIR)
-   b. `lower_ops` (LPIR → VInst)
-   c. `fold_immediates` (VInst)
-   d. `emit` (VInst → machine code)
-
-## Filetest annotations
-
-### Files to tag with `// @config(inline.mode, never)`
-
-These tests exist specifically to validate call/return mechanics:
-
-```
-filetests/function/call-simple.glsl
-filetests/function/call-multiple.glsl
-filetests/function/call-order.glsl
-filetests/function/call-return-value.glsl
-```
-
-Review all files under `filetests/function/` and tag any that test call
-semantics specifically. Files that test parameter passing (param-in,
-param-out, param-inout) should also keep real calls since inlining would
-eliminate the parameter passing path being tested.
-
-### Files to tag with `// @config(inline.mode, always)`
-
-New inliner correctness tests added in this milestone. Forces inlining
-regardless of heuristic, so tests don't break when thresholds change.
-
-### No annotation needed
-
-Most filetests (arithmetic, control flow, builtins, etc.) should work
-identically with or without inlining. The inliner only affects files that
-define helper functions, and even then the results should be numerically
-identical.
-
-## Validation plan
-
-### Step 1: Correctness
+2. For each function: `const_fold` → backend-specific lowering → emit.
+
+Logging: `inline_module` already emits `log::debug!` decisions. Each backend
+should emit a single `log::info!` summary line with
+`inline_result.call_sites_replaced` and `inline_result.functions_inlined`
+when non-zero, prefixed with the backend name (`[native-fa]`,
+`[cranelift]`, `[wasm]`).
+
+Acceptance:
+- `cargo build --workspace` succeeds.
+- `cargo test -p lps-filetests` passes for all three backends. Some tests
+  may now exercise the inliner end-to-end; if any fail, that's a real
+  inliner bug to triage (don't paper over with `compile-opt(inline.mode,
+  never)`).
+
+### Phase 3 — `lp-cli shader-debug --compiler-opt`
+
+Add a repeatable `--compiler-opt key=value` flag to `lp-cli shader-debug`
+that builds `CompilerConfig` from defaults and applies each `key=value` via
+the existing `CompilerConfig::apply(&str, &str)` API.
+
+Files:
+- `lp-cli/src/commands/shader_debug/args.rs` — add the flag.
+- `lp-cli/src/commands/shader_debug/handler.rs` — apply overrides when
+  building `CompilerConfig`.
+
+Acceptance:
+- `lp-cli shader-debug --compiler-opt inline.mode=never <file>` runs and
+  shows fewer/no inlines in the LPIR dump.
+- `lp-cli shader-debug --compiler-opt inline.mode=never --compiler-opt
+  inline.small_func_threshold=8 <file>` parses both correctly.
+- Invalid keys return a clear error (delegates to `CompilerConfig::apply`).
+
+### Phase 4 — Filetest harness `--force-opt`
+
+Add the suite-level A/B switch with three equivalent surfaces. Force semantics:
+flag/env wins over per-file `compile-opt(...)` directives.
+
+4a. CLI flag on `lps-filetests-app`:
+    - `lp-shader/lps-filetests-app/src/main.rs` — add `--force-opt
+      key=value` (repeatable) to `TestOptions`. Pass-through to
+      `lps_filetests::run`.
+    - `lp-shader/lps-filetests/src/lib.rs` — extend `run` signature.
+    - `lp-shader/lps-filetests/src/test_run/compile.rs::build_compiler_config`
+      — apply force-overrides AFTER per-file directives so they win.
+
+4b. Env var fallback:
+    - `LPS_FILETEST_FORCE_OPT="key1=value1,key2=value2"` (comma-separated).
+    - Read in `main.rs` if `--force-opt` not provided; merge if both present
+      (CLI flag wins on conflict).
+
+4c. Wrapper script:
+    - `scripts/glsl-filetests.sh` — add `--force-opt KEY=VALUE` (repeatable)
+      that translates to env var `LPS_FILETEST_FORCE_OPT`. Update help text.
+
+Acceptance:
+- `scripts/glsl-filetests.sh --force-opt inline.mode=never function/` runs
+  the function test corpus with inlining forced off, overriding the phase-1
+  surgical tags.
+- `LPS_FILETEST_FORCE_OPT="inline.mode=never" cargo test -p lps-filetests`
+  produces the same effect as the CLI flag.
+- The output table (the `pass / fail / unimpl / unsupported / compile-fail
+  / total inst` summary) renders identically; only the `total inst` numbers
+  shift between runs.
+
+### Phase 5 — Move rainbow + add inliner filetests
+
+5a. `git mv lp-shader/lps-filetests/filetests/debug/rainbow.glsl
+    lp-shader/lps-filetests/filetests/examples/rainbow.glsl`. Verify
+    filetest discovery still finds it (the harness recurses into all
+    subdirs of `filetests/`). Update any references in docs (grep for
+    `debug/rainbow`).
+
+5b. Add 4 inliner-specific filetests under
+    `lp-shader/lps-filetests/filetests/optimizer/inline/`:
+    - `inline-mode-flag.glsl` — same shader, three `// run:` blocks under
+      `compile-opt(inline.mode, auto)`, `always`, `never`. All three must
+      produce the same output. Tests mode-flag plumbing end-to-end.
+    - `inline-recursion.glsl` — `factorial(n)` or `fib(n)`. Must produce
+      correct output regardless of inline policy. If a self-recursive call
+      gets wrongly inlined the inliner will panic or hang.
+    - `inline-many-small.glsl` — module with ~10 small interdependent
+      helpers chained together. Stresses the call-graph topo-order +
+      orchestration loop.
+    - `inline-control-flow.glsl` — single callee with nested
+      `if`/`for`/`break`/`continue`. Stresses param/vreg remap and offset
+      recompute under realistic control flow.
+
+Acceptance:
+- All 4 new tests pass on all three default targets (`rv32n.q32`,
+  `rv32c.q32`, `wasm.q32`).
+- `inline-mode-flag.glsl` produces identical output across the three runs.
+
+### Phase 6 — Measurement, write-up, conditional firmware override
+
+6a. Run the full filetest suite twice and capture the summary table:
+    - `scripts/glsl-filetests.sh --summary` (default — Auto)
+    - `scripts/glsl-filetests.sh --summary --force-opt inline.mode=never`
+
+6b. Run the `examples/` corpus twice and capture per-file `lpir_ops` and
+    `rv32n_insns` from `lp-cli shader-debug`:
+    - `lp-cli shader-debug examples/rainbow.glsl`
+    - `lp-cli shader-debug --compiler-opt inline.mode=never examples/rainbow.glsl`
+
+6c. Append an `## Outcome (YYYY-MM-DD)` section to this doc with:
+    - Both summary tables (default vs `inline.mode=never`).
+    - Per-file `examples/` numbers and computed % growth in `rv32n_insns`.
+    - Decision: shipped as-is OR triggered the firmware override.
+
+6d. Conditional firmware override: if median growth on `examples/` exceeds
+    25% in `rv32n_insns`, add a one-liner in
+    `lp-core/lp-engine/src/gfx/native_jit.rs::NativeJitGraphics::new`:
+
+    ```rust
+    let mut config = CompilerConfig::default();
+    config.inline.mode = InlineMode::Never; // TODO(M5): remove once dead-func elim lands
+    ```
+
+    and thread it into the `NativeCompileOptions`. Document the override
+    decision in the outcome section.
+
+6e. Update `docs/roadmaps/2026-04-15-lpir-inliner/future-work.md`:
+    - Add "CI optimization-profile sweeps (Target × OptProfile axis)".
+    - Add "Grow `examples/` corpus with more representative shaders".
+
+Acceptance:
+- Outcome section is filled in with real numbers.
+- `cargo build --workspace` and `cargo test -p lps-filetests` pass.
+- If override applied: firmware build succeeds and uses
+  `InlineMode::Never`.
+
+## Validation summary
+
+After all phases:
 
 ```bash
-# Full filetest suite — all targets
-cargo test -p lps-filetests -- --test-threads=4
-```
-
-Every test must pass. Any failure indicates a bug in the inliner (vreg
-remap, control flow offset, slot remap, etc.).
-
-### Step 2: Firmware builds
+# Correctness — full filetest suite, all three backends
+cargo test -p lps-filetests
 
-```bash
+# Firmware builds (esp32 + emu)
 cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf \
     --profile release-esp32 --features esp32c6,server
 cargo check -p fw-emu --target riscv32imac-unknown-none-elf \
     --profile release-emu
-```
-
-### Step 3: Performance comparison
-
-Run filetests with instruction counting and compare before/after:
 
-```bash
-# Before (disable inlining via @disable or env flag)
-# After (default — inlining on)
-```
-
-Key files to measure:
-- `debug/rainbow.glsl` — many helper calls, significant call overhead.
-- `function/call-*` (with `// @disable(inline)`) — baseline for call cost.
-- Any test with deep call chains.
-
-Expected: measurable instruction count reduction for files with helper
-functions. No change for files without calls (arithmetic, control flow).
-
-### Step 4: Host still works
-
-```bash
+# Host still works
 cargo check -p lp-server
 cargo test -p lp-server --no-run
+
+# Perf A/B
+scripts/glsl-filetests.sh --summary
+scripts/glsl-filetests.sh --summary --force-opt inline.mode=never
 ```
 
 ## Rollback
 
-If the inliner introduces correctness issues:
-- Set `InlineConfig { mode: Never, .. }` globally in `NativeCompileOptions`.
-- Individual tests can use `// @config(inline.mode, never)`.
-- No structural changes to the pipeline — removing the `inline_module`
-  call restores the previous behavior exactly.
+If the inliner introduces correctness issues post-merge:
+- Set `InlineConfig { mode: Never, .. }` in `InlineConfig::default()` to
+  disable globally. Removing the `inline_module` calls is also possible but
+  not required — Never mode short-circuits the pass.
+- Individual tests already have `compile-opt(inline.mode, never)` available.
+- The `--force-opt` flag lets ops disable the inliner without rebuilding.
 
 ## Note on dead function elimination
 
 The inliner does NOT delete functions. After inlining, helper functions
-still exist and get compiled (they just have zero local call sites).
-This is intentional — filetests need all functions to remain callable.
-
-Dead function elimination is a separate pass (M5) that runs in production
-with a known root set. It is not part of this milestone.
+still exist and get compiled (they just have zero local call sites). This is
+intentional — filetests need all functions to remain callable. Dead function
+elimination is M5 and runs in production with a known root set.
 
 ## Success criteria
 
-1. All filetests pass (4400+ pass, 0 fail).
-2. Firmware builds succeed.
-3. `debug/rainbow.glsl` shows measurable instruction reduction on `rv32n.q32`.
-4. Compile time may increase slightly (inlined functions are larger, and
-   originals are still compiled). DeadFuncElim (M5) addresses this for
-   production.
+1. Phase 2 passes the full filetest suite on all three backends with default
+   `InlineMode::Auto`.
+2. `--force-opt inline.mode=never` produces the pre-inliner numbers (sanity
+   check that the override truly bypasses the pass).
+3. `examples/rainbow.glsl` shows measurable `rv32n_insns` reduction with
+   inlining on, vs. with `--compiler-opt inline.mode=never`.
+4. Firmware builds succeed (with override applied if measurement triggers
+   the 25% abort threshold).
+5. The 4 new tests in `filetests/optimizer/inline/` pass on all default
+   targets.
+
+## Outcome (2026-04-17)
+
+### What landed
+
+All six phases shipped. The inliner now runs by default
+(`InlineMode::Auto`) on all three LPIR backends — `lpvm-native`,
+`lpvm-cranelift` (both the in-process JIT path and the RV32 object-emitter
+used by the emulator), and `lpvm-wasm`. The full filetest suite (14,033
+tests across 701 files, 3 backends) passes with both default Auto and
+forced Never settings. Operators can A/B-compare via `--force-opt
+key=value` on the harness or `--compiler-opt key=value` on `lp-cli
+shader-debug`.
+
+### Phase 2 wiring discovery (post-handoff)
+
+The first wiring pass missed `lpvm-cranelift`'s `object_bytes_from_ir`
+entry, which is the path used by `Backend::Rv32` (the RV32 emulator
+backend). It only wired `build_jit_module` (the in-process Cranelift JIT
+path used by `Backend::Jit`). Symptom: `rv32c.q32` instruction counts were
+exactly identical pre/post wiring, while `rv32n.q32` showed the expected
+reduction. Fix: added the same clone-and-`inline_module` block at the top
+of `object_bytes_from_ir`. Both backends now show matched inliner activity.
+
+### Filetest suite A/B (full corpus, dynamic instruction count)
+
+| Target     | Default (Auto) | `inline.mode=never` | Δ (Auto − Never) | % change |
+| ---------- | -------------: | ------------------: | ---------------: | -------: |
+| `rv32c.q32` |    575,330 inst |          578,367 inst |        −3,037 inst |   −0.52% |
+| `rv32n.q32` |    595,922 inst |          598,950 inst |        −3,028 inst |   −0.51% |
+| `wasm.q32`  | (no inst count) |       (no inst count) |               n/a |      n/a |
+
+All 14,033 tests pass under both configurations. The ~0.5% suite-wide
+dynamic reduction is small because (a) 54 surgically-tagged files are
+fixed at `inline.mode=never` so they don't change, (b) most filetests are
+math/scalar/vec ops with no helper-function calls, and (c) the inliner's
+small-function threshold (16 ops) keeps it conservative — it fires only on
+the smallest helpers. The wins concentrate in the helper-call-heavy
+shaders.
+
+### Per-shader: `examples/rainbow.glsl`
+
+**Static code size** (LPIR ops + RV32 instructions per function, summed):
+
+| Metric        | Default (Auto) | `inline.mode=never` | Δ      | % change |
+| ------------- | -------------: | ------------------: | -----: | -------: |
+| LPIR ops      |            572 |                 548 |    +24 |   +4.4%  |
+| `rv32n` insns |          2,161 |               2,084 |    +77 |   +3.7%  |
+
+Inline log: `inlined=3 sites=3` — the three smallest helpers got pulled
+into `applyPalette` (whose body grew 42 → 66 LPIR ops, 148 → 225 rv32n
+insns). The five palette functions (`paletteHeatmap` etc.) are 22+ ops
+each, above the 16-op threshold, so they were not inlined. The original
+helpers also remain in the module (M5 will DCE them).
+
+**Dynamic instruction count** (7 test runs from the file, executed under
+the emulator):
+
+| Target     | Default (Auto) | `inline.mode=never` | Δ       |
+| ---------- | -------------: | ------------------: | ------: |
+| `rv32c.q32` |    24,420 inst |          24,402 inst |    +18 |
+| `rv32n.q32` |    24,594 inst |          24,582 inst |    +12 |
+
+Effectively neutral on rainbow. The per-call overhead saved by inlining
+is offset by the slightly larger inlined body executing each iteration.
+
+### Firmware code-size decision (Q4)
+
+**Threshold**: 25% median growth in `rv32n` static instructions on the
+`examples/` corpus.
+
+**Measured**: 3.7% growth on `examples/rainbow.glsl` (the only file in
+the corpus today).
+
+**Decision**: **Ship as-is** with `InlineMode::Auto` in firmware. No
+override applied. 3.7% << 25% threshold; firmware flash budget impact is
+negligible. The neutral dynamic perf on rainbow means the inliner is not
+yet earning its weight on real-world content, but it's not regressing
+either, and once M5 lands DCE the static cost will go to zero or
+negative.
+
+### What this validates
+
+- The inliner pipeline is correctly wired across all three LPIR
+  backends.
+- The `--force-opt` / `--compiler-opt` A/B switch works end-to-end
+  (CLI, env var, wrapper script).
+- The four new inliner-specific tests (`filetests/optimizer/inline/`)
+  pass on all backends, including the deep-call-chain test for the
+  recursion guard.
+- M3.1's `small_func_threshold = 16` produces conservative behavior:
+  small wins, no surprises, no regressions. Tightening or loosening this
+  threshold is a future tuning lever.
+
+### What's blocked on M5 (DCE)
+
+The biggest available inlining win — eliminating helper functions that
+become dead after inlining — requires dead function elimination. Today
+inlining strictly grows code size because the originals stay. M5 lands
+next; revisit the firmware override decision then if static growth
+becomes a real concern on broader corpora.
+
+### Known follow-ups (added to `future-work.md`)
+
+- Grow the `examples/` corpus with more representative shaders so the
+  measurement above is more robust.
+- CI optimization-profile sweeps (Target × OptProfile axis) for
+  automated regression detection on the perf signal.
+- Investigate `function/call-order.glsl` — flips from `@unimplemented`
+  failure to passing under `--force-opt inline.mode=always`. Either a
+  real bug that inlining accidentally papers over, or an `@unimplemented`
+  annotation that's stale. Worth a quick triage.
diff --git a/docs/roadmaps/2026-04-15-lpir-inliner/m5-dead-func-elim.md b/docs/roadmaps/2026-04-15-lpir-inliner/m5-dead-func-elim.md
index 7ec35d1e3..640269e45 100644
--- a/docs/roadmaps/2026-04-15-lpir-inliner/m5-dead-func-elim.md
+++ b/docs/roadmaps/2026-04-15-lpir-inliner/m5-dead-func-elim.md
@@ -1,8 +1,7 @@
 # M5 — Dead Function Elimination
 
-Remove functions from the module that have zero remaining local call sites
-and aren't in the root set. Separate from inlining — the inliner (M3)
-never deletes functions.
+Remove local functions that aren't reachable from a caller-supplied root
+set. Separate from inlining (M3) — the inliner never deletes functions.
 
 ## Motivation
 
@@ -11,83 +10,135 @@ callers. The originals still exist in the module and still get compiled.
 In production (single entry point), these are pure waste — removing them
 saves compile time and code size.
 
-Filetests don't use this pass (every function is potentially callable by
-the test harness).
+Filetests may opt into the pass via `compile-opt(dead_func_elim.mode,
+auto)`. The harness looks up entries by name, so anything DFE removes
+that the test wants to call by name will fail with "symbol not found".
+Mark functions you need preserved with `is_entry`, or keep them
+reachable from an `is_entry` root.
 
 ## API
 
+`lp-shader/lpir/src/dead_func_elim.rs`:
+
 ```rust
 pub struct DeadFuncElimResult {
     pub functions_removed: usize,
 }
 
-/// Remove functions with zero local call sites that aren't in `roots`.
+/// Remove local functions not transitively reachable from any root.
 pub fn dead_func_elim(
     module: &mut LpirModule,
-    roots: &[usize],  // indices into module.functions
-) -> DeadFuncElimResult {
-    // ...
-}
+    roots: &[FuncId],
+) -> DeadFuncElimResult;
+
+/// Helper: every function with `is_entry == true`.
+pub fn roots_from_is_entry(module: &LpirModule) -> Vec<FuncId>;
+
+/// Helper: look up roots by name.
+pub fn roots_by_name(module: &LpirModule, names: &[&str]) -> Vec<FuncId>;
 ```
 
 `roots` identifies the externally callable functions. Everything else is
-a candidate for removal if it has zero remaining local call sites.
+a candidate for removal if not transitively reachable from a root via
+`CalleeRef::Local` Call ops.
 
-## Algorithm
+## Configuration
 
-1. **Count local call sites.** Walk every function body, count how many
-   `Call` ops target each local function.
+`CompilerConfig::dead_func_elim: DeadFuncElimConfig`:
 
-2. **Mark reachable.** Starting from roots, transitively mark any function
-   that has a non-zero call count. (After full inlining, local call counts
-   should be zero for all non-import callees. But partial inlining or
-   disabled inlining could leave some calls.)
+```rust
+pub enum DeadFuncElimMode {
+    Auto,   // run when roots are available
+    Never,  // skip the pass (default)
+}
 
-3. **Remove unmarked.** Delete functions not in the reachable set. With
-   stable `FuncId` (M0), deletion doesn't invalidate any references.
+pub struct DeadFuncElimConfig {
+    pub mode: DeadFuncElimMode,
+}
+```
 
-4. **Update module signature.** Remove corresponding `LpsFnSig` entries
-   from `LpsModuleSig`.
+String key: `dead_func_elim.mode = auto | never`. Plumbed through
+`CompilerConfig::apply` and surfaced by `lp-cli shader-debug
+--compiler-opt`.
 
-## Integration
+Default `Never` means existing filetests behave exactly as before.
 
-### Production path
+## Algorithm
 
-The engine knows the shader entry point name. Before compilation:
+1. **Build local-call adjacency.** For each function, walk the body and
+   collect the set of `CalleeRef::Local(FuncId)` it calls.
 
-```rust
-if options.opt.is_enabled(OptPass::DeadFuncElim) {
-    let root_indices = find_roots_by_name(&ir, &["main"]);
-    lpir::dead_func_elim::dead_func_elim(&mut ir, &root_indices);
-}
-```
+2. **BFS from roots.** Starting from `roots`, follow the adjacency to
+   find every transitively reachable function.
 
-### Filetest path
+3. **Remove unreachable.** Delete from `module.functions` any local that
+   is not reachable. Stable `FuncId` (M0) makes deletion safe — no other
+   ref needs renumbering.
 
-DeadFuncElim is OFF by default in filetests (or roots = all functions).
-Either way, no functions are removed.
+4. **`LpsModuleSig` is left alone.** It's name-keyed and harmless if
+   stale; the runtime resolves entries by name and skips missing ones.
 
-### OptPass
+## Integration
 
-Add `OptPass::DeadFuncElim` to the enum. Default: ON in production, OFF
-in filetests.
+Wired into all four backend entry points after `inline_module`:
 
-## Dependencies
+- `lpvm-native::compile_module`
+- `lpvm-cranelift::build_jit_module`
+- `lpvm-cranelift::object_bytes_from_ir`
+- `lpvm-wasm::compile_lpir`
+- `lp-cli shader-debug` (`collect_fa_data`, `collect_cranelift_data`)
 
-- **M0 (Stable CalleeRef):** Required so deletion doesn't break references.
-- **M4 (Inliner wired in):** Without inlining, there are few dead functions
-  to eliminate. DeadFuncElim is most useful after inlining has created dead
-  functions.
+Each gates the call on `mode != Never`, computes
+`roots_from_is_entry(&ir)`, and skips silently when the root set is
+empty (e.g. unit-test harnesses that build raw modules).
+
+The GLSL frontend (`lps-frontend/src/lower.rs`) sets `is_entry = true`
+on the user-defined `render` function and on the synthesized
+`__shader_init` so they survive DFE.
+
+## WASM emitter dependency
+
+DFE leaves gaps in the `FuncId` space. The WASM emitter previously
+assumed `Local(FuncId(id))` could be turned into a WASM function index
+by `filtered_import_count + id`, which only holds when FuncIds are
+contiguous starting at 0. M5 fixes this by threading a `BTreeMap<FuncId,
+u32>` through `EmitCtx` and looking up the WASM index by FuncId.
 
 ## Validation
 
 ```bash
+cargo build
 cargo test -p lpir
-cargo test -p lps-filetests -- --test-threads=4
-cargo check -p fw-esp32 --target riscv32imac-unknown-none-elf \
-    --profile release-esp32 --features esp32c6,server
+./scripts/glsl-filetests.sh optimizer/dead_func_elim/
+./scripts/glsl-filetests.sh           # full suite, no regressions
 ```
 
+End-to-end filetest:
+`lp-shader/lps-filetests/filetests/optimizer/dead_func_elim/dfe-removes-unreachable.glsl`
+runs across `rv32n.q32`, `rv32c.q32`, and `wasm.q32` with
+`compile-opt(inline.mode, never)` + `compile-opt(dead_func_elim.mode,
+auto)` and asserts `unused_dead` / `also_dead` are removed while
+`render`, `test_dfe_basic`, and `helper` survive.
+
+## Known limitations / follow-ups
+
+- **`inline.mode=always` + DFE on a small `test_*` function removes the
+  test.** When the inliner inlines a small `test_*` function into
+  `render`, no caller remains, so DFE drops it. The harness then can't
+  call it by name. The clean fix is to mark `test_*` functions as
+  `is_entry` in the frontend (or equivalently extend the root set in
+  the filetest path). Tracked in
+  [`future-work.md`](./future-work.md).
+- **Inliner stale call-graph indices when a single caller has multiple
+  distinct local callees.** The bottom-up inliner builds the call graph
+  once and never refreshes the per-caller op indices, so after the
+  first callee is spliced into a caller the recorded sites for
+  subsequent callees in the same caller are stale and silently skipped
+  by `splice::inline_call_site`. Pre-existing M3 bug, exposed by the
+  M5 filetest design exploration. Tracked in
+  [`future-work.md`](./future-work.md).
+
 ## Estimated scope
 
-Small pass — ~50-100 lines. The hard part (stable ids) is in M0.
+Pass itself ~120 lines. Backend wiring ~30 lines per entry point.
+Stable ids (M0) and inliner (M3/M4) did the heavy lifting.
diff --git a/docs/roadmaps/2026-04-15-lpir-inliner/notes.md b/docs/roadmaps/2026-04-15-lpir-inliner/notes.md
index 5f7d73342..a72a2aa00 100644
--- a/docs/roadmaps/2026-04-15-lpir-inliner/notes.md
+++ b/docs/roadmaps/2026-04-15-lpir-inliner/notes.md
@@ -15,7 +15,9 @@ in LPIR to handle multi-return callees without fake-loop overhead.
 | M0 | Stable CalleeRef refactor | [m0](m0-stable-callee-ref.md) | — |
 | M1 | OptPass enum + filetest annotations | [m1](m1-optpass-filetests.md) | — |
 | M2 | Block/EndBlock/ExitBlock LPIR ops | [m2](m2-block-ops.md) | — |
-| M3 | LPIR inlining pass | [m3](m3-inlining-pass.md) | M2 |
+| M2.5 | `Continuing` marker op | [m2.5](m2.5-continuing-marker.md) | M2 |
+| M3 | LPIR inlining pass | [m3](m3-inlining-pass.md) | M2.5 |
+| M3.1 | Tune `func_weight` empirically | [m3.1](m3.1-tune-inline-weights.md) | M3 |
 | M4 | Wire into native + validation | [m4](m4-wire-and-validate.md) | M1, M3 |
 | M5 | Dead function elimination | [m5](m5-dead-func-elim.md) | M0, M4 |
 
@@ -49,7 +51,7 @@ functions to eliminate).
 - 53 tests under `filetests/function/` covering call semantics
 - `call-simple`, `call-nested`, `call-multiple`, `call-order`, `call-return-value`
   are the direct call-graph tests
-- `debug/rainbow.glsl` is a real shader with many small helper calls
+- `examples/rainbow.glsl` is a real shader with many small helper calls
 - One compile per file per target; no per-test compile flag mechanism today
 - `NativeCompileOptions` has float_mode, debug_info, emu_trace, alloc_trace
 - Env var pattern exists: `LPVM_ALLOC_TRACE=1` → option field
diff --git a/lp-cli/src/commands/shader_debug/args.rs b/lp-cli/src/commands/shader_debug/args.rs
index 03ca66eb0..470c92537 100644
--- a/lp-cli/src/commands/shader_debug/args.rs
+++ b/lp-cli/src/commands/shader_debug/args.rs
@@ -65,6 +65,22 @@ pub struct Args {
         default_missing_value = "",
     )]
     pub opt: Vec<String>,
+
+    /// Add inline weight columns (`body_len`, `mz`, `hb`) to the summary table
+    #[arg(long)]
+    pub weights: bool,
+
+    /// Override compiler options. Format: `key=value`. Repeatable.
+    /// Use `--compiler-opt` alone (no value) to print valid keys and values.
+    /// Example: `--compiler-opt inline.mode=never --compiler-opt inline.small_func_threshold=8`.
+    #[arg(
+        long = "compiler-opt",
+        value_name = "KEY=VALUE",
+        action = clap::ArgAction::Append,
+        num_args = 0..=1,
+        default_missing_value = "",
+    )]
+    pub compiler_opt: Vec<String>,
 }
 
 impl Args {
diff --git a/lp-cli/src/commands/shader_debug/collect.rs b/lp-cli/src/commands/shader_debug/collect.rs
index 36c4eada9..791646390 100644
--- a/lp-cli/src/commands/shader_debug/collect.rs
+++ b/lp-cli/src/commands/shader_debug/collect.rs
@@ -22,14 +22,37 @@ pub fn collect_fa_data(
     use lpvm_native::regalloc::allocate;
     use lpvm_native::regalloc::render::render_interleaved;
 
-    let module_abi = ModuleAbi::from_ir_and_sig(IsaTarget::Rv32imac, ir, sig);
+    let mut ir_opt = ir.clone();
+    lpir::inline_module(&mut ir_opt, &compiler_config.inline);
+    if !matches!(
+        compiler_config.dead_func_elim.mode,
+        lpir::DeadFuncElimMode::Never
+    ) {
+        let roots = lpir::roots_from_is_entry(&ir_opt);
+        if roots.is_empty() {
+            log::info!(
+                "[shader-debug] dead_func_elim: skipped (no is_entry roots); kept={}",
+                ir_opt.functions.len(),
+            );
+        } else {
+            let dfe = lpir::dead_func_elim(&mut ir_opt, &roots);
+            log::info!(
+                "[shader-debug] dead_func_elim: removed={} kept={} roots={}",
+                dfe.functions_removed,
+                ir_opt.functions.len(),
+                roots.len(),
+            );
+        }
+    }
+
+    let module_abi = ModuleAbi::from_ir_and_sig(IsaTarget::Rv32imac, &ir_opt, sig);
 
     let sig_map: std::collections::BTreeMap<&str, &lps_frontend::LpsFnSig> =
         sig.functions.iter().map(|s| (s.name.as_str(), s)).collect();
 
     let mut backend_data = BackendDebugData::new("rv32n");
 
-    for func in ir.functions.values() {
+    for func in ir_opt.functions.values() {
         // Filter if specified
         if let Some(name) = func_filter {
             if func.name != name {
@@ -55,7 +78,7 @@ pub fn collect_fa_data(
             float_mode,
             q32: &compiler_config.q32,
         };
-        let lowered = lower_ops(func, ir, &module_abi, &lower_opts)
+        let lowered = lower_ops(func, &ir_opt, &module_abi, &lower_opts)
             .map_err(|e| anyhow::anyhow!("lower: {e:?}"))?;
 
         let slots = func.total_param_slots() as usize;
@@ -66,7 +89,7 @@ pub fn collect_fa_data(
         // Generate interleaved output
         let interleaved = render_interleaved(
             func,
-            ir,
+            &ir_opt,
             &lowered.vinsts,
             &lowered.vreg_pool,
             &alloc_result.output,
@@ -124,6 +147,29 @@ pub fn collect_cranelift_data(
 ) -> Result<BackendDebugData> {
     use lpvm_cranelift::{CompileOptions, link_object_with_builtins, object_bytes_from_ir};
 
+    let mut ir_metrics = ir.clone();
+    lpir::inline_module(&mut ir_metrics, &compiler_config.inline);
+    if !matches!(
+        compiler_config.dead_func_elim.mode,
+        lpir::DeadFuncElimMode::Never
+    ) {
+        let roots = lpir::roots_from_is_entry(&ir_metrics);
+        if roots.is_empty() {
+            log::info!(
+                "[shader-debug] dead_func_elim: skipped (no is_entry roots); kept={}",
+                ir_metrics.functions.len(),
+            );
+        } else {
+            let dfe = lpir::dead_func_elim(&mut ir_metrics, &roots);
+            log::info!(
+                "[shader-debug] dead_func_elim: removed={} kept={} roots={}",
+                dfe.functions_removed,
+                ir_metrics.functions.len(),
+                roots.len(),
+            );
+        }
+    }
+
     let options = CompileOptions {
         float_mode,
         config: compiler_config.clone(),
@@ -139,7 +185,7 @@ pub fn collect_cranelift_data(
     let backend_name = if is_emu { "emu" } else { "rv32c" };
     let mut backend_data = BackendDebugData::new(backend_name);
 
-    for func in ir.functions.values() {
+    for func in ir_metrics.functions.values() {
         if let Some(name) = func_filter {
             if func.name != name {
                 continue;
diff --git a/lp-cli/src/commands/shader_debug/comparison_table.rs b/lp-cli/src/commands/shader_debug/comparison_table.rs
index c74acaf8f..adf1db7b3 100644
--- a/lp-cli/src/commands/shader_debug/comparison_table.rs
+++ b/lp-cli/src/commands/shader_debug/comparison_table.rs
@@ -132,7 +132,11 @@ fn legend_line(use_color: bool) -> String {
 }
 
 /// Render the summary block (title + table + optional legend), or `None` if there is nothing to show.
-pub fn render_summary_table(report: &DebugReport, use_color: bool) -> Option<String> {
+pub fn render_summary_table(
+    report: &DebugReport,
+    use_color: bool,
+    show_weights: bool,
+) -> Option<String> {
     if report.backends.is_empty() {
         return None;
     }
@@ -145,9 +149,17 @@ pub fn render_summary_table(report: &DebugReport, use_color: bool) -> Option<Str
     let n = report.backends.len();
 
     let mut align = vec![ColAlign::Left, ColAlign::Right];
+    if show_weights {
+        align.extend(std::iter::repeat(ColAlign::Right).take(3));
+    }
     align.extend(std::iter::repeat(ColAlign::Right).take(n));
 
     let mut header: Vec<String> = vec!["Function".to_string(), "LPIR".to_string()];
+    if show_weights {
+        header.push("body_len".to_string());
+        header.push("mz".to_string());
+        header.push("hb".to_string());
+    }
     for b in &report.backends {
         header.push(b.backend.clone());
     }
@@ -155,17 +167,33 @@ pub fn render_summary_table(report: &DebugReport, use_color: bool) -> Option<Str
     let mut rows: Vec<Vec<String>> = vec![header];
 
     let mut total_lpir = 0usize;
+    let mut total_body_len = 0usize;
+    let mut total_mz = 0usize;
+    let mut total_hb = 0usize;
     let mut total_disasm: Vec<usize> = vec![0; n];
 
     for func_name in &func_names {
-        let lpir_count = report
+        let first_fd = report
             .backends
             .first()
-            .and_then(|b| b.get_function(func_name))
-            .map(|f| f.lpir_count)
-            .unwrap_or(0);
+            .and_then(|b| b.get_function(func_name));
+
+        let lpir_count = first_fd.map(|f| f.lpir_count).unwrap_or(0);
         total_lpir += lpir_count;
 
+        let (w_bl, w_mz, w_hb) = if show_weights {
+            first_fd
+                .map(|f| (f.weight_body_len, f.weight_mz, f.weight_hb))
+                .unwrap_or((0, 0, 0))
+        } else {
+            (0usize, 0usize, 0usize)
+        };
+        if show_weights {
+            total_body_len += w_bl;
+            total_mz += w_mz;
+            total_hb += w_hb;
+        }
+
         let mut disasm = Vec::with_capacity(n);
         for backend in &report.backends {
             let d = backend
@@ -182,6 +210,11 @@ pub fn render_summary_table(report: &DebugReport, use_color: bool) -> Option<Str
         let multi = n > 1;
 
         let mut row: Vec<String> = vec![(*func_name).to_string(), lpir_count.to_string()];
+        if show_weights {
+            row.push(w_bl.to_string());
+            row.push(w_mz.to_string());
+            row.push(w_hb.to_string());
+        }
         for d in &disasm {
             row.push(format_count_with_ratio(*d, min_d, multi, use_color));
         }
@@ -192,6 +225,11 @@ pub fn render_summary_table(report: &DebugReport, use_color: bool) -> Option<Str
     let multi = n > 1;
 
     let mut total_row: Vec<String> = vec!["TOTAL".to_string(), total_lpir.to_string()];
+    if show_weights {
+        total_row.push(total_body_len.to_string());
+        total_row.push(total_mz.to_string());
+        total_row.push(total_hb.to_string());
+    }
     for t in &total_disasm {
         total_row.push(format_count_with_ratio(*t, min_t, multi, use_color));
     }
@@ -246,11 +284,47 @@ mod tests {
         r.backends.push(rv32c);
         r.backends.push(rv32n);
 
-        let s = render_summary_table(&r, false).expect("table");
+        let s = render_summary_table(&r, false, false).expect("table");
         assert!(!s.contains('\x1b'), "no ansi when use_color=false:\n{s}");
         assert!(s.contains("callee_identity"));
         assert!(s.contains("2 (1.00×)"));
         assert!(s.contains("9 (4.50×)"));
         assert!(s.contains("TOTAL"));
     }
+
+    #[test]
+    fn summary_includes_weight_columns_when_requested() {
+        let mut rv32c = BackendDebugData::new("rv32c");
+        let mut f0 = FunctionDebugData::new("foo".to_string());
+        f0.lpir_count = 10;
+        f0.weight_body_len = 10;
+        f0.weight_mz = 6;
+        f0.weight_hb = 8;
+        f0.disasm_count = 3;
+        rv32c.functions.push(f0);
+
+        let mut rv32n = BackendDebugData::new("rv32n");
+        let mut f1 = FunctionDebugData::new("foo".to_string());
+        f1.lpir_count = 10;
+        f1.weight_body_len = 10;
+        f1.weight_mz = 6;
+        f1.weight_hb = 8;
+        f1.disasm_count = 12;
+        rv32n.functions.push(f1);
+
+        let mut r = DebugReport::new();
+        r.backends.push(rv32c);
+        r.backends.push(rv32n);
+
+        let s = render_summary_table(&r, false, true).expect("table");
+        assert!(s.contains("body_len"));
+        assert!(s.contains("mz"));
+        assert!(s.contains("hb"));
+        assert!(s.contains("foo"));
+        assert!(s.contains("TOTAL"));
+        assert!(s.lines().any(|line| line.contains("foo")
+            && line.contains("10")
+            && line.contains("6")
+            && line.contains("8")));
+    }
 }
diff --git a/lp-cli/src/commands/shader_debug/display.rs b/lp-cli/src/commands/shader_debug/display.rs
index 237fe2730..ea653f49a 100644
--- a/lp-cli/src/commands/shader_debug/display.rs
+++ b/lp-cli/src/commands/shader_debug/display.rs
@@ -8,8 +8,9 @@ fn should_color() -> bool {
 }
 
 /// Print comparison table across all backends.
-pub fn print_comparison_table(report: &DebugReport) {
-    if let Some(text) = comparison_table::render_summary_table(report, should_color()) {
+pub fn print_comparison_table(report: &DebugReport, show_weights: bool) {
+    if let Some(text) = comparison_table::render_summary_table(report, should_color(), show_weights)
+    {
         print!("{text}");
     }
 }
diff --git a/lp-cli/src/commands/shader_debug/handler.rs b/lp-cli/src/commands/shader_debug/handler.rs
index ea42ec08c..3220dc5f4 100644
--- a/lp-cli/src/commands/shader_debug/handler.rs
+++ b/lp-cli/src/commands/shader_debug/handler.rs
@@ -2,6 +2,7 @@
 
 use anyhow::{Context, Result};
 use lp_shader::synth::{SynthError, synthesise_render_texture};
+use lpir::inline_weights::{weight_body_len, weight_heavy_bias, weight_markers_zero};
 use lpir::{CompilerConfig, FloatMode, LpirModule, validate_module};
 use lps_frontend::LpsModuleSig;
 use lps_shared::TextureStorageFormat;
@@ -12,14 +13,18 @@ use super::display::{print_comparison_table, print_detailed_view, print_help_tex
 use super::types::{BackendTarget, DebugReport};
 
 pub fn handle_shader_debug(args: Args) -> Result<()> {
-    let has_empty_opt = args.opt.iter().any(String::is_empty);
+    let compiler_opt_sources: Vec<&String> =
+        args.opt.iter().chain(args.compiler_opt.iter()).collect();
+    let has_empty_opt = compiler_opt_sources.iter().any(|s| s.is_empty());
     if has_empty_opt {
-        if args.opt.iter().any(|s| !s.is_empty()) {
+        if compiler_opt_sources.iter().any(|s| !s.is_empty()) {
             anyhow::bail!(
-                "`--opt` without KEY=value prints valid keys and values; do not mix with other `--opt` flags on the same command"
+                "`--opt` / `--compiler-opt` without KEY=value prints valid keys and values; do not mix empty and non-empty entries on the same command"
             );
         }
-        eprintln!("Valid keys for `-o KEY=VALUE` / `--opt KEY=VALUE`:");
+        eprintln!(
+            "Valid keys for `-o KEY=VALUE` / `--opt KEY=VALUE` / `--compiler-opt KEY=VALUE`:"
+        );
         eprintln!();
         eprintln!("  inline.mode                          auto | always | never  (default auto)");
         eprintln!("  inline.always_inline_single_site     true | false           (default true)");
@@ -30,6 +35,7 @@ pub fn handle_shader_debug(args: Args) -> Result<()> {
         eprintln!(
             "  inline.module_op_budget              <usize>                (default unlimited)"
         );
+        eprintln!("  dead_func_elim.mode                  auto | never           (default never)");
         eprintln!(
             "  q32.add_sub                          saturating | wrapping  (default saturating)"
         );
@@ -70,15 +76,15 @@ pub fn handle_shader_debug(args: Args) -> Result<()> {
     };
 
     let mut compiler_config = CompilerConfig::default();
-    for opt in &args.opt {
+    for opt in compiler_opt_sources {
         let (key, value) = opt.split_once('=').ok_or_else(|| {
             anyhow::anyhow!(
-                "--opt expects KEY=VALUE, got: {opt:?} (use `--opt` alone to list valid keys and values)"
+                "--opt / --compiler-opt expects KEY=VALUE, got: {opt:?} (use `--opt` or `--compiler-opt` alone to list valid keys and values)"
             )
         })?;
         compiler_config
             .apply(key, value)
-            .map_err(|e| anyhow::anyhow!("invalid --opt: {e}"))?;
+            .map_err(|e| anyhow::anyhow!("invalid compiler option: {e}"))?;
     }
 
     // Parse targets
@@ -110,6 +116,23 @@ pub fn handle_shader_debug(args: Args) -> Result<()> {
         report.backends.push(backend_data);
     }
 
+    if args.weights {
+        let by_name: std::collections::BTreeMap<&str, &lpir::IrFunction> = ir
+            .functions
+            .values()
+            .map(|f| (f.name.as_str(), f))
+            .collect();
+        for backend in &mut report.backends {
+            for fd in &mut backend.functions {
+                if let Some(func) = by_name.get(fd.name.as_str()) {
+                    fd.weight_body_len = weight_body_len(func);
+                    fd.weight_mz = weight_markers_zero(func);
+                    fd.weight_hb = weight_heavy_bias(func);
+                }
+            }
+        }
+    }
+
     // Print detailed view first (unless summary-only mode)
     if !args.summary {
         print_detailed_view(&report, &sections);
@@ -121,7 +144,7 @@ pub fn handle_shader_debug(args: Args) -> Result<()> {
     }
 
     // Print comparison table at the bottom (always shown)
-    print_comparison_table(&report);
+    print_comparison_table(&report, args.weights);
 
     Ok(())
 }
diff --git a/lp-cli/src/commands/shader_debug/types.rs b/lp-cli/src/commands/shader_debug/types.rs
index 0a6a17924..ce068f51e 100644
--- a/lp-cli/src/commands/shader_debug/types.rs
+++ b/lp-cli/src/commands/shader_debug/types.rs
@@ -4,6 +4,12 @@
 pub struct FunctionDebugData {
     pub name: String,
     pub lpir_count: usize,
+    /// `weight_body_len` from lpir inline_weights when `--weights` is used; otherwise 0.
+    pub weight_body_len: usize,
+    /// Markers-zero weight (`mz` column).
+    pub weight_mz: usize,
+    /// Heavy-bias weight (`hb` column).
+    pub weight_hb: usize,
     pub disasm_count: usize,
     pub spill_slots: Option<usize>,  // FA only
     pub interleaved: Option<String>, // FA only
@@ -16,6 +22,9 @@ impl FunctionDebugData {
         Self {
             name,
             lpir_count: 0,
+            weight_body_len: 0,
+            weight_mz: 0,
+            weight_hb: 0,
             disasm_count: 0,
             spill_slots: None,
             interleaved: None,
diff --git a/lp-core/lp-engine/src/gfx/cranelift.rs b/lp-core/lp-engine/src/gfx/cranelift.rs
new file mode 100644
index 000000000..a0865c91a
--- /dev/null
+++ b/lp-core/lp-engine/src/gfx/cranelift.rs
@@ -0,0 +1,139 @@
+//! Cranelift JIT backend for [`super::LpGraphics`].
+
+use crate::error::Error;
+use crate::gfx::lp_gfx::LpGraphics;
+use crate::gfx::lp_shader::{LpShader, ShaderCompileOptions};
+use alloc::boxed::Box;
+use alloc::format;
+use alloc::string::String;
+use lp_shared::Texture;
+use lpvm::{LpvmEngine, VmContextHeader};
+use lpvm_cranelift::{
+    CompileOptions, CraneliftEngine, CraneliftModule, DirectCall, FloatMode, MemoryStrategy,
+};
+
+/// Graphics backend using on-device/host Cranelift JIT.
+pub struct CraneliftGraphics;
+
+impl CraneliftGraphics {
+    #[must_use]
+    pub fn new() -> Self {
+        Self
+    }
+}
+
+impl Default for CraneliftGraphics {
+    fn default() -> Self {
+        Self::new()
+    }
+}
+
+impl LpGraphics for CraneliftGraphics {
+    fn compile_shader(
+        &self,
+        source: &str,
+        options: &ShaderCompileOptions,
+    ) -> Result<Box<dyn LpShader>, Error> {
+        // Frontend: GLSL -> LPIR (using lps_frontend)
+        let naga = lps_frontend::compile(source).map_err(|e| Error::Other {
+            message: format!("{e}"),
+        })?;
+        let (ir, meta) = lps_frontend::lower(&naga).map_err(|e| Error::Other {
+            message: format!("{e}"),
+        })?;
+        drop(naga);
+
+        // Backend: LPIR -> machine code (using CraneliftEngine)
+        let compile = CompileOptions {
+            float_mode: FloatMode::Q32,
+            q32_options: options.q32_options,
+            memory_strategy: MemoryStrategy::Default,
+            max_errors: options.max_errors,
+            emu_trace_instructions: false,
+            ..Default::default()
+        };
+        let engine = CraneliftEngine::new(compile);
+        let module = engine.compile(&ir, &meta).map_err(|e| Error::Other {
+            message: format!("{e}"),
+        })?;
+        let direct_call = module.direct_call("render");
+        Ok(Box::new(CraneliftShader {
+            _module: module,
+            direct_call,
+        }))
+    }
+
+    fn backend_name(&self) -> &'static str {
+        "cranelift"
+    }
+}
+
+struct CraneliftShader {
+    _module: CraneliftModule,
+    direct_call: Option<DirectCall>,
+}
+
+impl LpShader for CraneliftShader {
+    fn render(&mut self, texture: &mut Texture, time: f32) -> Result<(), Error> {
+        let dc = self.direct_call.as_ref().ok_or_else(|| Error::Other {
+            message: String::from("Shader has no render entry point"),
+        })?;
+        render_direct_call(dc, texture.width(), texture.height(), time, texture)
+    }
+
+    fn has_render(&self) -> bool {
+        self.direct_call.is_some()
+    }
+}
+
+fn render_direct_call(
+    dc: &DirectCall,
+    width: u32,
+    height: u32,
+    time: f32,
+    texture: &mut Texture,
+) -> Result<(), Error> {
+    const Q32_SCALE: i32 = 65536;
+    let time_q32 = (time * 65536.0 + 0.5) as i32;
+    let output_size_q32 = [(width as i32) * Q32_SCALE, (height as i32) * Q32_SCALE];
+    let vmctx = VmContextHeader::default();
+    let vmctx_ptr = core::ptr::from_ref(&vmctx).cast::<u8>();
+
+    for y in 0..height {
+        for x in 0..width {
+            let frag_coord_q32 = [(x as i32) * Q32_SCALE, (y as i32) * Q32_SCALE];
+            let args = [
+                frag_coord_q32[0],
+                frag_coord_q32[1],
+                output_size_q32[0],
+                output_size_q32[1],
+                time_q32,
+            ];
+            let mut rgba_q32 = [0i32; 4];
+            unsafe {
+                dc.call_i32_buf(vmctx_ptr, &args, &mut rgba_q32)
+                    .map_err(|e| Error::Other {
+                        message: format!("Shader direct call failed: {e}"),
+                    })?;
+            }
+
+            let clamp_q32 = |v: i32| -> i32 {
+                if v < 0 {
+                    0
+                } else if v > Q32_SCALE {
+                    Q32_SCALE
+                } else {
+                    v
+                }
+            };
+
+            let r = ((clamp_q32(rgba_q32[0]) as i64 * 65535) / Q32_SCALE as i64) as u16;
+            let g = ((clamp_q32(rgba_q32[1]) as i64 * 65535) / Q32_SCALE as i64) as u16;
+            let b = ((clamp_q32(rgba_q32[2]) as i64 * 65535) / Q32_SCALE as i64) as u16;
+            let a = ((clamp_q32(rgba_q32[3]) as i64 * 65535) / Q32_SCALE as i64) as u16;
+
+            texture.set_pixel_u16(x, y, [r, g, b, a]);
+        }
+    }
+    Ok(())
+}
diff --git a/lp-shader/lpir/Cargo.toml b/lp-shader/lpir/Cargo.toml
index def5bd69c..edd1f6e8e 100644
--- a/lp-shader/lpir/Cargo.toml
+++ b/lp-shader/lpir/Cargo.toml
@@ -10,4 +10,5 @@ workspace = true
 
 [dependencies]
 libm = "0.2"
+log = { workspace = true, default-features = false }
 lps-q32 = { path = "../lps-q32" }
diff --git a/lp-shader/lpir/src/builder.rs b/lp-shader/lpir/src/builder.rs
index 09a43e295..164217539 100644
--- a/lp-shader/lpir/src/builder.rs
+++ b/lp-shader/lpir/src/builder.rs
@@ -171,7 +171,8 @@ impl FunctionBuilder {
     }
 
     pub fn push_continuing(&mut self) {
-        let cur = self.body.len() as u32;
+        self.body.push(LpirOp::Continuing);
+        let cur = (self.body.len() - 1) as u32;
         let top = self
             .block_stack
             .last_mut()
diff --git a/lp-shader/lpir/src/compiler_config.rs b/lp-shader/lpir/src/compiler_config.rs
index 8f2b6ec6f..721364bf0 100644
--- a/lp-shader/lpir/src/compiler_config.rs
+++ b/lp-shader/lpir/src/compiler_config.rs
@@ -11,6 +11,7 @@ use core::str::FromStr;
 #[derive(Clone, Debug, PartialEq, Eq)]
 pub struct CompilerConfig {
     pub inline: InlineConfig,
+    pub dead_func_elim: DeadFuncElimConfig,
     pub q32: lps_q32::q32_options::Q32Options,
 }
 
@@ -18,6 +19,7 @@ impl Default for CompilerConfig {
     fn default() -> Self {
         Self {
             inline: InlineConfig::default(),
+            dead_func_elim: DeadFuncElimConfig::default(),
             q32: lps_q32::q32_options::Q32Options::default(),
         }
     }
@@ -47,14 +49,19 @@ impl fmt::Display for InlineMode {
 impl FromStr for InlineMode {
     type Err = ();
 
-    /// Accepts lowercase names: `auto`, `always`, `never`.
+    /// Accepts `auto`, `always`, `never` (ASCII case-insensitive).
     fn from_str(s: &str) -> Result<Self, Self::Err> {
-        match s.trim() {
-            "auto" => Ok(InlineMode::Auto),
-            "always" => Ok(InlineMode::Always),
-            "never" => Ok(InlineMode::Never),
-            _ => Err(()),
+        let s = s.trim();
+        if s.eq_ignore_ascii_case("auto") {
+            return Ok(InlineMode::Auto);
         }
+        if s.eq_ignore_ascii_case("always") {
+            return Ok(InlineMode::Always);
+        }
+        if s.eq_ignore_ascii_case("never") {
+            return Ok(InlineMode::Never);
+        }
+        Err(())
     }
 }
 
@@ -63,6 +70,9 @@ impl FromStr for InlineMode {
 pub struct InlineConfig {
     pub mode: InlineMode,
     pub always_inline_single_site: bool,
+    /// Maximum `func_weight` for "small" callees that are inlined unconditionally
+    /// (subject to budgets). Empirically tuned against the rv32n cost model on the
+    /// `inline-weights.glsl` corpus — see `docs/roadmaps/2026-04-15-lpir-inliner/m3.1-tune-inline-weights.md`.
     pub small_func_threshold: usize,
     pub max_growth_budget: Option<usize>,
     pub module_op_budget: Option<usize>,
@@ -73,37 +83,180 @@ impl Default for InlineConfig {
         Self {
             mode: InlineMode::Auto,
             always_inline_single_site: true,
-            small_func_threshold: 20,
+            small_func_threshold: 16,
             max_growth_budget: None,
             module_op_budget: None,
         }
     }
 }
 
+/// Controls dead function elimination.
+#[derive(Clone, Copy, Debug, PartialEq, Eq, Default)]
+pub enum DeadFuncElimMode {
+    /// Run the pass when explicit roots exist (production).
+    Auto,
+    /// Skip the pass entirely (default — keeps filetests safe).
+    #[default]
+    Never,
+}
+
+impl fmt::Display for DeadFuncElimMode {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        f.write_str(match self {
+            DeadFuncElimMode::Auto => "auto",
+            DeadFuncElimMode::Never => "never",
+        })
+    }
+}
+
+impl FromStr for DeadFuncElimMode {
+    type Err = ();
+
+    /// Accepts `auto`, `never` (ASCII case-insensitive).
+    fn from_str(s: &str) -> Result<Self, Self::Err> {
+        let s = s.trim();
+        if s.eq_ignore_ascii_case("auto") {
+            return Ok(DeadFuncElimMode::Auto);
+        }
+        if s.eq_ignore_ascii_case("never") {
+            return Ok(DeadFuncElimMode::Never);
+        }
+        Err(())
+    }
+}
+
+/// Options for the dead function elimination pass.
+#[derive(Clone, Debug, PartialEq, Eq)]
+pub struct DeadFuncElimConfig {
+    pub mode: DeadFuncElimMode,
+}
+
+impl Default for DeadFuncElimConfig {
+    fn default() -> Self {
+        Self {
+            mode: DeadFuncElimMode::Never,
+        }
+    }
+}
+
+/// Keys accepted by [`CompilerConfig::apply`] (for error messages and tooling).
+pub const COMPILER_CONFIG_KEYS_HELP: &str = "inline.mode, inline.always_inline_single_site, inline.small_func_threshold, inline.max_growth_budget, inline.module_op_budget, dead_func_elim.mode";
+
+/// Multi-line listing of keys and allowed values (e.g. `shader-debug --compiler-opt` with no value).
+pub const COMPILER_CONFIG_APPLY_HELP: &str = r#"Valid `--compiler-opt` entries use KEY=value. Repeat the flag for multiple overrides.
+
+Keys and values:
+
+  inline.mode
+      auto | always | never   (ASCII case-insensitive; default: auto)
+
+  inline.always_inline_single_site
+      true | false | 1 | 0   (default: true)
+
+  inline.small_func_threshold
+      non-negative integer   (default: 16)
+
+  inline.max_growth_budget
+      non-negative integer   (optional per-module growth cap)
+
+  inline.module_op_budget
+      non-negative integer   (optional whole-module op budget)
+
+  dead_func_elim.mode
+      auto | never   (ASCII case-insensitive; default: never)
+
+Examples:
+  --compiler-opt inline.mode=never
+  --compiler-opt inline.mode=always --compiler-opt inline.small_func_threshold=8
+"#;
+
 /// Error applying a single `compile-opt` key/value pair.
 #[derive(Debug, PartialEq, Eq)]
 pub enum ConfigError {
-    UnknownKey { key: String },
-    InvalidValue { key: String, value: String },
+    UnknownKey {
+        key: String,
+    },
+    InvalidValue {
+        key: String,
+        value: String,
+        expected: &'static str,
+    },
 }
 
 impl fmt::Display for ConfigError {
     fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
         match self {
-            ConfigError::UnknownKey { key } => write!(f, "unknown config key {key:?}"),
-            ConfigError::InvalidValue { key, value } => {
-                write!(f, "invalid value {value:?} for config key {key:?}")
-            }
+            ConfigError::UnknownKey { key } => write!(
+                f,
+                "unknown config key {key:?} (valid keys: {COMPILER_CONFIG_KEYS_HELP})"
+            ),
+            ConfigError::InvalidValue {
+                key,
+                value,
+                expected,
+            } => write!(
+                f,
+                "invalid value {value:?} for config key {key:?} (expected {expected})"
+            ),
         }
     }
 }
 
 impl core::error::Error for ConfigError {}
 
-fn invalid(key: &str, value: &str) -> ConfigError {
+fn invalid_usize(key: &str, value: &str) -> ConfigError {
+    ConfigError::InvalidValue {
+        key: String::from(key),
+        value: String::from(value),
+        expected: "a non-negative integer",
+    }
+}
+
+fn invalid_bool(key: &str, value: &str) -> ConfigError {
     ConfigError::InvalidValue {
         key: String::from(key),
         value: String::from(value),
+        expected: "true, false, 1, or 0",
+    }
+}
+
+fn invalid_inline_mode(key: &str, value: &str) -> ConfigError {
+    ConfigError::InvalidValue {
+        key: String::from(key),
+        value: String::from(value),
+        expected: "one of: auto, always, never (ASCII case-insensitive)",
+    }
+}
+
+fn invalid_dead_func_elim_mode(key: &str, value: &str) -> ConfigError {
+    ConfigError::InvalidValue {
+        key: String::from(key),
+        value: String::from(value),
+        expected: "one of: auto, never (ASCII case-insensitive)",
+    }
+}
+
+fn invalid_q32_addsub(key: &str, value: &str) -> ConfigError {
+    ConfigError::InvalidValue {
+        key: String::from(key),
+        value: String::from(value),
+        expected: "one of: saturating, wrapping",
+    }
+}
+
+fn invalid_q32_mul(key: &str, value: &str) -> ConfigError {
+    ConfigError::InvalidValue {
+        key: String::from(key),
+        value: String::from(value),
+        expected: "one of: saturating, wrapping",
+    }
+}
+
+fn invalid_q32_div(key: &str, value: &str) -> ConfigError {
+    ConfigError::InvalidValue {
+        key: String::from(key),
+        value: String::from(value),
+        expected: "one of: saturating, reciprocal",
     }
 }
 
@@ -112,32 +265,60 @@ impl CompilerConfig {
     pub fn apply(&mut self, key: &str, value: &str) -> Result<(), ConfigError> {
         match key.trim() {
             "inline.mode" => {
-                self.inline.mode = value.trim().parse().map_err(|_| invalid(key, value))?;
+                self.inline.mode = value
+                    .trim()
+                    .parse()
+                    .map_err(|_| invalid_inline_mode(key, value))?;
             }
             "inline.always_inline_single_site" => {
                 self.inline.always_inline_single_site =
-                    parse_bool(value).ok_or_else(|| invalid(key, value))?;
+                    parse_bool(value).ok_or_else(|| invalid_bool(key, value))?;
             }
             "inline.small_func_threshold" => {
-                self.inline.small_func_threshold =
-                    value.trim().parse().map_err(|_| invalid(key, value))?;
+                self.inline.small_func_threshold = value
+                    .trim()
+                    .parse()
+                    .map_err(|_| invalid_usize(key, value))?;
             }
             "inline.max_growth_budget" => {
-                self.inline.max_growth_budget =
-                    Some(value.trim().parse().map_err(|_| invalid(key, value))?);
+                self.inline.max_growth_budget = Some(
+                    value
+                        .trim()
+                        .parse()
+                        .map_err(|_| invalid_usize(key, value))?,
+                );
             }
             "inline.module_op_budget" => {
-                self.inline.module_op_budget =
-                    Some(value.trim().parse().map_err(|_| invalid(key, value))?);
+                self.inline.module_op_budget = Some(
+                    value
+                        .trim()
+                        .parse()
+                        .map_err(|_| invalid_usize(key, value))?,
+                );
+            }
+            "dead_func_elim.mode" => {
+                self.dead_func_elim.mode = value
+                    .trim()
+                    .parse()
+                    .map_err(|_| invalid_dead_func_elim_mode(key, value))?;
             }
             "q32.add_sub" => {
-                self.q32.add_sub = value.trim().parse().map_err(|_| invalid(key, value))?;
+                self.q32.add_sub = value
+                    .trim()
+                    .parse()
+                    .map_err(|_| invalid_q32_addsub(key, value))?;
             }
             "q32.mul" => {
-                self.q32.mul = value.trim().parse().map_err(|_| invalid(key, value))?;
+                self.q32.mul = value
+                    .trim()
+                    .parse()
+                    .map_err(|_| invalid_q32_mul(key, value))?;
             }
             "q32.div" => {
-                self.q32.div = value.trim().parse().map_err(|_| invalid(key, value))?;
+                self.q32.div = value
+                    .trim()
+                    .parse()
+                    .map_err(|_| invalid_q32_div(key, value))?;
             }
             _ => {
                 return Err(ConfigError::UnknownKey {
@@ -196,8 +377,14 @@ mod tests {
     #[test]
     fn apply_unknown_key_errors() {
         let mut c = CompilerConfig::default();
-        let r = c.apply("inline.unknown", "x");
-        assert!(matches!(r, Err(ConfigError::UnknownKey { .. })));
+        let err = c.apply("inline.unknown", "x").unwrap_err();
+        assert!(matches!(err, ConfigError::UnknownKey { .. }));
+        let msg = err.to_string();
+        assert!(
+            msg.contains("inline.mode"),
+            "error should list valid keys: {msg}"
+        );
+        assert!(msg.contains("inline.unknown"));
     }
 
     #[test]
@@ -205,6 +392,25 @@ mod tests {
         let mut c = CompilerConfig::default();
         assert!(c.apply("inline.mode", "bogus").is_err());
         assert!(c.apply("inline.small_func_threshold", "nope").is_err());
+        let msg = c.apply("inline.mode", "bogus").unwrap_err().to_string();
+        assert!(msg.contains("auto"));
+        assert!(msg.contains("always"));
+        assert!(msg.contains("never"));
+        let dfe = c
+            .apply("dead_func_elim.mode", "bogus")
+            .unwrap_err()
+            .to_string();
+        assert!(dfe.contains("auto"));
+        assert!(dfe.contains("never"));
+    }
+
+    #[test]
+    fn apply_inline_mode_case_insensitive() {
+        let mut c = CompilerConfig::default();
+        c.apply("inline.mode", "Never").unwrap();
+        assert_eq!(c.inline.mode, InlineMode::Never);
+        c.apply("inline.mode", "AUTO").unwrap();
+        assert_eq!(c.inline.mode, InlineMode::Auto);
     }
 
     #[test]
@@ -213,6 +419,38 @@ mod tests {
             let m: InlineMode = s.parse().expect(s);
             assert_eq!(m.to_string(), s);
         }
+        let m: InlineMode = "Never".parse().unwrap();
+        assert_eq!(m, InlineMode::Never);
+        assert_eq!(m.to_string(), "never");
+    }
+
+    #[test]
+    fn apply_dead_func_elim_mode() {
+        let mut c = CompilerConfig::default();
+        c.apply("dead_func_elim.mode", "auto").unwrap();
+        assert_eq!(c.dead_func_elim.mode, DeadFuncElimMode::Auto);
+        c.apply("dead_func_elim.mode", "never").unwrap();
+        assert_eq!(c.dead_func_elim.mode, DeadFuncElimMode::Never);
+    }
+
+    #[test]
+    fn apply_dead_func_elim_mode_case_insensitive() {
+        let mut c = CompilerConfig::default();
+        c.apply("dead_func_elim.mode", "Never").unwrap();
+        assert_eq!(c.dead_func_elim.mode, DeadFuncElimMode::Never);
+        c.apply("dead_func_elim.mode", "AUTO").unwrap();
+        assert_eq!(c.dead_func_elim.mode, DeadFuncElimMode::Auto);
+    }
+
+    #[test]
+    fn dead_func_elim_mode_from_str_and_display_round_trip() {
+        for s in ["auto", "never"] {
+            let m: DeadFuncElimMode = s.parse().expect(s);
+            assert_eq!(m.to_string(), s);
+        }
+        let m: DeadFuncElimMode = "Never".parse().unwrap();
+        assert_eq!(m, DeadFuncElimMode::Never);
+        assert_eq!(m.to_string(), "never");
     }
 
     #[test]
diff --git a/lp-shader/lpir/src/const_fold.rs b/lp-shader/lpir/src/const_fold.rs
index ecbe7e0c0..88b943f9f 100644
--- a/lp-shader/lpir/src/const_fold.rs
+++ b/lp-shader/lpir/src/const_fold.rs
@@ -328,6 +328,7 @@ pub fn fold_constants(func: &mut IrFunction) -> usize {
             | LpirOp::Else
             | LpirOp::End
             | LpirOp::LoopStart { .. }
+            | LpirOp::Continuing
             | LpirOp::Block { .. }
             | LpirOp::Break
             | LpirOp::Continue
diff --git a/lp-shader/lpir/src/dead_func_elim.rs b/lp-shader/lpir/src/dead_func_elim.rs
new file mode 100644
index 000000000..c8e21ee04
--- /dev/null
+++ b/lp-shader/lpir/src/dead_func_elim.rs
@@ -0,0 +1,105 @@
+//! Remove local functions with zero remaining call sites that aren't roots.
+
+use alloc::collections::{BTreeMap, BTreeSet, VecDeque};
+use alloc::vec::Vec;
+
+use crate::lpir_module::LpirModule;
+use crate::lpir_op::LpirOp;
+use crate::types::{CalleeRef, FuncId};
+
+/// Counters returned by [`dead_func_elim`].
+#[derive(Debug, Default, Clone, Copy)]
+pub struct DeadFuncElimResult {
+    pub functions_removed: usize,
+}
+
+/// Local caller → callees (local [`CalleeRef::Local`] only, deduplicated per caller).
+fn build_local_adjacency(module: &LpirModule) -> BTreeMap<FuncId, BTreeSet<FuncId>> {
+    let mut adj: BTreeMap<FuncId, BTreeSet<FuncId>> = BTreeMap::new();
+    for (&caller_id, func) in &module.functions {
+        for op in &func.body {
+            if let LpirOp::Call {
+                callee: CalleeRef::Local(callee_id),
+                ..
+            } = op
+            {
+                adj.entry(caller_id).or_default().insert(*callee_id);
+            }
+        }
+    }
+    adj
+}
+
+/// Remove functions that aren't transitively reachable from `roots`.
+///
+/// Stable [`FuncId`] (M0) means deletion never invalidates surviving call sites.
+/// Re-entry / cycles among reachable functions are handled by transitive marking.
+pub fn dead_func_elim(module: &mut LpirModule, roots: &[FuncId]) -> DeadFuncElimResult {
+    let adj = build_local_adjacency(module);
+
+    let mut reachable: BTreeSet<FuncId> = BTreeSet::new();
+    let mut work: VecDeque<FuncId> = VecDeque::new();
+    for &r in roots {
+        if module.functions.contains_key(&r) {
+            if reachable.insert(r) {
+                work.push_back(r);
+            }
+        } else {
+            log::warn!("dead_func_elim: root func={r:?} not in module, ignoring");
+        }
+    }
+
+    while let Some(f) = work.pop_front() {
+        if let Some(callees) = adj.get(&f) {
+            for &c in callees {
+                if reachable.insert(c) {
+                    work.push_back(c);
+                }
+            }
+        }
+    }
+
+    let mut to_remove: Vec<FuncId> = module
+        .functions
+        .keys()
+        .filter(|id| !reachable.contains(*id))
+        .copied()
+        .collect();
+
+    to_remove.sort();
+    let removed = to_remove.len();
+
+    for id in to_remove {
+        if let Some(f) = module.functions.remove(&id) {
+            log::debug!("dead_func_elim: drop func={id:?} name={:?}", f.name);
+        }
+    }
+
+    let kept = module.functions.len();
+    let roots_n = roots.len();
+    log::info!("dead_func_elim: removed={removed} kept={kept} roots={roots_n}");
+    DeadFuncElimResult {
+        functions_removed: removed,
+    }
+}
+
+/// Convenience: build a roots vector from `IrFunction::is_entry`.
+pub fn roots_from_is_entry(module: &LpirModule) -> Vec<FuncId> {
+    module
+        .functions
+        .iter()
+        .filter(|(_, f)| f.is_entry)
+        .map(|(&id, _)| id)
+        .collect()
+}
+
+/// Convenience: build a roots vector by function name (silently skips unknown names).
+pub fn roots_by_name(module: &LpirModule, names: &[&str]) -> Vec<FuncId> {
+    let mut out = Vec::with_capacity(names.len());
+    for &name in names {
+        if let Some((&id, _)) = module.functions.iter().find(|(_, f)| f.name == name) {
+            out.push(id);
+        }
+    }
+    out
+}
diff --git a/lp-shader/lpir/src/inline/callgraph.rs b/lp-shader/lpir/src/inline/callgraph.rs
new file mode 100644
index 000000000..b77ecf809
--- /dev/null
+++ b/lp-shader/lpir/src/inline/callgraph.rs
@@ -0,0 +1,95 @@
+//! Local call graph for module-level bottom-up passes.
+
+use alloc::collections::{BTreeMap, BTreeSet};
+use alloc::vec::Vec;
+
+use crate::lpir_module::LpirModule;
+use crate::lpir_op::LpirOp;
+use crate::types::{CalleeRef, FuncId};
+
+pub(crate) struct CallGraph {
+    /// `callees_of[caller]` = sorted, deduplicated list of local [`FuncId`]s called.
+    pub callees_of: BTreeMap<FuncId, Vec<FuncId>>,
+    /// `callers_of[callee]` = sorted, deduplicated list of local [`FuncId`]s calling it.
+    pub callers_of: BTreeMap<FuncId, Vec<FuncId>>,
+    /// Per caller: `(op_index, callee)` in body order (one entry per call site).
+    pub call_sites_of: BTreeMap<FuncId, Vec<(usize, FuncId)>>,
+}
+
+pub(crate) fn build(module: &LpirModule) -> CallGraph {
+    let mut callees_raw: BTreeMap<FuncId, BTreeSet<FuncId>> = BTreeMap::new();
+    let mut callers_raw: BTreeMap<FuncId, BTreeSet<FuncId>> = BTreeMap::new();
+    let mut call_sites_of: BTreeMap<FuncId, Vec<(usize, FuncId)>> = BTreeMap::new();
+
+    for (&caller_id, func) in &module.functions {
+        for (idx, op) in func.body.iter().enumerate() {
+            if let LpirOp::Call {
+                callee: CalleeRef::Local(callee_id),
+                ..
+            } = op
+            {
+                callees_raw.entry(caller_id).or_default().insert(*callee_id);
+                callers_raw.entry(*callee_id).or_default().insert(caller_id);
+                call_sites_of
+                    .entry(caller_id)
+                    .or_default()
+                    .push((idx, *callee_id));
+            }
+        }
+    }
+
+    let callees_of = callees_raw
+        .into_iter()
+        .map(|(k, v)| (k, v.into_iter().collect()))
+        .collect();
+    let callers_of = callers_raw
+        .into_iter()
+        .map(|(k, v)| (k, v.into_iter().collect()))
+        .collect();
+
+    CallGraph {
+        callees_of,
+        callers_of,
+        call_sites_of,
+    }
+}
+
+/// Kahn topological order (leaves / callees first). Remaining nodes form cycles.
+/// `module` supplies every [`FuncId`] so isolated functions (no calls / not called) participate.
+pub(crate) fn topo_order(g: &CallGraph, module: &LpirModule) -> (Vec<FuncId>, BTreeSet<FuncId>) {
+    let mut in_degree: BTreeMap<FuncId, usize> = BTreeMap::new();
+    for &f in module.functions.keys() {
+        let d = g.callees_of.get(&f).map(|v| v.len()).unwrap_or(0);
+        in_degree.insert(f, d);
+    }
+
+    let mut queue: BTreeSet<FuncId> = in_degree
+        .iter()
+        .filter(|(_, deg)| **deg == 0)
+        .map(|(&f, _)| f)
+        .collect();
+
+    let mut topo = Vec::new();
+    while let Some(gid) = queue.iter().next().copied() {
+        queue.remove(&gid);
+        topo.push(gid);
+        if let Some(callers) = g.callers_of.get(&gid) {
+            for &caller in callers {
+                if let Some(deg) = in_degree.get_mut(&caller) {
+                    *deg = deg.saturating_sub(1);
+                    if *deg == 0 {
+                        queue.insert(caller);
+                    }
+                }
+            }
+        }
+    }
+
+    let cyclic: BTreeSet<FuncId> = in_degree
+        .into_iter()
+        .filter(|(_, d)| *d > 0)
+        .map(|(f, _)| f)
+        .collect();
+
+    (topo, cyclic)
+}
diff --git a/lp-shader/lpir/src/inline/heuristic.rs b/lp-shader/lpir/src/inline/heuristic.rs
new file mode 100644
index 000000000..e1d24cdfb
--- /dev/null
+++ b/lp-shader/lpir/src/inline/heuristic.rs
@@ -0,0 +1,169 @@
+//! Size / budget gating for inlining.
+
+use crate::compiler_config::{InlineConfig, InlineMode};
+use crate::lpir_module::IrFunction;
+use crate::lpir_op::LpirOp;
+
+/// LPIR-op count of `func.body`. Empirically the best simple correlate of
+/// rv32n instruction count on the `inline-weights.glsl` corpus
+/// (Pearson r ≈ 0.98 vs `mz`/`hb` candidates evaluated in M3.1).
+/// See `docs/roadmaps/2026-04-15-lpir-inliner/m3.1-tune-inline-weights.md`.
+pub(crate) fn func_weight(func: &IrFunction) -> usize {
+    func.body.len()
+}
+
+/// Which candidate [`weight`] function to use (M3.1 tuning; not wired to [`func_weight`] yet).
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub enum WeightKind {
+    BodyLen,
+    MarkersZero,
+    HeavyBias,
+}
+
+/// Dispatch for candidate inline size metrics.
+pub fn weight(kind: WeightKind, func: &IrFunction) -> usize {
+    match kind {
+        WeightKind::BodyLen => weight_body_len(func),
+        WeightKind::MarkersZero => weight_markers_zero(func),
+        WeightKind::HeavyBias => weight_heavy_bias(func),
+    }
+}
+
+/// Baseline: raw LPIR op count (same as production [`func_weight`] today).
+pub fn weight_body_len(func: &IrFunction) -> usize {
+    func.body.len()
+}
+
+/// Count each op as 1 except structural / pure-marker ops weighted 0 (M3.1 plan):
+/// [`LpirOp::IfStart`], [`LpirOp::Else`], [`LpirOp::Continuing`], [`LpirOp::LoopStart`],
+/// [`LpirOp::SwitchStart`], [`LpirOp::CaseStart`], [`LpirOp::DefaultStart`], [`LpirOp::End`],
+/// [`LpirOp::Block`], [`LpirOp::ExitBlock`], [`LpirOp::Break`], [`LpirOp::Continue`],
+/// [`LpirOp::Return`]. Rationale: no standalone RV32 lowering for these; [`LpirOp::Return`]
+/// is an epilogue / lifetime boundary for sizing, not a counted “op” in this metric.
+pub fn weight_markers_zero(func: &IrFunction) -> usize {
+    func.body.iter().map(weight_op_markers_zero).sum()
+}
+
+/// Like [`weight_markers_zero`], with extra cost on ops that tend to expand to more
+/// machine code or helper calls: [`LpirOp::Call`] (call/return and arg shuffle),
+/// [`LpirOp::Memcpy`] (loop-bodied helper), [`LpirOp::Fsqrt`] (multi-cycle / lib helper),
+/// and slow div/rem helpers ([`LpirOp::IdivS`], [`LpirOp::IdivU`], [`LpirOp::IremS`],
+/// [`LpirOp::IremU`], [`LpirOp::Fdiv`]) for empirical correlation tests.
+pub fn weight_heavy_bias(func: &IrFunction) -> usize {
+    func.body.iter().map(weight_op_heavy_bias).sum()
+}
+
+fn weight_op_markers_zero(op: &LpirOp) -> usize {
+    match op {
+        LpirOp::IfStart { .. }
+        | LpirOp::Else
+        | LpirOp::Continuing
+        | LpirOp::LoopStart { .. }
+        | LpirOp::SwitchStart { .. }
+        | LpirOp::CaseStart { .. }
+        | LpirOp::DefaultStart { .. }
+        | LpirOp::End
+        | LpirOp::Block { .. }
+        | LpirOp::ExitBlock
+        | LpirOp::Break
+        | LpirOp::Continue
+        | LpirOp::Return { .. } => 0,
+        _ => 1,
+    }
+}
+
+fn weight_op_heavy_bias(op: &LpirOp) -> usize {
+    match op {
+        LpirOp::IfStart { .. }
+        | LpirOp::Else
+        | LpirOp::Continuing
+        | LpirOp::LoopStart { .. }
+        | LpirOp::SwitchStart { .. }
+        | LpirOp::CaseStart { .. }
+        | LpirOp::DefaultStart { .. }
+        | LpirOp::End
+        | LpirOp::Block { .. }
+        | LpirOp::ExitBlock
+        | LpirOp::Break
+        | LpirOp::Continue
+        | LpirOp::Return { .. } => 0,
+        LpirOp::Call { .. } => 5,
+        LpirOp::Memcpy { .. } => 4,
+        LpirOp::Fsqrt { .. } => 4,
+        LpirOp::IdivS { .. }
+        | LpirOp::IdivU { .. }
+        | LpirOp::IremS { .. }
+        | LpirOp::IremU { .. }
+        | LpirOp::Fdiv { .. } => 3,
+        _ => 1,
+    }
+}
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub(crate) enum BudgetReason {
+    MaxGrowth,
+    ModuleTotal,
+}
+
+#[derive(Debug, Clone, Copy, PartialEq, Eq)]
+pub(crate) enum Decision {
+    Inline,
+    SkipTooLarge {
+        weight: usize,
+        threshold: usize,
+    },
+    SkipBudget {
+        projected: usize,
+        budget: usize,
+        reason: BudgetReason,
+    },
+    SkipMode,
+}
+
+pub(crate) fn should_inline(
+    callee_weight: usize,
+    callsite_count: usize,
+    current_module_op_count: usize,
+    config: &InlineConfig,
+) -> Decision {
+    use InlineMode::*;
+
+    if matches!(config.mode, Never) {
+        return Decision::SkipMode;
+    }
+
+    if matches!(config.mode, Auto) {
+        if callee_weight > config.small_func_threshold
+            && (callsite_count > 1 || !config.always_inline_single_site)
+        {
+            return Decision::SkipTooLarge {
+                weight: callee_weight,
+                threshold: config.small_func_threshold,
+            };
+        }
+    }
+
+    let projected = callee_weight.saturating_mul(callsite_count);
+    if let Some(b) = config.max_growth_budget {
+        if projected > b {
+            return Decision::SkipBudget {
+                projected,
+                budget: b,
+                reason: BudgetReason::MaxGrowth,
+            };
+        }
+    }
+
+    if let Some(b) = config.module_op_budget {
+        let projected_total = current_module_op_count.saturating_add(projected);
+        if projected_total > b {
+            return Decision::SkipBudget {
+                projected: projected_total,
+                budget: b,
+                reason: BudgetReason::ModuleTotal,
+            };
+        }
+    }
+
+    Decision::Inline
+}
diff --git a/lp-shader/lpir/src/inline/mod.rs b/lp-shader/lpir/src/inline/mod.rs
new file mode 100644
index 000000000..fe52bec8b
--- /dev/null
+++ b/lp-shader/lpir/src/inline/mod.rs
@@ -0,0 +1,157 @@
+//! LPIR inlining pass — bottom-up, never deletes functions, structural
+//! offset recompute. See docs/plans/2026-04-17-lpir-inliner-stage-iii.
+
+pub(crate) mod callgraph;
+pub(crate) mod heuristic;
+mod offsets;
+pub(crate) mod remap;
+pub(crate) mod splice;
+
+pub(crate) use offsets::recompute_offsets;
+
+use alloc::collections::{BTreeMap, BTreeSet};
+use alloc::vec::Vec;
+
+use crate::InlineConfig;
+use crate::inline::callgraph::CallGraph;
+use crate::inline::heuristic::{BudgetReason, Decision};
+use crate::lpir_module::LpirModule;
+use crate::types::FuncId;
+
+/// Counters and flags returned by [`inline_module`].
+#[derive(Debug, Default, Clone, Copy)]
+pub struct InlineResult {
+    /// Distinct callees inlined into at least one caller this run.
+    pub functions_inlined: usize,
+    /// `Call` sites replaced with callee bodies.
+    pub call_sites_replaced: usize,
+    /// Functions on a local call cycle (skipped; bodies unchanged).
+    pub functions_skipped_recursive: usize,
+    /// True when `InlineConfig::module_op_budget` is exceeded and the pass stops early.
+    pub budget_exceeded: bool,
+}
+
+fn total_op_count(module: &LpirModule) -> usize {
+    module.functions.values().map(|f| f.body.len()).sum()
+}
+
+fn call_sites_for_callee(graph: &CallGraph, callee_id: FuncId) -> Vec<(FuncId, usize)> {
+    let mut out = Vec::new();
+    for (&caller_id, sites) in &graph.call_sites_of {
+        for &(op_idx, c) in sites {
+            if c == callee_id {
+                out.push((caller_id, op_idx));
+            }
+        }
+    }
+    out
+}
+
+fn group_by_caller_desc(sites: &[(FuncId, usize)]) -> Vec<(FuncId, Vec<usize>)> {
+    let mut map: BTreeMap<FuncId, Vec<usize>> = BTreeMap::new();
+    for &(caller, idx) in sites {
+        map.entry(caller).or_default().push(idx);
+    }
+    let mut out: Vec<(FuncId, Vec<usize>)> = map.into_iter().collect();
+    for (_, indices) in &mut out {
+        indices.sort_by(|a, b| b.cmp(a));
+    }
+    out
+}
+
+/// Bottom-up local inlining pass: mutates `module` in place, never removes functions.
+pub fn inline_module(module: &mut LpirModule, config: &InlineConfig) -> InlineResult {
+    let graph = callgraph::build(module);
+    let (topo, cyclic) = callgraph::topo_order(&graph, module);
+
+    let mut result = InlineResult {
+        functions_skipped_recursive: cyclic.len(),
+        ..Default::default()
+    };
+    for &cyc in &cyclic {
+        log::debug!("inline: skip recursive func={cyc:?}");
+    }
+
+    let mut current_op_count = total_op_count(module);
+    let mut inlined_callees = BTreeSet::new();
+    let mut mutated_callers = BTreeSet::new();
+
+    'outer: for callee_id in topo {
+        if cyclic.contains(&callee_id) {
+            continue;
+        }
+        let Some(callee_fn) = module.functions.get(&callee_id) else {
+            continue;
+        };
+        let weight = heuristic::func_weight(callee_fn);
+        let sites = call_sites_for_callee(&graph, callee_id);
+        if sites.is_empty() {
+            continue;
+        }
+
+        match heuristic::should_inline(weight, sites.len(), current_op_count, config) {
+            Decision::Inline => {
+                log::debug!(
+                    "inline: callee={:?} weight={} sites={} module_ops={} decision=inline",
+                    callee_id,
+                    weight,
+                    sites.len(),
+                    current_op_count
+                );
+                let by_caller = group_by_caller_desc(&sites);
+                let callee = module.functions.remove(&callee_id).expect("topo callee");
+                for (caller_id, indices) in by_caller {
+                    let caller = module.functions.get_mut(&caller_id).expect("caller");
+                    for op_idx in indices {
+                        splice::inline_call_site(caller, &callee, op_idx);
+                        result.call_sites_replaced += 1;
+                    }
+                    mutated_callers.insert(caller_id);
+                }
+                module.functions.insert(callee_id, callee);
+                inlined_callees.insert(callee_id);
+                current_op_count = total_op_count(module);
+            }
+            Decision::SkipTooLarge { weight, threshold } => {
+                log::debug!(
+                    "inline: callee={callee_id:?} skip too_large weight={weight} threshold={threshold}"
+                );
+            }
+            Decision::SkipBudget {
+                projected,
+                budget,
+                reason,
+            } => {
+                log::debug!(
+                    "inline: callee={callee_id:?} skip budget projected={projected} budget={budget} reason={reason:?}"
+                );
+                if matches!(reason, BudgetReason::ModuleTotal) {
+                    result.budget_exceeded = true;
+                    break 'outer;
+                }
+            }
+            Decision::SkipMode => {
+                log::debug!("inline: callee={callee_id:?} skip mode=Never");
+            }
+        }
+    }
+
+    for caller_id in mutated_callers {
+        let f = module
+            .functions
+            .get_mut(&caller_id)
+            .expect("mutated caller");
+        recompute_offsets(&mut f.body);
+        f.body.shrink_to_fit();
+    }
+
+    result.functions_inlined = inlined_callees.len();
+    log::info!(
+        "inline: done inlined={} sites={} skipped_recursive={} budget_exceeded={}",
+        result.functions_inlined,
+        result.call_sites_replaced,
+        result.functions_skipped_recursive,
+        result.budget_exceeded
+    );
+    result
+}
diff --git a/lp-shader/lpir/src/inline/offsets.rs b/lp-shader/lpir/src/inline/offsets.rs
new file mode 100644
index 000000000..cb5758146
--- /dev/null
+++ b/lp-shader/lpir/src/inline/offsets.rs
@@ -0,0 +1,211 @@
+//! Structural control-flow offset recompute for flat [`LpirOp`] bodies.
+
+use alloc::vec::Vec;
+
+use crate::lpir_op::LpirOp;
+
+enum Frame {
+    If {
+        start: usize,
+    },
+    Else {
+        if_start: usize,
+    },
+    Loop {
+        start: usize,
+        had_continuing: bool,
+    },
+    Block {
+        start: usize,
+    },
+    Switch {
+        start: usize,
+        /// Index of `CaseStart` / `DefaultStart` whose `end_offset` points to the next arm opener
+        /// or the switch's closing `End`.
+        pending_case: Option<usize>,
+    },
+    /// Inside a `case` / `default` arm (closed by one `End` per arm).
+    Arm,
+}
+
+/// Recompute all control-flow offset fields in `body`. Idempotent; overwrites existing offsets.
+pub(crate) fn recompute_offsets(body: &mut [LpirOp]) {
+    let mut stack: Vec<Frame> = Vec::new();
+
+    for idx in 0..body.len() {
+        let after = (idx + 1) as u32;
+
+        match &mut body[idx] {
+            LpirOp::IfStart {
+                else_offset,
+                end_offset,
+                ..
+            } => {
+                *else_offset = 0;
+                *end_offset = 0;
+                stack.push(Frame::If { start: idx });
+            }
+            LpirOp::Else => {
+                let top = stack.pop().expect("Else without matching IfStart");
+                match top {
+                    Frame::If { start } => {
+                        if let LpirOp::IfStart {
+                            else_offset,
+                            end_offset: _,
+                            ..
+                        } = &mut body[start]
+                        {
+                            *else_offset = idx as u32;
+                        } else {
+                            panic!("Else: expected IfStart at {start}");
+                        }
+                        stack.push(Frame::Else { if_start: start });
+                    }
+                    _ => panic!("Else: expected If frame"),
+                }
+            }
+            LpirOp::Continuing => {
+                let top = stack.last_mut().expect("Continuing outside loop");
+                match top {
+                    Frame::Loop {
+                        start,
+                        had_continuing,
+                    } => {
+                        assert!(!*had_continuing, "duplicate Continuing in same loop");
+                        *had_continuing = true;
+                        if let LpirOp::LoopStart {
+                            continuing_offset, ..
+                        } = &mut body[*start]
+                        {
+                            *continuing_offset = idx as u32;
+                        } else {
+                            panic!("Continuing: expected LoopStart");
+                        }
+                    }
+                    _ => panic!("Continuing: expected Loop frame"),
+                }
+            }
+            LpirOp::LoopStart {
+                continuing_offset,
+                end_offset,
+            } => {
+                *continuing_offset = 0;
+                *end_offset = 0;
+                stack.push(Frame::Loop {
+                    start: idx,
+                    had_continuing: false,
+                });
+            }
+            LpirOp::SwitchStart { end_offset, .. } => {
+                *end_offset = 0;
+                stack.push(Frame::Switch {
+                    start: idx,
+                    pending_case: None,
+                });
+            }
+            LpirOp::CaseStart { end_offset, .. } | LpirOp::DefaultStart { end_offset } => {
+                *end_offset = 0;
+                let pending = if let Some(Frame::Switch { pending_case, .. }) = stack.last_mut() {
+                    pending_case.take()
+                } else {
+                    panic!("Case/Default outside Switch");
+                };
+                if let Some(pc) = pending {
+                    match &mut body[pc] {
+                        LpirOp::CaseStart { end_offset: eo, .. }
+                        | LpirOp::DefaultStart { end_offset: eo } => {
+                            *eo = idx as u32;
+                        }
+                        _ => {}
+                    }
+                }
+                if let Some(Frame::Switch { pending_case, .. }) = stack.last_mut() {
+                    *pending_case = Some(idx);
+                }
+                stack.push(Frame::Arm);
+            }
+            LpirOp::Block { end_offset } => {
+                *end_offset = 0;
+                stack.push(Frame::Block { start: idx });
+            }
+            LpirOp::ExitBlock => {}
+            LpirOp::End => {
+                let end_idx = idx;
+                let frame = stack.pop().expect("End without matching opener");
+                match frame {
+                    Frame::Arm => {}
+                    Frame::Else { if_start } => {
+                        if let LpirOp::IfStart { end_offset, .. } = &mut body[if_start] {
+                            *end_offset = after;
+                        } else {
+                            panic!("End: expected IfStart");
+                        }
+                    }
+                    Frame::If { start } => {
+                        if let LpirOp::IfStart {
+                            else_offset,
+                            end_offset,
+                            ..
+                        } = &mut body[start]
+                        {
+                            *else_offset = end_idx as u32;
+                            *end_offset = after;
+                        } else {
+                            panic!("End: expected IfStart");
+                        }
+                    }
+                    Frame::Loop {
+                        start,
+                        had_continuing,
+                    } => {
+                        if let LpirOp::LoopStart {
+                            continuing_offset,
+                            end_offset,
+                        } = &mut body[start]
+                        {
+                            if !had_continuing {
+                                *continuing_offset = (start + 1) as u32;
+                            }
+                            *end_offset = after;
+                        } else {
+                            panic!("End: expected LoopStart");
+                        }
+                    }
+                    Frame::Block { start } => {
+                        if let LpirOp::Block { end_offset } = &mut body[start] {
+                            *end_offset = after;
+                        } else {
+                            panic!("End: expected Block");
+                        }
+                    }
+                    Frame::Switch {
+                        start,
+                        pending_case,
+                    } => {
+                        if let Some(pc) = pending_case {
+                            match &mut body[pc] {
+                                LpirOp::CaseStart { end_offset: eo, .. }
+                                | LpirOp::DefaultStart { end_offset: eo } => {
+                                    *eo = end_idx as u32;
+                                }
+                                _ => {}
+                            }
+                        }
+                        if let LpirOp::SwitchStart { end_offset, .. } = &mut body[start] {
+                            *end_offset = after;
+                        } else {
+                            panic!("End: expected SwitchStart");
+                        }
+                    }
+                }
+            }
+            _ => {}
+        }
+    }
+
+    debug_assert!(
+        stack.is_empty(),
+        "recompute_offsets: unclosed frames: {:?}",
+        stack.len()
+    );
+}
diff --git a/lp-shader/lpir/src/inline/remap.rs b/lp-shader/lpir/src/inline/remap.rs
new file mode 100644
index 000000000..b50422bab
--- /dev/null
+++ b/lp-shader/lpir/src/inline/remap.rs
@@ -0,0 +1,559 @@
+//! Per-call-site vreg / slot remapping for inlined callees.
+
+use alloc::vec::Vec;
+
+use crate::lpir_module::{IrFunction, VMCTX_VREG};
+use crate::lpir_op::LpirOp;
+use crate::types::{IrType, SlotId, VReg, VRegRange};
+
+const VREG_SENTINEL: VReg = VReg(u32::MAX);
+
+/// One bool per user param (index `i` = param `VReg(i + 1)`).
+pub(crate) struct ParamWriteMask {
+    pub written: Vec<bool>,
+}
+
+pub(crate) fn scan_param_writes(callee: &IrFunction) -> ParamWriteMask {
+    let n = callee.param_count as usize;
+    let mut written = alloc::vec![false; n];
+    for op in &callee.body {
+        if let Some(def) = op.def_vreg() {
+            debug_assert_ne!(def, VMCTX_VREG, "vmctx should never be defined");
+            let i = def.0 as usize;
+            if i >= 1 && i <= callee.param_count as usize {
+                written[i - 1] = true;
+            }
+        }
+    }
+    ParamWriteMask { written }
+}
+
+pub(crate) struct Remap {
+    pub vreg_table: Vec<VReg>,
+    pub param_copies: Vec<LpirOp>,
+    pub slot_offset: u32,
+}
+
+fn alloc_caller_vreg(caller: &mut IrFunction, ty: IrType) -> VReg {
+    let idx = caller.vreg_types.len() as u32;
+    caller.vreg_types.push(ty);
+    VReg(idx)
+}
+
+pub(crate) fn build_remap(
+    caller: &mut IrFunction,
+    callee: &IrFunction,
+    call_args: &[VReg],
+    _call_results: &[VReg],
+    param_writes: &ParamWriteMask,
+) -> Remap {
+    let n = callee.vreg_types.len();
+    debug_assert_eq!(
+        call_args.len(),
+        1 + callee.param_count as usize,
+        "call args arity"
+    );
+
+    let mut vreg_table = alloc::vec![VREG_SENTINEL; n];
+    let mut param_copies = Vec::new();
+
+    vreg_table[0] = VMCTX_VREG;
+
+    for i in 1..=callee.param_count as usize {
+        let idx = i;
+        if !param_writes.written[i - 1] {
+            vreg_table[idx] = call_args[i];
+        } else {
+            let ty = callee.vreg_types[idx];
+            let dst = alloc_caller_vreg(caller, ty);
+            vreg_table[idx] = dst;
+            param_copies.push(LpirOp::Copy {
+                dst,
+                src: call_args[i],
+            });
+        }
+    }
+
+    for idx in (callee.param_count as usize + 1)..n {
+        let ty = callee.vreg_types[idx];
+        vreg_table[idx] = alloc_caller_vreg(caller, ty);
+    }
+
+    debug_assert!(!vreg_table.iter().any(|&v| v == VREG_SENTINEL));
+
+    let slot_offset = caller.slots.len() as u32;
+    for s in &callee.slots {
+        caller.slots.push(s.clone());
+    }
+
+    Remap {
+        vreg_table,
+        param_copies,
+        slot_offset,
+    }
+}
+
+fn map_vreg(table: &[VReg], v: VReg) -> VReg {
+    table[v.0 as usize]
+}
+
+fn map_slot(off: u32, s: SlotId) -> SlotId {
+    SlotId(s.0 + off)
+}
+
+fn remap_vreg_range(
+    range: VRegRange,
+    remap: &Remap,
+    caller_pool: &mut Vec<VReg>,
+    callee_pool: &[VReg],
+) -> VRegRange {
+    let start_idx = range.start as usize;
+    let count = range.count as usize;
+    let end = start_idx + count;
+    let slice = &callee_pool[start_idx..end];
+    let start = caller_pool.len() as u32;
+    for &v in slice {
+        caller_pool.push(map_vreg(&remap.vreg_table, v));
+    }
+    VRegRange {
+        start,
+        count: range.count,
+    }
+}
+
+pub(crate) fn remap_op(
+    op: &LpirOp,
+    remap: &Remap,
+    caller_vreg_pool: &mut Vec<VReg>,
+    callee_vreg_pool: &[VReg],
+) -> LpirOp {
+    let m = |v: VReg| map_vreg(&remap.vreg_table, v);
+    let ms = |s: SlotId| map_slot(remap.slot_offset, s);
+
+    match op {
+        LpirOp::Fadd { dst, lhs, rhs } => LpirOp::Fadd {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fsub { dst, lhs, rhs } => LpirOp::Fsub {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fmul { dst, lhs, rhs } => LpirOp::Fmul {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fdiv { dst, lhs, rhs } => LpirOp::Fdiv {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fneg { dst, src } => LpirOp::Fneg {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Fabs { dst, src } => LpirOp::Fabs {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Fsqrt { dst, src } => LpirOp::Fsqrt {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Fmin { dst, lhs, rhs } => LpirOp::Fmin {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fmax { dst, lhs, rhs } => LpirOp::Fmax {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Ffloor { dst, src } => LpirOp::Ffloor {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Fceil { dst, src } => LpirOp::Fceil {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Ftrunc { dst, src } => LpirOp::Ftrunc {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Fnearest { dst, src } => LpirOp::Fnearest {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Iadd { dst, lhs, rhs } => LpirOp::Iadd {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Isub { dst, lhs, rhs } => LpirOp::Isub {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Imul { dst, lhs, rhs } => LpirOp::Imul {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IdivS { dst, lhs, rhs } => LpirOp::IdivS {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IdivU { dst, lhs, rhs } => LpirOp::IdivU {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IremS { dst, lhs, rhs } => LpirOp::IremS {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IremU { dst, lhs, rhs } => LpirOp::IremU {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Ineg { dst, src } => LpirOp::Ineg {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Feq { dst, lhs, rhs } => LpirOp::Feq {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fne { dst, lhs, rhs } => LpirOp::Fne {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Flt { dst, lhs, rhs } => LpirOp::Flt {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fle { dst, lhs, rhs } => LpirOp::Fle {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fgt { dst, lhs, rhs } => LpirOp::Fgt {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Fge { dst, lhs, rhs } => LpirOp::Fge {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Ieq { dst, lhs, rhs } => LpirOp::Ieq {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Ine { dst, lhs, rhs } => LpirOp::Ine {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IltS { dst, lhs, rhs } => LpirOp::IltS {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IleS { dst, lhs, rhs } => LpirOp::IleS {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IgtS { dst, lhs, rhs } => LpirOp::IgtS {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IgeS { dst, lhs, rhs } => LpirOp::IgeS {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IltU { dst, lhs, rhs } => LpirOp::IltU {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IleU { dst, lhs, rhs } => LpirOp::IleU {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IgtU { dst, lhs, rhs } => LpirOp::IgtU {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IgeU { dst, lhs, rhs } => LpirOp::IgeU {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Iand { dst, lhs, rhs } => LpirOp::Iand {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Ior { dst, lhs, rhs } => LpirOp::Ior {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Ixor { dst, lhs, rhs } => LpirOp::Ixor {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::Ibnot { dst, src } => LpirOp::Ibnot {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Ishl { dst, lhs, rhs } => LpirOp::Ishl {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IshrS { dst, lhs, rhs } => LpirOp::IshrS {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::IshrU { dst, lhs, rhs } => LpirOp::IshrU {
+            dst: m(*dst),
+            lhs: m(*lhs),
+            rhs: m(*rhs),
+        },
+        LpirOp::FconstF32 { dst, value } => LpirOp::FconstF32 {
+            dst: m(*dst),
+            value: *value,
+        },
+        LpirOp::IconstI32 { dst, value } => LpirOp::IconstI32 {
+            dst: m(*dst),
+            value: *value,
+        },
+        LpirOp::IaddImm { dst, src, imm } => LpirOp::IaddImm {
+            dst: m(*dst),
+            src: m(*src),
+            imm: *imm,
+        },
+        LpirOp::IsubImm { dst, src, imm } => LpirOp::IsubImm {
+            dst: m(*dst),
+            src: m(*src),
+            imm: *imm,
+        },
+        LpirOp::ImulImm { dst, src, imm } => LpirOp::ImulImm {
+            dst: m(*dst),
+            src: m(*src),
+            imm: *imm,
+        },
+        LpirOp::IshlImm { dst, src, imm } => LpirOp::IshlImm {
+            dst: m(*dst),
+            src: m(*src),
+            imm: *imm,
+        },
+        LpirOp::IshrSImm { dst, src, imm } => LpirOp::IshrSImm {
+            dst: m(*dst),
+            src: m(*src),
+            imm: *imm,
+        },
+        LpirOp::IshrUImm { dst, src, imm } => LpirOp::IshrUImm {
+            dst: m(*dst),
+            src: m(*src),
+            imm: *imm,
+        },
+        LpirOp::IeqImm { dst, src, imm } => LpirOp::IeqImm {
+            dst: m(*dst),
+            src: m(*src),
+            imm: *imm,
+        },
+        LpirOp::FtoiSatS { dst, src } => LpirOp::FtoiSatS {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::FtoiSatU { dst, src } => LpirOp::FtoiSatU {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::ItofS { dst, src } => LpirOp::ItofS {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::ItofU { dst, src } => LpirOp::ItofU {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::FfromI32Bits { dst, src } => LpirOp::FfromI32Bits {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::FtoUnorm16 { dst, src } => LpirOp::FtoUnorm16 {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::FtoUnorm8 { dst, src } => LpirOp::FtoUnorm8 {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Unorm16toF { dst, src } => LpirOp::Unorm16toF {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Unorm8toF { dst, src } => LpirOp::Unorm8toF {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::Select {
+            dst,
+            cond,
+            if_true,
+            if_false,
+        } => LpirOp::Select {
+            dst: m(*dst),
+            cond: m(*cond),
+            if_true: m(*if_true),
+            if_false: m(*if_false),
+        },
+        LpirOp::Copy { dst, src } => LpirOp::Copy {
+            dst: m(*dst),
+            src: m(*src),
+        },
+        LpirOp::SlotAddr { dst, slot } => LpirOp::SlotAddr {
+            dst: m(*dst),
+            slot: ms(*slot),
+        },
+        LpirOp::Load { dst, base, offset } => LpirOp::Load {
+            dst: m(*dst),
+            base: m(*base),
+            offset: *offset,
+        },
+        LpirOp::Store {
+            base,
+            offset,
+            value,
+        } => LpirOp::Store {
+            base: m(*base),
+            offset: *offset,
+            value: m(*value),
+        },
+        LpirOp::Store8 {
+            base,
+            offset,
+            value,
+        } => LpirOp::Store8 {
+            base: m(*base),
+            offset: *offset,
+            value: m(*value),
+        },
+        LpirOp::Store16 {
+            base,
+            offset,
+            value,
+        } => LpirOp::Store16 {
+            base: m(*base),
+            offset: *offset,
+            value: m(*value),
+        },
+        LpirOp::Load8U { dst, base, offset } => LpirOp::Load8U {
+            dst: m(*dst),
+            base: m(*base),
+            offset: *offset,
+        },
+        LpirOp::Load8S { dst, base, offset } => LpirOp::Load8S {
+            dst: m(*dst),
+            base: m(*base),
+            offset: *offset,
+        },
+        LpirOp::Load16U { dst, base, offset } => LpirOp::Load16U {
+            dst: m(*dst),
+            base: m(*base),
+            offset: *offset,
+        },
+        LpirOp::Load16S { dst, base, offset } => LpirOp::Load16S {
+            dst: m(*dst),
+            base: m(*base),
+            offset: *offset,
+        },
+        LpirOp::Memcpy {
+            dst_addr,
+            src_addr,
+            size,
+        } => LpirOp::Memcpy {
+            dst_addr: m(*dst_addr),
+            src_addr: m(*src_addr),
+            size: *size,
+        },
+        LpirOp::IfStart {
+            cond,
+            else_offset: _,
+            end_offset: _,
+        } => LpirOp::IfStart {
+            cond: m(*cond),
+            else_offset: 0,
+            end_offset: 0,
+        },
+        LpirOp::Else => LpirOp::Else,
+        LpirOp::Continuing => LpirOp::Continuing,
+        LpirOp::LoopStart {
+            continuing_offset: _,
+            end_offset: _,
+        } => LpirOp::LoopStart {
+            continuing_offset: 0,
+            end_offset: 0,
+        },
+        LpirOp::SwitchStart {
+            selector,
+            end_offset: _,
+        } => LpirOp::SwitchStart {
+            selector: m(*selector),
+            end_offset: 0,
+        },
+        LpirOp::CaseStart {
+            value,
+            end_offset: _,
+        } => LpirOp::CaseStart {
+            value: *value,
+            end_offset: 0,
+        },
+        LpirOp::DefaultStart { end_offset: _ } => LpirOp::DefaultStart { end_offset: 0 },
+        LpirOp::End => LpirOp::End,
+        LpirOp::Block { end_offset: _ } => LpirOp::Block { end_offset: 0 },
+        LpirOp::Break => LpirOp::Break,
+        LpirOp::Continue => LpirOp::Continue,
+        LpirOp::BrIfNot { cond } => LpirOp::BrIfNot { cond: m(*cond) },
+        LpirOp::ExitBlock => LpirOp::ExitBlock,
+        LpirOp::Call {
+            callee,
+            args,
+            results,
+        } => {
+            let callee = *callee;
+            let args = remap_vreg_range(*args, remap, caller_vreg_pool, callee_vreg_pool);
+            let results = remap_vreg_range(*results, remap, caller_vreg_pool, callee_vreg_pool);
+            LpirOp::Call {
+                callee,
+                args,
+                results,
+            }
+        }
+        LpirOp::Return { .. } => op.clone(),
+    }
+}
diff --git a/lp-shader/lpir/src/inline/splice.rs b/lp-shader/lpir/src/inline/splice.rs
new file mode 100644
index 000000000..1f679bb70
--- /dev/null
+++ b/lp-shader/lpir/src/inline/splice.rs
@@ -0,0 +1,120 @@
+//! Replace a [`LpirOp::Call`] with an inlined, remapped callee body.
+
+use alloc::vec::Vec;
+
+use crate::inline::remap::{build_remap, remap_op, scan_param_writes};
+use crate::lpir_module::IrFunction;
+use crate::lpir_op::LpirOp;
+use crate::types::VReg;
+
+enum ReturnShape {
+    None,
+    SingleAtEnd,
+    Multi,
+}
+
+fn classify_return_shape(body: &[LpirOp]) -> ReturnShape {
+    let mut return_indices = Vec::new();
+    for (i, op) in body.iter().enumerate() {
+        if matches!(op, LpirOp::Return { .. }) {
+            return_indices.push(i);
+        }
+    }
+    match return_indices.len() {
+        0 => ReturnShape::None,
+        1 => {
+            let ri = return_indices[0];
+            if ri + 1 == body.len() {
+                ReturnShape::SingleAtEnd
+            } else {
+                ReturnShape::Multi
+            }
+        }
+        _ => ReturnShape::Multi,
+    }
+}
+
+pub(crate) fn inline_call_site(caller: &mut IrFunction, callee: &IrFunction, call_op_idx: usize) {
+    let (args_range, results_range) = match &caller.body.get(call_op_idx) {
+        Some(LpirOp::Call { args, results, .. }) => (*args, *results),
+        _ => return,
+    };
+
+    let call_args: Vec<VReg> = caller.pool_slice(args_range).to_vec();
+    let call_results: Vec<VReg> = caller.pool_slice(results_range).to_vec();
+
+    debug_assert_eq!(
+        call_args.len(),
+        1 + callee.param_count as usize,
+        "inline call args arity"
+    );
+    debug_assert_eq!(
+        call_results.len(),
+        callee.return_types.len(),
+        "inline call results arity"
+    );
+    if call_args.len() != 1 + callee.param_count as usize
+        || call_results.len() != callee.return_types.len()
+    {
+        return;
+    }
+
+    let pw = scan_param_writes(callee);
+    let rmap = build_remap(caller, callee, &call_args, &call_results, &pw);
+
+    let shape = classify_return_shape(&callee.body);
+    let needs_block = matches!(shape, ReturnShape::Multi);
+
+    let mut scratch: Vec<LpirOp> = Vec::new();
+    scratch.extend_from_slice(&rmap.param_copies);
+
+    if needs_block {
+        scratch.push(LpirOp::Block { end_offset: 0 });
+    }
+
+    let mut last_was_exit_block = false;
+
+    for op in &callee.body {
+        match op {
+            LpirOp::Return { values } => {
+                let vals = callee.pool_slice(*values);
+                if vals.len() != call_results.len() {
+                    return;
+                }
+                debug_assert_eq!(vals.len(), call_results.len());
+                for (k, &src_raw) in vals.iter().enumerate() {
+                    let src = rmap.vreg_table[src_raw.0 as usize];
+                    scratch.push(LpirOp::Copy {
+                        dst: call_results[k],
+                        src,
+                    });
+                }
+                if needs_block {
+                    scratch.push(LpirOp::ExitBlock);
+                    last_was_exit_block = true;
+                } else {
+                    last_was_exit_block = false;
+                }
+            }
+            _ => {
+                last_was_exit_block = false;
+                scratch.push(remap_op(
+                    op,
+                    &rmap,
+                    &mut caller.vreg_pool,
+                    &callee.vreg_pool,
+                ));
+            }
+        }
+    }
+
+    if needs_block && !last_was_exit_block {
+        scratch.push(LpirOp::ExitBlock);
+    }
+
+    if needs_block {
+        scratch.push(LpirOp::End);
+    }
+
+    caller.body.splice(call_op_idx..=call_op_idx, scratch);
+}
diff --git a/lp-shader/lpir/src/interp.rs b/lp-shader/lpir/src/interp.rs
index 678026032..0260ba134 100644
--- a/lp-shader/lpir/src/interp.rs
+++ b/lp-shader/lpir/src/interp.rs
@@ -196,6 +196,9 @@ fn exec_func(
                     return Err(InterpError::Internal("exit_block outside block".into()));
                 }
             }
+            LpirOp::Continuing => {
+                pc += 1;
+            }
             LpirOp::End => match ctrl.last() {
                 Some(Ctrl::Loop { exit, head, .. }) if *exit == pc + 1 => {
                     pc = *head + 1;
diff --git a/lp-shader/lpir/src/lib.rs b/lp-shader/lpir/src/lib.rs
index b23de03d5..455fb07b3 100644
--- a/lp-shader/lpir/src/lib.rs
+++ b/lp-shader/lpir/src/lib.rs
@@ -9,6 +9,8 @@ extern crate alloc;
 pub mod builder;
 pub mod compiler_config;
 pub mod const_fold;
+pub mod dead_func_elim;
+mod inline;
 pub mod interp;
 pub mod lpir_module;
 pub mod lpir_op;
@@ -21,7 +23,12 @@ pub mod validate;
 mod tests;
 
 pub use builder::{FunctionBuilder, ModuleBuilder};
-pub use compiler_config::{CompilerConfig, ConfigError, InlineConfig, InlineMode};
+pub use compiler_config::{
+    COMPILER_CONFIG_APPLY_HELP, COMPILER_CONFIG_KEYS_HELP, CompilerConfig, ConfigError,
+    DeadFuncElimConfig, DeadFuncElimMode, InlineConfig, InlineMode,
+};
+pub use dead_func_elim::{DeadFuncElimResult, dead_func_elim, roots_by_name, roots_from_is_entry};
+pub use inline::{InlineResult, inline_module};
 pub use interp::{ImportHandler, InterpError, Value, interpret, interpret_with_depth};
 pub use lpir_module::{ImportDecl, IrFunction, LpirModule, SlotDecl, VMCTX_VREG};
 pub use lpir_op::LpirOp;
@@ -29,3 +36,10 @@ pub use parse::{ParseError, parse_module};
 pub use print::print_module;
 pub use types::{CalleeRef, FloatMode, FuncId, ImportId, IrType, SlotId, VReg, VRegRange};
 pub use validate::{ValidationError, validate_function, validate_module};
+
+/// Candidate inline size metrics for M3.1 (`func_weight` tuning). See [`inline_weights`].
+pub mod inline_weights {
+    pub use crate::inline::heuristic::{
+        WeightKind, weight, weight_body_len, weight_heavy_bias, weight_markers_zero,
+    };
+}
diff --git a/lp-shader/lpir/src/lpir_op.rs b/lp-shader/lpir/src/lpir_op.rs
index cbd402085..f3b3ea2a0 100644
--- a/lp-shader/lpir/src/lpir_op.rs
+++ b/lp-shader/lpir/src/lpir_op.rs
@@ -404,6 +404,9 @@ pub enum LpirOp {
     },
     /// False branch target; if reached by fall-through from the then-arm, jump to the enclosing `IfStart`'s `end_offset`.
     Else,
+    /// Marker for the start of the continuing block of the enclosing [`LpirOp::LoopStart`].
+    /// Position is cached in [`LpirOp::LoopStart::continuing_offset`] for fast backend access.
+    Continuing,
     LoopStart {
         continuing_offset: u32,
         end_offset: u32,
@@ -531,6 +534,7 @@ impl LpirOp {
             | LpirOp::Call { .. }
             | LpirOp::IfStart { .. }
             | LpirOp::Else
+            | LpirOp::Continuing
             | LpirOp::End
             | LpirOp::LoopStart { .. }
             | LpirOp::Break
diff --git a/lp-shader/lpir/src/print.rs b/lp-shader/lpir/src/print.rs
index 88d217f3c..dcbd9a6f6 100644
--- a/lp-shader/lpir/src/print.rs
+++ b/lp-shader/lpir/src/print.rs
@@ -8,8 +8,7 @@ use core::fmt::Write as _;
 
 use crate::lpir_module::{ImportDecl, IrFunction, LpirModule, VMCTX_VREG};
 use crate::lpir_op::LpirOp;
-use crate::types::ImportId;
-use crate::types::{CalleeRef, IrType, VReg};
+use crate::types::{CalleeRef, ImportId, IrType, VReg};
 
 fn callee_needs_vmctx_operand(module: &LpirModule, callee: CalleeRef) -> bool {
     match callee {
@@ -37,6 +36,7 @@ enum Block {
     If,
     Else,
     Loop {
+        #[allow(dead_code)]
         start_pc: usize,
     },
     Switch,
@@ -176,17 +176,6 @@ fn print_op_at(
     pc: &mut usize,
     depth: &mut usize,
 ) {
-    if let Some(Block::Loop { start_pc }) = stack.last() {
-        if let LpirOp::LoopStart {
-            continuing_offset, ..
-        } = &body[*start_pc]
-        {
-            let co = *continuing_offset as usize;
-            if co != *start_pc + 1 && *pc == co {
-                let _ = writeln!(out, "{}continuing:", indent_str(*depth));
-            }
-        }
-    }
     let ind = indent_str(*depth);
     match &body[*pc] {
         LpirOp::IfStart { cond, .. } => {
@@ -205,6 +194,10 @@ fn print_op_at(
             let _ = writeln!(out, "{}}} else {{", indent_str(*depth - 1));
             *pc += 1;
         }
+        LpirOp::Continuing => {
+            let _ = writeln!(out, "{ind}continuing:");
+            *pc += 1;
+        }
         LpirOp::LoopStart { .. } => {
             let _ = writeln!(out, "{ind}loop {{");
             stack.push(Block::Loop { start_pc: *pc });
diff --git a/lp-shader/lpir/src/tests.rs b/lp-shader/lpir/src/tests.rs
index b3bd3dd76..8d8400641 100644
--- a/lp-shader/lpir/src/tests.rs
+++ b/lp-shader/lpir/src/tests.rs
@@ -9,6 +9,30 @@ mod block_ops;
 #[path = "tests/interp.rs"]
 mod interp;
 
+#[path = "tests/inline_offsets.rs"]
+mod inline_offsets;
+
+#[path = "tests/inline_callgraph.rs"]
+mod inline_callgraph;
+
+#[path = "tests/inline_param_writes.rs"]
+mod inline_param_writes;
+
+#[path = "tests/inline_remap.rs"]
+mod inline_remap;
+
+#[path = "tests/inline_basic.rs"]
+mod inline_basic;
+
+#[path = "tests/inline_heuristic.rs"]
+mod inline_heuristic;
+
+#[path = "tests/inline_weights.rs"]
+mod inline_weights;
+
+#[path = "tests/dead_func_elim.rs"]
+mod dead_func_elim;
+
 #[path = "tests/validate.rs"]
 mod validate;
 
diff --git a/lp-shader/lpir/src/tests/all_ops_roundtrip.rs b/lp-shader/lpir/src/tests/all_ops_roundtrip.rs
index be6b1b995..46f79afb3 100644
--- a/lp-shader/lpir/src/tests/all_ops_roundtrip.rs
+++ b/lp-shader/lpir/src/tests/all_ops_roundtrip.rs
@@ -333,6 +333,16 @@ pub(crate) fn module_all_ops() -> LpirModule {
     b.push(LpirOp::Break);
     b.end_loop();
 
+    b.push_loop();
+    b.push(LpirOp::BrIfNot { cond: i0 });
+    b.push_continuing();
+    let cont_v = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 {
+        dst: cont_v,
+        value: 42,
+    });
+    b.end_loop();
+
     b.push_switch(i1);
     b.push_case(0);
     let z0 = b.alloc_vreg(IrType::I32);
diff --git a/lp-shader/lpir/src/tests/block_ops.rs b/lp-shader/lpir/src/tests/block_ops.rs
index cbe8ddf7b..df7ee40e8 100644
--- a/lp-shader/lpir/src/tests/block_ops.rs
+++ b/lp-shader/lpir/src/tests/block_ops.rs
@@ -4,6 +4,7 @@ use alloc::string::String;
 use alloc::vec::Vec;
 
 use crate::interp::{ImportHandler, InterpError, Value, interpret};
+use crate::lpir_op::LpirOp;
 use crate::parse::parse_module;
 use crate::print::print_module;
 use crate::validate::validate_module;
@@ -72,6 +73,41 @@ fn block_exit_from_inside_if() {
     assert_eq!(run_i32(ir, "f", &[Value::I32(0)]), 7);
 }
 
+#[test]
+fn loop_continuing_offset_points_at_marker_op() {
+    let ir = "func @f(v1:i32) -> i32 {
+  v2:i32 = iconst.i32 0
+  loop {
+    v3:i32 = iconst.i32 1
+    continuing:
+    v2 = iadd v2, v3
+    br_if_not v1
+  }
+  return v2
+}
+";
+    let module = parse_module(ir).unwrap_or_else(|e| panic!("parse: {e:?}"));
+    validate_module(&module).unwrap_or_else(|e| panic!("validate: {e:?}"));
+    let f = module.functions.values().next().expect("one func");
+    let (loop_pc, co) = f
+        .body
+        .iter()
+        .enumerate()
+        .find_map(|(i, op)| {
+            if let LpirOp::LoopStart {
+                continuing_offset, ..
+            } = op
+            {
+                Some((i, *continuing_offset as usize))
+            } else {
+                None
+            }
+        })
+        .expect("LoopStart");
+    assert!(matches!(f.body.get(co), Some(LpirOp::Continuing)));
+    assert_eq!(co, loop_pc + 2);
+}
+
 #[test]
 fn block_text_round_trip() {
     let src = "func @f(v1:i32) -> i32 {
diff --git a/lp-shader/lpir/src/tests/dead_func_elim.rs b/lp-shader/lpir/src/tests/dead_func_elim.rs
new file mode 100644
index 000000000..6eb28b5b2
--- /dev/null
+++ b/lp-shader/lpir/src/tests/dead_func_elim.rs
@@ -0,0 +1,313 @@
+//! Tests for [`crate::dead_func_elim`].
+
+use alloc::string::String;
+use alloc::vec;
+
+use crate::builder::{FunctionBuilder, ModuleBuilder};
+use crate::dead_func_elim::{dead_func_elim, roots_by_name, roots_from_is_entry};
+use crate::lpir_module::{ImportDecl, VMCTX_VREG};
+use crate::lpir_op::LpirOp;
+use crate::print::print_module;
+use crate::types::{CalleeRef, FuncId, IrType};
+use crate::validate::validate_module;
+
+#[test]
+fn removes_unreachable_leaf() {
+    let mut mb = ModuleBuilder::new();
+    let mut dead_helper = FunctionBuilder::new("dead_helper", &[IrType::I32]);
+    let _ = dead_helper.add_param(IrType::I32);
+    let v = dead_helper.alloc_vreg(IrType::I32);
+    dead_helper.push(LpirOp::IconstI32 { dst: v, value: 42 });
+    dead_helper.push_return(&[v]);
+    mb.add_function(dead_helper.finish());
+
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let _p = main.add_param(IrType::I32);
+    let o = main.alloc_vreg(IrType::I32);
+    main.push(LpirOp::IconstI32 { dst: o, value: 0 });
+    main.push_return(&[o]);
+    mb.add_function(main.finish());
+
+    let mut module = mb.finish();
+    let roots = roots_by_name(&module, &["main"]);
+    let r = dead_func_elim(&mut module, &roots);
+    assert_eq!(r.functions_removed, 1);
+    assert_eq!(module.function_count(), 1);
+    assert!(module.functions.values().all(|f| f.name == "main"));
+    validate_module(&module).unwrap();
+}
+
+#[test]
+fn keeps_transitively_reachable() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("c", &[IrType::I32]);
+    let _ = c.add_param(IrType::I32);
+    let v = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: v, value: 1 });
+    c.push_return(&[v]);
+    mb.add_function(c.finish());
+    let id_c = FuncId(0);
+
+    let mut b = FunctionBuilder::new("b", &[IrType::I32]);
+    let pb = b.add_param(IrType::I32);
+    let o = b.alloc_vreg(IrType::I32);
+    b.push_call(
+        CalleeRef::Local(id_c),
+        &[VMCTX_VREG, pb],
+        core::slice::from_ref(&o),
+    );
+    b.push_return(&[o]);
+    mb.add_function(b.finish());
+    let id_b = FuncId(1);
+
+    let mut a = FunctionBuilder::new("a", &[IrType::I32]);
+    let pa = a.add_param(IrType::I32);
+    let o = a.alloc_vreg(IrType::I32);
+    a.push_call(
+        CalleeRef::Local(id_b),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o),
+    );
+    a.push_return(&[o]);
+    mb.add_function(a.finish());
+    let id_a = FuncId(2);
+
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let p = main.add_param(IrType::I32);
+    let o = main.alloc_vreg(IrType::I32);
+    main.push_call(
+        CalleeRef::Local(id_a),
+        &[VMCTX_VREG, p],
+        core::slice::from_ref(&o),
+    );
+    main.push_return(&[o]);
+    mb.add_function(main.finish());
+
+    let mut module = mb.finish();
+    let roots = roots_by_name(&module, &["main"]);
+    let n = module.function_count();
+    let r = dead_func_elim(&mut module, &roots);
+    assert_eq!(r.functions_removed, 0);
+    assert_eq!(module.function_count(), n);
+    validate_module(&module).unwrap();
+}
+
+#[test]
+fn removes_unreachable_cycle() {
+    let mut mb = ModuleBuilder::new();
+    let mut a = FunctionBuilder::new("a", &[IrType::I32]);
+    let p = a.add_param(IrType::I32);
+    let o = a.alloc_vreg(IrType::I32);
+    a.push_call(
+        CalleeRef::Local(FuncId(1)),
+        &[VMCTX_VREG, p],
+        core::slice::from_ref(&o),
+    );
+    a.push_return(&[o]);
+    let mut b = FunctionBuilder::new("b", &[IrType::I32]);
+    let p = b.add_param(IrType::I32);
+    let o = b.alloc_vreg(IrType::I32);
+    b.push_call(
+        CalleeRef::Local(FuncId(0)),
+        &[VMCTX_VREG, p],
+        core::slice::from_ref(&o),
+    );
+    b.push_return(&[o]);
+    mb.add_function(a.finish());
+    mb.add_function(b.finish());
+
+    let mut module = mb.finish();
+    let r = dead_func_elim(&mut module, &[]);
+    assert_eq!(r.functions_removed, 2);
+    assert!(module.functions.is_empty());
+    validate_module(&module).unwrap();
+}
+
+#[test]
+fn multiple_roots() {
+    let mut mb = ModuleBuilder::new();
+    for name in ["h_main", "h_init", "dead_orphan"] {
+        let mut f = FunctionBuilder::new(name, &[IrType::I32]);
+        let _ = f.add_param(IrType::I32);
+        let o = f.alloc_vreg(IrType::I32);
+        f.push(LpirOp::IconstI32 { dst: o, value: 0 });
+        f.push_return(&[o]);
+        mb.add_function(f.finish());
+    }
+    let h_main = FuncId(0);
+    let h_init = FuncId(1);
+
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let p = main.add_param(IrType::I32);
+    let o = main.alloc_vreg(IrType::I32);
+    main.push_call(
+        CalleeRef::Local(h_main),
+        &[VMCTX_VREG, p],
+        core::slice::from_ref(&o),
+    );
+    main.push_return(&[o]);
+    mb.add_function(main.finish());
+
+    let mut shader_init = FunctionBuilder::new("__shader_init", &[IrType::I32]);
+    let p = shader_init.add_param(IrType::I32);
+    let o = shader_init.alloc_vreg(IrType::I32);
+    shader_init.push_call(
+        CalleeRef::Local(h_init),
+        &[VMCTX_VREG, p],
+        core::slice::from_ref(&o),
+    );
+    shader_init.push_return(&[o]);
+    mb.add_function(shader_init.finish());
+
+    let mut module = mb.finish();
+    let roots = roots_by_name(&module, &["main", "__shader_init"]);
+    let r = dead_func_elim(&mut module, &roots);
+    assert_eq!(r.functions_removed, 1);
+    assert_eq!(module.function_count(), 4);
+    assert!(!module.functions.values().any(|f| f.name == "dead_orphan"));
+    validate_module(&module).unwrap();
+}
+
+#[test]
+fn no_roots_removes_everything() {
+    let mut mb = ModuleBuilder::new();
+    for name in ["a", "b"] {
+        let mut f = FunctionBuilder::new(name, &[IrType::I32]);
+        let _ = f.add_param(IrType::I32);
+        let o = f.alloc_vreg(IrType::I32);
+        f.push(LpirOp::IconstI32 { dst: o, value: 0 });
+        f.push_return(&[o]);
+        mb.add_function(f.finish());
+    }
+    let mut module = mb.finish();
+    let r = dead_func_elim(&mut module, &[]);
+    assert_eq!(r.functions_removed, 2);
+    assert!(module.functions.is_empty());
+    validate_module(&module).unwrap();
+}
+
+#[test]
+fn roots_from_is_entry_picks_marked() {
+    let mut mb = ModuleBuilder::new();
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    main.set_entry();
+    let _ = main.add_param(IrType::I32);
+    let o = main.alloc_vreg(IrType::I32);
+    main.push(LpirOp::IconstI32 { dst: o, value: 0 });
+    main.push_return(&[o]);
+    mb.add_function(main.finish());
+    let mut other = FunctionBuilder::new("other", &[IrType::I32]);
+    let _ = other.add_param(IrType::I32);
+    let o = other.alloc_vreg(IrType::I32);
+    other.push(LpirOp::IconstI32 { dst: o, value: 1 });
+    other.push_return(&[o]);
+    mb.add_function(other.finish());
+
+    let module = mb.finish();
+    let roots = roots_from_is_entry(&module);
+    assert_eq!(roots, vec![FuncId(0)]);
+}
+
+#[test]
+fn roots_by_name_skips_unknown() {
+    let mut mb = ModuleBuilder::new();
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let _ = main.add_param(IrType::I32);
+    let o = main.alloc_vreg(IrType::I32);
+    main.push(LpirOp::IconstI32 { dst: o, value: 0 });
+    main.push_return(&[o]);
+    mb.add_function(main.finish());
+    let module = mb.finish();
+    let roots = roots_by_name(&module, &["main", "missing"]);
+    assert_eq!(roots, vec![FuncId(0)]);
+}
+
+#[test]
+fn noop_when_all_reachable() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("c", &[IrType::I32]);
+    let _ = c.add_param(IrType::I32);
+    let v = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: v, value: 1 });
+    c.push_return(&[v]);
+    mb.add_function(c.finish());
+    let id_c = FuncId(0);
+
+    let mut b = FunctionBuilder::new("b", &[IrType::I32]);
+    let pb = b.add_param(IrType::I32);
+    let o = b.alloc_vreg(IrType::I32);
+    b.push_call(
+        CalleeRef::Local(id_c),
+        &[VMCTX_VREG, pb],
+        core::slice::from_ref(&o),
+    );
+    b.push_return(&[o]);
+    mb.add_function(b.finish());
+    let id_b = FuncId(1);
+
+    let mut a = FunctionBuilder::new("a", &[IrType::I32]);
+    let pa = a.add_param(IrType::I32);
+    let o = a.alloc_vreg(IrType::I32);
+    a.push_call(
+        CalleeRef::Local(id_b),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o),
+    );
+    a.push_return(&[o]);
+    mb.add_function(a.finish());
+    let id_a = FuncId(2);
+
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let p = main.add_param(IrType::I32);
+    let o = main.alloc_vreg(IrType::I32);
+    main.push_call(
+        CalleeRef::Local(id_a),
+        &[VMCTX_VREG, p],
+        core::slice::from_ref(&o),
+    );
+    main.push_return(&[o]);
+    mb.add_function(main.finish());
+
+    let mut module = mb.finish();
+    let before = print_module(&module);
+    let roots = roots_by_name(&module, &["main"]);
+    let r = dead_func_elim(&mut module, &roots);
+    assert_eq!(r.functions_removed, 0);
+    assert_eq!(print_module(&module), before);
+    validate_module(&module).unwrap();
+}
+
+#[test]
+fn import_calls_dont_count_as_local_edges() {
+    let mut mb = ModuleBuilder::new();
+    let imp = mb.add_import(ImportDecl {
+        module_name: String::from("g"),
+        func_name: String::from("sin"),
+        param_types: vec![IrType::F32],
+        return_types: vec![IrType::F32],
+        lpfn_glsl_params: None,
+        needs_vmctx: true,
+    });
+
+    let mut only_import = FunctionBuilder::new("only_import_caller", &[IrType::F32]);
+    let p = only_import.add_param(IrType::F32);
+    let out = only_import.alloc_vreg(IrType::F32);
+    only_import.push_call(imp, &[VMCTX_VREG, p], &[out]);
+    only_import.push_return(&[out]);
+    mb.add_function(only_import.finish());
+
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let _p = main.add_param(IrType::I32);
+    let o = main.alloc_vreg(IrType::I32);
+    main.push(LpirOp::IconstI32 { dst: o, value: 0 });
+    main.push_return(&[o]);
+    mb.add_function(main.finish());
+
+    let mut module = mb.finish();
+    let roots = roots_by_name(&module, &["main"]);
+    let r = dead_func_elim(&mut module, &roots);
+    assert_eq!(r.functions_removed, 1);
+    assert_eq!(module.function_count(), 1);
+    assert!(module.functions.values().all(|f| f.name == "main"));
+    validate_module(&module).unwrap();
+}
diff --git a/lp-shader/lpir/src/tests/inline_basic.rs b/lp-shader/lpir/src/tests/inline_basic.rs
new file mode 100644
index 000000000..aed2285ab
--- /dev/null
+++ b/lp-shader/lpir/src/tests/inline_basic.rs
@@ -0,0 +1,558 @@
+//! Tests for [`crate::inline::splice::inline_call_site`] and (Phase 6) [`crate::inline_module`] inliner.
+
+use alloc::string::String;
+use alloc::vec;
+use alloc::vec::Vec;
+
+use crate::builder::{FunctionBuilder, ModuleBuilder};
+use crate::inline::recompute_offsets;
+use crate::inline::splice::inline_call_site;
+use crate::interp::{ImportHandler, InterpError, Value, interpret};
+use crate::lpir_module::{ImportDecl, LpirModule, VMCTX_VREG};
+use crate::lpir_op::LpirOp;
+use crate::types::{CalleeRef, IrType};
+use crate::validate::validate_module;
+use crate::{InlineConfig, inline_module};
+
+struct NoImports;
+
+impl ImportHandler for NoImports {
+    fn call(&mut self, _: &str, _: &str, _: &[Value]) -> Result<Vec<Value>, InterpError> {
+        Err(InterpError::Import(String::from("no imports")))
+    }
+}
+
+struct SinImport;
+
+impl ImportHandler for SinImport {
+    fn call(
+        &mut self,
+        module: &str,
+        name: &str,
+        args: &[Value],
+    ) -> Result<Vec<Value>, InterpError> {
+        if module == "g" && name == "sin" {
+            let x = args.get(1).and_then(|v| v.as_f32()).unwrap_or(0.0);
+            return Ok(vec![Value::F32(libm::sinf(x))]);
+        }
+        Err(InterpError::Import(String::from("bad import")))
+    }
+}
+
+fn find_local_call(f: &crate::lpir_module::IrFunction) -> Option<usize> {
+    f.body.iter().enumerate().find_map(|(i, o)| {
+        matches!(
+            o,
+            LpirOp::Call {
+                callee: CalleeRef::Local(_),
+                ..
+            }
+        )
+        .then_some(i)
+    })
+}
+
+fn run_i32(module: &LpirModule, name: &str, args: &[Value]) -> i32 {
+    let out = interpret(module, name, args, &mut NoImports).unwrap();
+    assert_eq!(out.len(), 1);
+    out[0].as_i32().expect("i32")
+}
+
+#[test]
+fn void_callee() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("c", &[]);
+    let _ = c.add_param(IrType::I32);
+    let s0 = c.alloc_slot(4);
+    let base = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::SlotAddr {
+        dst: base,
+        slot: s0,
+    });
+    let z = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: z, value: 99 });
+    c.push(LpirOp::Store {
+        base,
+        offset: 0,
+        value: z,
+    });
+    c.push_return(&[]);
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::I32]);
+    let p = t.add_param(IrType::I32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[]);
+    let r = t.alloc_vreg(IrType::I32);
+    t.push(LpirOp::IconstI32 { dst: r, value: 0 });
+    t.push_return(&[r]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    let before = run_i32(&module, "t", &[Value::I32(1)]);
+    let caller_id = module.functions.keys().nth(1).copied().expect("caller");
+    let callee_fn = module.functions.values().nth(0).expect("callee").clone();
+    let idx = find_local_call(module.functions.get(&caller_id).expect("t")).expect("call");
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    let after = run_i32(&module, "t", &[Value::I32(1)]);
+    assert_eq!(before, after);
+}
+
+#[test]
+fn single_return_at_end() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("add1", &[IrType::I32]);
+    let a = c.add_param(IrType::I32);
+    let one = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: one, value: 1 });
+    let r = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::Iadd {
+        dst: r,
+        lhs: a,
+        rhs: one,
+    });
+    c.push_return(&[r]);
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::I32]);
+    let p = t.add_param(IrType::I32);
+    let out = t.alloc_vreg(IrType::I32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[out]);
+    t.push_return(&[out]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    let before = run_i32(&module, "t", &[Value::I32(41)]);
+    let caller_id = *module.functions.keys().nth(1).unwrap();
+    let callee_fn = module.functions.values().nth(0).unwrap().clone();
+    let idx = find_local_call(module.functions.get(&caller_id).unwrap()).unwrap();
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    assert_eq!(run_i32(&module, "t", &[Value::I32(41)]), before);
+    assert!(
+        !module.functions[&caller_id]
+            .body
+            .iter()
+            .any(|o| matches!(o, LpirOp::Block { .. }))
+    );
+}
+
+#[test]
+fn single_return_not_at_end() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("early", &[IrType::I32]);
+    let a = c.add_param(IrType::I32);
+    c.push_if(a);
+    let neg = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::Ineg { dst: neg, src: a });
+    c.push_return(&[neg]);
+    c.push_else();
+    let z = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: z, value: 0 });
+    c.push_return(&[z]);
+    c.end_if();
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::I32]);
+    let p = t.add_param(IrType::I32);
+    let out = t.alloc_vreg(IrType::I32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[out]);
+    t.push_return(&[out]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    let before = run_i32(&module, "t", &[Value::I32(-5)]);
+    let caller_id = *module.functions.keys().nth(1).unwrap();
+    let callee_fn = module.functions.values().nth(0).unwrap().clone();
+    let idx = find_local_call(module.functions.get(&caller_id).unwrap()).unwrap();
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    assert_eq!(run_i32(&module, "t", &[Value::I32(-5)]), before);
+    assert!(
+        module.functions[&caller_id]
+            .body
+            .iter()
+            .any(|o| matches!(o, LpirOp::Block { .. }))
+    );
+}
+
+#[test]
+fn multiple_returns() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("two_ret", &[IrType::I32]);
+    let a = c.add_param(IrType::I32);
+    c.push_if(a);
+    let one = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: one, value: 1 });
+    c.push_return(&[one]);
+    c.push_else();
+    let two = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: two, value: 2 });
+    c.push_return(&[two]);
+    c.end_if();
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::I32]);
+    let p = t.add_param(IrType::I32);
+    let out = t.alloc_vreg(IrType::I32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[out]);
+    t.push_return(&[out]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    let before0 = run_i32(&module, "t", &[Value::I32(0)]);
+    let before1 = run_i32(&module, "t", &[Value::I32(7)]);
+    let caller_id = *module.functions.keys().nth(1).unwrap();
+    let callee_fn = module.functions.values().nth(0).unwrap().clone();
+    let idx = find_local_call(module.functions.get(&caller_id).unwrap()).unwrap();
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    assert_eq!(run_i32(&module, "t", &[Value::I32(0)]), before0);
+    assert_eq!(run_i32(&module, "t", &[Value::I32(7)]), before1);
+}
+
+#[test]
+fn nested_call_in_callee() {
+    let mut mb = ModuleBuilder::new();
+    let imp = mb.add_import(ImportDecl {
+        module_name: String::from("g"),
+        func_name: String::from("sin"),
+        param_types: vec![IrType::F32],
+        return_types: vec![IrType::F32],
+        lpfn_glsl_params: None,
+        needs_vmctx: true,
+    });
+
+    let mut c = FunctionBuilder::new("c", &[IrType::F32]);
+    let a = c.add_param(IrType::F32);
+    let out = c.alloc_vreg(IrType::F32);
+    c.push_call(imp, &[VMCTX_VREG, a], &[out]);
+    c.push_return(&[out]);
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::F32]);
+    let p = t.add_param(IrType::F32);
+    let r = t.alloc_vreg(IrType::F32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[r]);
+    t.push_return(&[r]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    let before = interpret(&module, "t", &[Value::F32(0.3)], &mut SinImport).unwrap();
+    let caller_id = *module.functions.keys().nth(1).unwrap();
+    let callee_fn = module.functions.values().nth(0).unwrap().clone();
+    let idx = find_local_call(module.functions.get(&caller_id).unwrap()).unwrap();
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    let after = interpret(&module, "t", &[Value::F32(0.3)], &mut SinImport).unwrap();
+    assert_eq!(before.len(), after.len());
+    assert!((before[0].as_f32().unwrap() - after[0].as_f32().unwrap()).abs() < 1e-5);
+}
+
+#[test]
+fn mutated_param() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("c", &[IrType::I32]);
+    let a = c.add_param(IrType::I32);
+    let one = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: one, value: 1 });
+    c.push(LpirOp::Iadd {
+        dst: a,
+        lhs: a,
+        rhs: one,
+    });
+    c.push_return(&[a]);
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::I32]);
+    let p = t.add_param(IrType::I32);
+    let out = t.alloc_vreg(IrType::I32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[out]);
+    t.push_return(&[out]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    let caller_id = *module.functions.keys().nth(1).unwrap();
+    let callee_fn = module.functions.values().nth(0).unwrap().clone();
+    let idx = find_local_call(module.functions.get(&caller_id).unwrap()).unwrap();
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    let copy_count = module.functions[&caller_id]
+        .body
+        .iter()
+        .filter(|o| matches!(o, LpirOp::Copy { .. }))
+        .count();
+    assert!(copy_count >= 2, "param Copy plus return Copy");
+}
+
+#[test]
+fn readonly_param() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("c", &[IrType::I32]);
+    let a = c.add_param(IrType::I32);
+    let r = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::Iadd {
+        dst: r,
+        lhs: a,
+        rhs: a,
+    });
+    c.push_return(&[r]);
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::I32]);
+    let p = t.add_param(IrType::I32);
+    let out = t.alloc_vreg(IrType::I32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[out]);
+    t.push_return(&[out]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    let caller_id = *module.functions.keys().nth(1).unwrap();
+    let callee_fn = module.functions.values().nth(0).unwrap().clone();
+    let idx = find_local_call(module.functions.get(&caller_id).unwrap()).unwrap();
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    let copy_count = module.functions[&caller_id]
+        .body
+        .iter()
+        .filter(|o| matches!(o, LpirOp::Copy { .. }))
+        .count();
+    assert_eq!(
+        copy_count, 1,
+        "only return lowering Copy, no param preamble"
+    );
+}
+
+#[test]
+fn vmctx_propagation() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("c", &[IrType::I32]);
+    let _ = c.add_param(IrType::I32);
+    let r = c.alloc_vreg(IrType::I32);
+    // `Load` from VMCTX is not interpreted meaningfully in the harness, but it
+    // keeps `v0` as the base pointer through validation + remap.
+    c.push(LpirOp::Load {
+        dst: r,
+        base: VMCTX_VREG,
+        offset: 0,
+    });
+    c.push_return(&[r]);
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::I32]);
+    let p = t.add_param(IrType::I32);
+    let out = t.alloc_vreg(IrType::I32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[out]);
+    t.push_return(&[out]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    validate_module(&module).unwrap();
+    let caller_id = *module.functions.keys().nth(1).unwrap();
+    let callee_fn = module.functions.values().nth(0).unwrap().clone();
+    let idx = find_local_call(module.functions.get(&caller_id).unwrap()).unwrap();
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    assert!(module.functions[&caller_id].body.iter().any(|o| matches!(
+        o,
+        LpirOp::Load {
+            base: VMCTX_VREG,
+            ..
+        }
+    )));
+}
+
+#[test]
+fn slot_remap() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("c", &[IrType::I32]);
+    let _ = c.add_param(IrType::I32);
+    let s0 = c.alloc_slot(4);
+    let s1 = c.alloc_slot(4);
+    let a0 = c.alloc_vreg(IrType::I32);
+    let a1 = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::SlotAddr { dst: a0, slot: s0 });
+    c.push(LpirOp::SlotAddr { dst: a1, slot: s1 });
+    let forty_two = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 {
+        dst: forty_two,
+        value: 42,
+    });
+    c.push(LpirOp::Store {
+        base: a0,
+        offset: 0,
+        value: forty_two,
+    });
+    let r = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::Load {
+        dst: r,
+        base: a0,
+        offset: 0,
+    });
+    c.push_return(&[r]);
+    let cref = mb.add_function(c.finish());
+
+    let mut t = FunctionBuilder::new("t", &[IrType::I32]);
+    for _ in 0..3 {
+        t.alloc_slot(1);
+    }
+    let p = t.add_param(IrType::I32);
+    let out = t.alloc_vreg(IrType::I32);
+    t.push_call(cref, &[VMCTX_VREG, p], &[out]);
+    t.push_return(&[out]);
+    mb.add_function(t.finish());
+
+    let mut module = mb.finish();
+    let caller_id = *module.functions.keys().nth(1).unwrap();
+    let callee_fn = module.functions.values().nth(0).unwrap().clone();
+    let idx = find_local_call(module.functions.get(&caller_id).unwrap()).unwrap();
+    {
+        let caller = module.functions.get_mut(&caller_id).unwrap();
+        inline_call_site(caller, &callee_fn, idx);
+        recompute_offsets(&mut caller.body);
+    }
+    validate_module(&module).unwrap();
+    assert_eq!(run_i32(&module, "t", &[Value::I32(0)]), 42);
+    let slots: Vec<_> = module.functions[&caller_id]
+        .body
+        .iter()
+        .filter_map(|o| {
+            if let LpirOp::SlotAddr { slot, .. } = o {
+                Some(slot.0)
+            } else {
+                None
+            }
+        })
+        .collect();
+    assert!(slots.contains(&3));
+    assert!(slots.contains(&4));
+}
+
+#[test]
+fn leaf_inlined_into_caller() {
+    let mut mb = ModuleBuilder::new();
+    let mut leaf = FunctionBuilder::new("leaf", &[IrType::I32]);
+    let _ = leaf.add_param(IrType::I32);
+    let v = leaf.alloc_vreg(IrType::I32);
+    leaf.push(LpirOp::IconstI32 { dst: v, value: 99 });
+    leaf.push_return(&[v]);
+    let cref = mb.add_function(leaf.finish());
+
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let p = main.add_param(IrType::I32);
+    let out = main.alloc_vreg(IrType::I32);
+    main.push_call(cref, &[VMCTX_VREG, p], &[out]);
+    main.push_return(&[out]);
+    mb.add_function(main.finish());
+
+    let mut module = mb.finish();
+    let want = run_i32(&module, "main", &[Value::I32(0)]);
+    let cfg = InlineConfig::default();
+    let r = inline_module(&mut module, &cfg);
+    assert!(r.call_sites_replaced >= 1);
+    validate_module(&module).unwrap();
+    assert_eq!(run_i32(&module, "main", &[Value::I32(0)]), want);
+}
+
+#[test]
+fn chain_inlined_bottom_up() {
+    let mut mb = ModuleBuilder::new();
+    let mut c = FunctionBuilder::new("c", &[IrType::I32]);
+    let _ = c.add_param(IrType::I32);
+    let v = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: v, value: 1 });
+    c.push_return(&[v]);
+    mb.add_function(c.finish());
+    let id_c = crate::types::FuncId(0);
+
+    let mut b = FunctionBuilder::new("b", &[IrType::I32]);
+    let pb = b.add_param(IrType::I32);
+    let o = b.alloc_vreg(IrType::I32);
+    b.push_call(
+        CalleeRef::Local(id_c),
+        &[VMCTX_VREG, pb],
+        core::slice::from_ref(&o),
+    );
+    b.push_return(&[o]);
+    mb.add_function(b.finish());
+    let id_b = crate::types::FuncId(1);
+
+    let mut a = FunctionBuilder::new("a", &[IrType::I32]);
+    let pa = a.add_param(IrType::I32);
+    let o = a.alloc_vreg(IrType::I32);
+    a.push_call(
+        CalleeRef::Local(id_b),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o),
+    );
+    a.push_return(&[o]);
+    mb.add_function(a.finish());
+
+    let mut module = mb.finish();
+    let want = run_i32(&module, "a", &[Value::I32(0)]);
+    let r = inline_module(&mut module, &InlineConfig::default());
+    assert!(r.call_sites_replaced >= 2);
+    validate_module(&module).unwrap();
+    assert_eq!(run_i32(&module, "a", &[Value::I32(0)]), want);
+}
+
+#[test]
+fn recursive_skipped() {
+    let mut mb = ModuleBuilder::new();
+    let id = mb.next_local_func_id();
+    let mut a = FunctionBuilder::new("a", &[IrType::I32]);
+    let p = a.add_param(IrType::I32);
+    let o = a.alloc_vreg(IrType::I32);
+    a.push_call(
+        CalleeRef::Local(id),
+        &[VMCTX_VREG, p],
+        core::slice::from_ref(&o),
+    );
+    a.push_return(&[o]);
+    mb.add_function(a.finish());
+
+    let mut module = mb.finish();
+    let fid = *module.functions.keys().next().unwrap();
+    let len_before = module.functions[&fid].body.len();
+    let r = inline_module(&mut module, &InlineConfig::default());
+    assert_eq!(r.functions_skipped_recursive, 1);
+    assert_eq!(r.call_sites_replaced, 0);
+    validate_module(&module).unwrap();
+    assert_eq!(module.functions[&fid].body.len(), len_before);
+    assert!(
+        find_local_call(&module.functions[&fid]).is_some(),
+        "self-call still present"
+    );
+}
diff --git a/lp-shader/lpir/src/tests/inline_callgraph.rs b/lp-shader/lpir/src/tests/inline_callgraph.rs
new file mode 100644
index 000000000..14010adab
--- /dev/null
+++ b/lp-shader/lpir/src/tests/inline_callgraph.rs
@@ -0,0 +1,322 @@
+//! Tests for [`crate::inline::callgraph`].
+
+use alloc::string::String;
+use alloc::vec;
+
+use crate::builder::{FunctionBuilder, ModuleBuilder};
+use crate::inline::callgraph::{self, CallGraph};
+use crate::lpir_module::VMCTX_VREG;
+use crate::lpir_op::LpirOp;
+use crate::types::{CalleeRef, FuncId, IrType};
+
+fn assert_sorted_dedup_eq(v: &[FuncId], expected: &[FuncId]) {
+    assert_eq!(v, expected);
+}
+
+#[test]
+fn leaf() {
+    let mut mb = ModuleBuilder::new();
+    let mut f = FunctionBuilder::new("leaf", &[IrType::I32]);
+    let p = f.add_param(IrType::I32);
+    let tmp = f.alloc_vreg(IrType::I32);
+    f.push(LpirOp::IconstI32 { dst: tmp, value: 0 });
+    f.push_return(&[p]);
+    mb.add_function(f.finish());
+
+    let module = mb.finish();
+    let g = callgraph::build(&module);
+    let (topo, cyclic) = callgraph::topo_order(&g, &module);
+    assert!(cyclic.is_empty());
+    assert_eq!(topo, vec![FuncId(0)]);
+    assert!(
+        g.callees_of
+            .get(&FuncId(0))
+            .map(|v| v.is_empty())
+            .unwrap_or(true)
+    );
+}
+
+#[test]
+fn linear_chain_a_b_c() {
+    let mut mb = ModuleBuilder::new();
+    // C: id 0
+    let mut c = FunctionBuilder::new("c", &[IrType::I32]);
+    let _pc = c.add_param(IrType::I32);
+    let r = c.alloc_vreg(IrType::I32);
+    c.push(LpirOp::IconstI32 { dst: r, value: 7 });
+    c.push_return(&[r]);
+    mb.add_function(c.finish());
+    let id_c = FuncId(0);
+
+    // B: id 1 — calls C
+    let mut b = FunctionBuilder::new("b", &[IrType::I32]);
+    let pb = b.add_param(IrType::I32);
+    let out = b.alloc_vreg(IrType::I32);
+    b.push_call(
+        CalleeRef::Local(id_c),
+        &[VMCTX_VREG, pb],
+        core::slice::from_ref(&out),
+    );
+    b.push_return(&[out]);
+    mb.add_function(b.finish());
+    let id_b = FuncId(1);
+
+    // A: id 2 — calls B
+    let mut a = FunctionBuilder::new("a", &[IrType::I32]);
+    let pa = a.add_param(IrType::I32);
+    let out = a.alloc_vreg(IrType::I32);
+    a.push_call(
+        CalleeRef::Local(id_b),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&out),
+    );
+    a.push_return(&[out]);
+    mb.add_function(a.finish());
+
+    let module = mb.finish();
+    let g = callgraph::build(&module);
+    let (topo, cyclic) = callgraph::topo_order(&g, &module);
+    assert!(cyclic.is_empty());
+    assert_eq!(topo, vec![id_c, id_b, FuncId(2)]);
+    assert_sorted_dedup_eq(&g.callees_of[&FuncId(2)], &[id_b]);
+    assert_sorted_dedup_eq(&g.callees_of[&id_b], &[id_c]);
+}
+
+#[test]
+fn diamond_a_bc_d() {
+    let mut mb = ModuleBuilder::new();
+    // D: 0
+    let mut d_fn = FunctionBuilder::new("d", &[IrType::I32]);
+    let pd = d_fn.add_param(IrType::I32);
+    d_fn.push_return(&[pd]);
+    mb.add_function(d_fn.finish());
+    let id_d = FuncId(0);
+
+    // B: 1 → D
+    let mut b_fn = FunctionBuilder::new("b", &[IrType::I32]);
+    let pb = b_fn.add_param(IrType::I32);
+    let ob = b_fn.alloc_vreg(IrType::I32);
+    b_fn.push_call(
+        CalleeRef::Local(id_d),
+        &[VMCTX_VREG, pb],
+        core::slice::from_ref(&ob),
+    );
+    b_fn.push_return(&[ob]);
+    mb.add_function(b_fn.finish());
+    let id_b = FuncId(1);
+
+    // C: 2 → D
+    let mut c_fn = FunctionBuilder::new("c", &[IrType::I32]);
+    let pc = c_fn.add_param(IrType::I32);
+    let oc = c_fn.alloc_vreg(IrType::I32);
+    c_fn.push_call(
+        CalleeRef::Local(id_d),
+        &[VMCTX_VREG, pc],
+        core::slice::from_ref(&oc),
+    );
+    c_fn.push_return(&[oc]);
+    mb.add_function(c_fn.finish());
+    let id_c = FuncId(2);
+
+    // A: 3 → B, C
+    let mut a_fn = FunctionBuilder::new("a", &[IrType::I32]);
+    let pa = a_fn.add_param(IrType::I32);
+    let o1 = a_fn.alloc_vreg(IrType::I32);
+    let o2 = a_fn.alloc_vreg(IrType::I32);
+    a_fn.push_call(
+        CalleeRef::Local(id_b),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o1),
+    );
+    a_fn.push_call(
+        CalleeRef::Local(id_c),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o2),
+    );
+    a_fn.push_return(&[o1]);
+    mb.add_function(a_fn.finish());
+    let id_a = FuncId(3);
+
+    let module = mb.finish();
+    let g = callgraph::build(&module);
+    let (topo, cyclic) = callgraph::topo_order(&g, &module);
+    assert!(cyclic.is_empty());
+    assert_eq!(topo[0], id_d);
+    assert_eq!(topo[3], id_a);
+    assert_sorted_dedup_eq(&g.callees_of[&id_a], &[id_b, id_c]);
+}
+
+#[test]
+fn self_recursive() {
+    let mut mb = ModuleBuilder::new();
+    let mut f = FunctionBuilder::new("rec", &[IrType::I32]);
+    let p = f.add_param(IrType::I32);
+    let id = mb.next_local_func_id();
+    let out = f.alloc_vreg(IrType::I32);
+    f.push_call(
+        CalleeRef::Local(id),
+        &[VMCTX_VREG, p],
+        core::slice::from_ref(&out),
+    );
+    f.push_return(&[out]);
+    mb.add_function(f.finish());
+
+    let module = mb.finish();
+    let g = callgraph::build(&module);
+    let (_topo, cyclic) = callgraph::topo_order(&g, &module);
+    assert_eq!(cyclic.len(), 1);
+    assert!(cyclic.contains(&FuncId(0)));
+}
+
+#[test]
+fn mutual_recursion() {
+    let mut mb = ModuleBuilder::new();
+    let id_a = mb.next_local_func_id();
+    let id_b = FuncId(id_a.0 + 1);
+
+    let mut a_fn = FunctionBuilder::new("a", &[IrType::I32]);
+    let pa = a_fn.add_param(IrType::I32);
+    let oa = a_fn.alloc_vreg(IrType::I32);
+    a_fn.push_call(
+        CalleeRef::Local(id_b),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&oa),
+    );
+    a_fn.push_return(&[oa]);
+    mb.add_function(a_fn.finish());
+
+    let mut b_fn = FunctionBuilder::new("b", &[IrType::I32]);
+    let pb = b_fn.add_param(IrType::I32);
+    let ob = b_fn.alloc_vreg(IrType::I32);
+    b_fn.push_call(
+        CalleeRef::Local(id_a),
+        &[VMCTX_VREG, pb],
+        core::slice::from_ref(&ob),
+    );
+    b_fn.push_return(&[ob]);
+    mb.add_function(b_fn.finish());
+
+    let module = mb.finish();
+    let g = callgraph::build(&module);
+    let (topo, cyclic) = callgraph::topo_order(&g, &module);
+    assert!(topo.is_empty());
+    assert_eq!(cyclic.len(), 2);
+}
+
+#[test]
+fn recursion_with_acyclic_tail() {
+    let mut mb = ModuleBuilder::new();
+    // C leaf: 0
+    let mut c_fn = FunctionBuilder::new("c", &[IrType::I32]);
+    let pc = c_fn.add_param(IrType::I32);
+    c_fn.push_return(&[pc]);
+    mb.add_function(c_fn.finish());
+    let id_c = FuncId(0);
+
+    let id_a = FuncId(1);
+    let id_b = FuncId(2);
+
+    // A: calls B and C (B added after A)
+    let mut a_fn = FunctionBuilder::new("a", &[IrType::I32]);
+    let pa = a_fn.add_param(IrType::I32);
+    let o1 = a_fn.alloc_vreg(IrType::I32);
+    let o2 = a_fn.alloc_vreg(IrType::I32);
+    a_fn.push_call(
+        CalleeRef::Local(id_b),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o1),
+    );
+    a_fn.push_call(
+        CalleeRef::Local(id_c),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o2),
+    );
+    a_fn.push_return(&[o1]);
+    mb.add_function(a_fn.finish());
+
+    // B: calls A
+    let mut b_fn = FunctionBuilder::new("b", &[IrType::I32]);
+    let pb = b_fn.add_param(IrType::I32);
+    let ob = b_fn.alloc_vreg(IrType::I32);
+    b_fn.push_call(
+        CalleeRef::Local(id_a),
+        &[VMCTX_VREG, pb],
+        core::slice::from_ref(&ob),
+    );
+    b_fn.push_return(&[ob]);
+    mb.add_function(b_fn.finish());
+
+    let module = mb.finish();
+    let g = callgraph::build(&module);
+    let (topo, cyclic) = callgraph::topo_order(&g, &module);
+    assert!(cyclic.contains(&id_a) && cyclic.contains(&id_b));
+    assert!(!cyclic.contains(&id_c));
+    assert_eq!(topo, vec![id_c]);
+}
+
+#[test]
+fn import_only_callee() {
+    let mut mb = ModuleBuilder::new();
+    let imp = mb.add_import(crate::lpir_module::ImportDecl {
+        module_name: String::from("m"),
+        func_name: String::from("f"),
+        param_types: alloc::vec![IrType::I32],
+        return_types: alloc::vec![IrType::I32],
+        lpfn_glsl_params: None,
+        needs_vmctx: true,
+    });
+    let mut f = FunctionBuilder::new("a", &[IrType::I32]);
+    let p = f.add_param(IrType::I32);
+    let o = f.alloc_vreg(IrType::I32);
+    f.push_call(imp, &[VMCTX_VREG, p], core::slice::from_ref(&o));
+    f.push_return(&[o]);
+    mb.add_function(f.finish());
+
+    let module = mb.finish();
+    let g = callgraph::build(&module);
+    let (topo, cyclic) = callgraph::topo_order(&g, &module);
+    assert!(cyclic.is_empty());
+    assert_eq!(topo, vec![FuncId(0)]);
+    assert!(
+        g.callees_of
+            .get(&FuncId(0))
+            .map(|v| v.is_empty())
+            .unwrap_or(true)
+    );
+}
+
+#[test]
+fn multiple_call_sites_same_callee() {
+    let mut mb = ModuleBuilder::new();
+    let mut callee = FunctionBuilder::new("c", &[IrType::I32]);
+    let pc = callee.add_param(IrType::I32);
+    callee.push_return(&[pc]);
+    mb.add_function(callee.finish());
+    let id_c = FuncId(0);
+
+    let mut a = FunctionBuilder::new("a", &[IrType::I32]);
+    let pa = a.add_param(IrType::I32);
+    let o1 = a.alloc_vreg(IrType::I32);
+    let o2 = a.alloc_vreg(IrType::I32);
+    a.push_call(
+        CalleeRef::Local(id_c),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o1),
+    );
+    a.push_call(
+        CalleeRef::Local(id_c),
+        &[VMCTX_VREG, pa],
+        core::slice::from_ref(&o2),
+    );
+    a.push_return(&[o1]);
+    mb.add_function(a.finish());
+
+    let module = mb.finish();
+    let g: CallGraph = callgraph::build(&module);
+    assert_eq!(g.callees_of[&FuncId(1)], vec![id_c]);
+    let sites = &g.call_sites_of[&FuncId(1)];
+    assert_eq!(sites.len(), 2);
+    assert_eq!(sites[0].1, id_c);
+    assert_eq!(sites[1].1, id_c);
+    assert_ne!(sites[0].0, sites[1].0);
+}
diff --git a/lp-shader/lpir/src/tests/inline_heuristic.rs b/lp-shader/lpir/src/tests/inline_heuristic.rs
new file mode 100644
index 000000000..f7fbd5942
--- /dev/null
+++ b/lp-shader/lpir/src/tests/inline_heuristic.rs
@@ -0,0 +1,177 @@
+//! Tests for [`crate::inline::heuristic::should_inline`] and budget behavior via [`crate::inline_module`].
+
+use crate::builder::{FunctionBuilder, ModuleBuilder};
+use crate::inline::heuristic::{BudgetReason, Decision, should_inline};
+use crate::lpir_module::VMCTX_VREG;
+use crate::lpir_op::LpirOp;
+use crate::types::{CalleeRef, IrType};
+use crate::{InlineConfig, InlineMode, inline_module};
+
+#[test]
+fn mode_never_is_skip_mode() {
+    let mut c = InlineConfig::default();
+    c.mode = InlineMode::Never;
+    assert_eq!(should_inline(1, 99, 0, &c), Decision::SkipMode);
+}
+
+#[test]
+fn mode_always_inlines_huge_callee() {
+    let mut c = InlineConfig::default();
+    c.mode = InlineMode::Always;
+    c.small_func_threshold = 1;
+    assert_eq!(should_inline(10_000, 1, 1_000_000, &c), Decision::Inline);
+}
+
+#[test]
+fn auto_skips_large_multi_site() {
+    let mut c = InlineConfig::default();
+    c.mode = InlineMode::Auto;
+    c.small_func_threshold = 5;
+    c.always_inline_single_site = true;
+    assert!(matches!(
+        should_inline(10, 2, 0, &c),
+        Decision::SkipTooLarge { .. }
+    ));
+}
+
+#[test]
+fn auto_inlines_large_single_site() {
+    let mut c = InlineConfig::default();
+    c.mode = InlineMode::Auto;
+    c.small_func_threshold = 5;
+    c.always_inline_single_site = true;
+    assert_eq!(should_inline(10, 1, 0, &c), Decision::Inline);
+}
+
+#[test]
+fn auto_skips_large_single_site_when_disabled() {
+    let mut c = InlineConfig::default();
+    c.mode = InlineMode::Auto;
+    c.small_func_threshold = 5;
+    c.always_inline_single_site = false;
+    assert!(matches!(
+        should_inline(10, 1, 0, &c),
+        Decision::SkipTooLarge { .. }
+    ));
+}
+
+#[test]
+fn max_growth_budget_per_callee() {
+    let mut c = InlineConfig::default();
+    c.mode = InlineMode::Always;
+    c.max_growth_budget = Some(20);
+    assert!(matches!(
+        should_inline(11, 2, 0, &c),
+        Decision::SkipBudget {
+            reason: BudgetReason::MaxGrowth,
+            ..
+        }
+    ));
+    assert_eq!(should_inline(10, 2, 0, &c), Decision::Inline);
+}
+
+#[test]
+fn module_op_budget_on_should_inline() {
+    let mut c = InlineConfig::default();
+    c.mode = InlineMode::Always;
+    c.module_op_budget = Some(15);
+    assert!(matches!(
+        should_inline(5, 2, 6, &c),
+        Decision::SkipBudget {
+            reason: BudgetReason::ModuleTotal,
+            ..
+        }
+    ));
+}
+
+#[test]
+fn module_op_budget_hit_inline_module() {
+    let mut mb = ModuleBuilder::new();
+    let mut leaf = FunctionBuilder::new("leaf", &[IrType::I32]);
+    let p = leaf.add_param(IrType::I32);
+    let v = leaf.alloc_vreg(IrType::I32);
+    leaf.push(LpirOp::IconstI32 { dst: v, value: 1 });
+    leaf.push_return(&[p]);
+    let cref = mb.add_function(leaf.finish());
+
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let pm = main.add_param(IrType::I32);
+    let out = main.alloc_vreg(IrType::I32);
+    main.push_call(cref, &[VMCTX_VREG, pm], &[out]);
+    main.push_return(&[out]);
+    mb.add_function(main.finish());
+
+    let mut module = mb.finish();
+    let mut cfg = InlineConfig::default();
+    cfg.mode = InlineMode::Always;
+    cfg.module_op_budget = Some(3);
+    let r = inline_module(&mut module, &cfg);
+    assert!(r.budget_exceeded);
+}
+
+#[test]
+fn debug_decisions_use_mode_never_no_inline() {
+    let mut mb = ModuleBuilder::new();
+    let mut leaf = FunctionBuilder::new("leaf", &[IrType::I32]);
+    let _ = leaf.add_param(IrType::I32);
+    let v = leaf.alloc_vreg(IrType::I32);
+    leaf.push(LpirOp::IconstI32 { dst: v, value: 7 });
+    leaf.push_return(&[v]);
+    let cref = mb.add_function(leaf.finish());
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let p = main.add_param(IrType::I32);
+    let o = main.alloc_vreg(IrType::I32);
+    main.push_call(cref, &[VMCTX_VREG, p], &[o]);
+    main.push_return(&[o]);
+    mb.add_function(main.finish());
+    let mut module = mb.finish();
+    let mut cfg = InlineConfig::default();
+    cfg.mode = InlineMode::Never;
+    let r = inline_module(&mut module, &cfg);
+    assert_eq!(r.call_sites_replaced, 0);
+    assert_eq!(r.functions_inlined, 0);
+}
+
+#[test]
+fn max_growth_still_allows_other_callees_orchestration() {
+    let mut mb = ModuleBuilder::new();
+    let mut always_small = FunctionBuilder::new("small", &[IrType::I32]);
+    let ps = always_small.add_param(IrType::I32);
+    always_small.push_return(&[ps]);
+    let small_ref = mb.add_function(always_small.finish());
+
+    let mut huge = FunctionBuilder::new("huge", &[IrType::I32]);
+    let ph = huge.add_param(IrType::I32);
+    for _ in 0..40 {
+        let t = huge.alloc_vreg(IrType::I32);
+        huge.push(LpirOp::IconstI32 { dst: t, value: 0 });
+    }
+    huge.push_return(&[ph]);
+    let huge_ref = mb.add_function(huge.finish());
+    let id_huge = match huge_ref {
+        CalleeRef::Local(id) => id,
+        _ => unreachable!(),
+    };
+
+    let mut main = FunctionBuilder::new("main", &[IrType::I32]);
+    let pm = main.add_param(IrType::I32);
+    let o1 = main.alloc_vreg(IrType::I32);
+    let o2 = main.alloc_vreg(IrType::I32);
+    main.push_call(huge_ref, &[VMCTX_VREG, pm], &[o1]);
+    main.push_call(small_ref, &[VMCTX_VREG, pm], &[o2]);
+    main.push_return(&[o2]);
+    mb.add_function(main.finish());
+
+    let mut module = mb.finish();
+    let mut cfg = InlineConfig::default();
+    cfg.mode = InlineMode::Always;
+    cfg.max_growth_budget = Some(30);
+    let r = inline_module(&mut module, &cfg);
+    assert_eq!(r.functions_inlined, 1);
+    let still_calls_huge = module.functions.values().any(|f| {
+        f.body.iter().any(
+            |op| matches!(op, LpirOp::Call { callee: CalleeRef::Local(id), .. } if *id == id_huge),
+        )
+    });
+    assert!(still_calls_huge);
+}
diff --git a/lp-shader/lpir/src/tests/inline_offsets.rs b/lp-shader/lpir/src/tests/inline_offsets.rs
new file mode 100644
index 000000000..c4592cd85
--- /dev/null
+++ b/lp-shader/lpir/src/tests/inline_offsets.rs
@@ -0,0 +1,228 @@
+//! Tests for [`crate::inline::recompute_offsets`].
+
+use crate::builder::FunctionBuilder;
+use crate::inline::recompute_offsets;
+use crate::lpir_op::LpirOp;
+use crate::types::{CalleeRef, ImportId, IrType};
+
+fn zero_all_offsets(body: &mut [LpirOp]) {
+    for op in body.iter_mut() {
+        match op {
+            LpirOp::IfStart {
+                else_offset,
+                end_offset,
+                ..
+            } => {
+                *else_offset = 0;
+                *end_offset = 0;
+            }
+            LpirOp::LoopStart {
+                continuing_offset,
+                end_offset,
+            } => {
+                *continuing_offset = 0;
+                *end_offset = 0;
+            }
+            LpirOp::SwitchStart { end_offset, .. } => *end_offset = 0,
+            LpirOp::CaseStart { end_offset, .. } | LpirOp::DefaultStart { end_offset } => {
+                *end_offset = 0;
+            }
+            LpirOp::Block { end_offset } => *end_offset = 0,
+            _ => {}
+        }
+    }
+}
+
+/// Collects all u32 offset fields from control ops in body order (for stable comparison).
+fn flatten_control_offset_words(body: &[LpirOp]) -> alloc::vec::Vec<u32> {
+    let mut w = alloc::vec::Vec::new();
+    for op in body {
+        match op {
+            LpirOp::IfStart {
+                else_offset,
+                end_offset,
+                ..
+            } => {
+                w.push(*else_offset);
+                w.push(*end_offset);
+            }
+            LpirOp::LoopStart {
+                continuing_offset,
+                end_offset,
+            } => {
+                w.push(*continuing_offset);
+                w.push(*end_offset);
+            }
+            LpirOp::SwitchStart { end_offset, .. } => w.push(*end_offset),
+            LpirOp::CaseStart { end_offset, .. } | LpirOp::DefaultStart { end_offset } => {
+                w.push(*end_offset);
+            }
+            LpirOp::Block { end_offset } => w.push(*end_offset),
+            _ => {}
+        }
+    }
+    w
+}
+
+fn assert_recompute_matches_original(mut original: alloc::vec::Vec<LpirOp>) {
+    let expected = flatten_control_offset_words(&original);
+    zero_all_offsets(&mut original);
+    recompute_offsets(&mut original);
+    assert_eq!(flatten_control_offset_words(&original), expected);
+}
+
+#[test]
+fn if_else_end() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let _v = b.add_param(IrType::I32);
+    let c = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: c, value: 1 });
+    b.push_if(c);
+    let t = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: t, value: 10 });
+    b.push_else();
+    let e = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: e, value: 20 });
+    b.end_if();
+    let f = b.finish();
+    assert_recompute_matches_original(f.body);
+}
+
+#[test]
+fn loop_with_continuing_marker() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let _v = b.add_param(IrType::I32);
+    b.push_loop();
+    let x = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: x, value: 0 });
+    b.push_continuing();
+    let y = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: y, value: 1 });
+    b.end_loop();
+    let f = b.finish();
+    assert_recompute_matches_original(f.body);
+}
+
+#[test]
+fn loop_without_continuing_marker() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let _v = b.add_param(IrType::I32);
+    b.push_loop();
+    let x = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: x, value: 0 });
+    b.end_loop();
+    let f = b.finish();
+    let loop_pc = f
+        .body
+        .iter()
+        .position(|op| matches!(op, LpirOp::LoopStart { .. }))
+        .expect("LoopStart");
+    let expected_co = (loop_pc + 1) as u32;
+    let mut body = f.body;
+    zero_all_offsets(&mut body);
+    recompute_offsets(&mut body);
+    if let LpirOp::LoopStart {
+        continuing_offset, ..
+    } = &body[loop_pc]
+    {
+        assert_eq!(*continuing_offset, expected_co);
+    } else {
+        panic!("expected LoopStart");
+    }
+}
+
+#[test]
+fn switch_multi_arm() {
+    let mut b = FunctionBuilder::new("f", &[IrType::F32]);
+    let sel = b.add_param(IrType::I32);
+    b.push_switch(sel);
+    b.push_case(0);
+    let a = b.alloc_vreg(IrType::F32);
+    b.push(LpirOp::FconstF32 { dst: a, value: 1.0 });
+    b.end_switch_arm();
+    b.push_case(1);
+    let c = b.alloc_vreg(IrType::F32);
+    b.push(LpirOp::FconstF32 { dst: c, value: 2.0 });
+    b.end_switch_arm();
+    b.push_default();
+    let d = b.alloc_vreg(IrType::F32);
+    b.push(LpirOp::FconstF32 {
+        dst: d,
+        value: -1.0,
+    });
+    b.end_switch_arm();
+    b.end_switch();
+    let f = b.finish();
+    assert_recompute_matches_original(f.body);
+}
+
+#[test]
+fn block_exit() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let _v = b.add_param(IrType::I32);
+    b.push_block();
+    b.push_exit_block();
+    let x = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: x, value: 1 });
+    b.end_block();
+    let f = b.finish();
+    assert_recompute_matches_original(f.body);
+}
+
+#[test]
+fn nested_loop_in_if_in_block() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let p = b.add_param(IrType::I32);
+    b.push_block();
+    b.push_if(p);
+    b.push_loop();
+    let x = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: x, value: 0 });
+    b.end_loop();
+    b.end_if();
+    b.end_block();
+    let f = b.finish();
+    assert_recompute_matches_original(f.body);
+}
+
+#[test]
+fn mutated_body_grows() {
+    let mut b_ref = FunctionBuilder::new("f", &[IrType::I32]);
+    let p = b_ref.add_param(IrType::I32);
+    b_ref.push_if(p);
+    let a = b_ref.alloc_vreg(IrType::I32);
+    b_ref.push(LpirOp::IconstI32 { dst: a, value: 1 });
+    let b_reg = b_ref.alloc_vreg(IrType::I32);
+    b_ref.push(LpirOp::IconstI32 {
+        dst: b_reg,
+        value: 2,
+    });
+    b_ref.end_if();
+    let reference = b_ref.finish();
+    let expected_words = flatten_control_offset_words(&reference.body);
+
+    let mut b_small = FunctionBuilder::new("f2", &[IrType::I32]);
+    let p2 = b_small.add_param(IrType::I32);
+    b_small.push_if(p2);
+    let a2 = b_small.alloc_vreg(IrType::I32);
+    b_small.push(LpirOp::IconstI32 { dst: a2, value: 1 });
+    b_small.end_if();
+    let mut grown = b_small.finish();
+    // Grow to match reference: insert no-op call before closing `End`.
+    let insert_at = grown.body.len() - 1;
+    grown.body.insert(
+        insert_at,
+        LpirOp::Call {
+            callee: CalleeRef::Import(ImportId(0)),
+            args: crate::types::VRegRange::EMPTY,
+            results: crate::types::VRegRange::EMPTY,
+        },
+    );
+    zero_all_offsets(&mut grown.body);
+    recompute_offsets(&mut grown.body);
+    assert_eq!(
+        flatten_control_offset_words(&grown.body),
+        expected_words,
+        "recomputed offsets should match a fresh build of the same control shape"
+    );
+}
diff --git a/lp-shader/lpir/src/tests/inline_param_writes.rs b/lp-shader/lpir/src/tests/inline_param_writes.rs
new file mode 100644
index 000000000..8a128e0ef
--- /dev/null
+++ b/lp-shader/lpir/src/tests/inline_param_writes.rs
@@ -0,0 +1,79 @@
+//! Tests for [`crate::inline::remap::scan_param_writes`].
+
+use alloc::vec;
+
+use crate::builder::FunctionBuilder;
+use crate::inline::remap::scan_param_writes;
+use crate::lpir_op::LpirOp;
+use crate::types::IrType;
+
+#[test]
+fn vmctx_never_written() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let _ = b.add_param(IrType::I32);
+    let r = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: r, value: 1 });
+    b.push_return(&[r]);
+    let f = b.finish();
+    let m = scan_param_writes(&f);
+    assert!(
+        m.written.is_empty() || !m.written.iter().any(|&x| x),
+        "no params written"
+    );
+}
+
+#[test]
+fn single_param_read_only() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let a = b.add_param(IrType::I32);
+    let r = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::Iadd {
+        dst: r,
+        lhs: a,
+        rhs: a,
+    });
+    b.push_return(&[r]);
+    let f = b.finish();
+    let m = scan_param_writes(&f);
+    assert_eq!(m.written, vec![false]);
+}
+
+#[test]
+fn single_param_mutated() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let a = b.add_param(IrType::I32);
+    let one = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: one, value: 1 });
+    b.push(LpirOp::Iadd {
+        dst: a,
+        lhs: a,
+        rhs: one,
+    });
+    b.push_return(&[a]);
+    let f = b.finish();
+    let m = scan_param_writes(&f);
+    assert_eq!(m.written, vec![true]);
+}
+
+#[test]
+fn multi_param_mixed() {
+    let mut b = FunctionBuilder::new("f", &[IrType::I32]);
+    let p0 = b.add_param(IrType::I32);
+    let p1 = b.add_param(IrType::I32);
+    let _p2 = b.add_param(IrType::I32);
+    b.push(LpirOp::Iadd {
+        dst: p1,
+        lhs: p1,
+        rhs: p0,
+    });
+    let r = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::Iadd {
+        dst: r,
+        lhs: p0,
+        rhs: p0,
+    });
+    b.push_return(&[r]);
+    let f = b.finish();
+    let m = scan_param_writes(&f);
+    assert_eq!(m.written, vec![false, true, false]);
+}
diff --git a/lp-shader/lpir/src/tests/inline_remap.rs b/lp-shader/lpir/src/tests/inline_remap.rs
new file mode 100644
index 000000000..f5b73e23f
--- /dev/null
+++ b/lp-shader/lpir/src/tests/inline_remap.rs
@@ -0,0 +1,153 @@
+//! Tests for [`crate::inline::remap::build_remap`] and [`crate::inline::remap::remap_op`].
+
+use alloc::string::String;
+use alloc::vec;
+
+use crate::builder::FunctionBuilder;
+use crate::inline::remap::{build_remap, remap_op, scan_param_writes};
+use crate::lpir_module::{IrFunction, SlotDecl, VMCTX_VREG};
+use crate::lpir_op::LpirOp;
+use crate::types::{IrType, VReg};
+
+#[test]
+fn alias_for_readonly_param() {
+    let mut b = FunctionBuilder::new("c", &[IrType::I32]);
+    let a = b.add_param(IrType::I32);
+    b.push_return(&[a]);
+    let callee = b.finish();
+    let pw = scan_param_writes(&callee);
+    let mut caller = FunctionBuilder::new("caller", &[IrType::I32]).finish();
+    let arg = VReg(5);
+    let r = build_remap(&mut caller, &callee, &[VMCTX_VREG, arg], &[], &pw);
+    assert!(r.param_copies.is_empty());
+    assert_eq!(r.vreg_table[1], arg);
+}
+
+#[test]
+fn copy_for_mutated_param() {
+    let mut b = FunctionBuilder::new("c", &[IrType::I32]);
+    let a = b.add_param(IrType::I32);
+    let one = b.alloc_vreg(IrType::I32);
+    b.push(LpirOp::IconstI32 { dst: one, value: 1 });
+    b.push(LpirOp::Iadd {
+        dst: a,
+        lhs: a,
+        rhs: one,
+    });
+    b.push_return(&[a]);
+    let callee = b.finish();
+    let pw = scan_param_writes(&callee);
+    let mut caller = FunctionBuilder::new("caller", &[IrType::I32]).finish();
+    let arg = VReg(9);
+    let r = build_remap(&mut caller, &callee, &[VMCTX_VREG, arg], &[], &pw);
+    assert_eq!(r.param_copies.len(), 1);
+    match &r.param_copies[0] {
+        LpirOp::Copy { dst, src } => {
+            assert_eq!(*src, arg);
+            assert_eq!(caller.vreg_types[dst.0 as usize], IrType::I32);
+        }
+        _ => panic!("expected Copy"),
+    }
+}
+
+#[test]
+fn vmctx_aliases() {
+    let mut b = FunctionBuilder::new("c", &[IrType::I32]);
+    let a = b.add_param(IrType::I32);
+    b.push_return(&[a]);
+    let callee = b.finish();
+    let pw = scan_param_writes(&callee);
+    let mut caller1 = FunctionBuilder::new("caller1", &[IrType::I32]).finish();
+    let r = build_remap(&mut caller1, &callee, &[VMCTX_VREG, VReg(3)], &[], &pw);
+    assert_eq!(r.vreg_table[0], VMCTX_VREG);
+    let mut caller2 = FunctionBuilder::new("caller2", &[IrType::I32]).finish();
+    let r2 = build_remap(&mut caller2, &callee, &[VMCTX_VREG, VReg(3)], &[], &pw);
+    assert_eq!(r2.vreg_table[0], VMCTX_VREG);
+}
+
+#[test]
+fn slot_offset_applied() {
+    let callee = IrFunction {
+        name: String::from("c"),
+        is_entry: false,
+        vmctx_vreg: VMCTX_VREG,
+        param_count: 0,
+        return_types: vec![],
+        vreg_types: vec![IrType::Pointer],
+        slots: vec![SlotDecl { size: 4 }, SlotDecl { size: 8 }],
+        body: vec![],
+        vreg_pool: vec![],
+    };
+    let mut caller = IrFunction {
+        name: String::from("x"),
+        is_entry: false,
+        vmctx_vreg: VMCTX_VREG,
+        param_count: 0,
+        return_types: vec![],
+        vreg_types: vec![IrType::Pointer],
+        slots: vec![
+            SlotDecl { size: 1 },
+            SlotDecl { size: 2 },
+            SlotDecl { size: 3 },
+        ],
+        body: vec![],
+        vreg_pool: vec![],
+    };
+    let pw = scan_param_writes(&callee);
+    let r = build_remap(&mut caller, &callee, &[VMCTX_VREG], &[], &pw);
+    assert_eq!(r.slot_offset, 3);
+    assert_eq!(caller.slots.len(), 5);
+
+    let mut pool = caller.vreg_pool.clone();
+    let op = LpirOp::SlotAddr {
+        dst: VReg(0),
+        slot: crate::types::SlotId(0),
+    };
+    let out = remap_op(&op, &r, &mut pool, &callee.vreg_pool);
+    match out {
+        LpirOp::SlotAddr { slot, .. } => assert_eq!(slot.0, 3),
+        _ => panic!("expected SlotAddr"),
+    }
+}
+
+#[test]
+fn vreg_pool_splice() {
+    let mut mb = crate::builder::ModuleBuilder::new();
+    let imp = mb.add_import(crate::lpir_module::ImportDecl {
+        module_name: String::from("g"),
+        func_name: String::from("sin"),
+        param_types: vec![IrType::F32],
+        return_types: vec![IrType::F32],
+        lpfn_glsl_params: None,
+        needs_vmctx: true,
+    });
+    let mut b = FunctionBuilder::new("c", &[IrType::F32]);
+    let a = b.add_param(IrType::F32);
+    b.push_call(imp, &[VMCTX_VREG, a], &[]);
+    let r = b.alloc_vreg(IrType::F32);
+    b.push(LpirOp::FconstF32 { dst: r, value: 0.0 });
+    b.push_return(&[r]);
+    let callee = b.finish();
+    let pw = scan_param_writes(&callee);
+    let mut caller = FunctionBuilder::new("caller", &[IrType::F32]).finish();
+    let arg = VReg(100);
+    let remap = build_remap(&mut caller, &callee, &[VMCTX_VREG, arg], &[], &pw);
+    let call_op = callee
+        .body
+        .iter()
+        .find(|o| matches!(o, LpirOp::Call { .. }))
+        .expect("call")
+        .clone();
+    let mut pool = caller.vreg_pool.clone();
+    let before_len = pool.len();
+    let mapped = remap_op(&call_op, &remap, &mut pool, &callee.vreg_pool);
+    assert!(pool.len() > before_len);
+    match mapped {
+        LpirOp::Call { args, .. } => {
+            let slice = &pool[args.start as usize..args.start as usize + args.count as usize];
+            assert_eq!(slice[0], VMCTX_VREG);
+            assert_eq!(slice[1], arg);
+        }
+        _ => panic!("expected Call"),
+    }
+}
diff --git a/lp-shader/lpir/src/tests/inline_weights.rs b/lp-shader/lpir/src/tests/inline_weights.rs
new file mode 100644
index 000000000..bb23aff51
--- /dev/null
+++ b/lp-shader/lpir/src/tests/inline_weights.rs
@@ -0,0 +1,49 @@
+//! Candidate inline weight metrics (M3.1).
+
+use crate::inline_weights::{
+    WeightKind, weight, weight_body_len, weight_heavy_bias, weight_markers_zero,
+};
+use crate::parse::parse_module;
+use crate::validate::validate_module;
+
+const HANDCRAFTED: &str = r#"import @glsl::fsin(f32) -> f32
+
+func @handcrafted(v1:f32) -> f32 {
+  slot ss0, 8
+  v2:i32 = slot_addr ss0
+  v3:i32 = iconst.i32 0
+  v4:f32 = fconst.f32 1.0
+  v5:i32 = flt v1, v4
+  if v5 {
+    v6:f32 = fsqrt v1
+    v7:f32 = call @glsl::fsin(v6)
+    return v7
+  } else {
+    memcpy v2, v3, 8
+    return v1
+  }
+}
+"#;
+
+#[test]
+fn handcrafted_three_weights_and_dispatcher() {
+    let m = parse_module(HANDCRAFTED).expect("parse");
+    validate_module(&m).expect("validate");
+    let f = m
+        .functions
+        .values()
+        .find(|g| g.name == "handcrafted")
+        .expect("func");
+
+    let bl = weight_body_len(f);
+    let mz = weight_markers_zero(f);
+    let hb = weight_heavy_bias(f);
+
+    assert_eq!(bl, 12, "body_len");
+    assert_eq!(mz, 7, "markers_zero");
+    assert_eq!(hb, 17, "heavy_bias");
+
+    assert_eq!(weight(WeightKind::BodyLen, f), bl);
+    assert_eq!(weight(WeightKind::MarkersZero, f), mz);
+    assert_eq!(weight(WeightKind::HeavyBias, f), hb);
+}
diff --git a/lp-shader/lpir/src/validate.rs b/lp-shader/lpir/src/validate.rs
index 81a7c6e6c..69694b423 100644
--- a/lp-shader/lpir/src/validate.rs
+++ b/lp-shader/lpir/src/validate.rs
@@ -219,6 +219,16 @@ fn validate_function_inner(
                         "LoopStart continuing_offset before body start",
                     ));
                 }
+                if co != i + 1 {
+                    match func.body.get(co) {
+                        Some(LpirOp::Continuing) => {}
+                        _ => errs.push(err_in_func(
+                            fname,
+                            op_i,
+                            "LoopStart continuing_offset must point at `continuing:` marker unless it is the first body op (legacy)",
+                        )),
+                    }
+                }
                 if *end_offset > 0 && *continuing_offset >= *end_offset {
                     errs.push(err_in_func(
                         fname,
@@ -231,6 +241,15 @@ fn validate_function_inner(
                     continuing_offset: *continuing_offset,
                 });
             }
+            LpirOp::Continuing => {
+                if !matches!(stack.last(), Some(StackEntry::Loop { .. })) {
+                    errs.push(err_in_func(
+                        fname,
+                        op_i,
+                        "`continuing:` must be directly inside a loop body (not nested in if/switch/block/inner loop)",
+                    ));
+                }
+            }
             LpirOp::Block { end_offset } => {
                 if *end_offset == 0 {
                     errs.push(err_in_func(
@@ -621,6 +640,7 @@ fn check_op_operands_defined(
         | LpirOp::FconstF32 { .. }
         | LpirOp::IconstI32 { .. }
         | LpirOp::Else
+        | LpirOp::Continuing
         | LpirOp::LoopStart { .. }
         | LpirOp::CaseStart { .. }
         | LpirOp::DefaultStart { .. }
@@ -791,6 +811,7 @@ fn check_opcode_dst_types(
         | LpirOp::Memcpy { .. }
         | LpirOp::IfStart { .. }
         | LpirOp::Else
+        | LpirOp::Continuing
         | LpirOp::LoopStart { .. }
         | LpirOp::SwitchStart { .. }
         | LpirOp::CaseStart { .. }
@@ -894,6 +915,7 @@ fn mark_op_defs(func: &IrFunction, op: &LpirOp, defined: &mut [bool]) {
         | LpirOp::Memcpy { .. }
         | LpirOp::IfStart { .. }
         | LpirOp::Else
+        | LpirOp::Continuing
         | LpirOp::LoopStart { .. }
         | LpirOp::SwitchStart { .. }
         | LpirOp::CaseStart { .. }
diff --git a/lp-shader/lps-filetests/filetests/debug/inline-weights.glsl b/lp-shader/lps-filetests/filetests/debug/inline-weights.glsl
new file mode 100644
index 000000000..dfe17b373
--- /dev/null
+++ b/lp-shader/lps-filetests/filetests/debug/inline-weights.glsl
@@ -0,0 +1,99 @@
+// Debug corpus for M3.1 inline `func_weight` tuning (`lp-cli shader-debug --weights`).
+// Many small helpers + entry points; no `// run:` expectations (validate-only).
+
+float iw_lerp(float a, float b, float t) {
+    return mix(a, b, t);
+}
+
+float iw_clamp01(float x) {
+    return clamp(x, 0.0, 1.0);
+}
+
+vec3 iw_mul3(vec3 v, float s) {
+    return v * s;
+}
+
+vec3 iw_add3(vec3 a, vec3 b) {
+    return a + b;
+}
+
+vec3 iw_palette_dispatch(float t, float k) {
+    if (k < 0.5) {
+        return mix(vec3(0.0), vec3(1.0), t);
+    }
+    if (k < 1.5) {
+        return iw_add3(vec3(t), vec3(0.1));
+    }
+    if (k < 2.5) {
+        return iw_mul3(vec3(1.0 - t), 0.5);
+    }
+    if (k < 3.5) {
+        return vec3(iw_clamp01(t * 2.0));
+    }
+    return vec3(sqrt(iw_clamp01(t)));
+}
+
+float iw_step01(float x, float edge) {
+    if (x < edge) {
+        return 0.0;
+    }
+    return 1.0;
+}
+
+vec3 iw_builtin_stack(float u, float v) {
+    float a = sqrt(clamp(u, 0.0, 1.0));
+    float b = cos(v * 3.14159265);
+    float c = mix(a, b, 0.37);
+    float d = sqrt(clamp(mix(u, v, c), 0.0, 1.0));
+    float e = cos(d * 2.0);
+    return vec3(mix(c, e, 0.2), sqrt(abs(b)), clamp(a * d, 0.0, 1.0));
+}
+
+float iw_vec3_len_custom(vec3 v) {
+    float s = v.x * v.x + v.y * v.y + v.z * v.z;
+    return sqrt(s);
+}
+
+vec3 iw_color_grade(vec3 rgb, float exposure, float lift, float sat) {
+    vec3 lifted = rgb * exposure + vec3(lift);
+    float luma = dot(lifted, vec3(0.299, 0.587, 0.114));
+    vec3 chroma = lifted - vec3(luma);
+    vec3 adj = vec3(luma) + chroma * sat;
+    return clamp(mix(lifted, adj, 0.65), vec3(0.0), vec3(1.0));
+}
+
+vec3 iw_noise_blend(vec3 p, float blend, float mode) {
+    vec3 a = iw_builtin_stack(p.x, p.y);
+    vec3 b = iw_color_grade(a, 1.1, 0.02, 1.05);
+    vec3 c = iw_palette_dispatch(blend, mode);
+    vec3 d = iw_mul3(iw_add3(b, c), 0.5);
+    float len = iw_vec3_len_custom(d + vec3(0.01));
+    vec3 e = iw_builtin_stack(len, p.z);
+    float edge = iw_step01(blend, 0.33);
+    vec3 f = mix(d, e, edge);
+    return clamp(f, vec3(0.0), vec3(1.0));
+}
+
+float iw_twist(float x, float amt) {
+    float y = fract(x + amt);
+    return iw_lerp(x, y, 0.5);
+}
+
+vec3 iw_fold_rgb(vec3 v) {
+    return abs(v * 2.0 - vec3(1.0));
+}
+
+vec3 test_inline_weights_entry_a() {
+    return iw_noise_blend(vec3(0.2, 0.7, 0.3), 0.4, 1.0);
+}
+
+vec3 test_inline_weights_entry_b() {
+    vec3 p = iw_palette_dispatch(0.5, 2.0);
+    vec3 q = iw_builtin_stack(0.25, 0.5);
+    float t = iw_twist(0.3, 0.11);
+    return iw_add3(iw_mul3(p, 0.9), iw_mul3(q, 0.1 + t * 0.02));
+}
+
+vec3 test_inline_weights_entry_c() {
+    return iw_fold_rgb(iw_color_grade(vec3(0.4, 0.5, 0.6), 0.95, 0.03, 1.2));
+}
diff --git a/lp-shader/lps-filetests/filetests/debug/rainbow.glsl b/lp-shader/lps-filetests/filetests/debug/rainbow.glsl
index 82d66f497..75ebf9199 100644
--- a/lp-shader/lps-filetests/filetests/debug/rainbow.glsl
+++ b/lp-shader/lps-filetests/filetests/debug/rainbow.glsl
@@ -131,4 +131,4 @@ vec4 test_rainbow_main_corner_t5() {
     return rainbow_main(vec2(0.0, 0.0), vec2(64.0, 64.0), 5.0);
 }
 
-// run: test_rainbow_main_corner_t5() ~= vec4(0.3924713, 0.63394165, 0.14109802, 1.0) (tolerance: 0.002)
+// run: rainbow_main(vec2(0.0, 0.0), vec2(64.0, 64.0), 5.0) ~= vec4(0.3924713, 0.63394165, 0.14109802, 1.0) (tolerance: 0.002)
diff --git a/lp-shader/lps-filetests/filetests/examples/rainbow.glsl b/lp-shader/lps-filetests/filetests/examples/rainbow.glsl
new file mode 100644
index 000000000..8242787f7
--- /dev/null
+++ b/lp-shader/lps-filetests/filetests/examples/rainbow.glsl
@@ -0,0 +1,94 @@
+// test run
+//
+// Integration-style checks mirroring examples/basic/src/rainbow.shader/main.glsl.
+// Expectations are blessed from jit.q32; wasm.q32 must match within tolerance.
+
+const bool CYCLE_PALETTE = true;
+
+vec3 paletteHeatmap(float t) {
+    vec3 r = t * 2.1 - vec3(1.8, 1.14, 0.3);
+    return clamp(1.0 - r * r, 0.0, 1.0);
+}
+
+vec3 paletteRainbow(float t) {
+    float r = 0.33333;
+    vec3 v = abs(mod(fract(1.0 - t) + vec3(0.0, 1.0, 2.0) * r, 1.0) * 2.0 - 1.0);
+    return v * v * (3.0 - 2.0 * v);
+}
+
+vec3 paletteFire(float t) {
+    return clamp(vec3(1.0, 0.25, 0.0625) * exp(4.0 * t - 1.0), 0.0, 1.0);
+}
+
+vec3 paletteCool(float t) {
+    vec3 a = vec3(0.5, 0.5, 0.5);
+    vec3 b = vec3(0.5, 0.5, 0.5);
+    vec3 c = vec3(1.0, 1.0, 1.0);
+    vec3 d = vec3(0.25, 0.25, 0.25);
+    return clamp(a + b * cos(6.28318530718 * (c * t + d)), 0.0, 1.0);
+}
+
+vec3 paletteWarm(float t) {
+    vec3 a = vec3(0.5, 0.5, 0.5);
+    vec3 b = vec3(0.5, 0.5, 0.5);
+    vec3 c = vec3(1.0, 1.0, 1.0);
+    vec3 d = vec3(0.0, 0.1, 0.2);
+    return clamp(a + b * cos(6.28318530718 * (c * t + d)), 0.0, 1.0);
+}
+
+vec3 applyPalette(float t, float palette) {
+    float p = floor(palette + 0.001);
+    if (p < 0.5) return paletteHeatmap(t);
+    if (p < 1.5) return paletteRainbow(t);
+    if (p < 2.5) return paletteFire(t);
+    if (p < 3.5) return paletteCool(t);
+    return paletteWarm(t);
+}
+
+vec2 prsd_demo(vec2 scaledCoord, float time) {
+    vec2 gradient;
+    float noiseValue = lpfx_psrdnoise(
+        scaledCoord,
+        vec2(0.0),
+        time,
+        gradient,
+        0u
+    );
+
+    float hue = (cos(noiseValue * 3.1415 + time) + 1.0) * 0.5;
+    float gradientAngle = atan(gradient.y, gradient.x) / (2.0 * 3.14159) + 0.5;
+    float t = mod(time * 0.1 + hue / 3.0, 1.0);
+    float v = mix(0.5, 1.0, gradientAngle);
+    return vec2(t, v);
+}
+
+vec4 rainbow_main(vec2 fragCoord, vec2 outputSize, float time) {
+    float cyclePhase = mod(time, 5.0);
+    float palette = min(floor(mod(time * 0.2, 5.0)), 4.0);
+    float nextPalette = mod(palette + 1.0, 5.0);
+    float blend = smoothstep(4.0, 5.0, cyclePhase);
+
+    float panSpeed = .3;
+    float pan = mix(1.0, 8.0, 0.5 * (sin(time * panSpeed) + 1.0));
+
+    float scaleSpeed = .7;
+    float scale = mix(.04, .06, 0.5 * (sin(time * scaleSpeed) + 1.0));
+
+    vec2 center = outputSize * 0.5;
+    vec2 dir = fragCoord - center;
+    vec2 scaledCoord = center + dir * scale;
+
+    vec2 tv = prsd_demo(scaledCoord, time);
+
+    if (CYCLE_PALETTE) {
+        return vec4(mix(
+            applyPalette(tv.x, palette),
+            applyPalette(tv.x, nextPalette),
+            blend
+        ) * tv.y, 1.0);
+    } else {
+        return vec4(applyPalette(tv.x, 0) * tv.y, 1.0);
+    }
+}
+
+// run: rainbow_main(vec2(0.0, 0.0), vec2(64.0, 64.0), 5.0) ~= vec4(0.3924713, 0.63394165, 0.14109802, 1.0) (tolerance: 0.002)
diff --git a/lp-shader/lps-filetests/filetests/function/call-multiple.glsl b/lp-shader/lps-filetests/filetests/function/call-multiple.glsl
index 74ea3aa59..b88ebfec4 100644
--- a/lp-shader/lps-filetests/filetests/function/call-multiple.glsl
+++ b/lp-shader/lps-filetests/filetests/function/call-multiple.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/call-nested.glsl b/lp-shader/lps-filetests/filetests/function/call-nested.glsl
index b42466520..6039e04a9 100644
--- a/lp-shader/lps-filetests/filetests/function/call-nested.glsl
+++ b/lp-shader/lps-filetests/filetests/function/call-nested.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/call-order.glsl b/lp-shader/lps-filetests/filetests/function/call-order.glsl
index 99e105636..055822f7a 100644
--- a/lp-shader/lps-filetests/filetests/function/call-order.glsl
+++ b/lp-shader/lps-filetests/filetests/function/call-order.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/call-return-value.glsl b/lp-shader/lps-filetests/filetests/function/call-return-value.glsl
index c3d258d57..85d670886 100644
--- a/lp-shader/lps-filetests/filetests/function/call-return-value.glsl
+++ b/lp-shader/lps-filetests/filetests/function/call-return-value.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/call-simple.glsl b/lp-shader/lps-filetests/filetests/function/call-simple.glsl
index 7893921c1..be0b6c57b 100644
--- a/lp-shader/lps-filetests/filetests/function/call-simple.glsl
+++ b/lp-shader/lps-filetests/filetests/function/call-simple.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/declare-prototype.glsl b/lp-shader/lps-filetests/filetests/function/declare-prototype.glsl
index 734598ffa..883a4cd53 100644
--- a/lp-shader/lps-filetests/filetests/function/declare-prototype.glsl
+++ b/lp-shader/lps-filetests/filetests/function/declare-prototype.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/edge-array-size-match.glsl b/lp-shader/lps-filetests/filetests/function/edge-array-size-match.glsl
index 563879b27..4c6cc2533 100644
--- a/lp-shader/lps-filetests/filetests/function/edge-array-size-match.glsl
+++ b/lp-shader/lps-filetests/filetests/function/edge-array-size-match.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/edge-const-out-error.glsl b/lp-shader/lps-filetests/filetests/function/edge-const-out-error.glsl
index e4789cc18..95c4087d8 100644
--- a/lp-shader/lps-filetests/filetests/function/edge-const-out-error.glsl
+++ b/lp-shader/lps-filetests/filetests/function/edge-const-out-error.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/edge-inout-both.glsl b/lp-shader/lps-filetests/filetests/function/edge-inout-both.glsl
index 5621068ff..93bdc7ef0 100644
--- a/lp-shader/lps-filetests/filetests/function/edge-inout-both.glsl
+++ b/lp-shader/lps-filetests/filetests/function/edge-inout-both.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/edge-lvalue-out.glsl b/lp-shader/lps-filetests/filetests/function/edge-lvalue-out.glsl
index 415626bc2..2e1abd75a 100644
--- a/lp-shader/lps-filetests/filetests/function/edge-lvalue-out.glsl
+++ b/lp-shader/lps-filetests/filetests/function/edge-lvalue-out.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/edge-out-not-read.glsl b/lp-shader/lps-filetests/filetests/function/edge-out-not-read.glsl
index c17346c22..b974f3bb0 100644
--- a/lp-shader/lps-filetests/filetests/function/edge-out-not-read.glsl
+++ b/lp-shader/lps-filetests/filetests/function/edge-out-not-read.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/edge-out-uninitialized.glsl b/lp-shader/lps-filetests/filetests/function/edge-out-uninitialized.glsl
index 962ad9195..8ff5b4003 100644
--- a/lp-shader/lps-filetests/filetests/function/edge-out-uninitialized.glsl
+++ b/lp-shader/lps-filetests/filetests/function/edge-out-uninitialized.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/edge-return-type-match.glsl b/lp-shader/lps-filetests/filetests/function/edge-return-type-match.glsl
index 1bfbd4e86..8edad7390 100644
--- a/lp-shader/lps-filetests/filetests/function/edge-return-type-match.glsl
+++ b/lp-shader/lps-filetests/filetests/function/edge-return-type-match.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/edge-void-return-value.glsl b/lp-shader/lps-filetests/filetests/function/edge-void-return-value.glsl
index 3c0149fab..44eaa8dba 100644
--- a/lp-shader/lps-filetests/filetests/function/edge-void-return-value.glsl
+++ b/lp-shader/lps-filetests/filetests/function/edge-void-return-value.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/forward-declare.glsl b/lp-shader/lps-filetests/filetests/function/forward-declare.glsl
index e7115730d..23d9137e7 100644
--- a/lp-shader/lps-filetests/filetests/function/forward-declare.glsl
+++ b/lp-shader/lps-filetests/filetests/function/forward-declare.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-array.glsl b/lp-shader/lps-filetests/filetests/function/param-array.glsl
index 92f938cf4..00320ef21 100644
--- a/lp-shader/lps-filetests/filetests/function/param-array.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-array.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-const.glsl b/lp-shader/lps-filetests/filetests/function/param-const.glsl
index 9bbc2a5b7..4f5f5251c 100644
--- a/lp-shader/lps-filetests/filetests/function/param-const.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-const.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-default-in.glsl b/lp-shader/lps-filetests/filetests/function/param-default-in.glsl
index 56578a478..3f42c8ab1 100644
--- a/lp-shader/lps-filetests/filetests/function/param-default-in.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-default-in.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-in.glsl b/lp-shader/lps-filetests/filetests/function/param-in.glsl
index 723289c51..7dfd9ab0a 100644
--- a/lp-shader/lps-filetests/filetests/function/param-in.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-in.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-inout.glsl b/lp-shader/lps-filetests/filetests/function/param-inout.glsl
index e3b9b959d..35f1244d5 100644
--- a/lp-shader/lps-filetests/filetests/function/param-inout.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-inout.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-many.glsl b/lp-shader/lps-filetests/filetests/function/param-many.glsl
index 76d663937..3fd6e7303 100644
--- a/lp-shader/lps-filetests/filetests/function/param-many.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-many.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-mixed.glsl b/lp-shader/lps-filetests/filetests/function/param-mixed.glsl
index 96632d36d..6085ec34a 100644
--- a/lp-shader/lps-filetests/filetests/function/param-mixed.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-mixed.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-out-array.glsl b/lp-shader/lps-filetests/filetests/function/param-out-array.glsl
index 175cc73e1..7e98c7c00 100644
--- a/lp-shader/lps-filetests/filetests/function/param-out-array.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-out-array.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-out.glsl b/lp-shader/lps-filetests/filetests/function/param-out.glsl
index a34ebd1fb..d4178acff 100644
--- a/lp-shader/lps-filetests/filetests/function/param-out.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-out.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/param-struct.glsl b/lp-shader/lps-filetests/filetests/function/param-struct.glsl
index c91035e32..1242ed5f9 100644
--- a/lp-shader/lps-filetests/filetests/function/param-struct.glsl
+++ b/lp-shader/lps-filetests/filetests/function/param-struct.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-array.glsl b/lp-shader/lps-filetests/filetests/function/return-array.glsl
index ed686d3f3..fb592dc72 100644
--- a/lp-shader/lps-filetests/filetests/function/return-array.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-array.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-early.glsl b/lp-shader/lps-filetests/filetests/function/return-early.glsl
index 44b323fe5..3ff69b6ae 100644
--- a/lp-shader/lps-filetests/filetests/function/return-early.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-early.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-exact-match.glsl b/lp-shader/lps-filetests/filetests/function/return-exact-match.glsl
index c69c9f0b7..1e2d56c48 100644
--- a/lp-shader/lps-filetests/filetests/function/return-exact-match.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-exact-match.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-matrix.glsl b/lp-shader/lps-filetests/filetests/function/return-matrix.glsl
index 0e77c1a80..9cd41dba3 100644
--- a/lp-shader/lps-filetests/filetests/function/return-matrix.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-matrix.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-multiple.glsl b/lp-shader/lps-filetests/filetests/function/return-multiple.glsl
index 8d2cb4bad..868ba2682 100644
--- a/lp-shader/lps-filetests/filetests/function/return-multiple.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-multiple.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 // test run
 
diff --git a/lp-shader/lps-filetests/filetests/function/return-nested-deep.glsl b/lp-shader/lps-filetests/filetests/function/return-nested-deep.glsl
index de2dfa58d..165766195 100644
--- a/lp-shader/lps-filetests/filetests/function/return-nested-deep.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-nested-deep.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // ============================================================================
 // Deeply Nested Return Tests: Return from various depths of nested ifs
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-nested-minimal.glsl b/lp-shader/lps-filetests/filetests/function/return-nested-minimal.glsl
index 8ba54cc36..416da01d9 100644
--- a/lp-shader/lps-filetests/filetests/function/return-nested-minimal.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-nested-minimal.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-scalar.glsl b/lp-shader/lps-filetests/filetests/function/return-scalar.glsl
index ac216fb8a..995bffa22 100644
--- a/lp-shader/lps-filetests/filetests/function/return-scalar.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-scalar.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-simple-if.glsl b/lp-shader/lps-filetests/filetests/function/return-simple-if.glsl
index 1f0fc0fd7..818b27f66 100644
--- a/lp-shader/lps-filetests/filetests/function/return-simple-if.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-simple-if.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // ============================================================================
 // Simplest Early Return: The minimal case that fails
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-struct.glsl b/lp-shader/lps-filetests/filetests/function/return-struct.glsl
index 86ac5eb59..ad6b6fcd7 100644
--- a/lp-shader/lps-filetests/filetests/function/return-struct.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-struct.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-vector.glsl b/lp-shader/lps-filetests/filetests/function/return-vector.glsl
index d89b9fed0..17dfb7f6f 100644
--- a/lp-shader/lps-filetests/filetests/function/return-vector.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-vector.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-void.glsl b/lp-shader/lps-filetests/filetests/function/return-void.glsl
index 6fadc0e0e..d3264276d 100644
--- a/lp-shader/lps-filetests/filetests/function/return-void.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-void.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/function/return-while-loop.glsl b/lp-shader/lps-filetests/filetests/function/return-while-loop.glsl
index 4def3b35a..ec5fab72e 100644
--- a/lp-shader/lps-filetests/filetests/function/return-while-loop.glsl
+++ b/lp-shader/lps-filetests/filetests/function/return-while-loop.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // ============================================================================
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-control-flow.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-control-flow.glsl
index 1cb0c26c6..362649c74 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-control-flow.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-control-flow.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // User calls from if/else and from a for-loop.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-mat4-return.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-mat4-return.glsl
index f8bdd4cfb..c26b54a93 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-mat4-return.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-mat4-return.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // Large sret (mat4): stress max callee buffer sizing on native path.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-multi-args.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-multi-args.glsl
index f9c2fcc94..867be0fb9 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-multi-args.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-multi-args.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // Six user float args (+ vmctx): register args a1–a7 on RV32 when no caller sret.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-nested.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-nested.glsl
index b1b8ddff0..dc4cd2f7c 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-nested.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-nested.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // Nested user calls (multiple callees in one expression).
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-simple.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-simple.glsl
index 67764bd85..6c7b47cb0 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-simple.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-simple.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // Native / multi-backend: user function call, scalar float return (direct registers).
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-vec2-return.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-vec2-return.glsl
index 1552c4aab..f4ebb480f 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-vec2-return.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-vec2-return.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // Two scalar return words (a0–a1 direct return).
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-vec4-return.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-vec4-return.glsl
index 78170b93f..52813a71a 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/native-call-vec4-return.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/native-call-vec4-return.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 
 // Four-word return (sret on RV32): caller-side buffer + callee stores.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/call-clobber-correctness.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/call-clobber-correctness.glsl
index fe9e13841..3e3cd9a6c 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/call-clobber-correctness.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/call-clobber-correctness.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Call clobber + spill slot correctness: sequential calls, evictions during arg
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/caller-save-pressure.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/caller-save-pressure.glsl
index ce6e2d4cd..fd251d1f5 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/caller-save-pressure.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/caller-save-pressure.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Performance: caller-saved register preservation across calls.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/live-range-interference.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/live-range-interference.glsl
index 106de94b6..68aec2e5e 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/live-range-interference.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/live-range-interference.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Performance: live range interference patterns.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/mat4-reg-pressure.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/mat4-reg-pressure.glsl
index 9fd61a98c..238b9da8b 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/mat4-reg-pressure.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/mat4-reg-pressure.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Performance: mat4 register pressure (16 scalars each).
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/nested-call-overhead.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/nested-call-overhead.glsl
index a7be3c21c..b68f90067 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/nested-call-overhead.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/nested-call-overhead.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Performance: register pressure across nested/cascaded calls.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/spill-density.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/spill-density.glsl
index e504bf550..16ea57968 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/spill-density.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/spill-density.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Performance: spill/reload density in tight computation sequences.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-incoming-16.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-incoming-16.glsl
index 523880ee6..a1633ba28 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-incoming-16.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-incoming-16.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Performance: incoming stack parameter load overhead.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-incoming.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-incoming.glsl
index c0de49ffe..758916938 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-incoming.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-incoming.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Performance: incoming stack parameter load overhead.
diff --git a/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-outgoing.glsl b/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-outgoing.glsl
index 4bde55d8e..387b44d82 100644
--- a/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-outgoing.glsl
+++ b/lp-shader/lps-filetests/filetests/lpvm/native/perf/stack-args-outgoing.glsl
@@ -1,3 +1,5 @@
+// compile-opt(inline.mode, never)
+
 // test run
 //
 // Performance: outgoing stack argument store overhead.
diff --git a/lp-shader/lps-filetests/filetests/optimizer/dead_func_elim/dfe-removes-unreachable.glsl b/lp-shader/lps-filetests/filetests/optimizer/dead_func_elim/dfe-removes-unreachable.glsl
new file mode 100644
index 000000000..4b6344f9e
--- /dev/null
+++ b/lp-shader/lps-filetests/filetests/optimizer/dead_func_elim/dfe-removes-unreachable.glsl
@@ -0,0 +1,33 @@
+// compile-opt(inline.mode, never)
+// compile-opt(dead_func_elim.mode, auto)
+
+// test run
+
+// ============================================================================
+// DFE end-to-end smoke test.
+//
+// `render` is the only `is_entry` root. Inliner is disabled so we isolate
+// DFE behavior:
+//   - reachable from render: `render`, `test_dfe_basic`, `helper` (kept)
+//   - unreachable from render: `unused_dead`, `also_dead` (removed)
+//
+// `// run:` calls `test_dfe_basic` directly by name; DFE must keep it
+// because `render` reaches it. The runtime looks up entries by name, so
+// kept-but-not-`is_entry` functions remain harness-callable.
+// ============================================================================
+
+float helper(float x) { return x * x; }
+
+float unused_dead(float x) { return x + 1.0; }
+float also_dead(float x) { return x - 1.0; }
+
+float test_dfe_basic() {
+    return helper(5.0);
+}
+
+// run: test_dfe_basic() ~= 25.0
+
+vec4 render(vec2 pos) {
+    float keep = test_dfe_basic() + helper(pos.x);
+    return vec4(keep, 0.0, 0.0, 1.0);
+}
diff --git a/lp-shader/lps-filetests/filetests/optimizer/inline/inline-control-flow.glsl b/lp-shader/lps-filetests/filetests/optimizer/inline/inline-control-flow.glsl
new file mode 100644
index 000000000..3f66ddb88
--- /dev/null
+++ b/lp-shader/lps-filetests/filetests/optimizer/inline/inline-control-flow.glsl
@@ -0,0 +1,51 @@
+// test run
+
+// ============================================================================
+// Inliner: callee with nested if / for / break / continue (remap stress).
+// ============================================================================
+
+int sum_evens_with_cap(int n) {
+    int total = 0;
+    for (int i = 0; i < n; i++) {
+        if (i > 100) {
+            break;
+        }
+        if ((i % 2) == 1) {
+            continue;
+        }
+        total = total + i;
+    }
+    return total;
+}
+
+int test_inline_control_flow_sum() {
+    return sum_evens_with_cap(10) + sum_evens_with_cap(5);
+}
+
+// 0+2+4+6+8 = 20; 0+2+4 = 6 -> 26
+// run: test_inline_control_flow_sum() == 26
+
+int mixed_loop(int n, int skip_below) {
+    int acc = 0;
+    for (int j = 0; j < n; j++) {
+        if (j < skip_below) {
+            continue;
+        }
+        if (j > 50) {
+            break;
+        }
+        if ((j % 3) == 0) {
+            acc = acc + j;
+        } else {
+            acc = acc - 1;
+        }
+    }
+    return acc;
+}
+
+int test_inline_control_flow_mixed() {
+    return mixed_loop(12, 2) + sum_evens_with_cap(4);
+}
+
+// mixed_loop(12,2)=11; sum_evens_with_cap(4)=0+2=2 -> 13
+// run: test_inline_control_flow_mixed() == 13
diff --git a/lp-shader/lps-filetests/filetests/optimizer/inline/inline-many-small.glsl b/lp-shader/lps-filetests/filetests/optimizer/inline/inline-many-small.glsl
new file mode 100644
index 000000000..02740610b
--- /dev/null
+++ b/lp-shader/lps-filetests/filetests/optimizer/inline/inline-many-small.glsl
@@ -0,0 +1,52 @@
+// test run
+
+// ============================================================================
+// Inliner: many small helpers with interleaved call graph (topo stress).
+// ============================================================================
+
+float m1(float x) {
+    return x + 1.0;
+}
+
+float m2(float x) {
+    return m1(x) * 2.0;
+}
+
+float m3(float x) {
+    return m2(x) - m1(0.0);
+}
+
+float m4(float x) {
+    return m3(x) + m2(0.5);
+}
+
+float m5(float x) {
+    return m4(x) * m1(0.0);
+}
+
+float m6(float x) {
+    return m5(x) + m3(0.0);
+}
+
+float m7(float x) {
+    return m6(x) + m4(0.0);
+}
+
+float m8(float x) {
+    return m7(m2(x));
+}
+
+float m9(float x) {
+    return m8(x) - m5(0.0);
+}
+
+float m10(float x) {
+    return m9(x) + m6(0.0);
+}
+
+float test_inline_many_small() {
+    return m10(1.0);
+}
+
+// x=1: m10=18 (traced from m1..m9 definitions).
+// run: test_inline_many_small() ~= 18.0
diff --git a/lp-shader/lps-filetests/filetests/optimizer/inline/inline-mode-flag.glsl b/lp-shader/lps-filetests/filetests/optimizer/inline/inline-mode-flag.glsl
new file mode 100644
index 000000000..8ab3168a8
--- /dev/null
+++ b/lp-shader/lps-filetests/filetests/optimizer/inline/inline-mode-flag.glsl
@@ -0,0 +1,37 @@
+// compile-opt(inline.mode, always)
+
+// test run
+
+// ============================================================================
+// Inliner: compile-opt(inline.mode, always) plumbs through; results match Auto.
+// ============================================================================
+
+float square(float x) {
+    return x * x;
+}
+
+float add(float a, float b) {
+    return a + b;
+}
+
+float compose(float x, float y) {
+    return square(add(x, y));
+}
+
+float test_inline_mode_flag_chain() {
+    return compose(2.0, 3.0);
+}
+
+// run: test_inline_mode_flag_chain() ~= 25.0
+
+float test_inline_mode_flag_compose_small() {
+    return compose(1.0, 1.0);
+}
+
+// run: test_inline_mode_flag_compose_small() ~= 4.0
+
+float test_inline_mode_flag_square_of_sum() {
+    return square(add(1.0, 2.0));
+}
+
+// run: test_inline_mode_flag_square_of_sum() ~= 9.0
diff --git a/lp-shader/lps-filetests/filetests/optimizer/inline/inline-recursion.glsl b/lp-shader/lps-filetests/filetests/optimizer/inline/inline-recursion.glsl
new file mode 100644
index 000000000..42c6c6998
--- /dev/null
+++ b/lp-shader/lps-filetests/filetests/optimizer/inline/inline-recursion.glsl
@@ -0,0 +1,79 @@
+// test run
+
+// ============================================================================
+// Inliner: deep call chain (no cycles). GLSL forbids recursion; a mistaken
+// "recursive" inline would miscompile or panic — this chain stresses that.
+// ============================================================================
+
+int chain9(int x) {
+    return x;
+}
+
+int chain8(int x) {
+    return chain9(x + 1);
+}
+
+int chain7(int x) {
+    return chain8(x + 1);
+}
+
+int chain6(int x) {
+    return chain7(x + 1);
+}
+
+int chain5(int x) {
+    return chain6(x + 1);
+}
+
+int chain4(int x) {
+    return chain5(x + 1);
+}
+
+int chain3(int x) {
+    return chain4(x + 1);
+}
+
+int chain2(int x) {
+    return chain3(x + 1);
+}
+
+int chain1(int x) {
+    return chain2(x + 1);
+}
+
+int chain0(int x) {
+    return chain1(x + 1);
+}
+
+int test_inline_deep_chain() {
+    return chain0(0);
+}
+
+// run: test_inline_deep_chain() == 9
+
+float tail(float x) {
+    return x * 2.0;
+}
+
+float step4(float x) {
+    return tail(x + 1.0);
+}
+
+float step3(float x) {
+    return step4(x) + 1.0;
+}
+
+float step2(float x) {
+    return step3(x * 2.0);
+}
+
+float step1(float x) {
+    return step2(x + 0.5);
+}
+
+float test_inline_deep_chain_float() {
+    return step1(1.0);
+}
+
+// step1(1)=step2(1.5)=step3(3.0)=step4(3.0)+1=tail(4.0)+1=8+1=9
+// run: test_inline_deep_chain_float() ~= 9.0
diff --git a/lp-shader/lps-frontend/src/lib.rs b/lp-shader/lps-frontend/src/lib.rs
index 47dc9778f..f3881cc65 100644
--- a/lp-shader/lps-frontend/src/lib.rs
+++ b/lp-shader/lps-frontend/src/lib.rs
@@ -146,6 +146,59 @@ mod tests {
             .find(|f| f.name == "add")
             .expect("add fn");
         assert_eq!(add.param_count, 2);
+        assert!(!add.is_entry);
+    }
+
+    #[test]
+    fn lower_marks_only_render_as_entry_among_user_functions() {
+        let src = r#"
+float helper(float x) { return x + 1.0; }
+vec4 render(vec2 pos) { return vec4(helper(pos.x)); }
+"#;
+        let naga = compile(src).unwrap();
+        let (ir, _) = super::lower(&naga).expect("lower");
+        let render = ir
+            .functions
+            .values()
+            .find(|f| f.name == "render")
+            .expect("render");
+        let helper = ir
+            .functions
+            .values()
+            .find(|f| f.name == "helper")
+            .expect("helper");
+        assert!(render.is_entry);
+        assert!(!helper.is_entry);
+    }
+
+    #[test]
+    fn lower_shader_init_ir_is_entry() {
+        let src = "float my_global = 42.0; float test() { return my_global; }";
+        let naga = compile(src).unwrap();
+        let (ir, _) = super::lower(&naga).expect("lower");
+        let init = ir
+            .functions
+            .values()
+            .find(|f| f.name == "__shader_init")
+            .expect("__shader_init");
+        assert!(init.is_entry);
+        let test_fn = ir
+            .functions
+            .values()
+            .find(|f| f.name == "test")
+            .expect("test");
+        assert!(!test_fn.is_entry);
+    }
+
+    #[test]
+    fn lower_helper_only_module_has_no_entry_functions() {
+        let src = "float foo(float x) { return x; }";
+        let naga = compile(src).unwrap();
+        let (ir, _) = super::lower(&naga).expect("lower");
+        assert!(
+            ir.functions.values().all(|f| !f.is_entry),
+            "no production roots without render or __shader_init"
+        );
     }
 
     #[test]
diff --git a/lp-shader/lps-frontend/src/lower.rs b/lp-shader/lps-frontend/src/lower.rs
index e97417993..260c7ac33 100644
--- a/lp-shader/lps-frontend/src/lower.rs
+++ b/lp-shader/lps-frontend/src/lower.rs
@@ -49,7 +49,7 @@ pub fn lower(naga_module: &NagaModule) -> Result<(LpirModule, LpsModuleSig), Low
     // Lower user functions.
     for (handle, info) in &naga_module.functions {
         let func = &naga_module.module.functions[*handle];
-        let ir = lower_function(
+        let mut ir = lower_function(
             &naga_module.module,
             func,
             info.name.as_str(),
@@ -62,6 +62,9 @@ pub fn lower(naga_module: &NagaModule) -> Result<(LpirModule, LpsModuleSig), Low
             name: info.name.clone(),
             inner: Box::new(e),
         })?;
+        if info.name == "render" {
+            ir.is_entry = true;
+        }
         glsl_meta.functions.push(LpsFnSig {
             name: info.name.clone(),
             parameters: info.params.clone(),
@@ -252,6 +255,7 @@ fn synthesize_shader_init(module: &Module, global_map: &GlobalVarMap) -> Option<
     }
 
     let mut fb = FunctionBuilder::new("__shader_init", &[]);
+    fb.set_entry();
     let mut emitted_any = false;
 
     // For each global with an initializer, evaluate it and store to VMContext.
diff --git a/lp-shader/lpvm-cranelift/Cargo.toml b/lp-shader/lpvm-cranelift/Cargo.toml
index bfe366fd0..470581efa 100644
--- a/lp-shader/lpvm-cranelift/Cargo.toml
+++ b/lp-shader/lpvm-cranelift/Cargo.toml
@@ -35,6 +35,7 @@ riscv32-object = [
 ]
 
 [dependencies]
+log = { workspace = true, default-features = false }
 libm = "0.2"
 spin = { workspace = true }
 lpvm = { path = "../lpvm", default-features = false }
diff --git a/lp-shader/lpvm-cranelift/src/emit/control.rs b/lp-shader/lpvm-cranelift/src/emit/control.rs
index 62b1bba5f..8469b933b 100644
--- a/lp-shader/lpvm-cranelift/src/emit/control.rs
+++ b/lp-shader/lpvm-cranelift/src/emit/control.rs
@@ -125,6 +125,7 @@ pub(crate) fn emit_control(
             });
             Ok(true)
         }
+        LpirOp::Continuing => Ok(true),
         LpirOp::Break => {
             let exit = find_innermost_loop_exit(ctrl_stack)?;
             builder.ins().jump(exit, &[]);
diff --git a/lp-shader/lpvm-cranelift/src/emit/mod.rs b/lp-shader/lpvm-cranelift/src/emit/mod.rs
index 917314e2a..d1edf82dd 100644
--- a/lp-shader/lpvm-cranelift/src/emit/mod.rs
+++ b/lp-shader/lpvm-cranelift/src/emit/mod.rs
@@ -1,8 +1,8 @@
 //! LPIR → CLIF translation: scalar ops, structured control flow, memory, and local calls.
 
+use alloc::collections::BTreeMap;
 use alloc::vec::Vec;
 
-use alloc::collections::BTreeMap;
 use cranelift_codegen::ir::{AbiParam, ArgumentPurpose, Signature, types};
 use cranelift_codegen::ir::{Block, FuncRef, InstBuilder, StackSlot, TrapCode, Value};
 use cranelift_codegen::isa::{CallConv, TargetIsa};
diff --git a/lp-shader/lpvm-cranelift/src/jit_module.rs b/lp-shader/lpvm-cranelift/src/jit_module.rs
index b393a6303..9db3f9706 100644
--- a/lp-shader/lpvm-cranelift/src/jit_module.rs
+++ b/lp-shader/lpvm-cranelift/src/jit_module.rs
@@ -115,6 +115,30 @@ pub(crate) fn build_jit_module(
     glsl_meta: LpsModuleSig,
     options: CompileOptions,
 ) -> Result<JitModule, CompilerError> {
+    let mut ir_opt = ir.clone();
+    let inline_result = lpir::inline_module(&mut ir_opt, &options.config.inline);
+    if inline_result.call_sites_replaced > 0 {
+        log::info!(
+            "[cranelift] inline: replaced {} call sites",
+            inline_result.call_sites_replaced
+        );
+    }
+    if !matches!(
+        options.config.dead_func_elim.mode,
+        lpir::DeadFuncElimMode::Never
+    ) {
+        let roots = lpir::roots_from_is_entry(&ir_opt);
+        if !roots.is_empty() {
+            let dfe = lpir::dead_func_elim(&mut ir_opt, &roots);
+            if dfe.functions_removed > 0 {
+                log::info!(
+                    "[cranelift] dead_func_elim: removed {} functions",
+                    dfe.functions_removed
+                );
+            }
+        }
+    }
+
     let _codegen_guard = process_sync::codegen_guard();
 
     let mut flag_builder = settings::builder();
@@ -140,7 +164,8 @@ pub(crate) fn build_jit_module(
 
     let mut jit_module = JITModule::new(jit_builder);
 
-    let lowered = lower_lpir_into_module(&mut jit_module, ir, options, LpirFuncEmitOrder::Source)?;
+    let lowered =
+        lower_lpir_into_module(&mut jit_module, &ir_opt, options, LpirFuncEmitOrder::Source)?;
 
     jit_module.finalize_definitions().map_err(|e| {
         CompilerError::Codegen(CompileError::cranelift(alloc::format!(
diff --git a/lp-shader/lpvm-cranelift/src/object_module.rs b/lp-shader/lpvm-cranelift/src/object_module.rs
index 1555e2f7a..6cf8c7715 100644
--- a/lp-shader/lpvm-cranelift/src/object_module.rs
+++ b/lp-shader/lpvm-cranelift/src/object_module.rs
@@ -72,6 +72,30 @@ pub fn object_bytes_from_ir(
     ir: &LpirModule,
     options: &CompileOptions,
 ) -> Result<Vec<u8>, CompilerError> {
+    let mut ir_opt = ir.clone();
+    let inline_result = lpir::inline_module(&mut ir_opt, &options.config.inline);
+    if inline_result.call_sites_replaced > 0 {
+        log::info!(
+            "[cranelift] inline: replaced {} call sites",
+            inline_result.call_sites_replaced
+        );
+    }
+    if !matches!(
+        options.config.dead_func_elim.mode,
+        lpir::DeadFuncElimMode::Never
+    ) {
+        let roots = lpir::roots_from_is_entry(&ir_opt);
+        if !roots.is_empty() {
+            let dfe = lpir::dead_func_elim(&mut ir_opt, &roots);
+            if dfe.functions_removed > 0 {
+                log::info!(
+                    "[cranelift] dead_func_elim: removed {} functions",
+                    dfe.functions_removed
+                );
+            }
+        }
+    }
+
     let _codegen_guard = process_sync::codegen_guard();
 
     let isa = riscv32_owned_isa()?;
@@ -84,7 +108,7 @@ pub fn object_bytes_from_ir(
     let mut object_module = ObjectModule::new(builder);
     lower_lpir_into_module(
         &mut object_module,
-        ir,
+        &ir_opt,
         options.clone(),
         LpirFuncEmitOrder::Name,
     )?;
diff --git a/lp-shader/lpvm-native/src/compile.rs b/lp-shader/lpvm-native/src/compile.rs
index 492391c97..68ad78838 100644
--- a/lp-shader/lpvm-native/src/compile.rs
+++ b/lp-shader/lpvm-native/src/compile.rs
@@ -167,22 +167,46 @@ pub fn compile_module(
     options: crate::native_options::NativeCompileOptions,
     isa: IsaTarget,
 ) -> Result<CompiledModule, NativeError> {
+    let mut ir_opt = ir.clone();
+    let inline_result = lpir::inline_module(&mut ir_opt, &options.config.inline);
+    if inline_result.call_sites_replaced > 0 {
+        log::info!(
+            "[native-fa] inline: replaced {} call sites",
+            inline_result.call_sites_replaced
+        );
+    }
+    if !matches!(
+        options.config.dead_func_elim.mode,
+        lpir::DeadFuncElimMode::Never
+    ) {
+        let roots = lpir::roots_from_is_entry(&ir_opt);
+        if !roots.is_empty() {
+            let dfe = lpir::dead_func_elim(&mut ir_opt, &roots);
+            if dfe.functions_removed > 0 {
+                log::info!(
+                    "[native-fa] dead_func_elim: removed {} functions",
+                    dfe.functions_removed
+                );
+            }
+        }
+    }
+
     log::debug!(
         "[native-fa] compile_module: building ABI for {n} functions",
-        n = ir.functions.len(),
+        n = ir_opt.functions.len(),
     );
-    let module_abi = ModuleAbi::from_ir_and_sig(isa, ir, sig);
+    let module_abi = ModuleAbi::from_ir_and_sig(isa, &ir_opt, sig);
     let mut session = CompileSession::new(module_abi, isa, float_mode, options);
 
     let sig_map: alloc::collections::BTreeMap<&str, &LpsFnSig> =
         sig.functions.iter().map(|s| (s.name.as_str(), s)).collect();
 
-    let mut functions = Vec::with_capacity(ir.functions.len());
-    for (idx, func) in ir.functions.values().enumerate() {
+    let mut functions = Vec::with_capacity(ir_opt.functions.len());
+    for (idx, func) in ir_opt.functions.values().enumerate() {
         log::debug!(
             "[native-fa] compile_module: compiling function {cur}/{total}: {name}",
             cur = idx + 1,
-            total = ir.functions.len(),
+            total = ir_opt.functions.len(),
             name = func.name,
         );
         let default_sig = LpsFnSig {
@@ -195,7 +219,7 @@ pub fn compile_module(
             .get(func.name.as_str())
             .copied()
             .unwrap_or(&default_sig);
-        let compiled = compile_function(&mut session, func, ir, fn_sig)?;
+        let compiled = compile_function(&mut session, func, &ir_opt, fn_sig)?;
         functions.push(compiled);
         log::debug!(
             "[native-fa] compile_module: function {name} complete",
diff --git a/lp-shader/lpvm-native/src/lower.rs b/lp-shader/lpvm-native/src/lower.rs
index 0407f14dd..1001afebd 100644
--- a/lp-shader/lpvm-native/src/lower.rs
+++ b/lp-shader/lpvm-native/src/lower.rs
@@ -1231,6 +1231,9 @@ pub fn lower_lpir_op(
                 "structural control-flow op must be lowered via lower_ops (IfStart/LoopStart/Block/Else/End/ExitBlock)",
             ),
         }),
+        LpirOp::Continuing => Err(LowerError::UnsupportedOp {
+            description: String::from("Continuing is a structural marker (skipped in lower_range)"),
+        }),
         LpirOp::Break | LpirOp::Continue | LpirOp::BrIfNot { .. } => {
             Err(LowerError::UnsupportedOp {
                 description: String::from(
@@ -1693,7 +1696,7 @@ impl<'a> LowerCtx<'a> {
                     });
                     i += 1;
                 }
-                LpirOp::Else | LpirOp::End => {
+                LpirOp::Else | LpirOp::End | LpirOp::Continuing => {
                     i += 1;
                 }
                 other => {
diff --git a/lp-shader/lpvm-native/src/regalloc/mod.rs b/lp-shader/lpvm-native/src/regalloc/mod.rs
index 71d2419e8..fa6675425 100644
--- a/lp-shader/lpvm-native/src/regalloc/mod.rs
+++ b/lp-shader/lpvm-native/src/regalloc/mod.rs
@@ -399,6 +399,7 @@ mod tests {
     // Snapshot test helpers for allocator
     fn expect_alloc(input: &str, expected: &str) {
         use crate::debug::vinst;
+        use crate::isa::rv32::abi;
         use crate::regalloc::render::render_alloc_output;
         use crate::regalloc::test::abi_fixtures;
         use crate::regalloc::walk::walk_linear;
diff --git a/lp-shader/lpvm-wasm/src/compile.rs b/lp-shader/lpvm-wasm/src/compile.rs
index 8ae47f0d0..eb903c5ea 100644
--- a/lp-shader/lpvm-wasm/src/compile.rs
+++ b/lp-shader/lpvm-wasm/src/compile.rs
@@ -1,9 +1,9 @@
 //! Compile LPIR (+ module metadata) to WASM.
 
-use alloc::{format, vec::Vec};
+use alloc::{collections::BTreeMap, format, vec::Vec};
 
 use lpir::LpirModule;
-use lps_shared::LpsModuleSig;
+use lps_shared::{LpsFnSig, LpsModuleSig};
 
 use crate::emit;
 use crate::error::WasmError;
@@ -41,10 +41,34 @@ pub fn compile_lpir(
     meta: &LpsModuleSig,
     options: &WasmOptions,
 ) -> Result<WasmArtifact, WasmError> {
-    validate_metadata(ir, meta)?;
+    let mut ir_opt = ir.clone();
+    let inline_result = lpir::inline_module(&mut ir_opt, &options.config.inline);
+    if inline_result.call_sites_replaced > 0 {
+        log::info!(
+            "[wasm] inline: replaced {} call sites",
+            inline_result.call_sites_replaced
+        );
+    }
+    if !matches!(
+        options.config.dead_func_elim.mode,
+        lpir::DeadFuncElimMode::Never
+    ) {
+        let roots = lpir::roots_from_is_entry(&ir_opt);
+        if !roots.is_empty() {
+            let dfe = lpir::dead_func_elim(&mut ir_opt, &roots);
+            if dfe.functions_removed > 0 {
+                log::info!(
+                    "[wasm] dead_func_elim: removed {} functions",
+                    dfe.functions_removed
+                );
+            }
+        }
+    }
+
+    validate_metadata(&ir_opt, meta)?;
     let (wasm_bytes, shadow_stack_base, env_memory) =
-        emit::emit_module(ir, options).map_err(WasmError::emit)?;
-    let exports = collect_exports(ir, meta, options);
+        emit::emit_module(&ir_opt, options).map_err(WasmError::emit)?;
+    let exports = collect_exports(&ir_opt, meta, options);
     Ok(WasmArtifact {
         module: WasmModule {
             bytes: wasm_bytes,
@@ -57,18 +81,16 @@ pub fn compile_lpir(
 }
 
 fn validate_metadata(ir: &LpirModule, meta: &LpsModuleSig) -> Result<(), WasmError> {
-    if ir.functions.len() != meta.functions.len() {
-        return Err(WasmError::metadata_mismatch(format!(
-            "IR has {} functions but metadata has {}",
-            ir.functions.len(),
-            meta.functions.len()
-        )));
-    }
-    for (ir_f, sig) in ir.functions.values().zip(meta.functions.iter()) {
-        if ir_f.name != sig.name {
+    let sig_map: BTreeMap<&str, &LpsFnSig> = meta
+        .functions
+        .iter()
+        .map(|s| (s.name.as_str(), s))
+        .collect();
+    for ir_f in ir.functions.values() {
+        if !sig_map.contains_key(ir_f.name.as_str()) {
             return Err(WasmError::metadata_mismatch(format!(
-                "function name mismatch: IR {:?} vs metadata {:?}",
-                ir_f.name, sig.name
+                "IR function {:?} has no metadata entry",
+                ir_f.name
             )));
         }
     }
@@ -76,10 +98,24 @@ fn validate_metadata(ir: &LpirModule, meta: &LpsModuleSig) -> Result<(), WasmErr
 }
 
 fn collect_exports(ir: &LpirModule, meta: &LpsModuleSig, options: &WasmOptions) -> Vec<WasmExport> {
+    let sig_map: BTreeMap<&str, &LpsFnSig> = meta
+        .functions
+        .iter()
+        .map(|s| (s.name.as_str(), s))
+        .collect();
     ir.functions
         .values()
-        .zip(meta.functions.iter())
-        .map(|(ir_f, sig)| {
+        .map(|ir_f| {
+            let default_sig = LpsFnSig {
+                name: ir_f.name.clone(),
+                return_type: lps_shared::LpsType::Void,
+                parameters: Vec::new(),
+                kind: lps_shared::LpsFnKind::UserDefined,
+            };
+            let sig = sig_map
+                .get(ir_f.name.as_str())
+                .copied()
+                .unwrap_or_else(|| &default_sig);
             let mut params: Vec<_> = alloc::vec![WasmValType::I32];
             params.extend(sig.parameters.iter().flat_map(|p| {
                 crate::module::glsl_type_to_wasm_components(&p.ty, options.float_mode)
diff --git a/lp-shader/lpvm-wasm/src/emit/mod.rs b/lp-shader/lpvm-wasm/src/emit/mod.rs
index a57e138c4..1fe20b1f9 100644
--- a/lp-shader/lpvm-wasm/src/emit/mod.rs
+++ b/lp-shader/lpvm-wasm/src/emit/mod.rs
@@ -8,11 +8,12 @@ mod memory;
 mod ops;
 mod q32;
 
+use alloc::collections::BTreeMap;
 use alloc::string::String;
 use alloc::vec::Vec;
 
 use lpir::FloatMode;
-use lpir::LpirModule;
+use lpir::{FuncId, LpirModule};
 use lps_q32::q32_options::Q32Options;
 
 use crate::module::EnvMemorySpec;
@@ -38,7 +39,9 @@ pub(crate) struct FdivRecipLocals {
 pub(crate) struct EmitCtx<'a> {
     pub options: &'a crate::options::WasmOptions,
     pub import_remap: &'a [Option<u32>],
-    pub filtered_import_count: u32,
+    /// Maps `FuncId` → WASM function index. Required because DFE may leave
+    /// gaps in the `FuncId` space, but WASM function indices are dense.
+    pub local_func_index: &'a BTreeMap<FuncId, u32>,
     /// Copied from [`lpir::CompilerConfig::q32`] for Q32 opcode lowering.
     pub q32: Q32Options,
 }
@@ -157,10 +160,15 @@ pub(crate) fn emit_module(
         exports.export("render_frame", ExportKind::Func, render_fn_index);
     }
 
+    let mut local_func_index: BTreeMap<FuncId, u32> = BTreeMap::new();
+    for (i, &fid) in ir.functions.keys().enumerate() {
+        local_func_index.insert(fid, filtered_fn_count + i as u32);
+    }
+
     let ctx = EmitCtx {
         options,
         import_remap: &filtered.remap,
-        filtered_import_count: filtered_fn_count,
+        local_func_index: &local_func_index,
         q32: options.config.q32,
     };
 
diff --git a/lp-shader/lpvm-wasm/src/emit/ops.rs b/lp-shader/lpvm-wasm/src/emit/ops.rs
index 9b92d1022..9e5ae78e8 100644
--- a/lp-shader/lpvm-wasm/src/emit/ops.rs
+++ b/lp-shader/lpvm-wasm/src/emit/ops.rs
@@ -5,7 +5,7 @@ use alloc::string::String;
 use alloc::vec::Vec;
 
 use lpir::FloatMode;
-use lpir::{CalleeRef, FuncId, ImportId, IrFunction, IrType, LpirModule, LpirOp};
+use lpir::{CalleeRef, ImportId, IrFunction, IrType, LpirModule, LpirOp};
 use lps_q32::q32_options::{AddSubMode, DivMode, MulMode};
 use wasm_encoder::{BlockType, Ieee32, InstructionSink, ValType};
 
@@ -26,7 +26,11 @@ fn wasm_func_index(ctx: &FuncEmitCtx<'_>, callee: CalleeRef) -> Result<u32, Stri
             let k = i as usize;
             m.import_remap[k].ok_or_else(|| format!("call to pruned import {k}"))
         }
-        CalleeRef::Local(FuncId(id)) => Ok(m.filtered_import_count + id as u32),
+        CalleeRef::Local(fid) => m
+            .local_func_index
+            .get(&fid)
+            .copied()
+            .ok_or_else(|| format!("call to unknown local function {fid:?}")),
     }
 }
 
@@ -373,6 +377,7 @@ pub(crate) fn emit_op(
                 outer_open_depth: outer_open + 1,
             });
         }
+        LpirOp::Continuing => {}
         LpirOp::SwitchStart { selector, .. } => {
             sink.block(BlockType::Empty);
             *wasm_open += 1;
diff --git a/run-tests.sh b/run-tests.sh
index 67e51f7a0..43ee92e57 100644
--- a/run-tests.sh
+++ b/run-tests.sh
@@ -53,7 +53,7 @@ export DEBUG=1
 (target/debug/lps-filetests-app test --target rv32fa.q32 control/while/nested_if.glsl &> docs/fa3-errors/control/while/nested_if.glsl)&
 (target/debug/lps-filetests-app test --target rv32fa.q32 debug/palette-rainbow.glsl &> docs/fa3-errors/debug/palette-rainbow.glsl)&
 (target/debug/lps-filetests-app test --target rv32fa.q32 debug/rainbow-noctrl.glsl &> docs/fa3-errors/debug/rainbow-noctrl.glsl)&
-(target/debug/lps-filetests-app test --target rv32fa.q32 debug/rainbow.glsl &> docs/fa3-errors/debug/rainbow.glsl)&
+(target/debug/lps-filetests-app test --target rv32fa.q32 examples/rainbow.glsl &> docs/fa3-errors/examples/rainbow.glsl)&
 (target/debug/lps-filetests-app test --target rv32fa.q32 function/call-multiple.glsl &> docs/fa3-errors/function/call-multiple.glsl)&
 (target/debug/lps-filetests-app test --target rv32fa.q32 function/call-order.glsl &> docs/fa3-errors/function/call-order.glsl)&
 (target/debug/lps-filetests-app test --target rv32fa.q32 function/call-return-value.glsl &> docs/fa3-errors/function/call-return-value.glsl)&
diff --git a/scripts/glsl-filetests.sh b/scripts/glsl-filetests.sh
index 243d36fa9..0eb8aba4b 100755
--- a/scripts/glsl-filetests.sh
+++ b/scripts/glsl-filetests.sh
@@ -44,6 +44,7 @@ SHOW_LIST=false
 REGEN_GEN_FILES=false
 TARGET_ARG=()
 TEST_ARGS=()
+FORCE_OPTS=()
 
 while [[ $# -gt 0 ]]; do
   case $1 in
@@ -92,7 +93,7 @@ while [[ $# -gt 0 ]]; do
     shift
     ;;
   --force-opt)
-    TEST_ARGS+=("--force-opt" "$2")
+    FORCE_OPTS+=("$2")
     shift 2
     ;;
   *)
@@ -169,6 +170,9 @@ EXAMPLES:
     # Baseline: mark all current failures @unimplemented(backend=jit), then re-run to get exit 0
     glsl-filetests.sh --target jit.q32 --mark-unimplemented --assume-yes
 
+    # A/B test inlining off
+    glsl-filetests.sh --force-opt inline.mode=never examples/
+
 PATTERN SYNTAX:
     *         Matches any sequence of characters
     ?         Matches any single character
@@ -276,4 +280,8 @@ fi
 # This ensures cargo run picks up all compilation changes in the lps workspace
 # Pass all remaining arguments directly to the test runner
 # Pass through DEBUG environment variable for debug logging
+if [ ${#FORCE_OPTS[@]} -gt 0 ]; then
+  LPS_FILETEST_FORCE_OPT="$(IFS=','; echo "${FORCE_OPTS[*]}")"
+  export LPS_FILETEST_FORCE_OPT
+fi
 cargo run -p lps-filetests-app --bin lps-filetests-app -- test "${TARGET_ARG[@]}" "${TEST_ARGS[@]}"
diff --git a/scripts/shader-debug.sh b/scripts/shader-debug.sh
index 44850cac9..cf35f2357 100755
--- a/scripts/shader-debug.sh
+++ b/scripts/shader-debug.sh
@@ -48,6 +48,8 @@ OPTIONS:
     --vinst            Show VInst/interleaved section
     --asm              Show assembly/disasm section
     --summary          Summary only (no detailed function output)
+    --compiler-opt     KEY=value LPIR compiler override (repeatable).
+                        Use `--compiler-opt` with no value after a FILE path to print valid keys.
 
 EXAMPLES:
     # Show debug output for all functions (rv32n backend)