Native frame generation (optical-flow interpolation)#537
Conversation
Compositor frame generation on open GLSL shaders: block-matching motion estimation and fragment-shader interpolation wired into the Vulkan present path. A headroom-driven scheduler posts at the target rate under a non-blocking present mode (so an adaptive panel ramps up) and passes through under FIFO; the real frame always presents, so it never drops below native. Frame Gen controls live in the FX tab as an expanding toggle (like SGSR): 2x/3x/4x multiplier, quality preset and smoothness; Other-settings toggle. The HUD FPS reports the output rate (real + generated) while FG is active.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 83c2550805
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…s idle Under FG only real game presents set fgNewScene, so cursor and window changes (paused, idle menus, static scenes) never reached a HOLD and the compositor stayed frozen on the last frame. Mark a scene-dirty flag on non-game render requests; the pump recomposites and presents it without interpolating.
Replace the full per-submit GPU drain (wait_inflight_frames) in fg_submit with a targeted wait on only the history slot a HOLD is about to overwrite. Grow the history ring 2->3 slots so the overwritten slot is never the pair an in-flight INTERP is sampling, and track a fence per slot. Add a fragment->compute WAR barrier so a new pair's motion recompute waits for the prior pair's interp reads on the single graphics queue. This lets CPU record/submit overlap GPU execution, cutting frametime variance under the Balanced/Quality presets.
…droom Refine the block-matching result with a per-axis parabolic SSD fit so motion vectors are sub-pixel instead of quantized to 2 full-res px (removes slow-pan wobble); guarded to keep the +/-1 taps inside the search tile. Apply a separable 3x3 component-wise median to the flow field in interpolate.frag to reject outlier vectors without blurring motion edges. Raise the non-FIFO swapchain image floor 3->4 so interpolated frames stop being dropped at acquire, and count those drops in the FG cadence log.
…acer Hold the adaptive (LTPO) panel high by over-posting under MAILBOX, and pace presents evenly with a clock_nanosleep(CLOCK_MONOTONIC, TIMER_ABSTIME) deadline loop anchored to the Choreographer vsync grid -- the smoothness comes from the absolute-time sleep, not the present mode or VK_GOOGLE_display_timing (Android ignores desiredPresentTime under MAILBOX). The display-timing extension is kept only for read-back telemetry (FG timing: avgInterval log). The FPS limiter pins the engine rate; the surface frame-rate vote targets engine*multiplier with a never-below-native floor; the FIFO cadence branch gains an epsilon so an exact ratio doesn't drop a multiplier.
Wake ~120us before the deadline so the present latches the current vblank; compute the interp phase from real-frame arrival times vs the present deadline instead of a fixed k/(n+1) fraction; replace the 4-sample avgInterval log with windowed CoV/min/max present-interval stats.
…rder-independent mode resolve
…ings - FIFO presents with deterministic slot phases; real frames present sharp - Bidirectional warp at every multiplier; median occlusion fallback; static HUD mask - Extrapolation present path (no added latency) - Runtime frames-in-flight (Buffering 1-3) - Per-game FG settings; Smoothest / Low Latency presets in the drawer
Interpolated frame generation on a dedicated worker thread with a deferred-promote history ring and a changed-pixel content-dedup, so distinct frames are kept and the source rate is measured cleanly. Slot-grid cadence with even-hold pacing for high-refresh panels, native-max refresh pinning, and deadline-paced presents. Occlusion-gated interpolation with motion-faded detail. Fixes: enable/disable crash (stale worker fence left in the shared slot-fence array), drawer-overlay pause/resume, and low-motion content-rate collapse. Adds compute shaders for the flow-generation path.
Brings in the 7 upstream commits (Fx tab effects, frontend support, controller auto-hide touchscreen controls, wrapper update, DXVK/D7VK ddraw, settings strings) on top of the frame-generation work. Conflicts in vk_renderer.c, vk_state.h, and VulkanRenderer.java resolved by keeping both sides.
Brings in the 5 upstream commits (Z drive boot & shortcut fix, touch controls refinement, Boot-to-Desktop/Graphics-test hero buttons, Box64 env vars, README) on top of the frame-generation work. Conflicts in XServerDisplayActivity.java, XServerDrawerMenu.kt, and 14 locale strings.xml resolved by keeping both sides.
Pace each presented frame at its true temporal position instead of quantizing to vblanks: - fg_compute_deadline no longer snaps the present target to the panel vsync grid; the target is the free, evenly-spaced instant on the measured content-rate grid (vsync-snapping was the 3:2-pulldown-style judder source for non-integer/variable content:panel ratios). - The worker requests that instant from the display via VkPresentTimeGOOGLE.desiredPresentTime (was hardcoded 0), so a panel that honours display-timing latches each frame on the correct vblank. - FG presents under a non-blocking mode on the toggle path (MAILBOX, IMMEDIATE fallback) to match the attach path, so the deadline nanosleep drives the present instant instead of FIFO vsync-blocking. - One present per pump tick (fgComputePerTick=1): the pump fires once per vblank, so emitting more posted multiple frames into a single vblank, which showed as an uneven ">panel" present rate.
Replace the open-loop slot-grid/tick-counter cadence with a time-based rule and lock pacing to the source clock: - fgEmitOne derives the output sub-frame from elapsed time since the last real frame (frac*M), not a tick counter, so irregular promote arrival can't misalign placement. One present per vblank; sub 0 = real frame, sub 1..M-1 = tweens at phase sub/M. - Divisor-snap the cadence multiplier to the largest divisor of the panel:content ratio so output divides the panel evenly (e.g. 3x of 30Hz = 90Hz holds frames 1,1,2 vblanks on a 120Hz panel = judder; snaps to 2x). fgTargetHz + the adaptive ratchet use the snapped multiplier; diag reports it as cad=. - Stabilise the rate lock with a light EMA toward the measured content rate instead of the drift-relock threshold that left the lock stale. - Native fg_compute_deadline anchors each present to curr_arrival + phase*(curr-prev) (the source clock) instead of a per-enqueue period accumulator that drifted when holds emit nothing; nativePresentLast now carries phase + arrivals.
The floor(frac*M) sub-frame rule skipped tweens when the source rate jittered against the EMA period, so the output ran far below target (interp ~16/s vs 30 expected at 2x, ~80 missing at 4x) and the adaptive ratchet misread that as a GPU limit and dropped the multiplier (4x ran as 2x) even though generation is cheap (over-budget ~7/240, interp ~0.07ms). Replace it with a deterministic steady gate: present a new frame every hold=slots/M vblanks and sample the tween phase continuously from the content clock, so a fresh frame is produced on every gate vblank. 4x now holds (eff=4x cad=4x) and the output tracks the target (~110fps from a 30fps source) instead of collapsing to ~48fps. Real frame shown sharp on promote; gate restarts there.
- fg_sig_delta noise floor lowered 4->2 so subtly-moving frames (3-4/channel change) count as distinct instead of being held as duplicates. Exact re-presents still score 0 and drop, so the rate measurement is unaffected. - Wire the Max preset (index 5) to enable deep (bidirectional) flow in both the startup load and the in-game preset handler; it was hardcoded off for all presets. Other presets stay single-flow.
The cadence anchored to the native content-promote time, which is quantized to the pump vblank. Anchor it instead to the precise onFramePresented arrival timestamp (fgLastGameNs) - the game's own buffer-swap/present clock. Content-dedup still runs upstream to drop redundant re-presents (the compositor sees some), but the cadence timing now comes from the real arrival instants. Measured on a 30fps source at 4x: present-interval cov ~37% -> ~22%, worst-case gap 24.9ms -> 16.6ms (no more 3-vblank holds).
Brings in the upstream commits synced on the remote (Fx effects per shortcut, glass control opacity, drive-creation fix) on top of the frame-generation pacing work. Clean auto-merge.
Default-on selectable CNN flow producer with per-pair flow caching, decimated coarse-to-fine flow, bidirectional deep mode, and 2x/3x/4x multipliers. Set debug.winnative.fgcnn=0 to force the classical block-match path.
Deadline-free flow job computes each pair's flow ahead of the warp-only present jobs, lowering present-interval jitter. CNN flow is now unconditional (only an fp16-incapable device falls back to classical).
Per-history-slot feature cache eliminates redundant ingest (was 2-4x per pair); slot invalidated on re-stage. Lowers per-pair flow cost and present-interval jitter. Also fixes the forward-flow direction so deep mode does true bidirectional occlusion.
…ized FIFO blocks the present at vsync after the precise nanosleep pacing, re-quantizing it; MAILBOX honors the paced timestamp and reaches a flat present cadence when the source is steady.
The CNN flow stores backward-flow components in the z/w channels, which the warp shader read as static-mask/confidence and used to snap pixels to the sharp current frame; noisy low-res flow made that toggle per frame (the sharp/unsharp flicker, worst at low presets). interpolate.frag now ignores those channels on the CNN path. Also stopped forcing the forward flow, which interpolate.frag never samples.
kdet was gated on flow magnitude and blend weight, both noisy with CNN flow, causing residual shimmer on motion; on the CNN path it is now a stable per-frame value.
…the drawer Generate path: warp prev/curr by the refined flow at three pyramid scales and select per-pixel with a numerically-stable softmax (max-subtract + temperature) and a flow-consistency occlusion term, guarded so static content is held unwarped. Delta5-9 warp-follow refinement feeds the flow. Worker keeps generating while the drawer overlay is up (it only composites the game content; the menu is a separate layer). Adds a prop-gated consecutive-frame dump for offline temporal verification. Verified via dumps across all presets and 2x/3x/4x: steady pacing, no strobe, single clean objects.
The interp phase was computed from the vsync clock relative to the frame-arrival clock; those grids are unaligned, so each pair's interps landed at jittery phases (0.6/0.48 instead of clean fractions), placing moving objects ahead/behind their correct position frame-to-frame. Phase is now vi/slots: clean, evenly-spaced, ordered (0.25/0.5/0.75 at 4x), verified by per-frame phase logging in the dump.
… convention The generate warped by a coarse pyramid flow level with mvScale hardcoded to 1.0, which is only correct at flow_scale=0.5 -> objects moved too far/short at other presets (the off rate-of-movement). Now warps the fine fg_motion by the same convention interpolate.frag uses (prev +flow*t, curr -flow*(1-t), one .xy field) with mvScale = gw/(2*flow_width), correct at every preset. Dump gains a phase-0.25-aligned start for interp-fraction measurement.
…ant 0.5 mvScale was gw/(2*flow_width), assuming flow stored in flow-resolution pixels; it is actually full-resolution pixels, so that formula scaled the warp by 1/flow_scale -> 4x overshoot at Eco (flow_scale 0.2), objects torn/shrunk, while Max (0.8) was near-correct. Constant 0.5 (warp = flow*t) is preset-independent and correct; verified by object-pixel integrity (Eco gen objects 6021->7971 px, matching real ~8000).
The phase>0.35 burst-start gate only matched 4x (which has phase 0.25); 2x/3x never started a burst so the dump returned stale frames. Now gates on phase not increasing vs the previous frame (a pair/wrap boundary), valid at 2x/3x/4x.
mvScale 0.5 (warp=flow*t) assumed the flow equals the true displacement, but the CNN flow chain overestimates by ~25%, so objects overshot the real frames. Swept mvScale by the object-integrity metric (gen-vs-real saturated-pixel ratio, which peaks where the two warps converge = least tearing): clear peak at 0.40 on both Eco (0.954) and Max (0.948), 0.50 worse (0.908/0.917). Added a debug.winnative.fgmvscale override for future tuning.
The bidirectional blend averaged back-warp and fwd-warp with phase weights, so where the imperfect flow makes them land at different spots the object smeared and shifted as the phase changed (perceived overshoot/jitter). Replaced the logit+temperature softmax with a sharp reliability select: each warp is weighted by exp(-0.5*|flow(landing)-warp|^2), decisively picking the warp whose landing point has consistent flow. Object integrity rose to 0.958 (Max 4x). Negative mvScale outputs |back-fwd| for diagnostics; debug.winnative.fgmvscale tunes magnitude live (default 0.40).
The dump was hardcoded 480x270 (landscape 16:9) but the gen image is 1080x2376 (portrait) -> dumps were rotated sideways AND squished 2.2:1->1.78:1, distorting every measurement. Now dumps at 270x594 (true aspect); analysis rotates +90 CCW to landscape. Logs gen WxH per dump.
…nfg_04 (structure)" This reverts commit d3a25e0f4fa43acc4ada7cf8130ab2df85ce4bf4.
…wnfg_04 (structure)" This reverts commit b7d8bc67428d24961a8cba112314734650c96a27.
Built GT validation harness (fgtest cnn_flow_run + wnfg_53): the real GT chain produces NON-DEGENERATE logits at all 7 levels (out ch ranges 1.5-5.2, maxsep 1.9-4.1) with b32=hD8(D)/b33=hD7(C) pair + b34=seed. PROVES the wnfg_53 kernel + wiring + chain structure are correct. Renderer wired to match. Renderer logits still degenerate ONLY on the sparse D3D11 test scene (mostly-black low-variance features); delta5-8 wiring is identical to the validated harness, so this is a feature-sparsity artifact of the synthetic test content, not a port bug — needs a real game (dense features, what wnfg_53 was trained on) to confirm 1:1.
MHS (and games presenting CPU-accessible linear swapchains, usage&0xFF!=0, allocationSize==w*h*4) showed vertical blue/cyan stripe corruption on the new Adreno 840 phone (CPH2749) but not Adreno 830. Root cause: vkr_texture_import_ahb hardcoded VK_IMAGE_TILING_OPTIMAL; on Adreno 840 OPTIMAL is a real tile swizzle, so linear buffer data was read through a tiling pattern. Fix: import CPU-accessible (linear) AHBs as VK_IMAGE_TILING_LINEAR, keep OPTIMAL for GPU-only buffers. NOT yet on-device-verified (phone disconnected mid-test). FG kept on two-flow (wnfg_53 logits preserved in history at 37518545).
RE (workflow): MHS vertical-stripe corruption = the game writes its swapchain AHB via the guest WN-Turnip in Adreno-840 macro-tile order, but winnative's compositor read it via the system Qualcomm driver -> tile-order mismatch (coincided on Adreno 830). The compositor driver came from graphicsDriverConfig 'version' and fell back to 'System' when unset, ignoring the game's actual driver. Fix: when no explicit compositor version is set, match the GAME's driver (graphicsDriver) so writer==reader, per the existing 'match guest libvulkan' intent. Also revert the inert CPU-access LINEAR import toggle (dedicated AHB layout comes from gralloc metadata, not ic.tiling) and add an OPTIMAL->LINEAR vkCreateImage fallback so no driver can black-screen. Diagnostic: XServerDisplayActivity logs Compositor graphics driver='...'. UNVERIFIED on-device (phone disconnected).
Reverts the unconditional guest-driver-matching for the compositor (a6acd95c) for FG-off games; on Adreno 840 reading the AHB via guest Turnip OR System both stripe, so this isn't the full fix but restores the pre-FG default. Regression hunt continues.
User-confirmed: the libvulkan_wrapper BCn emulation (auto/full) patches the Adreno driver and corrupts the whole frame as vertical stripes on Adreno 840; setting bcnEmulation=none fixes it. Mesa Turnip decodes BC natively, so force none on Adreno automatically. Also revert the wrong-diagnosis COLOR_ATTACHMENT import change.
…hen static) On a real game (MHS) camera spin, interp dropped 30->0/s: X11 damage events fire requestRenderCoalesced() -> fgSceneDirty every vblank during motion, and fgEmitOne's 'dirty && !newGame' branch presented sharp instead of running the interp cadence. Fix: defer the UI-only recomposite to the END and only take it when the content interval is fully spanned (static); mid-interval, run the interpolation cadence so motion is actually smoothed. Isolated via dumps/20260620_2047_mhs_spin_BEFORE_interp0.
…+ frame-sequence dump
… harness Warp: signed mvScale so the midpoint gather uses -0.5*flow. The old +0.25 was wrong-sign and caused ghosting/shake at intermediate phases (endpoints were unaffected, which is why raising m0 made it worse). Debug-viz moved to a flags bit; default m0 -0.25. Pacing: snap the interp present deadline to the even vblank grid; complete the motion to curr on a late-frame hold. Harness (debug, default-off): debug.winnative.fgsynth rigid-shift / fgpat noise field + full-res crop + flow-field dump for ground-truth warp verification.
conv36 was dispatched cinT=4 but the trained wnfg_36 is cinT=2 (8ch = prev4 + curr4), so it read past its weight file -> bias-only, content-independent features that froze the whole flow chain. Fixed: cnn_concat2 feeds prev.L0 ++ curr.L0 at cinT=2, plus the missing trained wnfg_45 (8->16) expansion before conv42. Verified on-device the gamma now produces content-varying features. The flow magnitude is still a fixed ~19 regardless of motion - the producer never tracked motion. Root-caused to the cost volume (wnfg_14) self-correlating combined prev+curr features plus a dominant flow-regression bias, with the coarsest pyramid level sub-pixel for the test shifts. Feeding the cost stage separated per-frame features is correct but insufficient alone; the full fix is a multi-stage RE, documented separately. Flow images get TRANSFER_SRC so the controlled-motion harness can dump them.
Resolves the sync (was 57 ahead / 13 behind upstream). Conflicts: - vk_renderer.c: kept the FG-integrated renderer (manage_scene_targets helper + record_and_submit_frame which already carries the scene-snapshot-under-mutex + graveyard improvement); merged upstream's no-surface present-mode guard with the FG worker stop/restart. - Steam DB converters (AppConverter, UserFileInfoListConverter): took upstream's schema-drift tolerance fixes (no FG changes there). - strings.xml: kept both the FG strings and upstream's file-manager strings.
|
Going to need to ensure that any change in the xserver menu has the new PANE_NAV |
Compositor frame generation on open GLSL shaders: block-matching motion estimation and fragment-shader interpolation wired into the Vulkan present path. A headroom-driven scheduler posts at the target rate under a non-blocking present mode (so an adaptive panel ramps up) and passes through under FIFO; the real frame always presents, so it never drops below native. Frame Gen controls live in the FX tab as an expanding toggle (like SGSR): 2x/3x/4x multiplier, quality preset and smoothness; Other-settings toggle. The HUD FPS reports the output rate (real + generated) while FG is active.