-
Notifications
You must be signed in to change notification settings - Fork 0
docs(splat-native): address review feedback on #212 (ray segmentation + Cholesky scratch buffer) #213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(splat-native): address review feedback on #212 (ray segmentation + Cholesky scratch buffer) #213
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -48,16 +48,18 @@ pub fn batched_cholesky_3x3( | |
| ); | ||
|
|
||
| pub fn batched_mahalanobis( | ||
| query_xyz: &[f32], // M × 3 query points | ||
| mu_xyz: &[f32], // N × 3 Gaussian centroids | ||
| sigma_packed: &[f32], // N × 6 packed Σ | ||
| out_dist_sq: &mut [f32], // M × N output (squared Mahalanobis) | ||
| query_xyz: &[f32], // M × 3 query points | ||
| mu_xyz: &[f32], // N × 3 Gaussian centroids | ||
| sigma_packed: &[f32], // N × 6 packed Σ | ||
| cholesky_scratch: &mut [f32], // N × 6 — caller-provided packed-L scratch (24 MiB @ N=1M); function MUST NOT allocate (see §4.2) | ||
| out_dist_sq: &mut [f32], // M × N output (squared Mahalanobis) | ||
| ); | ||
|
|
||
| pub fn batched_opacity_blend( | ||
| sorted_amplitudes: &[f32], // N (front-to-back along view ray) | ||
| opacity_lut: &[u8; 256], // amplitude → opacity LUT | ||
| out_alpha: &mut [u8], // composited alpha per pixel | ||
| sorted_amplitudes: &[f32], // flat; all rays' samples concatenated (front-to-back per ray) | ||
| ray_offsets: &[u32], // length = n_rays + 1 (CSR-style); ray r's range is [ray_offsets[r]..ray_offsets[r+1]) (see §4.3) | ||
| opacity_lut: &[u8; 256], // amplitude → opacity LUT | ||
| out_alpha: &mut [u8], // length = n_rays — composited alpha per ray | ||
| ); | ||
|
|
||
| pub fn batched_sh_eval_l3( | ||
|
|
@@ -194,11 +196,15 @@ L₂₂ = √(Σ₂₂ - L₂₀² - L₂₁²) | |
| /// - Degenerate Σ (Cholesky NaN) yields f32::INFINITY (sortable; never NaN). | ||
| /// - SIMD-batched 16×16 on AVX-512, 4×4 on NEON. | ||
| pub fn batched_mahalanobis( | ||
| query_xyz: &[f32], mu_xyz: &[f32], sigma_packed: &[f32], out_dist_sq: &mut [f32], | ||
| query_xyz: &[f32], // length 3*M | ||
| mu_xyz: &[f32], // length 3*N | ||
| sigma_packed: &[f32], // length 6*N (upper-triangle Σ per Gaussian) | ||
| cholesky_scratch: &mut [f32], // length 6*N (caller-provided; holds packed L per Gaussian) | ||
| out_dist_sq: &mut [f32], // length M*N (row-major) | ||
| ); | ||
| ``` | ||
|
|
||
| **Implementation note:** internally calls `batched_cholesky_3x3` once on `sigma_packed`, caches L (heap-free via stack or caller-provided scratch), then triangular-solve + squared norm per (m, n) pair. | ||
| **Implementation note:** internally calls `batched_cholesky_3x3` on `sigma_packed` once per call, writing packed L into `cholesky_scratch` (caller-provided; zero-allocation contract). The caller sizes the buffer as `6 * N * size_of::<f32>()` — for `N = 1_000_000` Gaussians this is **24 MiB**, which is not stack-feasible; callers must allocate it once at engine init and re-use across frames (matches the `splat-fit` / registration loop pattern). For small `N` (e.g. `N ≤ 8192`) callers MAY pass a stack-resident buffer. The function MUST NOT allocate internally. | ||
|
|
||
| **Tests:** | ||
| - Reference comparison against scipy `scipy.spatial.distance.mahalanobis` on random points + Σ. | ||
|
|
@@ -226,14 +232,26 @@ pub fn batched_mahalanobis( | |
| /// - Composition: α_new = α_old + (1 - α_old) · α_i. | ||
| /// - Internal accumulator: u16 with saturation; truncate to u8 at end. | ||
| pub fn batched_opacity_blend( | ||
| sorted_amplitudes: &[f32], opacity_lut: &[u8; 256], out_alpha: &mut [u8], | ||
| sorted_amplitudes: &[f32], // flat; contains all rays' samples concatenated | ||
| ray_offsets: &[u32], // length = n_rays + 1 (CSR-style); ray r's range is [ray_offsets[r]..ray_offsets[r+1]) | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The per-primitive spec now makes Useful? React with 👍 / 👎.
Owner
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Addressed in commit pub fn batched_opacity_blend(
sorted_amplitudes: &[f32], // flat; all rays' samples concatenated (front-to-back per ray)
ray_offsets: &[u32], // length = n_rays + 1 (CSR-style); ray r's range is [ray_offsets[r]..ray_offsets[r+1]) (see §4.3)
opacity_lut: &[u8; 256], // amplitude → opacity LUT
out_alpha: &mut [u8], // length = n_rays — composited alpha per ray
);Closing — skeleton and §4.3 detailed spec are now in sync. |
||
| opacity_lut: &[u8; 256], | ||
| out_alpha: &mut [u8], // length = n_rays | ||
| ); | ||
| ``` | ||
|
|
||
| **Per-ray segmentation contract.** A renderer composites N independent view rays per frame; each ray has its own front-to-back-sorted Gaussian sequence. `ray_offsets` is a CSR-style prefix-sum (length `n_rays + 1`) so ray `r`'s amplitudes are `sorted_amplitudes[ray_offsets[r] as usize..ray_offsets[r+1] as usize]` and `out_alpha[r]` is its composited alpha. Constraints: | ||
| - `ray_offsets[0] == 0` and `ray_offsets[n_rays] == sorted_amplitudes.len() as u32` (assert-on-debug). | ||
| - A ray with `ray_offsets[r] == ray_offsets[r+1]` (empty) yields `out_alpha[r] = 0`. | ||
| - Per-frame amplitude quantization (the 256-bucket LUT input) is computed by the caller from the per-frame max amplitude; `opacity_lut` is a frame-global constant for that pass. | ||
|
|
||
| **Implementation note:** the SIMD inner loop processes one ray's range as a contiguous front-to-back sweep; rays are independent (no cross-ray data dependence) so the outer ray loop is trivially parallelizable. | ||
|
|
||
| **Tests:** | ||
| - Reference comparison against scalar reference for known sequences. | ||
| - Reference comparison against scalar reference for known sequences (single-ray + multi-ray). | ||
| - Saturation at full opacity (sequence of high-amplitude Gaussians → α = 255). | ||
| - Empty sequence → α = 0. | ||
| - Empty ray (`ray_offsets[r] == ray_offsets[r+1]`) → α = 0. | ||
| - Multi-ray independence (concatenated rays produce same per-ray output as separate single-ray calls). | ||
| - `ray_offsets` invariant violations (debug assert on `ray_offsets[0] != 0` or `ray_offsets[last] != amplitudes.len()`). | ||
| - SIMD parity. | ||
|
|
||
| ### 4.4 `batched_sh_eval_l3` | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The detailed spec now requires
cholesky_scratch, but the module skeleton at the top of this same plan still advertisesbatched_mahalanobis(query_xyz, mu_xyz, sigma_packed, out_dist_sq)without any scratch buffer. Since this document is the handoff for implementingsrc/simd_splat.rs, an implementer following the skeleton would recreate the exact no-scratch API this change is meant to eliminate, leaving the zero-allocation contract unenforceable.Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in commit
3c78901on the same PR branch before merge — the module skeleton at line 50 now shows thecholesky_scratch: &mut [f32]parameter with the inline note pointing to §4.2 for sizing + the "function MUST NOT allocate" contract. Verified on merged master tipe2ec430:Closing — skeleton and §4.2 detailed spec are now in sync.