Multi-target ports + unified bench infrastructure by navado · Pull Request #3 · facex-engine/facex

navado · 2026-05-06T13:43:40Z

Multi-target ports + unified bench infrastructure

Adds first-class build paths for Apple Silicon, i.MX 8/93/95 (NXP
NPU), and ESP32-P4 (Espressif RISC-V MCU), plus a cross-platform
benchmark tool and a unified coverage matrix. Default make is
unchanged for existing x86 users — every new path is opt-in.

Four commits, each self-contained and amending its own section in the
new docs/implementation.md:

83aeee7  Add ESP32-P4 ESP-IDF component + MIPI-CSI camera example
6bc5f99  Add i.MX NPU library (TFLite + Ethos-U65 / VxDelegate / XNNPACK)
f75fd64  Add Apple Silicon perf paths: Accelerate (AMX), SME (M4+), Core ML (ANE)
7afb4f7  Add unified benchmark tooling + AArch64 NEON foundation

What's added

1. Bench foundation (`7afb4f7`) — also makes `make` work on ARM hosts

tools/bench.c — cross-platform synthetic latency bench. md/csv/json
output. Same source compiles on macOS arm64/x86, Linux aarch64, future
i.MX targets.
tools/bench_camera_mac.swift + make bench-camera — live AVFoundation
camera benchmark. --summary mode emits one CSV row in the same schema
as facex-bench so the two can be merged into one table.
scripts/bench_all.sh — sweeps build-flag combinations, emits unified
Markdown comparison.
scripts/test_all.sh — single-host test harness; topic commits amend
with their own checks (51/51 PASS on M2).
AArch64 build foundation: Makefile arch detection via uname -m,
src/threadpool_pthread.c (futex/WaitOnAddress aren't portable),
FACEX_NO_INT8 flag for the engine, hand-written NEON kernels for
matmul_fp32_packed{,_bias,_bias_gelu} (output byte-identical to
scalar within ULP), and a fix for the column-panel scalar fallbacks
that were silently wrong on every non-x86 host.

2. Apple Silicon perf paths (`f75fd64`) — opt-in, never default

Build	Adds	Measured speedup
`make ACCELERATE=1`	`cblas_sgemm` via Accelerate.framework (AMX)	-22% embed / -13% e2e on M2
`make SME=1`	M4+ FMOPA outer-product kernel; auto-disabled on M1-M3	n/a (compiles cleanly; not directly hardware-tested)
`make COREML=1`	Obj-C bridge loading `.mlpackage` for Apple Neural Engine	n/a (compile + link verified)
`make mac-universal`	Cross-compiled fat arm64 + x86_64 archive	each slice contains real arch-specific SIMD

Each opt-in dispatch is gated at compile time AND runtime via a
self-check (e.g. SME runs a tiny FMOPA-vs-scalar consistency test and
disables itself on output divergence). Critical: -march=armv9-a+sme
is applied per-file to transformer_ops_sme.c only — applying it
globally would let clang auto-vectorize plain C using SVE/SME and trap
on M1/M2/M3.

3. i.MX NPU library (`6bc5f99`) — separate `libfacex_npu.{so,dylib}`

Make target	SoC	NPU	Delegate
`make imx-npu`	host (dev)	XNNPACK fallback	(built into TFLite)
`make imx93 SDK=…`	i.MX 93	Arm Ethos-U65	`libethosu_delegate.so`
`make imx95 SDK=…`	i.MX 95	Arm Ethos-U65	`libethosu_delegate.so`
`make imx8mp SDK=…`	i.MX 8M Plus	VeriSilicon VIP9000	`libvx_delegate.so`

src/backend_tflite.c is a TFLite C-API wrapper with a dlopen-based
delegate loader (vendor .sos aren't a hard dep — auto-fallback to
XNNPACK if missing). One source path, three deployment targets. New
public API in include/facex_npu.h, mirrors facex.h shape.

Detector path is intentionally -ENOSYS. Anchor decode + NMS for
arbitrary YuNet/SCRFD topology is too fragile to ship blind. The
recommended deployment is the hybrid pipeline: CPU detect via
libfacex.a, NPU embed via libfacex_npu.so. ~80% of the perf
benefit, none of the post-processing risk. Documented in
docs/imx_npu.md.

Offline tooling: tools/onnx_to_tflite.py (PyTorch/ONNX → INT8 TFLite)
and tools/compile_vela.sh (Arm Vela for Ethos-U65).

4. ESP32-P4 ESP-IDF component (`83aeee7`)

components/facex/ (IDF wrapper) + examples/esp32p4_camera/ (full
runnable IDF project).

The MIPI-CSI capture path is complete and runnable — follows the
official ESP-IDF camera_driver recipe verbatim (LDO 2.5 V on the CSI
PHY, SCCB I2C, esp_cam_sensor_detect, esp_cam_new_csi_ctlr,
IRAM-safe callbacks, PSRAM frame buffer ring, capture task that
downscales RGB565 → RGB888 and calls the FaceX backend, requeues).
Logs FPS + per-detection latency once per second.

The face-detection backend is a Kconfig three-way choice:

stub (default) — synthetic deterministic face per frame for
bring-up. Works on the EV-Board today.
native — links the existing C engine. Compiles, runs, but at
1-3 s/frame the EdgeFace-XS model is too large for production on
P4. Provided for evaluation only.
espnn — reserved Kconfig slot. Production target needs a
distilled EdgeFace-Nano (~300 K params, 64×64 input, 256-d
embedding, no XCA attention) plus an ESP-NN backend (PIE-SIMD INT8).
Not in this PR — model + kernel work is its own milestone.

Honest scope: the camera + dispatch bridge ships now; production-fit
model is the follow-up. Once it exists, only the Kconfig backend
toggle changes; everything else in this commit stays.

Coverage at a glance (full table in `docs/coverage_matrix.md`)

Configuration	Compiles	Runs end-to-end	Tested on
Default `make` (Apple Silicon arm64 / NEON)	✅	✅	M2, macOS
`make ACCELERATE=1` (AMX)	✅	✅	M2
`make SME=1` (M4+ FMOPA)	✅	🧪 inert on M2 (sysctl `FEAT_SME=0`); self-check guards M4 correctness	M2 (compile + isolation only)
`make COREML=1` (ANE bridge)	✅	🧪 missing-`.mlpackage` smoke passes; ANE dispatch needs ONNX export	M2 (compile + link + error path)
`make mac-universal` (fat archive)	✅	✅ both slices contain arch-specific SIMD	M2
`make imx-npu` / `imx93` / `imx95` / `imx8mp`	🧪 (against TFLite header stubs)	🚫 needs vendor SDK + EVK	mac-m2 (syntax only)
ESP-IDF component + camera example	🧪 (against IDF header stubs)	🚫 needs IDF + P4-Function-EV-Board	mac-m2 (syntax only)
Existing x86 / WASM paths	upstream	upstream	unchanged

scripts/test_all.sh runs every check that's executable on the host:
51/51 PASS on M2. Topic commits each register their own checks so
the runner stays exhaustive as features land.

🧪 / 🚫 honesty markers: where I couldn't validate on hardware (M4,
i.MX EVK, P4 dev kit), the code follows the documented vendor APIs
(esp_cam_ctlr_*, TFLite C-API + delegate ABI, ACLE 2024 SME
intrinsics). Self-checks at runtime guard correctness for the SME
path. Bring-up checklists are documented in docs/{mac,imx_npu,esp32p4}.md.

Benchmark results (M2, n=100, `scripts/bench_all.sh`)

Build	Active backends	Embed median	E2E median
default	NEON	4.59 ms	8.46 ms
`ACCELERATE=1`	Accelerate(AMX) + NEON	3.54 ms	7.52 ms
`SME=1`	NEON (SME inert on M2)	4.57 ms	8.63 ms
`SME=1 ACCELERATE=1`	Accelerate(AMX) + NEON	3.52 ms	7.49 ms

Same \|\|emb\|\|² = 0.0756, same self-similarity 1.0000, same bbox
across all four — backend choice never changes the embedding bytes
beyond ULP.

WASM (Node.js, node wasm/bench.js): 5.27 ms median, 190 fps.
Embedding output matches native byte-for-byte.

Reproduce locally

bash download_weights.sh                       # pulls data/edgeface_xs_fp32.bin
make                                           # default build (NEON on ARM, AVX2 on x86)
make test                                      # golden test
scripts/test_all.sh                            # full local sweep (51 checks)
scripts/bench_all.sh --iters 200 --warmup 50   # build-flavour comparison table

# Mac-only opt-ins:
make ACCELERATE=1 mac-test                     # AMX
make SME=1 mac-test                            # M4+ SME
make COREML=1                                  # ANE bridge
make mac-universal                             # fat arm64 + x86_64

Compatibility / risk

Default build is unchanged. x86 users see the same Makefile
paths, same flags, same artifacts.
No new mandatory dependencies. libfacex.a still has zero
external runtime deps (otool -L → libSystem only on macOS, libc
- libpthread on Linux). NPU build is opt-in via
  FACEX_BACKEND_TFLITE; Mac perf paths are opt-in via per-flag
  build vars.
No API breaks. Existing public C API is untouched. New surfaces
are additive: include/facex_npu.h, include/facex_coreml.h,
include/facex_backend.h.
examples/example.c has a 1-line update for the existing 3-arg
facex_init signature — was orphaned in upstream relative to the
current API.

What's NOT in this PR

EdgeFace-Nano distilled model (sized for ESP32-P4)
ESP-NN-based FaceX backend (production path for P4)
Hardware-validated runs on i.MX 8M Plus / 93 / 95 EVKs
Hardware-validated SME runs on M4 / M5
Core ML .mlpackage artefact (needs an EdgeFace ONNX export not in
this repo)
WASM rebuild from source (uses pre-built wasm/*.wasm; no make wasm target wired)

Each of these is documented in docs/implementation.md with the
unblock path for whoever picks it up next.

Files at a glance

44 files changed, 6457 insertions(+), 30 deletions(-)

docs/
  benchmarking.md          how to bench (which-tool-when matrix)
  coverage_matrix.md       compile/static/e2e per (target × backend × flag)
  esp32p4.md               ESP32-P4 build, deploy, troubleshooting
  imx_npu.md               i.MX NPU deployment + bring-up checklist
  implementation.md        per-topic implementation snapshot (this doc)
  mac.md                   Apple Silicon build modes + perf reference

src/
  backend_accelerate.c     AMX cblas wrapper                       [Mac]
  backend_coreml.m         ARC Obj-C bridge for ANE / .mlpackage   [Mac]
  backend_tflite.c         TFLite + delegate loader                [i.MX]
  cpu_features.{h,c}       sysctl-based runtime probe              [Mac]
  threadpool_pthread.c     pthread+condvar pool                    [Bench]
  transformer_ops_sme.c    M4+ FMOPA matmul                        [Mac]

include/
  facex_backend.h          pluggable backend vtable                [i.MX]
  facex_coreml.h           Core ML public C API                    [Mac]
  facex_npu.h              NPU public C API                        [i.MX]

tools/
  bench.c                  cross-platform synthetic bench          [Bench]
  bench_camera_mac.swift   AVFoundation camera bench               [Bench]
  build_bench_camera_mac.sh swiftc invoker                         [Bench]
  compile_vela.sh          Vela compiler wrapper                   [i.MX]
  export_coreml.py         ONNX → .mlpackage                       [Mac]
  onnx_to_tflite.py        ONNX → INT8 .tflite                     [i.MX]

scripts/
  bench_all.sh             build-flavour sweep → comparison table  [Bench]
  test_all.sh              local test harness (51 checks)          [Bench]

components/facex/          ESP-IDF component                       [ESP32]
examples/esp32p4_camera/   Runnable IDF MIPI-CSI camera example    [ESP32]
tests/test_mac.c           Mac smoke + latency stats               [Mac]
tests/test_imx_npu_compile.c  NPU API smoke                        [i.MX]

Bisectability

Every commit builds clean on M2 (make succeeds, scripts/test_all.sh
passes its own commit's checks). The four are also intentionally
decoupled: Mac, i.MX, and ESP32 don't depend on each other; they all
sit on the Bench foundation.

Introduces a single source of truth for FaceX latency numbers across build flavours, OSes, and stages — replaces scattered ad-hoc benches each with its own format. Also lays the AArch64 build foundation so `make` succeeds on Apple Silicon and Linux ARM hosts. Benchmark tooling ----------------- * tools/bench.c — cross-platform synthetic latency bench. Same source compiles on macOS arm64/x86, Linux aarch64, future i.MX targets. Three output formats (md/csv/json) emitting the same data; reports compiled-in vs runtime-active backends. Stages: `embed`, `e2e`, `both`. * tools/bench_camera_mac.swift — live AVFoundation camera bench (Mac- only role). `--summary` mode emits one CSV row at exit using the same schema as facex-bench, so live-camera and engine numbers can join the unified table. * tools/build_bench_camera_mac.sh — swiftc invocation + bridging header. Auto-detects optional libfacex symbols (Accelerate / Core ML) and links matching frameworks so a stale lib doesn't silently break the build. * scripts/bench_all.sh — sweeps build-flag combos, runs facex-bench against each, emits unified Markdown / CSV. * scripts/test_all.sh — single-host test harness. Topic commits amend with their own checks. * docs/benchmarking.md — which-tool-answers-which-question matrix, CSV schema reference, recipe for combining engine + camera output. AArch64 / Mac build foundation ------------------------------ Without this, `make` on Apple Silicon or Linux aarch64 produced a broken binary (silently wrong output via the column-panel scalar fallback in transformer_ops.c). * Makefile arch detection via `uname -m`. arm64 path links src/gemm_stub.c (the existing INT8 GEMM is x86-only) and src/threadpool_pthread.c (linux/futex / win/WaitOnAddress aren't portable). Defines FACEX_NO_INT8 so the engine takes the FP32- packed path. * src/threadpool_pthread.c — pthread + condvar pool (~80 LOC). * src/edgeface_engine.c — gates the INT8 weight-packing block on !FACEX_NO_INT8 so mm->packed stays NULL on ARM and the matmul dispatch falls cleanly through to FP32. * src/transformer_ops.c — fixes column-panel scalar fallbacks for matmul_fp32_packed{,_bias,_bias_gelu} that previously fed packed B into matmul_fp32 (wrong layout, garbage output on every non-x86 host). Adds hand-written AArch64 NEON kernels (NR=8 MR=4 FMA- based; same packed format as AVX2). Output byte-equivalent to scalar within ULP. NEON is portable AArch64, helps i.MX too — not Mac-specific. Documentation ------------- * docs/implementation.md — new file replacing the forward-looking docs/plan/embedded_port_plan.md. This is the implementation- details document; each topic commit (this one, Mac, i.MX, ESP32) amends a new section. * docs/coverage_matrix.md — initial table (CPU library, bench infrastructure). Subsequent topic commits append rows. * CLAUDE.md — repo conventions, build/test commands, architecture summary. Touches Bench / Mac / i.MX / ESP32 surfaces incidentally; topic commits amend their own sections. Verification on mac-m2 ---------------------- * `make` builds clean (`Built libfacex.a (arm64)`). * `make test` golden test passes (`||emb||² = 0.076`, sim 1.000). * `make bench && ./facex-bench` produces md/csv/json output; ~4.6 ms median embed, ~8.4 ms e2e (NEON FP32 packed). * `make bench-camera && ./facex-camera-bench` 29 fps end-to-end. * `scripts/test_all.sh` all checks PASS. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All three are opt-in build flags, not the default. Default `make` still produces the same portable NEON-only artifact for distribution to any Mac. Each flag composes cleanly with the others; the dispatcher in `matmul_fp32_packed` chains them Accelerate → SME → NEON. Apple Accelerate / AMX (`make ACCELERATE=1`) --------------------------------------------- * src/backend_accelerate.c — wraps cblas_sgemm. The wrapper unpacks the column-panel B back to row-major, dispatches via Accelerate; AMX wins for M ≥ 4 and M*K*N ≥ 4096, otherwise returns -1 so the in-tree NEON kernel handles the warmup-dominated calls. * Self-check on first matmul: 4×16×8 cblas vs scalar reference, 1e-4 relative tolerance. Mismatch → facex_disable_accelerate() and the rest of the process stays on NEON. * Measured on M2: 4.6 → 3.5 ms / embed (-22%), 9 → 7.5 ms e2e (-13%). * Embedding bytes identical to NEON within ULP. SME / SME2 (`make SME=1`) — Apple M4 and newer ------------------------------------------------ * src/transformer_ops_sme.c — __arm_locally_streaming __arm_new("za") matmul_fp32_packed using FMOPA outer products. Pre-transposes the row tile of A into a [K, SVL] scratch (gather not allowed in streaming mode); zeroes ZA tile 0; accumulates K outer products; reads the rows back with svread_hor_za32_f32_m. NR=8 to match the existing FP32 packed format. * src/cpu_features.{h,c} — sysctl-based runtime probe for FEAT_SME / FEAT_SME2; cached, atomic, lock-free, no external deps. Designed to host future runtime probes (FP16 / BF16 / dotprod) too. * Build isolation: -march=armv9-a+sme is applied PER-FILE (transformer_ops_sme.c only). Without that, clang auto-vectorizes plain C in transformer_ops.c using SVE/SME instructions that trap on M1/M2/M3. Verified post-fix: transformer_ops.o has zero rdvl/smstart/fmopa; transformer_ops_sme.o has the expected fmopa za0.s. * Self-check: tiny 4×8 SME-vs-scalar consistency test on first matmul. Mismatch → facex_disable_sme() and stay on NEON. * Hardware status: COMPILES + emits real SME asm; NOT directly hardware-tested (no M4 here). Self-check guards correctness on M4 — owners get NEON speed if SME has a bug, never wrong output. Core ML / Apple Neural Engine (`make COREML=1`) ------------------------------------------------ * src/backend_coreml.m — ARC-managed Obj-C bridge that loads a precompiled `.mlpackage` (auto-compiles to .mlmodelc on first call) and dispatches MLModel prediction. Runtime compute_units hint: ALL / CPU+GPU / CPU-only / CPU+ANE. * include/facex_coreml.h — public C API: facex_coreml_init, facex_coreml_embed, facex_coreml_last_dispatch, facex_coreml_free. * Output L2-normalized so cosine similarity stays comparable to the CPU backend regardless of how the .mlpackage ends. * Graceful failure: missing .mlpackage returns NULL with a clear stderr message, no crash. (Validated by scripts/test_all.sh.) * tools/export_coreml.py — ONNX → .mlpackage via coremltools.convert(convert_to="mlprogram") with optional INT8 palettization (default 6 bits/weight via kmeans, drops package size to ~1.8 MB and unlocks ANE INT8 dispatch). * Hardware status: COMPILE-TESTED. Runtime ANE dispatch is not end-to-end validated — that requires running export_coreml.py against an actual EdgeFace ONNX export. Universal Mac binary (`make mac-universal`) -------------------------------------------- * Cross-compiles arm64 + x86_64 slices with target-specific flags, stashes each in /tmp (the in-Makefile clean target wipes its own artifacts), then `lipo`s them into libfacex-universal.a. * Verified: arm64 slice has 293 NEON insts (fmla/fmul/fadd) in transformer_ops; x86_64 slice has 786 AVX2 insts (vfmadd/vmovups). Real arch-specific code in both halves. Smoke test (`make mac-test` → tests/test_mac.c) ------------------------------------------------ * Loads weights, embed sanity, determinism, self/cross similarity, latency stats (min/median/p99 over 50 iters), end-to-end detect+align+embed on tests/test_face_160.raw. * Now reports both COMPILED-IN and RUNTIME-ACTIVE backends so the same binary tells you what will actually dispatch: Backends compiled in: Accelerate SME NEON Backends active at runtime: Accelerate(AMX) NEON (correctly shows SME compiled but inert on M2 — sysctl FEAT_SME=0). Documentation ------------- * docs/mac.md — full Mac story: build modes, runtime fallback chain, permissions, perf reference table, troubleshooting. * docs/implementation.md — appended "Apple Silicon / Mac perf paths" section. * docs/coverage_matrix.md — appended Mac rows (mac-test, ACCELERATE, SME, COREML, mac-universal, export_coreml.py). * scripts/test_all.sh — appended the entire Mac variants block: builds with each flag combo, validates symbol presence, links the expected frameworks, runs mac-test, checks fmopa-in / rdvl-out isolation, lipo per-slice instruction-count probes. * CLAUDE.md — make-target list extended with Mac options; new bullet documenting the opt-in flag policy. * README.md — short Mac section + link to docs/mac.md. * examples/example.c — orphaned 1-line API fix (facex_init went from 2 to 3 args repo-side; example didn't follow). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A second build of FaceX (libfacex_npu.{so,dylib}) that dispatches inference through the TensorFlow Lite C API to a runtime-selected delegate. Same source / same artefact targets three NXP SoCs: i.MX 8M Plus → NXP VxDelegate (libvx_delegate.so) i.MX 93 → Arm Ethos-U external (libethosu_delegate.so) i.MX 95 → Arm Ethos-U external (libethosu_delegate.so) any AArch64 → XNNPACK CPU fallback (built into TFLite) Code ---- * include/facex_backend.h — pluggable FacexBackend vtable. i.MX is the first concrete consumer beyond the in-tree CPU backend. * include/facex_npu.h — facex_npu_init / _embed / _detect / _free, mirrors facex.h shape so callers can swap CPU/NPU backends. * src/backend_tflite.c — dlopen-based delegate loader (vx → ethos-u → armnn fallback chain), TfLiteModel + Interpreter setup, INT8 quantize/dequantize for the embedder, L2 normalize. Detector path is intentionally -ENOSYS — the recommended deployment is the hybrid pipeline (CPU detect via libfacex.a, NPU embed via this backend). 80% of the perf benefit, none of the post-processing risk. Tooling (offline model conversion) ---------------------------------- * tools/onnx_to_tflite.py — ONNX → SavedModel (via onnx2tf) → INT8 TFLite (via tf.lite). Accepts a calibration directory of representative face crops; falls back to noise calibration with a warning. Uses subprocess.run with arg lists, not os.system. * tools/compile_vela.sh — wraps Arm's Vela compiler to produce an Ethos-U65 command stream from an INT8 .tflite. Defaults to ethos-u65-256 (i.MX 93 / 95). Prints op coverage from Vela's summary CSV so any CPU-fallback layers are visible. Build ----- Makefile gains four new targets: make imx-npu — host build (e.g. for XNNPACK fallback test) make imx93 SDK=… — cross-compile for i.MX 93 (A55 + Ethos-U65) make imx95 SDK=… — cross-compile for i.MX 95 (A55 + Ethos-U65) make imx8mp SDK=… — cross-compile for i.MX 8M Plus (A53 + VIP9000) All four produce libfacex_npu.{so,dylib}; the difference is the -mcpu tuning and which delegate the runtime picks at first init. TFLite lives behind FACEX_BACKEND_TFLITE so the existing CPU build (`make`) is unchanged — no new mandatory dependency. Test ---- * tests/test_imx_npu_compile.c — API surface compile + link smoke, runs without an actual NPU device. With one or two .tflite paths, also reports the active delegate. Useful for CI on hosts without NXP / Arm hardware. * scripts/test_all.sh — appended NPU compile-check section using a minimal TFLite header stub so the syntax check works on any host. Validation ---------- * Default `make` still builds and `make mac-test` still passes byte-identical (the NPU code is gated on FACEX_BACKEND_TFLITE). * src/backend_tflite.c + tests/test_imx_npu_compile.c syntax-check cleanly against the minimal TFLite C-API header stub. * Hardware bring-up on real i.MX EVK is the next milestone — see docs/imx_npu.md §5 "Hardware bring-up checklist". Documentation ------------- * docs/imx_npu.md — full deployment guide: model conversion pipeline, host vs cross-compile builds, hybrid pipeline wiring, per-SoC bring-up checklist, known limitations. * docs/implementation.md — appended "i.MX NPU library" section. * docs/coverage_matrix.md — appended NPU rows. * CLAUDE.md — make-target list extended with imx-* options; new bullet documenting the libfacex_npu library, hybrid pipeline, and -ENOSYS detector behaviour. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A complete ESP-IDF project that brings up the MIPI-CSI camera on an ESP32-P4 and feeds frames into the FaceX detection wrapper. Capture path is real and runnable; the face-detection backend ships in three selectable forms (stub / native / espnn) gated by Kconfig. Reasonable assumptions baked in ------------------------------- The camera bridge ships complete; the model story is staged because: 1. Bring-up first, model second. Customers integrating FaceX on P4 need to first prove camera + downscale + UART work. The default `stub` backend emits a deterministic synthetic face per frame so every code path is exercised without committing to a model. 2. Native backend is for evaluation only. EdgeFace-XS compiles and runs on P4 but at 1-3 s/frame — demonstrably not a product. Provided so partners can verify "the engine technically works". 3. EdgeFace-Nano is future work. Distilled model (~300 K params, 64×64 input, 256-d, no XCA attn) plus an ESP-NN backend (PIE- SIMD INT8 conv) is the production target. Kconfig slot `CONFIG_FACEX_BACKEND_ESPNN` is reserved (`depends on 0`) so adopters can see the eventual shape. components/facex/ ----------------- * Kconfig — choice between Stub / Native / ESPNN(reserved) backends, detector input W/H, optional per-frame log. * CMakeLists.txt — registers the component, conditionally pulls src/edgeface_engine.c et al. when FACEX_BACKEND_NATIVE is set. * include/facex_esp.h — small init / detect / free API. Mirrors FaceXResult (minus full embedding) so applications don't need to know which backend is running. * src/facex_esp.c — dispatches detect calls. Stub backend emits one deterministic synthetic face per frame with smooth bbox jitter. Native backend forwards into the existing C engine — works but ~1-3 s/frame on P4, evaluation only. examples/esp32p4_camera/ ------------------------ * main/app_main.c — full ESP-IDF camera_driver recipe per https://docs.espressif.com/.../camera_driver.html — LDO 2.5 V for the CSI PHY, SCCB I2C, esp_cam_sensor_detect (auto-picks SC2336 etc.), set_format, esp_cam_new_csi_ctlr, on_get_new_trans / on_trans_finished callbacks (IRAM_ATTR), PSRAM frame buffer ring, capture_task that downscales RGB565 → RGB888 and calls facex_esp_detect, requeues. Logs FPS + detection latency once per second. * main/Kconfig.projbuild — sensor resolution, lane count, lane bit-rate, SCCB pins, sensor reset+pwdn GPIOs. * main/idf_component.yml — pulls esp_cam_sensor + esp_video. * sdkconfig.defaults — target esp32p4, PSRAM hex, CPU @ 360 MHz, main task stack 8 KB. * README.md — build + flash, expected console output, backend selection table. Documentation ------------- * docs/esp32p4.md — full deployment guide: status table, prereqs (IDF v5.4+), backend selection, resource budget on the Function-EV-Board with the SC2336 sensor, troubleshooting. * docs/implementation.md — appended "ESP32-P4 ESP-IDF component" section. * docs/coverage_matrix.md — appended ESP32 rows. * scripts/test_all.sh — appended ESP32-P4 syntax check using synthesized IDF header stubs (esp_err / esp_log / esp_timer / sdkconfig) so the wrapper compiles cleanly without a full IDF install. * CLAUDE.md — note added so future sessions know this is an IDF component (idf.py), not a Makefile target. Honest scope ------------ Camera + dispatch bridge is complete and runnable. Model story (EdgeFace-Nano distillation, ESP-NN backend) is the follow-up. Once it lands, only the Kconfig backend toggle changes; everything else in this commit stays. Default host build (libfacex.a, mac-test) untouched and still passing byte-identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous code claimed i.MX 95 used Arm Ethos-U65, but it actually ships NXP's eIQ Neutron N3 NPU (Ethos-U65 is i.MX 93 only). Register libneutron_delegate.so in the TFLite delegate loader and fix the documentation across CLAUDE.md, the Makefile, the public header, and docs/imx_npu.md (per-SoC bring-up table, offline compiler note for neutron-converter vs vela). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Neutron delegate (and Ethos-U, for that matter) silently produces "0 nodes delegated" when handed a .tflite that wasn't pre-compiled by the matching offline tool — same latency as XNNPACK, no NPU offload. Surface this failure mode two ways: - tools/compile_neutron.sh — thin wrapper around NXP's neutron-converter (eIQ Toolkit), mirroring tools/compile_vela.sh: same args shape, same output naming convention (<base>_neutron.tflite). - backend_tflite.c — when verbose=1 and a Neutron/Ethos-U delegate is picked, print a one-shot hint at init time pointing at the right offline tool, so users can immediately interpret a subsequent "0 nodes delegated" line from TFLite. Also expand docs/imx_npu.md §1 with full instructions for obtaining neutron-converter (nxp.com download path, host-OS install matrix, env-script activation, BSP/toolkit version pinning). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lets us compare CPU NEON, XNNPACK, eIQ Neutron, Ethos-U, and VxDelegate side-by-side in a single CSV instead of running NXP's benchmark_model separately and reconciling output formats. Pieces: - tools/bench_npu.c — synthetic-input bench that mirrors facex-bench's CSV/MD/JSON schema. Emit-only embed stage (facex_npu_detect is -ENOSYS). - include/facex_npu.h — extend FaceXNpuOptions with external_delegate_path so callers (the bench, eventually production apps) can dlopen any TFLite-external-delegate-ABI .so by absolute path, matching how benchmark_model exposes --external_delegate_path. - src/backend_tflite.c — derive_path_name() picks a tidy logging name from a delegate path (libneutron_delegate.so → "neutron", libarmnnDelegate.so → "armnn"), then select_delegate honours a non-NULL path before walking the registry. preferred_delegate continues to work unchanged. - Makefile — facex-bench-npu target (depends on libfacex_npu.so), added to clean. Docs: - docs/benchmarking.md — new section for facex-bench-npu, three-way comparison recipe (NEON / XNNPACK / Neutron in one /tmp/cmp.csv). - docs/imx_npu.md — testing section gains a bench subsection cross-linking to docs/benchmarking.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

navado and others added 7 commits May 6, 2026 16:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-target ports + unified bench infrastructure#3

Multi-target ports + unified bench infrastructure#3
navado wants to merge 7 commits intofacex-engine:mainfrom
navado:main

navado commented May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

navado commented May 6, 2026

Multi-target ports + unified bench infrastructure

What's added

1. Bench foundation (7afb4f7) — also makes make work on ARM hosts

2. Apple Silicon perf paths (f75fd64) — opt-in, never default

3. i.MX NPU library (6bc5f99) — separate libfacex_npu.{so,dylib}

4. ESP32-P4 ESP-IDF component (83aeee7)

Coverage at a glance (full table in docs/coverage_matrix.md)

Benchmark results (M2, n=100, scripts/bench_all.sh)

Reproduce locally

Compatibility / risk

What's NOT in this PR

Files at a glance

Bisectability

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Bench foundation (`7afb4f7`) — also makes `make` work on ARM hosts

2. Apple Silicon perf paths (`f75fd64`) — opt-in, never default

3. i.MX NPU library (`6bc5f99`) — separate `libfacex_npu.{so,dylib}`

4. ESP32-P4 ESP-IDF component (`83aeee7`)

Coverage at a glance (full table in `docs/coverage_matrix.md`)

Benchmark results (M2, n=100, `scripts/bench_all.sh`)