Multi-target ports + unified bench infrastructure#3
Open
navado wants to merge 7 commits intofacex-engine:mainfrom
Open
Multi-target ports + unified bench infrastructure#3navado wants to merge 7 commits intofacex-engine:mainfrom
navado wants to merge 7 commits intofacex-engine:mainfrom
Conversation
Introduces a single source of truth for FaceX latency numbers across
build flavours, OSes, and stages — replaces scattered ad-hoc benches
each with its own format. Also lays the AArch64 build foundation so
`make` succeeds on Apple Silicon and Linux ARM hosts.
Benchmark tooling
-----------------
* tools/bench.c — cross-platform synthetic latency bench. Same source
compiles on macOS arm64/x86, Linux aarch64, future i.MX targets.
Three output formats (md/csv/json) emitting the same data; reports
compiled-in vs runtime-active backends. Stages: `embed`, `e2e`,
`both`.
* tools/bench_camera_mac.swift — live AVFoundation camera bench (Mac-
only role). `--summary` mode emits one CSV row at exit using the
same schema as facex-bench, so live-camera and engine numbers can
join the unified table.
* tools/build_bench_camera_mac.sh — swiftc invocation + bridging
header. Auto-detects optional libfacex symbols (Accelerate /
Core ML) and links matching frameworks so a stale lib doesn't
silently break the build.
* scripts/bench_all.sh — sweeps build-flag combos, runs facex-bench
against each, emits unified Markdown / CSV.
* scripts/test_all.sh — single-host test harness. Topic commits
amend with their own checks.
* docs/benchmarking.md — which-tool-answers-which-question matrix,
CSV schema reference, recipe for combining engine + camera output.
AArch64 / Mac build foundation
------------------------------
Without this, `make` on Apple Silicon or Linux aarch64 produced a
broken binary (silently wrong output via the column-panel scalar
fallback in transformer_ops.c).
* Makefile arch detection via `uname -m`. arm64 path links
src/gemm_stub.c (the existing INT8 GEMM is x86-only) and
src/threadpool_pthread.c (linux/futex / win/WaitOnAddress aren't
portable). Defines FACEX_NO_INT8 so the engine takes the FP32-
packed path.
* src/threadpool_pthread.c — pthread + condvar pool (~80 LOC).
* src/edgeface_engine.c — gates the INT8 weight-packing block on
!FACEX_NO_INT8 so mm->packed stays NULL on ARM and the matmul
dispatch falls cleanly through to FP32.
* src/transformer_ops.c — fixes column-panel scalar fallbacks for
matmul_fp32_packed{,_bias,_bias_gelu} that previously fed packed B
into matmul_fp32 (wrong layout, garbage output on every non-x86
host). Adds hand-written AArch64 NEON kernels (NR=8 MR=4 FMA-
based; same packed format as AVX2). Output byte-equivalent to
scalar within ULP. NEON is portable AArch64, helps i.MX too — not
Mac-specific.
Documentation
-------------
* docs/implementation.md — new file replacing the forward-looking
docs/plan/embedded_port_plan.md. This is the implementation-
details document; each topic commit (this one, Mac, i.MX, ESP32)
amends a new section.
* docs/coverage_matrix.md — initial table (CPU library, bench
infrastructure). Subsequent topic commits append rows.
* CLAUDE.md — repo conventions, build/test commands, architecture
summary. Touches Bench / Mac / i.MX / ESP32 surfaces incidentally;
topic commits amend their own sections.
Verification on mac-m2
----------------------
* `make` builds clean (`Built libfacex.a (arm64)`).
* `make test` golden test passes (`||emb||² = 0.076`, sim 1.000).
* `make bench && ./facex-bench` produces md/csv/json output;
~4.6 ms median embed, ~8.4 ms e2e (NEON FP32 packed).
* `make bench-camera && ./facex-camera-bench` 29 fps end-to-end.
* `scripts/test_all.sh` all checks PASS.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three are opt-in build flags, not the default. Default `make`
still produces the same portable NEON-only artifact for distribution
to any Mac. Each flag composes cleanly with the others; the
dispatcher in `matmul_fp32_packed` chains them
Accelerate → SME → NEON.
Apple Accelerate / AMX (`make ACCELERATE=1`)
---------------------------------------------
* src/backend_accelerate.c — wraps cblas_sgemm. The wrapper unpacks
the column-panel B back to row-major, dispatches via Accelerate;
AMX wins for M ≥ 4 and M*K*N ≥ 4096, otherwise returns -1 so the
in-tree NEON kernel handles the warmup-dominated calls.
* Self-check on first matmul: 4×16×8 cblas vs scalar reference,
1e-4 relative tolerance. Mismatch → facex_disable_accelerate()
and the rest of the process stays on NEON.
* Measured on M2: 4.6 → 3.5 ms / embed (-22%), 9 → 7.5 ms e2e (-13%).
* Embedding bytes identical to NEON within ULP.
SME / SME2 (`make SME=1`) — Apple M4 and newer
------------------------------------------------
* src/transformer_ops_sme.c — __arm_locally_streaming __arm_new("za")
matmul_fp32_packed using FMOPA outer products. Pre-transposes the
row tile of A into a [K, SVL] scratch (gather not allowed in
streaming mode); zeroes ZA tile 0; accumulates K outer products;
reads the rows back with svread_hor_za32_f32_m. NR=8 to match
the existing FP32 packed format.
* src/cpu_features.{h,c} — sysctl-based runtime probe for FEAT_SME /
FEAT_SME2; cached, atomic, lock-free, no external deps. Designed
to host future runtime probes (FP16 / BF16 / dotprod) too.
* Build isolation: -march=armv9-a+sme is applied PER-FILE
(transformer_ops_sme.c only). Without that, clang auto-vectorizes
plain C in transformer_ops.c using SVE/SME instructions that trap
on M1/M2/M3. Verified post-fix: transformer_ops.o has zero
rdvl/smstart/fmopa; transformer_ops_sme.o has the expected
fmopa za0.s.
* Self-check: tiny 4×8 SME-vs-scalar consistency test on first
matmul. Mismatch → facex_disable_sme() and stay on NEON.
* Hardware status: COMPILES + emits real SME asm; NOT directly
hardware-tested (no M4 here). Self-check guards correctness on
M4 — owners get NEON speed if SME has a bug, never wrong output.
Core ML / Apple Neural Engine (`make COREML=1`)
------------------------------------------------
* src/backend_coreml.m — ARC-managed Obj-C bridge that loads a
precompiled `.mlpackage` (auto-compiles to .mlmodelc on first
call) and dispatches MLModel prediction. Runtime compute_units
hint: ALL / CPU+GPU / CPU-only / CPU+ANE.
* include/facex_coreml.h — public C API: facex_coreml_init,
facex_coreml_embed, facex_coreml_last_dispatch, facex_coreml_free.
* Output L2-normalized so cosine similarity stays comparable to the
CPU backend regardless of how the .mlpackage ends.
* Graceful failure: missing .mlpackage returns NULL with a clear
stderr message, no crash. (Validated by scripts/test_all.sh.)
* tools/export_coreml.py — ONNX → .mlpackage via
coremltools.convert(convert_to="mlprogram") with optional INT8
palettization (default 6 bits/weight via kmeans, drops package
size to ~1.8 MB and unlocks ANE INT8 dispatch).
* Hardware status: COMPILE-TESTED. Runtime ANE dispatch is not
end-to-end validated — that requires running export_coreml.py
against an actual EdgeFace ONNX export.
Universal Mac binary (`make mac-universal`)
--------------------------------------------
* Cross-compiles arm64 + x86_64 slices with target-specific flags,
stashes each in /tmp (the in-Makefile clean target wipes its own
artifacts), then `lipo`s them into libfacex-universal.a.
* Verified: arm64 slice has 293 NEON insts (fmla/fmul/fadd) in
transformer_ops; x86_64 slice has 786 AVX2 insts (vfmadd/vmovups).
Real arch-specific code in both halves.
Smoke test (`make mac-test` → tests/test_mac.c)
------------------------------------------------
* Loads weights, embed sanity, determinism, self/cross similarity,
latency stats (min/median/p99 over 50 iters), end-to-end
detect+align+embed on tests/test_face_160.raw.
* Now reports both COMPILED-IN and RUNTIME-ACTIVE backends so the
same binary tells you what will actually dispatch:
Backends compiled in: Accelerate SME NEON
Backends active at runtime: Accelerate(AMX) NEON
(correctly shows SME compiled but inert on M2 — sysctl FEAT_SME=0).
Documentation
-------------
* docs/mac.md — full Mac story: build modes, runtime fallback chain,
permissions, perf reference table, troubleshooting.
* docs/implementation.md — appended "Apple Silicon / Mac perf paths"
section.
* docs/coverage_matrix.md — appended Mac rows (mac-test, ACCELERATE,
SME, COREML, mac-universal, export_coreml.py).
* scripts/test_all.sh — appended the entire Mac variants block:
builds with each flag combo, validates symbol presence, links the
expected frameworks, runs mac-test, checks fmopa-in / rdvl-out
isolation, lipo per-slice instruction-count probes.
* CLAUDE.md — make-target list extended with Mac options; new
bullet documenting the opt-in flag policy.
* README.md — short Mac section + link to docs/mac.md.
* examples/example.c — orphaned 1-line API fix
(facex_init went from 2 to 3 args repo-side; example didn't follow).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A second build of FaceX (libfacex_npu.{so,dylib}) that dispatches
inference through the TensorFlow Lite C API to a runtime-selected
delegate. Same source / same artefact targets three NXP SoCs:
i.MX 8M Plus → NXP VxDelegate (libvx_delegate.so)
i.MX 93 → Arm Ethos-U external (libethosu_delegate.so)
i.MX 95 → Arm Ethos-U external (libethosu_delegate.so)
any AArch64 → XNNPACK CPU fallback (built into TFLite)
Code
----
* include/facex_backend.h — pluggable FacexBackend vtable. i.MX is
the first concrete consumer beyond the in-tree CPU backend.
* include/facex_npu.h — facex_npu_init / _embed / _detect / _free,
mirrors facex.h shape so callers can swap CPU/NPU backends.
* src/backend_tflite.c — dlopen-based delegate loader (vx →
ethos-u → armnn fallback chain), TfLiteModel + Interpreter setup,
INT8 quantize/dequantize for the embedder, L2 normalize.
Detector path is intentionally -ENOSYS — the recommended
deployment is the hybrid pipeline (CPU detect via libfacex.a,
NPU embed via this backend). 80% of the perf benefit, none of
the post-processing risk.
Tooling (offline model conversion)
----------------------------------
* tools/onnx_to_tflite.py — ONNX → SavedModel (via onnx2tf) →
INT8 TFLite (via tf.lite). Accepts a calibration directory of
representative face crops; falls back to noise calibration with
a warning. Uses subprocess.run with arg lists, not os.system.
* tools/compile_vela.sh — wraps Arm's Vela compiler to produce an
Ethos-U65 command stream from an INT8 .tflite. Defaults to
ethos-u65-256 (i.MX 93 / 95). Prints op coverage from Vela's
summary CSV so any CPU-fallback layers are visible.
Build
-----
Makefile gains four new targets:
make imx-npu — host build (e.g. for XNNPACK fallback test)
make imx93 SDK=… — cross-compile for i.MX 93 (A55 + Ethos-U65)
make imx95 SDK=… — cross-compile for i.MX 95 (A55 + Ethos-U65)
make imx8mp SDK=… — cross-compile for i.MX 8M Plus (A53 + VIP9000)
All four produce libfacex_npu.{so,dylib}; the difference is the
-mcpu tuning and which delegate the runtime picks at first init.
TFLite lives behind FACEX_BACKEND_TFLITE so the existing CPU build
(`make`) is unchanged — no new mandatory dependency.
Test
----
* tests/test_imx_npu_compile.c — API surface compile + link smoke,
runs without an actual NPU device. With one or two .tflite paths,
also reports the active delegate. Useful for CI on hosts without
NXP / Arm hardware.
* scripts/test_all.sh — appended NPU compile-check section using a
minimal TFLite header stub so the syntax check works on any host.
Validation
----------
* Default `make` still builds and `make mac-test` still passes
byte-identical (the NPU code is gated on FACEX_BACKEND_TFLITE).
* src/backend_tflite.c + tests/test_imx_npu_compile.c syntax-check
cleanly against the minimal TFLite C-API header stub.
* Hardware bring-up on real i.MX EVK is the next milestone — see
docs/imx_npu.md §5 "Hardware bring-up checklist".
Documentation
-------------
* docs/imx_npu.md — full deployment guide: model conversion
pipeline, host vs cross-compile builds, hybrid pipeline wiring,
per-SoC bring-up checklist, known limitations.
* docs/implementation.md — appended "i.MX NPU library" section.
* docs/coverage_matrix.md — appended NPU rows.
* CLAUDE.md — make-target list extended with imx-* options; new
bullet documenting the libfacex_npu library, hybrid pipeline,
and -ENOSYS detector behaviour.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A complete ESP-IDF project that brings up the MIPI-CSI camera on an ESP32-P4 and feeds frames into the FaceX detection wrapper. Capture path is real and runnable; the face-detection backend ships in three selectable forms (stub / native / espnn) gated by Kconfig. Reasonable assumptions baked in ------------------------------- The camera bridge ships complete; the model story is staged because: 1. Bring-up first, model second. Customers integrating FaceX on P4 need to first prove camera + downscale + UART work. The default `stub` backend emits a deterministic synthetic face per frame so every code path is exercised without committing to a model. 2. Native backend is for evaluation only. EdgeFace-XS compiles and runs on P4 but at 1-3 s/frame — demonstrably not a product. Provided so partners can verify "the engine technically works". 3. EdgeFace-Nano is future work. Distilled model (~300 K params, 64×64 input, 256-d, no XCA attn) plus an ESP-NN backend (PIE- SIMD INT8 conv) is the production target. Kconfig slot `CONFIG_FACEX_BACKEND_ESPNN` is reserved (`depends on 0`) so adopters can see the eventual shape. components/facex/ ----------------- * Kconfig — choice between Stub / Native / ESPNN(reserved) backends, detector input W/H, optional per-frame log. * CMakeLists.txt — registers the component, conditionally pulls src/edgeface_engine.c et al. when FACEX_BACKEND_NATIVE is set. * include/facex_esp.h — small init / detect / free API. Mirrors FaceXResult (minus full embedding) so applications don't need to know which backend is running. * src/facex_esp.c — dispatches detect calls. Stub backend emits one deterministic synthetic face per frame with smooth bbox jitter. Native backend forwards into the existing C engine — works but ~1-3 s/frame on P4, evaluation only. examples/esp32p4_camera/ ------------------------ * main/app_main.c — full ESP-IDF camera_driver recipe per https://docs.espressif.com/.../camera_driver.html — LDO 2.5 V for the CSI PHY, SCCB I2C, esp_cam_sensor_detect (auto-picks SC2336 etc.), set_format, esp_cam_new_csi_ctlr, on_get_new_trans / on_trans_finished callbacks (IRAM_ATTR), PSRAM frame buffer ring, capture_task that downscales RGB565 → RGB888 and calls facex_esp_detect, requeues. Logs FPS + detection latency once per second. * main/Kconfig.projbuild — sensor resolution, lane count, lane bit-rate, SCCB pins, sensor reset+pwdn GPIOs. * main/idf_component.yml — pulls esp_cam_sensor + esp_video. * sdkconfig.defaults — target esp32p4, PSRAM hex, CPU @ 360 MHz, main task stack 8 KB. * README.md — build + flash, expected console output, backend selection table. Documentation ------------- * docs/esp32p4.md — full deployment guide: status table, prereqs (IDF v5.4+), backend selection, resource budget on the Function-EV-Board with the SC2336 sensor, troubleshooting. * docs/implementation.md — appended "ESP32-P4 ESP-IDF component" section. * docs/coverage_matrix.md — appended ESP32 rows. * scripts/test_all.sh — appended ESP32-P4 syntax check using synthesized IDF header stubs (esp_err / esp_log / esp_timer / sdkconfig) so the wrapper compiles cleanly without a full IDF install. * CLAUDE.md — note added so future sessions know this is an IDF component (idf.py), not a Makefile target. Honest scope ------------ Camera + dispatch bridge is complete and runnable. Model story (EdgeFace-Nano distillation, ESP-NN backend) is the follow-up. Once it lands, only the Kconfig backend toggle changes; everything else in this commit stays. Default host build (libfacex.a, mac-test) untouched and still passing byte-identical. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous code claimed i.MX 95 used Arm Ethos-U65, but it actually ships NXP's eIQ Neutron N3 NPU (Ethos-U65 is i.MX 93 only). Register libneutron_delegate.so in the TFLite delegate loader and fix the documentation across CLAUDE.md, the Makefile, the public header, and docs/imx_npu.md (per-SoC bring-up table, offline compiler note for neutron-converter vs vela). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Neutron delegate (and Ethos-U, for that matter) silently produces "0 nodes delegated" when handed a .tflite that wasn't pre-compiled by the matching offline tool — same latency as XNNPACK, no NPU offload. Surface this failure mode two ways: - tools/compile_neutron.sh — thin wrapper around NXP's neutron-converter (eIQ Toolkit), mirroring tools/compile_vela.sh: same args shape, same output naming convention (<base>_neutron.tflite). - backend_tflite.c — when verbose=1 and a Neutron/Ethos-U delegate is picked, print a one-shot hint at init time pointing at the right offline tool, so users can immediately interpret a subsequent "0 nodes delegated" line from TFLite. Also expand docs/imx_npu.md §1 with full instructions for obtaining neutron-converter (nxp.com download path, host-OS install matrix, env-script activation, BSP/toolkit version pinning). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lets us compare CPU NEON, XNNPACK, eIQ Neutron, Ethos-U, and VxDelegate side-by-side in a single CSV instead of running NXP's benchmark_model separately and reconciling output formats. Pieces: - tools/bench_npu.c — synthetic-input bench that mirrors facex-bench's CSV/MD/JSON schema. Emit-only embed stage (facex_npu_detect is -ENOSYS). - include/facex_npu.h — extend FaceXNpuOptions with external_delegate_path so callers (the bench, eventually production apps) can dlopen any TFLite-external-delegate-ABI .so by absolute path, matching how benchmark_model exposes --external_delegate_path. - src/backend_tflite.c — derive_path_name() picks a tidy logging name from a delegate path (libneutron_delegate.so → "neutron", libarmnnDelegate.so → "armnn"), then select_delegate honours a non-NULL path before walking the registry. preferred_delegate continues to work unchanged. - Makefile — facex-bench-npu target (depends on libfacex_npu.so), added to clean. Docs: - docs/benchmarking.md — new section for facex-bench-npu, three-way comparison recipe (NEON / XNNPACK / Neutron in one /tmp/cmp.csv). - docs/imx_npu.md — testing section gains a bench subsection cross-linking to docs/benchmarking.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Multi-target ports + unified bench infrastructure
Adds first-class build paths for Apple Silicon, i.MX 8/93/95 (NXP
NPU), and ESP32-P4 (Espressif RISC-V MCU), plus a cross-platform
benchmark tool and a unified coverage matrix. Default
makeisunchanged for existing x86 users — every new path is opt-in.
Four commits, each self-contained and amending its own section in the
new
docs/implementation.md:What's added
1. Bench foundation (
7afb4f7) — also makesmakework on ARM hoststools/bench.c— cross-platform synthetic latency bench. md/csv/jsonoutput. Same source compiles on macOS arm64/x86, Linux aarch64, future
i.MX targets.
tools/bench_camera_mac.swift+make bench-camera— live AVFoundationcamera benchmark.
--summarymode emits one CSV row in the same schemaas
facex-benchso the two can be merged into one table.scripts/bench_all.sh— sweeps build-flag combinations, emits unifiedMarkdown comparison.
scripts/test_all.sh— single-host test harness; topic commits amendwith their own checks (51/51 PASS on M2).
uname -m,src/threadpool_pthread.c(futex/WaitOnAddress aren't portable),FACEX_NO_INT8flag for the engine, hand-written NEON kernels formatmul_fp32_packed{,_bias,_bias_gelu}(output byte-identical toscalar within ULP), and a fix for the column-panel scalar fallbacks
that were silently wrong on every non-x86 host.
2. Apple Silicon perf paths (
f75fd64) — opt-in, never defaultmake ACCELERATE=1cblas_sgemmvia Accelerate.framework (AMX)make SME=1make COREML=1.mlpackagefor Apple Neural Enginemake mac-universalEach opt-in dispatch is gated at compile time AND runtime via a
self-check (e.g. SME runs a tiny FMOPA-vs-scalar consistency test and
disables itself on output divergence). Critical:
-march=armv9-a+smeis applied per-file to
transformer_ops_sme.conly — applying itglobally would let clang auto-vectorize plain C using SVE/SME and trap
on M1/M2/M3.
3. i.MX NPU library (
6bc5f99) — separatelibfacex_npu.{so,dylib}make imx-npumake imx93 SDK=…libethosu_delegate.somake imx95 SDK=…libethosu_delegate.somake imx8mp SDK=…libvx_delegate.sosrc/backend_tflite.cis a TFLite C-API wrapper with adlopen-baseddelegate loader (vendor
.sos aren't a hard dep — auto-fallback toXNNPACK if missing). One source path, three deployment targets. New
public API in
include/facex_npu.h, mirrorsfacex.hshape.Detector path is intentionally
-ENOSYS. Anchor decode + NMS forarbitrary YuNet/SCRFD topology is too fragile to ship blind. The
recommended deployment is the hybrid pipeline: CPU detect via
libfacex.a, NPU embed vialibfacex_npu.so. ~80% of the perfbenefit, none of the post-processing risk. Documented in
docs/imx_npu.md.Offline tooling:
tools/onnx_to_tflite.py(PyTorch/ONNX → INT8 TFLite)and
tools/compile_vela.sh(Arm Vela for Ethos-U65).4. ESP32-P4 ESP-IDF component (
83aeee7)components/facex/(IDF wrapper) +examples/esp32p4_camera/(fullrunnable IDF project).
The MIPI-CSI capture path is complete and runnable — follows the
official ESP-IDF camera_driver recipe verbatim (LDO 2.5 V on the CSI
PHY, SCCB I2C,
esp_cam_sensor_detect,esp_cam_new_csi_ctlr,IRAM-safe callbacks, PSRAM frame buffer ring, capture task that
downscales RGB565 → RGB888 and calls the FaceX backend, requeues).
Logs FPS + per-detection latency once per second.
The face-detection backend is a Kconfig three-way choice:
stub(default) — synthetic deterministic face per frame forbring-up. Works on the EV-Board today.
native— links the existing C engine. Compiles, runs, but at1-3 s/frame the EdgeFace-XS model is too large for production on
P4. Provided for evaluation only.
espnn— reserved Kconfig slot. Production target needs adistilled EdgeFace-Nano (~300 K params, 64×64 input, 256-d
embedding, no XCA attention) plus an ESP-NN backend (PIE-SIMD INT8).
Not in this PR — model + kernel work is its own milestone.
Honest scope: the camera + dispatch bridge ships now; production-fit
model is the follow-up. Once it exists, only the Kconfig backend
toggle changes; everything else in this commit stays.
Coverage at a glance (full table in
docs/coverage_matrix.md)make(Apple Silicon arm64 / NEON)make ACCELERATE=1(AMX)make SME=1(M4+ FMOPA)FEAT_SME=0); self-check guards M4 correctnessmake COREML=1(ANE bridge).mlpackagesmoke passes; ANE dispatch needs ONNX exportmake mac-universal(fat archive)make imx-npu/imx93/imx95/imx8mpscripts/test_all.shruns every check that's executable on the host:51/51 PASS on M2. Topic commits each register their own checks so
the runner stays exhaustive as features land.
🧪 / 🚫 honesty markers: where I couldn't validate on hardware (M4,
i.MX EVK, P4 dev kit), the code follows the documented vendor APIs
(
esp_cam_ctlr_*, TFLite C-API + delegate ABI, ACLE 2024 SMEintrinsics). Self-checks at runtime guard correctness for the SME
path. Bring-up checklists are documented in
docs/{mac,imx_npu,esp32p4}.md.Benchmark results (M2, n=100,
scripts/bench_all.sh)ACCELERATE=1SME=1SME=1 ACCELERATE=1Same
\|\|emb\|\|² = 0.0756, same self-similarity 1.0000, same bboxacross all four — backend choice never changes the embedding bytes
beyond ULP.
WASM (Node.js,
node wasm/bench.js): 5.27 ms median, 190 fps.Embedding output matches native byte-for-byte.
Reproduce locally
Compatibility / risk
paths, same flags, same artifacts.
libfacex.astill has zeroexternal runtime deps (
otool -L→ libSystem only on macOS, libcFACEX_BACKEND_TFLITE; Mac perf paths are opt-in via per-flagbuild vars.
are additive:
include/facex_npu.h,include/facex_coreml.h,include/facex_backend.h.facex_initsignature — was orphaned in upstream relative to thecurrent API.
What's NOT in this PR
.mlpackageartefact (needs an EdgeFace ONNX export not inthis repo)
wasm/*.wasm; nomake wasmtarget wired)Each of these is documented in
docs/implementation.mdwith theunblock path for whoever picks it up next.
Files at a glance
Bisectability
Every commit builds clean on M2 (
makesucceeds,scripts/test_all.shpasses its own commit's checks). The four are also intentionally
decoupled: Mac, i.MX, and ESP32 don't depend on each other; they all
sit on the Bench foundation.