diff --git a/docs/running-on-jetson.md b/docs/running-on-jetson.md new file mode 100644 index 0000000..4d21bdc --- /dev/null +++ b/docs/running-on-jetson.md @@ -0,0 +1,326 @@ +# Running Modly headless on an NVIDIA Jetson (AGX Orin, JetPack 6) + +This guide explains how to run Modly's image-to-3D generation on an **NVIDIA Jetson**, +a platform Modly does not officially target. Jetson is `aarch64` with NVIDIA's +Tegra/L4T CUDA stack, so the desktop installers and the stock extension `setup.py` +(which assume `x86_64` + the standard CUDA wheels) don't work as-is. + +The good news: **Modly's FastAPI backend is fully standalone and HTTP-driven**, so you +don't need Electron, a display, or the GUI. You run the backend on the Jetson and drive +it with `curl`. The only real work is building the model extension's venv with the +**Jetson-native** PyTorch and working around a broken ONNX Runtime. + +> Verified on: **Jetson AGX Orin Developer Kit, 64 GB, JetPack 6.2 (L4T R36.4), +> Python 3.10, CUDA 12.6, compute capability sm_87.** +> Model: the base **Hunyuan3D-2 Mini** extension (`hunyuan3d-mini`), mesh-only +> (no texture). ~6 GB VRAM-class model; comfortable on a 32–64 GB Orin. + +--- + +## TL;DR + +The stock setup gets three things wrong on Jetson. After a normal headless install you must: + +1. **Replace PyTorch.** The pinned `torch 2.5.1+cu124` SBSA wheel has **no `sm_87` + kernels** (loads, but every kernel fails with *"no kernel image is available"*). + Install the Jetson build from the [jetson-ai-lab](https://pypi.jetson-ai-lab.io) index. +2. **Pin `numpy < 2`.** The Jetson torch wheel is compiled against NumPy 1.x, so + `torch.from_numpy` raises *"Numpy is not available"* under NumPy 2. +3. **Bypass `rembg` / ONNX Runtime.** ONNX Runtime crashes hard on the Tegra CPU + (*"Unknown CPU vendor"* → C++ abort / segfault). Patch the extension's background + removal to a NumPy/SciPy implementation. + +Full steps below. + +--- + +## 1. System prerequisites + +```bash +sudo apt update +sudo apt install -y git python3-venv python3-pip +python3 --version # expect 3.10.x on JetPack 6 +cat /etc/nv_tegra_release # confirm L4T R36.x (JetPack 6) +/usr/local/cuda/bin/nvcc --version | grep release # confirm CUDA 12.x +free -h # confirm enough unified memory (16 GB+ recommended) +``` + +Set up working directories and environment: + +```bash +export EXTENSIONS_DIR=$HOME/.modly/extensions +export MODELS_DIR=$HOME/.modly/models +export WORKSPACE_DIR=$HOME/.modly/workspace +mkdir -p "$EXTENSIONS_DIR" "$MODELS_DIR" "$WORKSPACE_DIR" +``` + +--- + +## 2. Clone Modly and build the API backend venv + +The backend itself contains **no PyTorch**; it's just the FastAPI orchestrator plus +mesh post-processing (`trimesh`, `pymeshlab`). `pymeshlab` has `aarch64` wheels, so this +installs cleanly. + +```bash +git clone https://github.com/lightningpixel/modly.git ~/modly +cd ~/modly/api +python3 -m venv .venv +./.venv/bin/pip install -U pip +./.venv/bin/pip install -r requirements.txt +``` + +--- + +## 3. Fetch the model extension + +```bash +git clone https://github.com/lightningpixel/modly-hunyuan3d-mini-extension.git \ + "$EXTENSIONS_DIR/hunyuan3d-mini" +``` + +The base **Hunyuan3D-2 Mini** extension is the best starting point on Jetson: + +- mesh-only generation needs **no native/C++ build** (the texture path's + `custom_rasterizer` / `differentiable_renderer` are only imported when + `enable_texture=true`); +- the `hy3dgen` source and model weights **auto-download at runtime**; +- it has no `diso` dependency. + +--- + +## 4. Build the extension venv (with the Jetson fixes) + +Create the venv: + +```bash +EXT="$EXTENSIONS_DIR/hunyuan3d-mini" +python3 -m venv "$EXT/venv" +"$EXT/venv/bin/pip" install -U pip +``` + +### 4a. Install Jetson-native PyTorch (not the cu124 SBSA wheels) + +> Do **not** run the extension's stock `setup.py` on Jetson; its ARM64 branch installs +> generic server-ARM (SBSA) `torch 2.5.1+cu124` wheels that **do not contain `sm_87` +> kernels**. They import fine and report `cuda available: True`, but the first real +> kernel throws `RuntimeError: CUDA error: no kernel image is available for execution on the device`. + +Install from the jetson-ai-lab index instead (pick the channel matching your JetPack: +`jp6/cu126` for JetPack 6.2): + +```bash +"$EXT/venv/bin/pip" install --no-deps \ + --index-url https://pypi.jetson-ai-lab.io/jp6/cu126 \ + torch==2.8.0 torchvision==0.23.0 +``` + +> **Do not add `--extra-index-url https://pypi.org/simple` here.** pip will prefer the +> newer PyPI `torch` (e.g. 2.12 with a CUDA-13 dependency stack) which needs a CUDA-13 +> driver JetPack 6.2 doesn't have. Keep it to the jetson-ai-lab index only. +> Other available pairings on that index: torch `2.9.1`/`2.10.0`/`2.11.0` ↔ +> torchvision `0.24.1`/`0.25.0`/`0.26.0`. + +### 4b. Install the rest of the dependencies, pinned for NumPy 1.x + +The Jetson torch wheel is built against NumPy 1.x. Under NumPy 2 you get +`RuntimeError: Numpy is not available` from `torch.from_numpy` (the diffusion scheduler +uses it during model load). Pin NumPy `<2` and an OpenCV build that allows it: + +```bash +"$EXT/venv/bin/pip" install \ + Pillow "numpy==1.26.4" trimesh pymeshlab "opencv-python-headless==4.10.0.84" \ + huggingface_hub "diffusers>=0.31.0" "transformers>=4.46.0" accelerate \ + einops scipy scikit-image +``` + +> `rembg` / `onnxruntime` are intentionally **omitted**; see §4c. + +### 4c. Bypass `rembg` / ONNX Runtime (background removal) + +ONNX Runtime is unusable on this Tegra CPU. The generic PyPI `aarch64` wheel aborts on +import/inference with: + +``` +onnxruntime cpuid_info warning: Unknown CPU vendor. cpuinfo_vendor value: 0 +Assertion '__n < this->size()' failed. +``` + +and the jetson-ai-lab `onnxruntime-gpu` build segfaults during inference. Because these +are native C++ aborts (not Python exceptions), the extension's `try/except` fallback +can't catch them, so the whole subprocess dies (*"Subprocess died during generation"*). + +`rembg` is only used for background removal in `_preprocess`. Replace it with a +dependency-free NumPy/SciPy remover. Save this as `patch_preprocess.py`: + +```python +import re, sys +GEN = sys.argv[1] +src = open(GEN, encoding="utf-8").read() +NEW = ''' def _preprocess(self, image_bytes: bytes) -> Image.Image: + # rembg/onnxruntime is unusable on Jetson's Tegra CPU; use a + # dependency-free background remover (numpy + scipy, no onnxruntime). + import numpy as np + img = Image.open(io.BytesIO(image_bytes)).convert("RGBA") + arr = np.array(img) + if arr.shape[2] == 4 and int(arr[..., 3].min()) < 250: + return img # already has an alpha cutout + rgb = arr[..., :3].astype(np.float32) + h, w = rgb.shape[:2] + b = max(2, min(h, w) // 50) + border = np.concatenate([ + rgb[:b, :, :].reshape(-1, 3), rgb[-b:, :, :].reshape(-1, 3), + rgb[:, :b, :].reshape(-1, 3), rgb[:, -b:, :].reshape(-1, 3), + ], axis=0) + bg = np.median(border, axis=0) + dist = np.sqrt(((rgb - bg) ** 2).sum(axis=2)) + thr = max(25.0, float(np.percentile(dist, 35))) + alpha = (dist > thr).astype(np.uint8) * 255 + try: + from scipy import ndimage + lbl, n = ndimage.label(alpha > 0) + if n > 1: + sizes = ndimage.sum(np.ones_like(lbl), lbl, range(1, n + 1)) + keep = int(np.argmax(sizes)) + 1 + alpha = np.where(lbl == keep, 255, 0).astype(np.uint8) + alpha = ndimage.binary_fill_holes(alpha > 0).astype(np.uint8) * 255 + except Exception: + pass + out = arr.copy() + out[..., 3] = alpha + return Image.fromarray(out, "RGBA") + +''' +pat = re.compile(r" def _preprocess\(self, image_bytes: bytes\) -> Image\.Image:.*?(?=\n def )", re.DOTALL) +assert pat.search(src), "_preprocess block not found" +open(GEN, "w", encoding="utf-8").write(pat.sub(NEW.rstrip("\n") + "\n", src, count=1)) +print("patched") +``` + +Apply it (a backup is kept): + +```bash +cp -n "$EXT/generator.py" "$EXT/generator.py.orig" +python3 patch_preprocess.py "$EXT/generator.py" +"$EXT/venv/bin/python" -c "import ast; ast.parse(open('$EXT/generator.py').read()); print('parses OK')" +``` + +> **Caveat:** this remover is a simple border-colour key + largest-blob + hole-fill. It +> works well on clean/plain backgrounds (product shots, objects on a wall/floor) but +> poorly on busy scenes. A proper fix would be a working Jetson ONNX Runtime, or making +> background removal optional in the extension. + +### 4d. Verify the GPU actually runs a kernel + +```bash +"$EXT/venv/bin/python" - <<'PY' +import numpy as np, torch +print("torch", torch.__version__, "| cuda", torch.cuda.is_available(), + "| dev", torch.cuda.get_device_name(0), "| cap", torch.cuda.get_device_capability(0)) +a = torch.randn(512, 512, device="cuda", dtype=torch.float16) +print("matmul OK:", float((a @ a).float().sum())) +print("from_numpy OK:", torch.from_numpy(np.arange(3, dtype="float32")).cuda().sum().item()) +PY +``` + +Expect `cap (8, 7)`, a finite `matmul OK`, and `from_numpy OK: 3.0`. If `matmul` +fails with *"no kernel image"*, your torch wheel is wrong (see 4a). If `from_numpy` +fails, NumPy is ≥2 (see 4b). + +--- + +## 5. Run the backend (headless) + +```bash +cd ~/modly/api +EXTENSIONS_DIR=$HOME/.modly/extensions \ +MODELS_DIR=$HOME/.modly/models \ +WORKSPACE_DIR=$HOME/.modly/workspace \ +./.venv/bin/uvicorn main:app --host 0.0.0.0 --port 8000 +``` + +> Optional: `export HF_TOKEN=hf_...` before launching to avoid HuggingFace rate-limits +> on the one-time ~4 GB weight download. + +Check it (from the Jetson, or from another machine using the Jetson's IP): + +```bash +curl http://127.0.0.1:8000/health # {"status":"ok"} +curl http://127.0.0.1:8000/model/all # lists hunyuan3d-mini/generate +curl http://127.0.0.1:8000/extensions/errors # {} == no load errors +``` + +--- + +## 6. Generate a mesh + +```bash +# Submit (first run downloads weights + hy3dgen, then loads on CUDA) +curl -s -X POST http://127.0.0.1:8000/generate/from-image \ + -F image=@your_object.png \ + -F model_id=hunyuan3d-mini/generate \ + -F remesh=none -F enable_texture=false \ + -F 'params={"num_inference_steps":50,"octree_resolution":512,"guidance_scale":5.5,"seed":42}' +# -> {"job_id":"..."} + +# Poll until "done" +curl http://127.0.0.1:8000/generate/status/ + +# The result is served from the workspace; download it +curl -O http://127.0.0.1:8000/workspace/Default/.glb +``` + +`params` options for this model: `num_inference_steps` (10/30/50), +`octree_resolution` (256/380/512; higher = more detail + VRAM), `guidance_scale`, +`seed`. A 64 GB Orin handles `50 / 512` comfortably. + +--- + +## 7. Performance: max out the Orin + +A fresh Jetson is often in a power-limited `nvpmodel` (fewer CPU cores, lower clocks), +which slows the CPU-bound model-load and pre/post-processing. For full speed: + +```bash +sudo nvpmodel -m 0 # MAXN (all cores) +sudo jetson_clocks # lock max clocks +``` + +Watch utilisation during a run with `tegrastats` (expect `GR3D_FREQ` near 100% during +diffusion). + +--- + +## Troubleshooting + +| Symptom | Cause | Fix | +|---|---|---| +| `CUDA error: no kernel image is available for execution on the device` | torch wheel has no `sm_87` kernels (generic SBSA build) | Install Jetson torch from jetson-ai-lab (§4a) | +| `RuntimeError: Numpy is not available` (in `from_numpy`) | torch built against NumPy 1.x, but NumPy ≥2 installed | `pip install "numpy==1.26.4"` + OpenCV `4.10.0.84` (§4b) | +| `Subprocess died during generation` at *"Removing background"*; log shows `Unknown CPU vendor` / `Assertion '__n < this->size()'` / segfault | ONNX Runtime broken on Tegra CPU | Patch `_preprocess` to drop `rembg` (§4c) | +| `opencv-python-headless ... requires numpy>=2` | newer OpenCV forces NumPy 2 | pin `opencv-python-headless==4.10.0.84` | +| Edits to the extension venv don't take effect | the backend keeps a long-lived extension subprocess that survives errors | restart the backend so it respawns the subprocess | + +--- + +## Known limitations + +- **No textures.** This guide covers mesh-only generation. `enable_texture=true` + requires building `custom_rasterizer` and `differentiable_renderer` (CUDA C++ + extensions) from source for `aarch64`, not covered here. +- **Background removal is approximate** (see §4c caveat). +- Single-image 3D infers unseen sides; the back/underside are model guesses. Output + meshes are typically not watertight. + +--- + +## Notes for a proper upstream fix + +If Modly wants first-class Jetson support, the cleanest changes would be: + +1. In the extension `setup.py`, detect Tegra (`/etc/nv_tegra_release`) and install torch + from the jetson-ai-lab index for the detected JetPack/CUDA, pinning `numpy<2`. +2. Make background removal pluggable / optional, or ship a non-ONNX fallback, so a broken + ONNX Runtime can't hard-crash generation. +3. Document the headless backend (`uvicorn main:app` + the three env vars) as a supported + way to run generation without Electron.