Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/test-gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ jobs:
- id: get_pr_info
if: github.event_name == 'push'
continue-on-error: true
uses: nv-gha-runners/get-pr-info@main
uses: nv-gha-runners/get-pr-info@090577647b8ddc4e06e809e264f7881650ecdccf

- id: gate
shell: bash
Expand Down
11 changes: 11 additions & 0 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions architecture/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -301,4 +301,6 @@ This opens an interactive SSH session into the sandbox, with all provider creden
| [Inference Routing](inference-routing.md) | Transparent interception and sandbox-local routing of AI inference API calls to configured backends. |
| [System Architecture](system-architecture.md) | Top-level system architecture diagram with all deployable components and communication flows. |
| [Gateway Settings Channel](gateway-settings.md) | Runtime settings channel: two-tier key-value configuration, global policy override, settings registry, CLI/TUI commands. |
| [Custom VM Runtime](custom-vm-runtime.md) | Dual-backend VM runtime (libkrun / QEMU), kernel configuration, and build pipeline. |
| [VM GPU Passthrough](vm-gpu-passthrough.md) | VFIO GPU passthrough for VMs: host preparation, safety checks, nvidia driver hardening, and troubleshooting. |
| [TUI](tui.md) | Terminal user interface for sandbox interaction. |
197 changes: 172 additions & 25 deletions architecture/custom-vm-runtime.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,43 @@
# Custom libkrunfw VM Runtime
# Custom VM Runtime

> Status: Experimental and work in progress (WIP). VM support is under active development and may change.

## Overview

The OpenShell gateway VM uses [libkrun](https://github.com/containers/libkrun) to boot a
lightweight microVM with Apple Hypervisor.framework (macOS) or KVM (Linux). The kernel
is embedded inside `libkrunfw`, a companion library that packages a pre-built Linux kernel.
The OpenShell gateway VM supports two hypervisor backends:

The stock `libkrunfw` from Homebrew ships a minimal kernel without bridge, netfilter, or
conntrack support. This is insufficient for Kubernetes pod networking.
- **libkrun** (default) — lightweight VMM using Apple Hypervisor.framework (macOS) or KVM
(Linux). The kernel is embedded inside `libkrunfw`. Uses virtio-MMIO device transport and
gvproxy for user-space networking.
- **QEMU** — Linux-only VMM used for GPU passthrough (VFIO). Uses virtio-PCI device transport,
TAP networking, and requires a separate `vmlinux` kernel and `virtiofsd` for rootfs access.
QEMU binary is not embedded — it must be installed on the host.

Backend selection is automatic: `--gpu` selects QEMU, otherwise libkrun is used. The `--backend`
flag provides explicit control (`auto`, `libkrun`, `qemu`).

When `--gpu` is passed, `openshell-vm` automatically binds an eligible GPU to `vfio-pci`
and restores it to the original driver on shutdown. See
[vm-gpu-passthrough.md](vm-gpu-passthrough.md) for the full lifecycle description.

The custom libkrunfw runtime adds bridge CNI, iptables/nftables, and conntrack support to
the VM kernel, enabling standard Kubernetes networking.
Both backends share the same guest kernel (built from a single `openshell.kconfig` fragment)
and rootfs.

The stock `libkrunfw` from Homebrew ships a minimal kernel without bridge, netfilter, or
conntrack support. This is insufficient for Kubernetes pod networking. The custom kconfig
adds bridge CNI, iptables/nftables, conntrack, and QEMU compatibility.

## Architecture

```mermaid
graph TD
subgraph Host["Host (macOS / Linux)"]
BIN[openshell-vm binary]
EMB["Embedded runtime (zstd-compressed)\nlibkrun · libkrunfw · gvproxy"]
EMB["Embedded runtime (zstd-compressed)\nlibkrun · libkrunfw · gvproxy · rootfs"]
CACHE["~/.local/share/openshell/vm-runtime/{version}/"]
PROV[Runtime provenance logging]
GVP[gvproxy networking proxy]
QEMU_BIN["qemu-system-x86_64 · virtiofsd · vmlinux\n(GPU runtime bundle)"]

BIN --> EMB
BIN -->|extracts to| CACHE
Expand All @@ -44,8 +58,9 @@ graph TD
INIT --> VAL --> CNI --> EXECA --> PKI --> K3S
end

BIN -- "fork + krun_start_enter" --> INIT
GVP -- "virtio-net" --> Guest
BIN -- "libkrun: fork + krun_start_enter" --> INIT
BIN -- "QEMU: qemu-system-x86_64 + virtiofsd" --> INIT
GVP -- "virtio-net (libkrun only)" --> Guest
```

## Embedded Runtime
Expand All @@ -67,9 +82,22 @@ these to XDG cache directories with progress bars:
└── ...
```

This eliminates the need for separate bundles or downloads - a single ~120MB binary
provides everything needed to run the VM. Old cache versions are automatically
cleaned up when a new version is extracted.
When using QEMU for GPU passthrough, an additional runtime bundle is required alongside
the binary:

```
target/debug/openshell-vm.runtime/ (or alongside the installed binary)
├── virtiofsd # virtio-fs daemon
└── vmlinux # extracted guest kernel
```

This bundle is built with `mise run vm:bundle-runtime` and is separate from the
embedded runtime because virtiofsd is Linux-only and not embedded in the
self-extracting binary.

This eliminates the need for separate bundles or downloads for the default (libkrun)
path — a single ~120MB binary provides everything needed. Old cache versions are
automatically cleaned up when a new version is extracted.

### Hybrid Approach

Expand All @@ -86,6 +114,34 @@ mise run vm:rootfs # Full rootfs (~2GB, includes images)
mise run vm:build # Rebuild binary with full rootfs
```

## Backend Comparison

| | libkrun (default) | QEMU |
|---|---|---|
| Platforms | macOS (Hypervisor.framework), Linux (KVM) | Linux (KVM) only |
| Device transport | virtio-MMIO | virtio-PCI |
| Networking | gvproxy (user-space, no root needed) | TAP (requires root/CAP_NET_ADMIN) |
| Rootfs delivery | In-process (krun API) | virtiofsd (virtio-fs daemon) |
| Kernel delivery | Embedded in libkrunfw | Separate `vmlinux` file |
| Console | virtio-console (`hvc0`) | 8250 UART (`ttyS0`) |
| Shutdown | Automatic on PID 1 exit | ACPI poweroff (`poweroff -f`) |
| GPU passthrough | Not supported | VFIO PCI |
| Vsock | libkrun built-in | `AF_VSOCK` (kernel `vhost_vsock`) |
| VM control | krun C API | Command-line args |
| Binary source | Embedded in runtime | Host-installed |
| `--exec` mode | Direct init replacement | Wrapper script with ACPI shutdown |
| CLI flag | `--backend libkrun` | `--backend qemu` or `--gpu` |

### Exec mode differences

With libkrun, when `--exec <cmd>` is used, the command replaces the init process and
the VM exits when PID 1 exits.

With QEMU, the VM does not automatically exit when PID 1 terminates. A wrapper init
script is dynamically written to the guest rootfs that mounts necessary filesystems,
executes the user command, captures the exit code, and calls `poweroff -f` to trigger
an ACPI shutdown that the hypervisor detects.

## Network Profile

The VM uses the bridge CNI profile, which requires a custom libkrunfw with bridge and
Expand All @@ -100,6 +156,26 @@ fast with an actionable error if they are missing.
- Service VIPs: functional (ClusterIP, NodePort)
- hostNetwork workarounds: not required

### Networking by backend

- **libkrun**: Uses gvproxy for user-space virtio-net networking. No root privileges
needed. Port forwarding is handled via gvproxy configuration.
- **QEMU**: Uses TAP networking (requires root or CAP_NET_ADMIN). When `--net none`
is passed, networking is disabled entirely (useful for `--exec` mode tests). gvproxy
is not used with QEMU.

## Guest Init Script

The init script (`openshell-vm-init.sh`) runs as PID 1 in the guest. After mounting essential filesystems, it performs:

1. **Kernel cmdline parsing** — exports environment variables passed via the kernel command line (`GPU_ENABLED`, `OPENSHELL_VM_STATE_DISK_DEVICE`, `VM_NET_IP`, `VM_NET_GW`, `VM_NET_DNS`). This runs after `/proc` is mounted so `/proc/cmdline` is available.

2. **Cgroup v2 controller enablement** — enables `cpu`, `cpuset`, `memory`, `pids`, and `io` controllers in the root cgroup hierarchy (`cgroup.subtree_control`). k3s/kubelet requires these controllers; the `cpu` controller depends on `CONFIG_CGROUP_SCHED` in the kernel.

3. **Networking** — detects `eth0` and attempts DHCP (via `udhcpc`). On failure, falls back to static IP configuration using `VM_NET_IP` and `VM_NET_GW` from the kernel cmdline (set by the QEMU backend for TAP networking). DNS is configured from `VM_NET_DNS` if set, overriding any stale `/etc/resolv.conf` entries.

4. **Capability validation** — verifies required kernel features (bridge networking, netfilter, cgroups) and fails fast with actionable errors if missing.

## Runtime Provenance

At boot, the openshell-vm binary logs provenance metadata about the loaded runtime bundle:
Expand Down Expand Up @@ -128,21 +204,46 @@ graph LR
BUILD_M["Build libkrunfw.dylib + libkrun.dylib"]
end

subgraph GPU["Linux CI (build-gpu-deps.sh)"]
BUILD_GPU["Build virtiofsd\n(for QEMU backend)"]
end

subgraph NV["Linux CI (build-nvidia-modules.sh)"]
BUILD_NV["Compile NVIDIA .ko against VM kernel"]
end

subgraph QEMU["Host-installed"]
QEMU_BIN["qemu-system-x86_64\n(not built — must be on host PATH)"]
end

subgraph Output["target/libkrun-build/"]
LIB_SO["libkrunfw.so + libkrun.so\n(Linux)"]
LIB_DY["libkrunfw.dylib + libkrun.dylib\n(macOS)"]
VIRTIOFSD["virtiofsd\n(QEMU backend)"]
VMLINUX["vmlinux\n(shared by QEMU)"]
NV_KO["nvidia-modules/*.ko\n(GPU builds only)"]
end

KCONF --> BUILD_L
BUILD_L --> LIB_SO
BUILD_L --> VMLINUX
BUILD_L -->|kernel source tree| BUILD_NV
BUILD_NV --> NV_KO
KCONF --> BUILD_M
BUILD_M --> LIB_DY
BUILD_GPU --> VIRTIOFSD
```

The `vmlinux` kernel is extracted from the libkrunfw build and reused by QEMU.
Both backends boot the same kernel — the kconfig fragment includes drivers for
both virtio-MMIO (libkrun) and virtio-PCI (QEMU) transports.

## Kernel Config Fragment

The `openshell.kconfig` fragment enables these kernel features on top of the stock
libkrunfw kernel:
libkrunfw kernel. A single kernel binary is shared by both backends (libkrun and
QEMU) — backend-specific drivers coexist safely (the kernel probes whichever
transport the hypervisor provides).

| Feature | Key Configs | Purpose |
|---------|-------------|---------|
Expand All @@ -158,11 +259,18 @@ libkrunfw kernel:
| IP forwarding | `CONFIG_IP_ADVANCED_ROUTER`, `CONFIG_IP_MULTIPLE_TABLES` | Pod-to-pod routing |
| IPVS | `CONFIG_IP_VS`, `CONFIG_IP_VS_RR`, `CONFIG_IP_VS_NFCT` | kube-proxy IPVS mode (optional) |
| Traffic control | `CONFIG_NET_SCH_HTB`, `CONFIG_NET_CLS_CGROUP` | Kubernetes QoS |
| Cgroups | `CONFIG_CGROUPS`, `CONFIG_CGROUP_DEVICE`, `CONFIG_MEMCG`, `CONFIG_CGROUP_PIDS` | Container resource limits |
| Cgroups | `CONFIG_CGROUPS`, `CONFIG_CGROUP_DEVICE`, `CONFIG_CGROUP_CPUACCT`, `CONFIG_MEMCG`, `CONFIG_CGROUP_PIDS`, `CONFIG_CGROUP_FREEZER` | Container resource limits |
| Cgroup CPU | `CONFIG_CGROUP_SCHED`, `CONFIG_FAIR_GROUP_SCHED`, `CONFIG_CFS_BANDWIDTH` | cgroup v2 `cpu` controller for k3s/kubelet |
| TUN/TAP | `CONFIG_TUN` | CNI plugin support |
| Dummy interface | `CONFIG_DUMMY` | Fallback networking |
| Landlock | `CONFIG_SECURITY_LANDLOCK` | Filesystem sandboxing support |
| Seccomp filter | `CONFIG_SECCOMP_FILTER` | Syscall filtering support |
| PCI / GPU | `CONFIG_PCI`, `CONFIG_PCI_MSI`, `CONFIG_DRM` | GPU passthrough via VFIO |
| Kernel modules | `CONFIG_MODULES`, `CONFIG_MODULE_UNLOAD` | Loading NVIDIA drivers in guest |
| virtio-PCI transport | `CONFIG_VIRTIO_PCI` | QEMU device bus (libkrun uses MMIO) |
| Serial console | `CONFIG_SERIAL_8250`, `CONFIG_SERIAL_8250_CONSOLE` | QEMU console (`ttyS0`) |
| ACPI | `CONFIG_ACPI` | QEMU power management / clean shutdown |
| x2APIC | `CONFIG_X86_X2APIC` | Multi-vCPU support (QEMU uses x2APIC MADT entries) |

See `crates/openshell-vm/runtime/kernel/openshell.kconfig` for the full fragment with
inline comments explaining why each option is needed.
Expand All @@ -189,13 +297,22 @@ The standalone `openshell-vm` binary supports `openshell-vm exec -- <command...>
`openshell-vm exec` also injects `KUBECONFIG=/etc/rancher/k3s/k3s.yaml` by default so kubectl-style
commands work the same way they would inside the VM shell.

### Vsock by backend

- **libkrun**: Uses libkrun's built-in vsock port mapping, which transparently
bridges the guest vsock port to a host Unix socket.
- **QEMU**: Uses `vhost-vsock-pci` with kernel `AF_VSOCK` sockets. The exec
bridge opens a kernel `AF_VSOCK` socket to the guest CID and bridges it to
the same Unix domain socket path used by the other backend. Requires the
`vhost_vsock` kernel module on the host.

## Build Commands

```bash
# One-time setup: download pre-built runtime (~30s)
mise run vm:setup

# Build and run
# Build and run (libkrun, default)
mise run vm

# Build embedded binary with base rootfs (~120MB, recommended)
Expand All @@ -210,6 +327,29 @@ mise run vm:build # Rebuild binary
FROM_SOURCE=1 mise run vm:setup # Build runtime from source
mise run vm:build # Then build embedded binary

# Build GPU runtime bundle (Linux only)
mise run vm:bundle-runtime # Builds virtiofsd + extracts vmlinux

# Validate QEMU host prerequisites
mise run vm:qemu-check

# Install QEMU if not present (Ubuntu/Debian)
sudo apt install qemu-system-x86

# Load vhost-vsock kernel module (required for QEMU vsock)
sudo modprobe vhost_vsock
echo "vhost_vsock" | sudo tee /etc/modules-load.d/vhost_vsock.conf

# Build with GPU support (Linux x86_64 only)
FROM_SOURCE=1 mise run vm:setup # Build kernel from source (module compilation needs it)
mise run vm:nvidia-modules # Compile NVIDIA .ko files against VM kernel
mise run vm:rootfs -- --base --gpu # Build GPU rootfs with injected kernel modules
mise run vm:build # Rebuild binary with GPU rootfs

# Run with QEMU backend
openshell-vm --backend qemu # Requires qemu-system-x86_64 on host
openshell-vm --gpu # Auto-selects QEMU for GPU passthrough

# Wipe everything and start over
mise run vm:clean
```
Expand All @@ -221,20 +361,23 @@ rolling `vm-dev` GitHub Release:

### Kernel Runtime (`release-vm-kernel.yml`)

Builds the custom libkrunfw (kernel firmware), libkrun (VMM), and gvproxy for all
supported platforms. Runs on-demand or when the kernel config / pinned versions change.
Builds the custom libkrunfw (kernel firmware), libkrun (VMM), gvproxy, and virtiofsd
for all supported platforms. Runs on-demand or when the kernel config / pinned versions
change.

| Platform | Runner | Build Method |
|----------|--------|-------------|
| Linux ARM64 | `build-arm64` (self-hosted) | Native `build-libkrun.sh` |
| Linux x86_64 | `build-amd64` (self-hosted) | Native `build-libkrun.sh` |
| macOS ARM64 | `macos-latest-xlarge` (GitHub-hosted) | `build-libkrun-macos.sh` |
| Linux ARM64 | `build-arm64` (self-hosted) | `build-libkrun.sh` + `build-gpu-deps.sh` |
| Linux x86_64 | `build-amd64` (self-hosted) | `build-libkrun.sh` + `build-gpu-deps.sh` |
| macOS ARM64 | `macos-latest-xlarge` (GitHub-hosted) | `build-libkrun-macos.sh` (no GPU support) |

Artifacts: `vm-runtime-{platform}.tar.zst` containing libkrun, libkrunfw, gvproxy, and
provenance metadata.
Artifacts: `vm-runtime-{platform}.tar.zst` containing libkrun, libkrunfw, gvproxy,
and provenance metadata. Linux artifacts additionally include virtiofsd and the
extracted `vmlinux` kernel.

Each platform builds its own libkrunfw and libkrun natively. The kernel inside
libkrunfw is always Linux regardless of host platform.
libkrunfw is always Linux regardless of host platform. Virtiofsd is
Linux-only (macOS does not support VFIO/KVM passthrough).

### VM Binary (`release-vm-dev.yml`)

Expand Down Expand Up @@ -263,6 +406,10 @@ macOS binaries produced via osxcross are not codesigned. Users must self-sign:
codesign --entitlements crates/openshell-vm/entitlements.plist --force -s - ./openshell-vm
```

> **Note:** QEMU smoke tests (`vm_boot_smoke.rs`) are gated on `OPENSHELL_VM_BACKEND=qemu`.
> These tests require `qemu-system-x86_64` on the runner and are currently manual-only.
> Run `mise run vm:qemu-check` to validate prerequisites before running QEMU tests.

## Rollout Strategy

1. Custom runtime is embedded by default when building with `mise run vm:build`.
Expand Down
Loading
Loading