A minimal Linux container runtime written in C (~250 lines), implementing the core isolation mechanisms that production runtimes like runc use under the hood.
This is not a wrapper around Docker or containerd. It directly invokes Linux kernel APIs to build a container from scratch.
| Isolation layer | Kernel mechanism | Effect |
|---|---|---|
| Hostname isolation | unshare(CLONE_NEWUTS) |
Container has its own hostname |
| Filesystem isolation | unshare(CLONE_NEWNS) |
Container has its own mount tree |
| IPC isolation | unshare(CLONE_NEWIPC) |
Container has its own SysV IPC / POSIX MQ |
| PID isolation | unshare(CLONE_NEWPID) |
Container process becomes PID 1 |
| Root filesystem swap | mount --bind + pivot_root(2) |
Full filesystem root replacement |
| Memory limit | cgroups v2 memory.max |
100 MB hard cap enforced by the kernel |
| Capability drop | libcap-ng |
All capabilities dropped except a small whitelist |
| Syscall filtering | libseccomp (BPF) |
reboot, swapon, module loading blocked |
chroot(2) only redirects pathname lookups — a process with CAP_SYS_CHROOT can escape it. pivot_root(2) replaces the root mount point of the entire mount namespace, then the old root is unmounted with MNT_DETACH, making it genuinely unreachable from inside the container. This is what runc does.
Parent Child (PID 1 in new PID ns)
────────────────────────────────── ──────────────────────────────────
unshare(CLONE_NEWPID)
fork() ─────────────────────────────► block on sync pipe
mkdir /sys/fs/cgroup/simple_container
write memory.max = 100000000
write <child_pid> → cgroup.procs
signal child via pipe ──────────────► unshare(UTS | NS | IPC)
waitpid() sethostname("mycontainer")
rmdir cgroup dir mount --bind rootfs → rootfs
pivot_root(., old_root)
umount2(old_root, MNT_DETACH)
mount /proc
drop capabilities (libcap-ng)
seccomp BPF filter (libseccomp)
execvp(cmd)
The sync pipe ensures the child does not exec before the parent places it inside the cgroup — avoiding a race where the process runs unrestricted before memory limits are applied.
- Linux kernel ≥ 4.6 (cgroups v2 with
memory.max) libcap-ng—libcap-ng-develon Fedora/RHEL,libcap-ng-devon Debian/Ubuntulibseccomp—libseccomp-devel/libseccomp-devgcc,make- Must run as root (or with
CAP_SYS_ADMIN)
makeYou need a root filesystem directory. One way to get a minimal one:
mkdir alpine-rootfs
docker export $(docker create alpine) | tar -xC alpine-rootfsThen run:
sudo ./simple_container ./alpine-rootfs /bin/shInside the container:
/ # hostname
mycontainer
/ # echo $$
1- No network namespace (container shares the host network stack)
- No user namespace (requires root on the host)
- cgroups v2 only — no v1 fallback
- Memory limit only; no CPU or I/O constraints
- Single-container use; no lifecycle management
MIT