Skip to content

abed252/simple-container-runtime

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 

Repository files navigation

simple-container-runtime

A minimal Linux container runtime written in C (~250 lines), implementing the core isolation mechanisms that production runtimes like runc use under the hood.

This is not a wrapper around Docker or containerd. It directly invokes Linux kernel APIs to build a container from scratch.

What it implements

Isolation layer Kernel mechanism Effect
Hostname isolation unshare(CLONE_NEWUTS) Container has its own hostname
Filesystem isolation unshare(CLONE_NEWNS) Container has its own mount tree
IPC isolation unshare(CLONE_NEWIPC) Container has its own SysV IPC / POSIX MQ
PID isolation unshare(CLONE_NEWPID) Container process becomes PID 1
Root filesystem swap mount --bind + pivot_root(2) Full filesystem root replacement
Memory limit cgroups v2 memory.max 100 MB hard cap enforced by the kernel
Capability drop libcap-ng All capabilities dropped except a small whitelist
Syscall filtering libseccomp (BPF) reboot, swapon, module loading blocked

Why pivot_root instead of chroot

chroot(2) only redirects pathname lookups — a process with CAP_SYS_CHROOT can escape it. pivot_root(2) replaces the root mount point of the entire mount namespace, then the old root is unmounted with MNT_DETACH, making it genuinely unreachable from inside the container. This is what runc does.

Execution flow

Parent                                  Child (PID 1 in new PID ns)
──────────────────────────────────      ──────────────────────────────────
unshare(CLONE_NEWPID)
fork() ─────────────────────────────►  block on sync pipe
mkdir /sys/fs/cgroup/simple_container
write memory.max = 100000000
write <child_pid> → cgroup.procs
signal child via pipe ──────────────►  unshare(UTS | NS | IPC)
waitpid()                               sethostname("mycontainer")
rmdir cgroup dir                        mount --bind rootfs → rootfs
                                        pivot_root(., old_root)
                                        umount2(old_root, MNT_DETACH)
                                        mount /proc
                                        drop capabilities (libcap-ng)
                                        seccomp BPF filter (libseccomp)
                                        execvp(cmd)

The sync pipe ensures the child does not exec before the parent places it inside the cgroup — avoiding a race where the process runs unrestricted before memory limits are applied.

Requirements

  • Linux kernel ≥ 4.6 (cgroups v2 with memory.max)
  • libcap-nglibcap-ng-devel on Fedora/RHEL, libcap-ng-dev on Debian/Ubuntu
  • libseccomplibseccomp-devel / libseccomp-dev
  • gcc, make
  • Must run as root (or with CAP_SYS_ADMIN)

Build

make

Usage

You need a root filesystem directory. One way to get a minimal one:

mkdir alpine-rootfs
docker export $(docker create alpine) | tar -xC alpine-rootfs

Then run:

sudo ./simple_container ./alpine-rootfs /bin/sh

Inside the container:

/ # hostname
mycontainer
/ # echo $$
1

Limitations

  • No network namespace (container shares the host network stack)
  • No user namespace (requires root on the host)
  • cgroups v2 only — no v1 fallback
  • Memory limit only; no CPU or I/O constraints
  • Single-container use; no lifecycle management

License

MIT

About

Minimal Linux container runtime in C — namespaces, pivot_root, cgroups v2, seccomp, capabilities

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors