A decentralized peer-to-peer pipeline-parallel inference network written in Rust. Split large language models across machines and run inference collaboratively — BitTorrent-style.
Kademlia DHT (bootstrap node)
│
┌────────────┼────────────┐
│ │ │
[Worker A] [Worker B] [Worker C]
Layers 0-7 Layers 8-15 Layers 16-23
│ │ │
└────────────┴────────────┘
│
[Client]
(embeddings + LM head)
Each worker downloads a subset of transformer blocks, announces them on the DHT, and processes hidden states piped through the chain over QUIC or TCP.
git clone git@github.com:Shashank-k15/Inference-Mesh.git
cd Inference-Mesh
cargo build --releaseThe binary is at target/release/inferencemesh.
The bootstrap node is the rendezvous point. Start one on a machine with a public IP:
inferencemesh bootstrap --port 9000Note the peer ID printed on startup — workers and clients will need the bootstrap address.
InferenceMesh uses HuggingFace-format models with safetensors weights. Download any Llama-architecture model:
# Option A: Use huggingface-cli
pip install huggingface_hub
huggingface-cli download meta-llama/Llama-3.2-1B --local-dir ./models/llama-3.2-1b
# Option B: Clone via git (requires token for gated models)
git clone https://huggingface.co/meta-llama/Llama-3.2-1B ./models/llama-3.2-1bThe directory must contain config.json and model.safetensors (or a safetensors index plus shards).
Supported architectures: Any model using the standard Llama layout (RMSNorm, RoPE, GQA, SwiGLU MLP). This includes Llama 3/3.1/3.2, Mistral, Qwen, and Phi — anything with
model.layers.{N}.self_attn.q_projweight naming.
Run one worker per GPU or machine. Each worker loads a subset of layers determined by --layers:
Here is a possible configuration.
Machine A (layers 0–7):
inferencemesh worker \
--bootstrap /ip4/192.168.1.10/tcp/9000 \
--layers 0-7 \
--model-path ./models/llama-3.2-1b \
--device cuda \
--port 9100Machine B (layers 8–15):
inferencemesh worker \
--bootstrap /ip4/192.168.1.10/tcp/9000 \
--layers 8-15 \
--model-path ./models/llama-3.2-1b \
--device cuda \
--port 9200Machine C (layers 16–23):
inferencemesh worker \
--bootstrap /ip4/192.168.1.10/tcp/9000 \
--layers 16-23 \
--model-path ./models/llama-3.2-1b \
--device cuda \
--port 9300How many blocks to host? The system auto-selects based on available GPU memory. Omit
--layersto fill the GPU automatically, or specify a range to control it manually. A Llama 3.2 1B model has 16 hidden layers — split across 3 machines at ~5–6 layers each.
CPU-only fallback: Omit
--device cudato run on CPU. Useful for testing but slow for real inference.
Echo mode (no model): Omit
--model-pathto run in echo mode — the worker mirrors hidden states back untouched. Perfect for network testing without downloading models.
Once workers announce their layers, the swarm is live. Each worker will log:
Worker announcing layers 0-7
Worker bootstrapped
inferencemesh client \
--bootstrap /ip4/127.0.0.1/tcp/9000 \
--chain 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 \
--model-path ./Models \
--prompt "repeat a 5 times" \
--max-tokens 20The client discovers which workers host each layer via the DHT, builds a chain, and sends a test payload through the pipeline. You'll see:
Client: found provider for layer 0: PeerId(...)
Client: resolved chain: [PeerId(...), PeerId(...), PeerId(...)]
Client: sending forward pass to PeerId(...)
Client: received response for request_id 1