Teaching a model to become the router — replacing hand-crafted heuristics with learned tool-selection behavior via QLoRA distillation.
📖 Read the blog post: From Heuristics to Fine-Tuning
Autonomous AI agents need to decide which tool to call for every user query. Most implementations rely on:
- ❌ Regex/keyword matching (brittle, unmaintainable)
- ❌ Zero-shot LLM prompting (expensive, slow, inconsistent)
- ❌ Embedding similarity (loses argument extraction)
ToolForge solves this by fine-tuning a small LLM (7-8B) via QLoRA on synthetic tool-call traces — replacing a heuristic router with a learned one. On a clean, hand-written, non-circular test set with format-agnostic grading, fine-tuning improves tool-routing accuracy from 75.0% (base Qwen2.5-7B) to 83.3% (+8.3 pp) — and the gain comes from better routing decisions, not output formatting (see Honest Evaluation).
| Run | Base Model | LoRA r | LR | Test Accuracy | Eval Loss |
|---|---|---|---|---|---|
| 🥇 qwen7b-r64 | Qwen2.5-7B-Instruct | 64 | 2e-4 | 86.2% | 0.141 |
| 🥈 mistral-r64 | Mistral-7B-Instruct-v0.3 | 64 | 2e-4 | 82.8% | 0.670 |
| 🥉 mistral-r16 | Mistral-7B-Instruct-v0.3 | 16 | 2e-4 | 81.9% | 0.648 |
| ❌ mistral-lr5e4 | Mistral-7B-Instruct-v0.3 | 64 | 5e-4 | 60.3% | 0.730 |
Note: These accuracies are measured on a held-out split of the same synthetic distribution used for training (labels partly from the Gemini teacher). Because the teacher labels both train and test, this number is partly circular and should be read as an internal ablation comparing hyperparameters — not as an unbiased estimate of routing quality. For the unbiased, non-circular measurement, see Honest Evaluation below.
The ablation above answers "which hyperparameters are best?" but not "did fine-tuning actually add anything over the base model, or is Qwen2.5-7B already good at this?" — and it's measured on teacher-labeled data, which is circular.
To answer honestly, I built a separate, hand-written, non-circular test set (36 realistic, indirectly-phrased queries, labeled by hand — no teacher involved) and compared the base model against the fine-tuned adapter on identical inputs.
Fair grading. The fine-tuned model is trained to emit a specific <tool_calls>[...]</tool_calls> format. To avoid penalizing the base model purely for using a different format, grading is format-agnostic: a prediction counts if the correct tool is identified in any recognizable format (the trained format, Qwen-native <tool_call>, raw JSON, or function-call style).
| Model | Routing accuracy (format-agnostic) | Strict-format accuracy | Avg latency* |
|---|---|---|---|
| Base Qwen2.5-7B-Instruct | 75.0% | 75.0% | 3,457 ms |
| Fine-tuned (QLoRA r=64) | 83.3% | 83.3% | 5,322 ms |
| Gain | +8.3 pp | +8.3 pp | — |
Unbatched HuggingFace generate() on a single T4 — not a production serving setup; not a fair latency comparison.
- The gain is real routing, not formatting. Strict and lenient scores are identical for both models — base Qwen already emits parseable formats. So fine-tuning improved which tool the model picks, not how it writes the call.
- Gains concentrate on disambiguation:
web_searchvswikipedia(33% → 100%),unit_convertervscalculator(67% → 100%), and multi-tool selection (67% → 100%). - Honest tradeoff: fine-tuning slightly increases over-triggering on no-tool conversational queries (e.g. "what is 2 plus 2", "I'm bored") — a precision/recall cost of teaching the model to reach for tools. This is a known side effect of tool-routing fine-tuning, reported here rather than hidden.
- Fixed tool set. ToolForge learns 9 specific tools baked into the prompt; adding a tool requires retraining. It is a specialist router, trading the generality of frontier function-calling for a small, cheap, self-hostable model — the right tradeoff when the tool set is known and fixed.
- Latency above is not a serving benchmark (unbatched HF generate). A vLLM/batched setup is the correct way to measure production latency; that comparison is future work.
| Tool | Accuracy |
|---|---|
| datetime | 100% |
| unit_converter | 100% |
| web_reader | 100% |
| calculator | 94.1% |
| dictionary | 93.8% |
| weather | 92.3% |
| web_search | 91.7% |
| wikipedia | 86.7% |
| translate | 80.0% |
| multi_tool | 50.0% |
| no_tool | 41.7% |
Per-tool numbers are from the internal (synthetic) test split — see the circularity note above. The no_tool / multi_tool figures in particular are affected by teacher-label noise; the Honest Evaluation section is the unbiased measurement.
- 7/9 tools above 90% — single-tool routing is near-production quality
- Adapter size has minimal impact — r=16 (81.9%) vs r=64 (82.8%); smaller adapter is deployable for efficiency
- Learning rate is critical — 5e-4 causes divergence; 2e-4 is the sweet spot
- Student surpasses teacher — on the hand-written test set, the fine-tuned model correctly routed several queries that the Gemini teacher would mislabel (e.g. current-events →
web_searchrather thanwikipedia), confirmed by manual review of disagreements
┌─────────────────────────────────────────────────────────────┐
│ ToolForge Pipeline │
│ │
│ Phase 1: Data Generation │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ Template │ + │ Gemini │ → │ 1,173 labeled │ │
│ │ Generator │ │ Teacher │ │ examples │ │
│ │ (498 seed) │ │ (679 dist.) │ │ (train/val/ │ │
│ │ │ │ flash+lite │ │ test/hard) │ │
│ └──────────────┘ └──────────────┘ └────────────────┘ │
│ │
│ Phase 2: QLoRA Training │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ Base Model │ + │ LoRA r=64 │ → │ Fine-tuned │ │
│ │ (4-bit NF4) │ │ Adapter │ │ Router │ │
│ │ Qwen/Mistral │ │ ~335-646 MB │ │ +8.3pp vs base│ │
│ └──────────────┘ └──────────────┘ └────────────────┘ │
│ │
│ Phase 3: Evaluation │
│ ┌──────────────┐ ┌──────────────┐ ┌────────────────┐ │
│ │ Tool Acc. │ │ Per-Category│ │ W&B │ │
│ │ Arg Match │ │ Breakdown │ │ Dashboard │ │
│ │ Multi-Tool │ │ Error │ │ 4 ablation │ │
│ │ Latency │ │ Analysis │ │ runs │ │
│ └──────────────┘ └──────────────┘ └────────────────┘ │
└─────────────────────────────────────────────────────────────┘
The model learns to route queries to these tools (or respond directly):
| Tool | Description | Input Schema |
|---|---|---|
web_search |
Search the internet | {query: str} |
calculator |
Mathematical expressions | {expression: str} |
weather |
Current weather data | {location: str} |
wikipedia |
Encyclopedia lookup | {query: str} |
datetime |
Date/time operations | {action: str, ...} |
dictionary |
Word definitions | {word: str} |
translate |
Language translation | {text: str, to_lang: str} |
unit_converter |
Unit conversion | {value: float, from: str, to: str} |
web_reader |
Extract webpage content | {url: str} |
Plus no_tool (direct response) and multi_tool (chained calls).
toolforge/
├── README.md
├── requirements.txt
├── configs/
│ ├── mistral_r64.yaml # Default training config
│ ├── mistral_r16.yaml # Small adapter ablation
│ └── llama_r64.yaml # Alternative base model
├── data/
│ ├── synthetic/
│ │ ├── queries.json # 1,894 generated queries
│ │ └── teacher.jsonl # 679 Gemini-labeled examples
│ ├── train.jsonl # 918 training examples
│ ├── val.jsonl # 114 validation examples
│ ├── test.jsonl # 116 test examples
│ └── hard_test.jsonl # 25 multi-tool edge cases
├── src/
│ ├── data_gen/
│ │ ├── template_generator.py # Deterministic seed data (498 examples)
│ │ ├── teacher_labeler.py # Gemini distillation with multi-key rotation
│ │ └── build_dataset.py # Merge, dedup, split into train/val/test
│ ├── training/
│ │ ├── train.py # QLoRA fine-tuning with SFTTrainer
│ │ └── merge.py # LoRA → base model merge for deployment
│ └── eval/
│ └── evaluate.py # Tool accuracy, F1, per-category breakdown
├── kaggle_ablation.py # Self-contained Kaggle notebook with W&B
└── kaggle_notebook.py # Single-run training notebook
# Install dependencies
pip install -r requirements.txt
# Generate seed queries + label with Gemini
# (requires API keys in .env — get free keys at aistudio.google.com)
python -m src.data_gen.teacher_labeler --n 2500
# Build final dataset splits
python -m src.data_gen.build_dataset- Upload
data/*.jsonlas a Kaggle Dataset - Create a new notebook with GPU T4 enabled
- Paste cells from
kaggle_ablation.pyand run
# Or train locally with a GPU
python -m src.training.train --config configs/mistral_r64.yamlpython -m src.eval.evaluate \
--checkpoint checkpoints/qwen7b-r64-lr2e4/final \
--test-set data/test.jsonl| Source | Count | Method | Quality |
|---|---|---|---|
| Template Generator | 498 | Deterministic rules, 100% clean labels | ⭐⭐⭐ |
| Gemini Distillation | 679 | gemini-2.5-flash + flash-lite function calling |
⭐⭐ |
The teacher labeler (teacher_labeler.py) is designed for zero-cost, zero-data-loss operation:
- Multi-key round-robin: 6 API keys × 2 models = 12 independent quota slots
- Incremental saves: Every label is flushed to disk immediately
- Smart retry logic: Distinguishes daily quota (mark key dead) vs transient 503 (exponential backoff)
- Resume support:
--resumeflag continues from exactly where you left off
# Resume after quota exhaustion — add fresh keys to .env and re-run
python -m src.data_gen.teacher_labeler --resume| Parameter | Value |
|---|---|
| Quantization | 4-bit NF4, double quantization |
| LoRA rank | 64 (best), 16 (ablation) |
| LoRA alpha | 128 |
| Target modules | q, k, v, o, gate, up, down projections |
| Optimizer | AdamW |
| Learning rate | 2e-4 (cosine schedule) |
| Batch size | 4 × 4 gradient accumulation = 16 effective |
| Epochs | 3 |
| Trainable params | ~335M / 7.2B (4.6%) |
Step Train Loss Eval Loss
50 0.724 0.698
100 0.581 0.687
150 0.495 0.672
All runs are logged to Weights & Biases under the toolforge project:
- Training loss curves (per-step)
- Validation loss at each checkpoint
- Test accuracy and per-category breakdown
- Hyperparameter comparison across ablation runs
- System metrics (GPU utilization, memory)
With 918 training examples and a 7B model, full fine-tuning would catastrophically overfit. QLoRA freezes 95%+ of weights and only trains ~335M adapter parameters — enough capacity for tool routing without destroying the base model's knowledge.
Cost. Gemini's free tier provides 20+ requests/day per model per key. With 6 keys × 2 models = 12 quota slots, we labeled 679 examples at zero cost. The multi-key rotation system makes this fully automated.
The model sees 27/30 correct labels for patterns like "define X → dictionary" and learns the dominant signal. The 3 noisy labels from Gemini's inconsistency are treated as noise — a well-known property of neural network training on noisy supervision.
- Python 3.12+
- PyTorch 2.x with CUDA
- transformers, peft, trl, bitsandbytes
- Google API keys (free tier) for data generation
- GPU: T4 (16GB) minimum for training
MIT