Product @ Pre6 AI · I build production-grade AI agent systems.
Product by title, builder by craft — I design AI products and ship the engineering behind them: multi-agent orchestration, agent evaluation, and AI safety infrastructure.
I care about the unglamorous half of AI products — the part that decides whether they survive contact with real users. Most demos route a single LLM call. Production systems need orchestration, evaluation, safety gates, and observability. That gap is what I build into.
- Multi-agent orchestration — supervisor/specialist architectures with typed state, tool binding, and streaming traces.
- Agent reliability — measurable, auditable evaluation of agent runs across reliability, safety, latency, and cost.
- LLM safety — scanning retrieval context for prompt injection, secret leakage, PII, and exfiltration before it reaches a model.
- Developer tooling — sharp CLIs that turn fuzzy engineering signals into decisions teams can act on.
| Project | What it is | Stack | Links |
|---|---|---|---|
| winnow | Budget-aware context compression for RAG and agents — BM25 relevance + MMR diversity packs the highest-signal context into a token budget. Deterministic, zero runtime deps, no API keys, with a reproducible benchmark. | Python · CI | Code |
| gemma4-multi-agent | Multi-agent system — a Supervisor routes work across 4 specialist agents with live reasoning traces and sandboxed tool execution. | Python · LangGraph · Gemini · Streamlit | Code |
| agent-evals-lab | Evaluation workbench for agent reliability — typed scoring engine, policy rules, regression detection, and a trace-inspection dashboard. | TypeScript · React · CI | Live Demo · Code |
| verdict | Adversarial LLM red-teaming platform — runs PAIR, Crescendo, and injection attacks against any model, then reports attack-success-rate metrics with per-category breakdowns and HTML reports. | Python · CI | Code |
| rag-safety-gateway | AI security gateway that scans RAG context for prompt injection, secrets, PII, and exfiltration risk, producing deterministic allow/redact/quarantine decisions. | TypeScript · React · CI | Live Demo · Code |
| hermes | Test-time compute scaling engine — gives any LLM o1-style reasoning search via Process Reward Models, MCTS, and beam search. | Python · CI | Code |
Every featured project ships with tests, CI, and documentation — clone, run, and review the design in minutes.
repo-pulse — one of my CLIs, generating a real engineering-health report with no keys or config.
Typed contracts first → domain models before logic, so behavior is auditable
Deterministic by default → scoring and decisions reproducible without a live model
Measurable, then pretty → evals and telemetry before dashboards
Reviewable in 60 seconds → clone, run, understand — no API keys to start
Python · TypeScript · LangGraph · LangChain · React · Streamlit · Google Gemini · OpenAI · pytest · Vitest · GitHub Actions · uv
Open to conversations on AI agent engineering, evals, and LLM safety.
