Skip to content

ArcBlock/afs-evaluation

Repository files navigation

AFS Evaluation

Reproducible evaluation suites and benchmark results for AFS (Agentic File System) — a virtual file system abstraction layer that gives AI agents a unified, path-based interface to heterogeneous storage backends.

This repository accompanies the paper "Everything is a Path: An Asset Substrate for Agent Runtimes" (working title) and the AFS-UI paper, providing the runnable code, fixtures, methodology notes, and result artefacts behind every numerical claim.

About AFS. AFS unifies access to filesystems, databases, key-value stores, APIs, and cloud services behind a path-based protocol that any AI agent can navigate. The core repo is at github.com/AIGNE-io/afs (Apache-2.0). This repo contains the evaluations only — the AFS implementation itself lives there.

At-a-glance — what's in here

Suite Question it answers Headline result
agent-substrate-bench/ How does AFS compare to MCP / LangChain / FS-CLI / Raw-SDK as an agent substrate? AFS + FS-CLI deterministic 0% leak; others 15–85% on substrate-neutral ACL prompt across 1,200 trials
afs-protocol-evaluations/ Protocol-level properties: provider conformance density, schema cost, scaling E1–E11 experiments — protocol shape numbers for the paper's evaluation section
afs-ui-evaluations/ Three UI generation paradigms compared (AUP vs HTML vs Markdown) Performance, cost, interoperability, maintainability across 3 paradigms
afs-runtime-gpt-5.5/ Live-runtime regression: does the codebase actually deliver the paper's runtime claims? RQ1 20/20, RQ2 5/5, RQ3 5/5, RQ11 6/6, conformance 1669/0
locomo/ Long-term-conversational memory recall (Snap, ACL'24) LOCOMO Recall@5 — 67.5% with embeddings, 57.8% FTS-only
longmemeval/ Memory abilities + abstention (UEdinburgh+Tsinghua, ICLR'25) Recall@5 99.1% with embeddings; QA accuracy 67.6% with claude-haiku-4-5 reader
perltqa/ Personal long-term memory, multi-category (Du et al., NAACL'24) Recall@5 86.6% English, 98.4% Chinese (FTS)
dmr/ Deep Memory Retrieval — canonical multi-session test (MemGPT, 2023) Recall@5 96.6% with embeddings
MEMORY-BENCHMARKS-SUMMARY.md Cross-benchmark headline One table comparing all four memory benchmarks

The cumulative claim:

AFS is a substrate-uniform asset protocol — its access-control, provenance, and discoverability properties hold across heterogeneous backend shapes (FS, KV, SQLite, JSON, HTTP, vault, …) where POSIX chmod cannot. The agent-substrate-bench data above is the empirical evidence; structural conformance tests in the AFS core repo (suites visibility-acl, canonical-paths, search-provenance) protect those properties from regression.

Getting started

Prerequisites

  • Bun ≥ 1.3 (used as the test runner across all suites)
  • Node.js ≥ 20 (for some scripts that shell out to npm packages)
  • pnpm 10.x (workspace package manager)
  • Anthropic / OpenAI API keys for suites that drive real LLMs

Layout

evaluation/
├── _shared/                          shared LLM bridge + QA reader/judge utilities
├── MEMORY-BENCHMARKS-SUMMARY.md      cross-benchmark headline table
├── agent-substrate-bench/            paper §IX — AFS vs MCP/LangChain/FS-CLI/Raw-SDK
│   ├── platforms/                      one mock substrate adapter per platform
│   ├── tasks/                          afs-intrinsic, hotpot-multi, swe-bench
│   ├── runners/                        run-suite, plus result analyzers
│   ├── planning/                       design.md, task-selection, conformance-promotions
│   └── results/                        v1/v2/v3 result CSVs + reports
├── afs-protocol-evaluations/         paper §IX — protocol-level (E1–E11)
├── afs-ui-evaluations/               AUP paper — RQ1/RQ2/RQ3
├── afs-runtime-gpt-5.5/              live-runtime regression on real codebase
├── locomo/, longmemeval/,            public memory benchmarks
├── perltqa/, dmr/                       (data must be downloaded separately, see below)

Running a suite

Every suite has its own README with reproducible commands. The simplest entry points:

# Substrate comparison (agent-substrate-bench v3 — 1,200 trials)
cd agent-substrate-bench
bun runners/run-suite.ts --suite afs-intrinsic \
  --platforms afs,mcp,langchain,fs-cli,raw-sdk \
  --trials 10 \
  --models claude-haiku-4-5,claude-sonnet-4-5 \
  --out results/your-rerun

# Memory benchmark (e.g. LongMemEval)
cd longmemeval
bun scripts/run.ts --mode s --limit 100   # see longmemeval/README.md

Third-party datasets — download yourself

Memory benchmarks (locomo, longmemeval, perltqa, dmr) require their upstream datasets, which are NOT bundled here for license/size reasons:

Suite Dataset source License
LOCOMO snap-research/locomo per upstream
LongMemEval xiaowu0162/LongMemEval per upstream
PerLTQA Elvin-Yiming-Du/PerLTQA per upstream
DMR MemGPT paper appendix — MemGPT codebase per upstream

Each suite's README documents the exact path to drop the downloaded data into (<suite>/data/...).

Trial transcripts — partially bundled

The agent-substrate-bench/results/ and afs-ui-evaluations/results/ directories contain full transcripts (the v1/v2/v3 paper-grade evidence — small enough to bundle). The memory benchmarks' raw trials.jsonl files are not bundled (each is hundreds of MB to several GB); their summary reports (.md, .csv) are bundled. To regenerate transcripts, re-run the suite locally.

Methodology highlights

Substrate-bench afs-intrinsic

  • 5 substrates under test: AFS (production @aigne/afs), MCP-fair (mock MCP server), LangChain-style (BaseRetriever + BaseStore mget), FS-CLI (POSIX chmod + grep + cat), Raw-SDK (no ACL primitive)
  • 4 tasks probing distinct AFS-intrinsic properties: conformance discovery, access-control, canonical-path provenance, cross-aggregate
  • Substrate-neutral prompts — entries named by bare key, "use whatever fetch primitive your substrate provides" — to remove AFS-pathy phrasing bias
  • Strict + lenient verifiers — strict checks paper-grade canonical form; lenient checks whether agent got the right idea
  • Intrinsic probes — substrate-property metrics (canonical-path-rate, namespace-acl, failure-envelope-rate, cross-provider-density) measured alongside extrinsic success

See agent-substrate-bench/planning/design.md and agent-substrate-bench/results/v3-PAPER_REPORT.md for the full methodology + collaborator-reviewed honest data.

Memory benchmarks — apples-to-apples convention

  • Same hit rule everyone publishes: a retrieved chunk is a hit if (a) gold answer appears as substring after normalisation, OR (b) ≥ 50% of meaningful tokens (length > 3, stopwords removed) appear in the chunk
  • Mirrors YourMemory's methodology documentation
  • Scoreable subset = excludes adversarial-refusal categories (LOCOMO 5, LongMemEval _abs); inclusion would conflate retrieval with refusal logic
  • Top-line headline: Recall@5 (paper convention)

See MEMORY-BENCHMARKS-SUMMARY.md for cross-suite comparison.

Evaluation-as-self-improvement loop

A central charter of this evaluation effort is that paper findings → real codebase changes. The agent-substrate-bench v1 → v2 → v3 progression closed that loop end-to-end:

  • v1 (700 trials) surfaced 6 substrate-adapter improvements
  • v2 shipped 5/6 to the substrate adapter; 700-trial confirmation
  • v3 collaborator review removed 2 methodology biases; 5/5 paper-relevant changes promoted to @aigne/afs core; 1,200 trials confirmed AFS + FS-CLI as the only deterministic 0% leak substrates

Then the structural backing was filled in:

Layer Artifact in AFS core repo
Protocol primitive packages/core/src/afs.tsvisibility:meta enforcement; MountOptions.visibility universal hook
L5 conformance packages/testing/src/suites/visibility-acl.ts (Proxy invariant: provider.search never invoked)
L1 conformance packages/testing/src/suites/canonical-paths.ts, search-provenance.ts
Production substrates providers/core/vault/test/visibility.test.ts, providers/core/kv/test/visibility-mount.test.ts
Sweep 30+ providers verified clean (core / platform / cost / messaging / iot / runtime)

Contributing

This is an evaluation repo — issues, methodology critiques, and reproductions are welcome. PRs that add a benchmark / substrate / metric in the existing methodology style are happily reviewed.

For changes to the AFS core itself (new providers, protocol features, conformance suites), please open issues / PRs in AIGNE-io/afs instead. Keep discussion of what AFS does in the core repo and how AFS measures up in this one.

Citation

@misc{afs-evaluation,
  title  = {AFS Evaluation: Benchmark Suites for the Agentic File System},
  author = {ArcBlock},
  year   = {2026},
  howpublished = {\url{https://github.com/ArcBlock/afs-evaluation}},
  note   = {Reproducible evaluation harnesses and result artefacts for AFS
            (Agentic File System) — accompanies the "Everything is a Path"
            paper.}
}

License

MIT — see LICENSE.

The benchmark code, methodology, and result analyses in this repository are ours. The third-party dataset references (LOCOMO, LongMemEval, PerLTQA, DMR) remain under their original licenses; we link to them rather than redistribute.

About

Evaluation suites and benchmark results for AFS (Agentic File System) — accompanies the 'Everything is a Path' paper

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors