Skip to content

001TMF/pinakes

Pinakes

Deterministic, verifiable data for AI agents.

Version v0.1.0 CI Install MCP server Signed releases License: Apache-2.0

Pinakes — Deterministic, verifiable data for AI agents


Pinakes is a single signed static Go binary — with zero runtime dependencies — that gives AI agents reproducible, complete, and verifiable access to public scientific databases. It is a CLI, a Model Context Protocol stdio server, and a local REST server, all in one executable.

Named for the Pínakes, the catalogue of the Library of Alexandria — the first index of all knowledge.

The problem: ask a public biological database the same question twice and you can get different result sets. Tools paginate and truncate differently, and the databases expose no stable order. An agent's analysis silently inherits that drift, and nothing flags it.

The fix: every Pinakes query pins a content-addressed snapshot, so results are byte-identical forever; retrieval is complete-or-fail (the returned count is reconciled against the source's authoritative total — a short set fails loudly, never silently); and every result ships a re-runnable manifest that pinakes verify re-derives offline to prove the result is exactly what it claims. It is git for scientific data.


Install

Install the single static binary:

curl -fsSL https://get.pinakes.sh | sh

No token, no Docker, no network at query time beyond the source databases themselves. The install script verifies the release's SHA-256 checksum (and its cosign keyless signature when cosign is present) before installing anything.

Then register pinakes mcp with your MCP client — any client (Claude, Cursor, Cline, Windsurf, Zed, …); see Add to any MCP client below. (In Claude Code, for example, that's one line: claude mcp add --scope user pinakes -- pinakes mcp.)

Homebrew (coming soon). brew install pinakes-sh/tap/pinakes is pending the public Homebrew tap and is not live yet — use the curl | sh line above for now.

Add to any MCP client

Most clients (Claude Desktop, Cursor, Windsurf, Zed, VS Code via .mcp.json, …) read a standard config block. Add Pinakes to it:

{
  "mcpServers": {
    "pinakes": {
      "command": "pinakes",
      "args": ["mcp"]
    }
  }
}

No env, no token, no Docker — fully local and offline. If your GUI app can't find pinakes on its PATH, replace "command": "pinakes" with the absolute path from which pinakes.

Other ways to install

Homebrew (coming soon)
brew install pinakes-sh/tap/pinakes

Pending the public Homebrew tap — not live yet. Until then, use the curl | sh line above.

Go toolchain
go install pinakes.sh/pinakes/cmd/pinakes@latest

After installing, confirm the server is wired up by listing the catalogue:

pinakes catalog

API keys (optional)

Pinakes works fully without any key. The only sources that benefit today are the two NCBI sources (ncbi-protein, ncbi-virus): supplying a free NCBI api key raises your own upstream rate limit (E-utilities and the NCBI Datasets API both lift to ~10 requests/sec with a key, vs 3–5 without).

A key affects rate limits only — never the results, their order, or their hashes. Pinakes attaches the key to the outbound HTTP request alone; it never enters a snapshot, a record, or the reproducibility manifest, so a query is byte-identical and reproducible with or without a key.

The single ncbi alias covers both NCBI sources. Set it via the scoped CLI (the key is read from stdin, never passed as an argument, and stored at ~/.config/pinakes/config.yaml with 0600 permissions):

pinakes config set-key ncbi      # paste at the prompt, or: < keyfile
pinakes config list              # shows which sources have a key (never the value)

Or supply it via the environment (overrides the file; nothing on disk):

export PINAKES_NCBI_API_KEY=…    # honored by both NCBI sources

Get a free key from your NCBI accountSettingsAPI Key Management. The key is never exposed to the model or the MCP surface — pinakes config is a local CLI only, and get/list redact the value unless you pass --reveal.


Local state

Pinakes stores its content-addressed snapshots and reproducibility manifests under your OS cache directory by default — ~/Library/Caches/pinakes on macOS, $XDG_CACHE_HOME/pinakes (else ~/.cache/pinakes) on Linux. It is created on first use; back it up or clear it like any cache. Relocate it with PINAKES_STATE_ROOT:

export PINAKES_STATE_ROOT=/path/to/pinakes-state   # where the local snapshot store lives

Everything stays local: a pinned query and its pinakes verify re-run read this directory with no network access.


Why Pinakes

Three guarantees, none of which a raw API call or a typical MCP wrapper gives you:

  • Determinism. Every query pins a content-addressed snapshot of the source. The same pin returns one byte-identical set of normalized records — every run, forever — even when the underlying database reorders or grows. Where a source exposes no stable sort, Pinakes imposes one before hashing, so a pinned re-run cannot diverge.
  • Complete-or-fail. At capture, the retrieved count is reconciled against the source's own authoritative total. If the set is not provably complete, the query fails loudly instead of quietly handing back a shorter one. Silent truncation is the bug Pinakes exists to kill.
  • Verify, don't trust. Every result ships a re-runnable provenance manifest. pinakes verify re-derives the count and the record hash from the pinned snapshot offline — exit 0 means the snapshot reproduces exactly the result the manifest claims. A shortened or altered snapshot is refused.

Pinakes fixes the input layer. It does not run analysis, does not do phylogenetics, and never claims a source is "wrong" — the public databases are authoritative. It fixes naive client retrieval: no imposed order, no reconciliation, no provenance.


Tools

The MCP server exposes eight verbs. Each is also a CLI subcommand and a REST endpoint.

Tool What it does
catalog List available sources and their filter schema.
search Run a filter-driven query against a source.
get Fetch a record by identifier.
resolve Map an identifier from one namespace into another.
estimate Dry-run a plan — counts and cost — with no retrieval.
export Materialize a result set to a format.
jobs Check the status of an async job.
verify Prove that a manifest reproduces — offline.

Sources

26 public databases — proteins and structures (UniProt, PDB, AlphaFold, InterPro), variants and genetics (ClinVar, gnomAD, GWAS Catalog, Ensembl), chemistry and drugs (ChEMBL, PubChem, Guide to Pharmacology, openFDA), pathways and ontologies (Reactome, Gene Ontology, HPO, Rhea, Monarch), expression and disease (GTEx, Open Targets, cBioPortal, ClinicalTrials.gov), and more.

pinakes catalog is the authoritative, live list — every source with its filter schema, maturity, license, and required attribution. The README does not duplicate it; run it to see exactly what your binary serves:

pinakes catalog

Trust & verify

Pinakes offers a trust chain that no npx- or Docker-based MCP server can:

  • Signed binary. Releases are cosign keyless-signed (Sigstore + GitHub OIDC) and ship with an SBOM. The install script verifies the checksum (and the cosign signature when cosign is present) before anything lands on your machine.
  • Offline data provenance. After install, pinakes verify proves data provenance — it re-derives a result from its pinned snapshot with no network access and refuses any manifest whose snapshot has been tampered with.

Case study: Ebola, reproduced

case-studies/ebola/ is a worked case study on real data — a pinned, Complete (32/32), byte-identical, offline-verify-proven snapshot of the complete-genome records for Zaire ebolavirus from NCBI Virus. It runs the pinned query twice (hashes identical) and verifies it, entirely offline. CI denies all network access and asserts that the result reproduces byte-for-byte, is Complete, verifies, and that a tampered manifest is refused.

./case-studies/ebola/run.sh

Usage

From the CLI

A pinned, complete, verifiable search — and the offline proof:

pinakes search --source-id ncbi-virus \
  --filters organism_taxon_id:eq:186538 \
  --filters complete_only:eq:true \
  --filters released_since:gte:2023-01-01 \
  --snapshot-version sha256:b1983c09…      # → 32 records, Complete (32/32)

pinakes verify --manifest @manifest.json    # → {"verified": true}, offline

From an agent (MCP tool call)

Once registered, an agent calls the verbs by name. A typical search call:

{
  "name": "pinakes_search",
  "arguments": {
    "source_id": "uniprot",
    "filters": [
      { "field": "reviewed", "operator": "eq", "value": "true" },
      { "field": "organism_taxon_id", "operator": "in", "value": "9606" }
    ]
  }
}

The agent gets back normalized records plus a manifest it can later hand to pinakes_verify to prove the set is exactly what it claims — reproducibly, offline.

For agents

Pinakes is self-describing — an agent needs no external docs beyond the MCP tool schemas:

  1. Call catalog to discover the live sources and each source's filter schema (field names, allowed operators, enums) — this is how an agent learns what it can query.
  2. Call search / get / resolve with filters validated against that schema. An unknown field or bad value is rejected locally with a precise code (UNKNOWN_FIELD / INVALID_FILTER / BAD_ENUM) before any network call — an actionable correction, not a stack trace.
  3. Keep the returned manifest and call verify to prove the result reproduces.

Every tool's name, parameters, and description are served via MCP tools/list, so the full contract is in-band — no out-of-band documentation required.


Docs & links

A future hosted tier (accounts, metering) will use PINAKES_API_KEY. The local binary in this repository needs no key — it is the open, local engine, fully usable offline.


Versioning & stability

Pinakes follows SemVer. Tags are vMAJOR.MINOR.PATCH.

Two things version separately:

  • The determinism contract is frozen at ManifestSchemaVersion 1.0.0. A pinned query and the manifest pinakes verify re-derives stay reproducible across releases. See CONTRACTS.md for what the contract covers and the rules for ever changing it.
  • The CLI, MCP, and REST surfaces are still on the 0.x line. Flags, tool names, and endpoints may change before 1.0.0; surface stability is promised only at 1.0.0. The contract above does not move when a surface does.

License

Apache License 2.0. See LICENSE.

About

Pinakes — a deterministic, verifiable data layer for AI agents across domains (biology + literature). Same query → byte-identical records, complete-or-fail-loudly, re-runnable provenance manifest + verify().

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages