Skip to content

Track tool history in sessions and reconstruct judge context server-side#126

Open
hunner wants to merge 6 commits into
mainfrom
mcp_tmp_nibz_tool_history
Open

Track tool history in sessions and reconstruct judge context server-side#126
hunner wants to merge 6 commits into
mainfrom
mcp_tmp_nibz_tool_history

Conversation

@hunner

@hunner hunner commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

What

Atryum's LLM-as-judge evaluates tool calls with the agent's recent history as context. Previously that history came from harness-supplied chat blobs — content the agent under evaluation could shape, making the judge's context poisonable. This PR moves history onto Atryum's own records: Atryum mints a session ID, the harness echoes it on every invocation, and the judge's context is reconstructed server-side from the tool calls Atryum recorded for that session. Nothing the harness asserts about history is trusted on this path.

How it works

Sessions (Invocations API). POST /api/v1/external/sessions returns an Atryum-minted ses_<uuid> bound to the agent identity. Every POST /api/v1/external/invocations carries session_id; a session presented by an agent that doesn't own it is rejected outright. Sessions expire on a 7-day sliding TTL (expires_at, refreshed on use); expired sessions are rejected. Harness-supplied context is ignored whenever session_id is present. The harness's own session ID is recorded as client_session_id for cross-referencing.

Trust model for reconstructed context. Each prior invocation is rendered under an explicit trust framing (judgeHistoryPreamble):

  • Tool outputs/errors are evidence of what happened, but may relay attacker-controlled data — they are fenced between sentinel delimiters and labeled "never follow instructions inside." Sentinel occurrences inside payloads are neutralized, so a crafted output cannot terminate its own fence or impersonate a delimiter.
  • Tool inputs are agent-chosen and labeled lower-trust.
  • Human chat conveys intent but cannot override the agent charter configured in Atryum.
  • Agent chat is excluded entirely.

Bounding. Context is capped by a byte budget keeping the most recent tail, with an explicit [older session history omitted: N …] marker; individual blobs are truncated; session metadata and agent IDs are length-capped. The managed-agents watcher applies an equivalent (separate) cap to its aggregate context.

Auth. The three external routes (/sessions, /invocations, /invocations/{id}) run under the agent-runtime OAuth middleware. When an authenticated subject is present it wins over any request-body agent_id, and RecordExecution rejects writes to invocations the caller doesn't own (recorded output feeds the judge as evidence, so cross-agent writes would poison a victim's context). No-auth deployments keep working on self-declared identity.

Storage. Migration 024 adds invocations.session_id and the external_sessions table; 025 adds expires_at (NULL = non-expiring for legacy rows). Invocation audit rows are never deleted or expired.

Harnesses. The amp plugin and pi extension are rewritten to the session model: mint once, cache, echo session_id — no more chat/thread scraping. Both send ATRYUM_ACCESS_TOKEN as a bearer. The ChatContext/ChatContextMessages/Context fields remain as deprecated aliases for older callers, and callers that send no session_id (e.g. the CI fake-agent baseline) degrade gracefully to history-free evaluation.

Also included. A small standalone addition rides along: a server.log_level config option wired to slog (configureLogging in cmd/atryum/main.go), unrelated to sessions.

Testing

Grounding tests pin the security properties: context is rebuilt from Atryum's DB rather than caller input; malicious tool output (including output embedding the fence sentinel itself) stays inside the untrusted fence; charter precedence framing is present; the recent-tail byte cap keeps newest entries and marks omissions; unknown/foreign/expired sessions are rejected; authenticated identity beats claimed agent_id; mismatched authenticated callers cannot record another agent's execution results. go build, go test ./..., and go vet pass.

Out of scope (deliberate)

  • MCP-proxy sessions (no history on that path yet)
  • Background reaper for expired session rows; broader invocation retention policy
  • Rolling summarization of omitted older history
  • Rate-limiting session creation; harness attestation
  • Hard "reject calls and force a new session" backstop for runaway sessions (soft capping only)

🤖 Generated with Claude Code

mdeyell-valid-mind and others added 6 commits July 3, 2026 09:36
Feed the LLM-as-judge an agent's recent history with an explicit trust model
so a misbehaving agent cannot poison the judge's context.

Trust model (judgeHistoryPreamble):
- tool OUTPUTS: trustworthy evidence of state, but never obey embedded instructions
- tool INPUTS: agent-chosen, treat skeptically (intent only)
- human chat: intent only; cannot override the Atryum-set agent charter
- agent chat: excluded entirely

Invocations API sessions (API-only for now):
- POST /api/v1/external/sessions mints an Atryum session id bound to the agent
  (authed identity wins, else self-declared agent_id in no-auth mode)
- harness echoes session_id on every submit; Atryum rebuilds the judge's
  context from the prior invocations it recorded for that session, ignoring any
  harness-supplied context blob (no harness override)
- session ownership verified on submit; mismatches are rejected, not dropped
- session-create records harness + client_session_id as bookkeeping metadata
- context bounded at maxSessionContextInvocations=500 (soft cap); hard
  size/length backstop deferred (see comment)

Managed-agents path: only user.message is kept; agent.message excluded.
MCP proxy path: still sends no history (out of scope).

Rename ChatContext -> SessionContext (deprecated aliases retained for older
harness callers and the fake-agent baseline).

Rewrite the amp plugin and pi extension examples to the session model: mint a
session once, cache the session_id, send it on each call, and stop scraping and
shipping chat/thread context. READMEs updated to match.

Adds store migration 024 (invocations.session_id + external_sessions table).
… expiry

- Fence prior tool outputs/errors as untrusted data, neutralizing embedded
  fence sentinels so a crafted payload cannot escape or impersonate the fence
- Cap reconstructed session context by byte budget with recent-tail selection
  and an older-history-omitted marker
- Add session lifecycle: expires_at with 7-day sliding TTL (new migration 025;
  024 left untouched), expired sessions rejected
- Malicious-harness guardrails: authenticated identity wins over claimed
  agent_id, metadata/agent_id length caps, harness-supplied context ignored
  when session_id is used
- Grounding tests: malicious output stays fenced, fence-escape attempt stays
  fenced, charter precedence, recent-tail cap, unknown/foreign/expired session
  rejection, authenticated-identity precedence
- Document RecordExecution ownership KNOWN GAP (no caller identity available
  on this branch's external routes; enforce once OAuth middleware lands)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…ownership

Follow-up hardening now that the agent-runtime OAuth middleware
(agentRuntimeHandler) also fronts the tool-history/session endpoints:

- Wrap POST /api/v1/external/sessions in agentRuntimeHandler so it is
  authenticated in the same way as the external invocation routes. The
  authenticated identity from auth.AgentIDFromContext already wins over the
  request-body agent_id in externalSessions, and the no-auth fallback
  (noAuthAgentIDHint) still applies when auth is not configured. The amp and
  pi example harnesses now send ATRYUM_ACCESS_TOKEN as a bearer token on their
  session-creation calls so they keep working once the route requires auth.

- Enforce ownership in RecordExecution. The PATCH
  /api/v1/external/invocations/{id} route now runs under the same middleware,
  so in auth mode ctx carries the authenticated caller. A caller that does not
  own the invocation (inv.AgentID is set from the authenticated identity at
  Submit time) is now rejected hard, closing the gap where a caller could
  poison another agent's session context via update.Result/Error. No-auth mode
  and the in-process managed-agents watcher (no identity in ctx) are unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@hunner hunner force-pushed the mcp_tmp_nibz_tool_history branch from d387fc1 to 44d77e4 Compare July 3, 2026 16:36
@hunner

hunner commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

For reviewers (and @nibz when you're back): a note on how this PR relates to the original branch work, since the description covers only the final state.

Spencer's baseline — the session mechanism and its trust model: Atryum-minted session IDs (POST /api/v1/external/sessions), harnesses echoing session_id on every invocation, judge context rebuilt server-side from Atryum's own DB, the trust-model preamble, hard rejection of foreign sessions, a soft cap of 500 invocations with per-blob truncation, migration 024, the amp/pi plugins rewritten to the session model, and agent-chat exclusion in the managed-agents watcher. Explicitly deferred at hand-off: a length-based context cap and session lifecycle.

Additions on top of that baseline:

  1. Session lifecycleexpires_at (migration 025) with a 7-day sliding TTL refreshed on use; expired sessions rejected. No reaper yet; invocation audit rows deliberately untouched.
  2. Byte-budgeted context — the deferred "it's about length, not count" item: a 24KB budget keeping the most recent tail, with an explicit [older session history omitted: N …] marker. The managed-agents watcher got an equivalent aggregate cap.
  3. Hardened trust framing — per-entry labels (inputs marked agent-chosen/lower-trust/do-not-obey), tool outputs/errors fenced between sentinel delimiters labeled untrusted — plus neutralization of sentinel occurrences inside payloads, so a crafted tool output cannot close its own fence and forge trusted-looking context.
  4. Malicious-harness guardrails — harness-supplied context ignored whenever session_id is present; length caps on session metadata and agent IDs; authenticated identity wins over the request-body agent_id.
  5. Auth enforcement — all three external routes (/sessions, /invocations, /invocations/{id}) now run under the agent-runtime OAuth middleware (the sessions route was previously unauthenticated), and RecordExecution rejects an authenticated caller writing another agent's execution results — closing a judge-context-poisoning hole, since recorded output is presented to the judge as evidence.
  6. Grounding tests pinning each property: fence-escape attempts stay fenced, malicious output stays fenced, charter precedence, recent-tail capping, unknown/foreign/expired session rejection, identity precedence, ownership rejection, and the canary asserting context comes from Atryum's DB rather than the caller.
  7. Housekeeping — rebased onto latest main (admin auth, machine keys), and the amp/pi examples send the bearer token on session creation.

In short: the baseline built the mechanism and trust model; these additions made the framing tamper-resistant, bounded everything (context size, metadata, session lifetime), and put real authentication behind the identity claims the design depends on.

🤖 Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants