Skip to content

fix(broker): reclaim agent on 409 instead of crashing the broker (#797)#830

Merged
khaliqgant merged 1 commit into
mainfrom
fix/issue-797-agent-name-collision
May 10, 2026
Merged

fix(broker): reclaim agent on 409 instead of crashing the broker (#797)#830
khaliqgant merged 1 commit into
mainfrom
fix/issue-797-agent-name-collision

Conversation

@miyaontherelay
Copy link
Copy Markdown
Contributor

Summary

Fixes #797. When agent-relay up --workspace-key … is run against a workspace where an agent with the cwd-derived name already exists — same broker restarting, or a second machine joining the workspace whose prior offline record is still in the cloud — register_agent_with_workspace_key was calling relay.register_agent and bailing on 409 with strict_name=true. The broker exited during connect_relay(); the SDK's in-flight getSession() saw a dying broker and surfaced Bun's misleading "Unable to connect. Is the computer able to access the url?" error, which is what the issue reporter saw.

On 409 we now reclaim the existing agent: rotate its bearer token, then fetch the agent record to pluck the existing id. The non-strict path keeps its single suffix retry for ephemeral callers.

Reproduction

Reliable repro on Ubuntu aarch64 in Docker (matches the reporter's VPS): two back-to-back agent-relay up --workspace-key … invocations against the same workspace from cwds with the same basename. The second crashes during the initial handshake. v6.0.13's better launcher diagnostics surface the real cause as agent name 'X' already exists.

After the fix: three back-to-back invocations all start cleanly, all reuse the same agent_id, each rotates to a fresh at_live_* token.

Why not relaycast::RelayCast::register_or_get_agent?

Its internal get_agent call deserializes into the strict Agent struct, which requires workspace_id, token_hash, and created_at — fields the live cloud's GET /v1/agents/{name} does not return. The decode fails with HTTP error: error decoding response body. We instead call rotate_agent_token (whose TokenRotateResponse decodes fine) and use a raw HttpClient::get::<serde_json::Value> to read just the id field — every other field is non-essential for startup.

Changes

  • src/auth.rs — replace fatal bail! on 409 (strict_name) with reclaim_agent_via_rotate_token. New helper does rotate-token + lenient GET.
  • Cargo.toml — add urlencoding = "2.1" for safe agent-name encoding in the GET path.
  • Test strict_name_returns_conflict_error removed (asserted the broken behavior); replaced with strict_name_conflict_reclaims_via_rotate_token. Mock GET payload deliberately omits the cloud-missing fields, so the test locks in the lenient deserialization.

Test plan

  • cargo test --lib auth:: — 14/14 pass
  • cargo test --lib — 232/232 pass
  • cargo clippy --bin agent-relay-broker --lib — clean
  • End-to-end: 3× back-to-back agent-relay up --workspace-key with same name against same live workspace, in Ubuntu aarch64 Docker, all start, all reuse the same agent_id, all rotate token. Same scenario crashed pre-fix.
  • Reviewer to validate against another live workspace if desired

🤖 Generated with Claude Code

When `agent-relay up --workspace-key …` is run on a workspace where an
agent with the cwd-derived name already exists (the same broker
restarting, or a second machine joining the workspace and its prior
offline record still present cloud-side), `register_agent_with_workspace_key`
called `relay.register_agent` and bailed on 409 with strict_name=true. The
broker then exited during `connect_relay()`; the SDK's getSession() saw a
killed broker and reported the misleading "Unable to connect. Is the
computer able to access the url?" Bun fetch error.

Reproduce on Ubuntu aarch64 in Docker by running two `agent-relay up
--workspace-key …` invocations against the same workspace from cwds with
the same basename — the second one dies during the initial handshake.
v6.0.13's better launcher diagnostics surface the real cause as
"agent name 'X' already exists".

Fix: on 409 with strict_name, reclaim the existing agent. We can't use
the relaycast crate's `register_or_get_agent` helper because its internal
`get_agent` deserializes into a strict `Agent` struct that doesn't match
the live cloud's GET /v1/agents/{name} response (no workspace_id /
token_hash / created_at). Instead we call rotate_agent_token directly to
get a fresh bearer token, then GET the agent record as serde_json::Value
to pluck the existing id. The non-strict path keeps its single suffix
retry for ephemeral callers.

End-to-end verified in Docker: three back-to-back attempts on the same
workspace with the same name now all start successfully, reusing the
agent_id and rotating the token each time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 9, 2026

Review Change Stack

📝 Walkthrough

Walkthrough

This PR enables agent reclamation when a broker with strict naming attempts to register but encounters a 409 name conflict. Instead of failing, it now rotates the existing agent's token and refetches the agent record to recover the agent ID, allowing multi-machine workspace sharing via shared API keys.

Changes

Agent Reclamation on Name Conflict

Layer / File(s) Summary
Dependencies
Cargo.toml
Added urlencoding = "2.1" for URL encoding of agent names in HTTP requests.
Imports & HTTP Client Types
src/auth.rs
Updated relaycast imports to include ClientOptions and HttpClient for HTTP-based agent metadata operations.
Conflict Reclamation & Token Rotation
src/auth.rs
When strict_name is true and agent registration returns 409 conflict, routing now calls reclaim_agent_via_rotate_token instead of failing. New helper rotates agent token via POST /v1/agents/{name}/rotate-token, fetches agent record via GET /v1/agents/{name} as generic JSON, and returns reclaimed (agent_id, agent_name, token, None) tuple.
Tests & Validation
src/auth.rs
Replaced strict-name conflict test; new strict_name_conflict_reclaims_via_rotate_token mocks token rotation and agent fetch endpoints, asserts startup succeeds, and validates returned token and agent ID match the reclaimed agent.

Sequence Diagram

sequenceDiagram
  participant Client as AuthClient
  participant API as Relay API
  participant Store as Agent Store
  Client->>API: POST /v1/agents/{name}
  API-->>Client: 409 Conflict (name exists)
  Client->>API: POST /v1/agents/{name}/rotate-token
  API-->>Store: rotate token
  Store-->>API: new token
  API-->>Client: rotated token
  Client->>API: GET /v1/agents/{name}
  API-->>Client: agent record {id, ...}
  Client->>Client: reclaim with agent_id + rotated token
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • willwashburn
  • khaliqgant

Poem

🐰 A conflict arose, but the relay stands tall,
Rotating the token to reclaim it all,
When names collide across distant machines,
One agent persists through the agent's routines—
Now workspaces unite, no more shall we stall! 🚀

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed Title clearly summarizes the main change: fixing a broker crash by reclaiming agents on HTTP 409 instead of failing, which directly addresses the issue referenced.
Description check ✅ Passed Description comprehensively covers the problem, solution, reproduction steps, code changes, and a detailed test plan with both unit and end-to-end validation.
Linked Issues check ✅ Passed Code changes in auth.rs implement agent reclamation via token rotation and lenient GET, directly resolving the broker crash scenario described in #797 when agents with duplicate names already exist.
Out of Scope Changes check ✅ Passed All changes are directly scoped to fixing the agent name collision issue: auth.rs implements reclamation logic, Cargo.toml adds urlencoding dependency, and tests validate the fix.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/issue-797-agent-name-collision

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/auth.rs (1)

780-811: 💤 Low value

Consider order of operations: rotate completes before GET.

The function rotates the token first (line 786-790), then fetches the agent record (line 794-802). If the GET fails after a successful rotate, the old token is invalidated but the new token isn't returned to the caller. This leaves the agent in a state where:

  • The broker startup fails
  • The old token is invalid
  • Any retry will rotate again (recoverable, but wasteful)

This is likely acceptable since startup failures are retriable, but consider swapping the order: GET first to confirm agent exists, then rotate. This would avoid unnecessary token churn on transient GET failures.

♻️ Optional: Reorder to GET before rotate
 async fn reclaim_agent_via_rotate_token(
     workspace_key: &str,
     base_url: &str,
     name: &str,
 ) -> Result<(String, String, String, Option<String>)> {
-    let relay = build_relay_client(workspace_key, base_url)?;
-    let token_response = relay
-        .rotate_agent_token(name)
-        .await
-        .map_err(relay_error_to_anyhow)
-        .with_context(|| format!("failed to rotate token for existing agent '{}'", name))?;
-
     let http = HttpClient::new(ClientOptions::new(workspace_key).with_base_url(base_url))
         .map_err(|e| anyhow::anyhow!("failed to build http client: {e}"))?;
     let agent_record = http
         .get::<Value>(
             &format!("/v1/agents/{}", urlencoding::encode(name)),
             None,
             None,
         )
         .await
         .map_err(relay_error_to_anyhow)
         .with_context(|| format!("failed to fetch existing agent '{}'", name))?;

     let agent_id = agent_record
         .get("id")
         .and_then(Value::as_str)
         .map(str::to_string)
         .with_context(|| format!("agent '{}' record missing 'id' field", name))?;

+    let relay = build_relay_client(workspace_key, base_url)?;
+    let token_response = relay
+        .rotate_agent_token(name)
+        .await
+        .map_err(relay_error_to_anyhow)
+        .with_context(|| format!("failed to rotate token for existing agent '{}'", name))?;
+
     Ok((agent_id, token_response.name, token_response.token, None))
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/auth.rs` around lines 780 - 811, The rotate-first flow in
reclaim_agent_via_rotate_token can invalidate the old token before confirming
the agent exists; swap the operations so the GET (the HttpClient.get call that
produces agent_record) runs before calling relay.rotate_agent_token
(rotate_agent_token) so you confirm the agent and its id (agent_id from
agent_record.get("id")) succeeds prior to rotating; adjust error
handling/with_context messages accordingly so if the GET fails you return early
without rotating and only call rotate_agent_token when the agent record is
present, then return (agent_id, token_response.name, token_response.token,
None).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/auth.rs`:
- Around line 780-811: The rotate-first flow in reclaim_agent_via_rotate_token
can invalidate the old token before confirming the agent exists; swap the
operations so the GET (the HttpClient.get call that produces agent_record) runs
before calling relay.rotate_agent_token (rotate_agent_token) so you confirm the
agent and its id (agent_id from agent_record.get("id")) succeeds prior to
rotating; adjust error handling/with_context messages accordingly so if the GET
fails you return early without rotating and only call rotate_agent_token when
the agent record is present, then return (agent_id, token_response.name,
token_response.token, None).

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 03c45463-48be-418e-94ba-0c536c97e924

📥 Commits

Reviewing files that changed from the base of the PR and between 272cc13 and 7715cfb.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • Cargo.toml
  • src/auth.rs

@khaliqgant khaliqgant merged commit b9060a4 into main May 10, 2026
47 checks passed
@khaliqgant khaliqgant deleted the fix/issue-797-agent-name-collision branch May 10, 2026 05:05
@willwashburn
Copy link
Copy Markdown
Member

Why not relaycast::RelayCast::register_or_get_agent?
Its internal get_agent call deserializes into the strict Agent struct, which requires workspace_id, token_hash, and created_at — fields the live cloud's GET /v1/agents/{name} does not return. The decode fails with HTTP error: error decoding response body. We instead call rotate_agent_token (whose TokenRotateResponse decodes fine) and use a raw HttpClient::get::<serde_json::Value> to read just the id field — every other field is non-essential for startup.

@khaliqgant I'm not sure I agreee with this reasoning. we should fix this upstream in the relaycast sdk

@khaliqgant
Copy link
Copy Markdown
Member

Why not relaycast::RelayCast::register_or_get_agent?
Its internal get_agent call deserializes into the strict Agent struct, which requires workspace_id, token_hash, and created_at — fields the live cloud's GET /v1/agents/{name} does not return. The decode fails with HTTP error: error decoding response body. We instead call rotate_agent_token (whose TokenRotateResponse decodes fine) and use a raw HttpClient::get::<serde_json::Value> to read just the id field — every other field is non-essential for startup.

@khaliqgant I'm not sure I agreee with this reasoning. we should fix this upstream in the relaycast sdk

@willwashburn #835

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(v6.0.2): shared RELAY_API_KEY across machines — broker exits at WebSocket subscription with 'Unable to connect'

3 participants