fix(broker): reclaim agent on 409 instead of crashing the broker (#797) by miyaontherelay · Pull Request #830 · AgentWorkforce/relay

miyaontherelay · 2026-05-09T19:00:22Z

Summary

Fixes #797. When agent-relay up --workspace-key … is run against a workspace where an agent with the cwd-derived name already exists — same broker restarting, or a second machine joining the workspace whose prior offline record is still in the cloud — register_agent_with_workspace_key was calling relay.register_agent and bailing on 409 with strict_name=true. The broker exited during connect_relay(); the SDK's in-flight getSession() saw a dying broker and surfaced Bun's misleading "Unable to connect. Is the computer able to access the url?" error, which is what the issue reporter saw.

On 409 we now reclaim the existing agent: rotate its bearer token, then fetch the agent record to pluck the existing id. The non-strict path keeps its single suffix retry for ephemeral callers.

Reproduction

Reliable repro on Ubuntu aarch64 in Docker (matches the reporter's VPS): two back-to-back agent-relay up --workspace-key … invocations against the same workspace from cwds with the same basename. The second crashes during the initial handshake. v6.0.13's better launcher diagnostics surface the real cause as agent name 'X' already exists.

After the fix: three back-to-back invocations all start cleanly, all reuse the same agent_id, each rotates to a fresh at_live_* token.

Why not `relaycast::RelayCast::register_or_get_agent`?

Its internal get_agent call deserializes into the strict Agent struct, which requires workspace_id, token_hash, and created_at — fields the live cloud's GET /v1/agents/{name} does not return. The decode fails with HTTP error: error decoding response body. We instead call rotate_agent_token (whose TokenRotateResponse decodes fine) and use a raw HttpClient::get::<serde_json::Value> to read just the id field — every other field is non-essential for startup.

Changes

src/auth.rs — replace fatal bail! on 409 (strict_name) with reclaim_agent_via_rotate_token. New helper does rotate-token + lenient GET.
Cargo.toml — add urlencoding = "2.1" for safe agent-name encoding in the GET path.
Test strict_name_returns_conflict_error removed (asserted the broken behavior); replaced with strict_name_conflict_reclaims_via_rotate_token. Mock GET payload deliberately omits the cloud-missing fields, so the test locks in the lenient deserialization.

Test plan

cargo test --lib auth:: — 14/14 pass
cargo test --lib — 232/232 pass
cargo clippy --bin agent-relay-broker --lib — clean
End-to-end: 3× back-to-back agent-relay up --workspace-key with same name against same live workspace, in Ubuntu aarch64 Docker, all start, all reuse the same agent_id, all rotate token. Same scenario crashed pre-fix.
Reviewer to validate against another live workspace if desired

🤖 Generated with Claude Code

When `agent-relay up --workspace-key …` is run on a workspace where an agent with the cwd-derived name already exists (the same broker restarting, or a second machine joining the workspace and its prior offline record still present cloud-side), `register_agent_with_workspace_key` called `relay.register_agent` and bailed on 409 with strict_name=true. The broker then exited during `connect_relay()`; the SDK's getSession() saw a killed broker and reported the misleading "Unable to connect. Is the computer able to access the url?" Bun fetch error. Reproduce on Ubuntu aarch64 in Docker by running two `agent-relay up --workspace-key …` invocations against the same workspace from cwds with the same basename — the second one dies during the initial handshake. v6.0.13's better launcher diagnostics surface the real cause as "agent name 'X' already exists". Fix: on 409 with strict_name, reclaim the existing agent. We can't use the relaycast crate's `register_or_get_agent` helper because its internal `get_agent` deserializes into a strict `Agent` struct that doesn't match the live cloud's GET /v1/agents/{name} response (no workspace_id / token_hash / created_at). Instead we call rotate_agent_token directly to get a fresh bearer token, then GET the agent record as serde_json::Value to pluck the existing id. The non-strict path keeps its single suffix retry for ephemeral callers. End-to-end verified in Docker: three back-to-back attempts on the same workspace with the same name now all start successfully, reusing the agent_id and rotating the token each time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

coderabbitai · 2026-05-09T19:00:34Z

📝 Walkthrough

Walkthrough

This PR enables agent reclamation when a broker with strict naming attempts to register but encounters a 409 name conflict. Instead of failing, it now rotates the existing agent's token and refetches the agent record to recover the agent ID, allowing multi-machine workspace sharing via shared API keys.

Changes

Agent Reclamation on Name Conflict

Layer / File(s)	Summary
Dependencies `Cargo.toml`	Added `urlencoding = "2.1"` for URL encoding of agent names in HTTP requests.
Imports & HTTP Client Types `src/auth.rs`	Updated `relaycast` imports to include `ClientOptions` and `HttpClient` for HTTP-based agent metadata operations.
Conflict Reclamation & Token Rotation `src/auth.rs`	When `strict_name` is true and agent registration returns 409 conflict, routing now calls `reclaim_agent_via_rotate_token` instead of failing. New helper rotates agent token via `POST /v1/agents/{name}/rotate-token`, fetches agent record via `GET /v1/agents/{name}` as generic JSON, and returns reclaimed `(agent_id, agent_name, token, None)` tuple.
Tests & Validation `src/auth.rs`	Replaced strict-name conflict test; new `strict_name_conflict_reclaims_via_rotate_token` mocks token rotation and agent fetch endpoints, asserts startup succeeds, and validates returned token and agent ID match the reclaimed agent.

Sequence Diagram

sequenceDiagram
  participant Client as AuthClient
  participant API as Relay API
  participant Store as Agent Store
  Client->>API: POST /v1/agents/{name}
  API-->>Client: 409 Conflict (name exists)
  Client->>API: POST /v1/agents/{name}/rotate-token
  API-->>Store: rotate token
  Store-->>API: new token
  API-->>Client: rotated token
  Client->>API: GET /v1/agents/{name}
  API-->>Client: agent record {id, ...}
  Client->>Client: reclaim with agent_id + rotated token

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

willwashburn
khaliqgant

Poem

🐰 A conflict arose, but the relay stands tall,
Rotating the token to reclaim it all,
When names collide across distant machines,
One agent persists through the agent's routines—
Now workspaces unite, no more shall we stall! 🚀

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	Title clearly summarizes the main change: fixing a broker crash by reclaiming agents on HTTP 409 instead of failing, which directly addresses the issue referenced.
Description check	✅ Passed	Description comprehensively covers the problem, solution, reproduction steps, code changes, and a detailed test plan with both unit and end-to-end validation.
Linked Issues check	✅ Passed	Code changes in auth.rs implement agent reclamation via token rotation and lenient GET, directly resolving the broker crash scenario described in `#797` when agents with duplicate names already exist.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to fixing the agent name collision issue: auth.rs implements reclamation logic, Cargo.toml adds urlencoding dependency, and tests validate the fix.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/issue-797-agent-name-collision

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

coderabbitai

🧹 Nitpick comments (1)

src/auth.rs (1)

780-811: 💤 Low value

Consider order of operations: rotate completes before GET.

The function rotates the token first (line 786-790), then fetches the agent record (line 794-802). If the GET fails after a successful rotate, the old token is invalidated but the new token isn't returned to the caller. This leaves the agent in a state where:

The broker startup fails
The old token is invalid
Any retry will rotate again (recoverable, but wasteful)

This is likely acceptable since startup failures are retriable, but consider swapping the order: GET first to confirm agent exists, then rotate. This would avoid unnecessary token churn on transient GET failures.

♻️ Optional: Reorder to GET before rotate

 async fn reclaim_agent_via_rotate_token(
     workspace_key: &str,
     base_url: &str,
     name: &str,
 ) -> Result<(String, String, String, Option<String>)> {
-    let relay = build_relay_client(workspace_key, base_url)?;
-    let token_response = relay
-        .rotate_agent_token(name)
-        .await
-        .map_err(relay_error_to_anyhow)
-        .with_context(|| format!("failed to rotate token for existing agent '{}'", name))?;
-
     let http = HttpClient::new(ClientOptions::new(workspace_key).with_base_url(base_url))
         .map_err(|e| anyhow::anyhow!("failed to build http client: {e}"))?;
     let agent_record = http
         .get::<Value>(
             &format!("/v1/agents/{}", urlencoding::encode(name)),
             None,
             None,
         )
         .await
         .map_err(relay_error_to_anyhow)
         .with_context(|| format!("failed to fetch existing agent '{}'", name))?;

     let agent_id = agent_record
         .get("id")
         .and_then(Value::as_str)
         .map(str::to_string)
         .with_context(|| format!("agent '{}' record missing 'id' field", name))?;

+    let relay = build_relay_client(workspace_key, base_url)?;
+    let token_response = relay
+        .rotate_agent_token(name)
+        .await
+        .map_err(relay_error_to_anyhow)
+        .with_context(|| format!("failed to rotate token for existing agent '{}'", name))?;
+
     Ok((agent_id, token_response.name, token_response.token, None))
 }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/auth.rs` around lines 780 - 811, The rotate-first flow in
reclaim_agent_via_rotate_token can invalidate the old token before confirming
the agent exists; swap the operations so the GET (the HttpClient.get call that
produces agent_record) runs before calling relay.rotate_agent_token
(rotate_agent_token) so you confirm the agent and its id (agent_id from
agent_record.get("id")) succeeds prior to rotating; adjust error
handling/with_context messages accordingly so if the GET fails you return early
without rotating and only call rotate_agent_token when the agent record is
present, then return (agent_id, token_response.name, token_response.token,
None).

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/auth.rs`:
- Around line 780-811: The rotate-first flow in reclaim_agent_via_rotate_token
can invalidate the old token before confirming the agent exists; swap the
operations so the GET (the HttpClient.get call that produces agent_record) runs
before calling relay.rotate_agent_token (rotate_agent_token) so you confirm the
agent and its id (agent_id from agent_record.get("id")) succeeds prior to
rotating; adjust error handling/with_context messages accordingly so if the GET
fails you return early without rotating and only call rotate_agent_token when
the agent record is present, then return (agent_id, token_response.name,
token_response.token, None).

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 03c45463-48be-418e-94ba-0c536c97e924

📥 Commits

Reviewing files that changed from the base of the PR and between 272cc13 and 7715cfb.

⛔ Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

📒 Files selected for processing (2)

Cargo.toml
src/auth.rs

willwashburn · 2026-05-10T06:07:14Z

Why not relaycast::RelayCast::register_or_get_agent?
Its internal get_agent call deserializes into the strict Agent struct, which requires workspace_id, token_hash, and created_at — fields the live cloud's GET /v1/agents/{name} does not return. The decode fails with HTTP error: error decoding response body. We instead call rotate_agent_token (whose TokenRotateResponse decodes fine) and use a raw HttpClient::get::<serde_json::Value> to read just the id field — every other field is non-essential for startup.

@khaliqgant I'm not sure I agreee with this reasoning. we should fix this upstream in the relaycast sdk

khaliqgant · 2026-05-10T18:59:05Z

Why not relaycast::RelayCast::register_or_get_agent?
Its internal get_agent call deserializes into the strict Agent struct, which requires workspace_id, token_hash, and created_at — fields the live cloud's GET /v1/agents/{name} does not return. The decode fails with HTTP error: error decoding response body. We instead call rotate_agent_token (whose TokenRotateResponse decodes fine) and use a raw HttpClient::get::<serde_json::Value> to read just the id field — every other field is non-essential for startup.

@khaliqgant I'm not sure I agreee with this reasoning. we should fix this upstream in the relaycast sdk

@willwashburn #835

miyaontherelay requested review from khaliqgant and willwashburn as code owners May 9, 2026 19:00

devin-ai-integration Bot reviewed May 9, 2026

View reviewed changes

coderabbitai Bot reviewed May 9, 2026

View reviewed changes

willwashburn approved these changes May 9, 2026

View reviewed changes

khaliqgant merged commit b9060a4 into main May 10, 2026
47 checks passed

khaliqgant deleted the fix/issue-797-agent-name-collision branch May 10, 2026 05:05

willwashburn mentioned this pull request May 10, 2026

bug(v6.0.2): shared RELAY_API_KEY across machines — broker exits at WebSocket subscription with 'Unable to connect' #797

Closed

khaliqgant mentioned this pull request May 10, 2026

fix(sdk-rust): tolerate public agent payloads AgentWorkforce/relaycast#127

Merged

9 tasks

coderabbitai Bot mentioned this pull request May 10, 2026

Use published relaycast reclaim helper #835

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(broker): reclaim agent on 409 instead of crashing the broker (#797)#830

fix(broker): reclaim agent on 409 instead of crashing the broker (#797)#830
khaliqgant merged 1 commit into
mainfrom
fix/issue-797-agent-name-collision

miyaontherelay commented May 9, 2026

Uh oh!

coderabbitai Bot commented May 9, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

devin-ai-integration Bot left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

willwashburn commented May 10, 2026

Uh oh!

khaliqgant commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

miyaontherelay commented May 9, 2026

Summary

Reproduction

Why not relaycast::RelayCast::register_or_get_agent?

Changes

Test plan

Uh oh!

coderabbitai Bot commented May 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

willwashburn commented May 10, 2026

Uh oh!

khaliqgant commented May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Why not `relaycast::RelayCast::register_or_get_agent`?

coderabbitai Bot commented May 9, 2026 •

edited

Loading