fix(broker): reclaim agent on 409 instead of crashing the broker (#797)#830
Conversation
When `agent-relay up --workspace-key …` is run on a workspace where an
agent with the cwd-derived name already exists (the same broker
restarting, or a second machine joining the workspace and its prior
offline record still present cloud-side), `register_agent_with_workspace_key`
called `relay.register_agent` and bailed on 409 with strict_name=true. The
broker then exited during `connect_relay()`; the SDK's getSession() saw a
killed broker and reported the misleading "Unable to connect. Is the
computer able to access the url?" Bun fetch error.
Reproduce on Ubuntu aarch64 in Docker by running two `agent-relay up
--workspace-key …` invocations against the same workspace from cwds with
the same basename — the second one dies during the initial handshake.
v6.0.13's better launcher diagnostics surface the real cause as
"agent name 'X' already exists".
Fix: on 409 with strict_name, reclaim the existing agent. We can't use
the relaycast crate's `register_or_get_agent` helper because its internal
`get_agent` deserializes into a strict `Agent` struct that doesn't match
the live cloud's GET /v1/agents/{name} response (no workspace_id /
token_hash / created_at). Instead we call rotate_agent_token directly to
get a fresh bearer token, then GET the agent record as serde_json::Value
to pluck the existing id. The non-strict path keeps its single suffix
retry for ephemeral callers.
End-to-end verified in Docker: three back-to-back attempts on the same
workspace with the same name now all start successfully, reusing the
agent_id and rotating the token each time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
📝 WalkthroughWalkthroughThis PR enables agent reclamation when a broker with strict naming attempts to register but encounters a 409 name conflict. Instead of failing, it now rotates the existing agent's token and refetches the agent record to recover the agent ID, allowing multi-machine workspace sharing via shared API keys. ChangesAgent Reclamation on Name Conflict
Sequence DiagramsequenceDiagram
participant Client as AuthClient
participant API as Relay API
participant Store as Agent Store
Client->>API: POST /v1/agents/{name}
API-->>Client: 409 Conflict (name exists)
Client->>API: POST /v1/agents/{name}/rotate-token
API-->>Store: rotate token
Store-->>API: new token
API-->>Client: rotated token
Client->>API: GET /v1/agents/{name}
API-->>Client: agent record {id, ...}
Client->>Client: reclaim with agent_id + rotated token
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/auth.rs (1)
780-811: 💤 Low valueConsider order of operations: rotate completes before GET.
The function rotates the token first (line 786-790), then fetches the agent record (line 794-802). If the GET fails after a successful rotate, the old token is invalidated but the new token isn't returned to the caller. This leaves the agent in a state where:
- The broker startup fails
- The old token is invalid
- Any retry will rotate again (recoverable, but wasteful)
This is likely acceptable since startup failures are retriable, but consider swapping the order: GET first to confirm agent exists, then rotate. This would avoid unnecessary token churn on transient GET failures.
♻️ Optional: Reorder to GET before rotate
async fn reclaim_agent_via_rotate_token( workspace_key: &str, base_url: &str, name: &str, ) -> Result<(String, String, String, Option<String>)> { - let relay = build_relay_client(workspace_key, base_url)?; - let token_response = relay - .rotate_agent_token(name) - .await - .map_err(relay_error_to_anyhow) - .with_context(|| format!("failed to rotate token for existing agent '{}'", name))?; - let http = HttpClient::new(ClientOptions::new(workspace_key).with_base_url(base_url)) .map_err(|e| anyhow::anyhow!("failed to build http client: {e}"))?; let agent_record = http .get::<Value>( &format!("/v1/agents/{}", urlencoding::encode(name)), None, None, ) .await .map_err(relay_error_to_anyhow) .with_context(|| format!("failed to fetch existing agent '{}'", name))?; let agent_id = agent_record .get("id") .and_then(Value::as_str) .map(str::to_string) .with_context(|| format!("agent '{}' record missing 'id' field", name))?; + let relay = build_relay_client(workspace_key, base_url)?; + let token_response = relay + .rotate_agent_token(name) + .await + .map_err(relay_error_to_anyhow) + .with_context(|| format!("failed to rotate token for existing agent '{}'", name))?; + Ok((agent_id, token_response.name, token_response.token, None)) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/auth.rs` around lines 780 - 811, The rotate-first flow in reclaim_agent_via_rotate_token can invalidate the old token before confirming the agent exists; swap the operations so the GET (the HttpClient.get call that produces agent_record) runs before calling relay.rotate_agent_token (rotate_agent_token) so you confirm the agent and its id (agent_id from agent_record.get("id")) succeeds prior to rotating; adjust error handling/with_context messages accordingly so if the GET fails you return early without rotating and only call rotate_agent_token when the agent record is present, then return (agent_id, token_response.name, token_response.token, None).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@src/auth.rs`:
- Around line 780-811: The rotate-first flow in reclaim_agent_via_rotate_token
can invalidate the old token before confirming the agent exists; swap the
operations so the GET (the HttpClient.get call that produces agent_record) runs
before calling relay.rotate_agent_token (rotate_agent_token) so you confirm the
agent and its id (agent_id from agent_record.get("id")) succeeds prior to
rotating; adjust error handling/with_context messages accordingly so if the GET
fails you return early without rotating and only call rotate_agent_token when
the agent record is present, then return (agent_id, token_response.name,
token_response.token, None).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro Plus
Run ID: 03c45463-48be-418e-94ba-0c536c97e924
⛔ Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
📒 Files selected for processing (2)
Cargo.tomlsrc/auth.rs
@khaliqgant I'm not sure I agreee with this reasoning. we should fix this upstream in the relaycast sdk |
|
Summary
Fixes #797. When
agent-relay up --workspace-key …is run against a workspace where an agent with the cwd-derived name already exists — same broker restarting, or a second machine joining the workspace whose prior offline record is still in the cloud —register_agent_with_workspace_keywas callingrelay.register_agentand bailing on 409 withstrict_name=true. The broker exited duringconnect_relay(); the SDK's in-flightgetSession()saw a dying broker and surfaced Bun's misleading"Unable to connect. Is the computer able to access the url?"error, which is what the issue reporter saw.On 409 we now reclaim the existing agent: rotate its bearer token, then fetch the agent record to pluck the existing
id. The non-strict path keeps its single suffix retry for ephemeral callers.Reproduction
Reliable repro on Ubuntu aarch64 in Docker (matches the reporter's VPS): two back-to-back
agent-relay up --workspace-key …invocations against the same workspace from cwds with the same basename. The second crashes during the initial handshake. v6.0.13's better launcher diagnostics surface the real cause asagent name 'X' already exists.After the fix: three back-to-back invocations all start cleanly, all reuse the same
agent_id, each rotates to a freshat_live_*token.Why not
relaycast::RelayCast::register_or_get_agent?Its internal
get_agentcall deserializes into the strictAgentstruct, which requiresworkspace_id,token_hash, andcreated_at— fields the live cloud'sGET /v1/agents/{name}does not return. The decode fails withHTTP error: error decoding response body. We instead callrotate_agent_token(whoseTokenRotateResponsedecodes fine) and use a rawHttpClient::get::<serde_json::Value>to read just theidfield — every other field is non-essential for startup.Changes
src/auth.rs— replace fatalbail!on 409 (strict_name) withreclaim_agent_via_rotate_token. New helper does rotate-token + lenient GET.Cargo.toml— addurlencoding = "2.1"for safe agent-name encoding in the GET path.strict_name_returns_conflict_errorremoved (asserted the broken behavior); replaced withstrict_name_conflict_reclaims_via_rotate_token. Mock GET payload deliberately omits the cloud-missing fields, so the test locks in the lenient deserialization.Test plan
cargo test --lib auth::— 14/14 passcargo test --lib— 232/232 passcargo clippy --bin agent-relay-broker --lib— cleanagent-relay up --workspace-keywith same name against same live workspace, in Ubuntu aarch64 Docker, all start, all reuse the sameagent_id, all rotate token. Same scenario crashed pre-fix.🤖 Generated with Claude Code