feat(registry): paginate discovery/listDevices to bound NATS payloads by atsyplikhin · Pull Request #38 · arm/device-connect

atsyplikhin · 2026-05-21T00:32:53Z

Summary

This PR addresses a class of fleet-scale failures we hit at ~1400 phones
on a single registry. The headline fix is server-side pagination of
discovery/listDevices, but the same outage also exposed several
adjacent bottlenecks (a registration thundering-herd, an undersized
etcd HTTP pool, an unbounded portal NATS reconnect path, a dashboard
that fetched every device's detail block on every poll). Each is fixed
in its own commit; they're shipped together because they were the same
incident response and they interact (e.g. pagination is much less
valuable if the registry still hangs on a registration herd).

Splitting was considered and rejected as churn — each change is small
and they share tests/fixtures. Reviewers should still feel free to
critique any commit individually.

Discovery pagination (the headline)

Problem

At ~1400 devices the JSON-encoded fleet snapshot exceeds NATS's default
1 MB max_payload and any agent-tools call that resolves a
selector (discover_labels, discover('device(category:phone)'),
broadcast, even narrow selectors) fails with
nats: maximum payload exceeded. The registry loads the entire tenant
from etcd before any filter is applied, so even fine-grained selectors
hit the wall.

Wire contract (additive, backward compatible)

Request params for discovery/listDevices gain optional
offset (int, default 0) and limit (int).
Reply gains next_offset (int | null) and total_matched (int)
alongside the existing devices array — only when the caller
opted in by passing limit.

Compatibility matrix

Client behavior	Server behavior
Old client, no `limit`	Server takes the legacy unpaged path — returns the full filtered fleet in one reply. No `next_offset`/`total_matched` fields. Small fleets keep working untouched; large fleets fail loudly with the original `max_payload exceeded` error rather than silently truncating. Operators hitting the ceiling must upgrade clients.
New client, passes `limit`	Server paginates, silently clamps the page to `_LIST_DEVICES_MAX_LIMIT` (env `DC_LIST_DEVICES_MAX_LIMIT`, default 200), returns `next_offset`/`total_matched`. A client asking for `limit=10000` gets the first 200 plus a forward-pointing `next_offset` — no error, no truncation; the client just paginates more aggressively than it expected. Client loops until `next_offset is None`.
New client, old server	Server ignores unknown params and returns the legacy single-reply shape. Client sees `next_offset` absent, exits the loop after one request — forward-compatible.
Malformed `offset` / `limit`	Server returns a clean JSON-RPC `-32602` error instead of a 500.

Server

DeviceRegistry.list_devices_page(tenant, *, device_type, location, offset, limit) returns (devices_page, next_offset, total_matched). The existing list_devices(...) is untouched and remains the legacy path. Stability and ACL caveats documented in the docstring.
discovery/listDevices handler routes to the paged variant only when limit is in params; otherwise keeps the legacy reply shape. Both offset and limit are validated up front.

Clients (edge + server `RegistryClient`)

RegistryClient.list_devices() loops 100-device pages internally. External signature unchanged; callers keep getting a flat List[Dict]. Page size tunable via DEVICE_CONNECT_LIST_PAGE_SIZE.
New RegistryClient.list_devices_page(...) returns (devices, next_offset, total_matched) for callers that want explicit paging (e.g. a UI). Docstrings call out the ACL caveat: total_matched is pre-ACL and pages may be shorter than limit for restricted callers.

Defense in depth

security_infra/setup_deployment.sh bakes max_payload: 8MB into the regenerated NATS config so operators rerunning the script raise the broker ceiling alongside the protocol-level fix.

Other bundled fixes (same incident)

1. Registration thundering herd (`device.py`)

Problem: 1000+ phones spinning up in lockstep hit a 2s request
timeout, every late reply triggered a retry, and the queue collapsed
into a sustained storm.
Fix: registration request timeout now DEVICE_CONNECT_REGISTER_TIMEOUT
(default 15s, was 2s) and a one-shot pre-registration jitter window
DEVICE_CONNECT_REGISTER_JITTER (default 2s, set to 0 to disable;
bump to 10s+ for fleet scale) decorrelates the initial herd. 2s still
spreads ~1000 lockstep devices into roughly 500/sec while staying
tolerable for single-device dev. Subsequent retries keep their existing
exponential backoff.

2. etcd HTTP pool exhaustion (`registry/service/registry.py`)

Problem: etcd3gw's default urllib3 pool of 10 sockets bottlenecked
the registry under the herd — every lease+put is two sequential HTTP
calls, so thousands of requests queued behind 10 sockets.
Fix: new _enlarge_etcd_pool() mounts a larger HTTPAdapter
(DC_ETCD_POOL_SIZE, default 64) onto the etcd3gw client's
underlying requests.Session. Implemented via post-construction mount
(not etcd3gw.client(session=...)) so it works on 2.5.x and 2.6+.
Logs a warning if etcd3gw ever stops exposing client.session — the
fix would otherwise silently regress.

3. Portal NATS reconnect storm (`portal/services/nats_rpc.py`)

Problem: every "Run" click on the dashboard opened a fresh NATS
connection (TCP + JWT-auth handshake) per RPC.
Fix: long-lived cached client reused across invoke() calls. The
NATS client is concurrent-safe (nc.request allocates its own inbox
sub) so one shared client serves the whole portal. On hard transport
failure the cached client is dropped and best-effort closed before
the next call reconnects. The module-level asyncio.Lock is created
lazily inside the first awaiter so it binds to the live event loop,
not import-time.

4. Dashboard fan-out (`portal/views/dashboard.py`, templates)

Problem: the live-devices fragment rendered every device's full
detail panel (functions, events, raw JSON) on every 3s htmx poll,
turning the dashboard into a fleet-sized data fan-out.
Fix:

Detail markup moved to a new lazy endpoint
/api/devices/{device_id}/live-detail, fetched on row expand. The
main poll now returns only summary rows.
etcd get_prefix and credential listing moved off the event loop
via asyncio.to_thread so other portal requests don't queue behind
the poll.
Header counters (creds/online/registered) refresh in place via
hx-swap-oob. Previously they froze at first-render values and
showed "1401 online" long after the swarm shut down.
Poll cadence eased from 3s -> 10s.
revealed-based lazy loading was rejected: htmx's revealed
implementation calls getBoundingClientRect, and display:none
rows return a zero-rect at (0,0) which falsely registers as
in-viewport — every hidden row would fetch on page load.

5. `@on` handler collection (`drivers/base.py`)

Problem: _collect_event_subscriptions skipped every single-
underscore-prefixed method, so drivers that wrote @on async def _on_phone_state(...) (Python's conventional "this is private") had
their event handlers silently dropped.
Fix: scan everything except dunders. The _on_meta attribute
gate filters out unrelated private attributes; new test covers it.

6. `@on(device_type=...)` in registry/portal mode (`drivers/base.py`)

Problem: device_type filtering relies on the D2D peer cache. In
portal/registry mode there is no peer cache, so the filter silently
passed every device's matching event through.
Fix: warn once at subscription setup so subscribers don't silently
get cross-device events. Strict filtering can be added in-handler.

7. Portal RPC timing logs (`portal/views/dashboard.py`)

Added structured timing logs around backend.rpc_invoke so the next
slow-dashboard incident has data on whether the latency is in the
portal handler, the pre-RPC setup, or the device round-trip.

Tests

New

test_registry_service.py — 6 tests in TestListDevicesPage:
slice + next_offset, final page sets next_offset=None, offset
past end returns empty, limit=None returns the remainder, a full
1400-device round trip, and filter-before-pagination.
test_rpc_handlers.py — handler-level coverage rewritten around the
dual-path behavior. New regression tests:
- test_list_devices_legacy_no_limit_returns_unbounded — old-client
  path returns full fleet and emits no pagination metadata.
- test_list_devices_with_limit_paginates — new-client path emits
  next_offset/total_matched.
- test_list_devices_paged_acl_total_is_unfiltered — documents and
  locks in the ACL × pagination caveat (page shorter than limit,
  total_matched reflects pre-ACL count).
- test_list_devices_invalid_offset_returns_error /
  test_list_devices_negative_offset_returns_error /
  test_list_devices_invalid_limit_returns_error — malformed
  pagination params surface as JSON-RPC -32602, not 500s.
test_registry_client.py (edge) — 5 tests in
TestListDevicesPagination: full-fleet paging across 14 round
trips, wire-level offset/limit round-trip, legacy-server
single-reply fallback, explicit list_devices_page() metadata
exposure, filter forwarding.
test_drivers.py — test_underscore_prefixed_handler_is_still_collected
for the _collect_event_subscriptions fix.

Regressions

All existing registry / edge / agent-tools tests remain green; the only
test-file edits are in test_rpc_handlers.py to reflect the new
dual-path behavior of the handler.

Out of scope (deliberately)

Selector pushdown to the registry / etcd indexes. That would let
the server filter by category:phone etc. before paginating and
reduce wire traffic further, but is a much bigger change.
Pagination alone is enough to keep individual replies under
max_payload regardless of fleet size.
Keyset pagination. Offset pagination has known stability gaps
under concurrent registrations (documented in the
list_devices_page docstring); a device_id-cursor version is the
next iteration.
Streaming replies.

Test plan

pytest packages/device-connect-server/tests/device_connect_server/test_registry_service.py -v — 43 passed
pytest packages/device-connect-server/tests/device_connect_server/test_rpc_handlers.py -v — 41 passed
pytest packages/device-connect-edge/tests/test_registry_client.py -v — passes
pytest packages/device-connect-edge/tests/test_drivers.py -v — passes
pytest packages/device-connect-agent-tools/tests/test_connection_unit.py tests/test_tools_unit.py -v — 65 passed (from previous run; no semantic change to that surface)
Integration test (Docker) on the 1400-device deployment that originally hit the cliff: re-run discover('device(category:phone)'), discover_labels(), and broadcast(...) end-to-end.

🤖 Generated with Claude Code

At fleet scale (~1400 devices) the JSON-encoded reply from `discovery/listDevices` crosses NATS's default 1 MB max_payload, breaking every agent-tools call that resolves a selector — `discover()`, `discover_labels()`, and `broadcast()` all fail because the registry loads the full fleet before any filter and tries to publish it in a single message. This change makes the reply size independent of fleet size: - New optional `offset` / `limit` params on `discovery/listDevices`; new reply fields `next_offset` and `total_matched`. Omitting `limit` preserves the legacy single-shot reply for older clients. - `DeviceRegistry.list_devices_page(...)` returns a slice plus metadata alongside the existing `list_devices(...)` (untouched). - Both `RegistryClient` implementations (edge + server-internal) loop internally over 100-device pages — externally unchanged, but no single reply ever carries the whole fleet. Page size is tunable via `DEVICE_CONNECT_LIST_PAGE_SIZE`. A new `list_devices_page()` exposes metadata to callers that want explicit paging. - Client backward compat: a server that doesn't return `next_offset` causes the page loop to exit after one request. - Defense-in-depth: `setup_deployment.sh` now bakes `max_payload: 8MB` into the NATS config it generates, so operators who rerun it preserve the ceiling. Tests: 6 new in test_registry_service.py (including a full 1400-device walk), 5 new in test_registry_client.py covering paging, legacy-server fallback, filter forwarding, and explicit-page metadata. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The constant was sitting between the stdlib and device_connect_edge import groups, which trips ruff's "module level import not at top of file" rule. Move it after the imports to match the edge client layout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ayload overflow At fleet scale (1400+ devices, ~13KB per record) the legacy unpaginated listDevices path would emit a single multi-megabyte reply that exceeded the NATS broker's max_payload, killing every concurrent caller. The new _LIST_DEVICES_MAX_LIMIT (default 200, env DC_LIST_DEVICES_MAX_LIMIT) clamps every reply regardless of what the caller passed, so an unpaginated or misconfigured client now gets the first page plus next_offset instead of an overflow error. The dead "legacy single-shot" else-branch is removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the live-devices fragment rendered hidden detail markup (functions, events, identity, raw-JSON pre-block) for every device in every poll. At 1400 phones each poll returned ~18 MB on a 3-second interval, blocking the portal event loop and saturating client traffic. The fragment now ships only summary rows (~500 KB at 1400 devices) plus an htmx placeholder per row; the detail markup is fetched from a new GET /api/devices/{device_id}/live-detail endpoint when a row is expanded. The etcd prefix-scan is wrapped in asyncio.to_thread so it no longer blocks other portal requests. Poll cadence is relaxed from 3 s to 10 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

At 1400+ phones spinning up in lockstep, every phone's _register hit the registry inside the same microsecond window. The 2 s request timeout guaranteed most callers timed out before the registry's etcd writes returned, every timeout fired a retry, and the retries re-entered the queue — classic congestion collapse, which left ~1200 phones invisible to the registry while still alive on the fabric. Edge SDK (device.py): - DEVICE_CONNECT_REGISTER_TIMEOUT (default 15 s, was hardcoded 2 s) lets the registry process a queued request before any retry fires. - DEVICE_CONNECT_REGISTER_JITTER (default 5 s) sleeps a uniform-random 0..jitter before the first request, decorrelating arrivals so the registry never sees a synchronized burst. Skipped when set to 0. Registry (registry.py): - DC_ETCD_POOL_SIZE (default 64) hands etcd3gw a requests.Session with an oversized urllib3 HTTPAdapter. The default pool of 10 was the actual ceiling on concurrent lease+put round-trips. End-to-end at 1400 phones: 0 registration failures, 0 publish-overflow errors, all phones in etcd within ~30 s of launch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…5.x; align listDevices tests with server-side pagination The previous change passed ``session=`` to ``etcd3gw.client(...)``, which only works on 2.6+; CI runs against 2.5.x and broke at module import. Mount the larger urllib3 pool onto the already-constructed ``client.session`` instead, which works on both versions. TestListDevicesHandler now mocks ``list_devices_page`` (the only path the handler takes after the server-side cap landed) and asserts the ``next_offset``/``total_matched`` shape in the reply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@on

Three small clarity fixes uncovered while wiring an out-of-process camera driver to subscribe to phone state_changed events at fleet scale (1400 devices across 7 shards) against the DC Portal. 1. _collect_event_subscriptions no longer skips single-underscore method names. Previously an @on-decorated `_on_foo` handler was silently never registered (the attribute-name check at base.py:1058 dropped it before the `_is_event_subscription` marker check could even run). The collector now only skips dunders. Behavior is still gated by the `_is_event_subscription` marker, so the public-API surface is unchanged. New test: test_underscore_prefixed_handler_is_still_collected. 2. Renamed the two INFO log lines in _setup_agentic_driver that called the path "D2D capabilities" / "D2D setup complete" -- the same code path runs in both d2d and portal/registry mode, so the label was wrong half the time. The completion line now reports the actual registry kind (D2DRegistry vs RegistryClient). 3. @on(device_type=...) filtering is best-effort in portal/registry mode because the D2D peer cache used to resolve a source device's type is None. The handler's cache-miss path passes events through (false negatives are worse than false positives), but until now that limitation only surfaced via a per-event debug log. Emit a WARNING at subscription-setup time when device_type is set and there is no peer collector, so subscribers see the gotcha once instead of silently receiving events from other device types. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

invoke_device_rpc used to open and close a fresh NATS connection per dashboard "Run" click, paying a full TCP+JWT-auth handshake on every request. At fleet scale, that handshake is the dominant cost — the RPC itself is ~12 ms once a client is connected. nats_rpc.invoke() now holds a single long-lived client under an asyncio.Lock and reuses it across requests; transport errors drop the cache so the next call reconnects. The HTTP handler and the NATS layer both log handler/RPC timings so the source of any future latency is visible from docker logs alone: invoke alpha/phone-0001.get_position handler=11.5ms (pre-rpc=0.1ms rpc=11.5ms) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous lazy-load wired ``hx-trigger="revealed once"`` on every hidden detail ``<div>``. htmx's revealed implementation polls getBoundingClientRect, and ``display:none`` elements report a zero-size rect at (0,0) — which counts as "in the viewport." That meant the trigger fired for every hidden row immediately after the table fragment rendered: at 1400 phones, the browser fan-out queued ~1400 parallel fetches behind its 6-connection-per-origin pool, and a subsequent "Run" click on a row's RPC sat at the back of that queue for many seconds. Replace the htmx trigger with a plain fetch invoked from toggleDetail() and a ``_detailLoaded`` Set that dedupes re-expands. The detail URL is carried on the parent ``<tr>`` as ``data-detail-url`` so the JS doesn't have to reconstruct it. Net result: zero detail fetches at page load, one fetch on first expand per device, and "Run" feels instant again. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Credentials Created / Devices Online / Devices Registered cards were rendered once in dashboard_page and never updated. After the swarm shut down, the table emptied via its 10s htmx poll but the cards stayed at the last value seen during the initial page render — at fleet scale that produced an obviously-wrong "1401 online" while the table read "No devices registered yet." Piggyback on the existing live-devices poll using ``hx-swap-oob``: the fragment now appends three tiny ``<span hx-swap-oob="innerHTML">`` blobs that target ``#online-count``, ``#registered-count``, and ``#creds-count`` in the dashboard header. Counts and table swap in lockstep with no new endpoint and no extra request. The header span IDs are added in dashboard.html so the OOB swap has somewhere to land. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the handler always called list_devices_page and clamped to _LIST_DEVICES_MAX_LIMIT (200), even for callers that omitted ``limit``. At fleet scale an old client would silently receive only the first 200 devices with no next_offset signal and act on a partial view. Now the handler branches on ``limit``: if absent, return the legacy unbounded reply via list_devices (no next_offset/total_matched leaks); if present, paginate as before. Old clients keep working on small fleets and fail loudly with max_payload exceeded at large scale, the same behavior they had before this branch. New clients opt into pagination by passing limit. Also validates ``offset`` (was implicit int() that 500'd on garbage) and rejects negative values; both bad offset and bad limit now return a clean JSON-RPC -32602 error. Tests rewritten to cover both paths and add regressions for the ACL x pagination caveat (total_matched is pre-ACL; pages may be shorter than limit after ACL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two latent issues in the cached-NATS-client path: 1. The module-level asyncio.Lock() bound to whichever event loop was current at import. Tests (and any future code) that runs this module under a fresh loop would have hit a cross-loop usage error. The lock is now created lazily inside _get_invoke_lock() on first await. 2. On a hard transport failure the cached client was dropped to None but its sockets were not closed. The dead client could leak FDs until garbage-collected. Now we drop and best-effort await stale.close() in a new _drop_invoke_client() helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

list_devices_page docstrings (edge + server RegistryClient and the service-layer registry) now spell out the ACL caveat: ACL filtering runs after slicing, so total_matched is the pre-ACL count and pages may be shorter than limit for ACL-restricted callers. UIs that consume these values must not assume len(devices) == limit means a full page. The service-layer docstring also documents the offset-pagination stability gap (concurrent registrations can shift records across pages mid-walk) and points at keyset pagination as the next step. _enlarge_etcd_pool now logs a warning when etcd3gw stops exposing client.session, so a future internal refactor doesn't silently revert the pool back to the urllib3 default of 10 and reintroduce the registration-storm bottleneck without any operator signal. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- Reject `limit <= 0` on discovery/listDevices with JSON-RPC -32602 rather than silently mapping it to the server cap (surprising and masked client bugs that passed unintentional zero/negative values). - Narrow the portal `invoke()` exception handler so only transport errors (ConnectionClosed, Stale, NoServers, OSError, ...) drop the cached NATS client. ProtocolError, BadSubject, MaxPayload etc. are payload/programmer bugs and no longer churn the socket on every malformed request. - Return HTTP 404 (not 200) from `/api/devices/{id}/live-detail` on device-not-found so htmx and future API consumers can detect it. - Lower `DEVICE_CONNECT_REGISTER_JITTER` default 5.0s -> 2.0s. 2s still decorrelates ~1000 devices into ~500/sec (much better than lockstep) while being tolerable for single-device dev. Fleet operators should bump via env to 10s+. - Fix comment in `_LIST_DEVICES_MAX_LIMIT`: said "32MB broker ceiling" but `setup_deployment.sh` sets `max_payload: 8MB`. Tests: - New `test_nats_rpc.py`: client reuse, drop on transport error, reconnect on next call, and explicit non-drop on ProtocolError / NoRespondersError. - `test_rpc_handlers.py`: `limit=0` and `limit=-3` rejection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

- Dashboard detail rows fetched on click but cached via a side-channel `window._detailLoaded` Set keyed by device_id. The live-devices poll re-renders the detail markup on every tick and reset `.detail-content` back to the "Loading details..." placeholder, but the Set persisted across re-renders. Result: expand -> collapse -> next poll re-renders -> re-expand showed "Loading details..." forever. Replace the Set with a `data-loaded` / `data-loading` attribute on the slot element itself, so re-renders naturally invalidate the cache. Also added an in-flight guard against duplicate concurrent fetches and a `r.ok` check so a 404 surfaces as a visible error rather than rendering "Loading details..." with the 404 body text injected. - Both client page loops (`device_connect_edge.registry_client` and `device_connect_server.registry.client`) trusted the server's `next_offset` unconditionally. A buggy or future server returning a non-advancing cursor (`next_offset <= offset`) would have spun forever. Bail out with a warning instead — turns a fleet-scale incident into a recoverable log line. The current registry can't produce this, but the guard is cheap. - Drop the dead `total = len(page)` assignment in the legacy no-`limit` branch of `discovery/listDevices` — it was never read on the legacy reply path and just confused readers. Tests: - `test_registry_client.py`: - `test_list_devices_handles_empty_page_with_next_offset` — the documented ACL-after-pagination caveat: server can return an empty page with `next_offset` still pointing forward, and the loop must advance, not stall. - `test_list_devices_breaks_on_non_advancing_next_offset` — pins the new monotonicity guard via caplog. - `test_rpc_handlers.py`: - `test_list_devices_limit_above_cap_is_clamped` — pins the silent-clamp contract end-to-end: `limit=10000` -> registry called with `_LIST_DEVICES_MAX_LIMIT`, reply carries a forward-pointing `next_offset` so the client paginates without truncation. - `test_device.py` (`TestRegisterStartupJitter`): - `test_jitter_zero_skips_sleep_and_random` — `JITTER=0` is a clean disable: no `random.uniform` call, no jitter sleep. - `test_jitter_positive_sleeps_once_before_first_request` — pre-registration jitter fires exactly once before the first request. - `test_registry_service.py` (`TestEnlargeEtcdPool`): - `test_mounts_adapter_on_existing_session` — both http:// and https:// must be mounted; we don't want a surviving small-pool adapter on either scheme. - `test_logs_warning_when_session_missing` — pins the fallback branch so a future etcd3gw refactor that hides `client.session` doesn't silently regress to the urllib3 default of 10 without any operator-visible signal. - `test_drivers.py`: - `test_portal_mode_device_type_filter_warns_at_setup` — pins the portal-mode `@on(device_type=...)` setup-time warning that was introduced in 0673652 but not previously covered. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Reported live: after a portal restart, re-logging in dropped users on the bare HTML fragment at /api/devices/live?tenant=<...> instead of /dashboard. Chain: 1. /dashboard polls /api/devices/live every 10s via htmx. 2. Portal restarts -> in-memory session cookies invalidated. 3. Next poll fires GET /api/devices/live with HX-Request: true. 4. auth_middleware sees no session, captures the poll URL as next_url, returns HX-Redirect: /login?next=/api/devices/live... 5. htmx follows it as a top-level navigation; login form passes `next` through; login_submit redirects there. 6. User lands on the raw fragment. The `next` capture was added for the CLI-approval flow (a top-level page that requires login). It was applied indiscriminately, including to htmx polls and any GET under /api/. The fix is two-layered: - auth_middleware: skip the `next` capture when the request has HX-Request: true or the path is under /api/. Top-level CLI-approval navigations still capture `next` because they don't carry HX-Request and aren't under /api/. - _safe_next: also reject /api/ and /static/ prefixes belt-and- suspenders, so any stale ?next= link can't bounce the user into a fragment or static asset. Tests (test_portal_auth_redirect.py, 11 cases): - _safe_next accepts full-page paths, rejects external URLs, protocol-relative URLs, CRLF, and /api/ + /static/ paths. - auth_middleware on an HX-Request to /api/devices/live returns HX-Redirect: /login (no `next`). - A non-htmx GET to /api/devices/live still redirects to /login without capturing `next`. - Top-level /dashboard and /cli/approval/{token} still capture the page URL as `next` so CLI approval and post-login bounce still work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@on

asyncio.Lock has no acquire_nowait() (that's threading.Lock). The existing call raised AttributeError on every NATS reconnect, so the SDK never actually re-established @on subscriptions after a messaging hiccup. Latent until you hit a reconnect storm -- at N=1400 across 7 shards during a broadcast burst, ~24 phones per shard tripped it and their state_changed listeners on the co-located camera went dead. Replace with the standard asyncio.Lock pattern: locked() to fast-skip when another caller is already mid-resubscribe, then await acquire() to actually take the lock. Safe under single-loop asyncio (no preempt between locked() and acquire()). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… decisions in comments Three small fixes from review round 4, plus "review notes" comments at spots prior rounds have already churned so future reviewers don't re-litigate the same decisions. Fixes: - ConnectionReconnectingError removed from the transport-fatal set in the portal invoke() path. Dropping the cached client in that state preempts nats-py's own reconnect machinery and amplifies broker flaps. The next nc.request either succeeds post-reconnect or raises something more terminal that *is* in the fatal set. - Cached NATS invoke client now closed on portal shutdown via app.on_cleanup. Without this the long-lived socket leaks on graceful exit because the client is module-level state in nats_rpc, not tied to the aiohttp Application lifecycle. New close_invoke_client() is idempotent. - _collect_event_subscriptions now tolerates @Property descriptors that raise. dir() surfaces every attribute on the class and getattr will invoke descriptors; a lazy-init or hardware-probe property that raises before connect() would otherwise break subscription setup for unrelated handlers. Review notes (anti-circling comments): - device.py _resubscribe_after_reconnect: why asyncio.Lock has no acquire_nowait() and why locked()+acquire() is safe under single-loop asyncio (refs commit 1716f8d). - nats_rpc.py _TRANSPORT_FATAL_ERRORS: why ConnectionReconnectingError, ProtocolError, NoRespondersError are deliberately absent. - registry/service/main.py listDevices handler: why legacy callers fail loudly instead of being silently clamped, why streaming was rejected, why limit<=0 returns -32602 instead of mapping to the cap (refs commits 2349130, 79247e6). - registry/service/registry.py list_devices_page: explicit cost model — O(N * walks_per_second) etcd traffic. NATS payload is bounded but etcd/CPU load isn't; selector pushdown / keyset pagination is the documented next iteration. - drivers/base.py _collect_event_subscriptions: why single-underscore methods are scanned, refs commit 0673652. Tests: - test_drivers.py: test_collector_survives_raising_property. - test_nats_rpc.py: test_connection_reconnecting_keeps_cached_client, test_close_invoke_client_closes_cached, test_close_invoke_client_idempotent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The dashboard paused all polling of /api/devices/live as soon as any detail row was expanded, so it could preserve the expanded state across the swap. PR 38 later wired the header counters (#online-count / #registered-count / #creds-count) to refresh via hx-swap-oob spans that piggyback on that same poll fragment, so the pause silently froze the counters too: operators saw "Devices Online" stuck at its first-render value the whole time they were inspecting any device, and offline devices never dropped from the table. Replace the htmx:beforeRequest pause with an htmx:afterSwap restore: the poll fires normally (counters refresh, offline devices drop), and any rows the user had expanded are re-opened and their detail slot re-fetched via the existing loadDetailIfNeeded path. Devices that disappear from the table get pruned from _expandedDevices. Apply the same fix to the admin "view as user" tenant detail page, which had three related defects: 1. Counter cards were plain <div>{{ count }}</div> with no IDs, so the OOB swaps from _live_table.html had no target — silently dropped. Wrap each in <span id="..."> matching the OOB markup. 2. loadDetailIfNeeded was never copied over after PR 38 moved detail markup to the lazy /api/devices/{id}/live-detail endpoint, so expanding a row on this page just revealed an empty "Loading details..." placeholder forever. Copy the function and call it from toggleDetail. 3. Same beforeRequest pause as the dashboard had — replace with the afterSwap restore loop. Follow-up worth doing in a separate PR: the toggleDetail / loadDetailIfNeeded / RPC invoke / SSE event log scripts are duplicated between dashboard.html and admin/tenant_detail.html, which is exactly why fix #2 above drifted in the first place. Extract a shared _dashboard_scripts.html partial parameterized by tenant URL suffix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

When a caller passes a ``limit`` larger than ``_LIST_DEVICES_MAX_LIMIT`` the handler silently clamps the page size and relies on ``next_offset`` to let the client paginate naturally. The wire contract is correct -- the caller just paginates more aggressively than it expected -- but the clamp is invisible to operators. Anyone tuning ``DEVICE_CONNECT_LIST_PAGE_SIZE`` above the server cap gets no signal that their knob is being ignored, and the only symptom is "discovery walks take more round-trips than the page-size suggests." Log a single WARNING the first time we see a given over-cap value, then stay quiet on subsequent requests with the same limit. Dedup is keyed by the requested value, not by caller -- the warning is operator-facing ("your server cap no longer matches a tuned client knob"), not per-request observability, so a steady-state misconfigured client logs once instead of once per page. The dedup set is bounded by the number of distinct over-cap values any caller asks for, which in practice is 1-2. Tests: - new ``test_list_devices_clamp_warns_once_per_requested_limit``: first over-cap request warns; a repeat with the same value does not; a new over-cap value warns again. - new ``test_list_devices_at_or_under_cap_does_not_warn``: pins that the warn gate is strictly ``effective < requested`` so at-cap and under-cap requests stay silent. - existing ``test_list_devices_limit_above_cap_is_clamped`` still passes; the clamp behavior itself is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

atsyplikhin and others added 15 commits May 20, 2026 17:32

fix(tests): drop unused json import flagged by ruff

e234278

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

atsyplikhin mentioned this pull request May 26, 2026

feat(portal): mark UI as testing-only, not for production #40

Open

4 tasks

atsyplikhin and others added 4 commits May 26, 2026 23:53

atsyplikhin requested a review from kavya-chennoju May 27, 2026 19:19

atsyplikhin and others added 2 commits May 27, 2026 21:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(registry): paginate discovery/listDevices to bound NATS payloads#38

feat(registry): paginate discovery/listDevices to bound NATS payloads#38
atsyplikhin wants to merge 21 commits into
mainfrom
feat/registry-pagination

atsyplikhin commented May 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

atsyplikhin commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Discovery pagination (the headline)

Problem

Wire contract (additive, backward compatible)

Compatibility matrix

Server

Clients (edge + server RegistryClient)

Defense in depth

Other bundled fixes (same incident)

1. Registration thundering herd (device.py)

2. etcd HTTP pool exhaustion (registry/service/registry.py)

3. Portal NATS reconnect storm (portal/services/nats_rpc.py)

4. Dashboard fan-out (portal/views/dashboard.py, templates)

5. @on handler collection (drivers/base.py)

6. @on(device_type=...) in registry/portal mode (drivers/base.py)

7. Portal RPC timing logs (portal/views/dashboard.py)

Tests

New

Regressions

Out of scope (deliberately)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

atsyplikhin commented May 21, 2026 •

edited

Loading

Clients (edge + server `RegistryClient`)

1. Registration thundering herd (`device.py`)

2. etcd HTTP pool exhaustion (`registry/service/registry.py`)

3. Portal NATS reconnect storm (`portal/services/nats_rpc.py`)

4. Dashboard fan-out (`portal/views/dashboard.py`, templates)

5. `@on` handler collection (`drivers/base.py`)

6. `@on(device_type=...)` in registry/portal mode (`drivers/base.py`)

7. Portal RPC timing logs (`portal/views/dashboard.py`)