Skip to content

feat(registry): paginate discovery/listDevices to bound NATS payloads#38

Open
atsyplikhin wants to merge 21 commits into
mainfrom
feat/registry-pagination
Open

feat(registry): paginate discovery/listDevices to bound NATS payloads#38
atsyplikhin wants to merge 21 commits into
mainfrom
feat/registry-pagination

Conversation

@atsyplikhin
Copy link
Copy Markdown
Collaborator

@atsyplikhin atsyplikhin commented May 21, 2026

Summary

This PR addresses a class of fleet-scale failures we hit at ~1400 phones
on a single registry. The headline fix is server-side pagination of
discovery/listDevices
, but the same outage also exposed several
adjacent bottlenecks (a registration thundering-herd, an undersized
etcd HTTP pool, an unbounded portal NATS reconnect path, a dashboard
that fetched every device's detail block on every poll). Each is fixed
in its own commit; they're shipped together because they were the same
incident response and they interact (e.g. pagination is much less
valuable if the registry still hangs on a registration herd).

Splitting was considered and rejected as churn — each change is small
and they share tests/fixtures. Reviewers should still feel free to
critique any commit individually.


Discovery pagination (the headline)

Problem

At ~1400 devices the JSON-encoded fleet snapshot exceeds NATS's default
1 MB max_payload and any agent-tools call that resolves a
selector (discover_labels, discover('device(category:phone)'),
broadcast, even narrow selectors) fails with
nats: maximum payload exceeded. The registry loads the entire tenant
from etcd before any filter is applied, so even fine-grained selectors
hit the wall.

Wire contract (additive, backward compatible)

  • Request params for discovery/listDevices gain optional
    offset (int, default 0) and limit (int).
  • Reply gains next_offset (int | null) and total_matched (int)
    alongside the existing devices array — only when the caller
    opted in by passing limit.

Compatibility matrix

Client behavior Server behavior
Old client, no limit Server takes the legacy unpaged path — returns the full filtered fleet in one reply. No next_offset/total_matched fields. Small fleets keep working untouched; large fleets fail loudly with the original max_payload exceeded error rather than silently truncating. Operators hitting the ceiling must upgrade clients.
New client, passes limit Server paginates, silently clamps the page to _LIST_DEVICES_MAX_LIMIT (env DC_LIST_DEVICES_MAX_LIMIT, default 200), returns next_offset/total_matched. A client asking for limit=10000 gets the first 200 plus a forward-pointing next_offset — no error, no truncation; the client just paginates more aggressively than it expected. Client loops until next_offset is None.
New client, old server Server ignores unknown params and returns the legacy single-reply shape. Client sees next_offset absent, exits the loop after one request — forward-compatible.
Malformed offset / limit Server returns a clean JSON-RPC -32602 error instead of a 500.

Server

  • DeviceRegistry.list_devices_page(tenant, *, device_type, location, offset, limit) returns (devices_page, next_offset, total_matched). The existing list_devices(...) is untouched and remains the legacy path. Stability and ACL caveats documented in the docstring.
  • discovery/listDevices handler routes to the paged variant only when limit is in params; otherwise keeps the legacy reply shape. Both offset and limit are validated up front.

Clients (edge + server RegistryClient)

  • RegistryClient.list_devices() loops 100-device pages internally. External signature unchanged; callers keep getting a flat List[Dict]. Page size tunable via DEVICE_CONNECT_LIST_PAGE_SIZE.
  • New RegistryClient.list_devices_page(...) returns (devices, next_offset, total_matched) for callers that want explicit paging (e.g. a UI). Docstrings call out the ACL caveat: total_matched is pre-ACL and pages may be shorter than limit for restricted callers.

Defense in depth

  • security_infra/setup_deployment.sh bakes max_payload: 8MB into the regenerated NATS config so operators rerunning the script raise the broker ceiling alongside the protocol-level fix.

Other bundled fixes (same incident)

1. Registration thundering herd (device.py)

Problem: 1000+ phones spinning up in lockstep hit a 2s request
timeout, every late reply triggered a retry, and the queue collapsed
into a sustained storm.
Fix: registration request timeout now DEVICE_CONNECT_REGISTER_TIMEOUT
(default 15s, was 2s) and a one-shot pre-registration jitter window
DEVICE_CONNECT_REGISTER_JITTER (default 2s, set to 0 to disable;
bump to 10s+ for fleet scale) decorrelates the initial herd. 2s still
spreads ~1000 lockstep devices into roughly 500/sec while staying
tolerable for single-device dev. Subsequent retries keep their existing
exponential backoff.

2. etcd HTTP pool exhaustion (registry/service/registry.py)

Problem: etcd3gw's default urllib3 pool of 10 sockets bottlenecked
the registry under the herd — every lease+put is two sequential HTTP
calls, so thousands of requests queued behind 10 sockets.
Fix: new _enlarge_etcd_pool() mounts a larger HTTPAdapter
(DC_ETCD_POOL_SIZE, default 64) onto the etcd3gw client's
underlying requests.Session. Implemented via post-construction mount
(not etcd3gw.client(session=...)) so it works on 2.5.x and 2.6+.
Logs a warning if etcd3gw ever stops exposing client.session — the
fix would otherwise silently regress.

3. Portal NATS reconnect storm (portal/services/nats_rpc.py)

Problem: every "Run" click on the dashboard opened a fresh NATS
connection (TCP + JWT-auth handshake) per RPC.
Fix: long-lived cached client reused across invoke() calls. The
NATS client is concurrent-safe (nc.request allocates its own inbox
sub) so one shared client serves the whole portal. On hard transport
failure the cached client is dropped and best-effort closed before
the next call reconnects. The module-level asyncio.Lock is created
lazily inside the first awaiter so it binds to the live event loop,
not import-time.

4. Dashboard fan-out (portal/views/dashboard.py, templates)

Problem: the live-devices fragment rendered every device's full
detail panel (functions, events, raw JSON) on every 3s htmx poll,
turning the dashboard into a fleet-sized data fan-out.
Fix:

  • Detail markup moved to a new lazy endpoint
    /api/devices/{device_id}/live-detail, fetched on row expand. The
    main poll now returns only summary rows.
  • etcd get_prefix and credential listing moved off the event loop
    via asyncio.to_thread so other portal requests don't queue behind
    the poll.
  • Header counters (creds/online/registered) refresh in place via
    hx-swap-oob. Previously they froze at first-render values and
    showed "1401 online" long after the swarm shut down.
  • Poll cadence eased from 3s -> 10s.
  • revealed-based lazy loading was rejected: htmx's revealed
    implementation calls getBoundingClientRect, and display:none
    rows return a zero-rect at (0,0) which falsely registers as
    in-viewport — every hidden row would fetch on page load.

5. @on handler collection (drivers/base.py)

Problem: _collect_event_subscriptions skipped every single-
underscore-prefixed method, so drivers that wrote @on async def _on_phone_state(...) (Python's conventional "this is private") had
their event handlers silently dropped.
Fix: scan everything except dunders. The _on_meta attribute
gate filters out unrelated private attributes; new test covers it.

6. @on(device_type=...) in registry/portal mode (drivers/base.py)

Problem: device_type filtering relies on the D2D peer cache. In
portal/registry mode there is no peer cache, so the filter silently
passed every device's matching event through.
Fix: warn once at subscription setup so subscribers don't silently
get cross-device events. Strict filtering can be added in-handler.

7. Portal RPC timing logs (portal/views/dashboard.py)

Added structured timing logs around backend.rpc_invoke so the next
slow-dashboard incident has data on whether the latency is in the
portal handler, the pre-RPC setup, or the device round-trip.


Tests

New

  • test_registry_service.py — 6 tests in TestListDevicesPage:
    slice + next_offset, final page sets next_offset=None, offset
    past end returns empty, limit=None returns the remainder, a full
    1400-device round trip, and filter-before-pagination.
  • test_rpc_handlers.py — handler-level coverage rewritten around the
    dual-path behavior. New regression tests:
    • test_list_devices_legacy_no_limit_returns_unbounded — old-client
      path returns full fleet and emits no pagination metadata.
    • test_list_devices_with_limit_paginates — new-client path emits
      next_offset/total_matched.
    • test_list_devices_paged_acl_total_is_unfiltered — documents and
      locks in the ACL × pagination caveat (page shorter than limit,
      total_matched reflects pre-ACL count).
    • test_list_devices_invalid_offset_returns_error /
      test_list_devices_negative_offset_returns_error /
      test_list_devices_invalid_limit_returns_error — malformed
      pagination params surface as JSON-RPC -32602, not 500s.
  • test_registry_client.py (edge) — 5 tests in
    TestListDevicesPagination: full-fleet paging across 14 round
    trips, wire-level offset/limit round-trip, legacy-server
    single-reply fallback, explicit list_devices_page() metadata
    exposure, filter forwarding.
  • test_drivers.pytest_underscore_prefixed_handler_is_still_collected
    for the _collect_event_subscriptions fix.

Regressions

All existing registry / edge / agent-tools tests remain green; the only
test-file edits are in test_rpc_handlers.py to reflect the new
dual-path behavior of the handler.

Out of scope (deliberately)

  • Selector pushdown to the registry / etcd indexes. That would let
    the server filter by category:phone etc. before paginating and
    reduce wire traffic further, but is a much bigger change.
    Pagination alone is enough to keep individual replies under
    max_payload regardless of fleet size.
  • Keyset pagination. Offset pagination has known stability gaps
    under concurrent registrations (documented in the
    list_devices_page docstring); a device_id-cursor version is the
    next iteration.
  • Streaming replies.

Test plan

  • pytest packages/device-connect-server/tests/device_connect_server/test_registry_service.py -v — 43 passed
  • pytest packages/device-connect-server/tests/device_connect_server/test_rpc_handlers.py -v — 41 passed
  • pytest packages/device-connect-edge/tests/test_registry_client.py -v — passes
  • pytest packages/device-connect-edge/tests/test_drivers.py -v — passes
  • pytest packages/device-connect-agent-tools/tests/test_connection_unit.py tests/test_tools_unit.py -v — 65 passed (from previous run; no semantic change to that surface)
  • Integration test (Docker) on the 1400-device deployment that originally hit the cliff: re-run discover('device(category:phone)'), discover_labels(), and broadcast(...) end-to-end.

🤖 Generated with Claude Code

atsyplikhin and others added 15 commits May 20, 2026 17:32
At fleet scale (~1400 devices) the JSON-encoded reply from
`discovery/listDevices` crosses NATS's default 1 MB max_payload, breaking
every agent-tools call that resolves a selector — `discover()`,
`discover_labels()`, and `broadcast()` all fail because the registry
loads the full fleet before any filter and tries to publish it in a
single message.

This change makes the reply size independent of fleet size:

- New optional `offset` / `limit` params on `discovery/listDevices`; new
  reply fields `next_offset` and `total_matched`. Omitting `limit`
  preserves the legacy single-shot reply for older clients.
- `DeviceRegistry.list_devices_page(...)` returns a slice plus metadata
  alongside the existing `list_devices(...)` (untouched).
- Both `RegistryClient` implementations (edge + server-internal) loop
  internally over 100-device pages — externally unchanged, but no single
  reply ever carries the whole fleet. Page size is tunable via
  `DEVICE_CONNECT_LIST_PAGE_SIZE`. A new `list_devices_page()` exposes
  metadata to callers that want explicit paging.
- Client backward compat: a server that doesn't return `next_offset`
  causes the page loop to exit after one request.
- Defense-in-depth: `setup_deployment.sh` now bakes
  `max_payload: 8MB` into the NATS config it generates, so operators who
  rerun it preserve the ceiling.

Tests: 6 new in test_registry_service.py (including a full 1400-device
walk), 5 new in test_registry_client.py covering paging, legacy-server
fallback, filter forwarding, and explicit-page metadata.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The constant was sitting between the stdlib and device_connect_edge
import groups, which trips ruff's "module level import not at top of
file" rule. Move it after the imports to match the edge client layout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ayload overflow

At fleet scale (1400+ devices, ~13KB per record) the legacy unpaginated
listDevices path would emit a single multi-megabyte reply that exceeded
the NATS broker's max_payload, killing every concurrent caller. The new
_LIST_DEVICES_MAX_LIMIT (default 200, env DC_LIST_DEVICES_MAX_LIMIT)
clamps every reply regardless of what the caller passed, so an
unpaginated or misconfigured client now gets the first page plus
next_offset instead of an overflow error. The dead "legacy single-shot"
else-branch is removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the live-devices fragment rendered hidden detail markup
(functions, events, identity, raw-JSON pre-block) for every device in
every poll. At 1400 phones each poll returned ~18 MB on a 3-second
interval, blocking the portal event loop and saturating client traffic.

The fragment now ships only summary rows (~500 KB at 1400 devices) plus
an htmx placeholder per row; the detail markup is fetched from a new
GET /api/devices/{device_id}/live-detail endpoint when a row is
expanded. The etcd prefix-scan is wrapped in asyncio.to_thread so it
no longer blocks other portal requests. Poll cadence is relaxed from
3 s to 10 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At 1400+ phones spinning up in lockstep, every phone's _register hit the
registry inside the same microsecond window. The 2 s request timeout
guaranteed most callers timed out before the registry's etcd writes
returned, every timeout fired a retry, and the retries re-entered the
queue — classic congestion collapse, which left ~1200 phones invisible
to the registry while still alive on the fabric.

Edge SDK (device.py):
  - DEVICE_CONNECT_REGISTER_TIMEOUT (default 15 s, was hardcoded 2 s)
    lets the registry process a queued request before any retry fires.
  - DEVICE_CONNECT_REGISTER_JITTER (default 5 s) sleeps a uniform-random
    0..jitter before the first request, decorrelating arrivals so the
    registry never sees a synchronized burst. Skipped when set to 0.

Registry (registry.py):
  - DC_ETCD_POOL_SIZE (default 64) hands etcd3gw a requests.Session with
    an oversized urllib3 HTTPAdapter. The default pool of 10 was the
    actual ceiling on concurrent lease+put round-trips.

End-to-end at 1400 phones: 0 registration failures, 0 publish-overflow
errors, all phones in etcd within ~30 s of launch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…5.x; align listDevices tests with server-side pagination

The previous change passed ``session=`` to ``etcd3gw.client(...)``, which
only works on 2.6+; CI runs against 2.5.x and broke at module import.
Mount the larger urllib3 pool onto the already-constructed
``client.session`` instead, which works on both versions.

TestListDevicesHandler now mocks ``list_devices_page`` (the only path
the handler takes after the server-side cap landed) and asserts the
``next_offset``/``total_matched`` shape in the reply.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three small clarity fixes uncovered while wiring an out-of-process
camera driver to subscribe to phone state_changed events at fleet scale
(1400 devices across 7 shards) against the DC Portal.

1. _collect_event_subscriptions no longer skips single-underscore method
   names. Previously an @on-decorated `_on_foo` handler was silently
   never registered (the attribute-name check at base.py:1058 dropped
   it before the `_is_event_subscription` marker check could even
   run). The collector now only skips dunders. Behavior is still
   gated by the `_is_event_subscription` marker, so the public-API
   surface is unchanged. New test:
   test_underscore_prefixed_handler_is_still_collected.

2. Renamed the two INFO log lines in _setup_agentic_driver that called
   the path "D2D capabilities" / "D2D setup complete" -- the same code
   path runs in both d2d and portal/registry mode, so the label was
   wrong half the time. The completion line now reports the actual
   registry kind (D2DRegistry vs RegistryClient).

3. @on(device_type=...) filtering is best-effort in portal/registry
   mode because the D2D peer cache used to resolve a source device's
   type is None. The handler's cache-miss path passes events through
   (false negatives are worse than false positives), but until now
   that limitation only surfaced via a per-event debug log. Emit a
   WARNING at subscription-setup time when device_type is set and
   there is no peer collector, so subscribers see the gotcha once
   instead of silently receiving events from other device types.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
invoke_device_rpc used to open and close a fresh NATS connection per
dashboard "Run" click, paying a full TCP+JWT-auth handshake on every
request. At fleet scale, that handshake is the dominant cost — the
RPC itself is ~12 ms once a client is connected.

nats_rpc.invoke() now holds a single long-lived client under an
asyncio.Lock and reuses it across requests; transport errors drop the
cache so the next call reconnects. The HTTP handler and the NATS layer
both log handler/RPC timings so the source of any future latency is
visible from docker logs alone:

  invoke alpha/phone-0001.get_position handler=11.5ms (pre-rpc=0.1ms rpc=11.5ms)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous lazy-load wired ``hx-trigger="revealed once"`` on every
hidden detail ``<div>``. htmx's revealed implementation polls
getBoundingClientRect, and ``display:none`` elements report a zero-size
rect at (0,0) — which counts as "in the viewport." That meant the
trigger fired for every hidden row immediately after the table fragment
rendered: at 1400 phones, the browser fan-out queued ~1400 parallel
fetches behind its 6-connection-per-origin pool, and a subsequent
"Run" click on a row's RPC sat at the back of that queue for many
seconds.

Replace the htmx trigger with a plain fetch invoked from
toggleDetail() and a ``_detailLoaded`` Set that dedupes re-expands.
The detail URL is carried on the parent ``<tr>`` as
``data-detail-url`` so the JS doesn't have to reconstruct it. Net
result: zero detail fetches at page load, one fetch on first expand
per device, and "Run" feels instant again.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Credentials Created / Devices Online / Devices Registered cards
were rendered once in dashboard_page and never updated. After the
swarm shut down, the table emptied via its 10s htmx poll but the cards
stayed at the last value seen during the initial page render — at
fleet scale that produced an obviously-wrong "1401 online" while the
table read "No devices registered yet."

Piggyback on the existing live-devices poll using ``hx-swap-oob``:
the fragment now appends three tiny ``<span hx-swap-oob="innerHTML">``
blobs that target ``#online-count``, ``#registered-count``, and
``#creds-count`` in the dashboard header. Counts and table swap in
lockstep with no new endpoint and no extra request. The header span
IDs are added in dashboard.html so the OOB swap has somewhere to land.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the handler always called list_devices_page and clamped to
_LIST_DEVICES_MAX_LIMIT (200), even for callers that omitted ``limit``.
At fleet scale an old client would silently receive only the first 200
devices with no next_offset signal and act on a partial view.

Now the handler branches on ``limit``: if absent, return the legacy
unbounded reply via list_devices (no next_offset/total_matched leaks);
if present, paginate as before. Old clients keep working on small
fleets and fail loudly with max_payload exceeded at large scale, the
same behavior they had before this branch. New clients opt into
pagination by passing limit.

Also validates ``offset`` (was implicit int() that 500'd on garbage)
and rejects negative values; both bad offset and bad limit now return
a clean JSON-RPC -32602 error.

Tests rewritten to cover both paths and add regressions for the ACL x
pagination caveat (total_matched is pre-ACL; pages may be shorter than
limit after ACL).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two latent issues in the cached-NATS-client path:

1. The module-level asyncio.Lock() bound to whichever event loop was
   current at import. Tests (and any future code) that runs this
   module under a fresh loop would have hit a cross-loop usage error.
   The lock is now created lazily inside _get_invoke_lock() on first
   await.

2. On a hard transport failure the cached client was dropped to None
   but its sockets were not closed. The dead client could leak FDs
   until garbage-collected. Now we drop and best-effort await
   stale.close() in a new _drop_invoke_client() helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
list_devices_page docstrings (edge + server RegistryClient and the
service-layer registry) now spell out the ACL caveat: ACL filtering
runs after slicing, so total_matched is the pre-ACL count and pages
may be shorter than limit for ACL-restricted callers. UIs that
consume these values must not assume len(devices) == limit means a
full page.

The service-layer docstring also documents the offset-pagination
stability gap (concurrent registrations can shift records across
pages mid-walk) and points at keyset pagination as the next step.

_enlarge_etcd_pool now logs a warning when etcd3gw stops exposing
client.session, so a future internal refactor doesn't silently
revert the pool back to the urllib3 default of 10 and reintroduce
the registration-storm bottleneck without any operator signal.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- Reject `limit <= 0` on discovery/listDevices with JSON-RPC -32602
  rather than silently mapping it to the server cap (surprising and
  masked client bugs that passed unintentional zero/negative values).
- Narrow the portal `invoke()` exception handler so only transport
  errors (ConnectionClosed, Stale, NoServers, OSError, ...) drop the
  cached NATS client. ProtocolError, BadSubject, MaxPayload etc. are
  payload/programmer bugs and no longer churn the socket on every
  malformed request.
- Return HTTP 404 (not 200) from `/api/devices/{id}/live-detail` on
  device-not-found so htmx and future API consumers can detect it.
- Lower `DEVICE_CONNECT_REGISTER_JITTER` default 5.0s -> 2.0s. 2s
  still decorrelates ~1000 devices into ~500/sec (much better than
  lockstep) while being tolerable for single-device dev. Fleet
  operators should bump via env to 10s+.
- Fix comment in `_LIST_DEVICES_MAX_LIMIT`: said "32MB broker ceiling"
  but `setup_deployment.sh` sets `max_payload: 8MB`.

Tests:
- New `test_nats_rpc.py`: client reuse, drop on transport error,
  reconnect on next call, and explicit non-drop on ProtocolError /
  NoRespondersError.
- `test_rpc_handlers.py`: `limit=0` and `limit=-3` rejection.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
atsyplikhin and others added 4 commits May 26, 2026 23:53
- Dashboard detail rows fetched on click but cached via a side-channel
  `window._detailLoaded` Set keyed by device_id. The live-devices poll
  re-renders the detail markup on every tick and reset
  `.detail-content` back to the "Loading details..." placeholder, but
  the Set persisted across re-renders. Result: expand -> collapse ->
  next poll re-renders -> re-expand showed "Loading details..."
  forever. Replace the Set with a `data-loaded` / `data-loading`
  attribute on the slot element itself, so re-renders naturally
  invalidate the cache. Also added an in-flight guard against
  duplicate concurrent fetches and a `r.ok` check so a 404 surfaces
  as a visible error rather than rendering "Loading details..." with
  the 404 body text injected.
- Both client page loops (`device_connect_edge.registry_client` and
  `device_connect_server.registry.client`) trusted the server's
  `next_offset` unconditionally. A buggy or future server returning a
  non-advancing cursor (`next_offset <= offset`) would have spun
  forever. Bail out with a warning instead — turns a fleet-scale
  incident into a recoverable log line. The current registry can't
  produce this, but the guard is cheap.
- Drop the dead `total = len(page)` assignment in the legacy
  no-`limit` branch of `discovery/listDevices` — it was never read on
  the legacy reply path and just confused readers.

Tests:
- `test_registry_client.py`:
  - `test_list_devices_handles_empty_page_with_next_offset` — the
    documented ACL-after-pagination caveat: server can return an
    empty page with `next_offset` still pointing forward, and the
    loop must advance, not stall.
  - `test_list_devices_breaks_on_non_advancing_next_offset` — pins
    the new monotonicity guard via caplog.
- `test_rpc_handlers.py`:
  - `test_list_devices_limit_above_cap_is_clamped` — pins the
    silent-clamp contract end-to-end: `limit=10000` -> registry
    called with `_LIST_DEVICES_MAX_LIMIT`, reply carries a
    forward-pointing `next_offset` so the client paginates without
    truncation.
- `test_device.py` (`TestRegisterStartupJitter`):
  - `test_jitter_zero_skips_sleep_and_random` — `JITTER=0` is a
    clean disable: no `random.uniform` call, no jitter sleep.
  - `test_jitter_positive_sleeps_once_before_first_request` —
    pre-registration jitter fires exactly once before the first
    request.
- `test_registry_service.py` (`TestEnlargeEtcdPool`):
  - `test_mounts_adapter_on_existing_session` — both http:// and
    https:// must be mounted; we don't want a surviving small-pool
    adapter on either scheme.
  - `test_logs_warning_when_session_missing` — pins the fallback
    branch so a future etcd3gw refactor that hides `client.session`
    doesn't silently regress to the urllib3 default of 10 without
    any operator-visible signal.
- `test_drivers.py`:
  - `test_portal_mode_device_type_filter_warns_at_setup` — pins the
    portal-mode `@on(device_type=...)` setup-time warning that was
    introduced in 0673652 but not previously covered.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reported live: after a portal restart, re-logging in dropped users on
the bare HTML fragment at /api/devices/live?tenant=<...> instead of
/dashboard. Chain:

  1. /dashboard polls /api/devices/live every 10s via htmx.
  2. Portal restarts -> in-memory session cookies invalidated.
  3. Next poll fires GET /api/devices/live with HX-Request: true.
  4. auth_middleware sees no session, captures the poll URL as
     next_url, returns HX-Redirect: /login?next=/api/devices/live...
  5. htmx follows it as a top-level navigation; login form passes
     `next` through; login_submit redirects there.
  6. User lands on the raw fragment.

The `next` capture was added for the CLI-approval flow (a top-level
page that requires login). It was applied indiscriminately, including
to htmx polls and any GET under /api/. The fix is two-layered:

- auth_middleware: skip the `next` capture when the request has
  HX-Request: true or the path is under /api/. Top-level
  CLI-approval navigations still capture `next` because they don't
  carry HX-Request and aren't under /api/.
- _safe_next: also reject /api/ and /static/ prefixes belt-and-
  suspenders, so any stale ?next= link can't bounce the user into a
  fragment or static asset.

Tests (test_portal_auth_redirect.py, 11 cases):
- _safe_next accepts full-page paths, rejects external URLs,
  protocol-relative URLs, CRLF, and /api/ + /static/ paths.
- auth_middleware on an HX-Request to /api/devices/live returns
  HX-Redirect: /login (no `next`).
- A non-htmx GET to /api/devices/live still redirects to /login
  without capturing `next`.
- Top-level /dashboard and /cli/approval/{token} still capture
  the page URL as `next` so CLI approval and post-login bounce
  still work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
asyncio.Lock has no acquire_nowait() (that's threading.Lock). The
existing call raised AttributeError on every NATS reconnect, so the
SDK never actually re-established @on subscriptions after a messaging
hiccup. Latent until you hit a reconnect storm -- at N=1400 across 7
shards during a broadcast burst, ~24 phones per shard tripped it and
their state_changed listeners on the co-located camera went dead.

Replace with the standard asyncio.Lock pattern: locked() to fast-skip
when another caller is already mid-resubscribe, then await acquire()
to actually take the lock. Safe under single-loop asyncio (no preempt
between locked() and acquire()).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… decisions in comments

Three small fixes from review round 4, plus "review notes" comments at
spots prior rounds have already churned so future reviewers don't
re-litigate the same decisions.

Fixes:
- ConnectionReconnectingError removed from the transport-fatal set in
  the portal invoke() path. Dropping the cached client in that state
  preempts nats-py's own reconnect machinery and amplifies broker
  flaps. The next nc.request either succeeds post-reconnect or raises
  something more terminal that *is* in the fatal set.
- Cached NATS invoke client now closed on portal shutdown via
  app.on_cleanup. Without this the long-lived socket leaks on
  graceful exit because the client is module-level state in
  nats_rpc, not tied to the aiohttp Application lifecycle. New
  close_invoke_client() is idempotent.
- _collect_event_subscriptions now tolerates @Property descriptors
  that raise. dir() surfaces every attribute on the class and
  getattr will invoke descriptors; a lazy-init or hardware-probe
  property that raises before connect() would otherwise break
  subscription setup for unrelated handlers.

Review notes (anti-circling comments):
- device.py _resubscribe_after_reconnect: why asyncio.Lock has no
  acquire_nowait() and why locked()+acquire() is safe under
  single-loop asyncio (refs commit 1716f8d).
- nats_rpc.py _TRANSPORT_FATAL_ERRORS: why ConnectionReconnectingError,
  ProtocolError, NoRespondersError are deliberately absent.
- registry/service/main.py listDevices handler: why legacy callers
  fail loudly instead of being silently clamped, why streaming was
  rejected, why limit<=0 returns -32602 instead of mapping to the
  cap (refs commits 2349130, 79247e6).
- registry/service/registry.py list_devices_page: explicit cost
  model — O(N * walks_per_second) etcd traffic. NATS payload is
  bounded but etcd/CPU load isn't; selector pushdown / keyset
  pagination is the documented next iteration.
- drivers/base.py _collect_event_subscriptions: why single-underscore
  methods are scanned, refs commit 0673652.

Tests:
- test_drivers.py: test_collector_survives_raising_property.
- test_nats_rpc.py: test_connection_reconnecting_keeps_cached_client,
  test_close_invoke_client_closes_cached,
  test_close_invoke_client_idempotent.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
atsyplikhin and others added 2 commits May 27, 2026 21:57
The dashboard paused all polling of /api/devices/live as soon as any
detail row was expanded, so it could preserve the expanded state across
the swap. PR 38 later wired the header counters (#online-count /
#registered-count / #creds-count) to refresh via hx-swap-oob spans that
piggyback on that same poll fragment, so the pause silently froze the
counters too: operators saw "Devices Online" stuck at its first-render
value the whole time they were inspecting any device, and offline
devices never dropped from the table.

Replace the htmx:beforeRequest pause with an htmx:afterSwap restore:
the poll fires normally (counters refresh, offline devices drop),
and any rows the user had expanded are re-opened and their detail slot
re-fetched via the existing loadDetailIfNeeded path. Devices that
disappear from the table get pruned from _expandedDevices.

Apply the same fix to the admin "view as user" tenant detail page,
which had three related defects:

  1. Counter cards were plain <div>{{ count }}</div> with no IDs, so
     the OOB swaps from _live_table.html had no target — silently
     dropped. Wrap each in <span id="..."> matching the OOB markup.
  2. loadDetailIfNeeded was never copied over after PR 38 moved
     detail markup to the lazy /api/devices/{id}/live-detail endpoint,
     so expanding a row on this page just revealed an empty "Loading
     details..." placeholder forever. Copy the function and call it
     from toggleDetail.
  3. Same beforeRequest pause as the dashboard had — replace with the
     afterSwap restore loop.

Follow-up worth doing in a separate PR: the toggleDetail /
loadDetailIfNeeded / RPC invoke / SSE event log scripts are duplicated
between dashboard.html and admin/tenant_detail.html, which is exactly
why fix #2 above drifted in the first place. Extract a shared
_dashboard_scripts.html partial parameterized by tenant URL suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When a caller passes a ``limit`` larger than ``_LIST_DEVICES_MAX_LIMIT``
the handler silently clamps the page size and relies on ``next_offset``
to let the client paginate naturally. The wire contract is correct --
the caller just paginates more aggressively than it expected -- but the
clamp is invisible to operators. Anyone tuning
``DEVICE_CONNECT_LIST_PAGE_SIZE`` above the server cap gets no signal
that their knob is being ignored, and the only symptom is "discovery
walks take more round-trips than the page-size suggests."

Log a single WARNING the first time we see a given over-cap value, then
stay quiet on subsequent requests with the same limit. Dedup is keyed
by the requested value, not by caller -- the warning is operator-facing
("your server cap no longer matches a tuned client knob"), not
per-request observability, so a steady-state misconfigured client logs
once instead of once per page. The dedup set is bounded by the number
of distinct over-cap values any caller asks for, which in practice is
1-2.

Tests:
- new ``test_list_devices_clamp_warns_once_per_requested_limit``: first
  over-cap request warns; a repeat with the same value does not; a new
  over-cap value warns again.
- new ``test_list_devices_at_or_under_cap_does_not_warn``: pins that
  the warn gate is strictly ``effective < requested`` so at-cap and
  under-cap requests stay silent.
- existing ``test_list_devices_limit_above_cap_is_clamped`` still
  passes; the clamp behavior itself is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant