From 92f99f6e31fa2dbed3323194227239c1c21d031e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabr=C3=ADcio=20Bracht?= Date: Sun, 3 May 2026 11:27:52 -0700 Subject: [PATCH 1/8] =?UTF-8?q?add=20=C2=A712=20mode=20interaction=20and?= =?UTF-8?q?=20migration=20to=20peer-mode=20design?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/design/reactive-peer-mode.md | 73 ++++++++++++++++++++++++++++--- 1 file changed, 68 insertions(+), 5 deletions(-) diff --git a/docs/design/reactive-peer-mode.md b/docs/design/reactive-peer-mode.md index b9ec5a6..0886602 100644 --- a/docs/design/reactive-peer-mode.md +++ b/docs/design/reactive-peer-mode.md @@ -169,13 +169,13 @@ if (typeof remoteVersion === 'number') { } ``` -In peer mode, replace this with an HLC compare. The entity's `_version` field becomes an HLC object. The compare is lexicographic on `(ts, counter, nodeId)`. +In peer mode, the LWW compare reads a dedicated `_hlc` field instead of `_version`. The compare is lexicographic on `(ts, counter, nodeId)`. MQDB mode continues to use numeric `_version`. The two fields coexist on the type but only one is written per mode — see §12.2 for why we keep both rather than overloading `_version`. -For backward compat with MQDB mode: keep numeric `_version` as the comparison in `'mqdb'` mode, switch to HLC compare only in `'peer'` mode. The mode flag is read once and cached. +The mode flag is read once and cached at construction; the LWW branch in `applyMutationToDb` dispatches on it. ### 5.4 HLC location -HLC lives **per-entity record**, stored on the record itself (replacing `_version`). Not per-scope. This means cross-entity causality is not tracked — but that's fine because LWW only cares about same-entity ordering. +HLC lives **per-entity record**, stored on the record itself in a dedicated `_hlc` field (see §12.2). Not per-scope. This means cross-entity causality is not tracked — but that's fine because LWW only cares about same-entity ordering. ## 6. API surface changes @@ -187,7 +187,7 @@ Add: syncMode?: 'mqdb' | 'peer'; // default 'mqdb' for backward compat ``` -No other config additions. `syncTopicPrefix`, `responseTopicPrefix`, etc. still work. +No other config additions. `syncTopicPrefix`, `responseTopicPrefix`, etc. still work. `syncMode` is immutable per store — see §12.1. ### 6.2 `replaceScope` semantics @@ -338,7 +338,7 @@ Rough LOC budget for v1 (single-scope, no top-level entity discovery): Once this design is approved: 1. **Phase 0**: this doc, plus the `$SYS` fix (already done in `fix/response-topic-default`). -2. **Phase 1**: HLC implementation + replace `_version` compare in `applyMutationToDb` (gated by mode flag, default `mqdb` keeps numeric `_version`). Tests. +2. **Phase 1**: HLC type, `_hlc` field on peer-mode records, mode-gated branch in `applyMutationToDb` (default `mqdb` keeps numeric `_version`). Mode persistence + `ModeMismatchError` on boot (§12.1). Tests. 3. **Phase 2**: Tombstone storage in PersistenceLayer. Tests. 4. **Phase 3**: Peer-mode SyncEngine — mutations as fire-and-forget events, `request()` throws. 5. **Phase 4**: `hello` protocol — client publishes, server reacts, manifest diffing. @@ -347,6 +347,69 @@ Once this design is approved: Each phase is independently shippable behind the mode flag. +## 12. Mode interaction and migration + +The two modes share a codebase, a config type, an IndexedDB schema, and (potentially) a broker. None of that is enough to make them interoperable. This section spells out the invariants the implementation must enforce so that "switching mode" or "running mixed clients" produces a loud error rather than silent data corruption. + +### 12.1 `syncMode` is per-store, boot-time, immutable + +Set once when `createStore()` is first called against a given local database, and persisted in store metadata on first init. On subsequent boots, the store reads the persisted mode and asserts it matches the config-provided mode: + +- Match → proceed. +- Mismatch → throw `ModeMismatchError` with guidance: "Local data was created in mode X; either revert to X, or call `clearLocalData()` to start fresh in mode Y." + +This is cheaper than a real migration and matches the target use cases (a project picks peer at the start, or it doesn't). Apps that legitimately need to switch take the data loss explicitly. + +A fresh-install database has no persisted mode; the first init writes whatever is configured. + +### 12.2 Versioning fields are mode-segregated + +| Mode | Field on each record | +|---|---| +| `mqdb` | `_version: number` (server-managed, monotonic) | +| `peer` | `_hlc: { ts, counter, nodeId }` (client-managed) | + +`_version` and `_hlc` are independent fields. Peer-mode records do not write `_version`; MQDB-mode records do not write `_hlc`. `applyMutationToDb`'s LWW compare dispatches on mode and reads only the corresponding field — no runtime polymorphism, no Number-vs-Object narrowing. + +This supersedes earlier wording in §5.3 and §5.4 that described HLC as "replacing" `_version`. The earlier draft was wrong: keeping both fields is simpler, isolates the modes' on-disk representations, and means a stray cross-mode record (which 12.1 should prevent but defense in depth is cheap) doesn't silently miscompare. + +### 12.3 A scope is single-mode, not mixed + +All clients participating in a scope must use the same `syncMode`. The wire formats are not interoperable: + +- MQDB clients publish CRUD requests on `$DB/{entity}/create` and consume server-emitted events carrying numeric `_version`. +- Peer clients publish events directly on `$DB/{root}/{scope}/events/{type}` carrying `_hlc`. + +If both run on the same broker scope, MQDB clients ignore peer events (no HLC handler; numeric compare against an object is undefined) and peer clients see server-mediated events that lack `_hlc`. Each side becomes invisible to the other and their writes diverge. + +Two enforcement levels: + +1. **Documentation** — name this in user-facing docs. +2. **Mechanical** — peer mode defaults `syncTopicPrefix` to a distinct value (e.g. `$DB-peer`), making cross-mode topic collisions impossible at the broker level. Operators with custom prefixes already handle their own namespacing. + +Recommendation: ship both. Default-prefix isolation for the common case; documentation for the override case. + +### 12.4 Failure modes if invariants are violated + +| Scenario | Symptom | +|---|---| +| Same client switches modes without `clearLocalData()` | `ModeMismatchError` on init (12.1) — loud, recoverable | +| Mixed clients on same scope, default prefixes (12.3 enforced mechanically) | Impossible — topic prefixes don't overlap | +| Mixed clients on same scope, both overriding `syncTopicPrefix` to the same value | Each side silently drops the other's events; offline-queue mutations queued under one mode flush against the other's wire format and vanish | +| Peer mode reads MQDB-shaped records (only via 12.1 bypass) | `_hlc` undefined → treated as "lowest possible HLC" → incoming remote always wins → local edits look reverted | +| MQDB mode reads peer-shaped records (only via 12.1 bypass) | `_version` undefined → server reconciliation overwrites local on first sync | +| Peer-mode `_tombstones` table persists into a later MQDB-mode run | Inert; never read in MQDB. No corruption risk. (Cleared by `clearLocalData()` in any case.) | + +### 12.5 Offline queue and mode + +The offline queue (`pending_sync` table) is local-only and mode-flavored: pending mutations carry the wire shape of whichever mode was active when queued. The queue is persisted alongside the store metadata; on boot, if `syncMode` mismatches, the same `ModeMismatchError` from 12.1 fires before any flush is attempted. Users who run `clearLocalData()` to switch modes lose pending offline mutations — documented behavior, not a bug. + +### 12.6 What this section does NOT cover + +- **Migrating an existing MQDB deployment to peer.** Out of scope. The MQDB server has no HLC concept, so any migration would require a one-time export from MQDB, manufacture HLCs at import time (e.g. `{ts: server_updated_at, counter: 0, nodeId: 'mqdb-import'}`), and accept that the synthetic HLCs become the floor for all subsequent compares. Not v1. +- **Per-scope mode within a single store.** Not supported — `syncMode` is store-wide. +- **CRDT-backed records (Appendix B)** if added later: would replace `_hlc` with the CRDT's internal versioning. The mode-segregation invariant in 12.2 still applies; `_version` and the CRDT field stay distinct. + --- ## Appendix A: Why not retained topics From 53b6b9ac9b1aba3ba2f307288ca3e9f51111b4a5 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabr=C3=ADcio=20Bracht?= Date: Sun, 3 May 2026 11:35:03 -0700 Subject: [PATCH 2/8] =?UTF-8?q?address=20review=20on=20=C2=A712:=20non-$?= =?UTF-8?q?=20peer=20prefix,=20clearLocalData=20scope,=20refine=20failure-?= =?UTF-8?q?mode=20wording?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/design/reactive-peer-mode.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/docs/design/reactive-peer-mode.md b/docs/design/reactive-peer-mode.md index 0886602..55223aa 100644 --- a/docs/design/reactive-peer-mode.md +++ b/docs/design/reactive-peer-mode.md @@ -338,7 +338,7 @@ Rough LOC budget for v1 (single-scope, no top-level entity discovery): Once this design is approved: 1. **Phase 0**: this doc, plus the `$SYS` fix (already done in `fix/response-topic-default`). -2. **Phase 1**: HLC type, `_hlc` field on peer-mode records, mode-gated branch in `applyMutationToDb` (default `mqdb` keeps numeric `_version`). Mode persistence + `ModeMismatchError` on boot (§12.1). Tests. +2. **Phase 1**: HLC type, `_hlc` field on peer-mode records, mode-gated branch in `applyMutationToDb` (default `mqdb` keeps numeric `_version`). Mode persistence + `ModeMismatchError` on boot + `Store.clearLocalData()` recovery API (§12.1). Tests. 3. **Phase 2**: Tombstone storage in PersistenceLayer. Tests. 4. **Phase 3**: Peer-mode SyncEngine — mutations as fire-and-forget events, `request()` throws. 5. **Phase 4**: `hello` protocol — client publishes, server reacts, manifest diffing. @@ -356,12 +356,14 @@ The two modes share a codebase, a config type, an IndexedDB schema, and (potenti Set once when `createStore()` is first called against a given local database, and persisted in store metadata on first init. On subsequent boots, the store reads the persisted mode and asserts it matches the config-provided mode: - Match → proceed. -- Mismatch → throw `ModeMismatchError` with guidance: "Local data was created in mode X; either revert to X, or call `clearLocalData()` to start fresh in mode Y." +- Mismatch → throw `ModeMismatchError` with guidance pointing the caller at the recovery path described below. This is cheaper than a real migration and matches the target use cases (a project picks peer at the start, or it doesn't). Apps that legitimately need to switch take the data loss explicitly. A fresh-install database has no persisted mode; the first init writes whatever is configured. +**Recovery path** — stitch does not currently expose an API to wipe its local state. Phase 1 will add `Store.clearLocalData(): Promise` (deletes the IndexedDB database underlying `MemoryStore` + `PersistenceLayer`, including the offline queue and persisted mode metadata). Until that lands, callers can use the platform escape hatch — `indexedDB.deleteDatabase(dbName)` followed by reload — but the supported flow once Phase 1 ships is `await store.clearLocalData()` then re-`createStore()` with the new `syncMode`. + ### 12.2 Versioning fields are mode-segregated | Mode | Field on each record | @@ -385,24 +387,26 @@ If both run on the same broker scope, MQDB clients ignore peer events (no HLC ha Two enforcement levels: 1. **Documentation** — name this in user-facing docs. -2. **Mechanical** — peer mode defaults `syncTopicPrefix` to a distinct value (e.g. `$DB-peer`), making cross-mode topic collisions impossible at the broker level. Operators with custom prefixes already handle their own namespacing. +2. **Mechanical** — peer mode defaults `syncTopicPrefix` to a non-`$` value (e.g. `stitch/peer`), making cross-mode topic collisions impossible at the broker level. The prefix deliberately avoids the `$` reservation called out by MQTT 5 §4.7.2 — the same reason PR #11 moved `responseTopicPrefix` off `$SYS`. Operators with custom prefixes already handle their own namespacing. Recommendation: ship both. Default-prefix isolation for the common case; documentation for the override case. +> **Note on the existing `$DB` default.** MQDB-mode `syncTopicPrefix` still defaults to `$DB`, which is technically also under the §4.7.2 reservation but works on most production brokers in practice. Migrating MQDB-mode off `$DB` is out of scope for this design — peer mode just shouldn't inherit the same liability for a brand-new wire protocol. + ### 12.4 Failure modes if invariants are violated | Scenario | Symptom | |---|---| -| Same client switches modes without `clearLocalData()` | `ModeMismatchError` on init (12.1) — loud, recoverable | +| Same client switches modes without wiping local state | `ModeMismatchError` on init (12.1) — loud, recoverable via `clearLocalData()` | | Mixed clients on same scope, default prefixes (12.3 enforced mechanically) | Impossible — topic prefixes don't overlap | -| Mixed clients on same scope, both overriding `syncTopicPrefix` to the same value | Each side silently drops the other's events; offline-queue mutations queued under one mode flush against the other's wire format and vanish | +| Mixed clients on same scope, both overriding `syncTopicPrefix` to the same value | Silent overwrite, not data loss: each side receives the other's events but skips its version check (the relevant field is absent), so `applyMutationToDb` applies the foreign mutation unconditionally. Last-write-wins ordering is broken in both directions and local edits get clobbered by stale records. | | Peer mode reads MQDB-shaped records (only via 12.1 bypass) | `_hlc` undefined → treated as "lowest possible HLC" → incoming remote always wins → local edits look reverted | -| MQDB mode reads peer-shaped records (only via 12.1 bypass) | `_version` undefined → server reconciliation overwrites local on first sync | -| Peer-mode `_tombstones` table persists into a later MQDB-mode run | Inert; never read in MQDB. No corruption risk. (Cleared by `clearLocalData()` in any case.) | +| MQDB mode reads peer-shaped records (only via 12.1 bypass) | `_version` undefined → MQDB's `typeof remoteVersion === 'number'` guard skips and the remote mutation is applied without version check; server reconciliation later overwrites local | +| Peer-mode `_tombstones` table persists into a later MQDB-mode run | Inert; never read in MQDB. No corruption risk. (Wiped along with the rest of the local DB by `clearLocalData()`.) | ### 12.5 Offline queue and mode -The offline queue (`pending_sync` table) is local-only and mode-flavored: pending mutations carry the wire shape of whichever mode was active when queued. The queue is persisted alongside the store metadata; on boot, if `syncMode` mismatches, the same `ModeMismatchError` from 12.1 fires before any flush is attempted. Users who run `clearLocalData()` to switch modes lose pending offline mutations — documented behavior, not a bug. +The offline queue (`pending_sync` table) is local-only and mode-flavored: pending mutations carry the wire shape of whichever mode was active when queued. The queue is persisted alongside the store metadata; on boot, if `syncMode` mismatches, the same `ModeMismatchError` from 12.1 fires before any flush is attempted. Users who wipe local state to switch modes lose pending offline mutations — documented behavior, not a bug. ### 12.6 What this section does NOT cover From c5d492afe817a0b87dadda2ba8b5a0b98ed4e0e1 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabr=C3=ADcio=20Bracht?= Date: Sun, 3 May 2026 15:32:48 -0700 Subject: [PATCH 3/8] =?UTF-8?q?apply=20editorial=20fixes=20to=20=C2=A75.3,?= =?UTF-8?q?=20=C2=A712.2,=20=C2=A712.6=20of=20peer-mode=20design?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/design/reactive-peer-mode.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/design/reactive-peer-mode.md b/docs/design/reactive-peer-mode.md index 55223aa..e961c7b 100644 --- a/docs/design/reactive-peer-mode.md +++ b/docs/design/reactive-peer-mode.md @@ -171,7 +171,7 @@ if (typeof remoteVersion === 'number') { In peer mode, the LWW compare reads a dedicated `_hlc` field instead of `_version`. The compare is lexicographic on `(ts, counter, nodeId)`. MQDB mode continues to use numeric `_version`. The two fields coexist on the type but only one is written per mode — see §12.2 for why we keep both rather than overloading `_version`. -The mode flag is read once and cached at construction; the LWW branch in `applyMutationToDb` dispatches on it. +At runtime, the mode flag is read once and cached at construction. The LWW branch in `applyMutationToDb` dispatches on the cached value. ### 5.4 HLC location @@ -369,7 +369,7 @@ A fresh-install database has no persisted mode; the first init writes whatever i | Mode | Field on each record | |---|---| | `mqdb` | `_version: number` (server-managed, monotonic) | -| `peer` | `_hlc: { ts, counter, nodeId }` (client-managed) | +| `peer` | `_hlc: { ts: number; counter: number; nodeId: string }` (client-managed) | `_version` and `_hlc` are independent fields. Peer-mode records do not write `_version`; MQDB-mode records do not write `_hlc`. `applyMutationToDb`'s LWW compare dispatches on mode and reads only the corresponding field — no runtime polymorphism, no Number-vs-Object narrowing. @@ -391,8 +391,6 @@ Two enforcement levels: Recommendation: ship both. Default-prefix isolation for the common case; documentation for the override case. -> **Note on the existing `$DB` default.** MQDB-mode `syncTopicPrefix` still defaults to `$DB`, which is technically also under the §4.7.2 reservation but works on most production brokers in practice. Migrating MQDB-mode off `$DB` is out of scope for this design — peer mode just shouldn't inherit the same liability for a brand-new wire protocol. - ### 12.4 Failure modes if invariants are violated | Scenario | Symptom | @@ -411,6 +409,7 @@ The offline queue (`pending_sync` table) is local-only and mode-flavored: pendin ### 12.6 What this section does NOT cover - **Migrating an existing MQDB deployment to peer.** Out of scope. The MQDB server has no HLC concept, so any migration would require a one-time export from MQDB, manufacture HLCs at import time (e.g. `{ts: server_updated_at, counter: 0, nodeId: 'mqdb-import'}`), and accept that the synthetic HLCs become the floor for all subsequent compares. Not v1. +- **Migrating MQDB-mode off `$DB`.** MQDB-mode `syncTopicPrefix` still defaults to `$DB`, which is technically also under the §4.7.2 reservation but works on most production brokers in practice. Out of scope for this design — peer mode just shouldn't inherit the same liability for a brand-new wire protocol. - **Per-scope mode within a single store.** Not supported — `syncMode` is store-wide. - **CRDT-backed records (Appendix B)** if added later: would replace `_hlc` with the CRDT's internal versioning. The mode-segregation invariant in 12.2 still applies; `_version` and the CRDT field stay distinct. From 8e072c257c51e39eb92e33cf190ad83da23171cb Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabr=C3=ADcio=20Bracht?= Date: Sun, 3 May 2026 15:37:28 -0700 Subject: [PATCH 4/8] =?UTF-8?q?specify=20ModeMismatchError=20contract,=20c?= =?UTF-8?q?learLocalData=20scope,=20peer=20prefix=20default,=20=C2=A712.4/?= =?UTF-8?q?=C2=A712.5=20wording?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/design/reactive-peer-mode.md | 32 +++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/docs/design/reactive-peer-mode.md b/docs/design/reactive-peer-mode.md index e961c7b..0ed1ae7 100644 --- a/docs/design/reactive-peer-mode.md +++ b/docs/design/reactive-peer-mode.md @@ -362,7 +362,27 @@ This is cheaper than a real migration and matches the target use cases (a projec A fresh-install database has no persisted mode; the first init writes whatever is configured. -**Recovery path** — stitch does not currently expose an API to wipe its local state. Phase 1 will add `Store.clearLocalData(): Promise` (deletes the IndexedDB database underlying `MemoryStore` + `PersistenceLayer`, including the offline queue and persisted mode metadata). Until that lands, callers can use the platform escape hatch — `indexedDB.deleteDatabase(dbName)` followed by reload — but the supported flow once Phase 1 ships is `await store.clearLocalData()` then re-`createStore()` with the new `syncMode`. +**`ModeMismatchError` contract** — a named subclass of `Error` exported from the package root: + +```ts +class ModeMismatchError extends Error { + readonly name: 'ModeMismatchError'; + readonly expected: 'mqdb' | 'peer'; // mode persisted in the local DB + readonly actual: 'mqdb' | 'peer'; // mode requested by current config + readonly dbName: string; // IndexedDB name, for the recovery call +} +``` + +Message format: `` `stitch: store at "${dbName}" was initialized in ${expected} mode but config requested ${actual}; call store.clearLocalData() and re-create the store to switch modes` ``. The structured fields let callers branch programmatically without parsing the message. + +**Recovery path** — stitch does not currently expose an API to wipe its local state. Phase 1 will add `Store.clearLocalData(): Promise`, with this contract: + +- Disconnects the MQTT client (idempotent — safe to call when already disconnected). +- Deletes the IndexedDB database underlying `MemoryStore` + `PersistenceLayer`, including the offline queue and persisted mode metadata. +- Clears the stitch-owned `sessionStorage` keys: `stitch_client_id`, `stitch_cached_user`, `stitch_pending_logout`. +- After the promise resolves, the existing `Store` instance is **unusable** — all further method calls reject with `StoreDisposedError`. The caller must obtain a new instance via `createStore()`. + +Until Phase 1 ships, callers can use the platform escape hatch — `indexedDB.deleteDatabase(dbName)` followed by reload — but the supported flow is `await store.clearLocalData()` then re-`createStore()` with the new `syncMode`. ### 12.2 Versioning fields are mode-segregated @@ -387,7 +407,7 @@ If both run on the same broker scope, MQDB clients ignore peer events (no HLC ha Two enforcement levels: 1. **Documentation** — name this in user-facing docs. -2. **Mechanical** — peer mode defaults `syncTopicPrefix` to a non-`$` value (e.g. `stitch/peer`), making cross-mode topic collisions impossible at the broker level. The prefix deliberately avoids the `$` reservation called out by MQTT 5 §4.7.2 — the same reason PR #11 moved `responseTopicPrefix` off `$SYS`. Operators with custom prefixes already handle their own namespacing. +2. **Mechanical** — peer mode defaults `syncTopicPrefix` to `stitch/peer`, a non-`$` value chosen so cross-mode topic collisions are impossible at the broker level. The prefix deliberately avoids the `$` reservation called out by MQTT 5 §4.7.2 — the same reason PR #11 moved `responseTopicPrefix` off `$SYS`. Operators with custom prefixes already handle their own namespacing. Recommendation: ship both. Default-prefix isolation for the common case; documentation for the override case. @@ -397,14 +417,18 @@ Recommendation: ship both. Default-prefix isolation for the common case; documen |---|---| | Same client switches modes without wiping local state | `ModeMismatchError` on init (12.1) — loud, recoverable via `clearLocalData()` | | Mixed clients on same scope, default prefixes (12.3 enforced mechanically) | Impossible — topic prefixes don't overlap | -| Mixed clients on same scope, both overriding `syncTopicPrefix` to the same value | Silent overwrite, not data loss: each side receives the other's events but skips its version check (the relevant field is absent), so `applyMutationToDb` applies the foreign mutation unconditionally. Last-write-wins ordering is broken in both directions and local edits get clobbered by stale records. | +| Mixed clients on same scope, both overriding `syncTopicPrefix` to the same value | Silent data loss: each side receives the other's events but skips its version check (the relevant field is absent), so `applyMutationToDb` applies the foreign mutation unconditionally. Last-write-wins ordering is broken in both directions and local edits get clobbered by stale records. | | Peer mode reads MQDB-shaped records (only via 12.1 bypass) | `_hlc` undefined → treated as "lowest possible HLC" → incoming remote always wins → local edits look reverted | | MQDB mode reads peer-shaped records (only via 12.1 bypass) | `_version` undefined → MQDB's `typeof remoteVersion === 'number'` guard skips and the remote mutation is applied without version check; server reconciliation later overwrites local | | Peer-mode `_tombstones` table persists into a later MQDB-mode run | Inert; never read in MQDB. No corruption risk. (Wiped along with the rest of the local DB by `clearLocalData()`.) | ### 12.5 Offline queue and mode -The offline queue (`pending_sync` table) is local-only and mode-flavored: pending mutations carry the wire shape of whichever mode was active when queued. The queue is persisted alongside the store metadata; on boot, if `syncMode` mismatches, the same `ModeMismatchError` from 12.1 fires before any flush is attempted. Users who wipe local state to switch modes lose pending offline mutations — documented behavior, not a bug. +The offline queue (`pending_sync` table) is local-only and mode-flavored: pending mutations carry the wire shape of whichever mode was active when queued. The queue is persisted alongside the store metadata; on boot, if `syncMode` mismatches, the same `ModeMismatchError` from 12.1 fires before any flush is attempted. + +A fresh-install database is consistent by construction: no persisted mode means no queue (the queue table doesn't exist until first init writes the mode), so there is no "pending mutations under an unknown mode" state to reason about. The mismatch check is reachable only after at least one successful init has written both the mode and the queue table together. + +Users who wipe local state to switch modes lose any pending offline mutations — documented behavior, not a bug. ### 12.6 What this section does NOT cover From 3dac67b031bfb126c93b7a190baf12338afd6f48 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabr=C3=ADcio=20Bracht?= Date: Tue, 5 May 2026 09:10:57 -0700 Subject: [PATCH 5/8] add reactive scope-open design doc --- docs/design/reactive-scope-open.md | 275 +++++++++++++++++++++++++++++ 1 file changed, 275 insertions(+) create mode 100644 docs/design/reactive-scope-open.md diff --git a/docs/design/reactive-scope-open.md b/docs/design/reactive-scope-open.md new file mode 100644 index 0000000..f9a8720 --- /dev/null +++ b/docs/design/reactive-scope-open.md @@ -0,0 +1,275 @@ +# Design: Reactive scope-open for server-backed sync + +**Status**: Draft for review. Not implemented. + +**Purpose**: Remove the blocking `Promise.all` of N request/reply round-trips inside `replaceScope` by having the server stream per-record events instead of replying to batched `fetchList` queries. The client subscribes once and reacts to whatever arrives — no `Promise.all`, no per-call timeout, no "loaded" gate. + +This is a **design proposal** for the existing server-backed sync flow. It is intentionally orthogonal to the peer-mode design in `reactive-peer-mode.md` — they share a target shape (one streaming channel, no client-side deadlines) but address different deployments. + +--- + +## 1. Motivation + +### 1.1 What `replaceScope` does today + +`store.replaceScope` (`store.ts:484`) → `sync.openScope` (`sync-engine.ts:563`) does: + +1. Subscribe to `${prefix}/${rootEntity}/${scopeId}/#` (streaming, reactive). +2. `await Promise.all([fetchOne(rootEntity, scopeId), ...childEntities.map(e => fetchList(e, scopeId))])` — N parallel one-shot request/reply round-trips, each capped at 10 s by the timeout in `request()` (`sync-engine.ts:752`). +3. While step 2 is in flight, mutation events arriving on the events topic are buffered in `this.buffered` (line 552) instead of being applied immediately. +4. After step 2 resolves, `openScope` drains the buffer, returns the assembled `ScopeState`, and `replaceScope` reconciles it into IndexedDB + MemoryStore. + +The user-visible "wait" on `await store.replaceScope(id)` is step 2: stitch holds the promise until every child entity's list reply has arrived. With slow networks or large scopes this dominates time-to-interactive. + +### 1.2 Why this is the wrong shape + +Stitch already has a fully reactive subscription path (Pattern A in §2 of `reactive-peer-mode.md`). The events topic streams mutations with no timer, no deadline, callback-per-message. `openScope` *also* uses it, but only for events that occur *during* step 2 — once step 2 resolves, the buffer drains and from then on the subscription is the only path. + +There is no good reason for the initial state to come through a different channel from subsequent mutations. The split exists because the server emits a batched reply to `fetchList` instead of a stream of per-record events. + +Collapsing both paths onto the events topic produces: + +- One channel for both bootstrap and ongoing sync. +- No `Promise.all`, no 10 s deadline on scope-open. +- `replaceScope` returns as soon as the subscription is established. UI populates reactively as records stream in. +- Same bootstrap shape as peer mode (§4.2 of `reactive-peer-mode.md`), with the server filling the peer role. + +### 1.3 What this design does NOT change + +- **Mutations stay request/reply.** Writes need synchronous validation (ownership, constraint violations, server-canonical fields). Removing the await from `createEntity` / `updateEntity` / `deleteEntity` is a separate, harder design problem and is explicitly out of scope here. +- **The 10 s `request()` timeout stays.** It still applies to mutations and to any other request/reply call. This design simply removes scope-open from the set of callers. +- **The events topic structure stays.** Same `${prefix}/${rootEntity}/${scopeId}/events/{type}` topics, same payload shape, same `x-origin-client-id` filter. Only the *bootstrap delivery* changes. + +## 2. Wire-level flow + +### 2.1 Client publishes a hello + +When `replaceScope(scopeId)` is called and the client is connected, stitch publishes one message to a new topic: + +``` +${prefix}/${rootEntity}/${scopeId}/hello +``` + +Payload: + +```json +{ + "clientId": "", + "manifest": [ + { "id": "", "version": }, + ... + ] +} +``` + +The manifest is the union of root + children currently in local IndexedDB for `scopeId`. For a fresh device, the manifest is `[]`. + +QoS 1, `retain=false`. No response topic, no correlation data — this is fire-and-forget from the client's perspective. + +### 2.2 Server reacts + +The server (MQDB or any compatible backend) subscribes to `${prefix}/+/+/hello` (or its tenant-scoped equivalent). On receiving a hello: + +1. Validate the requesting client's auth and access to `scopeId` using existing `userScopeField` rules. If denied, publish a single error event on the client's response topic (`${responsePrefix}/${clientId}/scope-open-error`) and stop. The error event carries `{scopeId, code, message}`. (Mechanism mirrors today's 401/403 paths in `checkResponseAndAuth` at `sync-engine.ts:628`.) +2. Diff the requesting client's manifest against server-side state for that scope: + - For each record the server has that the client lacks, or has at a lower `_version`: publish a synthetic `events/created` (or `events/updated`) event on the scope topic carrying the full record. The `sender` field on these synthetic events is `'__server_replay__'` (a reserved value clients treat as non-self). + - For each record in the client's manifest that the server has marked deleted: publish a synthetic `events/deleted` event. + - For each record in the client's manifest that the server's state matches at the same `_version`: emit nothing. +3. Once the diff is fully published, the server is done. There is no "I'm finished" message — the client never asks for one. + +The synthetic events are indistinguishable on the wire from real mutation events. They flow through `handleWatchMessage` (`sync-engine.ts:447`) and into `applyMutationToDb`, where the existing `_version` LWW compare resolves any conflict against locally-pending writes. + +### 2.3 Client reacts + +After publishing the hello, `replaceScope` returns. The client's events-topic subscription (already established in step 1 of `openScope`) delivers each replay event as it arrives. Each event is applied via the existing `applyMutationToDb` path — same code as a normal remote mutation. + +There is no "scope is loaded" promise resolution. The UI populates as records stream in. Apps that need a "loaded" gate implement one themselves (e.g., observe MemoryStore for stability) — same recommendation as peer mode (§6.2 of `reactive-peer-mode.md`). + +### 2.4 Concurrent mutations during replay + +A peer (or the user themselves on another device) mutates a record while the server is mid-replay. The mutation event flows on the same scope events topic. The client receives both the replay event and the live mutation event in publish order, applies both via `applyMutationToDb`, and `_version` LWW resolves any overlap. No special handling needed. + +### 2.5 `bumpScopeVersion` goes away + +Today every child mutation triggers a *second* request/reply (`sync-engine.ts:668, 685, 695`) to bump the root entity's version field. In the new flow: + +- The server bumps the scope version implicitly when it processes any child mutation. +- The server emits an `events/updated` event for the root entity on the same topic, carrying the new `_version` (and `updatedAt`). +- The client receives that event reactively. No round-trip. + +This halves the per-mutation latency and removes a hidden second 10 s timeout from every awaited write. + +## 3. API surface changes + +### 3.1 `StoreConfig` + +No new config field. The existing `syncMode` field from peer mode could in principle gain a `'mqdb-reactive'` value, but a cleaner cut is: + +- This change replaces `syncMode: 'mqdb'` semantics. Old MQDB servers that don't support hello are handled via capability negotiation (§5). +- `syncMode: 'peer'` is unchanged. + +### 3.2 `replaceScope` semantics + +| Before | After | +|---|---| +| Awaits `openScope` → server returns full ScopeState → reconcile + load → resolve. UI sees complete state on resolve. | Subscribes to events, publishes hello, returns immediately. UI starts with whatever's in local IndexedDB. State populates over time as server replay events arrive. | + +This is a deliberate API contract change. App code that today does `await store.replaceScope(id); /* assume loaded */` must handle "still populating" UI states. Same trade-off and same guidance as peer mode. + +### 3.3 `request()` + +Unchanged. Mutations still use it. The 10 s timeout still applies to mutations, `bumpScopeVersion` (now unused — see §2.5), and any other one-shot request/reply. + +### 3.4 `fetchList` / `fetchOne` + +No longer called by `openScope`. They remain available for ad-hoc queries that don't want to subscribe (rare today, may be removed entirely if no caller remains). + +### 3.5 `OpenScopeResult` + +`openScope` returns immediately. The current `ScopeState` shape (`root`, `children`, `version`, `bufferedMutations`) is replaced by `Promise` — the engine has nothing meaningful to return synchronously. State populates via the existing mutation handler. + +## 4. What changes in the existing code + +### 4.1 `src/sync-engine.ts` + +- `openScope`: drop the `Promise.all` block (lines 576–589). Subscribe to the scope topic, publish a hello message, return. The `awaitingState` / `buffered` machinery (lines 568–570, 595–610) becomes unnecessary because there is no longer an "await snapshot" phase to buffer around. +- `bumpScopeVersion`: delete. The server handles version bumping implicitly. +- `createEntity` / `updateEntity` / `deleteEntity`: drop the trailing `await this.bumpScopeVersion(scopeId)` call (lines 668, 685, 695). One round-trip per mutation instead of two. +- New: handle a `scope-open-error` response on the response topic so that scope-open auth failures still surface as a rejected promise on `replaceScope` (or as an emitted error event on the store, depending on the chosen contract — see Q1). +- `subscribeToTopLevel`: same hello-based pattern applied at the top-level entity discovery topic (`${prefix}/${rootEntity}/+/hello` for the server's perspective, `${prefix}/${rootEntity}/discovery/hello` or similar for the client). Open question — see §5. + +Estimated diff: net negative LOC. The subscribe-and-replay path is mostly *removal* of the buffering / Promise.all logic. + +### 4.2 `src/remote-sync-layer.ts` + +- `openScope` callers: no longer receive a fully-populated `ScopeState`. Reconciliation work currently in `replaceScope` (lines 504–545) — `reconcileChildren`, the `bufferedMutations` drain — moves into the streaming path. Each replay event flows through `applyMutationToDb` like any other remote mutation. +- `applyMutationToDb`: unchanged. Treats replay events identically to live mutations. The `sender === '__server_replay__'` value is not own-mutation; the existing filter at `isOwnMutation` (sync-engine.ts:417) already accepts it. +- `reconcileChildren`: only meaningful in the old batched-snapshot flow. Likely deletable. The same job — "delete locally records the server doesn't have" — still needs to happen, but it now happens reactively as the client receives `events/deleted` for those records during replay. + +### 4.3 `src/store.ts` + +- `replaceScope` (line 484): the long path from line 502 onward (snapshot reconciliation, `loadScopeFromPersistence`, `loadRootIntoMemory`) collapses to: ensure subscription is established, ensure local IndexedDB state is loaded into MemoryStore, return. State updates flow through the existing mutation handler. +- The `setSuppressNotifications(true)` block (line 499) becomes unnecessary — there is no longer a batched-load-then-notify cycle to suppress around. + +### 4.4 Server (MQDB) + +This is the load-bearing dependency. The server must: + +- Subscribe to `${prefix}/+/+/hello` (or per-tenant equivalent). +- Implement the manifest diff and replay logic in §2.2. +- Emit per-record events on the scope topic instead of (or in addition to) replying to `fetchList` / `fetchOne` requests. +- Emit version-bump events on child mutations. +- Surface auth failures via `scope-open-error` events on the response topic. + +The server changes are larger than the client changes. This design only works if the server is willing to make them. + +## 5. Open questions + +### Q1: Auth failure surfacing + +Today `openScope` rejects with the server's 401/403 reply. In the reactive flow, `replaceScope` has already returned by the time the server validates the hello. + +Options: + +- **Reject `replaceScope` only on transport errors; surface auth failures via a store-level event** (e.g. `store.onScopeError(cb)`). Apps that care register a listener. +- **Hold `replaceScope`'s promise open until either the first replay event or an error arrives, with a short timeout** — sneaks a timer back in. Reject. +- **Pre-validate auth at connect time** — server publishes a list of accessible scopes on a per-client topic at session start; `replaceScope` rejects synchronously if `scopeId` is not in that list. Adds a connect-time round-trip but no per-scope-open round-trip. + +Recommendation: store-level event. Matches the reactive philosophy (no synthetic "loaded" gate, no synthetic "load failed" gate). + +### Q2: Backwards compatibility with old MQDB servers + +A client running new stitch against an old MQDB server that doesn't subscribe to hello will receive no replay events. Scope appears empty. + +Options: + +- **Capability negotiation via MQTT 5 User Property on CONNECT.** Server advertises `stitch-protocol-version: 2` (or similar). Client falls back to the old `fetchList` flow if absent. +- **Hard cutover.** Bump stitch's required server version. Document the dependency. +- **Probe-and-fallback.** Client publishes hello with a timeout; if no events arrive in N ms, falls back to `fetchList`. Reintroduces the timer this design is trying to remove. + +Recommendation: capability negotiation. The fallback path keeps the old code intact for one major version, then deletes it. + +### Q3: Top-level entity discovery + +`syncRootEntityList` today iterates all root entities a user can access via `fetchList(rootEntity)` with `userScopeField` filtering on the server. Same problem at this level. + +Recommendation: same hello mechanism at the discovery topic. Client publishes a discovery hello with whatever roots it already has locally; server replays the delta. Out-of-scope details mirror peer mode's §10 (top-level entities). + +### Q4: Replay ordering + +The server publishes replay events in some order. If a child references a root that hasn't replayed yet, the client may briefly see a child without its root in MemoryStore. + +Options: + +- **Server publishes root first, then children** — natural ordering. Documented contract. +- **Client tolerates out-of-order** — already partially true (the existing mutation handler doesn't enforce parent-before-child). Document as "same guarantees as live mutations: eventual consistency, brief inconsistency tolerated." + +Recommendation: document the second. Mirrors normal mutation behavior; no special handling. + +### Q5: Replay flooding for large scopes + +A scope with 10,000 records produces 10,000 replay messages. At small message size this is tens of MB and seconds of broker fan-out. + +Options: + +- **Batched replay events** — one message containing N records. Adds a wire-format variant. +- **Manifest diffing reduces volume** — clients with mostly-up-to-date local state only get the delta. Already in the design. +- **Accept it for v1** — document the limit; revisit if real deployments hit it. + +Recommendation: accept for v1. The manifest diff keeps the common case (returning user) cheap. Cold start of a 10k-record scope is a real cost but bounded. + +## 6. What this design does NOT solve + +- **Reactive mutations.** Writes still use request/reply. Surfacing write validation results without an await is a separate, harder problem. +- **Offline scope-open.** If the client is disconnected, `replaceScope` falls back to whatever's in local IndexedDB. Same as today. Reconnect republishes the hello. +- **Cross-scope queries.** Same as today; out of scope. +- **Strong consistency on scope-open.** The reactive flow is eventually consistent. Apps requiring "see all data before render" need an app-level loaded-gate (observe MemoryStore for stability). + +## 7. Estimated scope + +Rough LOC budget for the client-side change: + +| File | Net change | +|---|---| +| `src/sync-engine.ts` | -150 (remove Promise.all + buffering + bumpScopeVersion) | +| `src/remote-sync-layer.ts` | -100 (remove reconcileChildren snapshot path) | +| `src/store.ts` | -60 (collapse replaceScope reconcile path) | +| `src/types.ts` | ±10 (ScopeState removal, OpenScopeResult shape) | +| Tests | +400 (new replay tests, capability fallback tests) | + +Net: roughly LOC-neutral but structurally simpler. + +Server-side scope is **not estimated here** — depends on MQDB's existing architecture and whoever owns it. This document captures only the protocol contract. + +## 8. Implementation phasing + +1. **Phase 0**: this doc + agreement with MQDB owner on protocol. +2. **Phase 1**: server implements hello handling and per-record replay events behind a feature flag. Existing `fetchList` path stays in parallel. +3. **Phase 2**: capability negotiation. Stitch detects support and routes scope-open through the new path when available. +4. **Phase 3**: `bumpScopeVersion` deleted from stitch; server handles version bump implicitly. +5. **Phase 4**: deprecation window for old `fetchList`-based scope-open. Eventually removed. +6. **Phase 5**: top-level entity discovery moved to same hello pattern (§ Q3). + +Each phase is independently shippable behind capability negotiation. + +## 9. Relationship to peer mode + +The two designs converge on the same client shape: + +- One topic per scope (`events/#`). +- Per-record events for both bootstrap and live updates. +- Hello-driven manifest diff for bootstrap. +- No client-side deadlines. + +The differences are entirely upstream of the events topic: + +| | This design | Peer mode | +|---|---|---| +| Source of truth | Server | Distributed (each peer's IndexedDB) | +| Versioning | `_version: number` (server-managed) | `_hlc: { ts, counter, nodeId }` (client-managed) | +| Auth validation | Server-side at hello + per-mutation reply | None (each peer trusts its own state) | +| Mutation channel | Request/reply (this design preserves it) | Fire-and-forget events | + +Once both ship, the events-topic stream is identical from the client's perspective in both modes. The mode flag (`syncMode`) controls only the source-of-truth and versioning concerns spelled out in §12 of `reactive-peer-mode.md`. + +This is the practical argument for sequencing: ship reactive scope-open first, then peer mode reuses the streaming infrastructure with only the source-of-truth and versioning differences to design. From 785fa6ab033a4751b861fb5f6d2798c8db5af169 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabr=C3=ADcio=20Bracht?= Date: Tue, 5 May 2026 11:50:51 -0700 Subject: [PATCH 6/8] fix MemoryStore filtering, clearScope, create, delete, loadScope, and event scopeId for top-level entities --- src/memory-store.ts | 50 +++++++++++++++++++++++++++++++++++---------- 1 file changed, 39 insertions(+), 11 deletions(-) diff --git a/src/memory-store.ts b/src/memory-store.ts index 73095ce..bbf7400 100644 --- a/src/memory-store.ts +++ b/src/memory-store.ts @@ -50,6 +50,7 @@ class MemoryStoreImpl implements MemoryStore { private readonly config: StoreConfig; private readonly allEntities: string[]; + private readonly topLevelEntitySet: Set; private versions: Map> = new Map(); private listSnapshots: Map[]>>> = @@ -75,13 +76,18 @@ class MemoryStoreImpl implements MemoryStore { constructor(config: StoreConfig) { this.config = config; + this.topLevelEntitySet = new Set(config.topLevelEntities?.map((t) => t.entity) ?? []); this.allEntities = [ config.scope.rootEntity, ...config.scope.childEntities, - ...(config.topLevelEntities?.map((t) => t.entity) ?? []), + ...this.topLevelEntitySet, ]; } + private isTopLevelEntity(entity: string): boolean { + return this.topLevelEntitySet.has(entity); + } + ensureReady(): Promise { if (this._dbReady) return Promise.resolve(); if (!this._initPromise) { @@ -275,13 +281,17 @@ class MemoryStoreImpl implements MemoryStore { this.lastOriginTag = this.originTag; const { rootEntity, scopeField } = this.config.scope; + const isTopLevel = this.isTopLevelEntity(entity); let scopeId: string | undefined; if (event.operation === 'delete') { scopeId = - this.pendingDeleteContext?.scopeId ?? (entity === rootEntity ? event.id : undefined); - if (!scopeId) return; + this.pendingDeleteContext?.scopeId ?? + (entity === rootEntity ? event.id : isTopLevel ? '' : undefined); + if (scopeId === undefined) return; } else if (entity === rootEntity) { scopeId = event.id; + } else if (isTopLevel) { + scopeId = ''; } else { scopeId = event.data?.[scopeField] as string | undefined; if (!scopeId) return; @@ -386,11 +396,20 @@ class MemoryStoreImpl implements MemoryStore { private listRecords(entity: string, scopeId: string): Record[] { if (!this._dbReady || this._corrupted) return []; const { rootEntity, scopeField } = this.config.scope; - const filterField = entity === rootEntity ? 'id' : scopeField; + const isTopLevel = this.isTopLevelEntity(entity); + const listOptions = isTopLevel + ? {} + : { + filters: [ + { + field: entity === rootEntity ? 'id' : scopeField, + op: 'eq', + value: scopeId, + }, + ], + }; try { - return this._db!.listSync(entity, { - filters: [{ field: filterField, op: 'eq', value: scopeId }], - }) as Record[]; + return this._db!.listSync(entity, listOptions) as Record[]; } catch (err) { const wrapped = wrapWasmError(`listSync:${entity}`, err); if (this.isWasmCorrupted(err)) { @@ -447,7 +466,9 @@ class MemoryStoreImpl implements MemoryStore { this.originTag = tag ?? null; const { rootEntity, scopeField } = this.config.scope; const record = - entity === rootEntity ? stripNulls(data) : stripNulls({ ...data, [scopeField]: scopeId }); + entity === rootEntity || this.isTopLevelEntity(entity) + ? stripNulls(data) + : stripNulls({ ...data, [scopeField]: scopeId }); this.db.createSync(entity, record); } catch (err) { console.error(wrapWasmError(`createSync:${entity}`, err)); @@ -483,8 +504,13 @@ class MemoryStoreImpl implements MemoryStore { this.originTag = tag ?? null; const record = this.read(entity, id); if (!record) return; - const scopeField = this.config.scope.scopeField; - this.pendingDeleteContext = { scopeId: record[scopeField] as string }; + const { rootEntity, scopeField } = this.config.scope; + const scopeId = this.isTopLevelEntity(entity) + ? '' + : entity === rootEntity + ? id + : (record[scopeField] as string); + this.pendingDeleteContext = { scopeId }; this.db.deleteSync(entity, id); } catch (err) { console.error(wrapWasmError(`deleteSync:${entity}`, err)); @@ -510,9 +536,10 @@ class MemoryStoreImpl implements MemoryStore { const scopeField = this.config.scope.scopeField; for (const [entity, records] of Object.entries(data)) { + const isTopLevel = this.isTopLevelEntity(entity); for (const record of records) { try { - const rec = { ...record, [scopeField]: scopeId }; + const rec = isTopLevel ? record : { ...record, [scopeField]: scopeId }; newDb.createSync(entity, rec); } catch (err) { console.error(wrapWasmError(`loadScope.createSync:${entity}`, err)); @@ -579,6 +606,7 @@ class MemoryStoreImpl implements MemoryStore { recoveredDuringClear = true; break; } + if (this.isTopLevelEntity(entity)) continue; const records = this.listRecords(entity, scopeId); if (this._corrupted) { recoveredDuringClear = true; From 183c8c476c638dc60e5b24175c7c394e21f076e8 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabr=C3=ADcio=20Bracht?= Date: Tue, 5 May 2026 19:15:18 -0700 Subject: [PATCH 7/8] =?UTF-8?q?consolidate=20server-side=20requirements=20?= =?UTF-8?q?into=20=C2=A74.4.1?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- docs/design/reactive-scope-open.md | 52 ++++++++++++++++++++++++++---- 1 file changed, 46 insertions(+), 6 deletions(-) diff --git a/docs/design/reactive-scope-open.md b/docs/design/reactive-scope-open.md index f9a8720..874bbef 100644 --- a/docs/design/reactive-scope-open.md +++ b/docs/design/reactive-scope-open.md @@ -153,15 +153,55 @@ Estimated diff: net negative LOC. The subscribe-and-replay path is mostly *remov ### 4.4 Server (MQDB) -This is the load-bearing dependency. The server must: +This is the load-bearing dependency. The server changes are larger than the client changes; this design only works if the server is willing to make them. §4.4.1 below consolidates the full set of requirements so the MQDB owner has one place to scan rather than reassembling them from §2.2, §5, and §8. -- Subscribe to `${prefix}/+/+/hello` (or per-tenant equivalent). -- Implement the manifest diff and replay logic in §2.2. -- Emit per-record events on the scope topic instead of (or in addition to) replying to `fetchList` / `fetchOne` requests. -- Emit version-bump events on child mutations. +Headline list (full breakdown in §4.4.1): + +- Subscribe to `${prefix}/+/+/hello` and implement the manifest diff + replay logic in §2.2. +- Emit per-record events on the scope topic instead of replying to `fetchList` / `fetchOne` requests. +- Emit version-bump events on child mutations (replaces the per-mutation client `bumpScopeVersion` round-trip). - Surface auth failures via `scope-open-error` events on the response topic. +- Retain tombstones for deleted records so deletions can be replayed. +- Advertise protocol version on CONNACK so clients can branch. +- Throttle hello requests per client per scope. + +### 4.4.1 Server-side requirements (consolidated) + +For a fresh reader: this is the complete server-facing contract. Each item is sourced from elsewhere in the doc; the cross-references identify where the requirement is motivated. + +**New protocol handlers** + +- [ ] **Subscribe to `${prefix}/+/+/hello`** (per-tenant equivalent acceptable). Source: §2.1. +- [ ] **On hello: validate auth.** Use existing `userScopeField` rules. On failure, publish `{scopeId, code, message}` to `${responsePrefix}/${clientId}/scope-open-error`. Do not publish replay events. Source: §2.2 step 1. +- [ ] **On hello: compute manifest diff.** For each record server-side: if absent from client manifest or at lower `_version`, queue for replay. For each record in client manifest: if server has marked deleted (tombstone) at any version newer than client's, queue a delete-replay. If versions match, emit nothing. Source: §2.2 step 2. +- [ ] **Replay is stateless.** No per-client progress tracking. If the client reconnects mid-replay and republishes hello, the server replays the diff from scratch. Source: §2.4 (implicit; surface explicitly). + +**New broadcast behaviors** + +- [ ] **Synthetic replay events use real `events/{type}` topics** — not a per-client replay channel. They are indistinguishable on the wire from live mutation events, so existing subscribers see them too and apply them via LWW (idempotent — `_version` matches → no-op). The fan-out cost to existing peers is accepted as the price of keeping a single channel. Source: §2.2 step 2 (implicit). +- [ ] **Synthetic events carry `sender: '__server_replay__'`** as an MQTT user property (and/or payload field, matching the existing `sender` shape). This reserved value is not "own mutation" for any client, so all clients receive and apply it. The server is responsible for setting it. Source: §2.2 step 2 + §4.2. +- [ ] **Emit version-bump events on child mutations.** When the server processes a child create / update / delete, it bumps the root's `_version` (and `updatedAt`) and publishes an `events/updated` for the root entity carrying the new version. Replaces the client-driven `bumpScopeVersion` round-trip. Source: §2.5. +- [ ] **Emit per-record events instead of (or in addition to) batched `fetchList` / `fetchOne` replies.** During the deprecation window (§8 Phase 4), both paths can coexist behind capability negotiation. After the window, `fetchList` / `fetchOne` handlers are removed. Source: §3.4 + §8. + +**New persistence requirements** + +- [ ] **Tombstone tracking for deleted records.** The server must retain *some* record of deletions (e.g., a `_deleted_at` column with rows kept indefinitely, or a tombstone table) so it can publish `events/deleted` to a client whose manifest still lists a record the server has since removed. If MQDB today does hard deletes, this is a new schema requirement. Tombstone retention policy (forever vs TTL) is open — see Q2 of `reactive-peer-mode.md` §8 for the same trade-off; same recommendation (TTL with documented offline-window limit). Source: §2.2 step 2. + +**New connection-time advertisement** + +- [ ] **Capability advertisement on CONNACK** via MQTT 5 User Property: e.g. `stitch-protocol-version: 2`. Clients that do not see this property fall back to the legacy `fetchList` flow. This makes incremental rollout possible without a hard cutover. Source: §5/Q2. + +**New operational concerns** + +- [ ] **Hello rate limiting** per client per scope. A misbehaving or malicious client publishing hello in a tight loop triggers replay every time, multiplied by the fan-out to all current subscribers. Suggested floor: at most one replay per client per scope per N seconds (N to be tuned; start at 5–10 s). Server should silently drop or NACK hellos beyond the limit. Source: this section (not previously called out — gap closed here). + +**Deprecation track** + +- [ ] **Phase 4 deletes legacy handlers.** Once capability negotiation has been live for one major version and telemetry confirms no clients are falling back, the server's `fetchList`, `fetchOne`, and (once `bumpScopeVersion` is removed client-side) the `bumpScopeVersion` request handlers can be deleted. Source: §8 Phase 4–5. + +**Out of scope for v1** -The server changes are larger than the client changes. This design only works if the server is willing to make them. +- [ ] **Top-level entity discovery via hello.** Same shape as the per-scope hello but on the discovery topic — server replays the user's accessible scopes. Deferred to a follow-up; v1 keeps `fetchList(rootEntity)` for top-level discovery even after per-scope `replaceScope` goes reactive. Source: §5/Q3. ## 5. Open questions From ccf5fdf0b5c3fe74a54a86d243d8cbca88cfd147 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Fabr=C3=ADcio=20Bracht?= Date: Wed, 6 May 2026 07:12:22 -0700 Subject: [PATCH 8/8] add MQDB agent brief consolidating wire format and server requirements --- docs/design/MQDB-AGENT-BRIEF.md | 296 ++++++++++++++++++++++++++++++++ 1 file changed, 296 insertions(+) create mode 100644 docs/design/MQDB-AGENT-BRIEF.md diff --git a/docs/design/MQDB-AGENT-BRIEF.md b/docs/design/MQDB-AGENT-BRIEF.md new file mode 100644 index 0000000..4a0ca53 --- /dev/null +++ b/docs/design/MQDB-AGENT-BRIEF.md @@ -0,0 +1,296 @@ +# MQDB agent brief — reactive scope-open + +**Audience:** whoever (human or agent) is implementing the MQDB server-side changes for stitch's reactive `replaceScope`. This document is **self-contained**: all wire-format excerpts and protocol contracts are quoted inline so you do not need to read stitch's source tree to do the work. Pointers into stitch source are provided for verification only. + +**Scope:** the changes needed in MQDB to support `docs/design/reactive-scope-open.md`. Peer mode (`docs/design/reactive-peer-mode.md`) is a separate workstream and out of scope for this brief. + +--- + +## 1. What you are implementing in one sentence + +When a stitch client publishes a `hello` message naming a scope, MQDB should diff the client's manifest against server state and emit per-record events on the existing scope events topic, instead of waiting for the client to poll via `fetchList` / `fetchOne` request/reply. + +This collapses bootstrap and live-sync onto one streaming channel and removes ~10s of blocking time from `replaceScope` on slow networks. + +## 2. Why this is needed (60-second version) + +Today every stitch `replaceScope(scopeId)` does: + +1. Subscribe to the scope's events topic (already streaming). +2. **Block** on `Promise.all([fetchOne(root), fetchList(child) × N])` — N parallel request/reply round-trips, each with a 10s client-side timeout. +3. Buffer events arriving during step 2 so they don't get lost. +4. After step 2 resolves, drain the buffer and return. + +Step 2 is the user-visible wait. The new flow removes step 2 entirely: client publishes `hello`, server replays per-record events on the same topic the client is already subscribed to. Client never blocks on a snapshot. + +Full motivation, file:line analysis, and trade-offs: `docs/design/reactive-scope-open.md` §1. + +## 3. Wire format you must produce and consume + +### 3.1 Topic structure (existing — unchanged) + +`syncTopicPrefix` defaults to `$DB` (configurable). `responseTopicPrefix` defaults to `$DB/clients` (configurable). + +``` +$DB/{rootEntity}/{scopeId}/events/{created|updated|deleted} # root mutations +$DB/{rootEntity}/{scopeId}/{childEntity}/events/{created|updated|deleted} # child mutations +$DB/{rootEntity}/{scopeId}/hello # NEW — client subscribe-and-replay request +$DB/clients/{clientId}/{requestId} # response topic for one-shot replies +``` + +`{eventType}` in the topic and `operation` in the payload must agree per the mapping in §3.2. + +Source of truth in stitch: `parseScopedTopic` at `src/sync-engine.ts:487-507` (regex matches both root-own and child topics). + +### 3.2 ChangeEvent payload (the one MQDB must emit on event topics) + +This is the JSON shape stitch parses on every `events/{type}` topic. **Both real mutation events and synthetic replay events must conform.** + +```ts +interface ChangeEvent { + operation: 'Create' | 'Update' | 'Delete'; // capitalized + entity: string; // entity type, e.g. "task" + id: string; // record id (also appears in topic) + data: Record | null; // full record for Create/Update; null OK for Delete + operation_id?: string; // optional idempotency key + sequence?: number; // optional, currently unused by client + sender?: string; // origin client id; see §3.4 +} +``` + +Validation in stitch (`src/sync-engine.ts:460-470`): + +- `id` is a non-empty string. +- `operation` is exactly one of `'Create' | 'Update' | 'Delete'` (note capitalization). +- `data` is an object, `null`, or absent. + +Anything that fails validation is silently dropped by stitch. Make sure replay events conform. + +**Topic-to-operation mapping** (`src/sync-engine.ts:540-545`): the topic suffix is lowercase past tense, the payload field is capitalized infinitive: + +| Topic suffix | Payload `operation` | +|---|---| +| `events/created` | `Create` | +| `events/updated` | `Update` | +| `events/deleted` | `Delete` | + +Stitch overrides the payload's `operation` with the topic-derived value, so they must agree. Use both consistently. + +### 3.3 MQTT user properties + +Stitch sets these on outgoing publishes and reads them on incoming messages. MQDB must do the same. + +| User property | Set by stitch on publish? | Read by stitch on receive? | Purpose | +|---|---|---|---| +| `x-origin-client-id` | Yes (own clientId) | Yes — used in `isOwnMutation` filter | Prevents loop-back: stitch ignores any event whose `x-origin-client-id` matches its own clientId. | + +`isOwnMutation` (`src/sync-engine.ts:519-523`) returns true if **either** `event.sender === clientId` **or** the user property `x-origin-client-id === clientId`. Both mechanisms are honored; either is sufficient. + +For replay events MQDB emits, see §3.5 — the `sender` value is reserved. + +### 3.4 Request/reply envelope (existing — unchanged) + +Mutation requests (which stay request/reply per `reactive-scope-open.md` §1.3) use: + +- **Publish:** topic `$DB/{entity}/create` (or `update` / `delete`), payload is the JSON record. +- **MQTT properties on the request:** `responseTopic = $DB/clients/{clientId}/{requestId}`, `correlationData = utf8({requestId})` where `{requestId}` is a UUID generated by stitch. +- **Reply:** server publishes to the `responseTopic` the JSON shape: + +```ts +interface RequestResponse { + status: 'ok' | 'error'; + code?: number; // 401 = session invalid, 403 = ownership violation, others = generic error + message?: string; // human-readable error + data?: unknown; // payload on success (e.g. created record with server-assigned fields) +} +``` + +Source: `request()` at `src/sync-engine.ts:745-783` (request side), `checkResponseAndAuth` at `src/sync-engine.ts:628-640` (response side). + +### 3.5 NEW: Synthetic replay events + +When MQDB processes a `hello` (§4) and emits replay events, those events: + +- Use the **same topics** as live mutations (`events/created` / `events/updated` / `events/deleted`). +- Use the **same payload shape** as §3.2. +- Carry `sender: '__server_replay__'` in the payload **AND** set the MQTT user property `x-origin-client-id` to a server-fixed sentinel (suggest: `'__mqdb_server__'`). Neither value matches any real client id, so all clients (including the requesting one) will accept the events. +- The fan-out cost is intentional: existing peers on the scope receive replay events too. They apply them via the existing `_version` LWW compare (`src/remote-sync-layer.ts:386-399`), which no-ops when the version matches. Bandwidth cost accepted as the price of one channel. + +### 3.6 NEW: `hello` payload (client → server) + +Client publishes one message: + +- **Topic:** `${syncTopicPrefix}/{rootEntity}/{scopeId}/hello` +- **QoS:** 1, `retain: false` +- **No `responseTopic`, no `correlationData`** — fire-and-forget. +- **Payload:** + +```ts +interface HelloPayload { + clientId: string; // requesting client id + manifest: Array<{ id: string; version: number }>; // root + children currently in client's local IndexedDB for this scope +} +``` + +`version` is the numeric `_version` field on each record. For a fresh client, manifest is `[]`. + +### 3.7 NEW: `scope-open-error` reply (server → client, error path) + +When MQDB rejects a `hello` due to auth (the requesting client has no access to `scopeId`), MQDB publishes one message and stops: + +- **Topic:** `${responseTopicPrefix}/{clientId}/scope-open-error` +- **Payload:** + +```ts +interface ScopeOpenError { + scopeId: string; + code: number; // 401 | 403 | other + message: string; // human-readable +} +``` + +This is the only out-of-band reply during scope-open. There is no "I'm done replaying" message on the success path. + +## 4. Server behavior on `hello` + +Pseudo-code for the handler: + +``` +on_hello(message): + payload = parse_json(message.payload) + client_id = payload.clientId + scope_id = extract_scope_id_from_topic(message.topic) + manifest = payload.manifest + + # 1. Auth + if not session_has_access(client_id, scope_id): + publish( + topic = "$DB/clients/{client_id}/scope-open-error", + payload = { scopeId: scope_id, code: 403, message: "..." } + ) + return + + # 2. Rate limit (see §6 of this brief) + if not rate_limiter.allow(client_id, scope_id): + return # silent drop + + # 3. Diff + client_versions = { entry.id: entry.version for entry in manifest } + server_state = load_scope(scope_id) # root + all children + any tombstones + + for record in server_state.records: + client_v = client_versions.get(record.id) + if client_v is None or client_v < record._version: + emit_replay_event(record, "Create" if client_v is None else "Update") + + for tombstone in server_state.tombstones: + client_v = client_versions.get(tombstone.id) + if client_v is not None and client_v < tombstone.deleted_at_version: + emit_replay_event_delete(tombstone) + + # No "done" message. Client never expects one. + +emit_replay_event(record, op): + topic = "$DB/{root}/{scope}/events/{op_lower}" # or .../{child_entity}/events/{op_lower} + payload = { + operation: op, # capitalized + entity: record.entity_type, + id: record.id, + data: record.full_data, + sender: "__server_replay__", + } + user_properties = { "x-origin-client-id": "__mqdb_server__" } + publish(topic, payload, user_properties, qos=1, retain=false) +``` + +**Statelessness:** if the client disconnects mid-replay and republishes `hello` on reconnect, run the handler from scratch. Do not track per-client replay progress. + +## 5. Server behavior on mutations (changes from today) + +Per-mutation request/reply stays. **Two changes:** + +### 5.1 Emit `events/{type}` for every mutation (already done?) + +Whatever path MQDB uses today to broadcast mutations to peers stays in place. Replay events use the same path; just with the `__server_replay__` sender. + +### 5.2 Emit a `events/updated` for the root entity on every child mutation + +Today the client calls `bumpScopeVersion` after each child create/update/delete to bump the root's `_version` and `updatedAt`. In the new flow, **the server does this implicitly** when it processes the child mutation: + +1. Process the child mutation as today. +2. Increment the root entity's `_version`, set `updatedAt = now()`. +3. Emit a synthetic `events/updated` for the root with the new version. + +Removes one round-trip per child mutation from the client. Stitch will stop calling `bumpScopeVersion` (it's deleted in the client-side patch — see `reactive-scope-open.md` §4.1). + +## 6. Server-side requirements (full checklist) + +This is the consolidated checklist from `reactive-scope-open.md` §4.4.1. Each item references its motivating section in that doc. + +**New protocol handlers** + +- [ ] Subscribe to `${syncTopicPrefix}/+/+/hello` (per-tenant equivalent acceptable). [§2.1] +- [ ] On hello: auth-validate against `userScopeField`. On failure, publish `ScopeOpenError` to `${responseTopicPrefix}/{clientId}/scope-open-error`. [§2.2 step 1] +- [ ] On hello: compute manifest diff (server-has-newer + server-has-tombstone-newer). Emit replay events. [§2.2 step 2] +- [ ] Replay is stateless. No per-client progress tracking; reconnect republishes hello and the server replays from scratch. [§2.4] + +**New broadcast behaviors** + +- [ ] Synthetic replay events use real `events/{type}` topics. Fan-out to existing peers is accepted as the cost of one channel. [§2.2] +- [ ] Synthetic events carry `sender: '__server_replay__'` in the payload and `x-origin-client-id: '__mqdb_server__'` as a user property. [§4.2] +- [ ] On every child mutation: bump root `_version` and `updatedAt`, emit a synthetic `events/updated` for the root. Replaces client-driven `bumpScopeVersion`. [§2.5] +- [ ] During the deprecation window (§8 Phase 4): emit per-record events while keeping `fetchList` / `fetchOne` handlers working. After the window, remove the handlers. [§3.4 + §8] + +**New persistence requirements** + +- [ ] Tombstone tracking. The server must retain *some* record of deletions (e.g. a `_deleted_at` column kept indefinitely, or a tombstone table) so it can publish `events/deleted` to a client whose manifest still lists a record the server has since removed. If MQDB today does hard deletes, this is a real schema change. Retention policy (forever vs TTL) is open — see `reactive-peer-mode.md` §8/Q2 for the same trade-off. [§2.2] + +**New connection-time advertisement** + +- [ ] On CONNACK, set the MQTT 5 User Property `stitch-protocol-version: 2` (or whatever version label is agreed). Stitch falls back to the legacy `fetchList` flow if this property is absent. [§5/Q2] + +**New operational concerns** + +- [ ] Hello rate limiting per `(clientId, scopeId)`. Suggested floor: at most one replay per pair per 5–10s. Drop or silently NACK excess. [§4.4.1] + +**Out of scope for v1** + +- [ ] Top-level entity discovery via hello. Same shape but on the discovery topic. Deferred — keep `fetchList(rootEntity)` working for v1. [§5/Q3] + +## 7. Implementation phasing (server side) + +This mirrors `reactive-scope-open.md` §8. The split lets you ship server changes incrementally without a hard cutover on the client. + +1. **Phase 0** — agree on this brief. Confirm tombstone-tracking approach, capability label, rate-limit defaults. +2. **Phase 1** — implement `hello` subscription and replay logic. `fetchList` / `fetchOne` handlers stay alongside. Behind a feature flag if you want to dark-launch. +3. **Phase 2** — emit `stitch-protocol-version: 2` on CONNACK. Stitch starts routing scope-open through the new path. +4. **Phase 3** — implement implicit version-bump on child mutations. Coordinate with stitch removing `bumpScopeVersion` client-side. +5. **Phase 4** — telemetry confirms no clients fall back. Remove `fetchList` / `fetchOne` handlers (and `bumpScopeVersion` request handler) from MQDB. + +Each phase is independently shippable. + +## 8. Open questions you may need to escalate + +These are flagged in `reactive-scope-open.md` §5 but worth pulling forward: + +1. **Tombstone retention TTL.** Forever is unbounded growth. TTL means clients offline longer than TTL see deleted records resurrect. Recommendation: configurable, default 30 days. Document the limit. +2. **Auth-failure surfacing on the client.** Should `replaceScope` reject when a `scope-open-error` arrives, or fire a store-level event? Affects only the client API but you'll want to know which contract the client picks. Recommendation: store-level event. +3. **Rate-limit response on excess.** Silent drop or publish a `ScopeOpenError` with a `code: 429`-style payload? Silent drop is simpler; explicit reply lets buggy clients surface the issue. Recommendation: explicit `code: 429` after the third drop in a window. + +## 9. References (for verification only — this brief is self-contained) + +- `docs/design/reactive-scope-open.md` — full design rationale, client-side changes, phasing, open questions. +- `docs/design/reactive-peer-mode.md` — adjacent design (peer-coordinated mode); §8/Q2 has tombstone retention discussion that applies here. +- `src/types.ts` — `ChangeEvent`, `SyncMutation`, `OwnershipError` definitions. +- `src/sync-engine.ts` — wire-format truth: + - L18-26 — `ChangeEvent` interface. + - L368-421 — root + top-level event handlers (parsing). + - L423-437 — response message handler (for the request/reply path). + - L439-457 — scope subscription (`subscribeToScope`). + - L460-470 — `isValidChangeEvent` validator. + - L487-507 — `parseScopedTopic` regex. + - L509-523 — `extractOriginClientId` / `isOwnMutation`. + - L525-557 — main scoped event handler (`handleWatchMessage`). + - L628-640 — response status/error handling (`checkResponseAndAuth`). + - L654-711 — current mutation request/reply (`createEntity`, `updateEntity`, `deleteEntity`, `bumpScopeVersion`). + - L745-783 — `request()` envelope.