Skip to content

Add FaceTime command flow and auto-approval improvements#236

Closed
cameronaaron wants to merge 2334 commits into
mautrix:masterfrom
cameronaaron:refactor
Closed

Add FaceTime command flow and auto-approval improvements#236
cameronaaron wants to merge 2334 commits into
mautrix:masterfrom
cameronaaron:refactor

Conversation

@cameronaaron
Copy link
Copy Markdown
Contributor

Summary

  • add and wire FaceTime command handlers, including facetime-send behavior updates
  • improve FaceTime incoming call/link handling in connector flow
  • add rustpush FaceTime Let Me In auto-approval path for bridge-owned links
  • include related connector/login wiring updates

Notes

  • source branch: cameronaaron/refactor
  • requested upstream refactor base does not exist; using master

@cameronaaron cameronaaron requested a review from cnuss as a code owner April 15, 2026 07:03
Copilot AI review requested due to automatic review settings April 15, 2026 07:03
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR refactors the bridge toward a bridgev2-based architecture with updated CLI tooling and macOS integration, while adding new crypto utilities/tests and introducing a new Rust crate for generating Apple APNs NAC validation data.

Changes:

  • Introduces bridgev2 entrypoint/commands and a bundled bbctl for Beeper self-host workflows (register/stop/delete/login).
  • Adds CardDAV credential encryption (AES-256-GCM) + tests, plus new capability/tapback unit tests.
  • Adds macOS-specific wiring (Darwin build tags, chat.db/contacts integration tweaks, setup permissions UX) and introduces a new nac-validation Rust crate + C FFI interface.

Reviewed changes

Copilot reviewed 127 out of 237 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
pkg/connector/chatdb_darwin.go Darwin-only side-effect import to register macOS chat.db support
pkg/connector/carddav_crypto.go Adds AES-256-GCM encryption/decryption for CardDAV config secrets
pkg/connector/carddav_crypto_test.go Unit tests for CardDAV key management + encrypt/decrypt behavior
pkg/connector/capabilities.go Defines bridge capability descriptors (room/general) for bridgev2
pkg/connector/capabilities_test.go Adds tests validating capability invariants
pkg/connector/bridgeadapter.go Adds adapter implementing legacy interface for mac connector reuse
nac-validation/src/validation_data.h Adds C header for NAC validation data generation and 3-step NAC API
nac-validation/src/lib.rs Adds Rust wrapper for NAC generation + context-based step API
nac-validation/build.rs Compiles Objective-C NAC implementation and links Foundation
nac-validation/Cargo.toml Declares nac-validation crate deps and build deps
imessage/tapback_test.go Adds unit tests for tapback parsing/mapping behavior
imessage/mutation_test.go Adds mutation testing harness (build-tagged)
imessage/struct.go Extends Contact/Attachment structs and adjusts identifier parsing
imessage/interface.go Extends API interface with GetMessageGUIDsSince
imessage/mac/messages.go Adds attachment fields, new queries, and message GUID query method
imessage/mac/database.go Stores prepared stmt for message GUIDs since query
imessage/mac/send.go Adds darwin build tag, updates imports, improves cleanup logging
imessage/mac/contacts.go Adds darwin build tag and improves permission detection + CString lifetime
imessage/mac/meowContacts.m Adds native “test query” helper for Contacts permission verification
imessage/mac/meowContacts.h Exposes meowTestContactQuery in header
imessage/mac/debug.go Adds darwin build tag and updates import path
imessage/mac/groups.go Adds darwin build tag
imessage/mac/attributedstring.go Adds darwin build tag and updates import path
imessage/mac/sleepdetect.go Adds darwin build tag
cmd/mautrix-imessage/main.go New bridgev2 main entrypoint + subcommands and permissions repair
cmd/mautrix-imessage/login_cli.go Adds interactive terminal login flow driving bridge login steps
cmd/mautrix-imessage/carddav_setup.go Adds CLI subcommand for CardDAV discovery + password encryption
cmd/mautrix-imessage/setup_darwin.go Adds macOS --setup permission prompts and checks
cmd/mautrix-imessage/setup_other.go Stub setup helpers for non-darwin builds
cmd/bbctl/main.go Adds bbctl CLI entrypoint and command registration
cmd/bbctl/auth.go Adds bbctl auth config handling + login/logout/whoami
cmd/bbctl/register.go Adds bbctl config command to register appservice + generate bridge config
cmd/bbctl/stop.go Adds bbctl stop command to announce stopped bridge state
cmd/bbctl/delete.go Adds bbctl delete command to delete appservice + Beeper API bridge
docs/cloudkit-guide.md Adds CloudKit backfill design/operations documentation
.github/workflows/ci.yml New CI pipeline for lint/test/build (Linux default, macOS on dispatch)
.github/workflows/security.yml Adds govulncheck + cargo-audit security workflows
.github/workflows/release.yml Adds release workflow producing artifacts and GitHub release
.github/dependabot.yml Enables Dependabot for Go/Rust/GitHub Actions
AGENTS.md Adds dev notes for UniFFI binding generation
Info.plist Adds macOS app bundle metadata + Contacts usage description
go.mod Changes module path and updates Go/toolchain + dependency set
no-mac.go Removes legacy non-mac permissions checker stub
mac-permissions.go Removes legacy mac permissions checker (replaced by new setup flow)
no-heif.go Removes legacy HEIF conversion stubs
heif.go Removes libheif-based HEIF conversion implementation
mediaviewer.go Removes legacy media viewer URL generation path
findrooms.go Removes legacy portal discovery implementation
commands.go Removes legacy bridgev1 command handlers
config/config.go Removes legacy config structs (bridgev1)
config/bridge.go Removes legacy bridge config definitions (bridgev1)
config/download.go Removes legacy config download helper
config/upgrade.go Removes legacy config upgrader
database/database.go Removes legacy DB wrapper (bridgev1)
database/user.go Removes legacy user query model
database/portal.go Removes legacy portal query model
database/message.go Removes legacy message query model
database/tapback.go Removes legacy tapback query model
database/puppet.go Removes legacy puppet query model
database/mergedchat.go Removes legacy merged chat query model
database/kvstore.go Removes legacy kv store model
database/upgrades/upgrades.go Removes legacy DB upgrade table registration
database/upgrades/00-latest-schema.sql Removes legacy schema snapshot
database/upgrades/02-avatar-optional.go Removes legacy upgrade step
database/upgrades/03-message-part-index.go Removes legacy upgrade step
database/upgrades/04-portal-backfill-start-ts.sql Removes legacy upgrade step
database/upgrades/05-message-on-update-cascade.go Removes legacy upgrade step
database/upgrades/06-crypto-store-last-used.sql Removes legacy upgrade step
database/upgrades/07-tapback-guids.sql Removes legacy upgrade step
database/upgrades/08-remove-management-room.sql Removes legacy upgrade step
database/upgrades/09-add-kv-store.sql Removes legacy upgrade step
database/upgrades/10-personal-filtering-spaces.sql Removes legacy upgrade step
database/upgrades/11-splitcrypto-store-handling-split.sql Removes legacy upgrade step
database/upgrades/12-management-room.sql Removes legacy upgrade step
database/upgrades/13-displayname-override.sql Removes legacy upgrade step
database/upgrades/14-correlation-id.sql Removes legacy upgrade step
database/upgrades/15-thread-id.sql Removes legacy upgrade step
database/upgrades/16-remove-correlation-id.sql Removes legacy upgrade step
database/upgrades/17-batch-send-ids.sql Removes legacy upgrade step
database/upgrades/18-chat-merges.sql Removes legacy upgrade step
database/upgrades/19-add-contact-info.sql Removes legacy upgrade step
database/upgrades/20-thread-id-index.sql Removes legacy upgrade step
database/upgrades/21-prioritized-backfill.sql Removes legacy upgrade step
imessage/ios/requests.go Removes iOS IPC request types (legacy codepath)
imessage/mac-nosip/contactproxy.go Removes legacy mac-nosip proxy implementation
imessage/mac-nosip/nocontactproxy.go Removes non-darwin mac-nosip stub
imessage/bluebubbles/interface.go Removes legacy BlueBubbles API interface types
imessage/bluebubbles/README.md Removes BlueBubbles docs file
docker-run.sh Removes legacy Docker entrypoint script
Dockerfile.ci Removes legacy CI Dockerfile
build.sh Removes legacy build script
clangwrap.sh Removes legacy iOS clang wrapper
bridgeinfo.go Removes legacy bridge info event mapping
chatmerging.go Removes legacy chat merge/split logic
ROADMAP.md Removes outdated roadmap doc
example-registration.yaml Removes legacy appservice registration example
.pre-commit-config.yaml Removes pre-commit hooks config
.gitlab-ci.yml Removes GitLab CI configuration
.github/workflows/go.yml Removes legacy GitHub Actions Go workflow
.github/CODEOWNERS Removes CODEOWNERS file
.github/FUNDING.yml Removes funding config
.github/ISSUE_TEMPLATE/bug.md Removes bug issue template
.github/ISSUE_TEMPLATE/enhancement.md Removes enhancement issue template
.github/ISSUE_TEMPLATE/config.yml Removes issue template config

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +71 to +78
// Try to load existing key, generate if missing
key, err := loadCardDAVKey()
if err != nil {
key, err = generateCardDAVKey()
if err != nil {
return "", err
}
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EncryptCardDAVPassword generates a new key on any loadCardDAVKey() error (including wrong-size key file, permission errors, transient IO errors). That can silently rotate the key and make already-encrypted passwords undecryptable. Prefer generating a new key only when the key file is missing (e.g., errors.Is(err, os.ErrNotExist)), and return the error for other failure modes.

Copilot uses AI. Check for mistakes.
for i := range newKey {
newKey[i] = byte(i)
}
os.WriteFile(keyPath, newKey, 0600)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests ignore the return errors from os.WriteFile/os.MkdirAll. If the writes fail (permissions, disk issues), the test may pass/fail for the wrong reason. Capture and assert the returned errors (e.g., if err := os.WriteFile(...); err != nil { t.Fatal(...) }) to make failures deterministic.

Suggested change
os.WriteFile(keyPath, newKey, 0600)
if err := os.WriteFile(keyPath, newKey, 0600); err != nil {
t.Fatalf("WriteFile error: %v", err)
}

Copilot uses AI. Check for mistakes.
Comment on lines +114 to +115
os.MkdirAll(dir, 0700)
os.WriteFile(filepath.Join(dir, cardDAVKeyFileName), []byte("too-short"), 0600)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests ignore the return errors from os.WriteFile/os.MkdirAll. If the writes fail (permissions, disk issues), the test may pass/fail for the wrong reason. Capture and assert the returned errors (e.g., if err := os.WriteFile(...); err != nil { t.Fatal(...) }) to make failures deterministic.

Copilot uses AI. Check for mistakes.
return false
}
defer db.Close()
_, err = db.Query("SELECT 1 FROM message LIMIT 1")
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

canReadChatDB() uses db.Query(...) but never closes the returned *sql.Rows. Because runSetupPermissions() can call this repeatedly in a loop, this can leak resources and eventually fail the check. Use QueryRow (preferred here) or close the rows handle before returning.

Suggested change
_, err = db.Query("SELECT 1 FROM message LIMIT 1")
var probe int
err = db.QueryRow("SELECT 1 FROM message LIMIT 1").Scan(&probe)

Copilot uses AI. Check for mistakes.
Comment thread imessage/struct.go
Comment on lines +240 to 252
if len(parts) < 3 {
return Identifier{LocalID: guid}
}
localID := parts[2]
// Detect groups by the separator character ("+") or by LocalID pattern.
// The GUID format is "service;+;localID" for groups and "service;-;localID" for DMs.
// Group LocalIDs can be "chat..." (iMessage), hex UUIDs (SMS/RCS), or other formats.
isGroup := parts[1] == "+" || strings.HasPrefix(localID, "chat")
return Identifier{
Service: parts[0],
IsGroup: parts[1] == "+",
LocalID: parts[2],
IsGroup: isGroup,
LocalID: localID,
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using strings.HasPrefix(localID, \"chat\") to infer IsGroup can misclassify DMs whose local ID happens to start with chat (e.g., an email address like chat@example.com). Since the comment states the format is service;+;localID for groups and service;-;localID for DMs, relying on parts[1] == \"+\" is both simpler and more accurate. If there are known real-world GUIDs that violate the +/- separator rule, consider handling those explicitly and add unit tests covering the edge cases.

Copilot uses AI. Check for mistakes.
Comment thread nac-validation/build.rs
Comment on lines +1 to +15
fn main() {
println!("cargo:rerun-if-changed=src/validation_data.m");
println!("cargo:rerun-if-changed=src/validation_data.h");

// Compile the Objective-C file
cc::Build::new()
.file("src/validation_data.m")
.flag("-fobjc-arc")
.flag("-fmodules") // for @import if needed
.define("NAC_NO_MAIN", None) // exclude main() when building as a library
.compile("validation_data");

// Link with Foundation framework
println!("cargo:rustc-link-lib=framework=Foundation");
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This build script unconditionally compiles Objective-C and links the macOS Foundation framework. That will fail on non-macOS targets if this crate is ever built in CI or as part of a workspace build. Consider gating the build steps with cfg!(target_os = \"macos\") (and emitting a clear error or doing nothing on other OSes), and similarly gating any tests that require Apple frameworks/network access.

Copilot uses AI. Check for mistakes.
Comment thread nac-validation/src/lib.rs
Comment on lines +121 to +124
// The underlying AAAbsintheContext is not Send/Sync by default; upstream
// rustpush uses it from a single async task so we mirror that pattern.
unsafe impl Send for NacContext {}

Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsafe impl Send for NacContext is a strong guarantee: it allows moving the underlying Apple framework handle across threads, which may be undefined behavior if AAAbsintheContext is not thread-safe. If the intent is to keep it single-threaded, avoid implementing Send (or enforce single-thread use via a non-Send marker). If it truly is safe to send across threads, add a concrete justification (docs/experiments) explaining why the underlying object is thread-safe for cross-thread moves.

Suggested change
// The underlying AAAbsintheContext is not Send/Sync by default; upstream
// rustpush uses it from a single async task so we mirror that pattern.
unsafe impl Send for NacContext {}
// Do not implement Send/Sync for this wrapper: the underlying
// AAAbsintheContext is an opaque Apple framework object and we do not have
// a documented guarantee that moving it across threads is safe. Keep usage
// thread-confined unless and until that guarantee is established.

Copilot uses AI. Check for mistakes.
Comment thread cmd/bbctl/register.go

func generateSecret(n int) string {
b := make([]byte, n)
_, _ = rand.Read(b)
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The result of rand.Read is ignored. If the read fails, this will silently return a low-entropy secret (likely all-zero bytes). Handle and propagate the error so config generation fails closed rather than producing a weak provisioning secret.

Suggested change
_, _ = rand.Read(b)
if _, err := rand.Read(b); err != nil {
panic(fmt.Errorf("failed to generate secret: %w", err))
}

Copilot uses AI. Check for mistakes.
Comment on lines +142 to +145
fmt.Fprintf(os.Stderr, "[permissions] IsConfigured=%v entries=%d\n", configured, len(br.Config.Bridge.Permissions))
for key := range br.Config.Bridge.Permissions {
fmt.Fprintf(os.Stderr, "[permissions] %q\n", key)
}
Copy link

Copilot AI Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This emits permissions diagnostics to stderr unconditionally on every start (including listing permission keys). That can be noisy in production logs and may leak configuration details. Prefer using the bridge logger at a debug level (or only printing when an actual repair occurs / when a verbose flag is enabled).

Suggested change
fmt.Fprintf(os.Stderr, "[permissions] IsConfigured=%v entries=%d\n", configured, len(br.Config.Bridge.Permissions))
for key := range br.Config.Bridge.Permissions {
fmt.Fprintf(os.Stderr, "[permissions] %q\n", key)
}

Copilot uses AI. Check for mistakes.
David and others added 22 commits April 15, 2026 18:10
…'s ring

Stripping the bridge owner's handle from session.members around
respond_letmein was meant to suppress the AddMember wire fanout to the
owner's own devices on link tap, so the Mac wouldn't ring when the
caller joined via web FT. But the strip appeared to also break the
initial ring on subsequent !im facetime invocations — wife stopped
ringing at all.

The mechanism isn't obvious: respond_letmein only fires after a link
tap, well after wife is supposed to have already started ringing from
create_session. So either there's a state race I'm not seeing, or
something about the mutated session.members is persisting in a way
that affects the next outbound call.

Backing out the strip until we can either reproduce the regression
cleanly or expose IDSSendMessage from upstream so we can craft a
self-only RespondedElsewhere instead of mutating members.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
BindBridgeLinkToSession sets link.session_link on the persistent
"bridge" link immediately after CreateSession, so the letmein approver's
linked_group branch matches deterministically on the first tap — no
more falling through to member/ringing heuristics that miss under
cold-start / stale-state and fabricate an empty session (the "0 people"
symptom).

Also adds two info logs to diagnose the "wife's phone doesn't ring"
side: create_session now logs ring_targets + is_propped +
is_ringing_inaccurate after prop_up_conv, and auto_approve logs
match_kind (linked | member | ringing | cold-start) so we can tell on
the next run whether the pin took and which branch routed the tap.
…irect

82e002a wrapped upstream's create_session with a strip that pulled the
caller's handle out of session.members around prop_up_conv, so the wire
ring wouldn't fan out to the owner's other Apple devices (Mac, iPad).
The original motivation was that a Mac auto-answer sent
RespondedElsewhere back to the bridge, cleared is_ringing_inaccurate,
and broke auto_approve_bridge_letmein's ringing-group fallback for link
taps.

583369d's bind_bridge_link_to_session pins link.session_link to the
outgoing session id immediately after create, so the approver's
linked_group branch matches deterministically regardless of
is_ringing_inaccurate (confirmed on the last test run —
match_kind=linked). The strip's original justification is moot.

Empirically the strip also correlated with the callee not ringing on
outbound (log showed prop_ok=true, ring_targets=[wife], is_propped=true,
but wife's phone never rang — own was absent from update_context.members
and fanout_groupmembers in the Invitation wire, which Apple's FT routing
appears to reject as malformed). Calling upstream directly sends a
well-formed Invitation.

Side effect: the owner's devices will ring too. Acceptable for now;
future work is a targeted prop_up_conv(false) nudge once the callee
ring is confirmed stable.

Also: inbound-call join link now gets the same &n=<base64-handle>
pre-fill that outbound !im facetime applies (client.go:2870-ish), so
the user lands on the web FT join page with their display name already
populated instead of blank.
Upstream's FTClient::handle() hard-requires decoded_context.message to
be Some on command 207 (someone joined) and command 209 (group updated)
— see facetime.rs:1272 and :1344. Apple has started sending at least
some of these with message=None (server-originated state updates after
link-tap joins, plus the callee's answer ack), and upstream BadMsg's
out. The bridge never records the joiner in session.participants, the
local session state diverges from Apple's authoritative copy, and the
visible symptom on the callee's device is "this call is not available"
when answering.

The fix stays entirely in our wrapper (no upstream source changes — see
feedback_no_patch_rustpush):

- Wrap the receive-loop's ft.handle(msg) call with
  ft_handle_with_join_recovery.
- On any non-BadMsg result (success or other error) return unchanged.
- On BadMsg: re-run identity.receive_message on the cloned msg (it's
  side-effect-free beyond decryption, so a second call is safe); if
  cmd is 207 or 209, deserialize the wire plist into a locally-mirrored
  struct (FTWireMessage's fields are private upstream, but the schema
  is stable — we redeclare the fields we need with the same serde
  rename attrs); insert the joiner into session.participants with
  sensible defaults; emit a synthetic FTMessage::JoinEvent so the
  bridge's downstream pipeline still fires.

Skipped: session.unpack_members (private upstream helper). Member-list
drift is cosmetic — the load-bearing piece for Apple-side state is the
participants map, and that's what we populate.

Pairs with ba96333 (strip removal — wife's phone rings) to close the
outbound call loop end-to-end: she rings, she answers, her answer
no longer trips BadMsg, session state stays consistent.
Old flow: !im facetime → CreateSession (upstream prop_up_conv(ring=true))
→ wife rings immediately → she answers before the caller is in the
session → Apple sees no live participant → "call not available" /
"request declined." Even when the caller tapped the join link, the
race was too tight.

New flow (restored from PR 39's pending-ring design):

1. `!im facetime` calls CreateSessionNoRing — allocates the session and
   propagates to Apple's quickrelay, but prop_up_conv(ring=false) +
   is_ringing_inaccurate=false means no Invitation wire goes out. Nobody
   rings at this point.
2. RegisterPendingRing queues the callee's handle keyed on the session
   guid, filtered so the caller's own implicit self-join doesn't fire.
3. The bridge replies with the join link. The caller taps it. The
   letmein approver adds their web-FT temp pseud as a session member;
   Apple echoes back a JoinEvent.
4. maybe_fire_pending_ring in the receive loop sees the temp-pseud join
   (not the caller's own handle → not filtered), pops the queue, and
   calls ft.ring() against the callee. Her phone rings.
5. She answers. The caller is already a live participant, so Apple's
   side has a real session to connect her to.

Rust changes (pkg/rustpushgo/src/lib.rs):
- New FFI method WrappedFaceTimeClient::create_session_no_ring mirroring
  upstream's create_session skeleton but with is_ringing_inaccurate=false
  and prop_up_conv(ring=false).
- Pending-ring machinery (PendingFTRing, maybe_fire_pending_ring,
  register_pending_ring) was already in place from an earlier PR;
  nothing to add there.

Go changes (pkg/connector/facetime.go):
- fnFaceTimeCallInPortal swaps ft.CreateSession → ft.CreateSessionNoRing,
  then ft.RegisterPendingRing(sessionID, caller_handle, [target], 60s).
- Reply copy updated to match: "Tapping this link will ring <contact>'s
  phone" instead of "their phone is ringing now."

Regenerated uniffi bindings.
Two follow-ups on ee1ee6f (the pending-ring gate for outbound calls):

1. Restore missed-call detection. create_session_no_ring starts the
   session with is_ringing_inaccurate=false so prop_up_conv's
   RespondedElsewhere diversion doesn't fire. That also meant the
   upstream "no participants active + ringing" branch at
   facetime.rs:1411 never tripped, so if the callee declined or timed
   out, the session silently closed instead of marking Missed.

   maybe_fire_pending_ring now flips is_ringing_inaccurate=true at the
   moment the Invitation actually leaves — which is the semantically
   correct point, since that's when the callee's phone starts ringing.
   Upstream's Missed-marker path now trips normally.

2. Missed-call notice uses the bridge flow instead of facetime://.
   The old notice gave the user `facetime://<handle>` and
   `facetime-audio://<handle>` links that only worked on native
   iOS/macOS — tap on Android/web and nothing happened. Now the notice
   posts the same bridge link as `!im facetime`:

   - Mint a no-ring session targeting the caller we missed.
   - Queue a pending ring (1-hour TTL since the user may not see the
     notice immediately).
   - Fetch + pin the persistent bridge link; prefill &n= with the
     owner's handle.
   - On tap, letmein approve adds the owner to the session; their
     JoinEvent fires ft.ring() against the original caller.

   Copy mirrors the outbound command: "Tapping this link will ring X's
   phone … open the link when you're ready to be on camera." Falls
   back to facetime:// only if the bridge-link arm fails, so native
   users don't lose functionality on transient errors.

Refactoring: factored the session+link+pending-ring dance into
armBridgeFaceTimeCall so fnFaceTimeCallInPortal and
handleFaceTimeMissedNotice share one implementation and stay in sync.
facetime:// / facetime-audio:// URL schemes only worked on native
iOS/macOS clients — Android/web saw the raw URL and could do nothing
with it, and the native path bypassed the bridge entirely anyway.

If armBridgeFaceTimeCall fails (session mint, link fetch, etc.), post
the missed-call notice without a callback button instead of degrading
to the native scheme. User can still `!im facetime` in the portal
manually to place the callback, and the notice surfaces the miss
either way.
Switch from GetLinkForUsage (persistent bridge link + letmein indirection)
to GetSessionLink (session-specific). With the persistent link, tapping
routed the caller through auto_approve_bridge_letmein, and the JoinEvent
that drives maybe_fire_pending_ring had to match through the linked_group
fallback chain. With a session-specific link the caller joins the session
directly and the JoinEvent fires cleanly for the pending ring to target
wife's phone.

Matches the pattern from PR39 which worked end-to-end. BindBridgeLinkToSession
is no longer called from Go; the FFI method stays in place as a harmless
unused helper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The !im facetime setup chains CreateSessionNoRing → RegisterPendingRing →
GetSessionLink, and the two APNs-backed calls (create + get_session_link)
both surface transient SendTimedOut when APNs drops mid-flight. Our bridge
had been hitting this window repeatedly — the APNs reconnect grace is 30s
on our side, so a short bounded retry lands on the restored connection
instead of returning an error to the user.

GetSessionLink's retry is safe: the session.link is persisted before the
message_session fanout, so the second call returns via the early-return
branch without re-sending.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs preventing both outbound and inbound FaceTime from connecting:

1. Outbound: GetSessionLink creates links with usage=None (upstream
   behavior). auto_approve_bridge_letmein gated on usage=="bridge" so
   session-specific links never got approved — user taps link, LetMeIn
   fires, bridge ignores it, web FT hangs forever, no JoinEvent, no
   pending ring, wife never rings. Fix: widen the gate to also accept
   links where session_link.is_some() (these are bridge-created
   session-specific links, equally safe to auto-approve).

2. Inbound: handleFaceTimeRingNotice fell back to the persistent bridge
   link when the caller didn't embed a URL. That link's stale
   session_link (from a prior auto_approve) routed the user to the
   wrong session, so "answer" connected to a dead call. Fix: extract
   the session guid from the marker text and call GetSessionLink(guid)
   to mint a link that joins the caller's actual session.

Also reorder auto_approve fallback to ringing > linked > member. An
actively-ringing session (inbound call) is always the user's immediate
concern; a stale linked_group from a prior outbound would otherwise win
and route to the wrong session.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…otency

Three retry layers for the FT LetMeIn path that was dying on APNs flaps:

1. FT handle level: upstream's handle_letmein sends a delegation
   message_session INSIDE handle() before our auto_approve even runs. If
   APNs flaps there, the entire LetMeIn drops. Now retries once after 2s.

2. respond_letmein level: retries up to 3x with backoff. On retry, strips
   delegation_uuid so respond_letmein doesn't hit the "Already responded"
   early-return (first call removed it from delegated_requests but failed
   at the subsequent send). Duplicate LetMeInResponse is harmless; web
   client decrypts the first. add_members is idempotent (already-present
   member triggers ring instead of re-add).

3. Go-side armBridgeFaceTimeCall: retryOnAPNsFlap already covers
   CreateSessionNoRing and GetSessionLink (prior commit).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
SMS relays may normalize UUID case, causing delivery receipts to miss
their target message in the bridge DB. Fall back to upper/lower case
lookup before dropping the receipt.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When APNs drops mid-send ("early eof"), the connection reconnects within
seconds but the in-flight send times out. Wrap all outbound send paths
(message, attachment, edit, unsend, tapback, read receipt, typing) with
retrySendOnAPNsFlap — same pattern already used for FaceTime calls.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace the raw UUID argument with an interactive numbered list matching
the contact-search and restore-chat UX patterns. Users now type `off`
and pick from Do Not Disturb, Sleep, Driving, Personal, or Work instead
of memorising Apple mode identifiers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ists

Prevents the bridge from subscribing to or inviting its own handles in
StatusKit operations, which wastes APNs quota and can cause self-loops.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contacts may key back under a different handle form than their ghost ID
(e.g. mailto: vs tel:). request_handles does exact string matching
against the ghost list, so cross-form keys were silently unsubscribed —
no APNs channel, no presence updates, ever.

Augment the ghost handle list with every "from" handle persisted in
statuskit-state.plist so request_handles matches all available channels
regardless of handle form.

Also add missing bridge_id filter to the ghost query in
subscribeToContactPresence (the other two StatusKit ghost queries
already had it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fill allowed_modes with standard iOS Focus mode IDs (DND, Sleep,
  Driving, Personal, Work) instead of sending an empty list. iOS may
  silently ignore key-sharing invites with no allowed modes.
- Add per-handle target breakdown logging so we can see which contacts
  have IDS delivery targets and which don't.
- Log invite_to_channel completion for end-to-end send confirmation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace opaque UUID-based !shared-albums and !shared-photos commands
with a 3-step numbered picker: browse albums by name, browse assets
by filename/date/dimensions, then download selected assets into a
dedicated deletable Matrix room through the bridge's full media
pipeline (HEIC→JPEG, video transcoding, thumbnails).

Rust FFI additions: list_albums(), get_album_assets(), download_file()
with new SharedAlbumInfo and SharedAssetInfo uniffi records.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…g level

The manual !statuskit-invite-channel command uses
WrappedStatusKitClient.invite_to_channel, not
Client.invite_to_status_sharing. Previous diagnostic logging only
covered the automated path. Add info-level logging to both paths.

Also raise rustpush crate log level from warn to info so upstream
IDS send/receive diagnostics (target counts, delivery confirmations,
key lookups) are visible in the journal.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The wrapper's targets_for_handles finds 71 delivery targets, but the
upstream invite_to_channel does its own get_participants_targets lookup
internally. If that internal lookup returns empty (different cache key
path), the IDS send is silently skipped — explaining why invites appear
to succeed but contacts never respond.

Add a pre-send diagnostic that compares wrapper vs internal target
counts and warns on mismatch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The dedicated album room was created with IsDirect: true but wasn't
registered in the user's m.direct account data, so Beeper treated it
as a group room (can't delete, only leave). Now calls MarkAsDM() via
the double puppet so the room appears as a true DM that users can
delete from Beeper.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Strip per-handle target breakdown, internal mismatch detection, manual
invite tracing, and rustpush=info log level bump. Keep the allowed_modes
and subscribe augmentation fixes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mackid1993
Copy link
Copy Markdown

I believe this pull request was open accidentally. It is not meant for this repository.
I'm not sure how it keeps getting pushed to because the pull request in our repository is already closed.

Perhaps someone should close this.

The bot-created room couldn't be deleted from Beeper because the user
didn't own it. Now the double puppet (user's own Matrix identity)
creates the room and invites the bot, so it behaves like a real DM
between the user and the bot — deletable from Beeper like any other
DM conversation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
David and others added 28 commits May 12, 2026 14:30
Adds a read-only dump of the in-memory keychain state per pass to
investigate the upstream "PCS master key verification failed" warning
that fires once before the StatusKit-CloudKit DONE line. The warning
comes from rustpush's PCSPrivateKey::from_dict swallowing a
MasterKeyNotFound from verify_with_keychain, and it's unclear without
introspection whether the missing master is a benign condition (master
genuinely not provisioned for com.apple.statuskit on this account, or
orphaned-but-valid service key) or a sign of a sync gap worth fixing.

The diagnostic dumps view sizes, labels, and atyp keyids for both
ProtectedCloudStorage (where masters live) and LimitedPeersAllowed
(where the StatusKit service key lives). One info!() line per pass
prefixed STATUSKIT-CLOUDKIT-DIAG. No keychain writes, no Apple
network traffic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…cause

The dump from one pass confirmed the "PCS master key verification failed"
warning is from re-registration history: the user's account has five
distinct com.apple.statuskit service keys in LimitedPeersAllowed (each
created by a separate MBA registration over time) plus two PCS MasterKey
entries from rotations. Upstream's get_service_key picks one whose
parent reference doesn't match any current PCS atyp, so verification
fails — but per-record decryption still succeeds (decode_failed=0)
because crypto only needs the service key's private part.

Question answered, removing the diagnostic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
decode_invitation_record was running get_zone_encryption_config and
pcs_keys_for_record on every record, then checking !has_payload AFTER
that work. 10-field-variant records (CD_peerKey/CD_serverKey/
CD_channelToken instead of CD_invitationPayload + CD_incomingRatchetState)
have no assembly path yet and were always returned Ok(None) anyway.

Move the !has_payload skip up before any PCS / keychain work. Pure
hygiene — the wasted unwrap calls are local in-memory operations and
don't generate Apple-visible failed-protection events, but doing less
work for records we'll discard removes ambiguity and saves CPU.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
try_fetch_zone treated records.is_empty() as "this candidate didn't
hit" and fell through to the next zone. When all candidates returned
empty (the steady-state shape on a quiet pass with an up-to-date
since_token), the FFI returned ResolvedZone=None and the Go side
cleared the cached zone row — forcing the next pass into fresh
discovery and a from-scratch over-fetch.

Treat an empty page from the cached zone as a legitimate "no new
changes" response: return Some(DiscoveryHit { records: empty,
next_token }) so the cached zone stays cached and since_token
advances normally. Discovery candidates (no cache, or non-cached
fallbacks) still fall through on empty.

Real recovery paths (ZoneNotFound, explicit fetch errors) still
trigger re-discovery as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The bootstrap-side runCloudSyncOnce was firing the StatusKit-CloudKit
pull at the end of its phase sequence, before createPortalsFromCloudSync
ran. subscribeToContactPresence (which the pull triggers when keys are
injected) queries the ghost table for handles to subscribe to — at
bootstrap time that table was empty, so the call subscribed to nothing
and the freshly-injected keys never got wired up. The 12h success floor
then blocked subsequent passes from re-trying, leaving the bridge with
keys it never used.

Defer the bootstrap-side StatusKit pull to the end of
runCloudSyncController, after createPortalsFromCloudSync and the
post-sync housekeeping steps. By that point ghosts exist and the
subscribe call has handles to act on. Steady-state cloud-sync cycles
continue to fire the pull from inside runCloudSyncOnce as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OnStatusUpdate was falling back to "now - 1ms" when the target portal
had no prior message (freshly-created portal, or initial backfill
hadn't loaded its messages yet). Clients that don't fully honor the
com.beeper.action_message extension would then treat the notice as a
new tip-of-timeline event and bump the room to the top of the room
list. During initial backfill, presence broadcasts arriving for many
half-backfilled portals at once scrambled chat order in random
arrival sequence.

Skip the notice entirely when lastMsg is nil. The next presence
broadcast from the same peer (after backfill catches up and there's
a real anchor message) will create the notice properly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previous attempt (cc80fe1) skipped the notice when the target portal
had no anchor message — wrong direction; the notice should still be
sent. Replace with: stamp the notice at the same timestamp as the
last message in the portal (drop the prior -1ms offset). Matching
the last message's timestamp keeps room ordering stable on clients
that ignore the com.beeper.action_message=presence_update extension.

In practice the no-anchor-message case shouldn't occur (no messages
means nothing to backfill, so no portal), but the now-1ms fallback
is preserved as a defensive default.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Earlier attempt placed the deferred StatusKit-CloudKit pull at the
end of runCloudSyncController (after runPostSyncHousekeeping). That's
still during the bridge sync phase — forward backfill is queued
asynchronously by createPortalsFromCloudSync and runs after the
controller returns. Firing the pull at controller exit time means
subscribeToContactPresence still queries an incomplete ghost table:
DM ghosts created during forward backfill aren't there yet.

Move the pull to onForwardBackfillDone at counter==0, alongside the
existing inviteContactsToStatusSharing trigger that already uses this
hook for the same "ghosts now fully exist" reason. Drop the
controller-side call I added previously.

The runCloudSyncOnce defer-on-bootstrap gate stays — bootstrap flow
now routes entirely through this post-backfill hook; steady-state
cycles continue to fire from inside runCloudSyncOnce as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triggering the StatusKit-CloudKit pull from onForwardBackfillDone at
counter==0 still fired too early on initial bootstrap. Forward backfill's
last batch can be in flight to Matrix and the bridge DB at that exact
moment; presence broadcasts arriving for not-yet-committed portals
hit the OnStatusUpdate `lastMsg=nil` fallback and bumped chats to the
top of the room list. Warm restart works because prior-session messages
are already in the DB to anchor against.

Drop the onForwardBackfillDone-triggered pull. Instead, gate
syncCloudStatusKitPeers itself: skip when initial forward backfill
hasn't completed (apnsBufferFlushedAt == 0) or completed within the
last 60s (settle window for any straggler DB writes). The natural
runCloudSyncOnce cadence (delayed re-syncs, APNs nudges) will fire
the pull as soon as the gate clears.

Presence isn't time-critical — taking an extra minute to make sure
backfill is fully committed is the right trade.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original design assumed the outer cloud-sync orchestrator would
call syncCloudStatusKitPeers many times across a session, naturally
draining via the persisted continuation token (the pattern the
chat/message/attachment backfill paths use). For those paths the
assumption holds — they get hit hundreds of times per session by
APNs nudges and other triggers. For StatusKit the assumption broke
down: runCloudSyncOnce only fires from the bootstrap retry loop and
the three delayed re-syncs at 15s/60s/3min. After that no further
trigger exists until the next bridge restart. Pages stranded.

Loop the FFI call within a single pass until either the response
returns no continuation token, returns no records, or hits the 30-
page safety cap. Each successful page persists its zone+token before
the next iteration so a crash mid-drain resumes correctly. 1-second
pause between pages keeps the per-pass CKKS round-trip rate gentle.

Bounded cost per pass: ~5 base round-trips + (N-1) FetchRecordChanges
where N is pages drained. For typical accounts that's 2-3 total.
Strictly less aggressive than the existing chat/message backfill paths
which have no per-page cap and no inter-page pacing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The cloud-sync controller's three delayed re-syncs (15s/75s/4m15s
after bootstrap) all fire well before forward backfill completes on
accounts with substantial history — backfill regularly takes 20+
minutes and can run hours on the heaviest accounts. The internal
settle-window gate in syncCloudStatusKitPeers was correctly skipping
those premature attempts, but no later trigger existed: after the
4m15s re-sync, runCloudSyncOnce isn't called again until the next
bridge restart. So the StatusKit drain only ever ran on restart for
those accounts.

Add a post-backfill trigger in onForwardBackfillDone at counter==0
that sleeps slightly longer than the gate's settle window (75s vs
60s) and then calls syncCloudStatusKitPeers. The drain now fires
after backfill completes, always, regardless of duration. On warm
restart with fast backfill the 12h success floor short-circuits
this call after the delayed re-sync has already drained.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…lowing

Cached-path DOSYNC and FetchRecordChangesOperation errors were returning
Ok(None), which the FFI wrapped as a clean empty page. The Go-side gate
then recorded ConsecutiveErrs=0, applied the 12h success floor, and
cleared the cached zone — forcing the next pass into fresh discovery
(the burstiest pattern) and locking key population behind a 12h gate per
cycle on persistent failure.

Surface those errors as WrappedError::GenericError so the inter-pass
backoff schedule (15m → 30m → 1h → 2h, retry-after honoring) actually
fires on the signal it was built to act on. Discovery-mode and
non-cached fallback-zone errors keep the prior fall-through behavior so
first-pass discovery and the cached-zone-disappeared re-discovery path
are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Companion to be61a85 — the cached-path DOSYNC and FETCH paths now
propagate Apple-side errors, but candidate.init failures were still
swallowed via `continue`. With a cached path, candidates_to_try has
only one entry, so `continue` exits the loop into an empty success
page → same 12h-floor lockout pattern.

Discovery mode keeps the prior fall-through behavior so init failures
on alternate candidates still let the loop try the next candidate.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ence senders

Peer iOS fans every reshare across all of the peer's registered handles —
same channel id, different sender per alias. When a presence update
arrives from an alias that's missing from contacts and that Apple's IDS
refuses to correlate (LookupFailed/6001), the standard chain
(learned-cache → contacts → IDS → mailto: portal) drops the notice with
"no DM portal found".

Add a persistent (channel_id ↔ alias) cluster store that captures the
full reshare alias graph and powers a transitive resolver: for an
unknown handle X, look up the channel_ids X has been observed on, list
sibling handles in those clusters, and resolve through the persistent
alias→portal map (or the live chain on a sibling). The first sibling
that resolves hands X its portal too — and the mapping is persisted
for O(1) future lookups.

Data sources feeding the cluster:
  - APNs reshares: on_reshare_sender now carries channel_id (rust trait
    + both call sites updated). Live observations land immediately.
  - StatusKit-CloudKit pull: every successfully-decoded
    CD_ReceivedInvitation contributes (channel_id, sender) via a new
    cluster_observations field on the FFI page return — catches peers
    keyed via offline reshares that never fired the live callback.

Persistence:
  - statuskit.alias_portal.<handle>     → portalID
  - statuskit.channel_cluster.<channel> → JSON [handles]
  - statuskit.alias_channels.<handle>   → JSON [channel_ids]

statusKitPortalCache is now KV-backed via rememberAliasPortal, and the
in-memory map is hydrated from KV on Connect. A second pre-warm pass
scans bridge ghosts and seeds (handle → portal) via the cheap
non-IDS chain so the very first presence update after a restart
resolves known peers without round-trips.

Resolver code lives in its own file (statuskit_alias_resolver.go) so it
survives a future cutover to upstream rustpush's native StatusKit-
CloudKit pull — it consumes from durable callback shapes, not from
the current pull's internals.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d filter)

Initial hydration query used the wrong table name (kv) and missed the
required bridge_id filter, producing "no such table: kv" on every
restart.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OnStatusUpdate calls resolveViaCluster with the raw user form
("aap724@icloud.com"), but the CloudKit pull stores cluster
observations with the prefixed form ("mailto:aap724@icloud.com") that
Apple's records carry. Lookups missed every time.

Normalize the alias inside recordReshareObservation,
resolveViaCluster, lookupAliasPortal, and rememberAliasPortal so all
paths converge on the canonical prefixed key regardless of caller.

Also promote the cluster-observation log line from Debug to Info so
the pull's contribution to the cluster is visible without flipping
log levels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ns land

Previously the alias→portal mapping for an unknown handle was only
materialized when a presence update arrived from that handle (the
on-arrival cluster transitive lookup). This means the bridge waited
for the peer to publish before binding the alias.

Now every observation that grows a cluster runs eagerLinkClusterToPortal:
walks the cluster, finds the first sibling that resolves (via the
persistent alias-portal map or the live non-IDS chain), and maps every
unmapped sibling to that portal. By the time presence arrives from a
hidden alias, step 0 of the OnStatusUpdate chain (statusKitPortalCache)
already has it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…or unknown aliases

When a presence update arrives from a handle that isn't in contacts,
isn't in the bridge's IDS cache, and doesn't share a CloudKit cluster
with a known sibling, the chain dropped the notice. Now the bridge
asks Apple via IDS, with a wider service list and a persistent
negative cooldown so the lookup happens at most once per
six-hour window per unmapped handle.

Changes:
  - Extend resolve_handle / resolve_handle_cached SERVICES list to
    include com.apple.private.alloy.status.personal and
    com.apple.icloud.presence.mode.status. Hidden Apple-ID-linked
    aliases publish on these but aren't registered for Madrid or
    status.keysharing, so validate_targets returned LookupFailed
    (6001). The presence services catch them and surface a correlation
    id we can match against known siblings.
  - resolveStatusPortalViaIDSCached wraps the existing IDS resolver
    with two cache layers: alias_portal KV short-circuits prior
    successes (in-memory + persistent), and statuskit.ids_attempt.<h>
    records a 6-hour negative cooldown so a stuck handle doesn't
    re-trigger an IDS round-trip on every Focus toggle.
  - OnStatusUpdate's chain and eagerResolveReshareSender both now go
    through the cached wrapper.

Cutover note: when upstream rustpush ships its native correlation
helper, the cache wrapper stays — only the underlying ResolveHandle
target swaps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PascalCase Go filenames are non-standard; snake_case lowercase matches
the rest of the package.
…ered topic

Two fixes in one commit, both surfaced by the aap724 hidden-alias case:

1. Drop com.apple.icloud.presence.mode.status from the SERVICES list in
   resolve_handle / resolve_handle_cached. That topic is APNs-interest-only,
   not a registered IDSService — passing it to validate_targets makes
   IdentityManager::get_main_service panic via .expect("Topic ... not
   found!"), crashing the bridge process. The bridge registers MADRID and
   MULTIPLEX (which sub_serves status.keysharing and status.personal); the
   trimmed three-topic list covers all valid IDS lookups.

2. Restore the original Madrid batch validate_targets in resolve_handle.
   Hidden Apple-ID aliases (e.g. mailto:aap724@icloud.com) return
   LookupFailed (6001) when queried alone, but get their correlation_id
   populated alongside successful sibling lookups when the batch includes
   known ghost handles. The single-handle refactor — intended to cap a 15s
   block on bridges with hundreds of handles — broke the aap724 → wife
   correlation entirely. Restore [unknown_handle, ...known_ghosts] for
   Madrid (15s timeout, ample for typical ghost counts), keep single-handle
   for the other services. resolve_handle_cached is unchanged: cache reads
   are correct as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add a fair-slot rate gate around the alias-resolver's batch
validate_targets path. Per-handle 6h negative cache already prevents
re-querying the same unknown twice; this layer protects against a
*burst* of distinct unknowns (e.g. 50 reshares dropping in within a
minute after a CloudKit pull) firing parallel IDS calls.

Three-layer defense:
  - Concurrency cap (1): callers reserve slots serially.
  - Min interval (3s): base spacing between batch calls. Quieter than
    real iPhone bursts when sending to groups.
  - Adaptive multiplier (×N consecutive failures, capped at 8 → 24s):
    softens harder when results keep coming back empty. Resets on any
    successful resolution.

The gate uses slot-reservation rather than mutex-then-sleep so context
cancellation interrupts cleanly and back-to-back callers fairly receive
distinct future slots. Steady-state cost is zero — typical resolver
runs are minutes apart, the 3s pacing is invisible.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
resolve_handle was calling validate_targets, which uses refresh=false in
upstream IDS. With refresh=false, a previously-empty IDS result stays
"fresh" for EMPTY_REFRESH (1h) and is filtered out of the HTTP fetch —
Apple is never re-queried for that handle within the window.

This silently broke hidden-alias resolution: aap724@icloud.com (and
similar) hits LookupFailed once, gets cached as empty, then is excluded
from every subsequent batch lookup for the next hour even though the
batch itself runs fine for sibling handles.

Switch the resolver to cache_keys(refresh=true) directly. With refresh=
true, is_dirty drops the cutoff to REFRESH_MIN (60s), so the unknown
handle is included in the fetch on every resolver pass. The rate gate
already in place (3s min interval, exponential backoff on failure) is
the safety net against pounding Apple.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a local-first, Apple-fallback alias-link orchestrator that runs at
the end of every successful StatusKit-CloudKit drain. Walks state.keys,
links each peer handle to a portal using cached/local data first, and
sends only residual unknowns to Apple in a single batched IDS call.

Resolution order per handle:
  1. alias_portal cache (in-memory + KV) — already linked, skip
  2. Cluster store — sibling on a shared channel id
  3. Contacts + direct tel:/mailto: portal lookup
  4. Batched IDS — one cache_keys covers every residual unknown plus
     every known portal-bearing ghost; siblings matched via correlation_id

Why this shape:
- Idempotent: re-running confirms link state, doesn't re-do work.
- Cheap: most handles resolve from local data; Apple only sees the
  residual after that filter.
- Self-healing: each pass picks up handles that became resolvable
  (new ghost, new cluster observation, Apple finally publishing a
  correlation) since the prior pass.
- Bootstrap-safe: hooks at the END of syncCloudStatusKitPeers, which
  already gates on apnsBufferFlushedAt + 60s settle window + 12h
  success floor. No separate trigger; bridge start runs the pass via
  the cloudkit cycle's natural startup invocation.

Rust side: new batch_resolve_handles uniffi method that vectorizes
resolve_handle. One cache_keys(refresh=true) call per service across
unknowns ∪ known siblings, then walks the cache once to match
correlation_ids. 90s timeout cap.

Go side: new batchLinkStatusKitAliases hooked at the end of
syncCloudStatusKitPeers, plus collectKnownPortalHandles helper that
scans the ghost table for tel:/mailto: ids to feed the IDS batch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: batch link logged "resolved DM portal via IDS correlation" for
aap724 (and presumably others), but only some entries actually showed up
in the kv_store after a bridge restart. cameronaaron's four aliases
landed; aap724 didn't.

Root cause: bridgev2's KV.Set silently drops writes when ctx is canceled.
It logs via zerolog.Ctx(ctx), which returns a disabled logger when no
logger is attached to ctx, so the failure leaves no trace. The cloudkit
cycle ctx CAN cancel mid-iteration (orchestrator deadline, shutdown,
delayed-resync race), and Go map iteration order is random — so
whichever resolved entries happen to land late in the loop are silently
lost while earlier ones commit.

Fix: persist via context.Background() inside the batch link. The IDS
call upstream still respects the cycle ctx (rust-side 90s timeout
caps it), so cancellation propagates to the network call but not to the
local SQL UPSERT for entries that already returned a result.

Also adds:
- Read-after-write verification that warns if a write didn't land,
  so any future regression is immediately visible.
- Negative-attempt stamp clearing on batch-link success, mirroring
  what resolveStatusPortalViaIDSCached already does.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Symptom: aap724 was correctly mapped to wife's portal in alias_portal,
but presence updates from her never reached the matrix portal — log
showed "presence unchanged (restored from DB), skipping notice."

Root cause: presence dedupe is keyed on the raw (prefix-stripped) handle
and persists in `statuskit.presence.<raw>` plus an in-memory sync.Map.
When aap724's first presence arrived BEFORE alias_portal had a mapping
for her, the notice got dropped (no portal) but the cached presence
state was still recorded as "available." Subsequent updates with the
same mode then hit the dedupe and got skipped, even after the mapping
was created — wife's portal never saw the indicator.

Fix: when batch link writes a NEW mapping (alias_portal entry didn't
match the new value before the write), clear:
- in-memory dedupe via c.statusKitPresence.Delete(raw)
- KV `statuskit.presence.<raw>` to ""

Then trigger c.subscribeToContactPresence so APNs replays recent
presence — the now-routed handle's availability re-delivers and lands
in the matrix portal without waiting for a peer-side state change.

The "raw" form is the handle without mailto:/tel: prefix, matching the
format the presence handler receives directly from rust.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously batch_resolve_handles fed unknowns ∪ known_handles into a
single cache_keys call with refresh=true, which forced Apple to
re-query EVERY handle on every cycle — ~28 lookups per cycle for a
typical bridge, with the matching "IDS returned zero keys" warning
flood for any sibling that's genuinely unregistered.

Split into two calls per service:

  1. cache_keys(refresh=true, unknowns) — small set (~4), forces
     Apple re-query, bypassing EMPTY_REFRESH on stale empty results.
     This is the only path that needs a fresh fetch.

  2. cache_keys(refresh=false, known_siblings) — top-up only. Most
     siblings have correlation_ids cached from prior message traffic;
     refresh=false filters fresh entries out so Apple only sees the
     few siblings with genuinely missing/stale cache entries. Steady
     state is zero Apple traffic for this call.

Net effect: Apple traffic per cycle drops from O(unknowns + siblings)
to O(unknowns) once siblings are warmed up. The "zero keys" warnings
for known-empty siblings stop firing every cycle.

Sibling-promotion logic unchanged — both lists are queried into the
same cache state, then walked once per service for correlation matches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Flip the FaceTime install prompt to "Enable FaceTime Bridge?" so the
default-yes flow stays consistent across all install prompts. Also add
a note to the StatusKit notifications prompt that posting a notice
unarchives the destination chat — limitation is external to the bridge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CloudKit-pull inject skipped any peer whose channel_id was already
in state.keys, counting the record as already_known. That's correct for
the steady-state case (CloudKit reading back state we already learned
from the APNs reshare path) but wrong when a peer rotates device
material under the same channel id while the bridge was offline — the
fresh device from CloudKit would be silently dropped and state.keys
would stay pinned to the stale key.

Compare canonical binary-plist bytes of the existing and incoming
StatusKitSharedDevice. Same bytes → already_known, skip. Different
bytes → log a peer-key-rotation line and overwrite, counting toward
inserted so subscribeToContactPresence re-fires for the channel and
the alias-resolver observation callback still runs. Serialize failure
on either side forces an overwrite — safer than retaining a possibly-
stale key (and cannot occur in practice for a value that round-tripped
through plist to be constructed in the first place).

Upstream's StatusKitSharedDevice does not derive PartialEq, so the
bytewise comparison goes through plist::to_writer_binary — already the
canonical serialization used to persist state.keys to disk, so two
devices that compare equal here will also persist identically.
@cameronaaron cameronaaron deleted the refactor branch May 12, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants