Summary
In a passkey-only multi-tenant system, users who lose access to all their passkeys permanently lose their account. This issue defines a recovery code mechanism to mitigate that risk while maintaining strong security guarantees.
Relates to: #20 (Passkey auth), #17 (Multi-tenant)
Recovery Flow
A. Registration — Recovery Code Generation
During passkey registration, generate a high-entropy recovery code (e.g. 128-bit, displayed as grouped alphanumeric). The user must copy it to clipboard (or confirm they've saved it) before registration completes. The recovery code is stored as a bcrypt/argon2 hash — never in plaintext. Recovery codes must be globally unique across all accounts.
B. Initiate Recovery
When a user has lost all passkeys, they enter only their recovery code on an unauthenticated open endpoint (POST /api/auth/recover). No username or account identifier is provided — the code itself identifies the account.
C. Rate Limiting & Timing Protection
The recovery endpoint is heavily rate-limited to 1 request per hour per IP. Since recovery codes are global (no user identifier is submitted), rate limiting can only be scoped to IP. Constant-time comparison is used against all stored hashes to prevent timing attacks. Consider adding CAPTCHA or proof-of-work to further protect this unauthenticated endpoint.
D. 24h Freeze Period
On successful recovery code entry:
- The account enters a 24-hour freeze — no login, no passkey changes.
- All connected devices/sessions receive a push notification (mechanism TBD) with a secret cancel link that requires no authentication.
- The cancel link allows any current session holder to abort the recovery if it was unauthorized.
E. Cancel Flow
If the recovery is cancelled via the secret link:
- The recovery request is voided.
- The user is guided through generating a new recovery code (old one is invalidated).
- Login is NOT granted — the legitimate owner must initiate a fresh recovery flow with the new code if they still need access.
- This prevents an attacker from using the cancel flow as a login bypass.
F. Post-Freeze Completion
After the 24h freeze expires (uncancelled):
- The user enters the recovery code again (second verification).
- On success, they register a new passkey.
- A new recovery code is generated and must be saved (old one is invalidated).
- Normal account access is restored.
Open Questions / Design Decisions
-
Global code lookup cost: Since no user identifier is submitted, the server must check the recovery code against all stored hashes. With argon2id this is expensive at scale. Options: (a) use a keyed HMAC prefix as a fast lookup index (leaks timing info — needs care), (b) limit total user count (acceptable for ShellWatch's scale), (c) use a less expensive hash for the lookup step with argon2id for final verification.
-
Notification channel: Push notifications require a registered device/app. If this is a web-only app, what's the notification mechanism? WebSocket to active sessions? Email (adds a second factor but contradicts passkey-only purity)? Needs decision.
-
Cancel link security: Must be single-use and time-bound (valid only during the freeze window). If an attacker has access to a connected device, they could intercept the cancel link and obtain the new recovery code. At minimum: single-use, HTTPS-only, short cryptographic token.
-
Recovery code not rotated between steps D→F: The same code is valid at initiation (step B) and post-freeze re-entry (step F). An attacker who observed the code once can use it at both steps. The 24h freeze + cancel notification is the mitigation. Acceptable tradeoff?
-
No connected devices scenario: If the user genuinely lost everything (no active sessions, no devices), the freeze period is just a 24h wait with no one to notify. This is by design — the freeze still protects against rushed takeovers.
-
Recovery code entropy: Must be high enough to prevent brute-force across all accounts (not just one). Suggest minimum 128-bit entropy. With N accounts, the chance of a random guess hitting any account is N/keyspace — must remain negligible even under sustained 1-req/hour attempts.
-
Account lockout after N failed recovery attempts: Since rate limiting is per-IP only, a distributed attacker could attempt more frequently. Consider additional defenses: global rate limiting, exponential backoff, or temporary lockout after repeated failures from different IPs targeting the same code hash.
Implementation Notes
- Recovery code hash stored alongside user record (argon2id recommended)
- Codes must be globally unique — check for collisions at generation time
- Freeze state stored as a timestamp + recovery session token in DB
- Cancel link token: cryptographically random, stored hashed, single-use
- All recovery-related endpoints must be audit-logged
- Consider a dedicated
recovery_sessions table tracking attempts, freezes, and outcomes
Summary
In a passkey-only multi-tenant system, users who lose access to all their passkeys permanently lose their account. This issue defines a recovery code mechanism to mitigate that risk while maintaining strong security guarantees.
Relates to: #20 (Passkey auth), #17 (Multi-tenant)
Recovery Flow
A. Registration — Recovery Code Generation
During passkey registration, generate a high-entropy recovery code (e.g. 128-bit, displayed as grouped alphanumeric). The user must copy it to clipboard (or confirm they've saved it) before registration completes. The recovery code is stored as a bcrypt/argon2 hash — never in plaintext. Recovery codes must be globally unique across all accounts.
B. Initiate Recovery
When a user has lost all passkeys, they enter only their recovery code on an unauthenticated open endpoint (
POST /api/auth/recover). No username or account identifier is provided — the code itself identifies the account.C. Rate Limiting & Timing Protection
The recovery endpoint is heavily rate-limited to 1 request per hour per IP. Since recovery codes are global (no user identifier is submitted), rate limiting can only be scoped to IP. Constant-time comparison is used against all stored hashes to prevent timing attacks. Consider adding CAPTCHA or proof-of-work to further protect this unauthenticated endpoint.
D. 24h Freeze Period
On successful recovery code entry:
E. Cancel Flow
If the recovery is cancelled via the secret link:
F. Post-Freeze Completion
After the 24h freeze expires (uncancelled):
Open Questions / Design Decisions
Global code lookup cost: Since no user identifier is submitted, the server must check the recovery code against all stored hashes. With argon2id this is expensive at scale. Options: (a) use a keyed HMAC prefix as a fast lookup index (leaks timing info — needs care), (b) limit total user count (acceptable for ShellWatch's scale), (c) use a less expensive hash for the lookup step with argon2id for final verification.
Notification channel: Push notifications require a registered device/app. If this is a web-only app, what's the notification mechanism? WebSocket to active sessions? Email (adds a second factor but contradicts passkey-only purity)? Needs decision.
Cancel link security: Must be single-use and time-bound (valid only during the freeze window). If an attacker has access to a connected device, they could intercept the cancel link and obtain the new recovery code. At minimum: single-use, HTTPS-only, short cryptographic token.
Recovery code not rotated between steps D→F: The same code is valid at initiation (step B) and post-freeze re-entry (step F). An attacker who observed the code once can use it at both steps. The 24h freeze + cancel notification is the mitigation. Acceptable tradeoff?
No connected devices scenario: If the user genuinely lost everything (no active sessions, no devices), the freeze period is just a 24h wait with no one to notify. This is by design — the freeze still protects against rushed takeovers.
Recovery code entropy: Must be high enough to prevent brute-force across all accounts (not just one). Suggest minimum 128-bit entropy. With N accounts, the chance of a random guess hitting any account is N/keyspace — must remain negligible even under sustained 1-req/hour attempts.
Account lockout after N failed recovery attempts: Since rate limiting is per-IP only, a distributed attacker could attempt more frequently. Consider additional defenses: global rate limiting, exponential backoff, or temporary lockout after repeated failures from different IPs targeting the same code hash.
Implementation Notes
recovery_sessionstable tracking attempts, freezes, and outcomes