Skip to content

feat: Recovery code flow for passkey-only accounts #26

Description

@rado0x54

Summary

In a passkey-only multi-tenant system, users who lose access to all their passkeys permanently lose their account. This issue defines a recovery code mechanism to mitigate that risk while maintaining strong security guarantees.

Relates to: #20 (Passkey auth), #17 (Multi-tenant)

Recovery Flow

A. Registration — Recovery Code Generation

During passkey registration, generate a high-entropy recovery code (e.g. 128-bit, displayed as grouped alphanumeric). The user must copy it to clipboard (or confirm they've saved it) before registration completes. The recovery code is stored as a bcrypt/argon2 hash — never in plaintext. Recovery codes must be globally unique across all accounts.

B. Initiate Recovery

When a user has lost all passkeys, they enter only their recovery code on an unauthenticated open endpoint (POST /api/auth/recover). No username or account identifier is provided — the code itself identifies the account.

C. Rate Limiting & Timing Protection

The recovery endpoint is heavily rate-limited to 1 request per hour per IP. Since recovery codes are global (no user identifier is submitted), rate limiting can only be scoped to IP. Constant-time comparison is used against all stored hashes to prevent timing attacks. Consider adding CAPTCHA or proof-of-work to further protect this unauthenticated endpoint.

D. 24h Freeze Period

On successful recovery code entry:

  • The account enters a 24-hour freeze — no login, no passkey changes.
  • All connected devices/sessions receive a push notification (mechanism TBD) with a secret cancel link that requires no authentication.
  • The cancel link allows any current session holder to abort the recovery if it was unauthorized.

E. Cancel Flow

If the recovery is cancelled via the secret link:

  • The recovery request is voided.
  • The user is guided through generating a new recovery code (old one is invalidated).
  • Login is NOT granted — the legitimate owner must initiate a fresh recovery flow with the new code if they still need access.
  • This prevents an attacker from using the cancel flow as a login bypass.

F. Post-Freeze Completion

After the 24h freeze expires (uncancelled):

  • The user enters the recovery code again (second verification).
  • On success, they register a new passkey.
  • A new recovery code is generated and must be saved (old one is invalidated).
  • Normal account access is restored.

Open Questions / Design Decisions

  1. Global code lookup cost: Since no user identifier is submitted, the server must check the recovery code against all stored hashes. With argon2id this is expensive at scale. Options: (a) use a keyed HMAC prefix as a fast lookup index (leaks timing info — needs care), (b) limit total user count (acceptable for ShellWatch's scale), (c) use a less expensive hash for the lookup step with argon2id for final verification.

  2. Notification channel: Push notifications require a registered device/app. If this is a web-only app, what's the notification mechanism? WebSocket to active sessions? Email (adds a second factor but contradicts passkey-only purity)? Needs decision.

  3. Cancel link security: Must be single-use and time-bound (valid only during the freeze window). If an attacker has access to a connected device, they could intercept the cancel link and obtain the new recovery code. At minimum: single-use, HTTPS-only, short cryptographic token.

  4. Recovery code not rotated between steps D→F: The same code is valid at initiation (step B) and post-freeze re-entry (step F). An attacker who observed the code once can use it at both steps. The 24h freeze + cancel notification is the mitigation. Acceptable tradeoff?

  5. No connected devices scenario: If the user genuinely lost everything (no active sessions, no devices), the freeze period is just a 24h wait with no one to notify. This is by design — the freeze still protects against rushed takeovers.

  6. Recovery code entropy: Must be high enough to prevent brute-force across all accounts (not just one). Suggest minimum 128-bit entropy. With N accounts, the chance of a random guess hitting any account is N/keyspace — must remain negligible even under sustained 1-req/hour attempts.

  7. Account lockout after N failed recovery attempts: Since rate limiting is per-IP only, a distributed attacker could attempt more frequently. Consider additional defenses: global rate limiting, exponential backoff, or temporary lockout after repeated failures from different IPs targeting the same code hash.

Implementation Notes

  • Recovery code hash stored alongside user record (argon2id recommended)
  • Codes must be globally unique — check for collisions at generation time
  • Freeze state stored as a timestamp + recovery session token in DB
  • Cancel link token: cryptographically random, stored hashed, single-use
  • All recovery-related endpoints must be audit-logged
  • Consider a dedicated recovery_sessions table tracking attempts, freezes, and outcomes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions