feat: Recovery code flow for passkey-only accounts

## Summary

In a passkey-only multi-tenant system, users who lose access to all their passkeys permanently lose their account. This issue defines a recovery code mechanism to mitigate that risk while maintaining strong security guarantees.

Relates to: #20 (Passkey auth), #17 (Multi-tenant)

## Recovery Flow

### A. Registration — Recovery Code Generation

During passkey registration, generate a high-entropy recovery code (e.g. 128-bit, displayed as grouped alphanumeric). The user **must** copy it to clipboard (or confirm they've saved it) before registration completes. The recovery code is stored as a bcrypt/argon2 hash — never in plaintext. Recovery codes must be **globally unique** across all accounts.

### B. Initiate Recovery

When a user has lost all passkeys, they enter **only** their recovery code on an **unauthenticated** open endpoint (`POST /api/auth/recover`). No username or account identifier is provided — the code itself identifies the account.

### C. Rate Limiting & Timing Protection

The recovery endpoint is **heavily rate-limited** to **1 request per hour per IP**. Since recovery codes are global (no user identifier is submitted), rate limiting can only be scoped to IP. Constant-time comparison is used against all stored hashes to prevent timing attacks. Consider adding CAPTCHA or proof-of-work to further protect this unauthenticated endpoint.

### D. 24h Freeze Period

On successful recovery code entry:
- The account enters a **24-hour freeze** — no login, no passkey changes.
- All connected devices/sessions receive a **push notification** (mechanism TBD) with a **secret cancel link** that requires no authentication.
- The cancel link allows any current session holder to abort the recovery if it was unauthorized.

### E. Cancel Flow

If the recovery is cancelled via the secret link:
- The recovery request is voided.
- The user is guided through generating a **new recovery code** (old one is invalidated).
- **Login is NOT granted** — the legitimate owner must initiate a fresh recovery flow with the new code if they still need access.
- This prevents an attacker from using the cancel flow as a login bypass.

### F. Post-Freeze Completion

After the 24h freeze expires (uncancelled):
- The user enters the recovery code **again** (second verification).
- On success, they register a **new passkey**.
- A **new recovery code** is generated and must be saved (old one is invalidated).
- Normal account access is restored.

## Open Questions / Design Decisions

1. **Global code lookup cost**: Since no user identifier is submitted, the server must check the recovery code against all stored hashes. With argon2id this is expensive at scale. Options: (a) use a keyed HMAC prefix as a fast lookup index (leaks timing info — needs care), (b) limit total user count (acceptable for ShellWatch's scale), (c) use a less expensive hash for the lookup step with argon2id for final verification.

2. **Notification channel**: Push notifications require a registered device/app. If this is a web-only app, what's the notification mechanism? WebSocket to active sessions? Email (adds a second factor but contradicts passkey-only purity)? Needs decision.

3. **Cancel link security**: Must be single-use and time-bound (valid only during the freeze window). If an attacker has access to a connected device, they could intercept the cancel link and obtain the new recovery code. At minimum: single-use, HTTPS-only, short cryptographic token.

4. **Recovery code not rotated between steps D→F**: The same code is valid at initiation (step B) and post-freeze re-entry (step F). An attacker who observed the code once can use it at both steps. The 24h freeze + cancel notification is the mitigation. Acceptable tradeoff?

5. **No connected devices scenario**: If the user genuinely lost everything (no active sessions, no devices), the freeze period is just a 24h wait with no one to notify. This is by design — the freeze still protects against rushed takeovers.

6. **Recovery code entropy**: Must be high enough to prevent brute-force across all accounts (not just one). Suggest minimum 128-bit entropy. With N accounts, the chance of a random guess hitting any account is N/keyspace — must remain negligible even under sustained 1-req/hour attempts.

7. **Account lockout after N failed recovery attempts**: Since rate limiting is per-IP only, a distributed attacker could attempt more frequently. Consider additional defenses: global rate limiting, exponential backoff, or temporary lockout after repeated failures from different IPs targeting the same code hash.

## Implementation Notes

- Recovery code hash stored alongside user record (argon2id recommended)
- Codes must be globally unique — check for collisions at generation time
- Freeze state stored as a timestamp + recovery session token in DB
- Cancel link token: cryptographically random, stored hashed, single-use
- All recovery-related endpoints must be audit-logged
- Consider a dedicated `recovery_sessions` table tracking attempts, freezes, and outcomes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Recovery code flow for passkey-only accounts #26

Summary

Recovery Flow

A. Registration — Recovery Code Generation

B. Initiate Recovery

C. Rate Limiting & Timing Protection

D. 24h Freeze Period

E. Cancel Flow

F. Post-Freeze Completion

Open Questions / Design Decisions

Implementation Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Recovery code flow for passkey-only accounts #26

Description

Summary

Recovery Flow

A. Registration — Recovery Code Generation

B. Initiate Recovery

C. Rate Limiting & Timing Protection

D. 24h Freeze Period

E. Cancel Flow

F. Post-Freeze Completion

Open Questions / Design Decisions

Implementation Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions