Problem
VaultBackendImpl::admin_change (crates/mqdb-vault/src/backend.rs:540-648) rotates the passphrase in three writes:
migration_start — install new salt + new check, plus old check/salt as resume markers, set vault_migration_status: pending, vault_migration_mode: re_encrypt.
batch_vault_re_encrypt — decrypt each record with old_crypto, re-encrypt with new_crypto, write back.
migration_done — clear vault_migration_status, vault_migration_mode, vault_old_check, vault_old_salt.
After PR #65 surfaces step-3 failures, a real corruption path becomes visible: if step 2 succeeds and step 3 fails, the identity is left with status: pending, mode: re_encrypt, both checks present, and records encrypted under the new passphrase.
On the user's next unlock with the new passphrase, resume_pending_migration (crates/mqdb-vault/src/ops.rs:122-196) fires:
let old_crypto = VaultCrypto::derive(passphrase, &old_salt); // passphrase is now the NEW one
batch_vault_re_encrypt(db, ownership, id, &old_crypto, crypto).await;
old_crypto here is derived from the new passphrase plus the old salt — it cannot decrypt records that are already encrypted with new_crypto. The decrypt step produces garbage, then encrypts that garbage with new_crypto and writes it back. Data corruption.
admin_disable has the same structural risk: if the post-batch identity update (backend.rs:519-526) fails after batch_vault_operation has decrypted records, identity still reports vault_enabled: true while records are plaintext. Next unlock will derive a key, verify_check_token succeeds against the stale vault_check, and reads will treat plaintext as ciphertext.
Reproduction
Inject a failure on the migration_done update_entity call (e.g., DB transient error or Database::update returning Err(Internal)) after batch_vault_re_encrypt completes. The user then unlocks normally with the new passphrase. Verify subsequent reads of owned records return garbage.
Proposed fix
The resume path assumes the passphrase passed to it can decrypt records under `old_salt`. That is only true while `status == pending` and the user is supplying the old passphrase. Possible approaches:
- Atomic finalize: write `migration_done` before the last record is re-encrypted is unsafe; instead, make migration_done idempotent and retry it under a bounded loop before returning success. If retry exhausts, return Internal and leave the records re-encrypted but the markers stale — the next unlock with the new passphrase should then detect this state (records re-encrypted, status pending) and just clear the markers, not run the resume.
- Detect-and-skip: on resume, attempt to decrypt a probe record with `old_crypto` before calling `batch_vault_re_encrypt`. If decrypt fails the integrity check, skip the resume and clear markers.
- Two-phase commit on the identity: split the markers so `migration_done` only clears `vault_migration_status`, but `vault_old_check`/`vault_old_salt` are cleared in a separate idempotent step. The resume path can then use `vault_old_check` to verify that the supplied passphrase still matches the old salt before running re-encrypt.
Approach (2) is the cheapest and probably correct: `VaultCrypto::verify_check_token` on `vault_old_check` with the supplied passphrase + old_salt would tell us whether the resume is legitimate. If not, we're in the half-finalized state — just clear the markers.
Scope
Problem
VaultBackendImpl::admin_change(crates/mqdb-vault/src/backend.rs:540-648) rotates the passphrase in three writes:migration_start— install new salt + new check, plus old check/salt as resume markers, setvault_migration_status: pending, vault_migration_mode: re_encrypt.batch_vault_re_encrypt— decrypt each record withold_crypto, re-encrypt withnew_crypto, write back.migration_done— clearvault_migration_status,vault_migration_mode,vault_old_check,vault_old_salt.After PR #65 surfaces step-3 failures, a real corruption path becomes visible: if step 2 succeeds and step 3 fails, the identity is left with
status: pending, mode: re_encrypt, both checks present, and records encrypted under the new passphrase.On the user's next unlock with the new passphrase,
resume_pending_migration(crates/mqdb-vault/src/ops.rs:122-196) fires:old_cryptohere is derived from the new passphrase plus the old salt — it cannot decrypt records that are already encrypted withnew_crypto. The decrypt step produces garbage, then encrypts that garbage withnew_cryptoand writes it back. Data corruption.admin_disablehas the same structural risk: if the post-batch identity update (backend.rs:519-526) fails afterbatch_vault_operationhas decrypted records, identity still reportsvault_enabled: truewhile records are plaintext. Next unlock will derive a key,verify_check_tokensucceeds against the stalevault_check, and reads will treat plaintext as ciphertext.Reproduction
Inject a failure on the
migration_doneupdate_entitycall (e.g., DB transient error orDatabase::updatereturningErr(Internal)) afterbatch_vault_re_encryptcompletes. The user then unlocks normally with the new passphrase. Verify subsequent reads of owned records return garbage.Proposed fix
The resume path assumes the passphrase passed to it can decrypt records under `old_salt`. That is only true while `status == pending` and the user is supplying the old passphrase. Possible approaches:
Approach (2) is the cheapest and probably correct: `VaultCrypto::verify_check_token` on `vault_old_check` with the supplied passphrase + old_salt would tell us whether the resume is legitimate. If not, we're in the half-finalized state — just clear the markers.
Scope