Skip to content

feat(checkpoint): state migration for checkpoints (proposal 0014)#46

Open
chris-colinsky wants to merge 4 commits into
mainfrom
feature/0014-state-migration
Open

feat(checkpoint): state migration for checkpoints (proposal 0014)#46
chris-colinsky wants to merge 4 commits into
mainfrom
feature/0014-state-migration

Conversation

@chris-colinsky
Copy link
Copy Markdown
Member

Summary

  • Implements spec proposal 0014 (state-migration hooks for checkpoints) in openarmature.checkpoint + openarmature.graph. PR-4 of the five-PR batch following PR-1 (feat(llm): structured output (proposal 0016) #42), PR-2 (feat(llm): image content blocks (proposal 0015) #44), PR-3 (feat(prompts): prompt-management core (proposal 0017) #45).
  • Repurposes CheckpointRecord.schema_version from the prior backend-internal record-shape role to a user-facing state-schema identifier per spec §10.2. Saved records carry whatever the user declared on type(state).schema_version at save time; on resume, version mismatches route through a registered migration chain instead of failing.
  • New surface in openarmature.checkpoint: StateMigration (frozen dataclass — from_version / to_version / migrate callable) and MigrationRegistry (BFS chain resolution + duplicate-edge AND multi-shortest-path ambiguity detection per §10.12.2).
  • GraphBuilder.with_state_migration(from, to, migrate) and with_state_migrations(*migrations) populate the registry; compile() threads it through to the CompiledGraph. On resume, the engine reads record.schema_version, compares against type(state_cls).schema_version, and applies the resolved chain (or raises one of two new error categories).
  • Two new error categories: CheckpointStateMigrationMissing (no chain bridges the versions, or chain ambiguous) and CheckpointStateMigrationFailed (a migration function raised). Both non-transient. Post-migration deserialization failures still surface as CheckpointRecordInvalid per §10.12.4 — the three categories are mutually exclusive on any given resume.
  • Checkpointer.supports_state_migration: bool marks whether a backend can expose a structural intermediate form of the loaded state (the dict the migration registry consumes). SQLiteCheckpointer(serialization="json") opts in; SQLiteCheckpointer(serialization="pickle") and InMemoryCheckpointer opt out. Mismatched version against a non-migration-eligible backend raises CheckpointRecordInvalid per §10.12.1.

What's new

Surface Change
StateMigration, MigrationRegistry New types exported from openarmature.checkpoint. Registry's resolve_chain runs BFS and raises ValueError on ambiguous-shortest-paths per §10.12.2's load-time-detection allowance.
CheckpointStateMigrationMissing, CheckpointStateMigrationFailed Two new canonical categories per spec §10.10. The first carries from_version / to_version / registered_migrations_count / registry_description for actionable diagnostics.
State.schema_version: ClassVar[str] = "" Empty-string sentinel per spec §10.2. Declare a non-empty value to opt the state class into the migration registry.
GraphBuilder.with_state_migration / with_state_migrations Registration surface per §10.12.1. Duplicate (from, to) pairs raise ValueError at registration.
Checkpointer.supports_state_migration Protocol attribute (not ClassVar) so SQLite can compute it per-instance based on serialization mode.
CompiledGraph._migrate_record Resume-path routing: unsupported-backend → registry lookup → chain application → final deserialization. Order matters; documented inline.
Parent-state migration Same chain applied to each entry of parent_states in lockstep with the outer state per §10.12.2. Code comment records the spec-mandated equivalence so future contributors don't add per-parent metadata without a follow-on proposal.
CHECKPOINT_SCHEMA_VERSION Removed. The constant served the prior backend-internal record-shape role that the spec moved out of schema_version. If future record-shape evolution needs versioning, backends add their own private field.

Release gate

PR-4 of a five-PR batch (00160015001700140011). Do not tag a release until all five land — the CHANGELOG [Unreleased] Notes section carries the gate from PR-1. PR-5 (proposal 0011, parallel branches) is the only remaining piece.

Commits

  1. feat(checkpoint): state migration registry, types, errors, builder surface
  2. test(conformance): drive 0014 state-migration fixtures 039-046
  3. test(unit): state migration + docs(concepts) state migrations section + CHANGELOG

Notable implementation details

  • Multi-shortest-path detection lands at resolve-time, not compile-time. Spec §10.12.2 accepts both; resolve-time matches the BFS shape closely (one extra path-collection pass during the search) without needing a separate static analysis. Resolve-time misses ambiguity in chains that never trigger on a real resume, but that's a deferred cost.
  • schema_version semantic shift is a clean break. Pre-PR-4 records carrying the old "1" constant get reinterpreted as user-facing v1 identifiers. Per Chris's note, no compat shim — single-user dev, no production migration. CHANGELOG flags the shift under Changed.
  • supports_state_migration is an instance attribute, not ClassVar. Started as ClassVar[bool] per the original plan, but SQLiteCheckpointer's answer depends on the constructor's serialization arg. The Protocol declares it as bool = False; backends override either at the class level (InMemoryCheckpointer = False) or per-instance (SQLite computed in __init__). Pyright accepts both forms.
  • Parent-state migration is lockstep, not sequential. Each migration step runs once on the outer state AND once per parent_states[i] before advancing to the next step. End state is identical to sequential ordering given pure migrations (spec §10.12.2 mandates purity), but lockstep gives cleaner failure attribution (step k failed on outer-or-parent-i) and sets up per-migration spans for a follow-on. Documented in the engine path's code comment.
  • InMemoryCheckpointer is no longer migration-eligible. It holds live typed-state references; there's no class-independent intermediate. Mismatched-version resume against in-memory backend raises CheckpointRecordInvalid per §10.12.1 — same category as before, slightly different reason.

Conformance fixtures (039–046)

All 8 fixtures pass:

Fixture Asserts
039 additive-field Single v1 → v2 migration populates a new field; one migration ran
040 chain Two-step v1 → v2 → v3 chain runs in order
041 missing Empty registry on mismatch raises CheckpointStateMigrationMissing with registered_migrations_count: 0
042 versions-match-no-op Registry NOT consulted when versions match; should_not_run mock never called
043 parent-states-migrated Migration runs once per outer state + once per parent_states[i]
044 post-migration-deserialization-fails Migration ran successfully but produced output missing a required field → CheckpointRecordInvalid (per §10.12.4)
045 no-path-in-registry Migrations registered but none bridge the versions → CheckpointStateMigrationMissing with registered_migrations_count: 1 and the registered set in registry_description
046 function-raises Migration raises KeyErrorCheckpointStateMigrationFailed with the KeyError preserved as __cause__

Harness extensions

  • Migration mock library under tests/conformance/test_state_migration.py: add_new_field_default, add_v2_field, add_v3_field, identity_passthrough, raises_keyerror, should_not_run, irrelevant. Each fixture's migrate: <name> resolves through the library.
  • _MigrationTrace wraps every mock to capture invocation order, supporting migrations_run / migration_count / single_migration_invocation / migration_order_matches_chain assertions across all 8 fixtures.
  • _seed_record persists a CheckpointRecord matching the fixture's seeded_record: block before invoke(resume_invocation=...) so the resume path has data to load.
  • StateSchema model in tests/conformance/harness/directives.py gains optional schema_version (default "") and a required: bool knob for fixture 044's required_v2_field no-default field.
  • _DEFERRED_FIXTURES in test_fixture_parsing.py loses the 039-046 rows; the existing CasesFixture permissive-extras path picks up the new shape unchanged.

Test plan

  • uv run pytest — 644 pass, 64 skipped (down from 79; 8 fixtures × 2 = 16 fewer skips, minus 1 new docs example for the state-migrations concept section), 0 failed.
  • uv run pyright — clean.
  • uv run ruff check + uv run ruff format — clean.
  • uv run --group docs mkdocs build --strict — clean.
  • All 8 state-migration conformance fixtures pass.
  • 16 unit tests under tests/unit/test_state_migration.py covering BFS edge cases, multi-shortest-path ambiguity, GraphBuilder ergonomics, and __cause__ preservation.
  • Manual: walk docs/concepts/checkpointing.md#state-migrations via mkdocs serve to confirm rendering.

Pre-1.0 SemVer

Breaking-ish: CheckpointRecord.schema_version semantic shift means existing records with the prior "1" constant value get reinterpreted as user-facing v1 identifiers. The CHECKPOINT_SCHEMA_VERSION module constant is removed. Pre-1.0 batch + single-user dev means no compatibility shim; CHANGELOG flags the shift under Changed.

Additive otherwise: new types, new errors, new builder method, new state ClassVar, new Protocol attribute. Existing free-form callers (no schema_version declared on their state, no migrations registered) see no behavior change — records carry "" and the engine never consults the registry.

…rface

Implements pipeline-utilities spec §10.12 (proposal 0014).

- New errors: CheckpointStateMigrationMissing,
  CheckpointStateMigrationFailed. Both non-transient per §10.10.
  The missing-chain error carries from_version / to_version /
  registered_migrations_count / registry_description for
  actionable operator diagnostics.
- New types: StateMigration (frozen dataclass — from_version,
  to_version, migrate callable) and MigrationRegistry (BFS
  chain resolution + ambiguity detection per §10.12.2).
- Multi-shortest-path detection: when BFS finds a shortest
  path AND a second distinct path of equal length exists,
  the registry raises ValueError per the spec's ambiguous-chain
  rule. Resume surfaces this as CheckpointStateMigrationMissing
  with the ambiguity description in the payload.
- State.schema_version: ClassVar[str] = '' (per spec §10.2's
  per-language carve-out). Empty-string sentinel; the framework
  reads type(state).schema_version at save time.
- Checkpointer Protocol: supports_state_migration: ClassVar[bool]
  marker per §10.12.1. InMemoryCheckpointer: False (typed in-
  memory references can't expose a class-independent
  intermediate). SQLiteCheckpointer: True in JSON mode, False
  in pickle mode (pickle holds class identity and round-trips
  to typed instances; can't bridge versions).
- GraphBuilder.with_state_migration / with_state_migrations
  thread a populated MigrationRegistry into CompiledGraph at
  compile time.
- Resume-path routing (compiled.py): version mismatch →
  unsupported-backend check → registry lookup → chain
  application (with per-migration failure wrap) → final
  deserialization. The post-migration deserialization failure
  still surfaces as CheckpointRecordInvalid per §10.12.4;
  pre-migration version mismatch routes through the new two
  categories. Order matters; documented inline so a future
  reader doesn't swap it back.
- Parent-state migration: same chain applied to each entry of
  parent_states in lockstep with the outer state per §10.12.2.
  Code comment records the spec-mandated equivalence so
  future contributors don't add per-parent metadata without a
  follow-on proposal.
- Drop the CHECKPOINT_SCHEMA_VERSION = '1' constant: per Q1
  spec answer, the old backend-internal record-shape role had
  no spec slot anyway. SQLiteCheckpointer no longer rejects
  records with non-default versions on load — that routing
  is now the engine's concern at resume time. Existing
  records carrying schema_version='1' get reinterpreted as
  user-facing v1 identifiers (single-user dev, no compat
  shim needed per Chris's note).
New tests/conformance/test_state_migration.py drives all 8 spec
state-migration fixtures end-to-end against the real engine
(SQLite JSON-mode backend, real graph compile, real resume path).

Harness pieces:

- Migration mock library: add_new_field_default, add_v2_field,
  add_v3_field, identity_passthrough, raises_keyerror,
  should_not_run, irrelevant. Each fixture's migrate: <name>
  resolves through the library.
- _MigrationTrace wraps each mock to capture invocation order
  for the migrations_run / migration_count /
  single_migration_invocation / migration_order_matches_chain
  assertions. Consecutive duplicates collapse (fixture 043 runs
  each step once for outer + once for each parent under the
  lockstep ordering; the assertion is per-step, not per-entity).
- _seed_record persists a checkpoint matching the fixture's
  seeded_record: block before invoke(resume_invocation=...) so
  the resume path has data to load.

Harness/model adjustments:

- StateSchema in tests/conformance/harness/directives.py gains
  optional schema_version (default '') and the required field
  knob (no-default Pydantic field, used by fixture 044's
  required_v2_field deserialization-failure case).
- _DEFERRED_FIXTURES in test_fixture_parsing.py loses the
  039-046 rows; the CasesFixture model parses them via permissive
  extras on CaseSpec.
- Initial-state construction in the resume path uses
  state_cls.model_construct() so fixtures with required fields
  (044) can pass a placeholder past Pydantic's validator before
  the resume even starts; the engine loads state from the
  checkpoint, not from the placeholder.

Protocol-attribute shape:

- Checkpointer.supports_state_migration declared as
  bool = False (not ClassVar) so SQLiteCheckpointer can set
  the value per-instance in __init__ based on serialization mode.
  Backends with a static answer (InMemoryCheckpointer) override
  at the class level with bool = False — Pyright accepts
  either shape because Protocol attribute conformance ignores
  the ClassVar marker on subclasses.
… + CHANGELOG

tests/unit/test_state_migration.py (16 tests) covers gaps the
conformance fixtures don't exercise directly: BFS edge cases on
the registry, multi-shortest-path ambiguity detection, GraphBuilder
ergonomics (singular + plural registration, duplicate-pair
ValueError), and error attribute carriage including __cause__
preservation on CheckpointStateMigrationFailed.

docs/concepts/checkpointing.md gains a State migrations section
covering: the State.schema_version declaration, the registration
surface (with_state_migration / with_state_migrations), BFS chain
resolution and ambiguity cases (duplicate edges + multi-shortest-
path), the two new error categories and how they relate to
CheckpointRecordInvalid (§10.12.4), backend support and why
SQLite-pickle / InMemory aren't migration-eligible, the lockstep
parent_states migration rule, and the migrations-MUST-be-pure
contract.

CHANGELOG.md gains two new Added entries (state-migration
surface + Checkpointer.supports_state_migration Protocol
attribute) plus a Changed entry documenting the
CheckpointRecord.schema_version semantic shift and the
CHECKPOINT_SCHEMA_VERSION constant removal. Pre-1.0 breaking
change covered by the consolidated-release flag.
Copilot AI review requested due to automatic review settings May 16, 2026 06:07
Spec proposal 0018 (chain-ambiguity category) landed in spec
v0.16.0 between PR-4 push and the spec agent's code review.
This commit adds the new error category and the required
routing swaps; PR-4 stays open as the carrier.

- Submodule pinned to v0.16.0; pyproject + __spec_version__ +
  test_smoke pin assertions follow.
- New canonical error class CheckpointStateMigrationChainAmbiguous
  (and the matching category string) added to checkpoint/errors.py,
  re-exported from openarmature.checkpoint. Non-transient. Carries
  optional from_version / to_version when known (always set for
  the registration-time case; resume-time multi-shortest-path also
  populates).
- MigrationRegistry.register raises
  CheckpointStateMigrationChainAmbiguous directly at registration
  time per spec §10.10 (proposal 0018), so the canonical category
  surfaces at the registration boundary without wrapping. The
  internal BFS keeps raising ValueError for multi-shortest-path
  detection; CompiledGraph._migrate_record's except branch now
  routes that to CheckpointStateMigrationChainAmbiguous at the
  resume boundary.
- Routing precedence on resume per §10.10 (v0.16.0): chain-ambiguous
  → missing → failed → record-invalid. Code-comment in
  _migrate_record documents the ordering so a future reader
  doesn't swap it back.
- Code comments record two design-choice notes the spec agent
  flagged: the load-bearing equal-depth-re-entry behavior in BFS
  cycle protection (tightening to <= breaks multi-shortest-path
  detection), and the registration-order ordering of describe()
  output.
- Conformance harness gains the expected_chain_ambiguity_error
  primitive. Accepts the canonical category at EITHER the case
  top-level (registration-time detection: registration raises,
  resume block unreachable) OR inside resume (load-time detection
  during BFS). Spec §10.12.2's compile-time-SHOULD /
  load-time-acceptable carve-out lands in the driver as a
  single try/except wrap around both phases.
- Fixture 047 (state-migration-chain-ambiguous) covered; all 9
  state-migration fixtures (039-047) pass.
- 3 new unit tests for the new category's class shape +
  attribute carriage; existing duplicate-pair tests updated to
  assert the canonical category instead of ValueError.
- docs/concepts/checkpointing.md updated for the third category +
  the v0.16.0 routing precedence. CHANGELOG entry updated and
  Pinned-version line bumped to 0.10.0 → 0.16.0.

OTel openarmature.checkpoint.migrate span emission deferred to
a follow-on; documented inline in _migrate_record. Spec §6
cross-ref is SHOULD-level, and the emission needs the
_InvocationContext that's currently created AFTER the migration
path runs. The follow-on can either restructure resume to build
the context first or use a side-channel observer dispatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant