feat(checkpoint): state migration for checkpoints (proposal 0014)#46
Open
chris-colinsky wants to merge 4 commits into
Open
feat(checkpoint): state migration for checkpoints (proposal 0014)#46chris-colinsky wants to merge 4 commits into
chris-colinsky wants to merge 4 commits into
Conversation
…rface Implements pipeline-utilities spec §10.12 (proposal 0014). - New errors: CheckpointStateMigrationMissing, CheckpointStateMigrationFailed. Both non-transient per §10.10. The missing-chain error carries from_version / to_version / registered_migrations_count / registry_description for actionable operator diagnostics. - New types: StateMigration (frozen dataclass — from_version, to_version, migrate callable) and MigrationRegistry (BFS chain resolution + ambiguity detection per §10.12.2). - Multi-shortest-path detection: when BFS finds a shortest path AND a second distinct path of equal length exists, the registry raises ValueError per the spec's ambiguous-chain rule. Resume surfaces this as CheckpointStateMigrationMissing with the ambiguity description in the payload. - State.schema_version: ClassVar[str] = '' (per spec §10.2's per-language carve-out). Empty-string sentinel; the framework reads type(state).schema_version at save time. - Checkpointer Protocol: supports_state_migration: ClassVar[bool] marker per §10.12.1. InMemoryCheckpointer: False (typed in- memory references can't expose a class-independent intermediate). SQLiteCheckpointer: True in JSON mode, False in pickle mode (pickle holds class identity and round-trips to typed instances; can't bridge versions). - GraphBuilder.with_state_migration / with_state_migrations thread a populated MigrationRegistry into CompiledGraph at compile time. - Resume-path routing (compiled.py): version mismatch → unsupported-backend check → registry lookup → chain application (with per-migration failure wrap) → final deserialization. The post-migration deserialization failure still surfaces as CheckpointRecordInvalid per §10.12.4; pre-migration version mismatch routes through the new two categories. Order matters; documented inline so a future reader doesn't swap it back. - Parent-state migration: same chain applied to each entry of parent_states in lockstep with the outer state per §10.12.2. Code comment records the spec-mandated equivalence so future contributors don't add per-parent metadata without a follow-on proposal. - Drop the CHECKPOINT_SCHEMA_VERSION = '1' constant: per Q1 spec answer, the old backend-internal record-shape role had no spec slot anyway. SQLiteCheckpointer no longer rejects records with non-default versions on load — that routing is now the engine's concern at resume time. Existing records carrying schema_version='1' get reinterpreted as user-facing v1 identifiers (single-user dev, no compat shim needed per Chris's note).
New tests/conformance/test_state_migration.py drives all 8 spec state-migration fixtures end-to-end against the real engine (SQLite JSON-mode backend, real graph compile, real resume path). Harness pieces: - Migration mock library: add_new_field_default, add_v2_field, add_v3_field, identity_passthrough, raises_keyerror, should_not_run, irrelevant. Each fixture's migrate: <name> resolves through the library. - _MigrationTrace wraps each mock to capture invocation order for the migrations_run / migration_count / single_migration_invocation / migration_order_matches_chain assertions. Consecutive duplicates collapse (fixture 043 runs each step once for outer + once for each parent under the lockstep ordering; the assertion is per-step, not per-entity). - _seed_record persists a checkpoint matching the fixture's seeded_record: block before invoke(resume_invocation=...) so the resume path has data to load. Harness/model adjustments: - StateSchema in tests/conformance/harness/directives.py gains optional schema_version (default '') and the required field knob (no-default Pydantic field, used by fixture 044's required_v2_field deserialization-failure case). - _DEFERRED_FIXTURES in test_fixture_parsing.py loses the 039-046 rows; the CasesFixture model parses them via permissive extras on CaseSpec. - Initial-state construction in the resume path uses state_cls.model_construct() so fixtures with required fields (044) can pass a placeholder past Pydantic's validator before the resume even starts; the engine loads state from the checkpoint, not from the placeholder. Protocol-attribute shape: - Checkpointer.supports_state_migration declared as bool = False (not ClassVar) so SQLiteCheckpointer can set the value per-instance in __init__ based on serialization mode. Backends with a static answer (InMemoryCheckpointer) override at the class level with bool = False — Pyright accepts either shape because Protocol attribute conformance ignores the ClassVar marker on subclasses.
… + CHANGELOG tests/unit/test_state_migration.py (16 tests) covers gaps the conformance fixtures don't exercise directly: BFS edge cases on the registry, multi-shortest-path ambiguity detection, GraphBuilder ergonomics (singular + plural registration, duplicate-pair ValueError), and error attribute carriage including __cause__ preservation on CheckpointStateMigrationFailed. docs/concepts/checkpointing.md gains a State migrations section covering: the State.schema_version declaration, the registration surface (with_state_migration / with_state_migrations), BFS chain resolution and ambiguity cases (duplicate edges + multi-shortest- path), the two new error categories and how they relate to CheckpointRecordInvalid (§10.12.4), backend support and why SQLite-pickle / InMemory aren't migration-eligible, the lockstep parent_states migration rule, and the migrations-MUST-be-pure contract. CHANGELOG.md gains two new Added entries (state-migration surface + Checkpointer.supports_state_migration Protocol attribute) plus a Changed entry documenting the CheckpointRecord.schema_version semantic shift and the CHECKPOINT_SCHEMA_VERSION constant removal. Pre-1.0 breaking change covered by the consolidated-release flag.
Spec proposal 0018 (chain-ambiguity category) landed in spec v0.16.0 between PR-4 push and the spec agent's code review. This commit adds the new error category and the required routing swaps; PR-4 stays open as the carrier. - Submodule pinned to v0.16.0; pyproject + __spec_version__ + test_smoke pin assertions follow. - New canonical error class CheckpointStateMigrationChainAmbiguous (and the matching category string) added to checkpoint/errors.py, re-exported from openarmature.checkpoint. Non-transient. Carries optional from_version / to_version when known (always set for the registration-time case; resume-time multi-shortest-path also populates). - MigrationRegistry.register raises CheckpointStateMigrationChainAmbiguous directly at registration time per spec §10.10 (proposal 0018), so the canonical category surfaces at the registration boundary without wrapping. The internal BFS keeps raising ValueError for multi-shortest-path detection; CompiledGraph._migrate_record's except branch now routes that to CheckpointStateMigrationChainAmbiguous at the resume boundary. - Routing precedence on resume per §10.10 (v0.16.0): chain-ambiguous → missing → failed → record-invalid. Code-comment in _migrate_record documents the ordering so a future reader doesn't swap it back. - Code comments record two design-choice notes the spec agent flagged: the load-bearing equal-depth-re-entry behavior in BFS cycle protection (tightening to <= breaks multi-shortest-path detection), and the registration-order ordering of describe() output. - Conformance harness gains the expected_chain_ambiguity_error primitive. Accepts the canonical category at EITHER the case top-level (registration-time detection: registration raises, resume block unreachable) OR inside resume (load-time detection during BFS). Spec §10.12.2's compile-time-SHOULD / load-time-acceptable carve-out lands in the driver as a single try/except wrap around both phases. - Fixture 047 (state-migration-chain-ambiguous) covered; all 9 state-migration fixtures (039-047) pass. - 3 new unit tests for the new category's class shape + attribute carriage; existing duplicate-pair tests updated to assert the canonical category instead of ValueError. - docs/concepts/checkpointing.md updated for the third category + the v0.16.0 routing precedence. CHANGELOG entry updated and Pinned-version line bumped to 0.10.0 → 0.16.0. OTel openarmature.checkpoint.migrate span emission deferred to a follow-on; documented inline in _migrate_record. Spec §6 cross-ref is SHOULD-level, and the emission needs the _InvocationContext that's currently created AFTER the migration path runs. The follow-on can either restructure resume to build the context first or use a side-channel observer dispatch.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
openarmature.checkpoint+openarmature.graph. PR-4 of the five-PR batch following PR-1 (feat(llm): structured output (proposal 0016) #42), PR-2 (feat(llm): image content blocks (proposal 0015) #44), PR-3 (feat(prompts): prompt-management core (proposal 0017) #45).CheckpointRecord.schema_versionfrom the prior backend-internal record-shape role to a user-facing state-schema identifier per spec §10.2. Saved records carry whatever the user declared ontype(state).schema_versionat save time; on resume, version mismatches route through a registered migration chain instead of failing.openarmature.checkpoint:StateMigration(frozen dataclass —from_version/to_version/migratecallable) andMigrationRegistry(BFS chain resolution + duplicate-edge AND multi-shortest-path ambiguity detection per §10.12.2).GraphBuilder.with_state_migration(from, to, migrate)andwith_state_migrations(*migrations)populate the registry;compile()threads it through to theCompiledGraph. On resume, the engine readsrecord.schema_version, compares againsttype(state_cls).schema_version, and applies the resolved chain (or raises one of two new error categories).CheckpointStateMigrationMissing(no chain bridges the versions, or chain ambiguous) andCheckpointStateMigrationFailed(a migration function raised). Both non-transient. Post-migration deserialization failures still surface asCheckpointRecordInvalidper §10.12.4 — the three categories are mutually exclusive on any given resume.Checkpointer.supports_state_migration: boolmarks whether a backend can expose a structural intermediate form of the loaded state (the dict the migration registry consumes).SQLiteCheckpointer(serialization="json")opts in;SQLiteCheckpointer(serialization="pickle")andInMemoryCheckpointeropt out. Mismatched version against a non-migration-eligible backend raisesCheckpointRecordInvalidper §10.12.1.What's new
StateMigration,MigrationRegistryopenarmature.checkpoint. Registry'sresolve_chainruns BFS and raisesValueErroron ambiguous-shortest-paths per §10.12.2's load-time-detection allowance.CheckpointStateMigrationMissing,CheckpointStateMigrationFailedfrom_version/to_version/registered_migrations_count/registry_descriptionfor actionable diagnostics.State.schema_version: ClassVar[str] = ""GraphBuilder.with_state_migration/with_state_migrations(from, to)pairs raiseValueErrorat registration.Checkpointer.supports_state_migrationClassVar) so SQLite can compute it per-instance based on serialization mode.CompiledGraph._migrate_recordparent_statesin lockstep with the outer state per §10.12.2. Code comment records the spec-mandated equivalence so future contributors don't add per-parent metadata without a follow-on proposal.CHECKPOINT_SCHEMA_VERSIONschema_version. If future record-shape evolution needs versioning, backends add their own private field.Release gate
PR-4 of a five-PR batch (
0016→0015→0017→0014→0011). Do not tag a release until all five land — the CHANGELOG[Unreleased]Notes section carries the gate from PR-1. PR-5 (proposal 0011, parallel branches) is the only remaining piece.Commits
feat(checkpoint): state migration registry, types, errors, builder surfacetest(conformance): drive 0014 state-migration fixtures 039-046test(unit): state migration + docs(concepts) state migrations section + CHANGELOGNotable implementation details
schema_versionsemantic shift is a clean break. Pre-PR-4 records carrying the old"1"constant get reinterpreted as user-facing v1 identifiers. Per Chris's note, no compat shim — single-user dev, no production migration. CHANGELOG flags the shift underChanged.supports_state_migrationis an instance attribute, notClassVar. Started asClassVar[bool]per the original plan, butSQLiteCheckpointer's answer depends on the constructor'sserializationarg. The Protocol declares it asbool = False; backends override either at the class level (InMemoryCheckpointer = False) or per-instance (SQLite computed in__init__). Pyright accepts both forms.parent_states[i]before advancing to the next step. End state is identical to sequential ordering given pure migrations (spec §10.12.2 mandates purity), but lockstep gives cleaner failure attribution (step k failed on outer-or-parent-i) and sets up per-migration spans for a follow-on. Documented in the engine path's code comment.InMemoryCheckpointeris no longer migration-eligible. It holds live typed-state references; there's no class-independent intermediate. Mismatched-version resume against in-memory backend raisesCheckpointRecordInvalidper §10.12.1 — same category as before, slightly different reason.Conformance fixtures (039–046)
All 8 fixtures pass:
v1 → v2migration populates a new field; one migration ranv1 → v2 → v3chain runs in orderCheckpointStateMigrationMissingwithregistered_migrations_count: 0should_not_runmock never calledparent_states[i]CheckpointRecordInvalid(per §10.12.4)CheckpointStateMigrationMissingwithregistered_migrations_count: 1and the registered set inregistry_descriptionKeyError→CheckpointStateMigrationFailedwith theKeyErrorpreserved as__cause__Harness extensions
tests/conformance/test_state_migration.py:add_new_field_default,add_v2_field,add_v3_field,identity_passthrough,raises_keyerror,should_not_run,irrelevant. Each fixture'smigrate: <name>resolves through the library._MigrationTracewraps every mock to capture invocation order, supportingmigrations_run/migration_count/single_migration_invocation/migration_order_matches_chainassertions across all 8 fixtures._seed_recordpersists aCheckpointRecordmatching the fixture'sseeded_record:block beforeinvoke(resume_invocation=...)so the resume path has data to load.StateSchemamodel intests/conformance/harness/directives.pygains optionalschema_version(default"") and arequired: boolknob for fixture 044'srequired_v2_fieldno-default field._DEFERRED_FIXTURESintest_fixture_parsing.pyloses the 039-046 rows; the existingCasesFixturepermissive-extras path picks up the new shape unchanged.Test plan
uv run pytest— 644 pass, 64 skipped (down from 79; 8 fixtures × 2 = 16 fewer skips, minus 1 new docs example for the state-migrations concept section), 0 failed.uv run pyright— clean.uv run ruff check+uv run ruff format— clean.uv run --group docs mkdocs build --strict— clean.tests/unit/test_state_migration.pycovering BFS edge cases, multi-shortest-path ambiguity, GraphBuilder ergonomics, and__cause__preservation.docs/concepts/checkpointing.md#state-migrationsviamkdocs serveto confirm rendering.Pre-1.0 SemVer
Breaking-ish:
CheckpointRecord.schema_versionsemantic shift means existing records with the prior"1"constant value get reinterpreted as user-facing v1 identifiers. TheCHECKPOINT_SCHEMA_VERSIONmodule constant is removed. Pre-1.0 batch + single-user dev means no compatibility shim; CHANGELOG flags the shift underChanged.Additive otherwise: new types, new errors, new builder method, new state ClassVar, new Protocol attribute. Existing free-form callers (no
schema_versiondeclared on their state, no migrations registered) see no behavior change — records carry""and the engine never consults the registry.