feat(import): add ImportState + --skip-state-load flag for write-only import#560
Merged
Conversation
State.__init__ unconditionally loads from storage. Import discards prior per-type source dict before the API fetch and dump_state flushes the writes at the end, so the boot-time read is dead weight for import. New ImportState class exposes only write methods (set_source, clear_source_type, mark/clear_source_authoritative, dump_state). No source / destination property is defined; reads fail at the language level with AttributeError. dump_state rejects DESTINATION / ALL since import never populates destination state. New SourceStateWriter Protocol describes the shared write surface. State adds set_source / clear_source_type passthroughs so both classes satisfy it. Configuration.state is typed as Union[State, ImportState] so existing read access on State stays type-correct for sync/diffs/migrate/reset/prune code paths. New --skip-state-load Click flag on import only. build_config validates (requires --resource-per-file + --resources; not combinable with --minimize-reads; only valid on import) and dispatches to ImportState. Three import-path writer sites migrated to the Protocol methods: base_resource.py, resources_handler.py, synthetics_private_locations.py. Sync/diffs/migrate/reset/prune continue to use State's read accessors. The conflict checks for deprecated resources (logs_custom_pipelines+logs_pipelines, downtimes+downtime_schedules) still apply to ImportState callers; only the read-state fallback branch in _handle_deprecated is type-guarded with an isinstance check. Storage backend construction shared via build_storage_backend in _base_storage.py so State and ImportState cannot drift on the per-backend init dispatch. Diagnostic timing logs emit from a finally block in each backend's _list_and_load and put, so the log fires even when an SDK call raises and the exception propagates out. Cloud backends carry pages_listed, blobs_listed, blobs_downloaded, transient_errors, aborted, list_ms, download_ms, wall_ms — the list_ms vs download_ms split lets operators distinguish pagination cost from per-blob fetch cost when reading and writing race on the same bucket prefix; aborted distinguishes a successful empty result from an uncaught exception. transient_errors counts per-blob race-deletes (NoSuchKey on AWS, ResourceNotFoundError on Azure, NotFound on GCS) and JSON decode failures with the same semantics across all backends; SDK-internal retries below this layer are not visible. LocalFile emits the same schema (pages_listed=1) for cross-backend parity. Test plan: 556 tests pass (514 existing + 42 new); black-clean at 120 cols. Failure-path tests assert the timing log still emits with aborted=1 when each backend's list call raises. A paged-iterator test exercises the real-SDK .pages branch on GCS. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4bafe54 to
bb068a5
Compare
heyronhay
approved these changes
May 13, 2026
Collaborator
heyronhay
left a comment
There was a problem hiding this comment.
Nothing major that I can find, looks good!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
State.__init__unconditionally loads existing state from storage. Theimportcommand discards prior per-type source state before its API fetch (in_import_get_resources_cb) and writes fresh data viadump_stateat the end — the boot-time load is dead weight for that command. On populated buckets the load can dominate per-run wall-clock.This PR adds an opt-in
--skip-state-loadflag toimportthat constructs a newImportStateinstead ofState.ImportStateexposes only write methods and has nosource/destinationproperty, so any attempt to read state fails at the language level withAttributeErrorrather than silently returning empty data.Adds diagnostic timing logs around state init, load, dump, and each storage backend's
_list_and_loadandput. The logs emit from afinallyblock so they fire even when an SDK call raises mid-iteration, and carry alist_msvsdownload_mssplit plus pagination and blob counts. Useful for diagnosing where wall-clock goes on populated buckets.What changes
ImportStateclass (utils/import_state.py) — write-only state for the import command.set_source,clear_source_type,dump_state(Origin.SOURCE). No read accessors;dump_staterejects DESTINATION / ALL.SourceStateWriterProtocol (utils/source_state_writer.py) — shared write surface implemented by bothStateandImportState.--skip-state-loadClick option onimportonly. Click rejects the flag onsync/migrate/diffs/reset/prune;build_configadds a redundant validation guard.base_resource._import_resource,resources_handler._import_get_resources_cb,model/synthetics_private_locations.import_resource.build_storage_backendhelper in_base_storage.pysoStateandImportStatecannot drift on per-backend init.state.pyand all four storage backends, with atry/finallyso the log emits even when the underlying SDK raises. New fields:pages_listed,blobs_listed,blobs_downloaded,transient_errors,aborted,list_ms,download_ms,wall_ms. Race-delete semantics (NoSuchKeyon AWS,ResourceNotFoundErroron Azure,NotFoundon GCS) are caught uniformly across backends and counted intransient_errors.What does not change
--skip-state-loadis opt-in (default=False); existing callers see no behavior change.sync/diffs/migrate/reset/prunepaths. These continue to useStatewith the existing read+write surface.dump_statepatches per-blob exactly as before.Constraints
--skip-state-loadrequires--resource-per-fileand--resources, and is not combinable with--minimize-reads(which serves a different optimization goal). Validated inbuild_config.Test plan
test_import_state.py)black --line-length 120clean--skip-state-load, timing-log schema fields present on all four backends, aborted=1 fires when each backend's SDK call raises, paged-iterator branch on GCS, round-trip read of--skip-state-load-written state via a freshState.Rollout
Flag is opt-in (
default=False). Reverting the commit removes the option fromimport --helpand doesn't break any caller that wasn't using it. No on-disk or wire-format effect.Out of scope
migratedoes not gain--skip-state-load(apply-phase reads destination).synckeeps its existing state-load path._list_and_loadis a separate effort.