ci(infra): Kaniko layer caching + age-based registry cleanup#399
Merged
Conversation
Cloud Build runs each build on a fresh ephemeral VM, so the previous `docker build` steps always started from a cold layer cache and recompiled every Rust dependency from scratch (~6 min for the API image). The Dockerfiles are carefully structured around a dependency- caching layer (dummy-source trick), but with no persistent cache that optimization was dead on arrival. Replace the `docker build` + `docker push` steps in all three build modules (overslash-api, oversla-sh, metrics-exporter) with a single Kaniko step using `--cache=true`. Kaniko stores each layer — including the builder-stage dependency layer — as a content-addressed blob in a dedicated cache repo (<dest>/cache), keyed by command+input hash, so unchanged layers are reused across builds. It also pushes the image directly, so the separate push step is no longer needed. The cache repo lives in the same Artifact Registry repository and is derived from project_id/repository_name, so dev and prod caches are isolated automatically and no extra IAM is required. --cache-ttl=168h bounds staleness to one week. Caching does not depend on a :latest tag existing, so first builds (and fresh environments) work unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
The registry previously had only a KEEP cleanup policy (keep 10 most recent), which in Artifact Registry is a no-op without a paired DELETE policy — so nothing was ever deleted and the repo grew unbounded (~11.6 GB on dev). With Kaniko layer caching now pushing cache blobs on every build, that growth would accelerate. Add a DELETE policy that removes artifact versions older than a configurable threshold (default 30 days), covering both deployed images and Kaniko cache layers. The threshold must stay above the Kaniko --cache-ttl (168h): any cache layer Kaniko would reuse is re-pushed within the TTL and is therefore always younger than the delete age, so in-use cache is never pruned. The existing keep-recent policy still protects the 10 newest versions per package as a rollback safety net and takes precedence over the DELETE policy. Both the delete age and a dry-run toggle are exposed as variables with defaults, so callers (dev/prod) need no changes; flip cleanup_dry_run to preview deletion scope before enforcing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Kaniko otherwise infers the cache repository from a --destination, but both destinations carry a dynamic $COMMIT_SHA tag. Relying on Kaniko's implicit tag-stripping to land on a stable cache path is fragile; pin --cache-repo to a fixed <repo>/<image>/cache path in all three build modules so layer caching reliably reuses blobs across commits. Addresses Sentry review on #399. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two paired Cloud Build / Artifact Registry infra changes.
1. Kaniko layer caching for Cloud Build images
Problem: Cloud Build runs each build on a fresh ephemeral VM, so the existing
docker buildsteps always start from a cold layer cache and recompile every Rust dependency from scratch. The API image build step alone is ~6 min of a 7–9 min build. The Dockerfiles are carefully structured around a dependency-caching layer (dummy-source trick), but with no persistent cache that optimization was dead on arrival.Change: Replace
docker build+docker pushin all three build modules with a single Kaniko step using--cache=true:infra/modules/cloud-build(overslash-api)infra/modules/cloud-build-shortener(oversla-sh)infra/modules/cloud-build-metrics-exporter(overslash-metrics-exporter)Kaniko stores each layer — including the builder-stage dependency layer — as a content-addressed blob in
<dest>/cache, keyed by(command + input files)hash, so unchanged layers are reused across builds. It also pushes the image directly (the separate push step is removed).Why Kaniko over
--cache-from :latest: these are multi-stage builds; the expensivecargo buildlives in the builder stage, whose layers never reach the final pushed image —--cache-frominline cache would cache almost nothing here. Kaniko also doesn't depend on a:latesttag existing, so first builds / fresh environments work unchanged.Isolation/safety: cache repo is derived from
project_id/repository_name→ dev and prod caches are isolated automatically; cache hits are content-addressed (safe for prod); same AR repo → no extra IAM;--cache-ttl=168hbounds staleness.Expected impact: code-only builds drop from ~6 min to ~1–2 min once the cache is warm.
2. Age-based cleanup policy to bound registry growth
Problem: the registry had only a KEEP policy (keep 10 most recent), which in Artifact Registry is a no-op without a paired DELETE policy — nothing was ever deleted and the repo grew unbounded (~11.6 GB on dev). Kaniko caching pushes new cache blobs on every build, which would accelerate that growth.
Change: add a DELETE policy in
infra/modules/artifact-registryremoving versions older than a configurable threshold (default 30 days), covering both images and Kaniko cache layers. The keep-recent (10 newest) policy stays as a rollback safety net and takes precedence.Why it's cache-safe: the delete age (30d) stays well above the Kaniko
--cache-ttl(7d). Any cache layer Kaniko would reuse is re-pushed within the TTL, so it's always younger than the delete threshold and never pruned. Both the age and acleanup_dry_runtoggle are variables with defaults (no caller changes needed); flip dry-run to preview deletion scope before enforcing.Notes
gcr.io/cloud-builders/dockerwas also unpinned). Pinning to a digest is a reasonable follow-up.terraform fmtpasses on all four modules. Fullterraform validate/plannot run here (no provider/backend access in this environment) — worth aplanon dev before merge.🤖 Generated with Claude Code