Skip to content

ci(infra): Kaniko layer caching + age-based registry cleanup#399

Merged
angel-manuel merged 3 commits into
devfrom
infra/cloudbuild-kaniko-cache
Jun 14, 2026
Merged

ci(infra): Kaniko layer caching + age-based registry cleanup#399
angel-manuel merged 3 commits into
devfrom
infra/cloudbuild-kaniko-cache

Conversation

@angel-manuel

@angel-manuel angel-manuel commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Two paired Cloud Build / Artifact Registry infra changes.

1. Kaniko layer caching for Cloud Build images

Problem: Cloud Build runs each build on a fresh ephemeral VM, so the existing docker build steps always start from a cold layer cache and recompile every Rust dependency from scratch. The API image build step alone is ~6 min of a 7–9 min build. The Dockerfiles are carefully structured around a dependency-caching layer (dummy-source trick), but with no persistent cache that optimization was dead on arrival.

Change: Replace docker build + docker push in all three build modules with a single Kaniko step using --cache=true:

  • infra/modules/cloud-build (overslash-api)
  • infra/modules/cloud-build-shortener (oversla-sh)
  • infra/modules/cloud-build-metrics-exporter (overslash-metrics-exporter)

Kaniko stores each layer — including the builder-stage dependency layer — as a content-addressed blob in <dest>/cache, keyed by (command + input files) hash, so unchanged layers are reused across builds. It also pushes the image directly (the separate push step is removed).

Why Kaniko over --cache-from :latest: these are multi-stage builds; the expensive cargo build lives in the builder stage, whose layers never reach the final pushed image — --cache-from inline cache would cache almost nothing here. Kaniko also doesn't depend on a :latest tag existing, so first builds / fresh environments work unchanged.

Isolation/safety: cache repo is derived from project_id/repository_name → dev and prod caches are isolated automatically; cache hits are content-addressed (safe for prod); same AR repo → no extra IAM; --cache-ttl=168h bounds staleness.

Expected impact: code-only builds drop from ~6 min to ~1–2 min once the cache is warm.

2. Age-based cleanup policy to bound registry growth

Problem: the registry had only a KEEP policy (keep 10 most recent), which in Artifact Registry is a no-op without a paired DELETE policy — nothing was ever deleted and the repo grew unbounded (~11.6 GB on dev). Kaniko caching pushes new cache blobs on every build, which would accelerate that growth.

Change: add a DELETE policy in infra/modules/artifact-registry removing versions older than a configurable threshold (default 30 days), covering both images and Kaniko cache layers. The keep-recent (10 newest) policy stays as a rollback safety net and takes precedence.

Why it's cache-safe: the delete age (30d) stays well above the Kaniko --cache-ttl (7d). Any cache layer Kaniko would reuse is re-pushed within the TTL, so it's always younger than the delete threshold and never pruned. Both the age and a cleanup_dry_run toggle are variables with defaults (no caller changes needed); flip dry-run to preview deletion scope before enforcing.

Notes

  • Kaniko executor image is unpinned, matching the repo's existing convention (gcr.io/cloud-builders/docker was also unpinned). Pinning to a digest is a reasonable follow-up.
  • terraform fmt passes on all four modules. Full terraform validate/plan not run here (no provider/backend access in this environment) — worth a plan on dev before merge.

🤖 Generated with Claude Code

Cloud Build runs each build on a fresh ephemeral VM, so the previous
`docker build` steps always started from a cold layer cache and
recompiled every Rust dependency from scratch (~6 min for the API
image). The Dockerfiles are carefully structured around a dependency-
caching layer (dummy-source trick), but with no persistent cache that
optimization was dead on arrival.

Replace the `docker build` + `docker push` steps in all three build
modules (overslash-api, oversla-sh, metrics-exporter) with a single
Kaniko step using `--cache=true`. Kaniko stores each layer — including
the builder-stage dependency layer — as a content-addressed blob in a
dedicated cache repo (<dest>/cache), keyed by command+input hash, so
unchanged layers are reused across builds. It also pushes the image
directly, so the separate push step is no longer needed.

The cache repo lives in the same Artifact Registry repository and is
derived from project_id/repository_name, so dev and prod caches are
isolated automatically and no extra IAM is required. --cache-ttl=168h
bounds staleness to one week. Caching does not depend on a :latest tag
existing, so first builds (and fresh environments) work unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 14, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
overslash Ready Ready Preview, Comment Jun 14, 2026 1:02pm

Request Review

Comment thread infra/modules/cloud-build/main.tf
The registry previously had only a KEEP cleanup policy (keep 10 most
recent), which in Artifact Registry is a no-op without a paired DELETE
policy — so nothing was ever deleted and the repo grew unbounded
(~11.6 GB on dev). With Kaniko layer caching now pushing cache blobs on
every build, that growth would accelerate.

Add a DELETE policy that removes artifact versions older than a
configurable threshold (default 30 days), covering both deployed images
and Kaniko cache layers. The threshold must stay above the Kaniko
--cache-ttl (168h): any cache layer Kaniko would reuse is re-pushed
within the TTL and is therefore always younger than the delete age, so
in-use cache is never pruned. The existing keep-recent policy still
protects the 10 newest versions per package as a rollback safety net and
takes precedence over the DELETE policy.

Both the delete age and a dry-run toggle are exposed as variables with
defaults, so callers (dev/prod) need no changes; flip cleanup_dry_run to
preview deletion scope before enforcing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@angel-manuel angel-manuel changed the title ci(infra): use Kaniko with layer caching for Cloud Build images ci(infra): Kaniko layer caching + age-based registry cleanup Jun 14, 2026
Kaniko otherwise infers the cache repository from a --destination, but
both destinations carry a dynamic $COMMIT_SHA tag. Relying on Kaniko's
implicit tag-stripping to land on a stable cache path is fragile; pin
--cache-repo to a fixed <repo>/<image>/cache path in all three build
modules so layer caching reliably reuses blobs across commits.

Addresses Sentry review on #399.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread infra/modules/artifact-registry/main.tf
@angel-manuel angel-manuel merged commit 42a43ed into dev Jun 14, 2026
10 checks passed
@angel-manuel angel-manuel deleted the infra/cloudbuild-kaniko-cache branch June 14, 2026 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant