Skip to content

Wire cd_cache into the COG read path so repeated builds read locally (kill recurring S3 egress) #76

Description

@NewGraphEnvironment

Problem

The consumer read path re-downloads every COG from S3 on every call. cd_extract() loops each catalog row through cd_crop(), which does terra::rast(href) directly on a /vsicurl/ URL:

# R/cd_crop.R
cd_crop <- function(href, aoi) {
  r <- terra::rast(href)   # /vsicurl/ — fetched fresh every call
  ...
}

GDAL's /vsicurl/ only keeps a small in-memory chunk cache per session (~16 MB default) with no on-disk persistence across R sessions. So every separate report render, appendix knit, or vignette build re-pulls the full overviews + tiles for each AOI from scratch. Multiple dev iterations × multiple AOIs × all variables/periods → hundreds of GB of repeated downloads.

This is the likely dominant driver of S3 egress on the account (~$17 / ~290 GB in May 2026 — see NewGraphEnvironment/rtj#168). It's self-inflicted, recurring, and avoidable.

The fix is already half-built

R/cd_cache.R ships cd_cache_path() / cd_cache_clear() / cd_cache_info() backed by rappdirs::user_cache_dir("cd") — but nothing in R/ calls them. The cache module is orphaned. Wiring it into the read path turns repeated builds from network pulls into local reads.

Proposed approach

  • Add a fetch-through-cache layer the read path uses: given a remote href, download once to cd_cache_path() (keyed by a stable hash of the URL / STAC item id + etag), and on subsequent calls read the local copy
  • Route cd_crop() (and therefore cd_extract()) through it for remote hrefs; local paths pass through untouched
  • Honour an explicit opt-out / refresh (e.g. cache = TRUE arg or a cd_cache_clear() before a run) so users can force a fresh pull when the published data updates
  • Validate the cached file (size / etag) so a partial download isn't served as complete
  • Vignettes (kootenay-lake.Rmd, peace-fwcp.Rmd) read from cache on rebuild — confirm a second knit does ~zero egress

Design notes / open questions

  • Granularity: cache whole COGs, or AOI-cropped subsets? Whole-COG is simpler and dedupes across AOIs that overlap; cropped subsets are smaller but key on (href + AOI). Whole-COG is probably the right default.
  • Invalidation: the monthly Action republishes the catalog. Cache key should include something that changes when the underlying COG changes (S3 ETag / last-modified, or STAC item updated) so a stale local copy isn't silently reused.
  • Stopgap available now: even before this lands, report-dev egress can be cut by enabling a persistent GDAL /vsicurl/ cache (VSI_CACHE, GDAL_HTTP_*) or pre-downloading AOI COGs — worth a README note.

Why it matters

This is a real recurring cost reducer, not just a tidy-up: it should drop our own monthly S3 egress from ~$10–17 toward ~$0 on repeat builds, and it shrinks the blast radius of the rtj#168 cost-guardrail work (most of the egress we were about to alarm on is our own un-cached re-pulls).

References

  • NewGraphEnvironment/rtj#168 — account-wide S3 cost guardrails (this is the source-side fix for the egress that issue alarms on)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions