Problem
The consumer read path re-downloads every COG from S3 on every call. cd_extract() loops each catalog row through cd_crop(), which does terra::rast(href) directly on a /vsicurl/ URL:
# R/cd_crop.R
cd_crop <- function(href, aoi) {
r <- terra::rast(href) # /vsicurl/ — fetched fresh every call
...
}
GDAL's /vsicurl/ only keeps a small in-memory chunk cache per session (~16 MB default) with no on-disk persistence across R sessions. So every separate report render, appendix knit, or vignette build re-pulls the full overviews + tiles for each AOI from scratch. Multiple dev iterations × multiple AOIs × all variables/periods → hundreds of GB of repeated downloads.
This is the likely dominant driver of S3 egress on the account (~$17 / ~290 GB in May 2026 — see NewGraphEnvironment/rtj#168). It's self-inflicted, recurring, and avoidable.
The fix is already half-built
R/cd_cache.R ships cd_cache_path() / cd_cache_clear() / cd_cache_info() backed by rappdirs::user_cache_dir("cd") — but nothing in R/ calls them. The cache module is orphaned. Wiring it into the read path turns repeated builds from network pulls into local reads.
Proposed approach
Design notes / open questions
- Granularity: cache whole COGs, or AOI-cropped subsets? Whole-COG is simpler and dedupes across AOIs that overlap; cropped subsets are smaller but key on (href + AOI). Whole-COG is probably the right default.
- Invalidation: the monthly Action republishes the catalog. Cache key should include something that changes when the underlying COG changes (S3 ETag / last-modified, or STAC item
updated) so a stale local copy isn't silently reused.
- Stopgap available now: even before this lands, report-dev egress can be cut by enabling a persistent GDAL
/vsicurl/ cache (VSI_CACHE, GDAL_HTTP_*) or pre-downloading AOI COGs — worth a README note.
Why it matters
This is a real recurring cost reducer, not just a tidy-up: it should drop our own monthly S3 egress from ~$10–17 toward ~$0 on repeat builds, and it shrinks the blast radius of the rtj#168 cost-guardrail work (most of the egress we were about to alarm on is our own un-cached re-pulls).
References
- NewGraphEnvironment/rtj#168 — account-wide S3 cost guardrails (this is the source-side fix for the egress that issue alarms on)
Problem
The consumer read path re-downloads every COG from S3 on every call.
cd_extract()loops each catalog row throughcd_crop(), which doesterra::rast(href)directly on a/vsicurl/URL:GDAL's
/vsicurl/only keeps a small in-memory chunk cache per session (~16 MB default) with no on-disk persistence across R sessions. So every separate report render, appendix knit, or vignette build re-pulls the full overviews + tiles for each AOI from scratch. Multiple dev iterations × multiple AOIs × all variables/periods → hundreds of GB of repeated downloads.This is the likely dominant driver of S3 egress on the account (~$17 / ~290 GB in May 2026 — see NewGraphEnvironment/rtj#168). It's self-inflicted, recurring, and avoidable.
The fix is already half-built
R/cd_cache.Rshipscd_cache_path()/cd_cache_clear()/cd_cache_info()backed byrappdirs::user_cache_dir("cd")— but nothing inR/calls them. The cache module is orphaned. Wiring it into the read path turns repeated builds from network pulls into local reads.Proposed approach
href, download once tocd_cache_path()(keyed by a stable hash of the URL / STAC item id + etag), and on subsequent calls read the local copycd_crop()(and thereforecd_extract()) through it for remote hrefs; local paths pass through untouchedcache = TRUEarg or acd_cache_clear()before a run) so users can force a fresh pull when the published data updateskootenay-lake.Rmd,peace-fwcp.Rmd) read from cache on rebuild — confirm a second knit does ~zero egressDesign notes / open questions
updated) so a stale local copy isn't silently reused./vsicurl/cache (VSI_CACHE,GDAL_HTTP_*) or pre-downloading AOI COGs — worth a README note.Why it matters
This is a real recurring cost reducer, not just a tidy-up: it should drop our own monthly S3 egress from ~$10–17 toward ~$0 on repeat builds, and it shrinks the blast radius of the rtj#168 cost-guardrail work (most of the egress we were about to alarm on is our own un-cached re-pulls).
References