Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
37 changes: 19 additions & 18 deletions .claude/skills/indexing-diagnostics/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -868,30 +868,31 @@ When the summary signals (Mode A's `cpuTopFrames`, Mode D) name a hot or looping

### Two tiers — pick by what you need

| Tier | What | Where it lands | Captures the hard wedge? |
|------|------|----------------|--------------------------|
| 1 (always on for targeted realms) | Top-N self-time **summary** | `prerenderer` log: `affinity CPU profile …` | No — `Profiler.stop` needs the renderer thread |
| 2 — `.cpuprofile` | Full V8 CPU profile (whole call tree) | S3 `…/<ts>.cpuprofile` | No — same `Profiler.stop` limit |
| 2 — trace (`.trace.json`) | CDP/Perfetto trace, **streamed** — separates JS / GC / compile / layout / paint | S3 `…/<ts>.trace.json` | **Yes** — buffered on browser threads, drained out-of-band; the one capture that survives a fully-pegged renderer |
| 2 — heap (`.heapprofile`) | Cumulative allocation-sampling profile, flushed per render | S3 `…/<ts>.heapprofile` | No — `getSamplingProfile` needs the renderer thread |
| Tier | What | Where it lands | Captures the hard wedge? |
| --------------------------------- | ------------------------------------------------------------------------------- | ------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- |
| 1 (always on for targeted realms) | Top-N self-time **summary** | `prerenderer` log: `affinity CPU profile …` | No — `Profiler.stop` needs the renderer thread |
| 2 — `.cpuprofile` | Full V8 CPU profile (whole call tree) | S3 `…/<ts>.cpuprofile` | No — same `Profiler.stop` limit |
| 2 — trace (`.trace.json`) | CDP/Perfetto trace, **streamed** — separates JS / GC / compile / layout / paint | S3 `…/<ts>.trace.json` | **Yes** — buffered on browser threads, drained out-of-band; the one capture that survives a fully-pegged renderer |
| 2 — heap (`.heapprofile`) | Cumulative allocation-sampling profile, flushed per render | S3 `…/<ts>.heapprofile` | No — `getSamplingProfile` needs the renderer thread |

Rule of thumb: a render that **completes but is heavy** → `.cpuprofile` (+ heap for allocation growth). A render that **fully wedges** (no `cpuTopFrames`, `scriptBusy=<unknown>`) → the **trace**, which is the only thing that comes back. If the trace returns idle (no hot frame), the wedge isn't CPU-spinning — pivot to "what is it blocked on" (Mode A's `pendingFetches`).

### The knobs (SSM parameters)

All live at `/<env>/boxel/<NAME>` (Systems Manager → Parameter Store). The bucket itself (`PRERENDER_ARTIFACTS_BUCKET`) and the key prefix (`PRERENDER_ARTIFACTS_ENV`) are wired by Terraform — don't set them by hand.

| Parameter | Values | Effect |
|-----------|--------|--------|
| `PRERENDER_PROFILE_AFFINITY` | comma-separated affinity keys, e.g. `realm:https://realms.cardstack.com/team/foo/` | **Required to target.** Only renders whose affinity key exactly matches are profiled at all (Tier 1 + Tier 2). Empty / `off` → everything inert. |
| `PRERENDER_PROFILE_CPUPROFILE` | `true` / `false` | Persist the full `.cpuprofile` for targeted renders. |
| `PRERENDER_PROFILE_TRACE` | `true` / `false` | Capture the streaming trace for targeted renders. |
| `PRERENDER_PROFILE_HEAP` | `true` / `false` | Capture the heap allocation-sampling profile for targeted renders. |
| `PRERENDER_PROFILE_MAX_SESSION_BYTES` | positive integer, or `0` for the default | Soft per-process byte budget across all artifacts. `0`/unset → 5 GiB. Once spent, the task declines further uploads (in-flight ones finish, so blobs are never truncated). |
| Parameter | Values | Effect |
| ------------------------------------- | ---------------------------------------------------------------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `PRERENDER_PROFILE_AFFINITY` | comma-separated affinity keys, e.g. `realm:https://realms.cardstack.com/team/foo/` | **Required to target.** Only renders whose affinity key exactly matches are profiled at all (Tier 1 + Tier 2). Empty / `off` → everything inert. |
| `PRERENDER_PROFILE_CPUPROFILE` | `true` / `false` | Persist the full `.cpuprofile` for targeted renders. |
| `PRERENDER_PROFILE_TRACE` | `true` / `false` | Capture the streaming trace for targeted renders. |
| `PRERENDER_PROFILE_HEAP` | `true` / `false` | Capture the heap allocation-sampling profile for targeted renders. |
| `PRERENDER_PROFILE_MAX_SESSION_BYTES` | positive integer, or `0` for the default | Soft per-process byte budget across all artifacts. `0`/unset → 5 GiB. Once spent, the task declines further uploads (in-flight ones finish, so blobs are never truncated). |

The mode flags are Terraform-seeded sentinels (default `false` / `0`); `PRERENDER_PROFILE_AFFINITY` is operator-managed and must already exist. The affinity key is `realm:` + the realm's canonical URL **with trailing slash** — the same value Mode A/B logs print as `affinity=…`.

> **The container reads these at task start.** ECS injects SSM values when a task launches, so a change only takes effect on a fresh task. After editing the parameters, force a new deployment of the prerender service so tasks restart with the new env:
>
> ```
> aws ecs update-service --cluster <env> --service boxel-prerender-server-<env> --force-new-deployment
> ```
Expand All @@ -900,7 +901,7 @@ The mode flags are Terraform-seeded sentinels (default `false` / `0`); `PRERENDE

1. **Target the realm.** Set `PRERENDER_PROFILE_AFFINITY` to its affinity key and turn on the mode flag(s) you need (start with `PRERENDER_PROFILE_TRACE` for a wedge, `PRERENDER_PROFILE_CPUPROFILE` for a heavy-but-completing render).
2. **Restart the service** (`--force-new-deployment` above) so the tasks pick up the values.
3. **Generate renders.** Trigger a reindex of the targeted realm (see *Triggering a reindex* below) — the indexer's per-card visits are what produce artifacts. Confirm captures are happening in the `prerenderer` log: `artifact-sink uploaded <kind> key=… bytes=… sessionBytes=…/…`.
3. **Generate renders.** Trigger a reindex of the targeted realm (see _Triggering a reindex_ below) — the indexer's per-card visits are what produce artifacts. Confirm captures are happening in the `prerenderer` log: `artifact-sink uploaded <kind> key=… bytes=… sessionBytes=…/…`.
4. **Pull the artifacts** (below).
5. **Turn it off.** Set the mode flags back to `false` (and clear `PRERENDER_PROFILE_AFFINITY` if done), then force one more deployment. Leftover artifacts auto-expire after 14 days regardless.

Expand All @@ -917,14 +918,14 @@ The key schema is `env/realm/jobId/card/step/<timestamp>-<seq>.<suffix>` — eve

### Reading each artifact

- **`.cpuprofile`** — Chrome DevTools (Performance panel → *Load profile…*) or [speedscope](https://www.speedscope.app/). Self-time flame graph of the whole render; the summary's top frames are just the peak of this.
- **`.trace.json`** — [Perfetto UI](https://ui.perfetto.dev/) or Chrome DevTools Performance → *Load profile…*. Separate tracks for JS execution, V8 GC, compile, and layout/paint — this is how you tell a JS spin (`v8.execute` saturated) from GC thrash (`v8.gc` saturated) when the summary couldn't say.
- **`.heapprofile`** — Chrome DevTools Memory → *Allocation sampling**Load profile…*. Each upload is the cumulative profile **at that render**, so download two from different points in the session and compare to see which call sites kept allocating.
- **`.cpuprofile`** — Chrome DevTools (Performance panel → _Load profile…_) or [speedscope](https://www.speedscope.app/). Self-time flame graph of the whole render; the summary's top frames are just the peak of this.
- **`.trace.json`** — [Perfetto UI](https://ui.perfetto.dev/) or Chrome DevTools Performance → _Load profile…_. Separate tracks for JS execution, V8 GC, compile, and layout/paint — this is how you tell a JS spin (`v8.execute` saturated) from GC thrash (`v8.gc` saturated) when the summary couldn't say.
- **`.heapprofile`** — Chrome DevTools Memory → _Allocation sampling__Load profile…_. Each upload is the cumulative profile **at that render**, so download two from different points in the session and compare to see which call sites kept allocating.

### What Mode H can't tell you

- The `.cpuprofile` and `.heapprofile` need the renderer thread to serialize, so a **fully-wedged** render produces neither — only the streaming trace comes back. That's by design (Tier 1's summary has the same limit); the trace is the wedge tool.
- Browser-wide tracing is **single-flight** — only one trace runs at a time across the whole pool. Concurrent targeted renders skip their trace (logged at `debug`), so don't expect a trace for *every* render under load; constrain concurrency or accept the gaps. The summary and the cpuprofile/heap captures are per-render and unaffected.
- Browser-wide tracing is **single-flight** — only one trace runs at a time across the whole pool. Concurrent targeted renders skip their trace (logged at `debug`), so don't expect a trace for _every_ render under load; constrain concurrency or accept the gaps. The summary and the cpuprofile/heap captures are per-render and unaffected.
- Captured artifacts are anonymized only at the key level (host stripped). The blobs themselves contain card URLs and code paths — treat them as you would any prerender diagnostic.

## Field-by-field reading
Expand Down
1 change: 1 addition & 0 deletions .claude/skills/pr-screenshots/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ Visual changes benefit enormously from a screenshot — reviewers react to the a
```
![preview](https://raw.githubusercontent.com/<owner>/<repo>/<commit-sha>/.pr-images/<slug>/<name>.png)
```

5. **Followup commit** `git rm`s the image and pushes. GitHub still serves the blob from the named commit, so the SHA-pinned reference keeps working.

## Critical gotcha: pin to the commit SHA, never the branch
Expand Down
62 changes: 31 additions & 31 deletions .claude/skills/prerender-sizing/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,14 +8,14 @@ allowed-tools: Read, Grep, Glob, Bash

The prerender pool's tab capacity is governed by a small set of SSM-driven knobs:

| Env var | What it controls |
|---|---|
| `PRERENDER_PAGE_POOL_MIN` | Idle floor — pool never contracts below this. |
| `PRERENDER_PAGE_POOL_MAX` | Burst ceiling reachable by any priority. |
| `PRERENDER_PAGE_POOL_HIGH_PRIORITY_MAX` | Extra ceiling, reachable only when caller `priority >= HIGH_PRIORITY_THRESHOLD`. |
| `PRERENDER_HIGH_PRIORITY_THRESHOLD` | Priority bar that unlocks the upper tier. |
| `PRERENDER_PAGE_POOL_IDLE_CONTRACTION_MS` | Hysteresis window before each contraction tick. |
| `PRERENDER_SHARED_CONTEXT_CAP` | Absolute LRU cap for cached BrowserContexts. |
| Env var | What it controls |
| ----------------------------------------- | -------------------------------------------------------------------------------- |
| `PRERENDER_PAGE_POOL_MIN` | Idle floor — pool never contracts below this. |
| `PRERENDER_PAGE_POOL_MAX` | Burst ceiling reachable by any priority. |
| `PRERENDER_PAGE_POOL_HIGH_PRIORITY_MAX` | Extra ceiling, reachable only when caller `priority >= HIGH_PRIORITY_THRESHOLD`. |
| `PRERENDER_HIGH_PRIORITY_THRESHOLD` | Priority bar that unlocks the upper tier. |
| `PRERENDER_PAGE_POOL_IDLE_CONTRACTION_MS` | Hysteresis window before each contraction tick. |
| `PRERENDER_SHARED_CONTEXT_CAP` | Absolute LRU cap for cached BrowserContexts. |

Plus the ECS task definition's `cpu` and `memory`. All these together form the **memory envelope** that bounds how many warmed BrowserContexts the system can hold and how much burst headroom it has.

Expand All @@ -31,7 +31,7 @@ Trigger on any of:
- "Why does the dashboard show prerender memory peak at X%?"
- "Should I bump `PRERENDER_PAGE_POOL_MAX` from N to M?"

If the user is asking "why did this single render time out", that's the `indexing-diagnostics` skill, not this one. This skill is for *capacity planning*.
If the user is asking "why did this single render time out", that's the `indexing-diagnostics` skill, not this one. This skill is for _capacity planning_.

## The sizing model

Expand All @@ -47,7 +47,7 @@ where:
- `N`: number of warmed pool entries (active tabs + standby contexts the LRU is holding).
- `marginal_per_tab`: cost of one additional warmed BrowserContext + its cached fetches + tab queue state. Empirically derived per environment.

**CPU follows a different shape.** Each *actively rendering* tab consumes approximately one busy CPU core (Chromium docs / observed). But tabs alternate between rendering, host-side waits (fetches, store loads), and idle. So:
**CPU follows a different shape.** Each _actively rendering_ tab consumes approximately one busy CPU core (Chromium docs / observed). But tabs alternate between rendering, host-side waits (fetches, store loads), and idle. So:

```
cpu_peak ≈ (# tabs rendering simultaneously) × 1 vCPU
Expand Down Expand Up @@ -148,7 +148,7 @@ Confirms whether the system held under pressure (zero render-timeouts) or was at
-- skip the malformed rows: e.g. `AND diagnostics->'waits' ?
-- 'tabQueueMs'` (the JSONB `?` operator tests for a key) keeps
-- only rows with that key present.
SELECT
SELECT
count(*) AS rows_with_diag,
count(*) FILTER (WHERE (diagnostics->>'totalElapsedMs')::int >= 145000) AS at_or_over_timeout,
percentile_cont(0.95) WITHIN GROUP (ORDER BY (diagnostics->>'totalElapsedMs')::int) AS p95_total_ms,
Expand All @@ -167,7 +167,7 @@ WHERE diagnostics IS NOT NULL

Key signals to look for:

- `at_or_over_timeout > 0`: the system is *already* dropping renders. Sizing change is needed urgently.
- `at_or_over_timeout > 0`: the system is _already_ dropping renders. Sizing change is needed urgently.
- `max_tabq_ms` of seconds-to-tens-of-seconds: the user was waiting for a tab. This is the UX-visible pressure that priority routing + dynamic expansion exists to mitigate.
- `max_sem_ms` of seconds-to-tens-of-seconds: global render-semaphore saturation. Indicates pool is too small or fleet is too small.
- `p99_total_ms` near `145000` (the timeout budget): system was at the edge. Even if no timeouts fired, you're one bad burst from a 504.
Expand Down Expand Up @@ -230,13 +230,13 @@ If the resize affects task size, do a Fargate pricing comparison. us-east-1 on-d

So:

| Task size | $/hr | /month per task |
|---|---:|---:|
| 1 vCPU / 4 GB | $0.058 | $42 |
| 2 vCPU / 8 GB | $0.117 | $85 |
| 2 vCPU / 16 GB | $0.152 | $111 |
| 4 vCPU / 8 GB | $0.197 | $144 |
| 4 vCPU / 16 GB | $0.233 | $170 |
| Task size | $/hr | /month per task |
| -------------- | -----: | --------------: |
| 1 vCPU / 4 GB | $0.058 | $42 |
| 2 vCPU / 8 GB | $0.117 | $85 |
| 2 vCPU / 16 GB | $0.152 | $111 |
| 4 vCPU / 8 GB | $0.197 | $144 |
| 4 vCPU / 16 GB | $0.233 | $170 |

If the resize is "swap memory for CPU" (the typical case for prerender — memory-bound, CPU over-provisioned), the cost may actually drop. **Always show the pricing delta in the PR description.** It's a meaningful data point for the resize decision.

Expand All @@ -246,12 +246,12 @@ Captured on 2026-04-30 ~20:00 UTC for the CS-10976 PR 12 staging activation.

### Telemetry

| Metric | 24 h | 7 d |
|---|---:|---:|
| CPU avg of 5-min Avg | 1.1 % | 1.5 % |
| CPU 5-min peak | 67.5 % | 97.5 % |
| Memory avg of 5-min Avg | 35 % | 39 % |
| Memory 5-min peak | 64 % | 98.3 % |
| Metric | 24 h | 7 d |
| ----------------------- | -----: | -----: |
| CPU avg of 5-min Avg | 1.1 % | 1.5 % |
| CPU 5-min peak | 67.5 % | 97.5 % |
| Memory avg of 5-min Avg | 35 % | 39 % |
| Memory 5-min peak | 64 % | 98.3 % |

7-d render-timing histogram from `boxel_index.diagnostics`:

Expand Down Expand Up @@ -285,12 +285,12 @@ Queue-snapshot at the memory peak:

### Memory projection

| N tabs | Memory used | 8 GB (today) | 16 GB (resized) |
|---:|---:|:---:|:---:|
| 2 (MIN) | 3.7 GB | 46 % ✓ | 23 % ✓ |
| 4 | 5.4 GB | 67 % ✓ | 34 % ✓ |
| **6 (MAX)** | **7.1 GB** | **89 % ✗** | **44 % ✓** |
| **8 (HP_MAX)** | **8.7 GB** | **109 % ✗ OOM** | **55 % ✓** |
| N tabs | Memory used | 8 GB (today) | 16 GB (resized) |
| -------------: | ----------: | :-------------: | :-------------: |
| 2 (MIN) | 3.7 GB | 46 % ✓ | 23 % ✓ |
| 4 | 5.4 GB | 67 % ✓ | 34 % ✓ |
| **6 (MAX)** | **7.1 GB** | **89 % ✗** | **44 % ✓** |
| **8 (HP_MAX)** | **8.7 GB** | **109 % ✗ OOM** | **55 % ✓** |

The 16 GB resize is what makes HP_MAX=8 safe. On the existing 8 GB task, MAX=6 is already tight; HP_MAX=8 would OOM.

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/build-host.yml
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ jobs:
- name: Restore boxel-icons build cache
id: icons-cache
uses: ./.github/actions/restore-icons-cache
- name: Build boxel-icons and boxel-ui
- name: Build boxel-icons
run: mise run build:ui
env:
SKIP_ICONS_BUILD: ${{ steps.icons-cache.outputs.cache-hit }}
Expand Down
11 changes: 1 addition & 10 deletions .github/workflows/ci-lint.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -51,16 +51,7 @@ jobs:
- name: Lint Boxel UI
if: ${{ !cancelled() }}
run: pnpm run lint
working-directory: packages/boxel-ui/addon
- name: Build Boxel UI
# To faciliate linting of projects that depend on Boxel UI
if: ${{ !cancelled() }}
run: pnpm run build
working-directory: packages/boxel-ui/addon
- name: Lint Boxel UI Test App
if: ${{ !cancelled() }}
run: pnpm run lint
working-directory: packages/boxel-ui/test-app
working-directory: packages/boxel-ui
- name: Lint Host
if: ${{ !cancelled() }}
run: pnpm run lint
Expand Down
Loading