[analyze 1/3] analyze-gather: collect git/PR/codebase/session signal into JSON

Part of the `agentworkforce analyze` feature. **Issue 1 of 3.** Foundation for #76 (persona-discoverer) and #77 (CLI subcommand).

**Depends on #71** — start this after the persona-kit migration (#64–#71) ships. Pre-migration `PersonaSpec` / `runAgentSelector` go away and any work built on top of them will need to be rewritten.

## Goal

Add a pure-TS (no LLM) signal-gathering module that produces a single bounded JSON document describing how the team actually works in a given repo. This JSON becomes the input the `persona-discoverer` persona will read in issue 2.

The synthesizer is a judgment task best done by an LLM, but the gathering is mechanical and deterministic — keep it in TS so it's fast, cheap, testable, and works the same every run.

## Files to touch

**New:**

- `packages/cli/src/analyze-gather.ts` — the gather module.
- `packages/cli/src/analyze-gather.test.ts` — Node test runner (match `packages/cli/src/cli.test.ts` style).

**Modify:**

- `listSessionTranscriptsForCwd()` — a new shared helper. **Post-migration location depends on who ends up owning session-transcript discovery:**
  - If session lifecycle follows the spawn flow into `@agentworkforce/persona-kit` (the `handle` returned by `executePersonaSpawnPlan` per #67 is the natural owner), add the helper there and import it from `analyze-gather.ts`. This is the more likely landing spot.
  - If it stays CLI-layer (because burn-stamp ledger + filesystem walk feel like CLI concerns), extract from the auto-improve flow that currently lives near the bottom of `packages/cli/src/cli.ts` into `packages/cli/src/session-transcripts.ts` and import from there.
  - Either way: today's `findSessionTranscriptViaStamps` + cwd-content scan return *one* transcript for a just-ended run; the new helper is an *enumerator* returning all transcripts for a cwd. Both auto-improve and analyze should consume it.

## JSON shape

```ts
interface AnalyzeGatherResult {
  cwd: string;
  generatedAt: string; // ISO8601
  commits: Array<{ sha: string; author: string; date: string; subject: string; files: Array<{ path: string; added: number; deleted: number }> }>;
  hotFiles: Array<{ path: string; commits: number; added: number; deleted: number }>;
  prs: Array<{ number: number; title: string; author: string; labels: string[]; mergedAt: string; additions: number; deletions: number; changedFiles: number }> | { skipped: true; reason: string };
  codebase: {
    tree: Array<{ dir: string; fileCounts: Record<string /* ext */, number>; isPackageRoot: boolean }>;
    packages: Array<{ path: string; name: string; scripts: string[]; depCount: number; devDepCount: number }>;
  };
  sessions: Array<{ harness: 'claude' | 'codex' | 'opencode'; sessionId: string; transcriptPath: string; startedAt: string; headLines: string[]; tailLines: string[] }> | { skipped: true; reason: string };
}
```

## Bounds (do not exceed)

| Section | Bound |
|---|---|
| `commits` | min(`--max-commits` flag, last `--lookback-days` window). Defaults: 500 / 90. |
| `hotFiles` | top 80 by total churn (added+deleted) |
| `prs` | 200 most recent merged |
| `codebase.tree` | walk depth 6; skip `node_modules`, `dist`, `.git`, anything matched by repo's `.gitignore` |
| `codebase.packages` | every `package.json` found within depth bound |
| `sessions` | 30 most recent for this cwd; head 40 lines + tail 40 lines of each transcript |

These bounds keep the eventual analyzer prompt under ~100k tokens.

## Tasks

- [ ] Public API: `export async function gather(opts: GatherOptions): Promise<AnalyzeGatherResult>` plus `export interface GatherOptions { cwd: string; lookbackDays: number; maxCommits: number; includePrs: boolean; includeSessions: boolean; runner?: CommandRunner }`.
- [ ] Inject the command runner (so tests can stub `git` / `gh` deterministically). Default runner uses `child_process.spawnSync` — match the pattern in `packages/harness-kit/src/detect.ts` and `cli.ts:748`.
- [ ] `git log --since=<lookback> --no-merges --numstat --pretty=format:'…'` — single invocation, parse all commits + file deltas in one pass.
- [ ] Aggregate `hotFiles` from commit deltas (do not re-shell out).
- [ ] `gh pr list --state merged --json number,title,author,labels,mergedAt,additions,deletions,changedFiles --limit 200`. Detect `gh` absence (`spawnSync` ENOENT) and unauthenticated state (`gh auth status` non-zero) — return `{ skipped: true, reason }` rather than crashing.
- [ ] Codebase walk via `readdirSync({ withFileTypes: true })` (existing pattern in `cli.ts:1307`). Skip directories per the bounds table. Tag any dir containing `package.json` as a package root.
- [ ] Per package: read `package.json`, extract `name`, `scripts` keys, count `dependencies` + `devDependencies`. Don't include version strings or dep names (too much noise for too little value at this stage).
- [ ] Sessions: implement `listSessionTranscriptsForCwd(cwd)` by extracting/generalizing the existing logic at `cli.ts:2540-2815`. For each transcript file, read `head -40` + `tail -40` lines without slurping the whole file (sessions can be huge).
- [ ] Pass `--no-prs` / `--no-sessions` plumbing through `GatherOptions`.
- [ ] Write the result to a caller-supplied path (issue 3 will manage the path lifecycle); export the in-memory result too so tests can assert without disk I/O.

## Tests

- [ ] Stub a `CommandRunner` returning canned `git log` output; assert the parsed commit + numstat structure matches expected shape, including multiline subjects and renamed files (`{old => new}` syntax).
- [ ] Stub `gh` returning the canned JSON shape; assert `prs` parses correctly.
- [ ] Stub `gh` runner that throws ENOENT → assert `prs: { skipped: true, reason: /not installed/i }`.
- [ ] Stub `gh auth status` returning non-zero → assert `prs: { skipped: true, reason: /not authenticated/i }`.
- [ ] Build a temp directory tree (real `fs`, but in `os.tmpdir()`) with nested `package.json` files and a `node_modules` to skip → assert codebase walk shape and bounds.
- [ ] `includePrs: false` → `prs: { skipped: true, reason: /disabled/i }`, no `gh` invocation.
- [ ] `includeSessions: false` → `sessions: { skipped: true, reason: /disabled/i }`, no session enumeration.
- [ ] Bounds enforced: feed 1000 fake commits → result has exactly `maxCommits` entries; feed 200 hot file candidates → result has top 80.

## Verification

- [ ] `corepack pnpm --filter @agentworkforce/cli test` passes (this issue's tests included).
- [ ] Build is clean: `corepack pnpm -r build`.
- [ ] Manual: write a small harness script that calls `gather({ cwd: process.cwd(), … })` against this repo and pretty-prints the result. Eyeball it for sanity — commits cover the lookback window, hot files match what you'd expect, no `node_modules` in the codebase tree.

## Constraints

- **No new runtime deps.** Use `child_process` + `fs` + `path` — matches the rest of the CLI.
- **No `gh` requirement.** Gather must produce a useful result on a machine without `gh` installed.
- **Don't slurp giant files.** Some session JSONL transcripts are tens of MB; use streaming reads for head/tail extraction.
- **Don't leak secrets.** Skip env files, `.npmrc`, `.git/config` from any sampling. We only read `package.json` (script names + dep counts, not values).



Section	Bound
`commits`	min(`--max-commits` flag, last `--lookback-days` window). Defaults: 500 / 90.
`hotFiles`	top 80 by total churn (added+deleted)
`prs`	200 most recent merged
`codebase.tree`	walk depth 6; skip `node_modules`, `dist`, `.git`, anything matched by repo's `.gitignore`
`codebase.packages`	every `package.json` found within depth bound
`sessions`	30 most recent for this cwd; head 40 lines + tail 40 lines of each transcript

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[analyze 1/3] analyze-gather: collect git/PR/codebase/session signal into JSON #75

Goal

Files to touch

JSON shape

Bounds (do not exceed)

Tasks

Tests

Verification

Constraints

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[analyze 1/3] analyze-gather: collect git/PR/codebase/session signal into JSON #75

Description

Goal

Files to touch

JSON shape

Bounds (do not exceed)

Tasks

Tests

Verification

Constraints

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions