Skip to content

[analyze 1/3] analyze-gather: collect git/PR/codebase/session signal into JSON #75

@willwashburn

Description

@willwashburn

Part of the agentworkforce analyze feature. Issue 1 of 3. Foundation for #76 (persona-discoverer) and #77 (CLI subcommand).

Depends on #71 — start this after the persona-kit migration (#64#71) ships. Pre-migration PersonaSpec / runAgentSelector go away and any work built on top of them will need to be rewritten.

Goal

Add a pure-TS (no LLM) signal-gathering module that produces a single bounded JSON document describing how the team actually works in a given repo. This JSON becomes the input the persona-discoverer persona will read in issue 2.

The synthesizer is a judgment task best done by an LLM, but the gathering is mechanical and deterministic — keep it in TS so it's fast, cheap, testable, and works the same every run.

Files to touch

New:

  • packages/cli/src/analyze-gather.ts — the gather module.
  • packages/cli/src/analyze-gather.test.ts — Node test runner (match packages/cli/src/cli.test.ts style).

Modify:

  • listSessionTranscriptsForCwd() — a new shared helper. Post-migration location depends on who ends up owning session-transcript discovery:
    • If session lifecycle follows the spawn flow into @agentworkforce/persona-kit (the handle returned by executePersonaSpawnPlan per [persona-kit 4/8] Migrate workforce CLI to use @agentworkforce/persona-kit #67 is the natural owner), add the helper there and import it from analyze-gather.ts. This is the more likely landing spot.
    • If it stays CLI-layer (because burn-stamp ledger + filesystem walk feel like CLI concerns), extract from the auto-improve flow that currently lives near the bottom of packages/cli/src/cli.ts into packages/cli/src/session-transcripts.ts and import from there.
    • Either way: today's findSessionTranscriptViaStamps + cwd-content scan return one transcript for a just-ended run; the new helper is an enumerator returning all transcripts for a cwd. Both auto-improve and analyze should consume it.

JSON shape

interface AnalyzeGatherResult {
  cwd: string;
  generatedAt: string; // ISO8601
  commits: Array<{ sha: string; author: string; date: string; subject: string; files: Array<{ path: string; added: number; deleted: number }> }>;
  hotFiles: Array<{ path: string; commits: number; added: number; deleted: number }>;
  prs: Array<{ number: number; title: string; author: string; labels: string[]; mergedAt: string; additions: number; deletions: number; changedFiles: number }> | { skipped: true; reason: string };
  codebase: {
    tree: Array<{ dir: string; fileCounts: Record<string /* ext */, number>; isPackageRoot: boolean }>;
    packages: Array<{ path: string; name: string; scripts: string[]; depCount: number; devDepCount: number }>;
  };
  sessions: Array<{ harness: 'claude' | 'codex' | 'opencode'; sessionId: string; transcriptPath: string; startedAt: string; headLines: string[]; tailLines: string[] }> | { skipped: true; reason: string };
}

Bounds (do not exceed)

Section Bound
commits min(--max-commits flag, last --lookback-days window). Defaults: 500 / 90.
hotFiles top 80 by total churn (added+deleted)
prs 200 most recent merged
codebase.tree walk depth 6; skip node_modules, dist, .git, anything matched by repo's .gitignore
codebase.packages every package.json found within depth bound
sessions 30 most recent for this cwd; head 40 lines + tail 40 lines of each transcript

These bounds keep the eventual analyzer prompt under ~100k tokens.

Tasks

  • Public API: export async function gather(opts: GatherOptions): Promise<AnalyzeGatherResult> plus export interface GatherOptions { cwd: string; lookbackDays: number; maxCommits: number; includePrs: boolean; includeSessions: boolean; runner?: CommandRunner }.
  • Inject the command runner (so tests can stub git / gh deterministically). Default runner uses child_process.spawnSync — match the pattern in packages/harness-kit/src/detect.ts and cli.ts:748.
  • git log --since=<lookback> --no-merges --numstat --pretty=format:'…' — single invocation, parse all commits + file deltas in one pass.
  • Aggregate hotFiles from commit deltas (do not re-shell out).
  • gh pr list --state merged --json number,title,author,labels,mergedAt,additions,deletions,changedFiles --limit 200. Detect gh absence (spawnSync ENOENT) and unauthenticated state (gh auth status non-zero) — return { skipped: true, reason } rather than crashing.
  • Codebase walk via readdirSync({ withFileTypes: true }) (existing pattern in cli.ts:1307). Skip directories per the bounds table. Tag any dir containing package.json as a package root.
  • Per package: read package.json, extract name, scripts keys, count dependencies + devDependencies. Don't include version strings or dep names (too much noise for too little value at this stage).
  • Sessions: implement listSessionTranscriptsForCwd(cwd) by extracting/generalizing the existing logic at cli.ts:2540-2815. For each transcript file, read head -40 + tail -40 lines without slurping the whole file (sessions can be huge).
  • Pass --no-prs / --no-sessions plumbing through GatherOptions.
  • Write the result to a caller-supplied path (issue 3 will manage the path lifecycle); export the in-memory result too so tests can assert without disk I/O.

Tests

  • Stub a CommandRunner returning canned git log output; assert the parsed commit + numstat structure matches expected shape, including multiline subjects and renamed files ({old => new} syntax).
  • Stub gh returning the canned JSON shape; assert prs parses correctly.
  • Stub gh runner that throws ENOENT → assert prs: { skipped: true, reason: /not installed/i }.
  • Stub gh auth status returning non-zero → assert prs: { skipped: true, reason: /not authenticated/i }.
  • Build a temp directory tree (real fs, but in os.tmpdir()) with nested package.json files and a node_modules to skip → assert codebase walk shape and bounds.
  • includePrs: falseprs: { skipped: true, reason: /disabled/i }, no gh invocation.
  • includeSessions: falsesessions: { skipped: true, reason: /disabled/i }, no session enumeration.
  • Bounds enforced: feed 1000 fake commits → result has exactly maxCommits entries; feed 200 hot file candidates → result has top 80.

Verification

  • corepack pnpm --filter @agentworkforce/cli test passes (this issue's tests included).
  • Build is clean: corepack pnpm -r build.
  • Manual: write a small harness script that calls gather({ cwd: process.cwd(), … }) against this repo and pretty-prints the result. Eyeball it for sanity — commits cover the lookback window, hot files match what you'd expect, no node_modules in the codebase tree.

Constraints

  • No new runtime deps. Use child_process + fs + path — matches the rest of the CLI.
  • No gh requirement. Gather must produce a useful result on a machine without gh installed.
  • Don't slurp giant files. Some session JSONL transcripts are tens of MB; use streaming reads for head/tail extraction.
  • Don't leak secrets. Skip env files, .npmrc, .git/config from any sampling. We only read package.json (script names + dep counts, not values).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions