You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Part of the agentworkforce analyze feature. Issue 1 of 3. Foundation for #76 (persona-discoverer) and #77 (CLI subcommand).
Depends on #71 — start this after the persona-kit migration (#64–#71) ships. Pre-migration PersonaSpec / runAgentSelector go away and any work built on top of them will need to be rewritten.
Goal
Add a pure-TS (no LLM) signal-gathering module that produces a single bounded JSON document describing how the team actually works in a given repo. This JSON becomes the input the persona-discoverer persona will read in issue 2.
The synthesizer is a judgment task best done by an LLM, but the gathering is mechanical and deterministic — keep it in TS so it's fast, cheap, testable, and works the same every run.
Files to touch
New:
packages/cli/src/analyze-gather.ts — the gather module.
packages/cli/src/analyze-gather.test.ts — Node test runner (match packages/cli/src/cli.test.ts style).
Modify:
listSessionTranscriptsForCwd() — a new shared helper. Post-migration location depends on who ends up owning session-transcript discovery:
If session lifecycle follows the spawn flow into @agentworkforce/persona-kit (the handle returned by executePersonaSpawnPlan per [persona-kit 4/8] Migrate workforce CLI to use @agentworkforce/persona-kit #67 is the natural owner), add the helper there and import it from analyze-gather.ts. This is the more likely landing spot.
If it stays CLI-layer (because burn-stamp ledger + filesystem walk feel like CLI concerns), extract from the auto-improve flow that currently lives near the bottom of packages/cli/src/cli.ts into packages/cli/src/session-transcripts.ts and import from there.
Either way: today's findSessionTranscriptViaStamps + cwd-content scan return one transcript for a just-ended run; the new helper is an enumerator returning all transcripts for a cwd. Both auto-improve and analyze should consume it.
min(--max-commits flag, last --lookback-days window). Defaults: 500 / 90.
hotFiles
top 80 by total churn (added+deleted)
prs
200 most recent merged
codebase.tree
walk depth 6; skip node_modules, dist, .git, anything matched by repo's .gitignore
codebase.packages
every package.json found within depth bound
sessions
30 most recent for this cwd; head 40 lines + tail 40 lines of each transcript
These bounds keep the eventual analyzer prompt under ~100k tokens.
Tasks
Public API: export async function gather(opts: GatherOptions): Promise<AnalyzeGatherResult> plus export interface GatherOptions { cwd: string; lookbackDays: number; maxCommits: number; includePrs: boolean; includeSessions: boolean; runner?: CommandRunner }.
Inject the command runner (so tests can stub git / gh deterministically). Default runner uses child_process.spawnSync — match the pattern in packages/harness-kit/src/detect.ts and cli.ts:748.
git log --since=<lookback> --no-merges --numstat --pretty=format:'…' — single invocation, parse all commits + file deltas in one pass.
Aggregate hotFiles from commit deltas (do not re-shell out).
gh pr list --state merged --json number,title,author,labels,mergedAt,additions,deletions,changedFiles --limit 200. Detect gh absence (spawnSync ENOENT) and unauthenticated state (gh auth status non-zero) — return { skipped: true, reason } rather than crashing.
Codebase walk via readdirSync({ withFileTypes: true }) (existing pattern in cli.ts:1307). Skip directories per the bounds table. Tag any dir containing package.json as a package root.
Per package: read package.json, extract name, scripts keys, count dependencies + devDependencies. Don't include version strings or dep names (too much noise for too little value at this stage).
Sessions: implement listSessionTranscriptsForCwd(cwd) by extracting/generalizing the existing logic at cli.ts:2540-2815. For each transcript file, read head -40 + tail -40 lines without slurping the whole file (sessions can be huge).
Pass --no-prs / --no-sessions plumbing through GatherOptions.
Write the result to a caller-supplied path (issue 3 will manage the path lifecycle); export the in-memory result too so tests can assert without disk I/O.
Tests
Stub a CommandRunner returning canned git log output; assert the parsed commit + numstat structure matches expected shape, including multiline subjects and renamed files ({old => new} syntax).
Build a temp directory tree (real fs, but in os.tmpdir()) with nested package.json files and a node_modules to skip → assert codebase walk shape and bounds.
Bounds enforced: feed 1000 fake commits → result has exactly maxCommits entries; feed 200 hot file candidates → result has top 80.
Verification
corepack pnpm --filter @agentworkforce/cli test passes (this issue's tests included).
Build is clean: corepack pnpm -r build.
Manual: write a small harness script that calls gather({ cwd: process.cwd(), … }) against this repo and pretty-prints the result. Eyeball it for sanity — commits cover the lookback window, hot files match what you'd expect, no node_modules in the codebase tree.
Constraints
No new runtime deps. Use child_process + fs + path — matches the rest of the CLI.
No gh requirement. Gather must produce a useful result on a machine without gh installed.
Don't slurp giant files. Some session JSONL transcripts are tens of MB; use streaming reads for head/tail extraction.
Don't leak secrets. Skip env files, .npmrc, .git/config from any sampling. We only read package.json (script names + dep counts, not values).
Part of the
agentworkforce analyzefeature. Issue 1 of 3. Foundation for #76 (persona-discoverer) and #77 (CLI subcommand).Depends on #71 — start this after the persona-kit migration (#64–#71) ships. Pre-migration
PersonaSpec/runAgentSelectorgo away and any work built on top of them will need to be rewritten.Goal
Add a pure-TS (no LLM) signal-gathering module that produces a single bounded JSON document describing how the team actually works in a given repo. This JSON becomes the input the
persona-discovererpersona will read in issue 2.The synthesizer is a judgment task best done by an LLM, but the gathering is mechanical and deterministic — keep it in TS so it's fast, cheap, testable, and works the same every run.
Files to touch
New:
packages/cli/src/analyze-gather.ts— the gather module.packages/cli/src/analyze-gather.test.ts— Node test runner (matchpackages/cli/src/cli.test.tsstyle).Modify:
listSessionTranscriptsForCwd()— a new shared helper. Post-migration location depends on who ends up owning session-transcript discovery:@agentworkforce/persona-kit(thehandlereturned byexecutePersonaSpawnPlanper [persona-kit 4/8] Migrate workforce CLI to use @agentworkforce/persona-kit #67 is the natural owner), add the helper there and import it fromanalyze-gather.ts. This is the more likely landing spot.packages/cli/src/cli.tsintopackages/cli/src/session-transcripts.tsand import from there.findSessionTranscriptViaStamps+ cwd-content scan return one transcript for a just-ended run; the new helper is an enumerator returning all transcripts for a cwd. Both auto-improve and analyze should consume it.JSON shape
Bounds (do not exceed)
commits--max-commitsflag, last--lookback-dayswindow). Defaults: 500 / 90.hotFilesprscodebase.treenode_modules,dist,.git, anything matched by repo's.gitignorecodebase.packagespackage.jsonfound within depth boundsessionsThese bounds keep the eventual analyzer prompt under ~100k tokens.
Tasks
export async function gather(opts: GatherOptions): Promise<AnalyzeGatherResult>plusexport interface GatherOptions { cwd: string; lookbackDays: number; maxCommits: number; includePrs: boolean; includeSessions: boolean; runner?: CommandRunner }.git/ghdeterministically). Default runner useschild_process.spawnSync— match the pattern inpackages/harness-kit/src/detect.tsandcli.ts:748.git log --since=<lookback> --no-merges --numstat --pretty=format:'…'— single invocation, parse all commits + file deltas in one pass.hotFilesfrom commit deltas (do not re-shell out).gh pr list --state merged --json number,title,author,labels,mergedAt,additions,deletions,changedFiles --limit 200. Detectghabsence (spawnSyncENOENT) and unauthenticated state (gh auth statusnon-zero) — return{ skipped: true, reason }rather than crashing.readdirSync({ withFileTypes: true })(existing pattern incli.ts:1307). Skip directories per the bounds table. Tag any dir containingpackage.jsonas a package root.package.json, extractname,scriptskeys, countdependencies+devDependencies. Don't include version strings or dep names (too much noise for too little value at this stage).listSessionTranscriptsForCwd(cwd)by extracting/generalizing the existing logic atcli.ts:2540-2815. For each transcript file, readhead -40+tail -40lines without slurping the whole file (sessions can be huge).--no-prs/--no-sessionsplumbing throughGatherOptions.Tests
CommandRunnerreturning cannedgit logoutput; assert the parsed commit + numstat structure matches expected shape, including multiline subjects and renamed files ({old => new}syntax).ghreturning the canned JSON shape; assertprsparses correctly.ghrunner that throws ENOENT → assertprs: { skipped: true, reason: /not installed/i }.gh auth statusreturning non-zero → assertprs: { skipped: true, reason: /not authenticated/i }.fs, but inos.tmpdir()) with nestedpackage.jsonfiles and anode_modulesto skip → assert codebase walk shape and bounds.includePrs: false→prs: { skipped: true, reason: /disabled/i }, noghinvocation.includeSessions: false→sessions: { skipped: true, reason: /disabled/i }, no session enumeration.maxCommitsentries; feed 200 hot file candidates → result has top 80.Verification
corepack pnpm --filter @agentworkforce/cli testpasses (this issue's tests included).corepack pnpm -r build.gather({ cwd: process.cwd(), … })against this repo and pretty-prints the result. Eyeball it for sanity — commits cover the lookback window, hot files match what you'd expect, nonode_modulesin the codebase tree.Constraints
child_process+fs+path— matches the rest of the CLI.ghrequirement. Gather must produce a useful result on a machine withoutghinstalled..npmrc,.git/configfrom any sampling. We only readpackage.json(script names + dep counts, not values).