Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,957 changes: 1,957 additions & 0 deletions agent-runtime-budget-governance-design.md

Large diffs are not rendered by default.

1,143 changes: 1,143 additions & 0 deletions context-reliability-architecture.md

Large diffs are not rendered by default.

1,159 changes: 1,159 additions & 0 deletions docs/deep-review-design.md

Large diffs are not rendered by default.

194 changes: 194 additions & 0 deletions docs/deep-review-phase2-addendum.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,194 @@
# Deep Review Phase 2 Addendum Implementation Plan

> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking.

**Goal:** Close the semantic gaps found after comparing the latest Deep Review implementation with `docs/deep-review-design.md` and `docs/deep-review-phase2-plan.md`.

**Architecture:** Keep the existing prompt-driven DeepReview orchestrator, but move the riskiest guarantees into deterministic runtime boundaries where the current documents overstate completion. The addendum does not expand Deep Review into a new scheduler architecture; it tightens cache correctness, launch backpressure, retry semantics, observability, and documentation truthfulness.

**Tech Stack:** Rust core (`bitfun-core`), TypeScript React frontend (`src/web-ui`), Vitest, Cargo tests.

---

## Source Documents

- `docs/deep-review-design.md`: original strategy-engine and reviewer-role design.
- `docs/deep-review-phase2-plan.md`: Phase 2 implementation plan and status table.
- This addendum only covers gaps discovered after the latest implementation. It does not replace either source document.

## Current Truth Model

| Area | Current code truth | Addendum stance |
|---|---|---|
| Conditional Frontend Reviewer | Diff/path classification and conditional manifest behavior exist | Keep as implemented; improve only if future conditional reviewers need the same manifest contract |
| Custom review subagents | Required tooling is centralized and invalid agents are skipped with `invalid_tooling` | Keep as implemented; add regression coverage only when touching this area |
| Dynamic concurrency | TaskTool has a hard cap safety net with structured rejection text | Keep queue/wait deferred unless hard rejection causes real launch failures |
| Retry | Budget tracking and retry guidance exist | Keep retry prompt-guided; automatic backend redispatch remains deferred |
| Partial timeout | Grace-period final-message capture exists | Keep limitation documented; do not claim arbitrary stream-fragment capture |
| Incremental cache | Per-session cache struct, metadata field, `packet_id` read path, completed-reviewer write-through, and hit/miss report signals exist | Keep project-level cache deferred |
| Token budget | File-split/max-file style guardrails exist | Defer prompt-byte enforcement until prompt-size estimation is available |
| Shared context cache | Prompt-only | Keep deferred unless duplicate Read/GetFileDiff cost becomes a measured bottleneck |
| Observability | User/report-facing reliability summaries exist for cache, skipped-reviewer, token-budget, partial-timeout, retry, and structured cap signals | Keep external telemetry deferred unless diagnostics require it |
| First-run consent dialog | Dialog shows cost/time/lineup details and has missing skipped-reason locale keys | Simplify to key reminders, preserve strategy choice and skipped-reviewer warnings, and complete locale coverage |

## Scope Boundaries

In scope:

- Correct documentation claims so they match current code.
- Simplify first-run Deep Review consent so users see only the key launch reminders and all visible strings are localized.
- Close incremental review cache correctness for per-session continuation.
- Add a deterministic backpressure path for DeepReview reviewer launch caps if hard rejection proves too brittle.
- Make retry status explicit: either implement automatic redispatch or keep the code/documentation as prompt-guided retry.
- Add metrics/report surfaces for cap rejection, partial timeout, retry, cache hit/miss, skipped reviewers, and token-budget decisions.

Out of scope:

- Project-level cross-session review cache.
- Programmatic shared Read/GetFileDiff cache.
- Full prompt-byte budget enforcement.
- Large diff summary generation as a mandatory pre-review step.
- Replacing the prompt-driven DeepReview orchestrator with a full backend-owned DAG scheduler.

## Risk Register

| Risk | Scenario | Mitigation |
|---|---|---|
| Hard cap rejection causes missed reviewers | Weak orchestrator launches too many reviewers and does not recover from tool error | Add queue/wait behavior or keep rejection surfaced in report coverage notes |
| Cache returns stale review output | Fingerprint misses strategy, roster, renamed path, or model changes | Use `packet_id` keys and invalidate on target files, reviewer roster, strategy, model, and relevant manifest changes |
| Cache key mismatch | Writer stores by `packet_id`, reader checks `subagent_type` | Resolve packet id from Task description/run manifest before reading cache |
| Retry loops grow token cost | Model retries broad scope repeatedly | Enforce max retries per role and require reduced scope plus `retry: true` |
| Partial output is overtrusted | Timed-out reviewer emits incomplete analysis | Preserve `partial_timeout` and reliability signals; judge/report treats it as lower confidence |
| Observability adds noise | Reports become too dense | Keep raw counters in metadata and show only summarized user-facing notes |
| Privacy retention is unclear | Cached reviewer output stores sensitive code findings | Keep first closure per-session only; document retention and deletion semantics before project-level cache |

## File Map

Documentation:

- `docs/deep-review-design.md`: source-of-truth design status and remaining work.
- `docs/deep-review-phase2-plan.md`: reconciled Phase 2 implementation status.
- `docs/deep-review-phase2-addendum.md`: this addendum and follow-up implementation plan.

Rust core:

- `src/crates/core/src/agentic/deep_review_policy.rs`: policy helpers, concurrency policy, retry budget, incremental cache helpers.
- `src/crates/core/src/agentic/tools/implementations/task_tool.rs`: reviewer launch enforcement, cache hit lookup, retry guidance.
- `src/crates/core/src/agentic/tools/implementations/code_review_tool.rs`: normalized reviewer packet metadata and completed-reviewer cache write-through boundary.
- `src/crates/core/src/agentic/coordination/coordinator.rs`: subagent execution, timeout/partial-result handling, possible future queue/wait boundary.
- `src/crates/core/src/service/session/types.rs`: session metadata fields for run manifest and cache data.
- `src/crates/core/src/agentic/persistence/manager.rs`: session metadata persistence.

Frontend:

- `src/web-ui/src/shared/services/reviewTeamService.ts`: manifest construction, work packets, token budget plan, skipped reviewer reporting.
- `src/web-ui/src/shared/services/reviewTargetClassifier.ts`: review target classification and conditional reviewer applicability.
- `src/web-ui/src/shared/services/reviewSubagentCapabilities.ts`: custom review subagent tool contract.
- `src/web-ui/src/flow_chat/utils/codeReviewReport.ts`: report formatting for packet, partial, cache, and reliability metadata.
- `src/web-ui/src/flow_chat/components/DeepReviewConsentDialog.tsx`: user-facing run strategy override and launch summary.

## Implementation Rounds

### Round 1: Documentation Reconciliation

**Goal:** Make the docs truthful before more code changes.

- [x] Update `docs/deep-review-design.md` so completed/remaining work distinguishes complete runtime behavior from safety-net or prompt-guided behavior.
- [x] Update `docs/deep-review-phase2-plan.md` so Phase B and Phase C are marked partial where the code has cap/guidance/read-path behavior but not full scheduler/retry/cache semantics.
- [x] Add this addendum with scope boundaries, risk register, file map, and follow-up rounds.

Verification:

- `rg -n "PARTIAL|Safety net|write-through|queue|redispatch|prompt-guided" docs/deep-review-design.md docs/deep-review-phase2-plan.md docs/deep-review-phase2-addendum.md`
- Expected: the three docs explicitly describe partial completion and do not claim queued dispatch, automatic retry, or cache write-through as complete.

### Round 2: First-Run Consent Dialog I18n And Density

**Goal:** Keep the first Deep Review confirmation dialog useful without making it feel like a full report.

- [x] Replace the two cost/time fact cards and active reviewer chip list with a compact reminder area.
- [x] Keep only these visible decisions and warnings:
- Deep Review can take longer and use more tokens than a standard review.
- The first pass is read-only.
- The selected run strategy can be changed before launch.
- Skipped reviewers are shown only when there are skipped reviewers.
- [x] Keep strategy override behavior unchanged, including project-level persistence when the user confirms.
- [x] Add missing locale keys for skipped reviewer reasons in `en-US`, `zh-CN`, and `zh-TW`.
- [x] Update dialog tests so they assert compact summary behavior and no longer require active reviewer names to be shown.

Verification:

- `pnpm --dir src/web-ui run test:run -- src/flow_chat/components/DeepReviewConsentDialog.test.tsx`
- Expected: the dialog still opens when skipped reviewers exist, skipped reason text is localized through locale keys, and strategy override persistence still works.

### Round 3: Incremental Cache Correctness Closure

**Goal:** Make per-session incremental review cache safe enough to claim as implemented.

- [x] Add tests in `src/crates/core/src/agentic/deep_review_policy.rs` for cache invalidation on fingerprint mismatch and for packet keys that include split-review suffixes such as `reviewer:ReviewSecurity:group-1-of-3`.
- [x] Add TaskTool tests in `src/crates/core/src/agentic/tools/implementations/task_tool.rs` for resolving cache keys from the Task description pattern `[packet reviewer:ReviewSecurity]` and from unique run-manifest work packets.
- [x] Change TaskTool cache lookup from `subagent_type` to resolved `packet_id`, with no cache hit when multiple packets match and no packet id is present.
- [x] Add a write-through boundary after `submit_code_review` normalizes reviewer packet metadata. Store only reviewers with `status=completed` and a non-empty `packet_id`.
- [x] Preserve per-session scope only. Do not add project-level storage in this round.

Verification:

- `cargo test -p bitfun-core deep_review -- --nocapture`
- Expected: cache helper tests pass, ambiguous packet lookup does not produce a false cache hit, and completed reviewer output can be written and read by the same packet id.

### Round 4: Launch Backpressure And Retry Semantics

**Goal:** Remove model-dependent ambiguity around concurrency cap and retry behavior without replacing the whole orchestrator.

- [x] Add a focused DeepReview TaskTool test showing current cap rejection behavior for excess reviewer launches.
- [x] Decide implementation mode:
- Preserved rejection for now and kept the existing structured `code`/`message` tool-error payload.
- Deferred bounded backpressure until real launch failures show hard rejection is too brittle.
- [x] Keep `staggerSeconds` as a documented future scheduler field unless this round implements actual wait spacing.
- [x] For retry, keep the current prompt-guided path unless structured partial coverage extraction is added. If automatic redispatch is added later, require `retry: true`, lower timeout, and reduced assigned scope.

Verification:

- `cargo test -p bitfun-core deep_review -- --nocapture`
- `cargo test -p bitfun-core coordination -- --nocapture`
- Expected: cap behavior is deterministic and tested; retry status is either prompt-guided and documented, or automatic redispatch is covered by tests.

### Round 5: Observability, Report Surfacing, And UI Stability

**Goal:** Make runtime tradeoffs visible without making the Deep Review report noisy.

- [x] Add structured counters or report metadata for cache hit/miss, concurrency cap rejection/wait, partial timeout, retry guidance, skipped reviewers, and token-budget skipped extras.
- [x] Update `src/web-ui/src/flow_chat/utils/codeReviewReport.ts` to summarize only meaningful reliability signals in exported Markdown.
- [x] Keep dense details collapsed or metadata-only in UI components so report readability does not regress.
- [x] Confirm all new user-facing strings are localized in `src/web-ui/src/locales/en-US/flow-chat.json`, `zh-CN`, and `zh-TW` when UI text is added.

Verification:

- `pnpm run lint:web`
- `pnpm run type-check:web`
- `pnpm --dir src/web-ui run test:run`
- Expected: locale coverage remains complete, report formatting tests cover the new summary rows, and no layout text overflow is introduced.

### Round 6: Final Verification And Documentation Freeze

**Goal:** Prove code and docs still match after the addendum work.

- [x] Re-read `docs/deep-review-design.md`, `docs/deep-review-phase2-plan.md`, and this addendum against the implemented code paths.
- [x] Update only status wording if implementation choices differ from this plan.
- [x] Run the smallest complete verification set for touched areas.

Verification:

- For Rust-only changes: `cargo test -p bitfun-core deep_review -- --nocapture`
- For frontend/report changes: `pnpm run lint:web && pnpm run type-check:web && pnpm --dir src/web-ui run test:run`
- For cross-boundary behavior: `cargo check --workspace --exclude bitfun-cli`

Expected: no document claims exceed code behavior, and every intentionally deferred item remains marked deferred.

## Completion Criteria

- Existing docs no longer state that queued dispatch, automatic retry redispatch, cache write-through, or prompt-byte budget enforcement are complete unless code implements them.
- Per-session incremental cache can only return data by a verified packet id and matching fingerprint.
- Concurrency cap behavior is either user/report-visible as a skipped/rejected reviewer or is converted into bounded backpressure.
- Retry behavior is explicitly either prompt-guided or automatic; there is no mixed wording.
- User-facing additions remain localized and report density remains controlled.
Loading
Loading