docs: add async scheduling epic plan#658
Conversation
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Greptile SummaryThis PR adds the source-of-truth architecture plan for the async scheduling epic (#645): a PlantUML diagram source with rendered PNGs, and ten Markdown spec files covering architecture, contracts, module ownership, capacity model, task admission, request admission, observability, benchmarks, migration/cleanup, and issue mapping.
|
| Filename | Overview |
|---|---|
| plans/645/contracts.md | Normative DTO/event/config vocabulary; contains a schema inconsistency where RowGroupAdmission.observed_in_flight is placed under the configured section but represents observed runtime data, with no CapacityValue wrapper and a near-duplicate in observed_maxima. |
| plans/645/task-admission.md | Scheduler dispatch pseudocode merges queue-empty and admission-blocked conditions into a single emit call without showing the branching required to emit the correct canonical event_kind. |
| plans/645/architecture.md | Target shape and layer responsibilities; fully consistent with contracts.md, with try_acquire now correctly showing selection.queue_view in both the runtime flow and README spine. |
| plans/645/async-scheduling-epic.puml | PlantUML source for all six diagrams; previously-flagged issues (QueueSelection label, is_eligible param count, EvidenceStage label) appear to be addressed in this revision. |
| plans/645/benchmark-plan.md | Artifact schema, scenario matrix, and evidence gates; correlation/sequence fields and hidden-waiter derivation are well-specified and consistent with observability.md event taxonomy. |
| plans/645/request-admission.md | Request admission runtime shape, AIMD V1 semantics, and release classification; consistent with contracts.md throughout. |
| plans/645/observability.md | Event DTOs, runtime correlation, cardinality rules, and OTel boundary; canonical event kinds and snapshot fields are consistent with contracts.md. |
| plans/645/migration-and-cleanup.md | Legacy-name search gates and documentation cleanup rules; search commands are clear and cover all replaced naming patterns. |
| plans/645/module-ownership.md | Target module layout with class-to-file mapping; complete and consistent with architecture.md package boundaries. |
| plans/645/issue-map.md | GitHub issue-to-spec mapping and dependency order; evidence phasing and issue body cleanup pattern are clearly specified. |
Sequence Diagram
sequenceDiagram
participant CT as CompletionTracker
participant S as AsyncTaskScheduler
participant Q as FairTaskQueue
participant TAC as TaskAdmissionController
participant ME as ModelRequestExecutor
participant RAC as RequestAdmissionController
participant P as Provider/Endpoint
S->>CT: ready_frontier()
CT-->>S: Sequence[SchedulableTask]
S->>Q: enqueue(tasks)
Q-->>S: accepted task_ids
S->>CT: mark_enqueued(task_ids)
S->>Q: select_next(is_eligible callback)
Note over Q,TAC: Queue calls TAC.is_eligible(item, view) per candidate
Q-->>S: QueueSelection (item, queue_view, sequence_version)
S->>TAC: try_acquire(selection.item, selection.queue_view)
TAC-->>S: TaskAdmissionLease
S->>Q: commit(selection)
Q-->>S: SchedulableTask
S->>S: spawn_worker(task, lease)
Note over S,P: Inside admitted task execution
ME->>RAC: acquire_async(RequestAdmissionItem)
RAC-->>ME: RequestAdmissionLease
ME->>P: model call
P-->>ME: response
ME->>RAC: release(lease, RequestReleaseOutcome)
S->>TAC: release(task_lease)
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
plans/645/contracts.md:463-465
`RowGroupAdmission.observed_in_flight` is typed as `int | None` (not `CapacityValue[T]`) and sits inside the `configured` block, which `capacity-model.md` explicitly defines as "values computed before or at run start." Observed row-group concurrency is a runtime measurement, not a pre-run configuration. There is already a field `observed_maxima.row_groups_in_flight: int` that covers this correctly. An implementer reading the schema has no guidance on which field to populate, whether to write to both, or what the semantic difference is between them — and two independent implementers could produce incompatible artifacts.
```suggestion
RowGroupAdmission:
row_group_concurrency: CapacityValue[int]
```
### Issue 2 of 2
plans/645/task-admission.md:11-15
The pseudocode emits a single `emit_queue_empty_or_blocked` regardless of whether `select_next` returned `None` because the queue was empty or because every queued candidate failed eligibility. These are two distinct canonical `event_kind` values (`queue_empty` vs `admission_blocked` / `group_capped`) that are versioned as part of the benchmark artifact schema — emitting the wrong kind would break machine-gated benchmark checks. `contracts.md` is explicit that `explain_blocked` is only called "while queued work exists," so the empty-queue path should be branched before that call.
```suggestion
if selection is None:
if queue.is_empty():
emit_queue_empty(...)
wait_for_wake()
else:
block_summary = admission.explain_blocked(queue.view())
# dominant_reasons drives admission_blocked vs group_capped event_kind
emit_admission_blocked_or_group_capped(block_summary)
wait_for_wake_or_deadline(block_summary.available_after)
return
```
Reviews (8): Last reviewed commit: "docs: add async scheduling module owners..." | Re-trigger Greptile
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Resolve the second review pass over plans/645 by making the Markdown spec and UML source agree on task admission, request admission, capacity, telemetry, benchmark, migration, and issue-map contracts. Key updates include canonical event names, richer AsyncCapacityPlan fields, request waiter and cancellation semantics, timed wakeups, retry/salvage lease ordering, and clearer public/internal documentation boundaries. Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Feedback on PR #658: Spell Out Repository and Module OwnershipOverall, I think the #645 architecture plan is directionally strong. The diagrams correctly separate task admission, request admission, metadata resolution, telemetry, capacity diagnostics, and future policy work. That separation is useful because the current async engine already has these concerns; they are just partially implicit and concentrated in a few large modules. The main thing I would ask PR #658 to add is a concrete repository/module ownership section. The UML names the objects and the Markdown specs define their contracts, but I do not yet see enough guidance on where each object should live in the repo. Without that, the first implementation PRs may accidentally put too much into Requested ChangePlease add a section to The section should answer:
This does not need to lock every filename forever, but it should give implementers a default home for each object before implementation starts. Suggested Module LayoutOne possible organization: The exact filenames can change, but the ownership should stay clear:
Current Module Migration To Spell OutIt would also help to explicitly map current modules to their target homes:
Import Boundary RequestPlease also state import rules. For example:
Why This MattersThe plan is intentionally rich, but implementation needs boring, enforceable module boundaries. A module ownership table would make the sequence in #645 easier to follow and would reduce the risk that future PRs accidentally recreate the current coupling under new names. In short: the diagrams explain what the objects are and #645 explains when to implement them. Please also spell out where each object should live in the repository. |
| RowGroupAdmission: | ||
| row_group_concurrency: CapacityValue[int] | ||
| observed_in_flight: int | None |
There was a problem hiding this comment.
RowGroupAdmission.observed_in_flight is typed as int | None (not CapacityValue[T]) and sits inside the configured block, which capacity-model.md explicitly defines as "values computed before or at run start." Observed row-group concurrency is a runtime measurement, not a pre-run configuration. There is already a field observed_maxima.row_groups_in_flight: int that covers this correctly. An implementer reading the schema has no guidance on which field to populate, whether to write to both, or what the semantic difference is between them — and two independent implementers could produce incompatible artifacts.
| RowGroupAdmission: | |
| row_group_concurrency: CapacityValue[int] | |
| observed_in_flight: int | None | |
| RowGroupAdmission: | |
| row_group_concurrency: CapacityValue[int] |
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/645/contracts.md
Line: 463-465
Comment:
`RowGroupAdmission.observed_in_flight` is typed as `int | None` (not `CapacityValue[T]`) and sits inside the `configured` block, which `capacity-model.md` explicitly defines as "values computed before or at run start." Observed row-group concurrency is a runtime measurement, not a pre-run configuration. There is already a field `observed_maxima.row_groups_in_flight: int` that covers this correctly. An implementer reading the schema has no guidance on which field to populate, whether to write to both, or what the semantic difference is between them — and two independent implementers could produce incompatible artifacts.
```suggestion
RowGroupAdmission:
row_group_concurrency: CapacityValue[int]
```
How can I resolve this? If you propose a fix, please make it concise.| if selection is None: | ||
| block_summary = admission.explain_blocked(queue.view()) | ||
| emit_queue_empty_or_blocked(block_summary) | ||
| wait_for_wake_or_deadline(block_summary.available_after) | ||
| return |
There was a problem hiding this comment.
The pseudocode emits a single
emit_queue_empty_or_blocked regardless of whether select_next returned None because the queue was empty or because every queued candidate failed eligibility. These are two distinct canonical event_kind values (queue_empty vs admission_blocked / group_capped) that are versioned as part of the benchmark artifact schema — emitting the wrong kind would break machine-gated benchmark checks. contracts.md is explicit that explain_blocked is only called "while queued work exists," so the empty-queue path should be branched before that call.
| if selection is None: | |
| block_summary = admission.explain_blocked(queue.view()) | |
| emit_queue_empty_or_blocked(block_summary) | |
| wait_for_wake_or_deadline(block_summary.available_after) | |
| return | |
| if selection is None: | |
| if queue.is_empty(): | |
| emit_queue_empty(...) | |
| wait_for_wake() | |
| else: | |
| block_summary = admission.explain_blocked(queue.view()) | |
| # dominant_reasons drives admission_blocked vs group_capped event_kind | |
| emit_admission_blocked_or_group_capped(block_summary) | |
| wait_for_wake_or_deadline(block_summary.available_after) | |
| return |
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/645/task-admission.md
Line: 11-15
Comment:
The pseudocode emits a single `emit_queue_empty_or_blocked` regardless of whether `select_next` returned `None` because the queue was empty or because every queued candidate failed eligibility. These are two distinct canonical `event_kind` values (`queue_empty` vs `admission_blocked` / `group_capped`) that are versioned as part of the benchmark artifact schema — emitting the wrong kind would break machine-gated benchmark checks. `contracts.md` is explicit that `explain_blocked` is only called "while queued work exists," so the empty-queue path should be branched before that call.
```suggestion
if selection is None:
if queue.is_empty():
emit_queue_empty(...)
wait_for_wake()
else:
block_summary = admission.explain_blocked(queue.view())
# dominant_reasons drives admission_blocked vs group_capped event_kind
emit_admission_blocked_or_group_capped(block_summary)
wait_for_wake_or_deadline(block_summary.available_after)
return
```
How can I resolve this? If you propose a fix, please make it concise.
Summary
Adds the source-of-truth architecture plan and rendered UML reference artifacts for async scheduling epic #645. The plan has been expanded after reviewer passes so implementation issues can reference the Markdown specs for architecture, contracts, capacity, telemetry, benchmarks, cleanup, and issue sequencing instead of carrying the full design in issue bodies.
Related Issue
Part of #645.
Changes
Added
plans/645/README.mdas the plan index and diagram entrypoint.plans/645/async-scheduling-epic.pumlwith component, class, sequence, and issue dependency diagrams.Changed
TaskAdmissionBlockSummary, side-effect-free policy evaluation, timed wakeups, queue locking, accepted-id enqueue handoff, and retry/salvage lease ordering.RequestResourceResolver,RequestWaiter, no-bypass fairness, post-acquire cancellation handling, cooldown timed waits, stable AIMD ceilings, and release outcome taxonomy.AsyncCapacityPlanwith row-group, provider alias/raw-cap, AIMD config snapshot, runtime snapshot, and observed maxima fields.Rendered Diagrams
Component View
Task Admission Contracts
Request Admission Contracts
Capacity, Telemetry, and Evidence Contracts
Runtime Sequence
Issue Dependency Map
Attention Areas
plans/645/contracts.md- Normative DTO/protocol/event/config vocabulary for implementation issues.plans/645/async-scheduling-epic.puml- Diagram source that must stay aligned with the Markdown specs.plans/645/benchmark-plan.md- Machine-readable artifact schema and gates intended to replace one-off evidence.plans/645/migration-and-cleanup.md- Legacy-name cleanup gates and public/internal documentation boundary.Testing
plantuml -checkonly plans/645/async-scheduling-epic.pumlplantuml plans/645/async-scheduling-epic.pumlgit diff --checkmake testpasses - N/A, reference-only plan artifactsChecklist
Description updated with AI