Skip to content

docs: add async scheduling epic plan#658

Open
eric-tramel wants to merge 10 commits into
epic/645-async-schedulingfrom
codex/645-async-scheduling-uml-pr0
Open

docs: add async scheduling epic plan#658
eric-tramel wants to merge 10 commits into
epic/645-async-schedulingfrom
codex/645-async-scheduling-uml-pr0

Conversation

@eric-tramel
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel commented May 14, 2026

Summary

Adds the source-of-truth architecture plan and rendered UML reference artifacts for async scheduling epic #645. The plan has been expanded after reviewer passes so implementation issues can reference the Markdown specs for architecture, contracts, capacity, telemetry, benchmarks, cleanup, and issue sequencing instead of carrying the full design in issue bodies.

Related Issue

Part of #645.

Changes

Added

  • Adds plans/645/README.md as the plan index and diagram entrypoint.
  • Adds plans/645/async-scheduling-epic.puml with component, class, sequence, and issue dependency diagrams.
  • Adds durable Markdown specs for architecture, contracts, capacity model, task admission, request admission, observability, benchmark artifacts, migration cleanup, and issue mapping.
  • Adds rendered PNG review artifacts for every PlantUML diagram.

Changed

Rendered Diagrams

Component View

Component view

Task Admission Contracts

Task admission class model

Request Admission Contracts

Request admission class model

Capacity, Telemetry, and Evidence Contracts

Support contracts class model

Runtime Sequence

Runtime sequence

Issue Dependency Map

Issue dependency map

Attention Areas

Reviewers: Please pay special attention to the following:

Testing

  • plantuml -checkonly plans/645/async-scheduling-epic.puml
  • plantuml plans/645/async-scheduling-epic.puml
  • git diff --check
  • Stale pre-epic request-control/semaphore terminology sweep; remaining hits are limited to explicit cleanup/search-gate sections
  • Commit hooks passed
  • make test passes - N/A, reference-only plan artifacts
  • Unit tests added/updated - N/A, no executable code
  • E2E tests added/updated - N/A, no executable code

Checklist

  • Follows commit message conventions
  • Commits are signed off (DCO)
  • Plan artifacts updated

Description updated with AI

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
@eric-tramel eric-tramel self-assigned this May 14, 2026
@eric-tramel eric-tramel added the plan Agent-assisted development plan label May 14, 2026
@eric-tramel eric-tramel marked this pull request as ready for review May 14, 2026 14:54
@eric-tramel eric-tramel requested a review from a team as a code owner May 14, 2026 14:54
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 14, 2026

Greptile Summary

This PR adds the source-of-truth architecture plan for the async scheduling epic (#645): a PlantUML diagram source with rendered PNGs, and ten Markdown spec files covering architecture, contracts, module ownership, capacity model, task admission, request admission, observability, benchmarks, migration/cleanup, and issue mapping.

  • The control-flow direction in the README spine, the selection.queue_view argument in try_acquire, the EvidenceStage issue label, and previously-flagged PUML association labels have all been corrected in this revision.
  • Two spec inconsistencies remain: RowGroupAdmission.observed_in_flight is placed in the configured section (defined as pre-run values) rather than observed_maxima, and the dispatch-loop pseudocode merges queue_empty and admission_blocked/group_capped into a single unnamed emit call, omitting the branching logic required to emit the correct versioned event kinds.

Confidence Score: 4/5

Safe to merge after the two spec inconsistencies are resolved; no executable code is changed.

Two spec inconsistencies would cause divergent or incorrect implementations: RowGroupAdmission.observed_in_flight sits in the configured section despite being runtime-observed data, creating ambiguity against observed_maxima.row_groups_in_flight; and the dispatch-loop pseudocode omits the branch between queue_empty and admission_blocked/group_capped, which are versioned event kinds in the machine-gated benchmark schema. The rest of the plan is internally consistent and the previously-flagged issues appear corrected.

plans/645/contracts.md (RowGroupAdmission schema placement) and plans/645/task-admission.md (dispatch pseudocode event branching)

Important Files Changed

Filename Overview
plans/645/contracts.md Normative DTO/event/config vocabulary; contains a schema inconsistency where RowGroupAdmission.observed_in_flight is placed under the configured section but represents observed runtime data, with no CapacityValue wrapper and a near-duplicate in observed_maxima.
plans/645/task-admission.md Scheduler dispatch pseudocode merges queue-empty and admission-blocked conditions into a single emit call without showing the branching required to emit the correct canonical event_kind.
plans/645/architecture.md Target shape and layer responsibilities; fully consistent with contracts.md, with try_acquire now correctly showing selection.queue_view in both the runtime flow and README spine.
plans/645/async-scheduling-epic.puml PlantUML source for all six diagrams; previously-flagged issues (QueueSelection label, is_eligible param count, EvidenceStage label) appear to be addressed in this revision.
plans/645/benchmark-plan.md Artifact schema, scenario matrix, and evidence gates; correlation/sequence fields and hidden-waiter derivation are well-specified and consistent with observability.md event taxonomy.
plans/645/request-admission.md Request admission runtime shape, AIMD V1 semantics, and release classification; consistent with contracts.md throughout.
plans/645/observability.md Event DTOs, runtime correlation, cardinality rules, and OTel boundary; canonical event kinds and snapshot fields are consistent with contracts.md.
plans/645/migration-and-cleanup.md Legacy-name search gates and documentation cleanup rules; search commands are clear and cover all replaced naming patterns.
plans/645/module-ownership.md Target module layout with class-to-file mapping; complete and consistent with architecture.md package boundaries.
plans/645/issue-map.md GitHub issue-to-spec mapping and dependency order; evidence phasing and issue body cleanup pattern are clearly specified.

Sequence Diagram

sequenceDiagram
    participant CT as CompletionTracker
    participant S as AsyncTaskScheduler
    participant Q as FairTaskQueue
    participant TAC as TaskAdmissionController
    participant ME as ModelRequestExecutor
    participant RAC as RequestAdmissionController
    participant P as Provider/Endpoint

    S->>CT: ready_frontier()
    CT-->>S: Sequence[SchedulableTask]
    S->>Q: enqueue(tasks)
    Q-->>S: accepted task_ids
    S->>CT: mark_enqueued(task_ids)

    S->>Q: select_next(is_eligible callback)
    Note over Q,TAC: Queue calls TAC.is_eligible(item, view) per candidate
    Q-->>S: QueueSelection (item, queue_view, sequence_version)

    S->>TAC: try_acquire(selection.item, selection.queue_view)
    TAC-->>S: TaskAdmissionLease

    S->>Q: commit(selection)
    Q-->>S: SchedulableTask
    S->>S: spawn_worker(task, lease)

    Note over S,P: Inside admitted task execution
    ME->>RAC: acquire_async(RequestAdmissionItem)
    RAC-->>ME: RequestAdmissionLease
    ME->>P: model call
    P-->>ME: response
    ME->>RAC: release(lease, RequestReleaseOutcome)
    S->>TAC: release(task_lease)
Loading
Prompt To Fix All With AI
Fix the following 2 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 2
plans/645/contracts.md:463-465
`RowGroupAdmission.observed_in_flight` is typed as `int | None` (not `CapacityValue[T]`) and sits inside the `configured` block, which `capacity-model.md` explicitly defines as "values computed before or at run start." Observed row-group concurrency is a runtime measurement, not a pre-run configuration. There is already a field `observed_maxima.row_groups_in_flight: int` that covers this correctly. An implementer reading the schema has no guidance on which field to populate, whether to write to both, or what the semantic difference is between them — and two independent implementers could produce incompatible artifacts.

```suggestion
RowGroupAdmission:
  row_group_concurrency: CapacityValue[int]
```

### Issue 2 of 2
plans/645/task-admission.md:11-15
The pseudocode emits a single `emit_queue_empty_or_blocked` regardless of whether `select_next` returned `None` because the queue was empty or because every queued candidate failed eligibility. These are two distinct canonical `event_kind` values (`queue_empty` vs `admission_blocked` / `group_capped`) that are versioned as part of the benchmark artifact schema — emitting the wrong kind would break machine-gated benchmark checks. `contracts.md` is explicit that `explain_blocked` is only called "while queued work exists," so the empty-queue path should be branched before that call.

```suggestion
if selection is None:
    if queue.is_empty():
        emit_queue_empty(...)
        wait_for_wake()
    else:
        block_summary = admission.explain_blocked(queue.view())
        # dominant_reasons drives admission_blocked vs group_capped event_kind
        emit_admission_blocked_or_group_capped(block_summary)
        wait_for_wake_or_deadline(block_summary.available_after)
    return
```

Reviews (8): Last reviewed commit: "docs: add async scheduling module owners..." | Re-trigger Greptile

Comment thread plans/645/README.md Outdated
Comment thread architecture/async-scheduling/async-scheduling-epic.puml Outdated
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Comment thread plans/645/async-scheduling-epic.puml Outdated
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Comment thread plans/645/async-scheduling-epic.puml
Comment thread plans/645/async-scheduling-epic.puml Outdated
eric-tramel and others added 2 commits May 14, 2026 12:44
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Comment thread plans/645/async-scheduling-epic.puml
Comment thread plans/645/async-scheduling-epic.puml Outdated
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Comment thread plans/645/architecture.md Outdated
Resolve the second review pass over plans/645 by making the Markdown spec and UML source agree on task admission, request admission, capacity, telemetry, benchmark, migration, and issue-map contracts.

Key updates include canonical event names, richer AsyncCapacityPlan fields, request waiter and cancellation semantics, timed wakeups, retry/salvage lease ordering, and clearer public/internal documentation boundaries.

Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Comment thread plans/645/architecture.md Outdated
@nabinchha
Copy link
Copy Markdown
Contributor

Feedback on PR #658: Spell Out Repository and Module Ownership

Overall, I think the #645 architecture plan is directionally strong. The diagrams correctly separate task admission, request admission, metadata resolution, telemetry, capacity diagnostics, and future policy work. That separation is useful because the current async engine already has these concerns; they are just partially implicit and concentrated in a few large modules.

The main thing I would ask PR #658 to add is a concrete repository/module ownership section. The UML names the objects and the Markdown specs define their contracts, but I do not yet see enough guidance on where each object should live in the repo. Without that, the first implementation PRs may accidentally put too much into async_scheduler.py, keep request admission buried in models/clients/throttle_manager.py, or split related DTOs across packages in ways that make the new boundaries harder to enforce.

Requested Change

Please add a section to plans/645/architecture.md, plans/645/contracts.md, or a new plans/645/module-ownership.md that spells out where the new contracts and controllers are expected to live.

The section should answer:

  • Which objects live in data-designer-config versus data-designer-engine?
  • Which task-admission objects live near dataset_builders?
  • Which request-admission objects live near models?
  • Which current modules are being replaced, moved, or collapsed?
  • Which objects are public/plugin-facing, operator-facing, or engine-internal?
  • Which modules are allowed to import each other?
  • Where should tests and benchmark helpers live?

This does not need to lock every filename forever, but it should give implementers a default home for each object before implementation starts.

Suggested Module Layout

One possible organization:

packages/data-designer-config/src/data_designer/config/
  scheduling.py
    SchedulingMetadata
    SchedulingMetadataError

packages/data-designer-engine/src/data_designer/engine/
  dataset_builders/
    async_scheduler.py
      AsyncTaskScheduler remains the runtime coordinator only.

    scheduling/
      __init__.py
      resolver.py
        TaskSchedulingResolver
        ResolvedTaskScheduling

      resources.py
        TaskGroupSpec
        TaskGroupKey
        SchedulerResourceKey
        SchedulerResourceRequest
        SchedulableTask

      queue.py
        FairTaskQueue
        QueueView
        QueueSelection

      task_admission.py
        TaskAdmissionController
        TaskAdmissionLease
        TaskAdmissionDecision
        TaskAdmissionDenied
        TaskAdmissionView
        TaskAdmissionBlockSummary
        ReleaseResult

      task_policies.py
        TaskAdmissionPolicy
        StrictFairTaskAdmissionPolicy
        BoundedBorrowTaskAdmissionPolicy

      scheduler_events.py
        SchedulerAdmissionEvent
        SchedulerAdmissionEventSink

  models/
    request_executor.py
      ModelRequestExecutor

    request_admission/
      __init__.py
      resources.py
        ProviderModelKey
        RequestResourceKey
        RequestDomain
        RequestGroupSpec
        RequestAdmissionItem
        RequestEventContext

      resolver.py
        RequestResourceResolver

      controller.py
        RequestAdmissionController
        AdaptiveRequestAdmissionController
        RequestAdmissionLease
        RequestAdmissionDecision
        RequestAdmissionDenied
        RequestAdmissionError

      queue.py
        RequestFairQueue
        RequestWaiter
        RequestQueueView
        RequestQueueSelection

      limits.py
        AdaptiveRequestLimitState
        request AIMD state
        provider/model aggregate cap state

      outcomes.py
        RequestReleaseOutcome
        ReleaseResult

      pressure.py
        RequestPressureSnapshotProvider
        request pressure snapshot DTOs

      request_events.py
        RequestAdmissionEvent
        RequestAdmissionEventSink

  capacity.py
    AsyncCapacityPlan
    CapacityValue
    CapacityObservedMaxima
    ProviderModelStaticCap
    request/task capacity snapshots

  context.py
    RuntimeCorrelation
    RuntimeCorrelationProvider
    existing row-group context

  testing/
    async_scheduling.py
      fake admission controllers
      fake event sinks
      benchmark/harness fixtures

The exact filenames can change, but the ownership should stay clear:

  • dataset_builders/scheduling/ owns scheduler-level task readiness, queueing, selection, task admission, and scheduler admission telemetry.
  • models/request_admission/ owns provider/model/domain request admission at concrete model-call time.
  • models/request_executor.py owns the exact acquire/call/release boundary around outbound model attempts.
  • engine/capacity.py owns cross-cutting capacity diagnostics because it spans row groups, task admission, request admission, provider caps, and transport pools.
  • engine/context.py is the natural home for primitive runtime correlation because it already owns row-group context.

Current Module Migration To Spell Out

It would also help to explicitly map current modules to their target homes:

Current module Target direction
dataset_builders/async_scheduler.py Keep as coordinator; remove direct ownership of submission/LLM-wait accounting over time.
dataset_builders/utils/fair_task_queue.py Move or evolve into dataset_builders/scheduling/queue.py; remove admitted/running count ownership.
dataset_builders/utils/scheduling_hints.py Replace with TaskSchedulingResolver; remove independent introspection fallback.
models/clients/throttle_manager.py Rename/reshape into models/request_admission/ objects.
models/clients/throttled.py Collapse into or delegate through ModelRequestExecutor; avoid durable Throttle* production names.
models/telemetry.py Keep product telemetry separate from scheduler/request event DTOs unless an explicit bridge is added.

Import Boundary Request

Please also state import rules. For example:

  • Config-layer metadata DTOs must not import engine runtime protocols.
  • Engine scheduler modules may import config metadata DTOs.
  • Task admission must not import request admission controllers.
  • Request admission must not import dataset-builder scheduler types.
  • The two layers may meet through primitive correlation context, capacity snapshots, and read-only pressure snapshots.
  • Public plugin-facing APIs should expose SchedulingMetadata, not TaskGroupSpec, FairTaskQueue, TaskAdmissionController, or request admission internals.

Why This Matters

The plan is intentionally rich, but implementation needs boring, enforceable module boundaries. A module ownership table would make the sequence in #645 easier to follow and would reduce the risk that future PRs accidentally recreate the current coupling under new names.

In short: the diagrams explain what the objects are and #645 explains when to implement them. Please also spell out where each object should live in the repository.

Comment thread plans/645/contracts.md
Comment on lines +463 to +465
RowGroupAdmission:
row_group_concurrency: CapacityValue[int]
observed_in_flight: int | None
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 RowGroupAdmission.observed_in_flight is typed as int | None (not CapacityValue[T]) and sits inside the configured block, which capacity-model.md explicitly defines as "values computed before or at run start." Observed row-group concurrency is a runtime measurement, not a pre-run configuration. There is already a field observed_maxima.row_groups_in_flight: int that covers this correctly. An implementer reading the schema has no guidance on which field to populate, whether to write to both, or what the semantic difference is between them — and two independent implementers could produce incompatible artifacts.

Suggested change
RowGroupAdmission:
row_group_concurrency: CapacityValue[int]
observed_in_flight: int | None
RowGroupAdmission:
row_group_concurrency: CapacityValue[int]
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/645/contracts.md
Line: 463-465

Comment:
`RowGroupAdmission.observed_in_flight` is typed as `int | None` (not `CapacityValue[T]`) and sits inside the `configured` block, which `capacity-model.md` explicitly defines as "values computed before or at run start." Observed row-group concurrency is a runtime measurement, not a pre-run configuration. There is already a field `observed_maxima.row_groups_in_flight: int` that covers this correctly. An implementer reading the schema has no guidance on which field to populate, whether to write to both, or what the semantic difference is between them — and two independent implementers could produce incompatible artifacts.

```suggestion
RowGroupAdmission:
  row_group_concurrency: CapacityValue[int]
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +11 to +15
if selection is None:
block_summary = admission.explain_blocked(queue.view())
emit_queue_empty_or_blocked(block_summary)
wait_for_wake_or_deadline(block_summary.available_after)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The pseudocode emits a single emit_queue_empty_or_blocked regardless of whether select_next returned None because the queue was empty or because every queued candidate failed eligibility. These are two distinct canonical event_kind values (queue_empty vs admission_blocked / group_capped) that are versioned as part of the benchmark artifact schema — emitting the wrong kind would break machine-gated benchmark checks. contracts.md is explicit that explain_blocked is only called "while queued work exists," so the empty-queue path should be branched before that call.

Suggested change
if selection is None:
block_summary = admission.explain_blocked(queue.view())
emit_queue_empty_or_blocked(block_summary)
wait_for_wake_or_deadline(block_summary.available_after)
return
if selection is None:
if queue.is_empty():
emit_queue_empty(...)
wait_for_wake()
else:
block_summary = admission.explain_blocked(queue.view())
# dominant_reasons drives admission_blocked vs group_capped event_kind
emit_admission_blocked_or_group_capped(block_summary)
wait_for_wake_or_deadline(block_summary.available_after)
return
Prompt To Fix With AI
This is a comment left during a code review.
Path: plans/645/task-admission.md
Line: 11-15

Comment:
The pseudocode emits a single `emit_queue_empty_or_blocked` regardless of whether `select_next` returned `None` because the queue was empty or because every queued candidate failed eligibility. These are two distinct canonical `event_kind` values (`queue_empty` vs `admission_blocked` / `group_capped`) that are versioned as part of the benchmark artifact schema — emitting the wrong kind would break machine-gated benchmark checks. `contracts.md` is explicit that `explain_blocked` is only called "while queued work exists," so the empty-queue path should be branched before that call.

```suggestion
if selection is None:
    if queue.is_empty():
        emit_queue_empty(...)
        wait_for_wake()
    else:
        block_summary = admission.explain_blocked(queue.view())
        # dominant_reasons drives admission_blocked vs group_capped event_kind
        emit_admission_blocked_or_group_capped(block_summary)
        wait_for_wake_or_deadline(block_summary.available_after)
    return
```

How can I resolve this? If you propose a fix, please make it concise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

plan Agent-assisted development plan

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants