feat: implement async scheduling admission control#661
Conversation
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Resolve the second review pass over plans/645 by making the Markdown spec and UML source agree on task admission, request admission, capacity, telemetry, benchmark, migration, and issue-map contracts. Key updates include canonical event names, richer AsyncCapacityPlan fields, request waiter and cancellation semantics, timed wakeups, retry/salvage lease ordering, and clearer public/internal documentation boundaries. Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
|
MkDocs preview: https://f0b2b167.dd-docs-preview.pages.dev Fern preview: https://nvidia-preview-pr-661.docs.buildwithfern.com/nemo/datadesigner
|
Signed-off-by: Eric W. Tramel <eric.tramel@gmail.com>
|
Thanks for putting this together, @eric-tramel — this is a substantial reshaping of the runtime control surfaces and the new module ownership reads cleanly. Here are my thoughts. SummaryThis PR splits runtime control into explicit scheduler task admission ( FindingsWarnings — Worth addressing
from data_designer.engine.dataset_builders.scheduling.resources import stable_task_id
# ...
self._frontier = {task for task in self._frontier if stable_task_id(task) not in wanted}
except Exception:
logger.warning("Admission event sink raised; dropping event", exc_info=True)
return
def acquire_sync(self, item: RequestAdmissionItem) -> RequestAdmissionLease:
try:
asyncio.get_running_loop()
except RuntimeError:
pass
else:
raise RuntimeError(
"acquire_sync would block the running event loop; use acquire_async instead."
)
...
@dataclass
class _EndpointBucket:
aliases: list[str] = field(default_factory=list)
caps: list[int] = field(default_factory=list)
endpoints: dict[tuple[str, str, str], _EndpointBucket] = {}
...
bucket = endpoints.setdefault(endpoint, _EndpointBucket())
bucket.aliases.append(alias)
bucket.caps.append(cap)This drops the
Suggestions — Take it or leave it
Duplication between
What Looks Good
VerdictNeeds changes — the duplicated This review was generated by an AI assistant. |
📋 Summary
Implements the issue 645 async scheduling epic by splitting runtime control into explicit scheduler task admission and concrete model-request admission, with typed scheduling metadata, AIMD-backed request leases, capacity snapshots, and correlated observability. This PR also updates architecture/docs/Fern assets and includes benchmark evidence from live GPT-5.5 and GPT-5 Nano traffic.
🔗 Related Issue
Refs #645
🔄 Changes
✨ Added
SchedulingMetadataand validation in the config package.ModelRequestExecutorandAdaptiveRequestAdmissionControllerfor per-attempt provider/model/domain admission.artifacts/645-live-bench*.🔧 Changed
🗑️ Removed
🔍 Attention Areas
packages/data-designer-engine/src/data_designer/engine/dataset_builders/async_scheduler.py— central runtime control flow and lease lifecycle.packages/data-designer-engine/src/data_designer/engine/models/clients/request_admission.py— AIMD request admission state machine and exact request lease accounting.packages/data-designer-engine/src/data_designer/engine/models/clients/model_request_executor.py— concrete model-call attempt boundary and release outcome classification.packages/data-designer-engine/src/data_designer/engine/observability.py— scheduler/request event contracts used by benchmark evidence.reports/async-scheduling-epic-benchmark-report.html— high-level QA and live benchmark report.🧪 Testing
.venv/bin/ruff check packages scripts tests_e2e.venv/bin/ruff format --check packages scripts tests_e2egit diff --checkmake testas a single aggregate command was not rerun; equivalent package suites above passed✅ Checklist
Notes
Raw live benchmark traces were left local because the full artifact tree is roughly 519 MB. This PR includes the condensed README/combined-summary artifacts and the standalone HTML report so reviewers can inspect the benchmark evidence without committing the full JSONL timelines.