Add SLO checks with a SQLAlchemy read/write workload by vgvoleg · Pull Request #116 · ydb-platform/ydb-sqlalchemy

vgvoleg · 2026-06-17T09:45:22Z

What

Adds SLO (Service Level Objective) testing on top of ydb-platform/ydb-slo-action, following the ydb-python-sdk SLO example but expressed entirely in terms of SQLAlchemy.

Workload (`tests/slo/`)

A parallel read/write load generator driving the ydb_sqlalchemy dialect:

read — SELECT ... WHERE object_id = :id for a random id;
write — UPSERT INTO ... VALUES (...) for a fresh id;
dedicated reader/writer thread pools plus a metrics thread;
every operation is wrapped in an idempotent retry loop, so transient errors injected by the action's chaos layer become latency rather than availability drops.

Two modes, selected by WORKLOAD_NAME / --mode:

mode	read	write
`core`	`Connection.execute(select())`	`Connection.execute(upsert())`
`orm`	`Session.get(KeyValueRow, id)`	`Session.execute(upsert())` + commit

Metrics are emitted via OTLP with names matching the action's default metrics.yaml (sdk_operations_total, sdk_operation_latency_p{50,95,99}_seconds, sdk_retry_attempts_total, ...).

Workflow (`.github/workflows/slo.yml`)

Runs on PRs labelled SLO:

builds current (PR) and baseline (merge-base) workload images;
runs ydb-slo-action/init@v2 for the core and orm workloads in parallel;
publishes a current-vs-baseline comparison with ydb-slo-action/report@v2 and gates the PR on regressions.

The cluster is trimmed to fit a GitHub-hosted runner via disable_compose_profiles: extra-nodes (chaos and telemetry stay enabled).

How to run

Label this PR with SLO to trigger the checks. Locally:

python ./tests/slo/src create grpc://localhost:2136 /local --mode core
python ./tests/slo/src run    grpc://localhost:2136 /local --mode core --time 60

Notes

The in-run report job needs pull-requests: write, which same-repo PRs have. For fork PRs the report can be moved to a separate workflow_run-triggered workflow.
The workload source under tests/slo/ is outside the existing test/ lint scope, so it doesn't affect the style/tests workflows.

Introduce a parallel read/write SLO workload built on the ydb_sqlalchemy dialect (SQLAlchemy Core and ORM modes) and wire it into ydb-slo-action via a label-gated GitHub workflow. - tests/slo: workload runner, Dockerfile, entrypoint, requirements, README - .github/workflows/slo.yml: build current+baseline images, run init@v2 and publish report@v2 on PRs labelled "SLO"

The dialect integration tests now live in tests/integration/ alongside tests/slo/, so the repo no longer has both a test/ and a tests/ directory. tox.ini (lint + dialect pytest paths) and setup.cfg (profile_file) are updated accordingly.

The dialect runs in AUTOCOMMIT, so each single-statement read/write already goes through the YDB SDK's retry_operation_sync inside ydb-dbapi. The workload now performs one attempt per operation and records any surfaced exception as a real SLO failure, instead of a broad app-level retry loop that masked non-retryable errors. Removes the now-unused timeout/max-retries flags.

Align workload_duration and read/write RPS with ydb-python-sdk's tests/slo workflow. extra-nodes stays disabled to fit a GitHub-hosted runner.

…cluster, 600s, 1000/100 rps) Run the workload job on the large-runner-sqlalchemy self-hosted runner with the full YDB cluster (all compose profiles), and align workload_duration and read/write RPS with ydb-python-sdk's tests/slo workflow. The report job stays on ubuntu-latest.

A curl (28) SSL connection timeout while downloading docker-compose failed the whole job. Wrap the yq/buildx/compose downloads in a retrying curl (--retry --retry-all-errors with connect/max timeouts) so a transient network blip retries instead of failing the run.

Mirror the ydb-python-sdk reference behaviour: SLO is opt-in per label, so drop the label once the run finishes (any conclusion) to avoid re-running the heavy suite on every push. Re-add the label to trigger another run.

Switch the trigger to pull_request: [labeled] and gate every job on github.event.label.name == 'SLO', so the suite runs exactly when the label is attached and not on ordinary pushes. Matches the django-ydb-backend setup.

pool_pre_ping added a SELECT 1 round-trip on every checkout (~+3ms/op, measured); ydb-dbapi already retries transient errors and re-acquires sessions, so it is unnecessary. ORM mode now uses a sessionmaker with session.get() for reads and session.add()+commit() for writes (a real unit-of-work INSERT) instead of a Core upsert through the session; ORM inserts use a random id to stay collision-free and avoid a hot last partition.

New 'shared' mode runs the same Core read/write path as 'core' but builds the engine with a single ydb.QuerySessionPool shared across all connections (connect_args={'ydb_session_pool': ...}), instead of every pooled DBAPI connection creating its own driver and session pool. Added as a third matrix entry to compare session-pool strategies. Local 12-thread read benchmark: ~+14% throughput and ~-30% p95/p99 vs per-connection pools.

The shared mode ran the same Core operations as core, only with a shared ydb.QuerySessionPool, so it added no distinct load signal; its benefit is resource efficiency at high connection counts, which this SLO does not exercise and cannot measure cleanly across separate matrix runners. The shared-pool feature stays covered by the dialect unit tests. Drops a third full-cluster run from every SLO invocation.

With no app-level retry, attempts is always 1, so sdk_retry_attempts_total just mirrored sdk_operations_total. The action's *_retry_attempts metric (increase(retry) - increase(ops)) was then pure PromQL noise around zero, and comparing that noise current-vs-baseline produced arbitrary swings (43.8% warning one run, 100% critical the next). Drop the counter so the metric is empty and no longer gates.

… failures" This reverts commit c215323.

Unwrap a SQLAlchemy DBAPIError to the underlying ydb.Error and run the operation under the SDK retry policy (retry_ydb_operation / _async, plus a retry_ydb decorator for sync and async functions). Parameters are max_retries and idempotent; no ydb objects are exposed. Lives in the sqlalchemy subpackage with unit tests and docs.

New tx mode runs read-modify-write in a SERIALIZABLE interactive transaction wrapped in retry_ydb_operation (where ydb-dbapi does not retry on its own). core/orm stay autocommit with no app-level retry. The retry counter now emits retries (attempts beyond the first), so it is exactly 0 for autocommit instead of PromQL noise.

Switch to ydb-slo-action@feat/per-scenario-thresholds. Override the *_retry_attempts metric to count the retry counter directly (the new emission is retries, not total attempts). Make retry_attempts informational for the tx scenario only; core/orm keep the strict default (their value is structurally 0).

github-actions · 2026-06-19T11:12:25Z

🌋 SLO Test Results

🟢 3 workload(s) tested — All thresholds passed

Commit: 4b80edf · View run

Workload	Thresholds	Duration	Report
tx	🟢 OK	10m 5s	📄 Report
orm	🟢 OK	10m 5s	📄 Report
core	🟢 OK	10m 5s	📄 Report

Generated by ydb-slo-action

vgvoleg added the SLO Run SLO checks label Jun 17, 2026

vgvoleg added 7 commits June 17, 2026 12:48

SLO: create baseline tests/ parent dir before copying runner

5e90c6b

SLO: match the python-sdk reference config (600s, 1000/100 rps)

86f8e29

Align workload_duration and read/write RPS with ydb-python-sdk's tests/slo workflow. extra-nodes stays disabled to fit a GitHub-hosted runner.

SLO: remove the SLO label after a run

c2bf6fd

Mirror the ydb-python-sdk reference behaviour: SLO is opt-in per label, so drop the label once the run finishes (any conclusion) to avoid re-running the heavy suite on every push. Re-add the label to trigger another run.

github-actions Bot removed the SLO Run SLO checks label Jun 18, 2026

vgvoleg added 2 commits June 18, 2026 15:16

SLO: trigger only when the SLO label is added

4ab593b

Switch the trigger to pull_request: [labeled] and gate every job on github.event.label.name == 'SLO', so the suite runs exactly when the label is attached and not on ordinary pushes. Matches the django-ydb-backend setup.

vgvoleg added the SLO Run SLO checks label Jun 18, 2026

vgvoleg mentioned this pull request Jun 18, 2026

ci(slo): align SLO workflow with the canonical ydb-python-sdk setup ydb-platform/django-ydb-backend#109

Merged

github-actions Bot removed the SLO Run SLO checks label Jun 18, 2026

vgvoleg added the SLO Run SLO checks label Jun 18, 2026

github-actions Bot removed the SLO Run SLO checks label Jun 18, 2026

vgvoleg added the SLO Run SLO checks label Jun 18, 2026

github-actions Bot removed the SLO Run SLO checks label Jun 18, 2026

vgvoleg added the SLO Run SLO checks label Jun 19, 2026

github-actions Bot removed the SLO Run SLO checks label Jun 19, 2026

vgvoleg added the SLO Run SLO checks label Jun 19, 2026

Revert "SLO: stop emitting retry_attempts to avoid spurious threshold…

b1fe4b0

… failures" This reverts commit c215323.

vgvoleg removed the SLO Run SLO checks label Jun 19, 2026

vgvoleg added 3 commits June 19, 2026 13:49

vgvoleg added the SLO Run SLO checks label Jun 19, 2026

github-actions Bot removed the SLO Run SLO checks label Jun 19, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SLO checks with a SQLAlchemy read/write workload#116

Add SLO checks with a SQLAlchemy read/write workload#116
vgvoleg wants to merge 17 commits into
mainfrom
add-slo-checks

vgvoleg commented Jun 17, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vgvoleg commented Jun 17, 2026

What

Workload (tests/slo/)

Workflow (.github/workflows/slo.yml)

How to run

Notes

Uh oh!

github-actions Bot commented Jun 19, 2026

🌋 SLO Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Workload (`tests/slo/`)

Workflow (`.github/workflows/slo.yml`)