Add SLO checks with a SQLAlchemy read/write workload#116
Open
vgvoleg wants to merge 17 commits into
Open
Conversation
Introduce a parallel read/write SLO workload built on the ydb_sqlalchemy dialect (SQLAlchemy Core and ORM modes) and wire it into ydb-slo-action via a label-gated GitHub workflow. - tests/slo: workload runner, Dockerfile, entrypoint, requirements, README - .github/workflows/slo.yml: build current+baseline images, run init@v2 and publish report@v2 on PRs labelled "SLO"
The dialect integration tests now live in tests/integration/ alongside tests/slo/, so the repo no longer has both a test/ and a tests/ directory. tox.ini (lint + dialect pytest paths) and setup.cfg (profile_file) are updated accordingly.
The dialect runs in AUTOCOMMIT, so each single-statement read/write already goes through the YDB SDK's retry_operation_sync inside ydb-dbapi. The workload now performs one attempt per operation and records any surfaced exception as a real SLO failure, instead of a broad app-level retry loop that masked non-retryable errors. Removes the now-unused timeout/max-retries flags.
Align workload_duration and read/write RPS with ydb-python-sdk's tests/slo workflow. extra-nodes stays disabled to fit a GitHub-hosted runner.
…cluster, 600s, 1000/100 rps) Run the workload job on the large-runner-sqlalchemy self-hosted runner with the full YDB cluster (all compose profiles), and align workload_duration and read/write RPS with ydb-python-sdk's tests/slo workflow. The report job stays on ubuntu-latest.
A curl (28) SSL connection timeout while downloading docker-compose failed the whole job. Wrap the yq/buildx/compose downloads in a retrying curl (--retry --retry-all-errors with connect/max timeouts) so a transient network blip retries instead of failing the run.
Mirror the ydb-python-sdk reference behaviour: SLO is opt-in per label, so drop the label once the run finishes (any conclusion) to avoid re-running the heavy suite on every push. Re-add the label to trigger another run.
Switch the trigger to pull_request: [labeled] and gate every job on github.event.label.name == 'SLO', so the suite runs exactly when the label is attached and not on ordinary pushes. Matches the django-ydb-backend setup.
pool_pre_ping added a SELECT 1 round-trip on every checkout (~+3ms/op, measured); ydb-dbapi already retries transient errors and re-acquires sessions, so it is unnecessary. ORM mode now uses a sessionmaker with session.get() for reads and session.add()+commit() for writes (a real unit-of-work INSERT) instead of a Core upsert through the session; ORM inserts use a random id to stay collision-free and avoid a hot last partition.
New 'shared' mode runs the same Core read/write path as 'core' but builds the engine with a single ydb.QuerySessionPool shared across all connections (connect_args={'ydb_session_pool': ...}), instead of every pooled DBAPI connection creating its own driver and session pool. Added as a third matrix entry to compare session-pool strategies. Local 12-thread read benchmark: ~+14% throughput and ~-30% p95/p99 vs per-connection pools.
The shared mode ran the same Core operations as core, only with a shared ydb.QuerySessionPool, so it added no distinct load signal; its benefit is resource efficiency at high connection counts, which this SLO does not exercise and cannot measure cleanly across separate matrix runners. The shared-pool feature stays covered by the dialect unit tests. Drops a third full-cluster run from every SLO invocation.
With no app-level retry, attempts is always 1, so sdk_retry_attempts_total just mirrored sdk_operations_total. The action's *_retry_attempts metric (increase(retry) - increase(ops)) was then pure PromQL noise around zero, and comparing that noise current-vs-baseline produced arbitrary swings (43.8% warning one run, 100% critical the next). Drop the counter so the metric is empty and no longer gates.
… failures" This reverts commit c215323.
Unwrap a SQLAlchemy DBAPIError to the underlying ydb.Error and run the operation under the SDK retry policy (retry_ydb_operation / _async, plus a retry_ydb decorator for sync and async functions). Parameters are max_retries and idempotent; no ydb objects are exposed. Lives in the sqlalchemy subpackage with unit tests and docs.
New tx mode runs read-modify-write in a SERIALIZABLE interactive transaction wrapped in retry_ydb_operation (where ydb-dbapi does not retry on its own). core/orm stay autocommit with no app-level retry. The retry counter now emits retries (attempts beyond the first), so it is exactly 0 for autocommit instead of PromQL noise.
Switch to ydb-slo-action@feat/per-scenario-thresholds. Override the *_retry_attempts metric to count the retry counter directly (the new emission is retries, not total attempts). Make retry_attempts informational for the tx scenario only; core/orm keep the strict default (their value is structurally 0).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds SLO (Service Level Objective) testing on top of
ydb-platform/ydb-slo-action, following the ydb-python-sdk SLO example but expressed entirely in terms of SQLAlchemy.Workload (
tests/slo/)A parallel read/write load generator driving the
ydb_sqlalchemydialect:SELECT ... WHERE object_id = :idfor a random id;UPSERT INTO ... VALUES (...)for a fresh id;Two modes, selected by
WORKLOAD_NAME/--mode:coreConnection.execute(select())Connection.execute(upsert())ormSession.get(KeyValueRow, id)Session.execute(upsert())+ commitMetrics are emitted via OTLP with names matching the action's default
metrics.yaml(sdk_operations_total,sdk_operation_latency_p{50,95,99}_seconds,sdk_retry_attempts_total, ...).Workflow (
.github/workflows/slo.yml)Runs on PRs labelled
SLO:current(PR) andbaseline(merge-base) workload images;ydb-slo-action/init@v2for thecoreandormworkloads in parallel;ydb-slo-action/report@v2and gates the PR on regressions.The cluster is trimmed to fit a GitHub-hosted runner via
disable_compose_profiles: extra-nodes(chaos and telemetry stay enabled).How to run
Label this PR with
SLOto trigger the checks. Locally:Notes
reportjob needspull-requests: write, which same-repo PRs have. For fork PRs the report can be moved to a separateworkflow_run-triggered workflow.tests/slo/is outside the existingtest/lint scope, so it doesn't affect the style/tests workflows.