Skip to content

Add SLO checks with a SQLAlchemy read/write workload#116

Open
vgvoleg wants to merge 17 commits into
mainfrom
add-slo-checks
Open

Add SLO checks with a SQLAlchemy read/write workload#116
vgvoleg wants to merge 17 commits into
mainfrom
add-slo-checks

Conversation

@vgvoleg

@vgvoleg vgvoleg commented Jun 17, 2026

Copy link
Copy Markdown
Member

What

Adds SLO (Service Level Objective) testing on top of ydb-platform/ydb-slo-action, following the ydb-python-sdk SLO example but expressed entirely in terms of SQLAlchemy.

Workload (tests/slo/)

A parallel read/write load generator driving the ydb_sqlalchemy dialect:

  • readSELECT ... WHERE object_id = :id for a random id;
  • writeUPSERT INTO ... VALUES (...) for a fresh id;
  • dedicated reader/writer thread pools plus a metrics thread;
  • every operation is wrapped in an idempotent retry loop, so transient errors injected by the action's chaos layer become latency rather than availability drops.

Two modes, selected by WORKLOAD_NAME / --mode:

mode read write
core Connection.execute(select()) Connection.execute(upsert())
orm Session.get(KeyValueRow, id) Session.execute(upsert()) + commit

Metrics are emitted via OTLP with names matching the action's default metrics.yaml (sdk_operations_total, sdk_operation_latency_p{50,95,99}_seconds, sdk_retry_attempts_total, ...).

Workflow (.github/workflows/slo.yml)

Runs on PRs labelled SLO:

  1. builds current (PR) and baseline (merge-base) workload images;
  2. runs ydb-slo-action/init@v2 for the core and orm workloads in parallel;
  3. publishes a current-vs-baseline comparison with ydb-slo-action/report@v2 and gates the PR on regressions.

The cluster is trimmed to fit a GitHub-hosted runner via disable_compose_profiles: extra-nodes (chaos and telemetry stay enabled).

How to run

Label this PR with SLO to trigger the checks. Locally:

python ./tests/slo/src create grpc://localhost:2136 /local --mode core
python ./tests/slo/src run    grpc://localhost:2136 /local --mode core --time 60

Notes

  • The in-run report job needs pull-requests: write, which same-repo PRs have. For fork PRs the report can be moved to a separate workflow_run-triggered workflow.
  • The workload source under tests/slo/ is outside the existing test/ lint scope, so it doesn't affect the style/tests workflows.

Introduce a parallel read/write SLO workload built on the ydb_sqlalchemy dialect (SQLAlchemy Core and ORM modes) and wire it into ydb-slo-action via a label-gated GitHub workflow.

- tests/slo: workload runner, Dockerfile, entrypoint, requirements, README
- .github/workflows/slo.yml: build current+baseline images, run init@v2 and publish report@v2 on PRs labelled "SLO"
@vgvoleg vgvoleg added the SLO Run SLO checks label Jun 17, 2026
vgvoleg added 7 commits June 17, 2026 12:48
The dialect integration tests now live in tests/integration/ alongside tests/slo/, so the repo no longer has both a test/ and a tests/ directory. tox.ini (lint + dialect pytest paths) and setup.cfg (profile_file) are updated accordingly.
The dialect runs in AUTOCOMMIT, so each single-statement read/write already goes through the YDB SDK's retry_operation_sync inside ydb-dbapi. The workload now performs one attempt per operation and records any surfaced exception as a real SLO failure, instead of a broad app-level retry loop that masked non-retryable errors. Removes the now-unused timeout/max-retries flags.
Align workload_duration and read/write RPS with ydb-python-sdk's tests/slo workflow. extra-nodes stays disabled to fit a GitHub-hosted runner.
…cluster, 600s, 1000/100 rps)

Run the workload job on the large-runner-sqlalchemy self-hosted runner with the full YDB cluster (all compose profiles), and align workload_duration and read/write RPS with ydb-python-sdk's tests/slo workflow. The report job stays on ubuntu-latest.
A curl (28) SSL connection timeout while downloading docker-compose failed the whole job. Wrap the yq/buildx/compose downloads in a retrying curl (--retry --retry-all-errors with connect/max timeouts) so a transient network blip retries instead of failing the run.
Mirror the ydb-python-sdk reference behaviour: SLO is opt-in per label, so drop the label once the run finishes (any conclusion) to avoid re-running the heavy suite on every push. Re-add the label to trigger another run.
@github-actions github-actions Bot removed the SLO Run SLO checks label Jun 18, 2026
vgvoleg added 2 commits June 18, 2026 15:16
Switch the trigger to pull_request: [labeled] and gate every job on github.event.label.name == 'SLO', so the suite runs exactly when the label is attached and not on ordinary pushes. Matches the django-ydb-backend setup.
pool_pre_ping added a SELECT 1 round-trip on every checkout (~+3ms/op, measured); ydb-dbapi already retries transient errors and re-acquires sessions, so it is unnecessary. ORM mode now uses a sessionmaker with session.get() for reads and session.add()+commit() for writes (a real unit-of-work INSERT) instead of a Core upsert through the session; ORM inserts use a random id to stay collision-free and avoid a hot last partition.
New 'shared' mode runs the same Core read/write path as 'core' but builds the engine with a single ydb.QuerySessionPool shared across all connections (connect_args={'ydb_session_pool': ...}), instead of every pooled DBAPI connection creating its own driver and session pool. Added as a third matrix entry to compare session-pool strategies. Local 12-thread read benchmark: ~+14% throughput and ~-30% p95/p99 vs per-connection pools.
@vgvoleg vgvoleg added the SLO Run SLO checks label Jun 18, 2026
@github-actions github-actions Bot removed the SLO Run SLO checks label Jun 18, 2026
@vgvoleg vgvoleg added the SLO Run SLO checks label Jun 18, 2026
@github-actions github-actions Bot removed the SLO Run SLO checks label Jun 18, 2026
The shared mode ran the same Core operations as core, only with a shared ydb.QuerySessionPool, so it added no distinct load signal; its benefit is resource efficiency at high connection counts, which this SLO does not exercise and cannot measure cleanly across separate matrix runners. The shared-pool feature stays covered by the dialect unit tests. Drops a third full-cluster run from every SLO invocation.
@vgvoleg vgvoleg added the SLO Run SLO checks label Jun 19, 2026
@github-actions github-actions Bot removed the SLO Run SLO checks label Jun 19, 2026
With no app-level retry, attempts is always 1, so sdk_retry_attempts_total just mirrored sdk_operations_total. The action's *_retry_attempts metric (increase(retry) - increase(ops)) was then pure PromQL noise around zero, and comparing that noise current-vs-baseline produced arbitrary swings (43.8% warning one run, 100% critical the next). Drop the counter so the metric is empty and no longer gates.
@vgvoleg vgvoleg added the SLO Run SLO checks label Jun 19, 2026
@vgvoleg vgvoleg removed the SLO Run SLO checks label Jun 19, 2026
vgvoleg added 3 commits June 19, 2026 13:49
Unwrap a SQLAlchemy DBAPIError to the underlying ydb.Error and run the operation under the SDK retry policy (retry_ydb_operation / _async, plus a retry_ydb decorator for sync and async functions). Parameters are max_retries and idempotent; no ydb objects are exposed. Lives in the sqlalchemy subpackage with unit tests and docs.
New tx mode runs read-modify-write in a SERIALIZABLE interactive transaction wrapped in retry_ydb_operation (where ydb-dbapi does not retry on its own). core/orm stay autocommit with no app-level retry. The retry counter now emits retries (attempts beyond the first), so it is exactly 0 for autocommit instead of PromQL noise.
Switch to ydb-slo-action@feat/per-scenario-thresholds. Override the *_retry_attempts metric to count the retry counter directly (the new emission is retries, not total attempts). Make retry_attempts informational for the tx scenario only; core/orm keep the strict default (their value is structurally 0).
@vgvoleg vgvoleg added the SLO Run SLO checks label Jun 19, 2026
@github-actions

Copy link
Copy Markdown

🌋 SLO Test Results

🟢 3 workload(s) tested — All thresholds passed

Commit: 4b80edf · View run

Workload Thresholds Duration Report
tx 🟢 OK 10m 5s 📄 Report
orm 🟢 OK 10m 5s 📄 Report
core 🟢 OK 10m 5s 📄 Report

Generated by ydb-slo-action

@github-actions github-actions Bot removed the SLO Run SLO checks label Jun 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant