Skip to content

Add read-replica routing for Perfherder read-only endpoints#9528

Open
camd wants to merge 9 commits into
masterfrom
camd/read-replica-queries
Open

Add read-replica routing for Perfherder read-only endpoints#9528
camd wants to merge 9 commits into
masterfrom
camd/read-replica-queries

Conversation

@camd
Copy link
Copy Markdown
Collaborator

@camd camd commented May 18, 2026

Summary

Routes 11 read-only Perfherder GET endpoints through a separate read_replica PostgreSQL connection to relieve write-side contention on the primary database. Gated behind two env vars (READ_REPLICA_DATABASE_URL + READ_REPLICA_ENABLED) so it ships as a no-op until ops flips the switch.

  • New ReadReplicaRouter (Django DATABASE_ROUTERS) routes reads to the replica only when a thread-local flag is set and the model's app label is in an explicit allow-list ({"perf", "model"}).
  • New ReadReplicaMixin flips that flag for GET/HEAD/OPTIONS only, on a one-shot retry against primary if the replica raises OperationalError/InterfaceError, and emits a db_routing_fallback warning log. The mixin is a true no-op when the alias isn't configured.
  • Applied to 11 read-only viewsets in treeherder/webapp/api/performance_data.py. The 3 mutating viewsets (PerformanceAlertSummaryViewSet, PerformanceAlertViewSet, PerformanceTagViewSet) are intentionally left on primary so create→read flows stay consistent.

Design doc: .claude/plans/READ_REPLICA_DESIGN.md (local). Mechanism is reusable — applying it to Logviewer / Intermittent Failures View later is a one-line viewset change each.

Test plan

  • Unit tests for router contract (allow-list, read/write/relation/migrate behavior) — 7 tests.
  • Unit tests for mixin (safe-method flip, no-flip on POST, state cleared on exception, one-shot retry, fallback log, no-op when alias absent) — 6 tests.
  • Integration tests with CaptureQueriesContext asserting the replica alias actually serves opted-in endpoints and does NOT serve excluded ones — 4 tests.
  • Full backend suite: 934 passed, 2 skipped, 10 xfailed — no regressions vs master.
  • Ops: set READ_REPLICA_DATABASE_URL in stage; flip READ_REPLICA_ENABLED=true; soak; watch db_routing_fallback log volume + primary DB CPU; then prod.
  • Kill switch: set READ_REPLICA_ENABLED=false (one env-var flip) and verify all routed endpoints continue returning correct data via primary.

🤖 Generated with Claude Code

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 18, 2026

Codecov Report

❌ Patch coverage is 96.95652% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 83.02%. Comparing base (6e680ff) to head (4eeff5a).
⚠️ Report is 2 commits behind head on master.

Files with missing lines Patch % Lines
treeherder/config/db_routing.py 92.30% 4 Missing ⚠️
treeherder/config/settings.py 60.00% 2 Missing ⚠️
tests/config/test_db_routing.py 99.20% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #9528      +/-   ##
==========================================
+ Coverage   82.94%   83.02%   +0.08%     
==========================================
  Files         613      616       +3     
  Lines       35372    35588     +216     
  Branches     3208     3278      +70     
==========================================
+ Hits        29338    29547     +209     
+ Misses       5880     5672     -208     
- Partials      154      369     +215     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Comment thread treeherder/config/db_routing.py Outdated
into reading from the ``read_replica`` database alias. The router only routes
models whose Django app label is in :data:`READ_REPLICA_APP_ALLOW_LIST`.

Design: .claude/plans/READ_REPLICA_DESIGN.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread docs/accessing_data.md Outdated
The router and mixin are then active in dev. Routing is a no-op unless an
endpoint opts in via `ReadReplicaMixin`; see
`treeherder/config/db_routing.py` and the design at
`.claude/plans/READ_REPLICA_DESIGN.md`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

Comment thread tests/config/test_db_routing.py Outdated
_RecordingView.raise_on_call = 0
_RecordingView.call_count = 0
_RecordingView.saw_use_replica = []
yield
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the benefit of this?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That yield is a no-op. Removed.

camd and others added 8 commits May 23, 2026 10:44
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Routes 10 read-only Perfherder viewsets to the read replica (PerformanceSignatureViewSet,
PerformancePlatformViewSet, PerformanceJobViewSet, PerformanceDatumViewSet,
PerformanceBugTemplateViewSet, PerformanceIssueTrackerViewSet, PerformanceSummary,
PerformanceAlertSummaryTasks, PerfCompareResults, TestSuiteHealthViewSet).
Adds integration tests for signature list and summary endpoints. Updates
test_perfcompare_api.py, test_performance_data_api.py, and
test_performance_bug_template_api.py to declare read_replica database access.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two review fixes:

1. When the kill switch is off (READ_REPLICA_ENABLED=false or
   READ_REPLICA_DATABASE_URL unset), the mixin previously still wrapped
   dispatch, set the thread-local (a no-op without a router), and would
   emit a misleading "db_routing_fallback" log on any primary failure.
   The mixin now checks connections.databases for the alias and bypasses
   wrapping when absent.

2. The three test files that gained databases=["default","read_replica"]
   implicitly relied on fixtures pulling in transactional_db. Make the
   transactional requirement explicit with transaction=True so the
   behavior survives future fixture refactors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The feature is a no-op locally with the default .env (both env vars
unset → no read_replica alias → mixin early-returns). Standard local
dev doesn't need to know about the routing. The design, code-level
docstring, and PR deploy notes cover the cases that do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@camd camd force-pushed the camd/read-replica-queries branch from fb7dc10 to 5ed5343 Compare May 23, 2026 17:44
@camd
Copy link
Copy Markdown
Collaborator Author

camd commented May 23, 2026

I looked at the 5 slowest queries from CloudSQL. Here is the synopsis:

Short answer

This branch will not touch 4 of the 5. The branch only routes web GET requests whose viewset opts in via ReadReplicaMixin. Four of your five heaviest queries don't come from web requests at all — they come from Celery tasks and management commands, which the design explicitly lists as a non-goal ("Not a change to Celery, ETL, or management commands. They continue to use the primary connection"). Only the bugscache query is web-borne, and even that one isn't routed yet (but easily could be).

Where each query actually comes from

# Query Origin Trigger Routed by this branch?
1 performance_datum … signature_id = … push_timestamp >= … push_timestamp > generate_new_alerts_in_series alerts.py:83+92 Celery task perf/tasks.py:21 / generate_alerts mgmt cmd ❌ No
2 performance_datum … signature_id = … push_timestamp >= generate_new_alerts_in_series alerts.py:83 same Celery / mgmt ❌ No
4 performance_datum_replicate … IN (SELECT … EXISTS(replicate)) generate_new_alerts_in_series alerts.py:96‑111 same Celery / mgmt ❌ No
3 group ⋈ group_status ⋈ job ⋈ job_type ⋈ push … check_and_mark_intermittent intermittents.py:144 Celery task log_parser/tasks.py:86 ❌ No
5 bugscache … summary LIKE … LIMIT Bugscache.searchget_error_summary Web GET JobsProjectViewSet.bug_suggestions jobs.py:392 ❌ Not yet (viewset lacks mixin)

How I know #1/#2/#4 are the alert pipeline and not the graphs endpoints: they SELECT the full model row and filter signature_id = $2 (equality). The web endpoints (PerformanceDatumViewSet, PerformanceSummary) use .values()/.values_list() with a column subset and signature__id__in (an IN), and the web replicate path is a JOIN, not the IN (SELECT … EXISTS …) subquery in #4.

Can they be addressed?

#1, #2, #4 (perf alert generation) and #3 (intermittent classification): Not by the current viewset-mixin mechanism — there's no view in the path. They're also read-then-write tasks (read the datum/group history, then write alerts / classifications), so they're exactly the read-after-write case the design says to keep on primary. Replication lag could make alert generation miss the newest datapoints or mark intermittents off stale group history. Routing them is possible but needs a different, deliberate tool — explicit .using("read_replica") on just the read querysets inside those tasks — plus a conscious decision to accept lag on a correctness-sensitive write path. That's a separate project, not something this branch's design intends to cover.

Perf Alert Generation PR
Intermittent Failure PR

Worth noting: these four dominate your list precisely because they're the ingestion pipeline — alert generation runs across essentially every signature continuously, so the cumulative cost is huge. That's load on the writer that this branch, by design, doesn't move.

#5 (bugscache search): This is the one the current approach can capture. It's a read-only GET, and Bugscache lives in the model app, which is already in READ_REPLICA_APP_ALLOW_LIST. Adding ReadReplicaMixin to JobsProjectViewSet (line 161) would route it. I checked the read-after-write concern: get_error_summary does write a cache entry during the request, but via the db_cache alias (a Django cache table, not perf/model), and db_for_write always returns None anyway — so writes stay on primary and there's no inconsistency. Caveat: the mixin routes all of that viewset's GET actions (jobs/list/retrieve too), so if you want to limit blast radius you'd scope it to the bug_suggestions action rather than the whole viewset.

@camd camd self-assigned this May 25, 2026
Successful replica routing was previously silent — only the fallback path
emitted a log line. Adds a single DEBUG-level log per routed request so a
developer (or ops) can grep \`db_routing routed_to=read_replica\` to confirm
which requests hit the replica.

Prod-silent by default (logger threshold is INFO); enable locally with
LOGGING_LEVEL=DEBUG.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@camd camd force-pushed the camd/read-replica-queries branch from 5ed5343 to 4eeff5a Compare May 25, 2026 16:45
@camd
Copy link
Copy Markdown
Collaborator Author

camd commented May 25, 2026

Good candidates for RO replica:

  • Jobs endpoint
  • Groups summary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants