Add read-replica routing for Perfherder read-only endpoints#9528
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #9528 +/- ##
==========================================
+ Coverage 82.94% 83.02% +0.08%
==========================================
Files 613 616 +3
Lines 35372 35588 +216
Branches 3208 3278 +70
==========================================
+ Hits 29338 29547 +209
+ Misses 5880 5672 -208
- Partials 154 369 +215 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| into reading from the ``read_replica`` database alias. The router only routes | ||
| models whose Django app label is in :data:`READ_REPLICA_APP_ALLOW_LIST`. | ||
|
|
||
| Design: .claude/plans/READ_REPLICA_DESIGN.md |
| The router and mixin are then active in dev. Routing is a no-op unless an | ||
| endpoint opts in via `ReadReplicaMixin`; see | ||
| `treeherder/config/db_routing.py` and the design at | ||
| `.claude/plans/READ_REPLICA_DESIGN.md`. |
| _RecordingView.raise_on_call = 0 | ||
| _RecordingView.call_count = 0 | ||
| _RecordingView.saw_use_replica = [] | ||
| yield |
There was a problem hiding this comment.
What's the benefit of this?
There was a problem hiding this comment.
Good catch. That yield is a no-op. Removed.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Routes 10 read-only Perfherder viewsets to the read replica (PerformanceSignatureViewSet, PerformancePlatformViewSet, PerformanceJobViewSet, PerformanceDatumViewSet, PerformanceBugTemplateViewSet, PerformanceIssueTrackerViewSet, PerformanceSummary, PerformanceAlertSummaryTasks, PerfCompareResults, TestSuiteHealthViewSet). Adds integration tests for signature list and summary endpoints. Updates test_perfcompare_api.py, test_performance_data_api.py, and test_performance_bug_template_api.py to declare read_replica database access. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two review fixes: 1. When the kill switch is off (READ_REPLICA_ENABLED=false or READ_REPLICA_DATABASE_URL unset), the mixin previously still wrapped dispatch, set the thread-local (a no-op without a router), and would emit a misleading "db_routing_fallback" log on any primary failure. The mixin now checks connections.databases for the alias and bypasses wrapping when absent. 2. The three test files that gained databases=["default","read_replica"] implicitly relied on fixtures pulling in transactional_db. Make the transactional requirement explicit with transaction=True so the behavior survives future fixture refactors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The feature is a no-op locally with the default .env (both env vars unset → no read_replica alias → mixin early-returns). Standard local dev doesn't need to know about the routing. The design, code-level docstring, and PR deploy notes cover the cases that do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fb7dc10 to
5ed5343
Compare
|
I looked at the 5 slowest queries from CloudSQL. Here is the synopsis: Short answerThis branch will not touch 4 of the 5. The branch only routes web GET requests whose viewset opts in via Where each query actually comes from
How I know #1/#2/#4 are the alert pipeline and not the graphs endpoints: they Can they be addressed?#1, #2, #4 (perf alert generation) and #3 (intermittent classification): Not by the current viewset-mixin mechanism — there's no view in the path. They're also read-then-write tasks (read the datum/group history, then write alerts / classifications), so they're exactly the read-after-write case the design says to keep on primary. Replication lag could make alert generation miss the newest datapoints or mark intermittents off stale group history. Routing them is possible but needs a different, deliberate tool — explicit Perf Alert Generation PR Worth noting: these four dominate your list precisely because they're the ingestion pipeline — alert generation runs across essentially every signature continuously, so the cumulative cost is huge. That's load on the writer that this branch, by design, doesn't move. #5 (bugscache search): This is the one the current approach can capture. It's a read-only GET, and |
Successful replica routing was previously silent — only the fallback path emitted a log line. Adds a single DEBUG-level log per routed request so a developer (or ops) can grep \`db_routing routed_to=read_replica\` to confirm which requests hit the replica. Prod-silent by default (logger threshold is INFO); enable locally with LOGGING_LEVEL=DEBUG. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
5ed5343 to
4eeff5a
Compare
|
Good candidates for RO replica:
|
Summary
Routes 11 read-only Perfherder GET endpoints through a separate
read_replicaPostgreSQL connection to relieve write-side contention on the primary database. Gated behind two env vars (READ_REPLICA_DATABASE_URL+READ_REPLICA_ENABLED) so it ships as a no-op until ops flips the switch.ReadReplicaRouter(DjangoDATABASE_ROUTERS) routes reads to the replica only when a thread-local flag is set and the model's app label is in an explicit allow-list ({"perf", "model"}).ReadReplicaMixinflips that flag forGET/HEAD/OPTIONSonly, on a one-shot retry against primary if the replica raisesOperationalError/InterfaceError, and emits adb_routing_fallbackwarning log. The mixin is a true no-op when the alias isn't configured.treeherder/webapp/api/performance_data.py. The 3 mutating viewsets (PerformanceAlertSummaryViewSet,PerformanceAlertViewSet,PerformanceTagViewSet) are intentionally left on primary so create→read flows stay consistent.Design doc:
.claude/plans/READ_REPLICA_DESIGN.md(local). Mechanism is reusable — applying it to Logviewer / Intermittent Failures View later is a one-line viewset change each.Test plan
CaptureQueriesContextasserting the replica alias actually serves opted-in endpoints and does NOT serve excluded ones — 4 tests.READ_REPLICA_DATABASE_URLin stage; flipREAD_REPLICA_ENABLED=true; soak; watchdb_routing_fallbacklog volume + primary DB CPU; then prod.READ_REPLICA_ENABLED=false(one env-var flip) and verify all routed endpoints continue returning correct data via primary.🤖 Generated with Claude Code