Skip to content

Optimize performance_datum queries in alert generation#9550

Open
camd wants to merge 2 commits into
masterfrom
camd/query-optimization-performance-datum
Open

Optimize performance_datum queries in alert generation#9550
camd wants to merge 2 commits into
masterfrom
camd/query-optimization-performance-datum

Conversation

@camd
Copy link
Copy Markdown
Collaborator

@camd camd commented May 23, 2026

Bug 2041948

generate_new_alerts_in_series and generate_new_test_alerts_in_series run on every signature during ingestion. Their PerformanceDatum reads were not using the (repository, signature, push_timestamp) index:

  • The series query filtered only on signature + push_timestamp, so Postgres could not seek the composite index (its leading column, repository, was unconstrained). Add repository=signature.repository; a signature has exactly one repository, so results are unchanged.

  • The replicate lookup used IN (SELECT ... WHERE EXISTS (...)). The EXISTS was redundant -- a replicate row's parent datum is in the set by definition. Replace with a plain join via the new shared helper _collect_replicates_map(), removing the duplicated block in both functions and the now-unused Exists/OuterRef/Subquery imports.

  • De-duplicated some code that was identical. This streamlines the code where both queries had the same issues, so this was optimal.

Add a characterization test pinning the replicate-map behavior.

generate_new_alerts_in_series and generate_new_test_alerts_in_series run
on every signature during ingestion. Their PerformanceDatum reads were
not using the (repository, signature, push_timestamp) index:

- The `series` query filtered only on signature + push_timestamp, so
  Postgres could not seek the composite index (its leading column,
  repository, was unconstrained). Add repository=signature.repository;
  a signature has exactly one repository, so results are unchanged.

- The replicate lookup used `IN (SELECT ... WHERE EXISTS (...))`. The
  EXISTS was redundant -- a replicate row's parent datum is in the set
  by definition. Replace with a plain join via the new shared helper
  _collect_replicates_map(), removing the duplicated block in both
  functions and the now-unused Exists/OuterRef/Subquery imports.

Add a characterization test pinning the replicate-map behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@camd camd self-assigned this May 23, 2026
@camd camd requested review from Archaeopteryx and gmierz May 23, 2026 18:31
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 23, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 82.94%. Comparing base (6e680ff) to head (d96ff26).
⚠️ Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #9550    +/-   ##
========================================
  Coverage   82.94%   82.94%            
========================================
  Files         613      613            
  Lines       35372    35376     +4     
  Branches     3208     3278    +70     
========================================
+ Hits        29338    29342     +4     
+ Misses       5880     5665   -215     
- Partials      154      369   +215     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Trim the PerformanceDatum series query to the 4 columns actually used
(.only id/push_id/push_timestamp/value) and drop the no-op
select_related on the latest-alert-timestamp lookup in both
generate_new_alerts_in_series and generate_new_test_alerts_in_series.

Extract the now-identical data-loading prologue into _load_revision_data,
parameterized by the latest-alert-timestamp queryset and the
RevisionDatum class, so the two alerting paths share one copy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@camd
Copy link
Copy Markdown
Collaborator Author

camd commented May 25, 2026

Commit 2 - extra perf enhancements

File: treeherder/perf/alerts.py

Two nearly-identical functions — generate_new_alerts_in_series and generate_new_test_alerts_in_series (its *Testing-table twin) — each got the same two query optimizations:

1. Trimmed over-fetching on the series queryset

Added .only("id", "push_id", "push_timestamp", "value") to the PerformanceDatum query. Previously it fetched full model instances (10 columns), but the consuming loop only ever reads those 4 attributes. This drops the SELECT from 10 columns to 4 per row — eliminating 6 unused columns including three VARCHARs (os_name, platform_version, application_version). Savings scale with series size (many datapoints per signature in the alert window).

2. Removed dead select_related

Deleted .select_related("summary__push__time") from the latest-alert-timestamp query. Django silently ignores select_related when .values_list() is chained, so it produced byte-for-byte identical SQL — it was misleading dead code (and an invalid path, since time is a field, not a relation).

Behavior preserved

  • .only(...) covers every attribute the loops touch, so model instances behave identically.
  • The select_related removal changes no generated SQL.
  • No index changes required — the WHERE clause is unchanged and the existing (repository, signature, push_timestamp) composite index still applies.

Verification

  • ruff check + ruff format --check: clean
  • 36 tests passing across tests/perfalert/test_methods_alerts.py, tests/perfalert/test_alerts.py, and tests/etl/test_perf_data_load.py

Copy link
Copy Markdown
Contributor

@junngo junngo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At that time, we spent quite a bit of effort tuning the datum replicate query shape.
The Subquery and Exists conditions may look logically redundant, but they helped the DB optimizer choose the intended index path for PerformanceDatumReplicate.
I’m happy to hear any ideas for improving it further, and I’ll also try to double-check whether there’s a better query shape here.

Comment thread treeherder/perf/alerts.py
performance_datum__signature=signature,
performance_datum__repository=signature.repository,
performance_datum__push_timestamp__gte=alert_after_ts,
).values_list("performance_datum_id", "value")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m concerned this may introduce a query plan regression.
This query [0] takes over 2 minutes per call in Redash. From the query plan [1], performance_datum uses the intended index quickly, but performance_datum_replicate still performs a full scan.

[0]
SELECT
    pdr.performance_datum_id,
    pdr.value
FROM
    performance_datum_replicate pdr
INNER JOIN
    performance_datum pd
    ON pdr.performance_datum_id = pd.id
WHERE pd.push_timestamp >= '2026-05-11'
and pd.repository_id = 77
and pd.signature_id = 5763463

[1]
Execution Time: 138,480 ms ≈ 2m 18s
Gather
└─ Hash Join
   Join: pdr.performance_datum_id = pd.id
   ├─ Outer: Parallel Seq Scan on performance_datum_replicate pdr
   │  ├─ ....
   │  ├─ Total scanned: ~1.36 billion rows
   │
   └─ Inner: Hash
      └─ Index Scan on performance_datum pd
         ├─ Index: performance_reposit_c9d328_idx
         ├─ Cond:
         │  repository_id = 77
         │  signature_id = 5763463
         │  push_timestamp >= 2026-05-11
         ├─ Actual rows: 55,900
         └─ Actual time: ~49ms

Comment thread treeherder/perf/alerts.py
)
replicates = PerformanceDatumReplicate.objects.filter(
performance_datum_id__in=Subquery(datum_with_replicates.values("id"))
).values_list("performance_datum_id", "value")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original query takes about 100 ms per call, which looks reasonable to me.

Execution Time: 107 ms
Nested Loop
├─ Outer: Aggregate / Gather
│  └─ Nested Loop Semi
│     ├─ Bitmap Heap Scan on performance_datum
│     │  ├─ Index: performance_reposit_c9d328_idx
│     │  ├─ Actual rows: ~55,494 total
│     │  └─ Uses repository/signature/push_timestamp condition
│     │
│     └─ Index Only Scan on performance_datum_replicate u0
│        ├─ Index: performance_datum_replicate_performance_datum_id_fe2ed518
│        ├─ Cond: performance_datum_id = v0.id
│        ├─ Loops: 55,494
│        └─ Heap Fetches: 0
│
└─ Inner: Index Scan on performance_datum_replicate
   └─ Not executed because no matching datum ids with replicates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants