Centralize materialized view registry and refactor refresh task by shlokgilda · Pull Request #292 · chaoss/CollectOSS

shlokgilda · 2026-05-05T05:02:07Z

Description:

Introduces a single source of truth for all 15 existing PostgreSQL materialized view definitions and replaces a fragile hardcoded refresh task with a dynamic loop.

add collectoss/application/db/materialized_views.py — registry of all 15 existing materialized views (SQL definitions + unique index columns). get_refresh_sql() helper builds REFRESH MATERIALIZED VIEW statements.
refactor collectoss/tasks/db/refresh_materialized_views.py — replaces 14 hardcoded try/except: pass blocks with a dynamic loop over the registry. Uses an AUTOCOMMIT connection so REFRESH CONCURRENTLY actually works
(it cannot run inside a transaction block, which the old code silently violated). Raises RuntimeError at the end if any view failed both concurrent and non-concurrent refresh, so failures are visible in Celery instead of swallowed.
wire alembic_utils into collectoss/application/schema/alembic/env.py — registers all views from the registry so alembic revision --autogenerate detects SQL definition changes automatically (phase 2 of Add new materialized views for heatmaps on 8knot (and make a system for it) #243).
add alembic-utils==0.8.8 to pyproject.toml and regenerate uv.lock.

No schema changes in this PR. The 3 new heatmap views for 8Knot follow in a separate incremental PR on top of this one.

This PR fixes #243 (partial — heatmap views follow separately)

Notes for Reviewers:

The key correctness fix worth understanding: REFRESH MATERIALIZED VIEW CONCURRENTLY which wraps everything in engine.begin(), so every concurrent refresh was silently failing and falling back to a blocking refresh on every run. This PR fixes that by using engine.connect().execution_options(isolation_level="AUTOCOMMIT") directly.

The ; COMMIT; embedded in the old SQL strings has also been removed — those were breaking SQLAlchemy's transaction management.

First post-deploy alembic revision --autogenerate may propose a no-op normalization revision for views whose SQL Postgres normalizes differently from what's in the registry (Postgres reformats SQL on storage). Safe to discard that revision.

Signed commits

Yes, I signed my commits.

AI Disclosure: Claude Code was used to draft this PR description and write docstrings in the new materialized_views.py module. I reviewed and verified all code changes, SQL definitions, and fixes before committing.

- add collectoss/application/db/materialized_views.py: single source of truth for all 15 existing materialized view definitions (SQL + unique index columns). get_refresh_sql() helper constructs REFRESH statements. - refactor collectoss/tasks/db/refresh_materialized_views.py: replace 14 hardcoded try/except-pass blocks with a dynamic loop over the registry. uses an AUTOCOMMIT connection so REFRESH CONCURRENTLY works correctly (it cannot run inside a transaction block). raises RuntimeError if any view fails both concurrent and non-concurrent refresh. - wire alembic_utils into collectoss/application/schema/alembic/env.py: register all views from the registry so that alembic revision --autogenerate detects SQL definition changes automatically. - add alembic-utils==0.8.8 to pyproject.toml and regenerate uv.lock. closes chaoss#243 (partial — heatmap views follow in a separate PR) Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>

MoralCode · 2026-05-05T14:46:16Z

+MATERIALIZED_VIEWS = [
+    # --- View 1: legacy DDL (augur_full.sql), no unique index ---
+    {
+        "name": "issue_reporter_created_at",
+        "schema": "augur_data",
+        "sql": _ISSUE_REPORTER_CREATED_AT,
+        "unique_index_columns": [],  # only a non-unique btree on repo_id
+    },
+    # --- Views 2-6: from migration 4, indexes from migration 25 ---
+    {


This feels like a very different paradigm from the class-based modeling in the other SQLAlchemy models.

In developing this, did you come to realize that text-based SQL is the only way to do it? I think it would be awesome if there was a class-based approach to this (ideally passing SQLAlchemy Table's around or something, but if we have to DIY our own classes that mirror the rest of our models and just encapsulate this metadata alongside the SQL query, that could work too)

This feels like a very different paradigm from the class-based modeling in the other SQLAlchemy models...

Fair callout. pushed a follow-up (56647597a).

On Table-based modeling: not viable here. Table describes column/constraint shape, not SELECT semantics, and several views are 100+-line UNION ALL / CTE queries (explorer_new_contributors is ~164 lines). SQLAlchemy also has no first-class materialized-view construct.

What I did: went with the DIY-class fallback. New frozen MaterializedView dataclass with fqn, refresh_sql(), and to_pg_view():

@dataclass(frozen=True) class MaterializedView: name: str schema: str sql: str unique_index_columns: tuple[str, ...] = ()

All 15 registry entries converted from dicts to instances.

Refresh task uses view.fqn / view.refresh_sql(...); get_refresh_sql removed.

env.py collapses to [v.to_pg_view() for v in MATERIALIZED_VIEWS]. Composition over subclassing PGMaterializedView since unique_index_columns is metadata alembic_utils doesn't track.

tuple (not list) so frozen=True is actually immutable. Custom __repr__ keeps the SQL out of logs.

Verified the emitted REFRESH MATERIALIZED VIEW [CONCURRENTLY] ... is byte-identical to before across all 15 views, both modes. No SQL or migration changes.

Do you think its worth going a step further and using MaterializedView as a base class, allowing each mat view to be defined as its own class (and maybe including the SQL within it?)

The tricky part there would be recreating the way that sqlalchemy allows you to simply import the file containing your models and it can just... find them all (maybe thats something they do thats super custom and not worth replicating).

my main motivation being that i want to have the SQL next to/ideally after the name of the mat view, so that its clear which one is being modified. Even if we cant do the auto-discovery, having each in a subclass with its own name, and being able to just

MATERIALIZED_VIEWS = [ ContributorActionsView(), FileChangesView(), ... ]

May be helpful in terms of code organization for anyone looking to change the SQL for a single view (i.e. a ctrl-f for the views name in the DB would take you to the right class, rather than spreading out the code in different parts of the file)

Genuinely fair point on locality. I worked through three ways to fix it and don't think any is a clear win for this PR:

Inline SQL into the registry list entries. Solves ctrl-F locality, but bloats MATERIALIZED_VIEWS = [...] from a ~60-line scannable index of names + unique-index columns into an 800-line block. Also a subtler concern: the SQL strings currently sit at module indent-0; moving them into an indented MaterializedView(...) call inside a list forces a choice between flush-left SQL (visually jarring) and indented SQL (which changes the literal that alembic_utils round-trips against PostgreSQL — easy to
introduce spurious autogenerate diffs).

Per-view subclasses (your proposal). Same locality win. Costs are 15 class definitions instead of 15 instances, awkward override semantics on @dataclass(frozen=True) (subclasses can't redeclare fields without unfreezing or switching off dataclass), and the auto-discovery question you flagged. SQLAlchemy's "import-and-find" is module-level instantiation registering into a MetaData singleton. so the equivalent here is just the explicit MATERIALIZED_VIEWS = [...] at the bottom of the same module. Replicating it via __init_subclass__ is achievable but more machinery than 15 entries warrant.

Module-level assignments, e.g.

EXPLORER_PR_RESPONSE_TIMES_VIEW = MaterializedView( name="explorer_pr_response_times", schema="augur_data", sql="""SELECT ...""", unique_index_columns=("repo_id", "pr_src_id", "pr_src_meta_label"), )

then MATERIALIZED_VIEWS = [EXPLORER_PR_RESPONSE_TIMES_VIEW, ...] at the bottom. Co-locates name + SQL + index metadata, keeps the registry compact, no indentation trap. Closest in spirit to what's there today.

Option 3 is the one I'd actually recommend if we do this. But I'd rather land it on top, not in this PR. the goal here is centralization + the REFRESH CONCURRENTLY correctness fix, and a layout pass that doesn't change semantics adds review surface without changing behavior.

Replace the list-of-dicts registry with a frozen MaterializedView dataclass exposing fqn, refresh_sql(), and to_pg_view(). Brings the registry's shape in line with the declarative ORM style used elsewhere in the codebase and gives callers attribute access + type checking instead of string-keyed dict lookups. unique_index_columns is a tuple so frozen=True actually means immutable. __repr__ is overridden to keep the multi-hundred-line view SQL out of debug logs. Refresh task and alembic env.py updated to use the new API; get_refresh_sql free function removed (only two call sites). Emitted REFRESH SQL is byte-identical to the previous version.

Signed-off-by: Adrian Edwards <adredwar@redhat.com>

shlokgilda requested a review from MoralCode as a code owner May 5, 2026 05:02

MoralCode requested changes May 5, 2026

View reviewed changes

MoralCode added this to the v1.1 Migration Release milestone May 5, 2026

MoralCode added 3 commits May 5, 2026 16:41

the suggested fix

233d0da

Signed-off-by: Adrian Edwards <adredwar@redhat.com>

migration to adjust the schema

4590054

Signed-off-by: Adrian Edwards <adredwar@redhat.com>

ensure facade checking process always gets rescheduled

dc4fa34

Signed-off-by: Adrian Edwards <adredwar@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize materialized view registry and refactor refresh task#292

Centralize materialized view registry and refactor refresh task#292
shlokgilda wants to merge 5 commits intochaoss:mainfrom
shlokgilda:feature/materialized-views-registry

shlokgilda commented May 5, 2026

Uh oh!

MoralCode May 5, 2026

Uh oh!

shlokgilda May 5, 2026

Uh oh!

MoralCode May 5, 2026

Uh oh!

MoralCode May 5, 2026 •

edited

Loading

Uh oh!

shlokgilda May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

shlokgilda commented May 5, 2026

Uh oh!

MoralCode May 5, 2026

Choose a reason for hiding this comment

Uh oh!

shlokgilda May 5, 2026

Choose a reason for hiding this comment

Uh oh!

MoralCode May 5, 2026

Choose a reason for hiding this comment

Uh oh!

MoralCode May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shlokgilda May 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MoralCode May 5, 2026 •

edited

Loading