Centralize materialized view registry and refactor refresh task#292
Centralize materialized view registry and refactor refresh task#292shlokgilda wants to merge 5 commits intochaoss:mainfrom
Conversation
- add collectoss/application/db/materialized_views.py: single source of truth for all 15 existing materialized view definitions (SQL + unique index columns). get_refresh_sql() helper constructs REFRESH statements. - refactor collectoss/tasks/db/refresh_materialized_views.py: replace 14 hardcoded try/except-pass blocks with a dynamic loop over the registry. uses an AUTOCOMMIT connection so REFRESH CONCURRENTLY works correctly (it cannot run inside a transaction block). raises RuntimeError if any view fails both concurrent and non-concurrent refresh. - wire alembic_utils into collectoss/application/schema/alembic/env.py: register all views from the registry so that alembic revision --autogenerate detects SQL definition changes automatically. - add alembic-utils==0.8.8 to pyproject.toml and regenerate uv.lock. closes chaoss#243 (partial — heatmap views follow in a separate PR) Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
| MATERIALIZED_VIEWS = [ | ||
| # --- View 1: legacy DDL (augur_full.sql), no unique index --- | ||
| { | ||
| "name": "issue_reporter_created_at", | ||
| "schema": "augur_data", | ||
| "sql": _ISSUE_REPORTER_CREATED_AT, | ||
| "unique_index_columns": [], # only a non-unique btree on repo_id | ||
| }, | ||
| # --- Views 2-6: from migration 4, indexes from migration 25 --- | ||
| { |
There was a problem hiding this comment.
This feels like a very different paradigm from the class-based modeling in the other SQLAlchemy models.
In developing this, did you come to realize that text-based SQL is the only way to do it? I think it would be awesome if there was a class-based approach to this (ideally passing SQLAlchemy Table's around or something, but if we have to DIY our own classes that mirror the rest of our models and just encapsulate this metadata alongside the SQL query, that could work too)
There was a problem hiding this comment.
This feels like a very different paradigm from the class-based modeling in the other SQLAlchemy models...
Fair callout. pushed a follow-up (56647597a).
On Table-based modeling: not viable here. Table describes column/constraint shape, not SELECT semantics, and several views are 100+-line UNION ALL / CTE queries (explorer_new_contributors is ~164 lines). SQLAlchemy also has no first-class materialized-view construct.
What I did: went with the DIY-class fallback. New frozen MaterializedView dataclass with fqn, refresh_sql(), and to_pg_view():
@dataclass(frozen=True)
class MaterializedView:
name: str
schema: str
sql: str
unique_index_columns: tuple[str, ...] = ()- All 15 registry entries converted from dicts to instances.
- Refresh task uses
view.fqn/view.refresh_sql(...);get_refresh_sqlremoved. env.pycollapses to[v.to_pg_view() for v in MATERIALIZED_VIEWS]. Composition over subclassingPGMaterializedViewsinceunique_index_columnsis metadata alembic_utils doesn't track.tuple(notlist) sofrozen=Trueis actually immutable. Custom__repr__keeps the SQL out of logs.
Verified the emitted REFRESH MATERIALIZED VIEW [CONCURRENTLY] ... is byte-identical to before across all 15 views, both modes. No SQL or migration changes.
There was a problem hiding this comment.
Do you think its worth going a step further and using MaterializedView as a base class, allowing each mat view to be defined as its own class (and maybe including the SQL within it?)
The tricky part there would be recreating the way that sqlalchemy allows you to simply import the file containing your models and it can just... find them all (maybe thats something they do thats super custom and not worth replicating).
There was a problem hiding this comment.
my main motivation being that i want to have the SQL next to/ideally after the name of the mat view, so that its clear which one is being modified. Even if we cant do the auto-discovery, having each in a subclass with its own name, and being able to just
MATERIALIZED_VIEWS = [
ContributorActionsView(),
FileChangesView(),
...
]
May be helpful in terms of code organization for anyone looking to change the SQL for a single view (i.e. a ctrl-f for the views name in the DB would take you to the right class, rather than spreading out the code in different parts of the file)
There was a problem hiding this comment.
Genuinely fair point on locality. I worked through three ways to fix it and don't think any is a clear win for this PR:
- Inline SQL into the registry list entries. Solves ctrl-F locality, but bloats
MATERIALIZED_VIEWS = [...]from a ~60-line scannable index of names + unique-index columns into an 800-line block. Also a subtler concern: the SQL strings currently sit at module indent-0; moving them into an indentedMaterializedView(...)call inside a list forces a choice between flush-left SQL (visually jarring) and indented SQL (which changes the literal thatalembic_utilsround-trips against PostgreSQL — easy to
introduce spurious autogenerate diffs). - Per-view subclasses (your proposal). Same locality win. Costs are 15 class definitions instead of 15 instances, awkward override semantics on
@dataclass(frozen=True)(subclasses can't redeclare fields without unfreezing or switching offdataclass), and the auto-discovery question you flagged. SQLAlchemy's "import-and-find" is module-level instantiation registering into aMetaDatasingleton. so the equivalent here is just the explicitMATERIALIZED_VIEWS = [...]at the bottom of the same module. Replicating it via__init_subclass__is achievable but more machinery than 15 entries warrant. - Module-level assignments, e.g.
EXPLORER_PR_RESPONSE_TIMES_VIEW = MaterializedView(
name="explorer_pr_response_times",
schema="augur_data",
sql="""SELECT ...""",
unique_index_columns=("repo_id", "pr_src_id", "pr_src_meta_label"),
)
then MATERIALIZED_VIEWS = [EXPLORER_PR_RESPONSE_TIMES_VIEW, ...] at the bottom. Co-locates name + SQL + index metadata, keeps the registry compact, no indentation trap. Closest in spirit to what's there today.
Option 3 is the one I'd actually recommend if we do this. But I'd rather land it on top, not in this PR. the goal here is centralization + the REFRESH CONCURRENTLY correctness fix, and a layout pass that doesn't change semantics adds review surface without changing behavior.
Replace the list-of-dicts registry with a frozen MaterializedView dataclass exposing fqn, refresh_sql(), and to_pg_view(). Brings the registry's shape in line with the declarative ORM style used elsewhere in the codebase and gives callers attribute access + type checking instead of string-keyed dict lookups. unique_index_columns is a tuple so frozen=True actually means immutable. __repr__ is overridden to keep the multi-hundred-line view SQL out of debug logs. Refresh task and alembic env.py updated to use the new API; get_refresh_sql free function removed (only two call sites). Emitted REFRESH SQL is byte-identical to the previous version.
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Description:
Introduces a single source of truth for all 15 existing PostgreSQL materialized view definitions and replaces a fragile hardcoded refresh task with a dynamic loop.
collectoss/application/db/materialized_views.py— registry of all 15 existing materialized views (SQL definitions + unique index columns).get_refresh_sql()helper buildsREFRESH MATERIALIZED VIEWstatements.collectoss/tasks/db/refresh_materialized_views.py— replaces 14 hardcodedtry/except: passblocks with a dynamic loop over the registry. Uses anAUTOCOMMITconnection soREFRESH CONCURRENTLYactually works(it cannot run inside a transaction block, which the old code silently violated). Raises
RuntimeErrorat the end if any view failed both concurrent and non-concurrent refresh, so failures are visible in Celery instead of swallowed.alembic_utilsintocollectoss/application/schema/alembic/env.py— registers all views from the registry soalembic revision --autogeneratedetects SQL definition changes automatically (phase 2 of Add new materialized views for heatmaps on 8knot (and make a system for it) #243).alembic-utils==0.8.8topyproject.tomland regenerateuv.lock.No schema changes in this PR. The 3 new heatmap views for 8Knot follow in a separate incremental PR on top of this one.
This PR fixes #243 (partial — heatmap views follow separately)
Notes for Reviewers:
The key correctness fix worth understanding:
REFRESH MATERIALIZED VIEW CONCURRENTLYwhich wraps everything inengine.begin(), so every concurrent refresh was silently failing and falling back to a blocking refresh on every run. This PR fixes that by usingengine.connect().execution_options(isolation_level="AUTOCOMMIT")directly.The
; COMMIT;embedded in the old SQL strings has also been removed — those were breaking SQLAlchemy's transaction management.First post-deploy
alembic revision --autogeneratemay propose a no-op normalization revision for views whose SQL Postgres normalizes differently from what's in the registry (Postgres reformats SQL on storage). Safe to discard that revision.Signed commits
AI Disclosure: Claude Code was used to draft this PR description and write docstrings in the new
materialized_views.pymodule. I reviewed and verified all code changes, SQL definitions, and fixes before committing.