Skip to content

Centralize materialized view registry and refactor refresh task#292

Open
shlokgilda wants to merge 5 commits intochaoss:mainfrom
shlokgilda:feature/materialized-views-registry
Open

Centralize materialized view registry and refactor refresh task#292
shlokgilda wants to merge 5 commits intochaoss:mainfrom
shlokgilda:feature/materialized-views-registry

Conversation

@shlokgilda
Copy link
Copy Markdown

Description:

Introduces a single source of truth for all 15 existing PostgreSQL materialized view definitions and replaces a fragile hardcoded refresh task with a dynamic loop.

  • add collectoss/application/db/materialized_views.py — registry of all 15 existing materialized views (SQL definitions + unique index columns). get_refresh_sql() helper builds REFRESH MATERIALIZED VIEW statements.
  • refactor collectoss/tasks/db/refresh_materialized_views.py — replaces 14 hardcoded try/except: pass blocks with a dynamic loop over the registry. Uses an AUTOCOMMIT connection so REFRESH CONCURRENTLY actually works
    (it cannot run inside a transaction block, which the old code silently violated). Raises RuntimeError at the end if any view failed both concurrent and non-concurrent refresh, so failures are visible in Celery instead of swallowed.
  • wire alembic_utils into collectoss/application/schema/alembic/env.py — registers all views from the registry so alembic revision --autogenerate detects SQL definition changes automatically (phase 2 of Add new materialized views for heatmaps on 8knot (and make a system for it) #243).
  • add alembic-utils==0.8.8 to pyproject.toml and regenerate uv.lock.

No schema changes in this PR. The 3 new heatmap views for 8Knot follow in a separate incremental PR on top of this one.

This PR fixes #243 (partial — heatmap views follow separately)

Notes for Reviewers:

The key correctness fix worth understanding: REFRESH MATERIALIZED VIEW CONCURRENTLY which wraps everything in engine.begin(), so every concurrent refresh was silently failing and falling back to a blocking refresh on every run. This PR fixes that by using engine.connect().execution_options(isolation_level="AUTOCOMMIT") directly.

The ; COMMIT; embedded in the old SQL strings has also been removed — those were breaking SQLAlchemy's transaction management.

First post-deploy alembic revision --autogenerate may propose a no-op normalization revision for views whose SQL Postgres normalizes differently from what's in the registry (Postgres reformats SQL on storage). Safe to discard that revision.

Signed commits

  • Yes, I signed my commits.

AI Disclosure: Claude Code was used to draft this PR description and write docstrings in the new materialized_views.py module. I reviewed and verified all code changes, SQL definitions, and fixes before committing.

- add collectoss/application/db/materialized_views.py: single source of
  truth for all 15 existing materialized view definitions (SQL + unique
  index columns). get_refresh_sql() helper constructs REFRESH statements.

- refactor collectoss/tasks/db/refresh_materialized_views.py: replace 14
  hardcoded try/except-pass blocks with a dynamic loop over the registry.
  uses an AUTOCOMMIT connection so REFRESH CONCURRENTLY works correctly
  (it cannot run inside a transaction block). raises RuntimeError if any
  view fails both concurrent and non-concurrent refresh.

- wire alembic_utils into collectoss/application/schema/alembic/env.py:
  register all views from the registry so that alembic revision
  --autogenerate detects SQL definition changes automatically.

- add alembic-utils==0.8.8 to pyproject.toml and regenerate uv.lock.

closes chaoss#243 (partial — heatmap views follow in a separate PR)

Signed-off-by: Shlok Gilda <gildashlok@hotmail.com>
@shlokgilda shlokgilda requested a review from MoralCode as a code owner May 5, 2026 05:02
Comment on lines +692 to +701
MATERIALIZED_VIEWS = [
# --- View 1: legacy DDL (augur_full.sql), no unique index ---
{
"name": "issue_reporter_created_at",
"schema": "augur_data",
"sql": _ISSUE_REPORTER_CREATED_AT,
"unique_index_columns": [], # only a non-unique btree on repo_id
},
# --- Views 2-6: from migration 4, indexes from migration 25 ---
{
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a very different paradigm from the class-based modeling in the other SQLAlchemy models.

In developing this, did you come to realize that text-based SQL is the only way to do it? I think it would be awesome if there was a class-based approach to this (ideally passing SQLAlchemy Table's around or something, but if we have to DIY our own classes that mirror the rest of our models and just encapsulate this metadata alongside the SQL query, that could work too)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a very different paradigm from the class-based modeling in the other SQLAlchemy models...

Fair callout. pushed a follow-up (56647597a).

On Table-based modeling: not viable here. Table describes column/constraint shape, not SELECT semantics, and several views are 100+-line UNION ALL / CTE queries (explorer_new_contributors is ~164 lines). SQLAlchemy also has no first-class materialized-view construct.

What I did: went with the DIY-class fallback. New frozen MaterializedView dataclass with fqn, refresh_sql(), and to_pg_view():

@dataclass(frozen=True)
class MaterializedView:
    name: str
    schema: str
    sql: str
    unique_index_columns: tuple[str, ...] = ()
  • All 15 registry entries converted from dicts to instances.
  • Refresh task uses view.fqn / view.refresh_sql(...); get_refresh_sql removed.
  • env.py collapses to [v.to_pg_view() for v in MATERIALIZED_VIEWS]. Composition over subclassing PGMaterializedView since unique_index_columns is metadata alembic_utils doesn't track.
  • tuple (not list) so frozen=True is actually immutable. Custom __repr__ keeps the SQL out of logs.

Verified the emitted REFRESH MATERIALIZED VIEW [CONCURRENTLY] ... is byte-identical to before across all 15 views, both modes. No SQL or migration changes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think its worth going a step further and using MaterializedView as a base class, allowing each mat view to be defined as its own class (and maybe including the SQL within it?)

The tricky part there would be recreating the way that sqlalchemy allows you to simply import the file containing your models and it can just... find them all (maybe thats something they do thats super custom and not worth replicating).

Copy link
Copy Markdown
Contributor

@MoralCode MoralCode May 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my main motivation being that i want to have the SQL next to/ideally after the name of the mat view, so that its clear which one is being modified. Even if we cant do the auto-discovery, having each in a subclass with its own name, and being able to just

MATERIALIZED_VIEWS = [
ContributorActionsView(),
FileChangesView(),
...
]

May be helpful in terms of code organization for anyone looking to change the SQL for a single view (i.e. a ctrl-f for the views name in the DB would take you to the right class, rather than spreading out the code in different parts of the file)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genuinely fair point on locality. I worked through three ways to fix it and don't think any is a clear win for this PR:

  1. Inline SQL into the registry list entries. Solves ctrl-F locality, but bloats MATERIALIZED_VIEWS = [...] from a ~60-line scannable index of names + unique-index columns into an 800-line block. Also a subtler concern: the SQL strings currently sit at module indent-0; moving them into an indented MaterializedView(...) call inside a list forces a choice between flush-left SQL (visually jarring) and indented SQL (which changes the literal that alembic_utils round-trips against PostgreSQL — easy to
    introduce spurious autogenerate diffs).
  2. Per-view subclasses (your proposal). Same locality win. Costs are 15 class definitions instead of 15 instances, awkward override semantics on @dataclass(frozen=True) (subclasses can't redeclare fields without unfreezing or switching off dataclass), and the auto-discovery question you flagged. SQLAlchemy's "import-and-find" is module-level instantiation registering into a MetaData singleton. so the equivalent here is just the explicit MATERIALIZED_VIEWS = [...] at the bottom of the same module. Replicating it via __init_subclass__ is achievable but more machinery than 15 entries warrant.
  3. Module-level assignments, e.g.
EXPLORER_PR_RESPONSE_TIMES_VIEW = MaterializedView(                                                                             
    name="explorer_pr_response_times",
    schema="augur_data",                                                                                                        
    sql="""SELECT ...""",                                                                                                       
    unique_index_columns=("repo_id", "pr_src_id", "pr_src_meta_label"),
    )                                                                                                                               

then MATERIALIZED_VIEWS = [EXPLORER_PR_RESPONSE_TIMES_VIEW, ...] at the bottom. Co-locates name + SQL + index metadata, keeps the registry compact, no indentation trap. Closest in spirit to what's there today.

Option 3 is the one I'd actually recommend if we do this. But I'd rather land it on top, not in this PR. the goal here is centralization + the REFRESH CONCURRENTLY correctness fix, and a layout pass that doesn't change semantics adds review surface without changing behavior.

Replace the list-of-dicts registry with a frozen MaterializedView dataclass
exposing fqn, refresh_sql(), and to_pg_view(). Brings the registry's shape
in line with the declarative ORM style used elsewhere in the codebase and
gives callers attribute access + type checking instead of string-keyed
dict lookups.

unique_index_columns is a tuple so frozen=True actually means immutable.
__repr__ is overridden to keep the multi-hundred-line view SQL out of
debug logs. Refresh task and alembic env.py updated to use the new API;
get_refresh_sql free function removed (only two call sites).

Emitted REFRESH SQL is byte-identical to the previous version.
@MoralCode MoralCode added this to the v1.1 Migration Release milestone May 5, 2026
MoralCode added 3 commits May 5, 2026 16:41
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Signed-off-by: Adrian Edwards <adredwar@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add new materialized views for heatmaps on 8knot (and make a system for it)

2 participants