feat(multi-value): first-class support for array-typed dimensions (ClickHouse MVP)#41279
Draft
thedeceptio wants to merge 7 commits into
Draft
feat(multi-value): first-class support for array-typed dimensions (ClickHouse MVP)#41279thedeceptio wants to merge 7 commits into
thedeceptio wants to merge 7 commits into
Conversation
…y contract Introduce first-class multi-value (array-typed) column support, designed to work across any SQL dialect that supports arrays. This stage adds the semantic type and the dialect-agnostic capability layer, with ClickHouse as the first concrete implementation. - Activate GenericDataType.MULTI_VALUE (= 4) on both the backend enum (superset/utils/core.py) and the frontend enum (superset-core/common). - Add an opt-in capability contract to BaseEngineSpec: supports_multivalue_columns flag (default False) plus array_contains, array_length and array_explode methods (default NotImplementedError) so engines that have not opted in keep treating arrays as strings. - ClickHouse: set the flag, reclassify Array(...) columns as MULTI_VALUE, and implement has()/length()/arrayJoin() as bound SQLAlchemy expressions. Tests: flip the existing Array->STRING assertion to MULTI_VALUE, add nested array variants, per-capability SQL-compilation tests, a bound-parameter (injection-safety) test, and negative tests asserting the base spec stays disabled and raises. array_explode is implemented and unit-tested but not yet wired into the query builder (Stage C). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: thedeceptio <thedeceptio@gmail.com>
Wire a new CONTAINS filter operator end-to-end so users can filter multi-value (array) columns by membership. - Add FilterOperator.CONTAINS (and the FilterStringOperators mirror). The charts API schema derives its allowed operators from the enum, so CONTAINS is accepted without any extra allow-list change. - Translate CONTAINS in the query builder (models/helpers.py): route to db_engine_spec.array_contains() so each dialect emits its native membership call, and raise a clear QueryObjectValidationError when the engine does not support multi-value columns. Tests: operator round-trip, an unsupported-engine guard (sqlite raises), and a positive path that drives the real get_sqla_query pipeline with a ClickHouse engine spec and asserts the generated SQL contains has(...). The dataset compiles against sqlite so no ClickHouse driver is required (func.has renders dialect-agnostically). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: thedeceptio <thedeceptio@gmail.com>
Let users group/aggregate by the element count of an array column, produced dialect-agnostically via the engine-spec capability layer. - Add a multi-value modifier column shape to AdhocColumn: a base `column` plus a `columnOperation` (e.g. "LENGTH") instead of a literal sqlExpression. Add the MultiValueColumnOperation enum and an is_multivalue_operation_column detector. - Resolve the shape in SqlaTable via a new _multivalue_column_to_sqla helper: it looks up the base array column and calls db_engine_spec.array_length(), typing the result NUMERIC. Unsupported engines, unknown base columns and unknown operations all raise a clear QueryObjectValidationError. - Route the shape through both the grouped (adhoc_column_to_sqla) and raw-columns branches of the query builder. Because the derived column is computed live from the engine spec there is no stored virtual column to persist, so nothing special is needed to survive "Sync columns from database". Tests: end-to-end length(skills) generation through get_sqla_query, NUMERIC typing, and the unsupported-engine / unknown-column / unknown-operation guards. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: thedeceptio <thedeceptio@gmail.com>
Add an EXPLODE operation to the multi-value modifier column so users can group by individual elements of an array column. - Wire EXPLODE into _multivalue_column_to_sqla: it calls db_engine_spec.array_explode(), which on ClickHouse is the scalar arrayJoin(col) and projects directly into SELECT/GROUP BY (no JOIN needed). - Future-safe guard: set-returning UNNEST dialects (Postgres/Trino/BigQuery) require CROSS JOIN UNNEST plumbing and are deferred to phase 2 (post live validation). Those engines leave array_explode unimplemented, so we catch NotImplementedError and raise a clear QueryObjectValidationError instead of emitting invalid SQL. The explode modifier is part of the query object, so exploded and non-exploded queries naturally get distinct cache keys. Note (ClickHouse semantics): arrayJoin behaves like an INNER JOIN — rows with empty arrays drop out, changing totals. To be confirmed against a live instance in Stage E. Tests: arrayJoin(skills) generation through get_sqla_query, explode-vs-plain SQL divergence, the unsupported-engine guard, and a simulated UNNEST-only dialect getting a clean error rather than bad SQL. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: thedeceptio <thedeceptio@gmail.com>
…rator Surface multi-value columns and the membership filter in Explore. - ColumnTypeLabel renders a distinct list icon for GenericDataType.MultiValue so array columns are visually identifiable in the column tree. - Add the Contains operator (CONTAINS) to the Explore operator enum and the operator->SQL label map. - Gate operators by column type in the adhoc filter editor: CONTAINS is shown only for multi-value columns, and multi-value columns expose only CONTAINS / IS NULL / IS NOT NULL (scalar comparators are hidden). Capability is implied by the backend classifying the column as MULTI_VALUE, so no separate frontend flag is needed. Tests: multi-value icon rendering, a GenericDataType enum-parity test guarding the values shared with the backend, and operator-visibility tests (CONTAINS shown for array columns, hidden for scalar ones). Note: the Length/Explode dimension modifier UI (controls that emit the columnOperation payload) is a separate follow-up; those operations are already exercisable via the query API. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: thedeceptio <thedeceptio@gmail.com>
…ation Found during live validation against a real ClickHouse instance: executing a query with a Length/Explode modifier column failed with "Columns missing in dataset", even though SQL generation worked. The query-context validation in query_context_processor.py extracts physical column names via get_column_name_from_column, which returned the modifier dict (treating it as a physical column) instead of None. - get_column_name_from_column now returns None for multi-value operation columns, matching how adhoc (sqlExpression) columns are already handled, so validation ignores them. Also make the multi-value model tests import the ClickHouse engine spec lazily inside their helpers. clickhouse.py reads app.config at module-import time (inside a try/except ImportError around clickhouse-connect); once the driver is installed that import-time access fails at pytest collection without an app context. Deferring the import matches the existing pattern in tests/.../db_engine_specs/test_clickhouse.py. Adds a regression test asserting modifier columns are excluded from physical column-name extraction. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: thedeceptio <thedeceptio@gmail.com>
…ttier) - Type the multi-value test helpers with AdhocColumn/Column and QueryObjectDict, and exercise the raise paths via get_query_str_extended() instead of get_sqla_query(**dict) (avoids mypy TypedDict-unpacking false positives). - Import ClickHouseEngineSpec lazily inside tests: clickhouse.py reads app.config at import time, which is unavailable at pytest collection once clickhouse-connect is installed. - Apply prettier formatting to the adhoc filter simple-tab test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: thedeceptio <thedeceptio@gmail.com>
5eb1fcb to
66a9e99
Compare
✅ Deploy Preview for superset-docs-preview ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SUMMARY
Adds first-class support for array-typed (multi-value) columns in Explore, designed to be dialect-agnostic via an engine-spec capability layer. ClickHouse is the first concrete implementation; other array dialects opt in by implementing three methods.
Three operations are supported on array columns:
has(col, v)length(col)arrayJoin(col)Design
GenericDataType.MULTI_VALUE(= 4), synced across the backend enum (superset/utils/core.py) and the frontend enum (@apache-superset/core/common).BaseEngineSpec: asupports_multivalue_columnsflag (defaultFalse, so existing engines are unaffected) plusarray_contains/array_length/array_explode, returning SQLAlchemy expressions (proper binding/quoting per dialect — no raw SQL strings).Array(...)columns toMULTI_VALUEand implements the three methods.{column, columnOperation}) resolved through the engine spec, so the same payload works on any array dialect.BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
UI changes (screenshots to be attached):
Array(...)columns were treated as genericSTRING— no distinct icon, no array-aware operators.TESTING INSTRUCTIONS
Prerequisites: a database with array columns. Validated live against ClickHouse (
clickhouse-connect) on a 22.8M-row table with 11Array(...)columns.Backend (unit):
Frontend (unit):
Manual (ClickHouse):
Array(...)column.MULTI_VALUE(type_generic == 4) and shows the list icon in Explore.Contains/Is null/Is not nullare offered. Filter with a value → SQL emitshas(col, '<value>').{"column": "<arr>", "columnOperation": "LENGTH"}(via the query payload) → SQL emitslength(col), returns a numeric distribution.{"column": "<arr>", "columnOperation": "EXPLODE"}→ SQL emitsarrayJoin(col), one row per element. Note rows with empty arrays drop out (arrayJoin behaves like an INNER JOIN) — expected.Live validation results: classification,
has()/length()/arrayJoin()rendering and execution all confirmed through the real ClickHouse dialect; the empty-array drop-out (~42% of rows in the test table) behaved as documented.ADDITIONAL INFORMATION
Scope / follow-ups (phase 2):
UNNESTdialects (PostgreSQL/Trino/BigQuery) requireCROSS JOIN UNNESTplumbing in the query builder; they are guarded to raise a clear error instead of emitting invalid SQL, and are deferred.columnOperationpayload) is not yet built; the operations are usable via the query API today.IN(the dropdown correctly hides scalar operators, but the stale default should reset toContains).🤖 Generated with Claude Code