Skip to content

feat(multi-value): first-class support for array-typed dimensions (ClickHouse MVP)#41279

Draft
thedeceptio wants to merge 7 commits into
apache:masterfrom
thedeceptio:feat/multi-value-dimension-support
Draft

feat(multi-value): first-class support for array-typed dimensions (ClickHouse MVP)#41279
thedeceptio wants to merge 7 commits into
apache:masterfrom
thedeceptio:feat/multi-value-dimension-support

Conversation

@thedeceptio

@thedeceptio thedeceptio commented Jun 22, 2026

Copy link
Copy Markdown

SUMMARY

Adds first-class support for array-typed (multi-value) columns in Explore, designed to be dialect-agnostic via an engine-spec capability layer. ClickHouse is the first concrete implementation; other array dialects opt in by implementing three methods.

Three operations are supported on array columns:

  • Contains — filter rows where the array includes a value → has(col, v)
  • Length — numeric dimension of element count → length(col)
  • Explode — group by individual elements → arrayJoin(col)

Design

  • New semantic type GenericDataType.MULTI_VALUE (= 4), synced across the backend enum (superset/utils/core.py) and the frontend enum (@apache-superset/core/common).
  • Opt-in capability contract on BaseEngineSpec: a supports_multivalue_columns flag (default False, so existing engines are unaffected) plus array_contains / array_length / array_explode, returning SQLAlchemy expressions (proper binding/quoting per dialect — no raw SQL strings).
  • ClickHouse reclassifies Array(...) columns to MULTI_VALUE and implements the three methods.
  • Length/Explode use a small adhoc-column shape ({column, columnOperation}) resolved through the engine spec, so the same payload works on any array dialect.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

UI changes (screenshots to be attached):

  • Before: Array(...) columns were treated as generic STRING — no distinct icon, no array-aware operators.
  • After:
    • Array columns render a distinct list icon in the column tree.
    • The adhoc filter editor offers a Contains operator for array columns and hides scalar operators (Equals/Greater than/Like/…), exposing only Contains / Is null / Is not null.

TESTING INSTRUCTIONS

Prerequisites: a database with array columns. Validated live against ClickHouse (clickhouse-connect) on a 22.8M-row table with 11 Array(...) columns.

Backend (unit):

pytest tests/unit_tests/db_engine_specs/test_clickhouse.py \
       tests/unit_tests/db_engine_specs/test_base.py \
       tests/unit_tests/models/test_multivalue_filter.py \
       tests/unit_tests/models/test_multivalue_length.py \
       tests/unit_tests/models/test_multivalue_explode.py

Frontend (unit):

cd superset-frontend
npx jest packages/superset-ui-chart-controls/test/components/ColumnTypeLabel.test.tsx
npx jest src/explore/components/controls/FilterControl/AdhocFilterEditPopoverSimpleTabContent

Manual (ClickHouse):

  1. Connect a ClickHouse database and create a dataset over a table with an Array(...) column.
  2. Confirm the array column is classified MULTI_VALUE (type_generic == 4) and shows the list icon in Explore.
  3. Contains: add an adhoc filter on the array column → only Contains / Is null / Is not null are offered. Filter with a value → SQL emits has(col, '<value>').
  4. Length: add a dimension {"column": "<arr>", "columnOperation": "LENGTH"} (via the query payload) → SQL emits length(col), returns a numeric distribution.
  5. Explode: add a dimension {"column": "<arr>", "columnOperation": "EXPLODE"} → SQL emits arrayJoin(col), one row per element. Note rows with empty arrays drop out (arrayJoin behaves like an INNER JOIN) — expected.

Live validation results: classification, has()/length()/arrayJoin() rendering and execution all confirmed through the real ClickHouse dialect; the empty-array drop-out (~42% of rows in the test table) behaved as documented.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

Scope / follow-ups (phase 2):

  • Explode is ClickHouse-only. Set-returning UNNEST dialects (PostgreSQL/Trino/BigQuery) require CROSS JOIN UNNEST plumbing in the query builder; they are guarded to raise a clear error instead of emitting invalid SQL, and are deferred.
  • Length/Explode modifier UI (controls that emit the columnOperation payload) is not yet built; the operations are usable via the query API today.
  • Minor polish: switching a filter's subject to a multi-value column leaves the operator defaulted to IN (the dropdown correctly hides scalar operators, but the stale default should reset to Contains).

🤖 Generated with Claude Code

thedeceptio and others added 7 commits June 22, 2026 14:31
…y contract

Introduce first-class multi-value (array-typed) column support, designed to
work across any SQL dialect that supports arrays. This stage adds the semantic
type and the dialect-agnostic capability layer, with ClickHouse as the first
concrete implementation.

- Activate GenericDataType.MULTI_VALUE (= 4) on both the backend enum
  (superset/utils/core.py) and the frontend enum (superset-core/common).
- Add an opt-in capability contract to BaseEngineSpec:
  supports_multivalue_columns flag (default False) plus array_contains,
  array_length and array_explode methods (default NotImplementedError) so
  engines that have not opted in keep treating arrays as strings.
- ClickHouse: set the flag, reclassify Array(...) columns as MULTI_VALUE, and
  implement has()/length()/arrayJoin() as bound SQLAlchemy expressions.

Tests: flip the existing Array->STRING assertion to MULTI_VALUE, add nested
array variants, per-capability SQL-compilation tests, a bound-parameter
(injection-safety) test, and negative tests asserting the base spec stays
disabled and raises. array_explode is implemented and unit-tested but not yet
wired into the query builder (Stage C).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: thedeceptio <thedeceptio@gmail.com>
Wire a new CONTAINS filter operator end-to-end so users can filter
multi-value (array) columns by membership.

- Add FilterOperator.CONTAINS (and the FilterStringOperators mirror). The
  charts API schema derives its allowed operators from the enum, so CONTAINS is
  accepted without any extra allow-list change.
- Translate CONTAINS in the query builder (models/helpers.py): route to
  db_engine_spec.array_contains() so each dialect emits its native membership
  call, and raise a clear QueryObjectValidationError when the engine does not
  support multi-value columns.

Tests: operator round-trip, an unsupported-engine guard (sqlite raises), and a
positive path that drives the real get_sqla_query pipeline with a ClickHouse
engine spec and asserts the generated SQL contains has(...). The dataset
compiles against sqlite so no ClickHouse driver is required (func.has renders
dialect-agnostically).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: thedeceptio <thedeceptio@gmail.com>
Let users group/aggregate by the element count of an array column, produced
dialect-agnostically via the engine-spec capability layer.

- Add a multi-value modifier column shape to AdhocColumn: a base `column` plus a
  `columnOperation` (e.g. "LENGTH") instead of a literal sqlExpression. Add the
  MultiValueColumnOperation enum and an is_multivalue_operation_column detector.
- Resolve the shape in SqlaTable via a new _multivalue_column_to_sqla helper:
  it looks up the base array column and calls db_engine_spec.array_length(),
  typing the result NUMERIC. Unsupported engines, unknown base columns and
  unknown operations all raise a clear QueryObjectValidationError.
- Route the shape through both the grouped (adhoc_column_to_sqla) and raw-columns
  branches of the query builder.

Because the derived column is computed live from the engine spec there is no
stored virtual column to persist, so nothing special is needed to survive
"Sync columns from database".

Tests: end-to-end length(skills) generation through get_sqla_query, NUMERIC
typing, and the unsupported-engine / unknown-column / unknown-operation guards.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: thedeceptio <thedeceptio@gmail.com>
Add an EXPLODE operation to the multi-value modifier column so users can group
by individual elements of an array column.

- Wire EXPLODE into _multivalue_column_to_sqla: it calls
  db_engine_spec.array_explode(), which on ClickHouse is the scalar
  arrayJoin(col) and projects directly into SELECT/GROUP BY (no JOIN needed).
- Future-safe guard: set-returning UNNEST dialects (Postgres/Trino/BigQuery)
  require CROSS JOIN UNNEST plumbing and are deferred to phase 2 (post live
  validation). Those engines leave array_explode unimplemented, so we catch
  NotImplementedError and raise a clear QueryObjectValidationError instead of
  emitting invalid SQL.

The explode modifier is part of the query object, so exploded and non-exploded
queries naturally get distinct cache keys.

Note (ClickHouse semantics): arrayJoin behaves like an INNER JOIN — rows with
empty arrays drop out, changing totals. To be confirmed against a live instance
in Stage E.

Tests: arrayJoin(skills) generation through get_sqla_query, explode-vs-plain SQL
divergence, the unsupported-engine guard, and a simulated UNNEST-only dialect
getting a clean error rather than bad SQL.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: thedeceptio <thedeceptio@gmail.com>
…rator

Surface multi-value columns and the membership filter in Explore.

- ColumnTypeLabel renders a distinct list icon for GenericDataType.MultiValue so
  array columns are visually identifiable in the column tree.
- Add the Contains operator (CONTAINS) to the Explore operator enum and the
  operator->SQL label map.
- Gate operators by column type in the adhoc filter editor: CONTAINS is shown
  only for multi-value columns, and multi-value columns expose only
  CONTAINS / IS NULL / IS NOT NULL (scalar comparators are hidden). Capability is
  implied by the backend classifying the column as MULTI_VALUE, so no separate
  frontend flag is needed.

Tests: multi-value icon rendering, a GenericDataType enum-parity test guarding
the values shared with the backend, and operator-visibility tests (CONTAINS
shown for array columns, hidden for scalar ones).

Note: the Length/Explode dimension modifier UI (controls that emit the
columnOperation payload) is a separate follow-up; those operations are already
exercisable via the query API.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: thedeceptio <thedeceptio@gmail.com>
…ation

Found during live validation against a real ClickHouse instance: executing a
query with a Length/Explode modifier column failed with "Columns missing in
dataset", even though SQL generation worked. The query-context validation in
query_context_processor.py extracts physical column names via
get_column_name_from_column, which returned the modifier dict (treating it as a
physical column) instead of None.

- get_column_name_from_column now returns None for multi-value operation columns,
  matching how adhoc (sqlExpression) columns are already handled, so validation
  ignores them.

Also make the multi-value model tests import the ClickHouse engine spec lazily
inside their helpers. clickhouse.py reads app.config at module-import time
(inside a try/except ImportError around clickhouse-connect); once the driver is
installed that import-time access fails at pytest collection without an app
context. Deferring the import matches the existing pattern in
tests/.../db_engine_specs/test_clickhouse.py.

Adds a regression test asserting modifier columns are excluded from physical
column-name extraction.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: thedeceptio <thedeceptio@gmail.com>
…ttier)

- Type the multi-value test helpers with AdhocColumn/Column and QueryObjectDict,
  and exercise the raise paths via get_query_str_extended() instead of
  get_sqla_query(**dict) (avoids mypy TypedDict-unpacking false positives).
- Import ClickHouseEngineSpec lazily inside tests: clickhouse.py reads app.config
  at import time, which is unavailable at pytest collection once
  clickhouse-connect is installed.
- Apply prettier formatting to the adhoc filter simple-tab test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: thedeceptio <thedeceptio@gmail.com>
@thedeceptio thedeceptio force-pushed the feat/multi-value-dimension-support branch from 5eb1fcb to 66a9e99 Compare June 22, 2026 09:01
@netlify

netlify Bot commented Jun 22, 2026

Copy link
Copy Markdown

Deploy Preview for superset-docs-preview ready!

Name Link
🔨 Latest commit 66a9e99
🔍 Latest deploy log https://app.netlify.com/projects/superset-docs-preview/deploys/6a38fa053c9dcc00088e8d07
😎 Deploy Preview https://deploy-preview-41279--superset-docs-preview.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant