Skip to content

feat: Hive parser hardening (STRUCT/DIV/OFFSET) + Matrix CTE collapse#1

Merged
cooker-code merged 4 commits into
masterfrom
feat/hive-struct-named-fields
May 14, 2026
Merged

feat: Hive parser hardening (STRUCT/DIV/OFFSET) + Matrix CTE collapse#1
cooker-code merged 4 commits into
masterfrom
feat/hive-struct-named-fields

Conversation

@cooker-code
Copy link
Copy Markdown
Owner

Summary

Bundles two independent workstreams that were developed on the same branch:

1. Hive SQL parser hardening (3 commits)

Fixes 5 of 6 real parse failures in the conan SQL corpus (auditId 2479, 2482, 2497*, 2568, 2571). Each root cause gets its own focused commit; one shared FlowscopeHiveDialect wrapper is introduced under crates/flowscope-core/src/dialect_ext/.

Commit Root cause Fix Affected auditIds
753f20c sqlparser-rs 0.61 HiveDialect inherits supports_struct_literal() = false, so STRUCT(a AS w, ...) (Hive/Spark named-field syntax) fails with "Expected: ), found: AS" New FlowscopeHiveDialect wraps upstream HiveDialect and enables struct literals; wired through Dialect::to_sqlparser_dialect() 2479, 2482, 2571
2c25fe7 a DIV b (Hive integer division) is only wired into MySqlDialect's parse_infix upstream Mirror MySQL impl in FlowscopeHiveDialect, lower to existing BinaryOperator::MyIntegerDivide so analyzer code stays dialect-agnostic 2568
8b8448c sqlparser-rs keeps OFFSET in global RESERVED_FOR_TABLE_ALIAS, but Hive grammar has no OFFSET clause — CROSS JOIN (...) offset is legal alias usage Override Dialect::is_table_alias in FlowscopeHiveDialect to remove only OFFSET from the reserved set; negative test guards against accidental loosening 2482

*auditId=2497 (INSERT ... PARTITION (...) (SELECT ...)) is NOT in this PR — to be handled separately.
auditId=2550 is intentionally skipped (user SQL bug: CTE missing as keyword).

2. Matrix CTE collapse (1 commit, 2ec9324)

Tables sub-mode previously rendered WITH-clause CTE aliases as first-class rows/columns alongside physical tables, diluting the cross-script blueprint signal (~30%+ noise in typical scripts).

  • New Layers toggle (default ON) hides CTE rows/columns.
  • BFS over write cells reconstructs physical→physical dependencies that previously only reached via CTE chains; rebuilt edges render with dashed half-opacity arrows + tooltip showing the CTE hop chain (visually distinct from direct edges).
  • Toggle is hidden in Scripts sub-mode (no overlap with CTE keys).
  • Worker payload gains cteItemKeys: string[]; collapse + metric recomputation happen on main thread for instant toggle response.
  • Refactor: MatrixMetrics + computeMatrixMetrics extracted to matrixUtils as single source of truth, eliminating a 49-line duplicate that lived in both worker and MatrixView.
  • Docs: CLAUDE.md and ui-change-protocol.md updated to clarify Cursor browser MCP (Playwright) reliably operates Radix DropdownMenu, while external agent-browser CLI is still unreliable.

Test plan

  • cargo test --workspace passes (Hive parser changes)
  • yarn workspace @pondpilot/flowscope-react test --run — 201/201 passing, includes 5 new collapseCteFromMatrix cases (single-hop, multi-hop, direct-edge-priority, empty CTE set, chain-isolation)
  • yarn workspace @pondpilot/flowscope-react lint && typecheck clean
  • End-to-end: dev server + Cursor browser MCP — toggle hides 18 CTEs in 07_cohort_analysis.sql; turn off restores them; Scripts sub-mode hides the toggle button
  • gitnexus impact analysis: risk_level=medium, only MatrixView → UseLineageStore and MatrixView → GetWorker processes affected, both expected
  • Reviewer to confirm 5 of 6 conan corpus failing SQLs (auditId 2479/2482/2568/2571) now succeed against this branch

Made with Cursor

wangliangbj01 and others added 4 commits May 14, 2026 12:21
The Tables sub-mode of MatrixView previously surfaced WITH-clause CTE
aliases as first-class rows/columns alongside physical tables, diluting
the cross-script blueprint signal (typical scripts have 30%+ CTE noise).

Add a Layers toggle (default ON) that:
- Filters CTE rows/columns out of the rendered matrix.
- Reconstructs physical->physical dependencies that previously only
  reached via CTE chains, using a forward BFS over `write` cells with
  the visited-CTE path captured in `viaCtes`.
- Renders rebuilt edges with a dashed half-opacity arrow and a tooltip
  showing the CTE hop chain, distinct from direct edges.
- Hides itself in Scripts sub-mode (no overlap with CTE keys).

Worker payload gains `cteItemKeys: string[]`; collapse + metric
recomputation happen on the main thread for instant toggle response.

Refactor: extract `MatrixMetrics` and `computeMatrixMetrics` into
matrixUtils as the single source of truth, eliminating a 49-line
duplicate that previously lived in both the worker and MatrixView.

Docs: clarify in CLAUDE.md and ui-change-protocol.md that the Cursor
browser MCP (Playwright-based) reliably operates Radix DropdownMenu
buttons, while the external agent-browser CLI is still unreliable.

Tests: 5 new cases cover single-hop, multi-hop, direct-edge-priority,
empty CTE set, and chain-isolation scenarios. 201/201 passing.

Co-authored-by: Cursor <cursoragent@cursor.com>
…support

Real-world Hive / Spark SQL uses `struct(field1 AS name1, ...)` (e.g.
`collect_list(struct(...))`) but sqlparser-rs 0.61's `HiveDialect`
inherits `supports_struct_literal()` = false from the base `Dialect`
trait, so `STRUCT(a AS w)` fails to parse with "Expected: ), found: AS".

Upstream BigQuery / Databricks / Generic dialects all override this
to `true`; HiveDialect simply forgot to. Fix it by introducing a thin
wrapper dialect (`crates/flowscope-core/src/dialect_ext/`) that
composes the upstream `HiveDialect` and re-enables this feature, and
wire it through `Dialect::to_sqlparser_dialect()`.

Affects auditId=2479, 2482, 2571 in the conan SQL corpus (3 of 6 real
parse failures resolved by this single change).

Co-authored-by: Cursor <cursoragent@cursor.com>
Hive defines `a DIV b` as BIGINT integer division (see
<https://cwiki.apache.org/confluence/display/hive/languagemanual+udf>),
but sqlparser-rs 0.61 only wires up `DIV` parsing inside MySqlDialect
via `Dialect::parse_infix`. HiveDialect doesn't override `parse_infix`,
so `SELECT x DIV 1000 FROM t` fails with a generic parse error.

Mirror the MySQL implementation in FlowscopeHiveDialect: parse `DIV`
as an infix operator and lower it to the same `BinaryOperator::MyIntegerDivide`
node MySQL uses, keeping downstream analyzer code dialect-agnostic.

Affects auditId=2568 in the conan SQL corpus.

Co-authored-by: Cursor <cursoragent@cursor.com>
Hive's SELECT grammar has no OFFSET clause (only LIMIT, see
<https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select>),
so writing `CROSS JOIN (SELECT ...) offset` to name a derived table
`offset` is perfectly legal Hive syntax. sqlparser-rs, however, keeps
`OFFSET` in its global `RESERVED_FOR_TABLE_ALIAS` list and HiveDialect
inherits that default, so the construct fails with
"Expected: ), found: <next token>" at the join site.

Override `Dialect::is_table_alias` in FlowscopeHiveDialect to remove
only `OFFSET` from the reserved set; other reserved keywords (SELECT,
FROM, WHERE, ...) are still rejected, so disambiguation of the surrounding
grammar is preserved. A negative test (`SELECT * FROM t CROSS JOIN (...) select`)
guards against accidental loosening.

Affects auditId=2482 in the conan SQL corpus.

Co-authored-by: Cursor <cursoragent@cursor.com>
@cooker-code cooker-code merged commit 1ec1b02 into master May 14, 2026
5 of 7 checks passed
@cooker-code cooker-code deleted the feat/hive-struct-named-fields branch May 14, 2026 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants