Skip to content

actionGrammar: compile-time grammar optimizer (inline + prefix factoring)#2247

Merged
curtisman merged 16 commits intomicrosoft:mainfrom
curtisman:opt
Apr 24, 2026
Merged

actionGrammar: compile-time grammar optimizer (inline + prefix factoring)#2247
curtisman merged 16 commits intomicrosoft:mainfrom
curtisman:opt

Conversation

@curtisman
Copy link
Copy Markdown
Member

Summary

Adds a compile-time grammar optimizer to actionGrammar that reduces NFA/DFA size by inlining single-alternative rules and factoring common prefixes across alternatives, plus a generalization that factors common prefixes across top-level rules. Optimizations are now enabled by default in actionGrammarCompiler's compile command (with a new --debug flag to disable for diagnostics).

What's in here

  • New optimizer (grammarOptimizer.ts) with three composable passes:
    • Inline single-alternative rules — substitutes a referenced rule's body into call sites, with α-renaming of child top-level bindings to opaque __opt_inline_<n> names to avoid collisions. Hoist / Substitute / Drop sub-strategies handle value-producing vs. valueless children.
    • Factor common prefixes across alternatives within a RulesPart, via a trie keyed by part-merge keys. Variable-bearing edges get fresh opaque __opt_v_<n> canonical names, eliminating outer-name shadowing and first-inserter-wins collisions.
    • Factor across top-level rules as a final pass (with a documented caveat that this destroys the 1:1 source-rule-index correspondence).
  • recommendedOptimizations preset re-exported from the package index for runtime callers.
  • agc compile defaults to recommendedOptimizations; --debug disables them and preserves the unoptimized AST for diagnostics.

Correctness

  • Parity check: optimizer pass results are validated against the unoptimized grammar for representative inputs.
  • Local bailouts when a fork's members all lack a value expression (would otherwise produce a missing-value-for-default error at finalize).
  • Opaque canonical names (__opt_v_*, __opt_inline_*) cannot collide with any user-supplied variable name.
  • New fuzz spec (grammarOptimizerFuzz.spec.ts) generates random grammars + inputs and asserts matchGrammar agrees with and without optimizations.

Tests

Adds ~2300 cases across 6 spec files covering inlining, factoring, sharing, value expressions, integration, and fuzzing. grammarOptimizer.ts line coverage is 96.7%, branch coverage 90%, function coverage 100%. Full action-grammar suite passes; prettier clean.

Benchmarks

Benchmarks (src/bench/grammarOptimizer*Benchmark.ts) are standalone scripts (not jest), runnable via pnpm run bench:synthetic / bench:real after build. README has an "Optimizer benchmarks" section.

Docs

  • docs/architecture/actionGrammar.md — new "Compile-time optimizations" section describing strategies, opaque canonicals, parity check, top-level-index caveat, and bench layout.
  • packages/actionGrammar/README.md — optimizer overview and benchmark instructions.

Adds an optimizer pipeline (`optimizeGrammar`) gated by per-grammar
options, currently exposing two passes:

  - inlineSingleAlternatives: collapse single-alternative RulesParts
    into their parent, removing a layer of backtracking nesting in
    the matcher.  Handles child.value via a flat 3-case decision —
    substitute into parent.value, hoist onto a single-part parent
    without its own value, or drop when unobservable — and
    propagates captured variables onto direct-capture parts when
    the wrapper carries a binding.

  - factorCommonPrefixes: factor common leading parts (including
    partial leading string tokens) shared by alternatives within
    a RulesPart, with conservative refusals on binding collisions,
    cross-scope value references, mixed value-presence, and
    multi-part suffixes that would lose the matcher's single-part
    default-value rule.

Both passes preserve shared-array identity via an identity memo
so the dedup invariant in grammarSerializer.ts holds.  Tests
cover the spacing-mode tightening, value substitution/hoist/drop
cases, and the prefix-factoring guards.
…scripts

- Move grammarOptimizerBenchmark and grammarOptimizerSyntheticBenchmark
  from test/*.spec.ts into src/bench/ as plain node entry points.
- Extract shared CONFIGS, runBenchmark, timing, and colored-speedup
  helpers into src/bench/benchUtil.ts.
- Color the speedup column with chalk: green when >1.10x, red when
  <0.90x, plain otherwise.
- Add 'bench', 'bench:synthetic', and 'bench:real' npm scripts; exclude
  dist/bench from the published package.
The inliner spreads a child rule's parts into the parent at the call
site. When the child's GrammarRule[] array was referenced from multiple
RulesParts (named-rule sharing established by the compiler), inlining at
each site duplicated the child's parts, defeating the serializer's
identity-based dedup and bloating .ag.json output proportional to the
reference count.

Add a one-time reference-count pre-pass over the input AST and refuse to
inline any RulesPart whose body has more than one incoming reference.
Single-reference single-alternative rules continue to be inlined.
The matcher treats top-level alternatives the same way it treats inner
RulesPart alternatives — each is queued as its own MatchState and
produces its own GrammarMatchResult. So factoring shared prefixes
across top-level rules is semantically safe (cardinality and per-rule
values are preserved via the existing __opt_factor capture).

After nested factoring completes, wrap the top-level Grammar.rules in a
synthetic RulesPart and reuse factorRulesPart with the same
fixed-point iteration used for nested groups.

This destroys the 1:1 correspondence between top-level rule indices and
the original source. That mapping must be recovered via separate
metadata if anything downstream depends on it.

Also fuse the map+some pass in inlineRulesArray and factorRulesArray
into a single loop that only allocates a new array once an element
actually changes.
The loop existed defensively but the current grouping/intersection logic
converges in a single pass: groups are seeded from rules that pairwise
fail the share check, so each group's canonical prefix can't share with
any other group's prefix on a second pass. Newly synthesized suffix
RulesParts are intentionally not re-walked, and single-rule wrappers hit
the early return.

Call factorRulesPart once at each site; remove factorRulesPartToFixedPoint.
Replaces the pairwise-shape-comparison factoring with an incremental trie:

rules are inserted edge-by-edge (per-token for StringPart; conservative

match by typeName / array identity / matcherName for other parts), and the

trie is then walked post-order to emit factored rules. Variables on

wildcard / number / rules edges are canonicalized to the first inserter's

name, with later inserters' value expressions remapped at emission.

Behavioral notes:

- Suffix sub-prefixes are now factored (e.g. 'play song x | play song y |

  play album z' factors both 'play' and inner 'song').

- A failed eligibility check at one fork now triggers a local bailout

  (members are flattened with their prefix prepended) instead of refusing

  to factor the whole group, allowing factoring above the failing fork.

- Strengthened the wholly-consumed check to refuse empty-parts members

  regardless of value-presence (the matcher's default-value resolver

  cannot handle 'parts:[], value: undefined' inside a wrapped RulesPart).

Tests: existing 90 optimizer tests pass; added 10 new tests covering the

suffix-sharing win and 9 risk categories (cross-scope bailout, RulesPart

array-identity preservation, strict-prefix overlap with mixed values,

multi-level factoring, variable canonicalization including object-shorthand,

interleaved match-order preservation, nested wrapper-variable scoping).
- Split TrieRoot from TrieNode so non-root edges are non-optional,

  removing four non-null assertions and the redundant edge guard.

- Collapse single-line edgeKeyMatches branches; keep block scope only

  for the variants that read multiple fields.

- Use items.flatMap(it => it.rules) at the two emission sites.
Allocate fresh '__opt_v_<n>' canonical names for every variable-bearing

edge in the factoring trie instead of using the first inserter's user-

supplied variable name. Eliminates two collision classes:

- Outer-name shadow: under first-inserter-wins, a non-lead alternative

  whose value referenced an outer free variable matching the lead's

  local binding name would silently alias the lead's binding. Opaque

  canonicals cannot collide with any user name.

- Bound vs. unbound RulesPart references: parity check in

  edgeKeyMatches keeps '<Inner>' and '$(v:<Inner>)' as separate

  trie children rather than silently merging (which would either

  invent or drop a binding).

Other changes:

- Split TrieStep (yielded by partsToEdgeSteps with 'local') from

  TrieEdge (stored in trie with 'canonical').

- Lead inserter now records its remap too (previously was a no-op).

- recordStepRemap throws on conflicting overwrite of the same local

  to two different canonicals (defensive; unreachable in well-formed

  input).

- Removed binding-shadow check in checkFactoringEligible (impossible

  with opaque canonicals); kept cross-scope-ref check, now expressed

  in canonical-name terms — it remains necessary because nested rule

  scope is fresh in the matcher (entering RulesPart resets valueIds).

Tests: optimizer suite 100 \u2192 102; new tests cover outer-name shadow

and bound-vs-unbound parity. Full action-grammar suite 2322 pass.
…e walks

- Inline pass: unconditionally α-rename child rule top-level bindings to fresh __opt_inline_<n> names (per-parent counter), eliminating collisions in both substitute and drop branches.
- Unify Substitute and Drop branches in tryInlineRulesPart; Hoist remains a separate fast path.
- Batch per-parent value substitutions into a single Map and a single AST walk in inlineRule.
- Unify remapValueVariables and substituteValueVariable into substituteValueVariables(node, Map<string, CompiledValueNode>); remapValueVariables is now a thin wrapper.
- Add tests for branch-3 collision rename and per-parent counter uniqueness.
- inlineSingleAlternatives: handle multi-part parents that capture a
  value-producing child via part.variable by hoisting child.value into
  a synthesized valueAssignment (matches the matcher's default-value
  rule).  Drop withPropagatedVariable / findExistingVariable in favor
  of in-place re-targeting of the single binding-friendly child part.
- Tests: add coverage for optional-flag preservation, spacing-mode
  mismatches, refCount>1 inline refusal, bound-nested-rules-part
  regression, plus factoring eligibility and trie-edge variants.
- factorRulesPart split into thin RulesPart wrapper + flat factorRules core; top-level factoring now reuses factorRules directly instead of wrapping Grammar.rules in a synthetic RulesPart.
- Trie children switched from TrieNode[] (linear edgeKeyMatches scan) to Map<string, TrieNode> keyed by stepMergeKey for O(1) insertion. GrammarRule[] array identity preserved via a per-BuildState WeakMap interner so the rules-edge merge key stays primitive without losing array-identity sharing. Removed now-unused edgeKeyMatches.
- emitFromNode prefix construction made linear: appendPart (returned a fresh array per step → O(depth²)) replaced with in-place appendPartInPlace.
- factorCommonPrefixes JSDoc warns that top-level factoring destroys the 1:1 source-rule index correspondence.
- Inliner second-pass comment corrected: factoring never emits single-alternative wrappers itself; the second inline pass exists to collapse sub-RulesParts inside emitted suffixes that became inline-eligible.
- BuildState JSDoc clarifies its scope and contrasts with RenameState.
- bench/grammarOptimizerBenchmark.ts: comment documents the dist/bench path-relative grammar-file assumption.
- README adds an 'Optimizer benchmarks' section explaining the build-then-run flow for pnpm run bench:synthetic / bench:real.
- docs/architecture/actionGrammar.md: rewrote the Compile-time optimizations section to match current code (Hoist/Substitute/Drop sub-strategies, opaque __opt_v_/__opt_inline_ canonicals, parity check, local bailout, no fixed-point loop, top-level index caveat, factorRules vs. factorRulesPart, Map-keyed children, src/bench/ standalone scripts).
- Test consolidation: merged grammarOptimizerFactoringRepro.spec.ts and grammarOptimizerTrieRisks.spec.ts into grammarOptimizerFactoring.spec.ts (28 tests preserved exactly; 5 spec files → 3).
- Full action-grammar suite passes (2339 tests); prettier clean.
Adds tests for previously uncovered paths in grammarOptimizer.ts:

- grammarOptimizerValueExpressions.spec.ts: exercises every
  CompiledValueNode arm in substituteValueVariables (inliner Substitute
  branch) and collectVariableReferences (factorer cross-scope-ref check)
  -- array, binary/unary/conditional/member/call expressions, spread
  element, template literal, object spread, object shorthand-with-sub.

- grammarOptimizerFactoring.spec.ts: number-edge factoring (with and
  without optional flag) and wrapper-rule spacingMode propagation
  (positive + negative).

- grammarOptimizerIntegration.spec.ts: optimizeGrammar early-return
  paths, inliner refusal when child has multiple variable-bearing parts,
  compiler skipping optimization on grammars with errors, and observable
  effect of the post-factor inline pass.

grammarOptimizer.ts coverage: 86.18% -> 96.73% lines, 75.45% -> 90%
branches, 98% -> 100% functions.  Tests: 119 -> 142, all passing.
- Fix correctness bug in factorCommonPrefixes: bail out at any fork
  whose members all lack a value expression. The matcher's implicit
  text-default only fires for single-StringPart rules, so a wrapper
  rule [prefix..., suffixRulesPart] with no value would throw
  'missing value for default' at finalize. Caught by the new fuzzer.
- Unify factor-pass naming: wrapper bindings now use freshWrapperBinding
  threaded through BuildState; reserved-set scan removed.
- Defer-allocate in factorParts to match factorRulesArray style.
- Cleaner onlyChild (no non-null chain).
- Drop duplicate JSDoc on BuildState.
- Add recommendedOptimizations preset (and re-export from index).
- Add grammarOptimizerFuzz.spec.ts: deterministic random grammars +
  inputs, asserts matchGrammar agrees with and without optimizations.
Default the agc compile command to recommendedOptimizations (matches
what runtime callers should use).  Add a --debug flag that disables
optimizations, producing an unoptimized AST that preserves the 1:1
correspondence between top-level rules and the original source for
diagnostics.
The cross-scope-ref check in checkFactoringEligible only compared
member values against the immediate prefix at the current trie fork.
It missed the case where a deeper bailout had already prepended an
inner edge to each member, dragging in references to ancestor-prefix
canonicals (variables bound several edges up the trie).  An outer
fork at a literal edge would then factor those members into a fresh
wrapper-rule scope where the ancestor canonical was no longer
visible, producing a runtime

  "Internal error: No value for variable '__opt_v_*'"

This surfaced on playerSchema.agr's

  play <TrackPhrase> by <ArtistName>                   (alt1)
  play <TrackPhrase> from album <AlbumName>            (alt2)
  play <TrackPhrase> by <ArtistName> from album ...    (alt3)

trie shape: the inner <ArtistName> fork bails (alt1's terminal lands
with empty parts), prepending <ArtistName> to alt3's suffix; the
outer 'by' fork's eligibility check then sees no canonicals in its
'by' prefix and factored anyway, lifting alt1/alt3 into a wrapper
where <TrackPhrase>'s canonical was lost.

Fix: tighten the check to require every member's value to only
reference variables bound in *that member's own parts*.  This
subsumes the simpler immediate-prefix check and catches the
bailout-then-factor scenario.  prefix parameter is no longer needed.

Add a regression test covering the exact playerSchema trie shape.
@curtisman curtisman added this pull request to the merge queue Apr 24, 2026
Merged via the queue into microsoft:main with commit 20b4727 Apr 24, 2026
13 of 15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant