actionGrammar: compile-time grammar optimizer (inline + prefix factoring) by curtisman · Pull Request #2247 · microsoft/TypeAgent

curtisman · 2026-04-24T06:08:22Z

Summary

Adds a compile-time grammar optimizer to actionGrammar that reduces NFA/DFA size by inlining single-alternative rules and factoring common prefixes across alternatives, plus a generalization that factors common prefixes across top-level rules. Optimizations are now enabled by default in actionGrammarCompiler's compile command (with a new --debug flag to disable for diagnostics).

What's in here

New optimizer (grammarOptimizer.ts) with three composable passes:
- Inline single-alternative rules — substitutes a referenced rule's body into call sites, with α-renaming of child top-level bindings to opaque __opt_inline_<n> names to avoid collisions. Hoist / Substitute / Drop sub-strategies handle value-producing vs. valueless children.
- Factor common prefixes across alternatives within a RulesPart, via a trie keyed by part-merge keys. Variable-bearing edges get fresh opaque __opt_v_<n> canonical names, eliminating outer-name shadowing and first-inserter-wins collisions.
- Factor across top-level rules as a final pass (with a documented caveat that this destroys the 1:1 source-rule-index correspondence).
recommendedOptimizations preset re-exported from the package index for runtime callers.
agc compile defaults to recommendedOptimizations; --debug disables them and preserves the unoptimized AST for diagnostics.

Correctness

Parity check: optimizer pass results are validated against the unoptimized grammar for representative inputs.
Local bailouts when a fork's members all lack a value expression (would otherwise produce a missing-value-for-default error at finalize).
Opaque canonical names (__opt_v_*, __opt_inline_*) cannot collide with any user-supplied variable name.
New fuzz spec (grammarOptimizerFuzz.spec.ts) generates random grammars + inputs and asserts matchGrammar agrees with and without optimizations.

Tests

Adds ~2300 cases across 6 spec files covering inlining, factoring, sharing, value expressions, integration, and fuzzing. grammarOptimizer.ts line coverage is 96.7%, branch coverage 90%, function coverage 100%. Full action-grammar suite passes; prettier clean.

Benchmarks

Benchmarks (src/bench/grammarOptimizer*Benchmark.ts) are standalone scripts (not jest), runnable via pnpm run bench:synthetic / bench:real after build. README has an "Optimizer benchmarks" section.

Docs

docs/architecture/actionGrammar.md — new "Compile-time optimizations" section describing strategies, opaque canonicals, parity check, top-level-index caveat, and bench layout.
packages/actionGrammar/README.md — optimizer overview and benchmark instructions.

Adds an optimizer pipeline (`optimizeGrammar`) gated by per-grammar options, currently exposing two passes: - inlineSingleAlternatives: collapse single-alternative RulesParts into their parent, removing a layer of backtracking nesting in the matcher. Handles child.value via a flat 3-case decision — substitute into parent.value, hoist onto a single-part parent without its own value, or drop when unobservable — and propagates captured variables onto direct-capture parts when the wrapper carries a binding. - factorCommonPrefixes: factor common leading parts (including partial leading string tokens) shared by alternatives within a RulesPart, with conservative refusals on binding collisions, cross-scope value references, mixed value-presence, and multi-part suffixes that would lose the matcher's single-part default-value rule. Both passes preserve shared-array identity via an identity memo so the dedup invariant in grammarSerializer.ts holds. Tests cover the spacing-mode tightening, value substitution/hoist/drop cases, and the prefix-factoring guards.

…scripts - Move grammarOptimizerBenchmark and grammarOptimizerSyntheticBenchmark from test/*.spec.ts into src/bench/ as plain node entry points. - Extract shared CONFIGS, runBenchmark, timing, and colored-speedup helpers into src/bench/benchUtil.ts. - Color the speedup column with chalk: green when >1.10x, red when <0.90x, plain otherwise. - Add 'bench', 'bench:synthetic', and 'bench:real' npm scripts; exclude dist/bench from the published package.

The inliner spreads a child rule's parts into the parent at the call site. When the child's GrammarRule[] array was referenced from multiple RulesParts (named-rule sharing established by the compiler), inlining at each site duplicated the child's parts, defeating the serializer's identity-based dedup and bloating .ag.json output proportional to the reference count. Add a one-time reference-count pre-pass over the input AST and refuse to inline any RulesPart whose body has more than one incoming reference. Single-reference single-alternative rules continue to be inlined.

The matcher treats top-level alternatives the same way it treats inner RulesPart alternatives — each is queued as its own MatchState and produces its own GrammarMatchResult. So factoring shared prefixes across top-level rules is semantically safe (cardinality and per-rule values are preserved via the existing __opt_factor capture). After nested factoring completes, wrap the top-level Grammar.rules in a synthetic RulesPart and reuse factorRulesPart with the same fixed-point iteration used for nested groups. This destroys the 1:1 correspondence between top-level rule indices and the original source. That mapping must be recovered via separate metadata if anything downstream depends on it. Also fuse the map+some pass in inlineRulesArray and factorRulesArray into a single loop that only allocates a new array once an element actually changes.

The loop existed defensively but the current grouping/intersection logic converges in a single pass: groups are seeded from rules that pairwise fail the share check, so each group's canonical prefix can't share with any other group's prefix on a second pass. Newly synthesized suffix RulesParts are intentionally not re-walked, and single-rule wrappers hit the early return. Call factorRulesPart once at each site; remove factorRulesPartToFixedPoint.

Replaces the pairwise-shape-comparison factoring with an incremental trie: rules are inserted edge-by-edge (per-token for StringPart; conservative match by typeName / array identity / matcherName for other parts), and the trie is then walked post-order to emit factored rules. Variables on wildcard / number / rules edges are canonicalized to the first inserter's name, with later inserters' value expressions remapped at emission. Behavioral notes: - Suffix sub-prefixes are now factored (e.g. 'play song x | play song y | play album z' factors both 'play' and inner 'song'). - A failed eligibility check at one fork now triggers a local bailout (members are flattened with their prefix prepended) instead of refusing to factor the whole group, allowing factoring above the failing fork. - Strengthened the wholly-consumed check to refuse empty-parts members regardless of value-presence (the matcher's default-value resolver cannot handle 'parts:[], value: undefined' inside a wrapped RulesPart). Tests: existing 90 optimizer tests pass; added 10 new tests covering the suffix-sharing win and 9 risk categories (cross-scope bailout, RulesPart array-identity preservation, strict-prefix overlap with mixed values, multi-level factoring, variable canonicalization including object-shorthand, interleaved match-order preservation, nested wrapper-variable scoping).

- Split TrieRoot from TrieNode so non-root edges are non-optional, removing four non-null assertions and the redundant edge guard. - Collapse single-line edgeKeyMatches branches; keep block scope only for the variants that read multiple fields. - Use items.flatMap(it => it.rules) at the two emission sites.

Allocate fresh '__opt_v_<n>' canonical names for every variable-bearing edge in the factoring trie instead of using the first inserter's user- supplied variable name. Eliminates two collision classes: - Outer-name shadow: under first-inserter-wins, a non-lead alternative whose value referenced an outer free variable matching the lead's local binding name would silently alias the lead's binding. Opaque canonicals cannot collide with any user name. - Bound vs. unbound RulesPart references: parity check in edgeKeyMatches keeps '<Inner>' and '$(v:<Inner>)' as separate trie children rather than silently merging (which would either invent or drop a binding). Other changes: - Split TrieStep (yielded by partsToEdgeSteps with 'local') from TrieEdge (stored in trie with 'canonical'). - Lead inserter now records its remap too (previously was a no-op). - recordStepRemap throws on conflicting overwrite of the same local to two different canonicals (defensive; unreachable in well-formed input). - Removed binding-shadow check in checkFactoringEligible (impossible with opaque canonicals); kept cross-scope-ref check, now expressed in canonical-name terms — it remains necessary because nested rule scope is fresh in the matcher (entering RulesPart resets valueIds). Tests: optimizer suite 100 \u2192 102; new tests cover outer-name shadow and bound-vs-unbound parity. Full action-grammar suite 2322 pass.

…e walks - Inline pass: unconditionally α-rename child rule top-level bindings to fresh __opt_inline_<n> names (per-parent counter), eliminating collisions in both substitute and drop branches. - Unify Substitute and Drop branches in tryInlineRulesPart; Hoist remains a separate fast path. - Batch per-parent value substitutions into a single Map and a single AST walk in inlineRule. - Unify remapValueVariables and substituteValueVariable into substituteValueVariables(node, Map<string, CompiledValueNode>); remapValueVariables is now a thin wrapper. - Add tests for branch-3 collision rename and per-parent counter uniqueness.

- inlineSingleAlternatives: handle multi-part parents that capture a value-producing child via part.variable by hoisting child.value into a synthesized valueAssignment (matches the matcher's default-value rule). Drop withPropagatedVariable / findExistingVariable in favor of in-place re-targeting of the single binding-friendly child part. - Tests: add coverage for optional-flag preservation, spacing-mode mismatches, refCount>1 inline refusal, bound-nested-rules-part regression, plus factoring eligibility and trie-edge variants.

- factorRulesPart split into thin RulesPart wrapper + flat factorRules core; top-level factoring now reuses factorRules directly instead of wrapping Grammar.rules in a synthetic RulesPart. - Trie children switched from TrieNode[] (linear edgeKeyMatches scan) to Map<string, TrieNode> keyed by stepMergeKey for O(1) insertion. GrammarRule[] array identity preserved via a per-BuildState WeakMap interner so the rules-edge merge key stays primitive without losing array-identity sharing. Removed now-unused edgeKeyMatches. - emitFromNode prefix construction made linear: appendPart (returned a fresh array per step → O(depth²)) replaced with in-place appendPartInPlace. - factorCommonPrefixes JSDoc warns that top-level factoring destroys the 1:1 source-rule index correspondence. - Inliner second-pass comment corrected: factoring never emits single-alternative wrappers itself; the second inline pass exists to collapse sub-RulesParts inside emitted suffixes that became inline-eligible. - BuildState JSDoc clarifies its scope and contrasts with RenameState. - bench/grammarOptimizerBenchmark.ts: comment documents the dist/bench path-relative grammar-file assumption. - README adds an 'Optimizer benchmarks' section explaining the build-then-run flow for pnpm run bench:synthetic / bench:real. - docs/architecture/actionGrammar.md: rewrote the Compile-time optimizations section to match current code (Hoist/Substitute/Drop sub-strategies, opaque __opt_v_/__opt_inline_ canonicals, parity check, local bailout, no fixed-point loop, top-level index caveat, factorRules vs. factorRulesPart, Map-keyed children, src/bench/ standalone scripts). - Test consolidation: merged grammarOptimizerFactoringRepro.spec.ts and grammarOptimizerTrieRisks.spec.ts into grammarOptimizerFactoring.spec.ts (28 tests preserved exactly; 5 spec files → 3). - Full action-grammar suite passes (2339 tests); prettier clean.

Adds tests for previously uncovered paths in grammarOptimizer.ts: - grammarOptimizerValueExpressions.spec.ts: exercises every CompiledValueNode arm in substituteValueVariables (inliner Substitute branch) and collectVariableReferences (factorer cross-scope-ref check) -- array, binary/unary/conditional/member/call expressions, spread element, template literal, object spread, object shorthand-with-sub. - grammarOptimizerFactoring.spec.ts: number-edge factoring (with and without optional flag) and wrapper-rule spacingMode propagation (positive + negative). - grammarOptimizerIntegration.spec.ts: optimizeGrammar early-return paths, inliner refusal when child has multiple variable-bearing parts, compiler skipping optimization on grammars with errors, and observable effect of the post-factor inline pass. grammarOptimizer.ts coverage: 86.18% -> 96.73% lines, 75.45% -> 90% branches, 98% -> 100% functions. Tests: 119 -> 142, all passing.

- Fix correctness bug in factorCommonPrefixes: bail out at any fork whose members all lack a value expression. The matcher's implicit text-default only fires for single-StringPart rules, so a wrapper rule [prefix..., suffixRulesPart] with no value would throw 'missing value for default' at finalize. Caught by the new fuzzer. - Unify factor-pass naming: wrapper bindings now use freshWrapperBinding threaded through BuildState; reserved-set scan removed. - Defer-allocate in factorParts to match factorRulesArray style. - Cleaner onlyChild (no non-null chain). - Drop duplicate JSDoc on BuildState. - Add recommendedOptimizations preset (and re-export from index). - Add grammarOptimizerFuzz.spec.ts: deterministic random grammars + inputs, asserts matchGrammar agrees with and without optimizations.

Default the agc compile command to recommendedOptimizations (matches what runtime callers should use). Add a --debug flag that disables optimizations, producing an unoptimized AST that preserves the 1:1 correspondence between top-level rules and the original source for diagnostics.

The cross-scope-ref check in checkFactoringEligible only compared member values against the immediate prefix at the current trie fork. It missed the case where a deeper bailout had already prepended an inner edge to each member, dragging in references to ancestor-prefix canonicals (variables bound several edges up the trie). An outer fork at a literal edge would then factor those members into a fresh wrapper-rule scope where the ancestor canonical was no longer visible, producing a runtime "Internal error: No value for variable '__opt_v_*'" This surfaced on playerSchema.agr's play <TrackPhrase> by <ArtistName> (alt1) play <TrackPhrase> from album <AlbumName> (alt2) play <TrackPhrase> by <ArtistName> from album ... (alt3) trie shape: the inner <ArtistName> fork bails (alt1's terminal lands with empty parts), prepending <ArtistName> to alt3's suffix; the outer 'by' fork's eligibility check then sees no canonicals in its 'by' prefix and factored anyway, lifting alt1/alt3 into a wrapper where <TrackPhrase>'s canonical was lost. Fix: tighten the check to require every member's value to only reference variables bound in *that member's own parts*. This subsumes the simpler immediate-prefix check and catches the bailout-then-factor scenario. prefix parameter is no longer needed. Add a regression test covering the exact playerSchema trie shape.

curtisman added 14 commits April 23, 2026 22:35

curtisman had a problem deploying to development-fork April 24, 2026 06:08 — with GitHub Actions Error

lint

b57f6cc

curtisman had a problem deploying to development-fork April 24, 2026 06:13 — with GitHub Actions Error

curtisman requested a deployment to development-fork April 24, 2026 06:49 — with GitHub Actions Waiting

curtisman added this pull request to the merge queue Apr 24, 2026

Merged via the queue into microsoft:main with commit 20b4727 Apr 24, 2026
13 of 15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

actionGrammar: compile-time grammar optimizer (inline + prefix factoring)#2247

actionGrammar: compile-time grammar optimizer (inline + prefix factoring)#2247
curtisman merged 16 commits intomicrosoft:mainfrom
curtisman:opt

curtisman commented Apr 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

curtisman commented Apr 24, 2026

Summary

What's in here

Correctness

Tests

Benchmarks

Docs

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant