Upstream sync, mostly eager output interface cleanup by silentbicycle · Pull Request #37 · fastly/libfsm

silentbicycle · 2026-02-09T17:57:01Z

This had quite a lot of conflicts, largely because katef#513 cleaned up the eager output interfaces that were originally merged here (with more narrowly scoped functionality).

Missing documentation for -x in the rx(1) manpage

The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.

I verified that LeakSanitizer catches the leak during this test when the fix is reverted. This commit adds a test directory, `tests/regressions`, for misc. regression tests that need a .c file and aren't easily testable via fsm, re, etc. inputs.

…minise-with-config Fix a memory leak during `fsm_determinise_with_config`'s early exit

…clang and fail builds

Update CI to use Ubuntu 22.04

While discussing other issues, kate said she'd rather not have this have different behavior when stdout is / isn't a tty.

retest: Remove `isatty` check and extra logging output.

print/ir.h only depended on it for FSM_SIGMA_COUNT in `struct ir_state_table`, and IR_TABLE isn't actually implemented, so this can be removed for now. lx/print/c.c only needed internal.h because of print/ir.h, and because of several direct accesses to fsm->statecount, which are easily replaced by calls to fsm_countstates.

This will miss failures in prefixed res files, such as build/tests/lxpos/dyn-fdgetc-getc-res0

Changing the leaf and endleaf callbacks to accept and reject in katef#485 broke lx, but it went unnoticed for a while. This fixes it. libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend. lx used to work by breaking abstraction and calling directly into `fsm_print_cfrag` (overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type. This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward. Most of the code changes happen inside of lx's code generation, but there are a few elsewhere: - The reject hook now has a state_metadata pointer, so update the callers for all the output formats. - libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`, which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined. - libfsm's 'c' output also includes a flag, `has_consumed_input`, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state. lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.

Instead of having the EOF token occupy the same byte, line, and column position as the last token, it should immediately follow. The new lx codegen behaves this way, and katef and I decided that it made sense to keep it like that, as long as it's consistent.

Add `${LX}` as a dependency for the targets using it. Remove the `getcio=${io}` and `io=${io}` arguments to cat. Those may be a merge error? They just produce a warning.

Some of the CI test matrix builds set LX to 'true; echo lx', but that obviously won't work for tests that actually need to run lx in order to exercise its output.

Add a couple test cases (in8-10.txt) with an unexpected end of input, either in the middle of a pattern, or after matching the first pattern in a .. pair, but without matching the second. Supporting this changes the expected result for in6.txt: Previously it resulted in TOK_EOF, now it leads to TOK_UNKNOWN and produces a "lexically uncategorised" error message for the unexpected end of input. This change is necessary for fixing katef#386 / katef#508, and more generally to detect things like unterminated string literals.

This was previously ending up with a useless call to the current zone after returning the token ("case S1: return TOK_UNKNOWN; lx->z(lx);"), which led to a warning in CI [-Werror=implicit-fallthrough=].

These may or may not be called, depending on the input.

In some cases this was hardcoding "lx_" in the generated code, which could lead to build failures if 'lx -e' was used to override the default prefix.

This avoids cluttering libfsm's print output with `has_consumed_input`, which is specific to lx.

Tested with every combination of (dyn+fgetc, fixed+fgetc, pair, str), with and without '-x buf', '-x pos', or both.

Fix lx token identification

PR katef#509 introduced a bug: It didn't distinguish between an unexpected end of input and an end of input in a zone that matches but ignores its input. This caused several lxpos tests to fail due to getting a TOK_UNKNOWN rather than a TOK_EOF when the input has trailing whitespace, but I didn't notice until after merging because the normal build doesn't regenerate the code for src/lx/lexer.lx or src/libfsm/lexer.lx. (I had ensured all the libre dialect lexers and parsers were regenerated, but missed those.) Instead of always printing TOK_UNKNOWN, this this inspects the zone mappings to determine whether the current end ID represents a dead end for the zone. If not, it should instead print TOK_EOF.

(Contributed by Scott)

Advice on how to use libfsm for generating performant pattern matchers

Spotted by both June and Scott, thanks

Suggested by Scott

Github actions cache fluffery

Add re_interpolate groups()

This had a lot of conflicts, because katef#513 cleaned up the eager output code throughout.

I ran into a test failure: grep FAIL build/tests/*/*res*; [ $? -ne 0 ] grep: build/tests/eager_output/run_mixed_start_anchor_regression: binary file matches *** Error code 1 because tests/eager_output/eager_output_mixed_start_anchor_regression.c contains the substring "res" in "regression", and the binary file contains "FAIL: $ strings build/tests/eager_output/run_mixed_start_anchor_regression | grep FAIL VM_END_FAIL VM_FAIL This shouldn't matter for testing purposes, the 'test' target only cares about the test result files containing "PASS" or "FAIL". The change in cb42d58 to allow prefixed result files (e.g. "dyn-fdgetc-getc-res0") made this check ANY files with "res" in the name, but grep should ignore binary files.

Makefile: grep for 'FAIL' should use -I to ignore binary files.

silentbicycle · 2026-02-09T18:56:56Z

Adding katef#519, because that issue is failing CI here.

…r-output

katef and others added 30 commits October 23, 2024 11:15

Missing documentation for -x

98dfb2d

Too much documentation for -u.

280ca7c

Merge pull request katef#502 from katef/kate/missing-manpage-flag

6dc7305

Missing documentation for -x in the rx(1) manpage

Fix memory leak.

7ebc1eb

The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.

Merge pull request katef#504 from katef/sv/fix-memory-leak-from-deter…

3cf9ee7

…minise-with-config Fix a memory leak during `fsm_determinise_with_config`'s early exit

Update CI to use Ubuntu 22.04 for now so we do not get bleeding edge …

3572258

…clang and fail builds

Merge pull request katef#505 from deg4uss3r/rth/fix-ci

db6dcdd

Update CI to use Ubuntu 22.04

retest: Remove isatty check and extra logging output.

c89c15d

While discussing other issues, kate said she'd rather not have this have different behavior when stdout is / isn't a tty.

Merge pull request katef#506 from katef/sv/remove-isatty-check

f6fc836

retest: Remove `isatty` check and extra logging output.

Makefile: Check '*res*' not 'res*' for tests.

cb42d58

This will miss failures in prefixed res files, such as build/tests/lxpos/dyn-fdgetc-getc-res0

Re-enable lxpos tests.

862a68c

Add `${LX}` as a dependency for the targets using it. Remove the `getcio=${io}` and `io=${io}` arguments to cat. Those may be a merge error? They just produce a warning.

Use $LX_BIN instead of $LX in lxpos makefile.

affca78

Some of the CI test matrix builds set LX to 'true; echo lx', but that obviously won't work for tests that actually need to run lx in order to exercise its output.

lx: Make -l dump's output call lx.free() when using dynamic buffer.

a8f0c59

lx: Use prefix.tok, not "TOK_".

552aa01

lx: return TOK_ERROR if reaching the end of a zone function.

beebd1b

lx: Rewrite logic to make the four cases explicit, fix dead code.

ea9c90b

This was previously ending up with a useless call to the current zone after returning the token ("case S1: return TOK_UNKNOWN; lx->z(lx);"), which led to a warning in CI [-Werror=implicit-fallthrough=].

lx: Only gen fixedpop / dynpop & calls to them when buffer mode is set.

17d415d

lx: Suppress warning for possibly unused function.

7ed18b9

These may or may not be called, depending on the input.

lx: Ensure prefix.api & prefix.lx are used in the generated code.

1e55db8

In some cases this was hardcoding "lx_" in the generated code, which could lead to build failures if 'lx -e' was used to override the default prefix.

Replace FSM_ADVANCE_HOOK macro with optional hooks->advance callback.

4a5ca84

The advance hook should also be called for FSM_IO_STR.

f25e8b7

Move setting has_consumed_input flag into lx's advance hook.

051aaf0

This avoids cluttering libfsm's print output with `has_consumed_input`, which is specific to lx.

lx: Avoid useless call to pop and some other 'unused' warnings.

08fd72c

Tested with every combination of (dyn+fgetc, fixed+fgetc, pair, str), with and without '-x buf', '-x pos', or both.

Merge pull request katef#509 from katef/sv/fix-lx-token-identification

c897e9d

Fix lx token identification

katef and others added 22 commits November 27, 2025 16:21

Blurb on calling the generated code.

bf867f2

Blurb on bounded repetition.

4351273

(Contributed by Scott)

Markup.

d8ab92e

Merge pull request katef#515 from katef/gsusanto-docs

27802dc

Advice on how to use libfsm for generating performant pattern matchers

First cut at re_interpolate_groups()

e6683da

Add start,end error reporting

2f0c772

Clarification.

1a6f007

Fill out placeholders for writing out output.

cf2bb0a

Convincing myself string offsets are convenient

3234a7c

Allow a NULL output string.

433c8b8

Clarification.

4ae257e

Spotted by both June and Scott, thanks

Defensively terminate the output buffer on error.

0d487a6

Suggested by Scott

Update to actions/cache@v5

8247320

fail-on-cache-miss: for grabbing arbitrary builds.

6d1bd9b

Explicitly allow build cache miss for makefile tests.

c1203e3

Explicitly fail-on-cache-miss for other things too.

3f69a7c

cache/restore where possible.

661105e

Merge branch 'kate/actions-fluffery' into kate/interpolate_groups

74ab9a7

Merge pull request katef#517 from katef/kate/actions-fluffery

d817464

Github actions cache fluffery

Merge pull request katef#516 from katef/kate/interpolate_groups

a1526ab

Add re_interpolate groups()

Merge mishap, accidentally @v4

c798ec1

Merge branch 'main' into upstream-sync

051be1a

This had a lot of conflicts, because katef#513 cleaned up the eager output code throughout.

silentbicycle requested a review from katef February 9, 2026 17:57

silentbicycle and others added 2 commits February 9, 2026 12:57

Merge pull request katef#519 from katef/sv/grep-test-ignore-binary-files

97cdb4e

Makefile: grep for 'FAIL' should use -I to ignore binary files.

Merge remote-tracking branch 'origin/main' into sv/upstream-sync-eage…

5c0ed62

…r-output

katef approved these changes Feb 9, 2026

View reviewed changes

katef merged commit 00ffc70 into main Feb 9, 2026
4 checks passed

katef deleted the sv/upstream-sync-eager-output branch February 9, 2026 23:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream sync, mostly eager output interface cleanup#37

Upstream sync, mostly eager output interface cleanup#37
katef merged 97 commits into
mainfrom
sv/upstream-sync-eager-output

silentbicycle commented Feb 9, 2026

Uh oh!

silentbicycle commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

silentbicycle commented Feb 9, 2026

Uh oh!

silentbicycle commented Feb 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants