Upstream sync, mostly eager output interface cleanup#37
Merged
Conversation
Missing documentation for -x in the rx(1) manpage
The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.
I verified that LeakSanitizer catches the leak during this test when the fix is reverted. This commit adds a test directory, `tests/regressions`, for misc. regression tests that need a .c file and aren't easily testable via fsm, re, etc. inputs.
…minise-with-config Fix a memory leak during `fsm_determinise_with_config`'s early exit
…clang and fail builds
Update CI to use Ubuntu 22.04
While discussing other issues, kate said she'd rather not have this have different behavior when stdout is / isn't a tty.
retest: Remove `isatty` check and extra logging output.
print/ir.h only depended on it for FSM_SIGMA_COUNT in `struct ir_state_table`, and IR_TABLE isn't actually implemented, so this can be removed for now. lx/print/c.c only needed internal.h because of print/ir.h, and because of several direct accesses to fsm->statecount, which are easily replaced by calls to fsm_countstates.
This will miss failures in prefixed res files, such as
build/tests/lxpos/dyn-fdgetc-getc-res0
Changing the leaf and endleaf callbacks to accept and reject in katef#485 broke lx, but it went unnoticed for a while. This fixes it. libfsm's normal execution mode evaluates a DFA, character by character, terminating either when the next character isn't a valid edge or the end of input is reached (in which case it checks end state metadata). lx's execution mode is a little different, because it's tokenizing -- instead of reading to the end of input, it should consume as much consecutive input that matches a particular token, then push back the last character read (so it can resume with it as context for the next token), yield the token type, and suspend. lx used to work by breaking abstraction and calling directly into `fsm_print_cfrag` (overriding the leaf behavior to yield token types, and adding an extra 'NONE' state to the generated state machine code), but when the callback interfaces shifted its internals no longer fit what lx expected. Now the reject hook is passed the same state metadata as the accept state, and the reject hook in lx checks whether the end id is associated with a particular AST mapping and token type. This is only implemented for the "c" output format, but similar changes could possibly make others usable without a lot more work. In particular, kate mentioned it'd be good to be able to use "vmc" output instead of "c" moving forward. Most of the code changes happen inside of lx's code generation, but there are a few elsewhere: - The reject hook now has a state_metadata pointer, so update the callers for all the output formats. - libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`, which is called with the next character read in the FSM_IO_STR and FSM_IO_PAIR io modes immediately after advancing. This is used to inform lx's internal bookkeeping about token positions and buffering token names. FSM_IO_GETC doesn't need it, because its getc callback manages the character stream. The macro defaults to a no-op when undefined. - libfsm's 'c' output also includes a flag, `has_consumed_input`, so the code expanded in place from the reject/accept hooks can determine when the state machine input handler loop has consumed any input. This was previously encoded by the extra NONE state. lx's code generation using this flag is a bit cluttered, because the reject hook doesn't know whether it's expanding for the end states, but it's probably not worth changing the reject hook type signature to add another flag. This results in checks for has_consumed_input in code paths where trivial static analysis would show it to be dead code, and some extra unreachable code at the end of the function.
Instead of having the EOF token occupy the same byte, line, and column position as the last token, it should immediately follow. The new lx codegen behaves this way, and katef and I decided that it made sense to keep it like that, as long as it's consistent.
Add `${LX}` as a dependency for the targets using it.
Remove the `getcio=${io}` and `io=${io}` arguments to cat. Those may be
a merge error? They just produce a warning.
Some of the CI test matrix builds set LX to 'true; echo lx', but that obviously won't work for tests that actually need to run lx in order to exercise its output.
Add a couple test cases (in8-10.txt) with an unexpected end of input, either in the middle of a pattern, or after matching the first pattern in a .. pair, but without matching the second. Supporting this changes the expected result for in6.txt: Previously it resulted in TOK_EOF, now it leads to TOK_UNKNOWN and produces a "lexically uncategorised" error message for the unexpected end of input. This change is necessary for fixing katef#386 / katef#508, and more generally to detect things like unterminated string literals.
This was previously ending up with a useless call to the current zone
after returning the token ("case S1: return TOK_UNKNOWN; lx->z(lx);"),
which led to a warning in CI [-Werror=implicit-fallthrough=].
These may or may not be called, depending on the input.
In some cases this was hardcoding "lx_" in the generated code, which could lead to build failures if 'lx -e' was used to override the default prefix.
This avoids cluttering libfsm's print output with `has_consumed_input`, which is specific to lx.
Tested with every combination of (dyn+fgetc, fixed+fgetc, pair, str), with and without '-x buf', '-x pos', or both.
Fix lx token identification
PR katef#509 introduced a bug: It didn't distinguish between an unexpected end of input and an end of input in a zone that matches but ignores its input. This caused several lxpos tests to fail due to getting a TOK_UNKNOWN rather than a TOK_EOF when the input has trailing whitespace, but I didn't notice until after merging because the normal build doesn't regenerate the code for src/lx/lexer.lx or src/libfsm/lexer.lx. (I had ensured all the libre dialect lexers and parsers were regenerated, but missed those.) Instead of always printing TOK_UNKNOWN, this this inspects the zone mappings to determine whether the current end ID represents a dead end for the zone. If not, it should instead print TOK_EOF.
(Contributed by Scott)
Advice on how to use libfsm for generating performant pattern matchers
Spotted by both June and Scott, thanks
Suggested by Scott
Github actions cache fluffery
Add re_interpolate groups()
This had a lot of conflicts, because katef#513 cleaned up the eager output code throughout.
I ran into a test failure:
grep FAIL build/tests/*/*res*; [ $? -ne 0 ]
grep: build/tests/eager_output/run_mixed_start_anchor_regression: binary file matches
*** Error code 1
because tests/eager_output/eager_output_mixed_start_anchor_regression.c
contains the substring "res" in "regression", and the binary file
contains "FAIL:
$ strings build/tests/eager_output/run_mixed_start_anchor_regression | grep FAIL
VM_END_FAIL
VM_FAIL
This shouldn't matter for testing purposes, the 'test' target only cares
about the test result files containing "PASS" or "FAIL". The change in
cb42d58 to allow prefixed result files (e.g. "dyn-fdgetc-getc-res0")
made this check ANY files with "res" in the name, but grep should ignore
binary files.
Makefile: grep for 'FAIL' should use -I to ignore binary files.
Author
|
Adding katef#519, because that issue is failing CI here. |
katef
approved these changes
Feb 9, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This had quite a lot of conflicts, largely because katef#513 cleaned up the eager output interfaces that were originally merged here (with more narrowly scoped functionality).