Skip to content

Upstream sync, mostly eager output interface cleanup#37

Merged
katef merged 97 commits into
mainfrom
sv/upstream-sync-eager-output
Feb 9, 2026
Merged

Upstream sync, mostly eager output interface cleanup#37
katef merged 97 commits into
mainfrom
sv/upstream-sync-eager-output

Conversation

@silentbicycle
Copy link
Copy Markdown

This had quite a lot of conflicts, largely because katef#513 cleaned up the eager output interfaces that were originally merged here (with more narrowly scoped functionality).

katef and others added 30 commits October 23, 2024 11:15
Missing documentation for -x in the rx(1) manpage
The edge sets leak when halting with FSM_DETERMINISE_WITH_CONFIG_STATE_LIMIT_REACHED.
I verified that LeakSanitizer catches the leak during this test when the
fix is reverted.

This commit adds a test directory, `tests/regressions`, for misc.
regression tests that need a .c file and aren't easily testable via fsm,
re, etc. inputs.
…minise-with-config

Fix a memory leak during `fsm_determinise_with_config`'s early exit
While discussing other issues, kate said she'd rather not have this
have different behavior when stdout is / isn't a tty.
retest: Remove `isatty` check and extra logging output.
print/ir.h only depended on it for FSM_SIGMA_COUNT in `struct
ir_state_table`, and IR_TABLE isn't actually implemented, so
this can be removed for now.

lx/print/c.c only needed internal.h because of print/ir.h, and
because of several direct accesses to fsm->statecount, which are
easily replaced by calls to fsm_countstates.
This will miss failures in prefixed res files, such as

    build/tests/lxpos/dyn-fdgetc-getc-res0
Changing the leaf and endleaf callbacks to accept and reject in katef#485
broke lx, but it went unnoticed for a while. This fixes it.

libfsm's normal execution mode evaluates a DFA, character by character,
terminating either when the next character isn't a valid edge or the end
of input is reached (in which case it checks end state metadata). lx's
execution mode is a little different, because it's tokenizing -- instead
of reading to the end of input, it should consume as much consecutive
input that matches a particular token, then push back the last character
read (so it can resume with it as context for the next token), yield the
token type, and suspend.

lx used to work by breaking abstraction and calling directly into
`fsm_print_cfrag` (overriding the leaf behavior to yield token types,
and adding an extra 'NONE' state to the generated state machine code),
but when the callback interfaces shifted its internals no longer fit
what lx expected. Now the reject hook is passed the same state
metadata as the accept state, and the reject hook in lx checks whether
the end id is associated with a particular AST mapping and token type.

This is only implemented for the "c" output format, but similar changes
could possibly make others usable without a lot more work. In
particular, kate mentioned it'd be good to be able to use "vmc" output
instead of "c" moving forward.

Most of the code changes happen inside of lx's code generation, but
there are a few elsewhere:

- The reject hook now has a state_metadata pointer, so update the
callers for all the output formats.

- libfsm's 'c' output now includes a macro `FSM_ADVANCE_HOOK(C)`,
which is called with the next character read in the FSM_IO_STR and
FSM_IO_PAIR io modes immediately after advancing. This is used to inform
lx's internal bookkeeping about token positions and buffering token
names. FSM_IO_GETC doesn't need it, because its getc callback manages
the character stream. The macro defaults to a no-op when undefined.

- libfsm's 'c' output also includes a flag, `has_consumed_input`, so the
code expanded in place from the reject/accept hooks can determine when
the state machine input handler loop has consumed any input. This was
previously encoded by the extra NONE state.

lx's code generation using this flag is a bit cluttered, because the
reject hook doesn't know whether it's expanding for the end states, but
it's probably not worth changing the reject hook type signature to add
another flag. This results in checks for has_consumed_input in code
paths where trivial static analysis would show it to be dead code, and
some extra unreachable code at the end of the function.
Instead of having the EOF token occupy the same byte, line, and column
position as the last token, it should immediately follow.

The new lx codegen behaves this way, and katef and I decided that it
made sense to keep it like that, as long as it's consistent.
Add `${LX}` as a dependency for the targets using it.

Remove the `getcio=${io}` and `io=${io}` arguments to cat. Those may be
a merge error? They just produce a warning.
Some of the CI test matrix builds set LX to 'true; echo lx', but that
obviously won't work for tests that actually need to run lx in order to
exercise its output.
Add a couple test cases (in8-10.txt) with an unexpected end of input,
either in the middle of a pattern, or after matching the first pattern
in a .. pair, but without matching the second.

Supporting this changes the expected result for in6.txt: Previously it
resulted in TOK_EOF, now it leads to TOK_UNKNOWN and produces a
"lexically uncategorised" error message for the unexpected end of input.
This change is necessary for fixing katef#386 / katef#508, and more generally
to detect things like unterminated string literals.
This was previously ending up with a useless call to the current zone
after returning the token ("case S1: return TOK_UNKNOWN; lx->z(lx);"),
which led to a warning in CI [-Werror=implicit-fallthrough=].
These may or may not be called, depending on the input.
In some cases this was hardcoding "lx_" in the generated code, which
could lead to build failures if 'lx -e' was used to override the default
prefix.
This avoids cluttering libfsm's print output with `has_consumed_input`,
which is specific to lx.
Tested with every combination of (dyn+fgetc, fixed+fgetc, pair, str),
with and without '-x buf', '-x pos', or both.
PR katef#509 introduced a bug: It didn't distinguish between an unexpected
end of input and an end of input in a zone that matches but ignores its
input. This caused several lxpos tests to fail due to getting a
TOK_UNKNOWN rather than a TOK_EOF when the input has trailing
whitespace, but I didn't notice until after merging because the normal
build doesn't regenerate the code for src/lx/lexer.lx or
src/libfsm/lexer.lx. (I had ensured all the libre dialect lexers and
parsers were regenerated, but missed those.)

Instead of always printing TOK_UNKNOWN, this this inspects the zone
mappings to determine whether the current end ID represents a dead end
for the zone. If not, it should instead print TOK_EOF.
@silentbicycle silentbicycle requested a review from katef February 9, 2026 17:57
silentbicycle and others added 2 commits February 9, 2026 12:57
I ran into a test failure:

    grep FAIL build/tests/*/*res*; [ $? -ne 0 ]
    grep: build/tests/eager_output/run_mixed_start_anchor_regression: binary file matches
    *** Error code 1

because tests/eager_output/eager_output_mixed_start_anchor_regression.c
contains the substring "res" in "regression", and the binary file
contains "FAIL:

    $ strings build/tests/eager_output/run_mixed_start_anchor_regression | grep FAIL
    VM_END_FAIL
    VM_FAIL

This shouldn't matter for testing purposes, the 'test' target only cares
about the test result files containing "PASS" or "FAIL". The change in
cb42d58 to allow prefixed result files (e.g. "dyn-fdgetc-getc-res0")
made this check ANY files with "res" in the name, but grep should ignore
binary files.
Makefile: grep for 'FAIL' should use -I to ignore binary files.
@silentbicycle
Copy link
Copy Markdown
Author

Adding katef#519, because that issue is failing CI here.

@katef katef merged commit 00ffc70 into main Feb 9, 2026
4 checks passed
@katef katef deleted the sv/upstream-sync-eager-output branch February 9, 2026 23:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants