Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1. by silentbicycle · Pull Request #440 · katef/libfsm

silentbicycle · 2023-07-24T21:04:22Z

This adds an extra compilation step for PCRE regexes, generating programs for a group capture resolution abstract machine. The implementation builds on the approach described in "Regular Expression Matching: the Virtual Machine Approach", with adaptations to reduce the runtime memory footprint, handle some edge cases the same way PCRE does, explicitly detect some other PCRE edge cases and reject them as unsupported, and continue supporting group captures as the resulting DFAs are combined with other DFAs into a single DFA that matches them all in one pass

I have fuzzed this code quite a lot, both comparing group capture results to PCRE's and comparing libfsm's handling of individual regexes' DFAs with the DFA produced my combining them. Many of the test cases in tests/capture/capture_test_case_list.c came from fuzzing. My working branch diverged from main for a while during this work, in particular the metadata associated with end states, but after integrating upstream changes I fuzzed it further. I think I integrated things properly, but it's something to be mindful of during review. Some unrelated issues discovered during fuzzing have already been posted as separate PRs.

Worst-case memory usage is proportional to the size of the capvm program's length (known at compile time) and the input length -- we record a bit for each branch taken while evaluating the regex (e.g. either greedily jumping to repeat a subexpression or non-greedily advancing in the regex), so memory usage will slowly increase as inputs are evaluated. Common path prefixes are shared between threads, and the total thread count (and divergence) is bounded by the size of the opcode program. As a special case, a long path prefix of all zero bits is collapsed down to a counter; this usually happens because of an unanchored start loop, where each zero bit represents a path continuing to advance through the input without starting the match yet.

This PR targets an integration branch because some work is not yet complete:

Some quick cleanup of CI issues -- the submodule's build config is more strict about variables that are set but unused (because their use is in logging code that compiles away).
A few pathological cases having to do with repetition in combination with anchors are rejected as unsupported, such as ^(()($)|x)+$. These are probably not worth the trouble to support. It also doesn't support the \z operator yet.
Some of the combinations of CLI flags for re aren't enabled yet, such as capture resolution for multiple files, and the re docs need updating for some new flags.
We should be able to calculate likely memory usage info at regex compile time, which could be used by the caller to stack-allocate and reuse a preallocated buffer rather than using dynamic allocation at runtime, but I haven't implemented that yet. The memory usage should small in practice, but it would be nice to completely eliminate runtime dynamic allocation. The path metadata table currently grows on demand, the other data structures are either fixed-size or the same length as the opcode count. Cases where it would hit the caller's memory limit for growing the path table are likely to hit PCRE's limits much sooner.
Code generation does not output the capture resolution abstract machine's opcodes yet. When using capture handling, libfsm's caller will need to either link a small implementation of the abstract machine or have its code generation produce a standalone implementation, but because that brings some architectural trade-offs I wanted to wait until this round has been reviewed.

silentbicycle · 2023-09-27T18:41:21Z

I've merged the current main into this and fuzzed it for another 20 minutes, on 8 CPU cores.

katef · 2023-10-01T14:44:43Z

+	MODE_REGEX,
+	MODE_REGEX_SINGLE_ONLY,
+	MODE_REGEX_MULTI_ONLY,
+	MODE_IDEMPOTENT_DET_MIN,


Can you add these new modes to ci.yml too please? where currently we have:

mode: [ m, p, d ]

Updated in 239927c.

katef · 2023-10-01T14:47:47Z

+
+static int
+compare_with_pcre(const char *pattern, struct fsm *fsm)
+{


katef · 2023-10-01T14:50:26Z

+			/* This triggers an "unreached" assertion in the parser.
+			 * It's already been reported (issue #386), but once the
+			 * fuzzer finds it, it will report it over and over.
+			 * Exit here so that the fuzzer considers it uninteresting. */


There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.

Endleaf rejiggle, .fsm syntax for endids, refactor endid get/set api, various related bugfixes

…ssions.

…sages.

…ng at \n-terminated patterns in-situ. This is just so much less complicated.

I think at the time of writing, fsm_example() is broken, because I get no output here. Possibly related to #438

…input files.

…ray. This sets the scene for multi-file pattern lists, later on. But also seems to keep scopes tighter and make things simpler as a byproduct.

(Contributed by Scott)

Advice on how to use libfsm for generating performant pattern matchers

Spotted by both June and Scott, thanks

Suggested by Scott

Github actions cache fluffery

Add re_interpolate groups()

I ran into a test failure: grep FAIL build/tests/*/*res*; [ $? -ne 0 ] grep: build/tests/eager_output/run_mixed_start_anchor_regression: binary file matches *** Error code 1 because tests/eager_output/eager_output_mixed_start_anchor_regression.c contains the substring "res" in "regression", and the binary file contains "FAIL: $ strings build/tests/eager_output/run_mixed_start_anchor_regression | grep FAIL VM_END_FAIL VM_FAIL This shouldn't matter for testing purposes, the 'test' target only cares about the test result files containing "PASS" or "FAIL". The change in cb42d58 to allow prefixed result files (e.g. "dyn-fdgetc-getc-res0") made this check ANY files with "res" in the name, but grep should ignore binary files.

Makefile: grep for 'FAIL' should use -I to ignore binary files.

…resolution' into sv/integrate-combinable-DFA-capture-resolution This was a really gnarly merge and took a while to get through. Both the branch implementing captures and the branch implementing `fsm_union_repeated_pattern_group` made interface changes and added significant behavior to `ast_analysis.c`, in ways that were at times tricky to reconcile. `ast_compile.c` was also restructured, but some of the changes were necessary to inform capture and unioning functionality. As of now _almost_ all the tests are passing, but there are a couple specific things to note: - This brings in an interface change to `fsm_endid_set`, it now returns an enum rather than just an int. Because if you squint enough they are both just numbers (according to the C language spec) code using `if (!fsm_endid_set(...)) { ... }` will not get a warning for the changed meaning of the return code. I prefer having this be an enum, but it IS an interface change, and I'm not opposed to changing it back in a later commit. - `build/tests/capture/res_test_case_list:FAIL` I'm going to fix this in a later commit. I'm not 100% sure yet, but I think it's related to conflicting changes in the parser code, which I'm waiting to regenerate until after this merge commit. - `build/tests/endids/res10_minimise_partial_overlap:FAIL` This has to do with `AST_ANALYSIS_ERROR_UNSUPPORTED_PCRE` or `AST_ANALYSIS_ERROR_UNSUPPORTED_CAPTURE` not being handled yet by code specific to the native dialect. I'm going to handle that later, but will have to check whether native should behave like PCRE in those particular cases or not. Either the error handling code needs to be updated, or the code raising the UNSUPPORTED error needs to check which regex dialect is in effect. - fuzz/target.c I have confirmed that the merged fuzzer harness code builds, but haven't yet spend time re-fuzzing anything. In my experience libfuzzer has bit-rotted over the last couple clang/LLVM releases and tends to nondeterministically crash in combination with some of the clang sanitizers now, so we may want to retarget this to using AFL-Fuzz++. That's well outside the scope of this PR, though. - src/lx/parser.act I updated this but haven't re-generated the parsers yet, and I updated the generated code directly with a one-line change to reflect the `fsm_endid_set` interface change. As mentioned above, I'll re-generate that code in a separate commit. Conflicts: - fuzz/target.c - include/adt/stateset.h - include/re/re.h - src/adt/stateset.c - src/fsm/main.c - src/libfsm/Makefile - src/libfsm/capture.c - src/libfsm/clone.c - src/libfsm/closure.c - src/libfsm/consolidate.c - src/libfsm/determinise.c - src/libfsm/determinise_internal.h - src/libfsm/endids.c - src/libfsm/epsilons.c - src/libfsm/exec.c - src/libfsm/internal.h - src/libfsm/merge.c - src/libfsm/minimise.c - src/libfsm/state.c - src/libre/ast.h - src/libre/ast_analysis.c - src/libre/ast_analysis.h - src/libre/ast_compile.c - src/libre/ast_rewrite.c - src/libre/re.c - src/lx/parser.act - src/re/main.c - tests/capture/captest.c - tests/capture/captest.h - tests/capture/capture3.c - tests/capture/capture4.c - tests/capture/capture5.c - tests/capture/capture_concat1.c - tests/capture/capture_concat2.c - tests/capture/capture_union1.c - tests/minimise/minimise_test_case_list.c

silentbicycle requested review from katef and sfstewman July 24, 2023 21:04

silentbicycle marked this pull request as ready for review July 25, 2023 19:18

This was referenced Sep 11, 2023

Integrate misc commits from #440 #442

Merged

Integrate PCRE capture handling fastly/libfsm#12

Closed

katef reviewed Oct 1, 2023

View reviewed changes

Comment thread fuzz/target.c

static int

compare_with_pcre(const char *pattern, struct fsm *fsm)

{

Copy link
Copy Markdown

Owner

katef Oct 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fantastic

katef reviewed Oct 1, 2023

View reviewed changes

silentbicycle mentioned this pull request Oct 19, 2023

Add error code for unsupported PCRE cases (RE_EUNSUPPPCRE), reject one. #447

Merged

katef added 17 commits June 15, 2024 21:10

Getting 0 end IDs is not an error.

99fa19c

Naming; ids[] for endids.

f663cd4

Merge pull request #479 from katef/kate/endleaf-rejiggle

0402f39

Endleaf rejiggle, .fsm syntax for endids, refactor endid get/set api, various related bugfixes

First cut at introducing rx, a tool to compile a set of regular expre…

a39f723

…ssions.

Move to structs for literal and id sets.

994390e

Fold together literal sets into an array indexed by flags.

976313f

Clarification.

f234a1c

Port to fsm_intersect_charset.

e74a9d4

Rework categorisation for (hopefully) clearer outputs and verbose mes…

ee57739

…sages.

Back to POSIX for this, I messed up strndup.

237544a

Tidying up.

708f7f8

Clarification, resolving a few TODOs, smaller CLI options etc.

3ab2029

Allocate and duplicate for \0-terminated patterns, rather than pointi…

37d34c8

…ng at \n-terminated patterns in-situ. This is just so much less complicated.

Simplification, deal with -n earlier.

aa0eca1

Give an example for AMBIG_ERROR.

b6fe284

I think at the time of writing, fsm_example() is broken, because I get no output here. Possibly related to #438

Move declined-file to an optarg, we need to keep argv[] for multiple …

7290f01

…input files.

Store patterns per set, rather than indexing into a global pattern ar…

798f8e7

…ray. This sets the scene for multi-file pattern lists, later on. But also seems to keep scopes tighter and make things simpler as a byproduct.

gsusanto-fastly and others added 30 commits November 27, 2025 12:40

Add documentation guide on how to use Libfsm effectively

8debf24

Update docs based on reviews

7306570

Revision #2

20f262c

Naming.

227f4ac

Markup.

c4d4ffe

Blurb on calling the generated code.

bf867f2

Blurb on bounded repetition.

4351273

(Contributed by Scott)

Markup.

d8ab92e

Merge pull request #515 from katef/gsusanto-docs

27802dc

Advice on how to use libfsm for generating performant pattern matchers

First cut at re_interpolate_groups()

e6683da

Add start,end error reporting

2f0c772

Clarification.

1a6f007

Fill out placeholders for writing out output.

cf2bb0a

Convincing myself string offsets are convenient

3234a7c

Allow a NULL output string.

433c8b8

Clarification.

4ae257e

Spotted by both June and Scott, thanks

Defensively terminate the output buffer on error.

0d487a6

Suggested by Scott

Update to actions/cache@v5

8247320

fail-on-cache-miss: for grabbing arbitrary builds.

6d1bd9b

Explicitly allow build cache miss for makefile tests.

c1203e3

Explicitly fail-on-cache-miss for other things too.

3f69a7c

cache/restore where possible.

661105e

Merge branch 'kate/actions-fluffery' into kate/interpolate_groups

74ab9a7

Merge pull request #517 from katef/kate/actions-fluffery

d817464

Github actions cache fluffery

Merge pull request #516 from katef/kate/interpolate_groups

a1526ab

Add re_interpolate groups()

Merge mishap, accidentally @v4

c798ec1

Merge pull request #519 from katef/sv/grep-test-ignore-binary-files

97cdb4e

Makefile: grep for 'FAIL' should use -I to ignore binary files.

Oops... missed installing <re/groups.h>

7ea5421

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1.#440

Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1.#440
silentbicycle wants to merge 418 commits into
sv/integration-target--combinable-DFA-capture-resolutionfrom
sv/integrate-combinable-DFA-capture-resolution

silentbicycle commented Jul 24, 2023 •

edited

Loading

Uh oh!

silentbicycle commented Sep 27, 2023

Uh oh!

katef Oct 1, 2023

Uh oh!

silentbicycle Oct 2, 2023

Uh oh!

katef Oct 1, 2023

Uh oh!

katef Oct 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

silentbicycle commented Jul 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

silentbicycle commented Sep 27, 2023

Uh oh!

katef Oct 1, 2023

Choose a reason for hiding this comment

Uh oh!

silentbicycle Oct 2, 2023

Choose a reason for hiding this comment

Uh oh!

katef Oct 1, 2023

Choose a reason for hiding this comment

Uh oh!

katef Oct 1, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

silentbicycle commented Jul 24, 2023 •

edited

Loading