Integrate combinable DFA capture resolution for our PCRE dialect regexes, part 1.#440
Open
silentbicycle wants to merge 418 commits into
Conversation
This was referenced Sep 11, 2023
Collaborator
Author
|
I've merged the current |
katef
reviewed
Oct 1, 2023
| MODE_REGEX, | ||
| MODE_REGEX_SINGLE_ONLY, | ||
| MODE_REGEX_MULTI_ONLY, | ||
| MODE_IDEMPOTENT_DET_MIN, |
Owner
There was a problem hiding this comment.
Can you add these new modes to ci.yml too please? where currently we have:
mode: [ m, p, d ]
katef
reviewed
Oct 1, 2023
|
|
||
| static int | ||
| compare_with_pcre(const char *pattern, struct fsm *fsm) | ||
| { |
katef
reviewed
Oct 1, 2023
| /* This triggers an "unreached" assertion in the parser. | ||
| * It's already been reported (issue #386), but once the | ||
| * fuzzer finds it, it will report it over and over. | ||
| * Exit here so that the fuzzer considers it uninteresting. */ |
silentbicycle
added a commit
that referenced
this pull request
Oct 19, 2023
There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.
silentbicycle
added a commit
that referenced
this pull request
Dec 20, 2023
There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.
katef
pushed a commit
that referenced
this pull request
Jan 3, 2024
There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.
Endleaf rejiggle, .fsm syntax for endids, refactor endid get/set api, various related bugfixes
…ng at \n-terminated patterns in-situ. This is just so much less complicated.
I think at the time of writing, fsm_example() is broken, because I get no output here. Possibly related to #438
…ray. This sets the scene for multi-file pattern lists, later on. But also seems to keep scopes tighter and make things simpler as a byproduct.
(Contributed by Scott)
Advice on how to use libfsm for generating performant pattern matchers
Spotted by both June and Scott, thanks
Suggested by Scott
Github actions cache fluffery
Add re_interpolate groups()
I ran into a test failure:
grep FAIL build/tests/*/*res*; [ $? -ne 0 ]
grep: build/tests/eager_output/run_mixed_start_anchor_regression: binary file matches
*** Error code 1
because tests/eager_output/eager_output_mixed_start_anchor_regression.c
contains the substring "res" in "regression", and the binary file
contains "FAIL:
$ strings build/tests/eager_output/run_mixed_start_anchor_regression | grep FAIL
VM_END_FAIL
VM_FAIL
This shouldn't matter for testing purposes, the 'test' target only cares
about the test result files containing "PASS" or "FAIL". The change in
cb42d58 to allow prefixed result files (e.g. "dyn-fdgetc-getc-res0")
made this check ANY files with "res" in the name, but grep should ignore
binary files.
Makefile: grep for 'FAIL' should use -I to ignore binary files.
…resolution' into sv/integrate-combinable-DFA-capture-resolution
This was a really gnarly merge and took a while to get through.
Both the branch implementing captures and the branch implementing
`fsm_union_repeated_pattern_group` made interface changes and added
significant behavior to `ast_analysis.c`, in ways that were at times
tricky to reconcile. `ast_compile.c` was also restructured, but some
of the changes were necessary to inform capture and unioning
functionality.
As of now _almost_ all the tests are passing, but there are a couple
specific things to note:
- This brings in an interface change to `fsm_endid_set`, it now returns
an enum rather than just an int. Because if you squint enough they are
both just numbers (according to the C language spec) code using
`if (!fsm_endid_set(...)) { ... }` will not get a warning for the
changed meaning of the return code. I prefer having this be an enum,
but it IS an interface change, and I'm not opposed to changing it back
in a later commit.
- `build/tests/capture/res_test_case_list:FAIL`
I'm going to fix this in a later commit. I'm not 100% sure yet, but
I think it's related to conflicting changes in the parser code, which
I'm waiting to regenerate until after this merge commit.
- `build/tests/endids/res10_minimise_partial_overlap:FAIL`
This has to do with `AST_ANALYSIS_ERROR_UNSUPPORTED_PCRE` or
`AST_ANALYSIS_ERROR_UNSUPPORTED_CAPTURE` not being handled yet by code
specific to the native dialect. I'm going to handle that later, but
will have to check whether native should behave like PCRE in those
particular cases or not. Either the error handling code needs to be
updated, or the code raising the UNSUPPORTED error needs to check
which regex dialect is in effect.
- fuzz/target.c
I have confirmed that the merged fuzzer harness code builds, but
haven't yet spend time re-fuzzing anything. In my experience libfuzzer
has bit-rotted over the last couple clang/LLVM releases and tends to
nondeterministically crash in combination with some of the clang
sanitizers now, so we may want to retarget this to using AFL-Fuzz++.
That's well outside the scope of this PR, though.
- src/lx/parser.act
I updated this but haven't re-generated the parsers yet, and I
updated the generated code directly with a one-line change to
reflect the `fsm_endid_set` interface change. As mentioned above,
I'll re-generate that code in a separate commit.
Conflicts:
- fuzz/target.c
- include/adt/stateset.h
- include/re/re.h
- src/adt/stateset.c
- src/fsm/main.c
- src/libfsm/Makefile
- src/libfsm/capture.c
- src/libfsm/clone.c
- src/libfsm/closure.c
- src/libfsm/consolidate.c
- src/libfsm/determinise.c
- src/libfsm/determinise_internal.h
- src/libfsm/endids.c
- src/libfsm/epsilons.c
- src/libfsm/exec.c
- src/libfsm/internal.h
- src/libfsm/merge.c
- src/libfsm/minimise.c
- src/libfsm/state.c
- src/libre/ast.h
- src/libre/ast_analysis.c
- src/libre/ast_analysis.h
- src/libre/ast_compile.c
- src/libre/ast_rewrite.c
- src/libre/re.c
- src/lx/parser.act
- src/re/main.c
- tests/capture/captest.c
- tests/capture/captest.h
- tests/capture/capture3.c
- tests/capture/capture4.c
- tests/capture/capture5.c
- tests/capture/capture_concat1.c
- tests/capture/capture_concat2.c
- tests/capture/capture_union1.c
- tests/minimise/minimise_test_case_list.c
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This adds an extra compilation step for PCRE regexes, generating programs for a group capture resolution abstract machine. The implementation builds on the approach described in "Regular Expression Matching: the Virtual Machine Approach", with adaptations to reduce the runtime memory footprint, handle some edge cases the same way PCRE does, explicitly detect some other PCRE edge cases and reject them as unsupported, and continue supporting group captures as the resulting DFAs are combined with other DFAs into a single DFA that matches them all in one pass
I have fuzzed this code quite a lot, both comparing group capture results to PCRE's and comparing libfsm's handling of individual regexes' DFAs with the DFA produced my combining them. Many of the test cases in
tests/capture/capture_test_case_list.ccame from fuzzing. My working branch diverged frommainfor a while during this work, in particular the metadata associated with end states, but after integrating upstream changes I fuzzed it further. I think I integrated things properly, but it's something to be mindful of during review. Some unrelated issues discovered during fuzzing have already been posted as separate PRs.Worst-case memory usage is proportional to the size of the capvm program's length (known at compile time) and the input length -- we record a bit for each branch taken while evaluating the regex (e.g. either greedily jumping to repeat a subexpression or non-greedily advancing in the regex), so memory usage will slowly increase as inputs are evaluated. Common path prefixes are shared between threads, and the total thread count (and divergence) is bounded by the size of the opcode program. As a special case, a long path prefix of all zero bits is collapsed down to a counter; this usually happens because of an unanchored start loop, where each zero bit represents a path continuing to advance through the input without starting the match yet.
This PR targets an integration branch because some work is not yet complete:
^(()($)|x)+$. These are probably not worth the trouble to support. It also doesn't support the\zoperator yet.rearen't enabled yet, such as capture resolution for multiple files, and theredocs need updating for some new flags.