Upstream sync #25

silentbicycle · 2024-05-07T20:32:51Z

This syncs some changes from upstream, mostly bugfixes:

Construct group #0 at the AST root for all dialects.
Allow bases=NULL for fsm_union_array()
Bugfix: Skipping group #0 only applied for pcre dialect.
Minimisation should not combine end states with distinct sets of endids.
Add assert for unreachable branch.
Remove duplicated struct field introduced during merge.
Add error code for unsupported PCRE cases (RE_EUNSUPPPCRE), reject one.
Missing free.
determinise: Ensure that the analysis_cache pair ordering is consistent.
Add EXPENSIVE_CHECK for ids[] being in ascending order.
Ensure pair order is consistent for hashing.
hash_id: Rather than adding the IDs, combine into one uint64_t.
Handling for the initial token being TOK_UNKNOWN.

There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.

These are anchoring edge cases found by fuzzing. They have to do with incomplete pruning of repeated nodes that are unsatisfiable due to anchoring, nesting of anchors inside groups and/or ALT subtrees, etc. The fixes will go in their own commit.

There's extra error codes in #440 for regexes that aren't UNSATISFIABLE per se, but depend on particular corner cases in PCRE that probably aren't worth supporting in an automata-based implementation. Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b* This is a tricky one to handle properly; according to PCRE it should match either "a<anything...>" OR "\n", but nothing else. The newline match is because $ is a non-input-consuming check that evaluation is either at the end of input, or at a newline immediately before the end. In this case `$[^x]b*` matches exactly one newline; it's equivalent to "$\n". This probably isn't worth supporting, but we can detect cases where a potential newline match appears after a $ and reject them as an unsupported PCRE behavior.

`^(?:$^|a)(?:$^|^)*`: This should be equivalent to `^(?:^$|a)` because the first group is always either fully anchored (making the second redundant) or must match 'a' (making it being anchored at the start impossible), but that information wasn't being passed properly to the second group. `$[^a]*` was returning AST_ANALYSIS_ERROR_UNSUPPORTED_PCRE while inside the repeat node (`[]*`), so it didn't flag the `^` immediately after. If a repeat node that can match 0 times is unsupported, unsatisfiable, etc., just set its max count to 0 and continue. With this change analysis finds the `^` and treats the whole regex overall like `$^|$^`, which reduces to `^$`. The fixes for both of these will be integrated in later commits -- I'm integrating changes from a working branch that had several intermediate steps committed to push to another computer for fuzzing.

If a repeat node's subtree returns an error but has a min count of 0, then prune it by setting its max count to 0. If its error is returned directly then the analysis won't see nodes that appear after it.

Found while fuzzing.

This should normally be set when allocation fails, but when built for fuzzing analysis can pre-emptively reject inputs that would need excessive memory due to large repetition counts (for example), so in that case make sure an error code is set to avoid triggering an assertion later on.

Lots of other bugs in linking have been indirect consequences of an ancestor ALT node that changed its top-down x,y linkage to global or the global start loop instead. Avoiding modification here eliminates a lot of noise. While trying to simultaneously satisfy the pcre-anchor tests 92, 99, and 100 I took a step back, thought through the system as a whole, and realized this was where their linkage was getting messed up.

Rather than modifying this directly, add a function, which has its own asserts and logging.

Add probe count logging to `to_set_htab_check`. Switch from adding each PHI_64*(id+1) to xor-ing, it has better collision behavior. This change makes time ./re -rpcre 'a[^bz][^bz][^bz][^b][^bz][^bz][^bz][^bz][^bz][^bz][^b][^bz][^bz][^bz][^bz][^bz]' go from taking ~14 seconds to ~6 when I run it locally, because adding leads to much higher probe counts on average.

katef's testing with words.sh found some suspicious timing, profiling with callgrind showed there's still some kind of bad collision behavior doing PHI64(a) ^ PHI64(b) with exactly two IDs. It's probably still a bad idea to combine multiple Fibonacci hashes, even with xor-ing rather than adding. Changing to xorshift* (another fast, high quality hash function for 64-bit ints) immediately makes the issue go away, so do that.

@cinco-de-mayonnaise

My intended idea was to have all syntax errors go through the same actions driven by parser.sid. But in this case SID's generated parser bails out immediately on TOK_UNKNOWN, and we never reach an ## alt. So we never reach <make-ast>, and thus never <err-syntax> about it. That's what the assertion was about; the generated parser should never return with a NULL AST. My workaround for this is just to recreate the equivalent syntax error at the top-level, which isn't very satisfying at all. But it does what we need here. Spotted by @cinco-de-mayonnaise, thank you.

…-ast-analysis

- hash_id should take a uint64_t argument, rather than unsigned. - Instead of adding them or hashing them separately and combining, pack both into the uint64_t argument for hash_id, since each is a 32-bit ID. Further experimentation supports that this has better collision behavior.

- Add a function to hash a series of IDs, rather than doing it with a loop combining the intermediate hashes in inconsistent ways. - Add a #define to log a probe count and a limit to assert, and rework the hashing code control flow for tracking/asserting on the probe count. - Remove `hash_fnv1a_64` from a logging statement (normally not compiled), since that's the only remaining use remove the FNV1a functions from hash.h. - Remove FSM_PHI_32 and FSM_PHI_64, now unused. - parser.act still uses 32-bit FNV1a, but that's for character-by- character hashing, so that still makes sense, and it has its own copy of fnv1a.

Ranges using named endpoints are currently rejected with "Unsupported operator".

Spotted by ASAN.

I'm still experimenting with the limit, but it's common for sort_and_dedup_dst_buf to be called with a fairly small input array. The bitset approach's cost is proportional to the max - min value delta, so when there's only a few entries it may be faster to do insertion sort instead -- 16 entries will probably fit in a single cache line. Changing SMALL_INPUT_LIMIT changes cutoff, commenting it out disables that approach outright. Setting SMALL_INPUT_CHECK to 1 enables a pass at the end that verifies the array is actually sorted and unique. Also, check upfront whether the input array is already sorted and unique. Experiments so far suggest that this doesn't pay off very often, but it's very cheap since we're already sweeping over the input to the min and max values.

I benchmarked with a range of values 8 to 4096, somewhere around 32 - 256 worked particularly well, and it started to become slower beyond that, so set it to <32 for now.

…ons-during-ast-analysis Add error code for unsupported PCRE cases (RE_EUNSUPPPCRE), reject one.

This was adapted from the group capture integration branch. It isn't specific to group capture. Currently, metadata on end states will not prohibit merging, which can lead to false positives when using endids to detect which of the original regexes are matching when several regexes are compiled and then unioned into a single DFA. This adds a pass before the main minimisation loop that splits the ECs of end states with distinct sets of end IDs, which is sufficient to prevent them from merging later. I reworked the test tests/endids/endids2_union_many_endids.c somewhat, since it was previously checking that the end states WERE merged. Unfortunately the other checks are somewhat weakened -- I removed the part checking that `endids[j] == j+1` because that ordering isn't guaranteed anymore: the patterns aren't anchored, so some examples will match more than one of them, and the endids collected can be nonconsecutive. I added a separate test, endids10_minimise_partial_overlap.c, which checks the minimal case that when the two regexes /^abc$/ and /^ab*c$/ are combined and a distinct endid is set on each, matching "ac" "abc" "abbc" gets one or both endids as appropriate. Ideally, we'd have a property test that checks that any arbitrary set of regex patterns has the same set of inputs matching -> endids when they are run individually vs. when combined into a single FSM, determinised, and minimised.

These got missed. Also, rename EXPENSIVE_INTEGRITY_CHECKS in minimise.c to EXPENSIVE_CHECKS and move the <fsm/print.h> header inclusion.

If two end states have distinct sets of endids associated with them, mark them as distinguishable before entering the main fixpoint loop. This brings the results from the test oracle back in line with fsm_minimise.

Here I'd assumed #0 would always be present at the root, but that's actually only the case for a dialect which provides capturing groups. Instead of switching on the dialect, I'm just coding this to treat #0 as optional. This does have the semantic effect of removing the "whatever we parsed, it wasn't a literal" part. But I actually can't find a way to reach that path anyway.

Minimisation should not combine end states with distinct sets of endids.

My thinking here is that capturing the entire expression is a property of the runtime, regardless of whether a particular dialect offers syntax for group capture for subexpressions. Practically, it seems like a user might want to know what a pattern matched (i.e. ignoring the unanchored prefix and suffix).

Break a dependency on group #0 for re_is_literal()

This adds an extra parameter to `re_strings_add_str` and `re_strings_add_raw` that (if non-NULL) will associate a single endid with the string being added. When `re_strings_build` constructs the DFA it will produce a separate end state for each end. This needs further testing with multiple overlapping patterns. When multiple literal strings appear in the input only the latest match will be reported.

This uses a `struct state_set` since sizeof(fsm_state) == sizeof(fsm_end_id_t), and it's probably not worth making a separate ADT just for these. The second test checks that duplicated strings get all their endids set. The previous implementation (a single endid, or ENDID_NONE) dropped all but the last endid defined.

The generated functions for FSM_IO_STR and FSM_IO_PAIR didn't include a `void *opaque` to pass other state, so add it. For example, this could be used to pass info about which endids matched.

silentbicycle · 2024-05-07T21:24:19Z

This will also need katef#465 .

…odegen-in-IO-STR src/libfsm/print/c.c: Add opaque argument to generated DFA functions.

Add endid support to `re_strings`.

silentbicycle and others added 30 commits October 19, 2023 13:34

tests/pcre-anchor/in87.re: bubble up anchor flags to group node.

341b380

More pcre and pcre-anchor test cases from fuzzing.

1bb18fc

new pcre-anchor tests

120f9c5

pcre-anchor/in94.re: One more test from fuzzing.

20cb71a

pcre-anchor/in95-98.re: More anchoring tests from fuzzing.

b8b4baf

Analysis/linkage updates to fix new pcre-anchor tests.

04be2af

ast_analysis: Fix for tests/pcre-anchor/in100.fsm.

4e86c0c

If a repeat node's subtree returns an error but has a min count of 0, then prune it by setting its max count to 0. If its error is returned directly then the analysis won't see nodes that appear after it.

ast_analysis: Update incorrect assertion and comment.

d6eecff

Silence warnings about unused (log-only) variables.

f1f5ff6

ast_analysis: Always check analysis_iter's result.

1dc2b2b

Found while fuzzing.

ast.h: Add comments for AST_FLAG_CONSTRAINED_AT_{START,END}.

9484319

ast_analysis: Remove tautological comparisons (unsigned >= 0).

95b1915

ast_analysis: Add prune_repeat_node.

0f5a966

Rather than modifying this directly, add a function, which has its own asserts and logging.

Merge branch 'main' into sv/reject-unsupported-pcre-extensions-during…

9f1d072

…-ast-analysis

Remove duplicated struct field introduced during merge.

16957a2

stateset: Remove unused state_set_hash.

2b44351

hash_pair: Ensure pair order is consistent for hashing.

60a2f8c

hash_ids: Add EXPENSIVE_CHECK for ids[] being in ascending order.

cad3ecd

determinise: Ensure that the analysis_cache pair ordering is consistent.

78081b8

silentbicycle and others added 20 commits January 8, 2024 17:46

ast_analysis.c: Add assert for unreachable branch.

f5b1300

Ranges using named endpoints are currently rejected with "Unsupported operator".

Missing free.

32c8995

Spotted by ASAN.

determinise: Adjust SMALL_INPUT_LIMIT.

259d507

I benchmarked with a range of values 8 to 4096, somewhere around 32 - 256 worked particularly well, and it started to become slower beyond that, so set it to <32 for now.

Merge pull request #447 from katef/sv/reject-unsupported-pcre-extensi…

fdd18f4

…ons-during-ast-analysis Add error code for unsupported PCRE cases (RE_EUNSUPPPCRE), reject one.

minimise: Integrate a few functions only called with EXPENSIVE_CHECKS.

76ccc7a

These got missed. Also, rename EXPENSIVE_INTEGRITY_CHECKS in minimise.c to EXPENSIVE_CHECKS and move the <fsm/print.h> header inclusion.

asan: Fix memory leaks in test.

b5f8fb7

minimise_test_oracle: Add upfront pass based on endids.

a54a40c

If two end states have distinct sets of endids associated with them, mark them as distinguishable before entering the main fixpoint loop. This brings the results from the test oracle back in line with fsm_minimise.

minimise_test_oracle: Fix memory leak.

269c973

Allow bases=NULL for fsm_union_array()

d80644a

Merge pull request #461 from katef/sv/preserve_ecs_split_by_endids

d05126a

Minimisation should not combine end states with distinct sets of endids.

Merge pull request #463 from katef/kate/is-literal-non-pcre

0ea82df

Break a dependency on group #0 for re_is_literal()

ac: free the endid set.

b208d59

tests/re_strings/testutil.c: Free the fsm after the test.

43439c0

ac: Move field for better struct packing.

8b69338

silentbicycle requested a review from katef May 7, 2024 20:32

src/libfsm/print/c.c: Add opaque argument to generated DFA functions.

6bc2e50

The generated functions for FSM_IO_STR and FSM_IO_PAIR didn't include a `void *opaque` to pass other state, so add it. For example, this could be used to pass info about which endids matched.

katef and others added 3 commits May 8, 2024 04:37

Merge pull request #465 from katef/sv/add-extra-opaque-arg-to-c-dfa-c…

25659c2

…odegen-in-IO-STR src/libfsm/print/c.c: Add opaque argument to generated DFA functions.

re_strings: Add tests for ALL duplicates and for the empty set.

7d9b862

Merge pull request #464 from katef/sv/re_strings-endids

f945fda

Add endid support to `re_strings`.

katef approved these changes May 8, 2024

View reviewed changes

silentbicycle merged commit 80b80a6 into fastly:main May 8, 2024
161 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upstream sync #25

Upstream sync #25

silentbicycle commented May 7, 2024

silentbicycle commented May 7, 2024

Upstream sync #25

Upstream sync #25

Conversation

silentbicycle commented May 7, 2024

silentbicycle commented May 7, 2024