Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upstream sync #25

Merged
merged 54 commits into from
May 8, 2024
Merged

Upstream sync #25

merged 54 commits into from
May 8, 2024

Conversation

silentbicycle
Copy link

This syncs some changes from upstream, mostly bugfixes:

  • Construct group #0 at the AST root for all dialects.
  • Allow bases=NULL for fsm_union_array()
  • Bugfix: Skipping group #0 only applied for pcre dialect.
  • Minimisation should not combine end states with distinct sets of endids.
  • Add assert for unreachable branch.
  • Remove duplicated struct field introduced during merge.
  • Add error code for unsupported PCRE cases (RE_EUNSUPPPCRE), reject one.
  • Missing free.
  • determinise: Ensure that the analysis_cache pair ordering is consistent.
  • Add EXPENSIVE_CHECK for ids[] being in ascending order.
  • Ensure pair order is consistent for hashing.
  • hash_id: Rather than adding the IDs, combine into one uint64_t.
  • Handling for the initial token being TOK_UNKNOWN.

silentbicycle and others added 30 commits October 19, 2023 13:34
There's extra error codes in #440 for regexes that aren't UNSATISFIABLE
per se, but depend on particular corner cases in PCRE that probably
aren't worth supporting in an automata-based implementation.

Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b*

This is a tricky one to handle properly; according to PCRE it should
match either "a<anything...>" OR "\n", but nothing else. The newline
match is because $ is a non-input-consuming check that evaluation is
either at the end of input, or at a newline immediately before the end.
In this case `$[^x]b*` matches exactly one newline; it's equivalent to
"$\n". This probably isn't worth supporting, but we can detect cases
where a potential newline match appears after a $ and reject them as
an unsupported PCRE behavior.
These are anchoring edge cases found by fuzzing. They have to do
with incomplete pruning of repeated nodes that are unsatisfiable due to
anchoring, nesting of anchors inside groups and/or ALT subtrees, etc.

The fixes will go in their own commit.
There's extra error codes in #440 for regexes that aren't UNSATISFIABLE
per se, but depend on particular corner cases in PCRE that probably
aren't worth supporting in an automata-based implementation.

Add a test case for one, tests/pcre/in48.re: ^a|$[^x]b*

This is a tricky one to handle properly; according to PCRE it should
match either "a<anything...>" OR "\n", but nothing else. The newline
match is because $ is a non-input-consuming check that evaluation is
either at the end of input, or at a newline immediately before the end.
In this case `$[^x]b*` matches exactly one newline; it's equivalent to
"$\n". This probably isn't worth supporting, but we can detect cases
where a potential newline match appears after a $ and reject them as
an unsupported PCRE behavior.
`^(?:$^|a)(?:$^|^)*`: This should be equivalent to `^(?:^$|a)` because
the first group is always either fully anchored (making the second
redundant) or must match 'a' (making it being anchored at the start
impossible), but that information wasn't being passed properly to the
second group.

`$[^a]*` was returning AST_ANALYSIS_ERROR_UNSUPPORTED_PCRE while
inside the repeat node (`[]*`), so it didn't flag the `^` immediately
after. If a repeat node that can match 0 times is unsupported,
unsatisfiable, etc., just set its max count to 0 and continue. With
this change analysis finds the `^` and treats the whole regex overall
like `$^|$^`, which reduces to `^$`.

The fixes for both of these will be integrated in later commits -- I'm
integrating changes from a working branch that had several intermediate
steps committed to push to another computer for fuzzing.
If a repeat node's subtree returns an error but has a min count of
0, then prune it by setting its max count to 0. If its error is
returned directly then the analysis won't see nodes that appear
after it.
This should normally be set when allocation fails, but when built
for fuzzing analysis can pre-emptively reject inputs that would
need excessive memory due to large repetition counts (for example),
so in that case make sure an error code is set to avoid triggering
an assertion later on.
Lots of other bugs in linking have been indirect consequences of an
ancestor ALT node that changed its top-down x,y linkage to global
or the global start loop instead. Avoiding modification here eliminates
a lot of noise. While trying to simultaneously satisfy the pcre-anchor
tests 92, 99, and 100 I took a step back, thought through the system as
a whole, and realized this was where their linkage was getting messed up.
Rather than modifying this directly, add a function, which has
its own asserts and logging.
Add probe count logging to `to_set_htab_check`.

Switch from adding each PHI_64*(id+1) to xor-ing, it has better
collision behavior.

This change makes

    time ./re -rpcre 'a[^bz][^bz][^bz][^b][^bz][^bz][^bz][^bz][^bz][^bz][^b][^bz][^bz][^bz][^bz][^bz]'

go from taking ~14 seconds to ~6 when I run it locally, because
adding leads to much higher probe counts on average.
katef's testing with words.sh found some suspicious timing, profiling
with callgrind showed there's still some kind of bad collision behavior
doing PHI64(a) ^ PHI64(b) with exactly two IDs. It's probably still a
bad idea to combine multiple Fibonacci hashes, even with xor-ing rather
than adding.

Changing to xorshift* (another fast, high quality hash function for
64-bit ints) immediately makes the issue go away, so do that.
My intended idea was to have all syntax errors go through the same actions driven by parser.sid. But in this case SID's generated parser bails out immediately on TOK_UNKNOWN, and we never reach an ## alt. So we never reach <make-ast>, and thus never <err-syntax> about it.

That's what the assertion was about; the generated parser should never return with a NULL AST.

My workaround for this is just to recreate the equivalent syntax error at the top-level, which isn't very satisfying at all. But it does what we need here.

Spotted by @cinco-de-mayonnaise, thank you.
- hash_id should take a uint64_t argument, rather than unsigned.

- Instead of adding them or hashing them separately and combining,
  pack both into the uint64_t argument for hash_id, since each is
  a 32-bit ID. Further experimentation supports that this has
  better collision behavior.
- Add a function to hash a series of IDs, rather than doing it with
  a loop combining the intermediate hashes in inconsistent ways.

- Add a #define to log a probe count and a limit to assert, and
  rework the hashing code control flow for tracking/asserting on
  the probe count.

- Remove `hash_fnv1a_64` from a logging statement (normally not
  compiled), since that's the only remaining use remove the FNV1a
  functions from hash.h.

- Remove FSM_PHI_32 and FSM_PHI_64, now unused.

- parser.act still uses 32-bit FNV1a, but that's for character-by-
  character hashing, so that still makes sense, and it has its
  own copy of fnv1a.
silentbicycle and others added 20 commits January 8, 2024 17:46
Ranges using named endpoints are currently rejected with "Unsupported
operator".
Spotted by ASAN.
I'm still experimenting with the limit, but it's common for
sort_and_dedup_dst_buf to be called with a fairly small input array. The
bitset approach's cost is proportional to the max - min value delta, so
when there's only a few entries it may be faster to do insertion sort
instead -- 16 entries will probably fit in a single cache line.

Changing SMALL_INPUT_LIMIT changes cutoff, commenting it out disables
that approach outright. Setting SMALL_INPUT_CHECK to 1 enables a
pass at the end that verifies the array is actually sorted and unique.

Also, check upfront whether the input array is already sorted and
unique. Experiments so far suggest that this doesn't pay off very often,
but it's very cheap since we're already sweeping over the input to the
min and max values.
I benchmarked with a range of values 8 to 4096, somewhere around
32 - 256 worked particularly well, and it started to become slower
beyond that, so set it to <32 for now.
…ons-during-ast-analysis

Add error code for unsupported PCRE cases (RE_EUNSUPPPCRE), reject one.
This was adapted from the group capture integration branch. It isn't
specific to group capture.

Currently, metadata on end states will not prohibit merging, which can
lead to false positives when using endids to detect which of the
original regexes are matching when several regexes are compiled and then
unioned into a single DFA. This adds a pass before the main minimisation
loop that splits the ECs of end states with distinct sets of end IDs,
which is sufficient to prevent them from merging later.

I reworked the test tests/endids/endids2_union_many_endids.c somewhat,
since it was previously checking that the end states WERE merged.
Unfortunately the other checks are somewhat weakened -- I removed the
part checking that `endids[j] == j+1` because that ordering isn't
guaranteed anymore: the patterns aren't anchored, so some examples
will match more than one of them, and the endids collected can be
nonconsecutive.

I added a separate test, endids10_minimise_partial_overlap.c, which
checks the minimal case that when the two regexes /^abc$/ and /^ab*c$/
are combined and a distinct endid is set on each, matching "ac" "abc"
"abbc" gets one or both endids as appropriate. Ideally, we'd have a
property test that checks that any arbitrary set of regex patterns has
the same set of inputs matching -> endids when they are run individually
vs. when combined into a single FSM, determinised, and minimised.
These got missed.

Also, rename EXPENSIVE_INTEGRITY_CHECKS in minimise.c to
EXPENSIVE_CHECKS and move the <fsm/print.h> header inclusion.
If two end states have distinct sets of endids associated with
them, mark them as distinguishable before entering the main
fixpoint loop. This brings the results from the test oracle back
in line with fsm_minimise.
Here I'd assumed #0 would always be present at the root, but that's actually only the case for a dialect which provides capturing groups. Instead of switching on the dialect, I'm just coding this to treat #0 as optional.

This does have the semantic effect of removing the "whatever we parsed, it wasn't a literal" part. But I actually can't find a way to reach that path anyway.
Minimisation should not combine end states with distinct sets of endids.
My thinking here is that capturing the entire expression is a property of the runtime, regardless of whether a particular dialect offers syntax for group capture for subexpressions. Practically, it seems like a user might want to know what a pattern matched (i.e. ignoring the unanchored prefix and suffix).
Break a dependency on group #0 for re_is_literal()
This adds an extra parameter to `re_strings_add_str` and
`re_strings_add_raw` that (if non-NULL) will associate a single endid
with the string being added. When `re_strings_build` constructs the
DFA it will produce a separate end state for each end.

This needs further testing with multiple overlapping patterns. When
multiple literal strings appear in the input only the latest match
will be reported.
This uses a `struct state_set` since sizeof(fsm_state) ==
sizeof(fsm_end_id_t), and it's probably not worth making a separate ADT
just for these.

The second test checks that duplicated strings get all their endids set.
The previous implementation (a single endid, or ENDID_NONE) dropped all
but the last endid defined.
@silentbicycle silentbicycle requested a review from katef May 7, 2024 20:32
The generated functions for FSM_IO_STR and FSM_IO_PAIR didn't include
a `void *opaque` to pass other state, so add it. For example, this could
be used to pass info about which endids matched.
@silentbicycle
Copy link
Author

This will also need katef#465 .

katef and others added 3 commits May 8, 2024 04:37
…odegen-in-IO-STR

src/libfsm/print/c.c: Add opaque argument to generated DFA functions.
Add endid support to `re_strings`.
@silentbicycle silentbicycle merged commit 80b80a6 into fastly:main May 8, 2024
161 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants