experimental: Add eager outputs, similar to endids but eagerly matched. #29

silentbicycle · 2024-10-10T18:27:29Z

When combining several unanchored regexes it becomes VERY expensive to handle combinations of matches via the end state -- essentially, the whole reachable DFA gets separate matching and non-matching copies for each pattern, leading to a DFA whose size is proportional to the number of possible combinations of matches. With eager outputs, we can set a flag for matching as we reach the end of the original pattern (before looping back and possibly also matching other patterns), which keeps the state count from blowing up in fsm_determinise.

To see how much difference this makes, the test tests/eager_output/run7 combines 26 different patterns. It should finish very quickly (~50 msec, just now). Try running it with env FORCE_ENDIDS=N for N increasing from 4 to 26. Around 10-11 it will start taking several seconds, and memory usage will roughly double with each step.

This PR adds fsm_union_repeated_pattern_group, a variant of fsm_union_array that combines a set of DFAs into a single NFA, but correctly handles a mix of anchored and unanchored ends without the state count blowing up. It currently needs flags passed in for each fsm indicating whether the start and/or end are anchored, and there is a hacky special case that removes self-edges from states with eager outputs and instead connects them to a single overall unanchored end loop. I haven't yet figured out how to handle this properly in the general case, but it works for this specific use case, provided all the DFAs are combined at once. (Combining multiple DFAs each produced by determinising fsm_union_repeated_pattern_group's result probably won't work correctly.) I have tried detecting and ignoring those edges in fsm_determinise, after epsilon removal, but so far either it still causes the graph size to blow up or subtly breaks something else.

This is still experimental, and the code generation for -lc here is quite hacky -- it expects the caller to define a FSM_SET_EAGER_OUTPUT acro, since the code generation interface doesn't define where the match info will go yet. A later PR will add a new code generation mode with better support for eager outputs, and I plan to eventually integrate this better with rx, AMBIG_MULTIPLE, and so on.

(This squashes down a couple false starts. Sorry, it's a large PR, but there isn't a meaningful intermediate state here.)

When combining several unanchored regexes it becomes VERY expensive to handle combinations of matches via the end state -- essentially, the whole reachable DFA gets separate matching and non-matching copies for each pattern, leading to a DFA whose size is proportional to the number of *possible combinations* of matches. With eager outputs, we can set a flag for matching as we reach the end of the original pattern (before looping back and possibly also matching other patterns), which keeps the state count from blowing up in fsm_determinise. To see how much difference this makes, the test tests/eager_output/run7 combines 26 different patterns. It should finish very quickly (~50 msec, just now). Try running it with `env FORCE_ENDIDS=N` for N increasing from 4 to 26. Around 10-11 it will start taking several seconds, and memory usage will roughly double with each step. This PR adds `fsm_union_repeated_pattern_group`, a variant of `fsm_union_array` that combines a set of DFAs into a single NFA, but correctly handles a mix of anchored and unanchored ends without the state count blowing up. It currently needs flags passed in for each fsm indicating whether the start and/or end are anchored, and there is a hacky special case that removes self-edges from states with eager outputs and instead connects them to a single overall unanchored end loop. I haven't yet figured out how to handle this properly in the general case, but it works for this specific use case, provided all the DFAs are combined at once. (Combining multiple DFAs each produced by determinising fsm_union_repeated_pattern_group's result probably won't work correctly.) I have tried detecting and ignoring those edges in fsm_determinise, after epsilon removal, but so far either it still causes the graph size to blow up or subtly breaks something else. This is still experimental, and the code generation for `-lc` here is quite hacky -- it expects the caller to define a `FSM_SET_EAGER_OUTPUT` acro, since the code generation interface doesn't define where the match info will go yet. A later PR will add a new code generation mode with better support for eager outputs, and I plan to eventually integrate this better with rx, AMBIG_MULTIPLE, and so on. (This squashes down a couple false starts.)

fsm_eager_output_count and fsm_eager_output_get aligns better with the endid interface, and now fsm_eager_output_get ensures the buffer contents are sorted,

Do the same check with eager outputs as it does with endids -- any two states with distinct sets for either should end up in different equivalence classes.

silentbicycle · 2024-10-10T18:45:42Z

src/libfsm/print/c.c

+		/* If .fragment is set and the state has eager outputs, then emit a call to a
+		 * macro (the caller is expected to define). This is a temporary interface. */
+		for (size_t i = 0; i < cs->eager_outputs->count; i++) {
+			fprintf(f, "\t\t\tFSM_SET_EAGER_OUTPUT(%u);\n", cs->eager_outputs->ids[i]);


This is really ugly. I figured we should talk about what the code generation interface should look like. It'd be awkward to generate non-fragment function prototypes with or without extra argument(s) for the eager outputs, but I'm not sure always adding it is a good plan either -- it probably won't be used as widely as endids. There could be a flag in the print options for whether to include it, and require the flag to be set when eager outputs are used?

Also, this interface passes the buck on whether the IDs are dense or sparse. In DD's case it's a regex ID, so there's a bit array 0..N, but the API doesn't prevent someone from using non-continguous values like 32-bit hashes. Here, the macro can expand to something that includes other variables from the caller's scope, whether that's a bit vector or some other kind of collection.

katef · 2024-10-12T15:02:53Z

include/fsm/bool.h

+struct fsm *
+fsm_union_repeated_pattern_group(size_t entry_count,
+    struct fsm_union_entry *entries, struct fsm_combined_base_pair *bases);
+


would we be able to use this from the <re/strings.h> interfaces?

katef · 2024-10-12T16:00:52Z

include/re/re.h

+re_is_anchored(enum re_dialect dialect, re_getchar_fun *f, void *opaque,
+	enum re_flags flags, struct re_err *err,
+	struct re_anchoring_info *info);
+


note to self; I requested pulling this out to its own PR, straight to upstream

katef · 2024-10-12T16:13:18Z

src/libfsm/determinise.c

@@ -405,7 +428,7 @@ hash_iss(interned_state_set_id iss)
 }

 static struct mapping *
-map_first(struct map *map, struct map_iter *iter)
+map_first(const struct map *map, struct map_iter *iter)


always makes me happy to see a const

katef · 2024-10-12T17:11:14Z

src/libfsm/union.c

+		 * as the combined pattern count increases; it's essentially what I'm
+		 * trying to avoid by adding eager output support in the first place.
+		 *
+		 * FIXME: instead of actively removing these, filter in fsm_determinise? */


great comment

katef

👍 subject to syncing from upstream etc

Addressed merge conflict on upstreamed variant of re_is_anchored.

silentbicycle · 2024-10-12T17:46:01Z

Fixed merge conflict with changed version of fsm_is_anchored from upstream (katef#500).

Use the first byte of input as a seed, when available.

silentbicycle added 5 commits October 10, 2024 12:16

gen: No end states means nothing to do. (todo: cherry pick)

796c6ab

Ensure .has_eager_outputs is zeroed on new states. (msan)

9fcbdf5

eager_output interface cleanup: Replace _any with _count and _get.

1baee53

fsm_eager_output_count and fsm_eager_output_get aligns better with the endid interface, and now fsm_eager_output_get ensures the buffer contents are sorted,

minimise_test_oracle.c: mismatched eager outputs also prevent merging.

8fd728e

Do the same check with eager outputs as it does with endids -- any two states with distinct sets for either should end up in different equivalence classes.

silentbicycle requested a review from katef October 10, 2024 18:27

silentbicycle commented Oct 10, 2024

View reviewed changes

katef reviewed Oct 12, 2024

View reviewed changes

silentbicycle mentioned this pull request Oct 12, 2024

Upstream sync: re_is_anchored and a few more misc. changes #31

Merged

katef reviewed Oct 12, 2024

View reviewed changes

katef approved these changes Oct 12, 2024

View reviewed changes

Merge branch 'f/main' into sv/eager-outputs

a779057

Addressed merge conflict on upstreamed variant of re_is_anchored.

silentbicycle added 2 commits October 12, 2024 13:54

fuzz/target.c: re_is_anchor interface changes.

bbcac92

fuzz/target: fsm_generate_matches has randomized flag.

e133a74

Use the first byte of input as a seed, when available.

silentbicycle mentioned this pull request Oct 12, 2024

Add cdata codegen, with eager output support #30

Merged

katef approved these changes Oct 12, 2024

View reviewed changes

silentbicycle merged commit c7b9e1d into main Oct 12, 2024
349 checks passed

silentbicycle deleted the sv/eager-outputs branch October 12, 2024 19:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

experimental: Add eager outputs, similar to endids but eagerly matched. #29

experimental: Add eager outputs, similar to endids but eagerly matched. #29

silentbicycle commented Oct 10, 2024

silentbicycle Oct 10, 2024

katef Oct 12, 2024

katef Oct 12, 2024

katef Oct 12, 2024

katef Oct 12, 2024

katef left a comment

silentbicycle commented Oct 12, 2024

experimental: Add eager outputs, similar to endids but eagerly matched. #29

experimental: Add eager outputs, similar to endids but eagerly matched. #29

Conversation

silentbicycle commented Oct 10, 2024

silentbicycle Oct 10, 2024

Choose a reason for hiding this comment

katef Oct 12, 2024

Choose a reason for hiding this comment

katef Oct 12, 2024

Choose a reason for hiding this comment

katef Oct 12, 2024

Choose a reason for hiding this comment

katef Oct 12, 2024

Choose a reason for hiding this comment

katef left a comment

Choose a reason for hiding this comment

silentbicycle commented Oct 12, 2024