Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

experimental: Add eager outputs, similar to endids but eagerly matched. #29

Merged
merged 8 commits into from
Oct 12, 2024

Conversation

silentbicycle
Copy link

When combining several unanchored regexes it becomes VERY expensive to handle combinations of matches via the end state -- essentially, the whole reachable DFA gets separate matching and non-matching copies for each pattern, leading to a DFA whose size is proportional to the number of possible combinations of matches. With eager outputs, we can set a flag for matching as we reach the end of the original pattern (before looping back and possibly also matching other patterns), which keeps the state count from blowing up in fsm_determinise.

To see how much difference this makes, the test tests/eager_output/run7 combines 26 different patterns. It should finish very quickly (~50 msec, just now). Try running it with env FORCE_ENDIDS=N for N increasing from 4 to 26. Around 10-11 it will start taking several seconds, and memory usage will roughly double with each step.

This PR adds fsm_union_repeated_pattern_group, a variant of fsm_union_array that combines a set of DFAs into a single NFA, but correctly handles a mix of anchored and unanchored ends without the state count blowing up. It currently needs flags passed in for each fsm indicating whether the start and/or end are anchored, and there is a hacky special case that removes self-edges from states with eager outputs and instead connects them to a single overall unanchored end loop. I haven't yet figured out how to handle this properly in the general case, but it works for this specific use case, provided all the DFAs are combined at once. (Combining multiple DFAs each produced by determinising fsm_union_repeated_pattern_group's result probably won't work correctly.) I have tried detecting and ignoring those edges in fsm_determinise, after epsilon removal, but so far either it still causes the graph size to blow up or subtly breaks something else.

This is still experimental, and the code generation for -lc here is quite hacky -- it expects the caller to define a FSM_SET_EAGER_OUTPUT acro, since the code generation interface doesn't define where the match info will go yet. A later PR will add a new code generation mode with better support for eager outputs, and I plan to eventually integrate this better with rx, AMBIG_MULTIPLE, and so on.

(This squashes down a couple false starts. Sorry, it's a large PR, but there isn't a meaningful intermediate state here.)

When combining several unanchored regexes it becomes VERY expensive
to handle combinations of matches via the end state -- essentially,
the whole reachable DFA gets separate matching and non-matching copies
for each pattern, leading to a DFA whose size is proportional to the
number of *possible combinations* of matches. With eager outputs,
we can set a flag for matching as we reach the end of the original
pattern (before looping back and possibly also matching other patterns),
which keeps the state count from blowing up in fsm_determinise.

To see how much difference this makes, the test tests/eager_output/run7
combines 26 different patterns. It should finish very quickly (~50 msec,
just now). Try running it with `env FORCE_ENDIDS=N` for N increasing from
4 to 26. Around 10-11 it will start taking several seconds, and memory
usage will roughly double with each step.

This PR adds `fsm_union_repeated_pattern_group`, a variant of
`fsm_union_array` that combines a set of DFAs into a single NFA, but
correctly handles a mix of anchored and unanchored ends without the
state count blowing up. It currently needs flags passed in for each fsm
indicating whether the start and/or end are anchored, and there is a
hacky special case that removes self-edges from states with eager
outputs and instead connects them to a single overall unanchored end
loop. I haven't yet figured out how to handle this properly in the
general case, but it works for this specific use case, provided all the
DFAs are combined at once. (Combining multiple DFAs each produced by
determinising fsm_union_repeated_pattern_group's result probably won't
work correctly.) I have tried detecting and ignoring those edges in
fsm_determinise, after epsilon removal, but so far either it still
causes the graph size to blow up or subtly breaks something else.

This is still experimental, and the code generation for `-lc` here is
quite hacky -- it expects the caller to define a `FSM_SET_EAGER_OUTPUT`
acro, since the code generation interface doesn't define where the
match info will go yet. A later PR will add a new code generation mode
with better support for eager outputs, and I plan to eventually
integrate this better with rx, AMBIG_MULTIPLE, and so on.

(This squashes down a couple false starts.)
fsm_eager_output_count and fsm_eager_output_get aligns better with
the endid interface, and now  fsm_eager_output_get ensures the buffer
contents are sorted,
Do the same check with eager outputs as it does with endids -- any
two states with distinct sets for either should end up in different
equivalence classes.
/* If .fragment is set and the state has eager outputs, then emit a call to a
* macro (the caller is expected to define). This is a temporary interface. */
for (size_t i = 0; i < cs->eager_outputs->count; i++) {
fprintf(f, "\t\t\tFSM_SET_EAGER_OUTPUT(%u);\n", cs->eager_outputs->ids[i]);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really ugly. I figured we should talk about what the code generation interface should look like. It'd be awkward to generate non-fragment function prototypes with or without extra argument(s) for the eager outputs, but I'm not sure always adding it is a good plan either -- it probably won't be used as widely as endids. There could be a flag in the print options for whether to include it, and require the flag to be set when eager outputs are used?

Also, this interface passes the buck on whether the IDs are dense or sparse. In DD's case it's a regex ID, so there's a bit array 0..N, but the API doesn't prevent someone from using non-continguous values like 32-bit hashes. Here, the macro can expand to something that includes other variables from the caller's scope, whether that's a bit vector or some other kind of collection.

struct fsm *
fsm_union_repeated_pattern_group(size_t entry_count,
struct fsm_union_entry *entries, struct fsm_combined_base_pair *bases);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would we be able to use this from the <re/strings.h> interfaces?

include/re/re.h Outdated
re_is_anchored(enum re_dialect dialect, re_getchar_fun *f, void *opaque,
enum re_flags flags, struct re_err *err,
struct re_anchoring_info *info);

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self; I requested pulling this out to its own PR, straight to upstream

@@ -405,7 +428,7 @@ hash_iss(interned_state_set_id iss)
}

static struct mapping *
map_first(struct map *map, struct map_iter *iter)
map_first(const struct map *map, struct map_iter *iter)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

always makes me happy to see a const

* as the combined pattern count increases; it's essentially what I'm
* trying to avoid by adding eager output support in the first place.
*
* FIXME: instead of actively removing these, filter in fsm_determinise? */
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great comment

Copy link

@katef katef left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 subject to syncing from upstream etc

Addressed merge conflict on upstreamed variant of re_is_anchored.
@silentbicycle
Copy link
Author

Fixed merge conflict with changed version of fsm_is_anchored from upstream (katef#500).

@silentbicycle silentbicycle merged commit c7b9e1d into main Oct 12, 2024
349 checks passed
@silentbicycle silentbicycle deleted the sv/eager-outputs branch October 12, 2024 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants