Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameterize regex engine #5

Open
masklinn opened this issue Aug 17, 2024 · 0 comments
Open

Parameterize regex engine #5

masklinn opened this issue Aug 17, 2024 · 0 comments

Comments

@masklinn
Copy link
Collaborator

masklinn commented Aug 17, 2024

This is a long-term thing so I don't forget about it.

This is related to #4: the standard regex::Regex is fast, convenient, and feature-rich and I think it makes a good default, but there's no denying with the number of regexes you'd put in a regex-filtered set it can get rather memory-intensive. So specific users may want to trade performance and / or convenience for lower memory use. Possibilities here are:

  • regex-lite, that is Switch from regex crate to regex-lite #4's attempt and the memory savings are tremendous (for about the same features minus rich unicode support), the performances are terrible unless the prefilter has extremely high discriminatory power, but for more resource-constrained uses, or users who are already on regex-lite (and don't mind the lower performances) it could be a nice option
  • regex::bytes, the memory savings are much less than lite but they can still be quite respectable, this trades away a lot of convenience as you get bytes out
  • lazy compilation of any of those, using std::sync::LazyLock (or once_cell::sync::Lazy for lower MSRV), for highly biased sets which have a very small number of "hot" regexes, and a much larger sets of regexes which are essentially never used, the engine would keep much more compact String or even &str around until the regex is actually needed for post-filtering and matching, this trades for less consistent behaviour however (memory will grow over time and any matching can take arbitrarily long if it triggers the compilation of several regexes), this is especially attractive for the cases where the regex set is static and embedded in the binary (so the source strings are "free").

This would likely require a trait per crate:

  • regex-filtered needs to parameterize on the "regex" object being stored, which may be lifetime-parameterized how to construct it the interface to match it
  • ua-parser further needs the extracted data kind and a way to mix that and the replacement values
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant