-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
\b does not behave like it does with java.util.regex.Pattern #116
Comments
Seen that PR #100 was closed, but java.util.regex.Pattern behaves like it should (lets say with french word "été", this is a real word so word boundaries should match, shouldn't they ?), but current re2j doesn't match them, wiki page should be updated to reflect this, not be used to refuse improvements : PR #100 should be applyed (the new unit-tests show the behaviour mismatches) |
As we describe in the package documentation, RE2/J implements the behavior specified by https://github.com/google/re2/wiki/Syntax. As noted on the github page, RE2J is not a drop-in replacement for I raised https://groups.google.com/u/1/g/re2-dev/c/nyGkxcJKExY with re2-dev to see why RE2 does not implement word boundary matching in this way. I'm not in a position to document every way in which RE2/J differs from java.util.regexp. Some of the differences are noted on the github page, others will be described in the RE2 syntax document (e.g. \b unambiguously implements ASCII word boundary matching, this is different from java.util.regexp). |
I guess/hope the wiki page documents what the code do, But maybe should not limit to what it should do, Imho the page should be updated. Word boudaries with french words but many other languages are mandatory for my usage, and accents are part of it, I moved to re2j because I need longuest matches and this was easier to implement/patch it than with java's regexp. Why not try to make both libraries almost equivalents, why not make re2j the optimal regexp library with both functionalities ? (US users will not notice, this is the the same "fight" again and again : ascii vs unicode, the same fight that brought us to "code pages" that we both like, I hope, to forget forever) |
Leaved a comment on the closed original PR : a hack for people wanting the PR, I always need it :-/ |
Same issue exists in RE2 as well: google/re2#344 |
Word boundaries should use \p{L} not just A-Za-z to behave like the default regex in java. Added some tests showing the issue and fixed it in this PR : #100 . (but I had to disable a large unit-test I don't know how to adapt to support this change)
The text was updated successfully, but these errors were encountered: