Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\b does not behave like it does with java.util.regex.Pattern #116

Open
mykeul opened this issue Jun 10, 2020 · 6 comments
Open

\b does not behave like it does with java.util.regex.Pattern #116

mykeul opened this issue Jun 10, 2020 · 6 comments

Comments

@mykeul
Copy link
Contributor

mykeul commented Jun 10, 2020

Word boundaries should use \p{L} not just A-Za-z to behave like the default regex in java. Added some tests showing the issue and fixed it in this PR : #100 . (but I had to disable a large unit-test I don't know how to adapt to support this change)

@mykeul mykeul changed the title \b does not behave like java.util.regex.Pattern does \b does not behave like it does with java.util.regex.Pattern Jun 10, 2020
@mykeul
Copy link
Contributor Author

mykeul commented Jun 23, 2020

Seen that PR #100 was closed, but java.util.regex.Pattern behaves like it should (lets say with french word "été", this is a real word so word boundaries should match, shouldn't they ?), but current re2j doesn't match them, wiki page should be updated to reflect this, not be used to refuse improvements : PR #100 should be applyed (the new unit-tests show the behaviour mismatches)

@sjamesr
Copy link
Contributor

sjamesr commented Jun 26, 2020

As we describe in the package documentation, RE2/J implements the behavior specified by https://github.com/google/re2/wiki/Syntax. As noted on the github page, RE2J is not a drop-in replacement for java.util.regex.Pattern for this and a host of other reasons.

I raised https://groups.google.com/u/1/g/re2-dev/c/nyGkxcJKExY with re2-dev to see why RE2 does not implement word boundary matching in this way.

I'm not in a position to document every way in which RE2/J differs from java.util.regexp. Some of the differences are noted on the github page, others will be described in the RE2 syntax document (e.g. \b unambiguously implements ASCII word boundary matching, this is different from java.util.regexp).

@mykeul
Copy link
Contributor Author

mykeul commented Jun 29, 2020

I guess/hope the wiki page documents what the code do, But maybe should not limit to what it should do, Imho the page should be updated. Word boudaries with french words but many other languages are mandatory for my usage, and accents are part of it, I moved to re2j because I need longuest matches and this was easier to implement/patch it than with java's regexp. Why not try to make both libraries almost equivalents, why not make re2j the optimal regexp library with both functionalities ? (US users will not notice, this is the the same "fight" again and again : ascii vs unicode, the same fight that brought us to "code pages" that we both like, I hope, to forget forever)

@mykeul
Copy link
Contributor Author

mykeul commented Jun 15, 2021

Leaved a comment on the closed original PR : a hack for people wanting the PR, I always need it :-/

@mbasmanova
Copy link

We ran into this problem in Velox when matching German strings:

SELECT REGEXP_LIKE('Insidern auch als Grenzenüberschreiter bekannt', '(?i)(\b)Grenzen(\b)')

This query returns 'true' while we (and our users) expect 'false'.

Is there any workaround?

CC: @zacw7

@mbasmanova
Copy link

Same issue exists in RE2 as well: google/re2#344

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants