Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve caseless character range processing when utf is enabled #477

Merged
merged 1 commit into from
Sep 18, 2024

Conversation

zherczeg
Copy link
Collaborator

Currently when utf caseless matching is requested, each character in a class range is checked one-by-one to find their other cases. I always thought this is inefficient, even if it is only done by the parser.

To speed things up, I have added a data structure, which contains ranges where characters have no other cases. The ranges have a minimum size. The size in uint32_t units with different minimum sizes (the number of ranges is half of that size):

minimum size 4: 124; /* 1111114 characters */
minimum size 8: 74; /* 1110983 characters */
minimum size 16: 60; /* 1110911 characters */

I choose 8 at the end, because only 74 values are needed, and these ranges cover 1110983 characters, which is 99.7% of all utf characters.

Performance improvement by the patch:

Original (-O3):

./pcre2test -t 100
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[\x{100}-\x{100000}]/iB,utf
Compile time 8957.8600 microseconds
------------------------------------------------------------------
        Bra
        [KSks\xb5\xc5\xdf\xe5\xff\x{100}-\x{100000}]
        Ket
        End
------------------------------------------------------------------

New (-O3)

./pcre2test -t 100
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[\x{100}-\x{100000}]/iB,utf
Compile time  83.0600 microseconds
------------------------------------------------------------------
        Bra
        [KSks\xb5\xc5\xdf\xe5\xff\x{100}-\x{100000}]
        Ket
        End
------------------------------------------------------------------

The new one is 107 times faster than the old, I think this is a nice speedup.

@zherczeg zherczeg marked this pull request as draft September 17, 2024 12:41
@zherczeg
Copy link
Collaborator Author

I have some ideas to further improve this patch.


while ((rc = get_othercase_range(&c, end, &oc, &od,
(xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0)) >= 0)
{
if (c > skip_start)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought get_othercase_range returns the c after the character with othercase, but sometimes it moves even further. I am not fully understand this.

@zherczeg zherczeg marked this pull request as ready for review September 17, 2024 15:49
@zherczeg
Copy link
Collaborator Author

I have updated the code. @PhilipHazel what do you think?

@PhilipHazel
Copy link
Collaborator

Very nice idea! I am happy with it. Is it ready for merging now?

@zherczeg
Copy link
Collaborator Author

It is ready. The code is fail safe, although it would be great if I understand the purpose of the extra increase.

@PhilipHazel PhilipHazel merged commit 0333a78 into PCRE2Project:master Sep 18, 2024
12 checks passed
@zherczeg zherczeg deleted the utf_caseless_ranges branch September 18, 2024 16:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants