Improve caseless character range processing when utf is enabled #477

zherczeg · 2024-09-17T11:28:39Z

Currently when utf caseless matching is requested, each character in a class range is checked one-by-one to find their other cases. I always thought this is inefficient, even if it is only done by the parser.

To speed things up, I have added a data structure, which contains ranges where characters have no other cases. The ranges have a minimum size. The size in uint32_t units with different minimum sizes (the number of ranges is half of that size):

minimum size 4: 124; /* 1111114 characters */
minimum size 8: 74; /* 1110983 characters */
minimum size 16: 60; /* 1110911 characters */

I choose 8 at the end, because only 74 values are needed, and these ranges cover 1110983 characters, which is 99.7% of all utf characters.

Performance improvement by the patch:

Original (-O3):

./pcre2test -t 100
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[\x{100}-\x{100000}]/iB,utf
Compile time 8957.8600 microseconds
------------------------------------------------------------------
        Bra
        [KSks\xb5\xc5\xdf\xe5\xff\x{100}-\x{100000}]
        Ket
        End
------------------------------------------------------------------

New (-O3)

./pcre2test -t 100
PCRE2 version 10.45-DEV 2024-06-09 (8-bit)
  re> /[\x{100}-\x{100000}]/iB,utf
Compile time  83.0600 microseconds
------------------------------------------------------------------
        Bra
        [KSks\xb5\xc5\xdf\xe5\xff\x{100}-\x{100000}]
        Ket
        End
------------------------------------------------------------------

The new one is 107 times faster than the old, I think this is a nice speedup.

zherczeg · 2024-09-17T12:41:57Z

I have some ideas to further improve this patch.

zherczeg · 2024-09-17T14:49:21Z

src/pcre2_compile.c


    while ((rc = get_othercase_range(&c, end, &oc, &od,
             (xoptions & PCRE2_EXTRA_CASELESS_RESTRICT) != 0)) >= 0)
      {
+      if (c > skip_start)


I thought get_othercase_range returns the c after the character with othercase, but sometimes it moves even further. I am not fully understand this.

zherczeg · 2024-09-17T15:51:29Z

I have updated the code. @PhilipHazel what do you think?

PhilipHazel · 2024-09-17T16:02:29Z

Very nice idea! I am happy with it. Is it ready for merging now?

zherczeg · 2024-09-17T16:20:14Z

It is ready. The code is fail safe, although it would be great if I understand the purpose of the extra increase.

zherczeg marked this pull request as draft September 17, 2024 12:41

Improve caseless character range processing when utf is enabled

05db400

zherczeg force-pushed the utf_caseless_ranges branch from 8bae1b1 to 05db400 Compare September 17, 2024 14:46

zherczeg commented Sep 17, 2024

View reviewed changes

zherczeg marked this pull request as ready for review September 17, 2024 15:49

PhilipHazel merged commit 0333a78 into PCRE2Project:master Sep 18, 2024
12 checks passed

zherczeg deleted the utf_caseless_ranges branch September 18, 2024 16:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve caseless character range processing when utf is enabled #477

Improve caseless character range processing when utf is enabled #477

zherczeg commented Sep 17, 2024

zherczeg commented Sep 17, 2024

zherczeg Sep 17, 2024

zherczeg commented Sep 17, 2024

PhilipHazel commented Sep 17, 2024

zherczeg commented Sep 17, 2024

Improve caseless character range processing when utf is enabled #477

Improve caseless character range processing when utf is enabled #477

Conversation

zherczeg commented Sep 17, 2024

zherczeg commented Sep 17, 2024

zherczeg Sep 17, 2024

Choose a reason for hiding this comment

zherczeg commented Sep 17, 2024

PhilipHazel commented Sep 17, 2024

zherczeg commented Sep 17, 2024