diff --git a/doc/html/pcre2_set_optimize.html b/doc/html/pcre2_set_optimize.html new file mode 100644 index 000000000..764b7f77c --- /dev/null +++ b/doc/html/pcre2_set_optimize.html @@ -0,0 +1,47 @@ + + +pcre2_set_optimize specification + + +

pcre2_set_optimize man page

+

+Return to the PCRE2 index page. +

+

+This page is part of the PCRE2 HTML documentation. It was generated +automatically from the original man page. If there is any nonsense in it, +please consult the man page, in case the conversion went wrong. +
+
+SYNOPSIS +
+

+#include <pcre2.h> +

+

+int pcre2_set_optimize(pcre2_compile_context *ccontext, + uint32_t directive); +

+
+DESCRIPTION +
+

+This function controls which performance optimizations will be applied +by pcre2_compile. It can be called multiple times with the same compile +context; the effects are cumulative, with the effects of later calls taking +precedence over earlier ones. +

+

+The result is zero for success, PCRE2_ERROR_NULL if ccontext is NULL, +or PCRE2_ERROR_BADOPTION if directive is unknown. This can be used to +detect when the available version of PCRE2 does not implement a certain +optimization. +

+

+There is a complete description of the PCRE2 native API, including all +permitted values for the directive parameter of pcre2_set_optimize, +in the +pcre2api +page.

+Return to the PCRE2 index page. +

diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html index 9766f454d..08ba35811 100644 --- a/doc/html/pcre2api.html +++ b/doc/html/pcre2api.html @@ -179,6 +179,10 @@

pcre2api man page


int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, int (*guard_function)(uint32_t, void *), void *user_data); +
+
+int pcre2_set_optimize(pcre2_compile_context *ccontext, + uint32_t directive);


PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS

@@ -808,6 +812,7 @@

pcre2api man page

The compile time nested parentheses limit The maximum length of the pattern string The extra options bits (none set by default) + Which performance optimizations the compiler should apply A compile context is also required if you are using custom memory management. If none of these apply, just pass NULL as the context argument of @@ -952,6 +957,110 @@

pcre2api man page

nesting, and the second is user data that is set up by the last argument of pcre2_set_compile_recursion_guard(). The callout function should return zero if all is well, or non-zero to force an error. +
+
+int pcre2_set_optimize(pcre2_compile_context *ccontext, + uint32_t directive); +
+
+PCRE2 can apply various performance optimizations during compilation, in order +to make matching faster. For example, the compiler might convert some regex +constructs into an equivalent construct which pcre2_match() can execute +faster. By default, all available optimizations are enabled. However, in rare +cases, one might wish to disable specific optimizations. For example, if it is +known that some optimizations cannot benefit a certain regex, it might be +desirable to disable them, in order to speed up compilation. +

+

+The permitted values of directive are as follows: +

+  PCRE2_OPTIMIZATION_NONE
+
+Disable all optional performance optimizations. +
+  PCRE2_OPTIMIZATION_FULL
+
+Enable all optional performance optimizations. This is the default value. +
+  PCRE2_AUTO_POSSESS
+  PCRE2_AUTO_POSSESS_OFF
+
+Enable/disable "auto-possessification" of variable quantifiers such as * and +. +This optimization, for example, turns a+b into a++b in order to avoid +backtracks into a+ that can never be successful. However, if callouts are in +use, auto-possessification means that some callouts are never taken. You can +disable this optimization if you want the matching functions to do a full, +unoptimized search and run all the callouts. +
+  PCRE2_DOTSTAR_ANCHOR
+  PCRE2_DOTSTAR_ANCHOR_OFF
+
+Enable/disable an optimization that is applied when .* is the first significant +item in a top-level branch of a pattern, and all the other branches also start +with .* or with \A or \G or ^. Such a pattern is automatically anchored if +PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any +^ items. Otherwise, the fact that any match must start either at the start of +the subject or following a newline is remembered. Like other optimizations, +this can cause callouts to be skipped. +

+

+Dotstar anchor optimization is automatically disabled for .* if it is inside an +atomic group or a capture group that is the subject of a backreference, or if +the pattern contains (*PRUNE) or (*SKIP). +

+  PCRE2_START_OPTIMIZE
+  PCRE2_START_OPTIMIZE_OFF
+
+Enable/disable optimizations which cause matching functions to scan the subject +string for specific code unit values before attempting a match. For example, if +it is known that an unanchored match must start with a specific value, the +matching code searches the subject for that value, and fails immediately if it +cannot find it, without actually running the main matching function. This means +that a special item such as (*COMMIT) at the start of a pattern is not +considered until after a suitable starting point for the match has been found. +Also, when callouts or (*MARK) items are in use, these "start-up" optimizations +can cause them to be skipped if the pattern is never actually used. The start-up +optimizations are in effect a pre-scan of the subject that takes place before +the pattern is run. +

+

+Disabling start-up optimizations ensures that in cases where the result is "no +match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are +considered at every possible starting position in the subject string. +

+

+Disabling start-up optimizations may change the outcome of a matching operation. +Consider the pattern +

+  (*COMMIT)ABC
+
+When this is compiled, PCRE2 records the fact that a match must start with the +character "A". Suppose the subject string is "DEFABC". The start-up +optimization scans along the subject, finds "A" and runs the first match +attempt from there. The (*COMMIT) item means that the pattern must match the +current starting position, which in this case, it does. However, if the same +match is run without start-up optimizations, the initial scan along the subject +string does not happen. The first match attempt is run starting from "D" and +when this fails, (*COMMIT) prevents any further matches being tried, so the +overall result is "no match". +

+

+Another start-up optimization makes use of a minimum length for a matching +subject, which is recorded when possible. Consider the pattern +

+  (*MARK:1)B(*MARK:2)(X|Y)
+
+The minimum length for a match is two characters. If the subject is "XXBB", the +"starting character" optimization skips "XX", then tries to match "BB", which +is long enough. In the process, (*MARK:2) is encountered and remembered. When +the match attempt fails, the next "B" is found, but there is only one character +left, so there are no more attempts, and "no match" is returned with the "last +mark seen" set to "2". Without start-up optimizations, however, matches are +tried at every possible starting position, including at the end of the subject, +where (*MARK:1) is encountered, but there is no "B", so the "last mark seen" +that is returned is "1". In this case, the optimizations do not affect the +overall match result, which is still "no match", but they do affect the +auxiliary information that is returned.


The match context @@ -1807,85 +1916,55 @@

pcre2api man page

   PCRE2_NO_AUTO_POSSESS
 
-If this option is set, it disables "auto-possessification", which is an -optimization that, for example, turns a+b into a++b in order to avoid +If this (deprecated) option is set, it disables "auto-possessification", which +is an optimization that, for example, turns a+b into a++b in order to avoid backtracks into a+ that can never be successful. However, if callouts are in use, auto-possessification means that some callouts are never taken. You can set this option if you want the matching functions to do a full unoptimized search and run all the callouts, but it is mainly provided for testing purposes. +

+

+It is recommended to use pcre2_set_optimize with the directive +PCRE2_AUTO_POSSESS_OFF rather than the compile option PCRE2_NO_AUTO_POSSESS. +Note that PCRE2_NO_AUTO_POSSESS takes precedence over the +pcre2_set_optimize optimization directives PCRE2_AUTO_POSSESS and +PCRE2_AUTO_POSSESS_OFF.

   PCRE2_NO_DOTSTAR_ANCHOR
 
-If this option is set, it disables an optimization that is applied when .* is -the first significant item in a top-level branch of a pattern, and all the -other branches also start with .* or with \A or \G or ^. The optimization is -automatically disabled for .* if it is inside an atomic group or a capture -group that is the subject of a backreference, or if the pattern contains -(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is -automatically anchored if PCRE2_DOTALL is set for all the .* items and -PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match -must start either at the start of the subject or following a newline is +If this (deprecated) option is set, it disables an optimization that is applied +when .* is the first significant item in a top-level branch of a pattern, and +all the other branches also start with .* or with \A or \G or ^. The +optimization is automatically disabled for .* if it is inside an atomic group +or a capture group that is the subject of a backreference, or if the pattern +contains (*PRUNE) or (*SKIP). When the optimization is not disabled, such a +pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items +and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any +match must start either at the start of the subject or following a newline is remembered. Like other optimizations, this can cause callouts to be skipped. +(It is recommended to use pcre2_set_optimize instead.)
   PCRE2_NO_START_OPTIMIZE
 
This is an option whose main effect is at matching time. It does not change what pcre2_compile() generates, but it does affect the output of the JIT -compiler. +compiler. Setting this option is equivalent to calling pcre2_set_optimize +with the directive parameter set to PCRE2_START_OPTIMIZE_OFF.

There are a number of optimizations that may occur at the start of a match, in order to speed up the process. For example, if it is known that an unanchored match must start with a specific code unit value, the matching code searches the subject for that value, and fails immediately if it cannot find it, without -actually running the main matching function. This means that a special item -such as (*COMMIT) at the start of a pattern is not considered until after a -suitable starting point for the match has been found. Also, when callouts or -(*MARK) items are in use, these "start-up" optimizations can cause them to be -skipped if the pattern is never actually used. The start-up optimizations are +actually running the main matching function. The start-up optimizations are in effect a pre-scan of the subject that takes place before the pattern is run.

-The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, -possibly causing performance to suffer, but ensuring that in cases where the -result is "no match", the callouts do occur, and that items such as (*COMMIT) -and (*MARK) are considered at every possible starting position in the subject -string. -

-

-Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation. -Consider the pattern -

-  (*COMMIT)ABC
-
-When this is compiled, PCRE2 records the fact that a match must start with the -character "A". Suppose the subject string is "DEFABC". The start-up -optimization scans along the subject, finds "A" and runs the first match -attempt from there. The (*COMMIT) item means that the pattern must match the -current starting position, which in this case, it does. However, if the same -match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the -subject string does not happen. The first match attempt is run starting from -"D" and when this fails, (*COMMIT) prevents any further matches being tried, so -the overall result is "no match". -

-

-As another start-up optimization makes use of a minimum length for a matching -subject, which is recorded when possible. Consider the pattern -

-  (*MARK:1)B(*MARK:2)(X|Y)
-
-The minimum length for a match is two characters. If the subject is "XXBB", the -"starting character" optimization skips "XX", then tries to match "BB", which -is long enough. In the process, (*MARK:2) is encountered and remembered. When -the match attempt fails, the next "B" is found, but there is only one character -left, so there are no more attempts, and "no match" is returned with the "last -mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried -at every possible starting position, including at the end of the subject, where -(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is -returned is "1". In this case, the optimizations do not affect the overall -match result, which is still "no match", but they do affect the auxiliary -information that is returned. +Disabling the start-up optimizations may cause performance to suffer. However, +this may be desirable for patterns which contain callouts or items such as +(*COMMIT) and (*MARK). See the above description of PCRE2_START_OPTIMIZE_OFF +for further details.
   PCRE2_NO_UTF_CHECK
 
@@ -2312,6 +2391,7 @@

pcre2api man page

PCRE2_DOTALL is in force for .* Neither (*PRUNE) nor (*SKIP) appears in the pattern PCRE2_NO_DOTSTAR_ANCHOR is not set + Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the options returned for PCRE2_INFO_ALLOPTIONS. diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html index 1902f1030..fe7c42eef 100644 --- a/doc/html/pcre2pattern.html +++ b/doc/html/pcre2pattern.html @@ -2243,7 +2243,7 @@

pcre2pattern man page

PCRE2 has an optimization that automatically "possessifies" certain simple pattern constructs. For example, the sequence A+B is treated as A++B because there is no point in backtracking into a sequence of A's when B must follow. -This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting +This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).

diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html index a519793c7..b38869e10 100644 --- a/doc/html/pcre2test.html +++ b/doc/html/pcre2test.html @@ -681,6 +681,23 @@

pcre2test man page

brackets. Setting utf in 16-bit or 32-bit mode also causes pattern and subject strings to be translated to UTF-16 or UTF-32, respectively, before being passed to library functions. +
+
+The following modifiers enable or disable performance optimizations by +calling pcre2_set_optimize() before invoking the regex compiler. +
+      optimization_full      enable all optional optimizations
+      optimization_none      disable all optional optimizations
+      auto_possess           auto-possessify variable quantifiers
+      auto_possess_off       don't auto-possessify variable quantifiers
+      dotstar_anchor         anchor patterns starting with .*
+      dotstar_anchor_off     don't anchor patterns starting with .*
+      start_optimize         enable pre-scan of subject string
+      start_optimize_off     disable pre-scan of subject string
+
+See the +pcre2_set_optimize +documentation for details on these optimizations.


Setting compilation controls diff --git a/doc/pcre2.txt b/doc/pcre2.txt index d6b8d0b34..42c2f452b 100644 --- a/doc/pcre2.txt +++ b/doc/pcre2.txt @@ -296,6 +296,9 @@ PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext, int (*guard_function)(uint32_t, void *), void *user_data); + int pcre2_set_optimize(pcre2_compile_context *ccontext, + uint32_t directive); + PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS @@ -978,6 +981,110 @@ PCRE2 CONTEXTS ment of pcre2_set_compile_recursion_guard(). The callout function should return zero if all is well, or non-zero to force an error. + int pcre2_set_optimize(pcre2_compile_context *ccontext, + uint32_t directive); + + PCRE2 can apply various performance optimizations during compilation, + in order to make matching faster. For example, the compiler might con‐ + vert some regex constructs into an equivalent construct which + pcre2_match() can execute faster. By default, all available optimiza‐ + tions are enabled. However, in rare cases, one might wish to disable + specific optimizations. For example, if it is known that some optimiza‐ + tions cannot benefit a certain regex, it might be desirable to disable + them, in order to speed up compilation. + + The permitted values of directive are as follows: + + PCRE2_OPTIMIZATION_NONE + + Disable all optional performance optimizations. + + PCRE2_OPTIMIZATION_FULL + + Enable all optional performance optimizations. This is the default + value. + + PCRE2_AUTO_POSSESS + PCRE2_AUTO_POSSESS_OFF + + Enable/disable "auto-possessification" of variable quantifiers such as + * and +. This optimization, for example, turns a+b into a++b in order + to avoid backtracks into a+ that can never be successful. However, if + callouts are in use, auto-possessification means that some callouts are + never taken. You can disable this optimization if you want the matching + functions to do a full, unoptimized search and run all the callouts. + + PCRE2_DOTSTAR_ANCHOR + PCRE2_DOTSTAR_ANCHOR_OFF + + Enable/disable an optimization that is applied when .* is the first + significant item in a top-level branch of a pattern, and all the other + branches also start with .* or with \A or \G or ^. Such a pattern is + automatically anchored if PCRE2_DOTALL is set for all the .* items and + PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that + any match must start either at the start of the subject or following a + newline is remembered. Like other optimizations, this can cause call‐ + outs to be skipped. + + Dotstar anchor optimization is automatically disabled for .* if it is + inside an atomic group or a capture group that is the subject of a + backreference, or if the pattern contains (*PRUNE) or (*SKIP). + + PCRE2_START_OPTIMIZE + PCRE2_START_OPTIMIZE_OFF + + Enable/disable optimizations which cause matching functions to scan the + subject string for specific code unit values before attempting a match. + For example, if it is known that an unanchored match must start with a + specific value, the matching code searches the subject for that value, + and fails immediately if it cannot find it, without actually running + the main matching function. This means that a special item such as + (*COMMIT) at the start of a pattern is not considered until after a + suitable starting point for the match has been found. Also, when call‐ + outs or (*MARK) items are in use, these "start-up" optimizations can + cause them to be skipped if the pattern is never actually used. The + start-up optimizations are in effect a pre-scan of the subject that + takes place before the pattern is run. + + Disabling start-up optimizations ensures that in cases where the result + is "no match", the callouts do occur, and that items such as (*COMMIT) + and (*MARK) are considered at every possible starting position in the + subject string. + + Disabling start-up optimizations may change the outcome of a matching + operation. Consider the pattern + + (*COMMIT)ABC + + When this is compiled, PCRE2 records the fact that a match must start + with the character "A". Suppose the subject string is "DEFABC". The + start-up optimization scans along the subject, finds "A" and runs the + first match attempt from there. The (*COMMIT) item means that the pat‐ + tern must match the current starting position, which in this case, it + does. However, if the same match is run without start-up optimizations, + the initial scan along the subject string does not happen. The first + match attempt is run starting from "D" and when this fails, (*COMMIT) + prevents any further matches being tried, so the overall result is "no + match". + + Another start-up optimization makes use of a minimum length for a + matching subject, which is recorded when possible. Consider the pattern + + (*MARK:1)B(*MARK:2)(X|Y) + + The minimum length for a match is two characters. If the subject is + "XXBB", the "starting character" optimization skips "XX", then tries to + match "BB", which is long enough. In the process, (*MARK:2) is encoun‐ + tered and remembered. When the match attempt fails, the next "B" is + found, but there is only one character left, so there are no more at‐ + tempts, and "no match" is returned with the "last mark seen" set to + "2". Without start-up optimizations, however, matches are tried at ev‐ + ery possible starting position, including at the end of the subject, + where (*MARK:1) is encountered, but there is no "B", so the "last mark + seen" that is returned is "1". In this case, the optimizations do not + affect the overall match result, which is still "no match", but they do + affect the auxiliary information that is returned. + The match context A match context is required if you want to: @@ -1775,86 +1882,55 @@ COMPILING A PATTERN PCRE2_NO_AUTO_POSSESS - If this option is set, it disables "auto-possessification", which is an - optimization that, for example, turns a+b into a++b in order to avoid - backtracks into a+ that can never be successful. However, if callouts - are in use, auto-possessification means that some callouts are never - taken. You can set this option if you want the matching functions to do - a full unoptimized search and run all the callouts, but it is mainly - provided for testing purposes. + If this (deprecated) option is set, it disables "auto-possessifica‐ + tion", which is an optimization that, for example, turns a+b into a++b + in order to avoid backtracks into a+ that can never be successful. How‐ + ever, if callouts are in use, auto-possessification means that some + callouts are never taken. You can set this option if you want the + matching functions to do a full unoptimized search and run all the + callouts, but it is mainly provided for testing purposes. + + It is recommended to use pcre2_set_optimize with the directive + PCRE2_AUTO_POSSESS_OFF rather than the compile option + PCRE2_NO_AUTO_POSSESS. Note that PCRE2_NO_AUTO_POSSESS takes prece‐ + dence over the pcre2_set_optimize optimization directives + PCRE2_AUTO_POSSESS and PCRE2_AUTO_POSSESS_OFF. PCRE2_NO_DOTSTAR_ANCHOR - If this option is set, it disables an optimization that is applied when - .* is the first significant item in a top-level branch of a pattern, - and all the other branches also start with .* or with \A or \G or ^. - The optimization is automatically disabled for .* if it is inside an - atomic group or a capture group that is the subject of a backreference, - or if the pattern contains (*PRUNE) or (*SKIP). When the optimization - is not disabled, such a pattern is automatically anchored if + If this (deprecated) option is set, it disables an optimization that is + applied when .* is the first significant item in a top-level branch of + a pattern, and all the other branches also start with .* or with \A or + \G or ^. The optimization is automatically disabled for .* if it is in‐ + side an atomic group or a capture group that is the subject of a back‐ + reference, or if the pattern contains (*PRUNE) or (*SKIP). When the op‐ + timization is not disabled, such a pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match must start either at the start of the subject or following a newline is remembered. Like - other optimizations, this can cause callouts to be skipped. + other optimizations, this can cause callouts to be skipped. (It is + recommended to use pcre2_set_optimize instead.) PCRE2_NO_START_OPTIMIZE - This is an option whose main effect is at matching time. It does not + This is an option whose main effect is at matching time. It does not change what pcre2_compile() generates, but it does affect the output of - the JIT compiler. + the JIT compiler. Setting this option is equivalent to calling + pcre2_set_optimize with the directive parameter set to PCRE2_START_OP‐ + TIMIZE_OFF. There are a number of optimizations that may occur at the start of a match, in order to speed up the process. For example, if it is known that an unanchored match must start with a specific code unit value, - the matching code searches the subject for that value, and fails imme- - diately if it cannot find it, without actually running the main match- - ing function. This means that a special item such as (*COMMIT) at the - start of a pattern is not considered until after a suitable starting - point for the match has been found. Also, when callouts or (*MARK) - items are in use, these "start-up" optimizations can cause them to be - skipped if the pattern is never actually used. The start-up optimiza- - tions are in effect a pre-scan of the subject that takes place before - the pattern is run. - - The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, - possibly causing performance to suffer, but ensuring that in cases - where the result is "no match", the callouts do occur, and that items - such as (*COMMIT) and (*MARK) are considered at every possible starting - position in the subject string. - - Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching - operation. Consider the pattern - - (*COMMIT)ABC - - When this is compiled, PCRE2 records the fact that a match must start - with the character "A". Suppose the subject string is "DEFABC". The - start-up optimization scans along the subject, finds "A" and runs the - first match attempt from there. The (*COMMIT) item means that the pat- - tern must match the current starting position, which in this case, it - does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE - set, the initial scan along the subject string does not happen. The - first match attempt is run starting from "D" and when this fails, - (*COMMIT) prevents any further matches being tried, so the overall re- - sult is "no match". - - As another start-up optimization makes use of a minimum length for a - matching subject, which is recorded when possible. Consider the pattern - - (*MARK:1)B(*MARK:2)(X|Y) + the matching code searches the subject for that value, and fails imme‐ + diately if it cannot find it, without actually running the main match‐ + ing function. The start-up optimizations are in effect a pre-scan of + the subject that takes place before the pattern is run. - The minimum length for a match is two characters. If the subject is - "XXBB", the "starting character" optimization skips "XX", then tries to - match "BB", which is long enough. In the process, (*MARK:2) is encoun- - tered and remembered. When the match attempt fails, the next "B" is - found, but there is only one character left, so there are no more at- - tempts, and "no match" is returned with the "last mark seen" set to - "2". If NO_START_OPTIMIZE is set, however, matches are tried at every - possible starting position, including at the end of the subject, where - (*MARK:1) is encountered, but there is no "B", so the "last mark seen" - that is returned is "1". In this case, the optimizations do not affect - the overall match result, which is still "no match", but they do affect - the auxiliary information that is returned. + Disabling the start-up optimizations may cause performance to suffer. + However, this may be desirable for patterns which contain callouts or + items such as (*COMMIT) and (*MARK). See the above description of + PCRE2_START_OPTIMIZE_OFF for further details. PCRE2_NO_UTF_CHECK @@ -2261,6 +2337,7 @@ INFORMATION ABOUT A COMPILED PATTERN PCRE2_DOTALL is in force for .* Neither (*PRUNE) nor (*SKIP) appears in the pattern PCRE2_NO_DOTSTAR_ANCHOR is not set + Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the options returned for PCRE2_INFO_ALLOPTIONS. diff --git a/doc/pcre2_set_optimize.3 b/doc/pcre2_set_optimize.3 new file mode 100644 index 000000000..1a51cc27e --- /dev/null +++ b/doc/pcre2_set_optimize.3 @@ -0,0 +1,33 @@ +.TH PCRE2_SET_OPTIMIZE 3 "16 September 2024" "PCRE2 10.45" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH SYNOPSIS +.rs +.sp +.B #include +.PP +.nf +.B int pcre2_set_optimize(pcre2_compile_context *\fIccontext\fP, +.B " uint32_t \fIdirective\fP);" +.fi +. +.SH DESCRIPTION +.rs +.sp +This function controls which performance optimizations will be applied +by \fBpcre2_compile\fP. It can be called multiple times with the same compile +context; the effects are cumulative, with the effects of later calls taking +precedence over earlier ones. +.P +The result is zero for success, PCRE2_ERROR_NULL if \fIccontext\fP is NULL, +or PCRE2_ERROR_BADOPTION if \fIdirective\fP is unknown. This can be used to +detect when the available version of PCRE2 does not implement a certain +optimization. +.P +There is a complete description of the PCRE2 native API, including all +permitted values for the \fIdirective\fP parameter of \fBpcre2_set_optimize\fP, +in the +.\" HREF +\fBpcre2api\fP +.\" +page. \ No newline at end of file diff --git a/doc/pcre2api.3 b/doc/pcre2api.3 index a362982d8..026e85d0b 100644 --- a/doc/pcre2api.3 +++ b/doc/pcre2api.3 @@ -115,6 +115,9 @@ document for an overview of all the PCRE2 documentation. .sp .B int pcre2_set_compile_recursion_guard(pcre2_compile_context *\fIccontext\fP, .B " int (*\fIguard_function\fP)(uint32_t, void *), void *\fIuser_data\fP);" +.sp +.B int pcre2_set_optimize(pcre2_compile_context *\fIccontext\fP, +.B " uint32_t \fIdirective\fP);" .fi . . @@ -738,6 +741,7 @@ following compile-time parameters: The compile time nested parentheses limit The maximum length of the pattern string The extra options bits (none set by default) + Which performance optimizations the compiler should apply .sp A compile context is also required if you are using custom memory management. If none of these apply, just pass NULL as the context argument of @@ -881,6 +885,105 @@ The first argument to the callout function gives the current depth of nesting, and the second is user data that is set up by the last argument of \fBpcre2_set_compile_recursion_guard()\fP. The callout function should return zero if all is well, or non-zero to force an error. +.sp +.nf +.B int pcre2_set_optimize(pcre2_compile_context *\fIccontext\fP, +.B " uint32_t \fIdirective\fP);" +.fi +.sp +PCRE2 can apply various performance optimizations during compilation, in order +to make matching faster. For example, the compiler might convert some regex +constructs into an equivalent construct which \fBpcre2_match()\fP can execute +faster. By default, all available optimizations are enabled. However, in rare +cases, one might wish to disable specific optimizations. For example, if it is +known that some optimizations cannot benefit a certain regex, it might be +desirable to disable them, in order to speed up compilation. +.P +The permitted values of \fIdirective\fP are as follows: +.sp + PCRE2_OPTIMIZATION_NONE +.sp +Disable all optional performance optimizations. +.sp + PCRE2_OPTIMIZATION_FULL +.sp +Enable all optional performance optimizations. This is the default value. +.sp + PCRE2_AUTO_POSSESS + PCRE2_AUTO_POSSESS_OFF +.sp +Enable/disable "auto-possessification" of variable quantifiers such as * and +. +This optimization, for example, turns a+b into a++b in order to avoid +backtracks into a+ that can never be successful. However, if callouts are in +use, auto-possessification means that some callouts are never taken. You can +disable this optimization if you want the matching functions to do a full, +unoptimized search and run all the callouts. +.sp + PCRE2_DOTSTAR_ANCHOR + PCRE2_DOTSTAR_ANCHOR_OFF +.sp +Enable/disable an optimization that is applied when .* is the first significant +item in a top-level branch of a pattern, and all the other branches also start +with .* or with \eA or \eG or ^. Such a pattern is automatically anchored if +PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any +^ items. Otherwise, the fact that any match must start either at the start of +the subject or following a newline is remembered. Like other optimizations, +this can cause callouts to be skipped. +.P +Dotstar anchor optimization is automatically disabled for .* if it is inside an +atomic group or a capture group that is the subject of a backreference, or if +the pattern contains (*PRUNE) or (*SKIP). +.sp + PCRE2_START_OPTIMIZE + PCRE2_START_OPTIMIZE_OFF +.sp +Enable/disable optimizations which cause matching functions to scan the subject +string for specific code unit values before attempting a match. For example, if +it is known that an unanchored match must start with a specific value, the +matching code searches the subject for that value, and fails immediately if it +cannot find it, without actually running the main matching function. This means +that a special item such as (*COMMIT) at the start of a pattern is not +considered until after a suitable starting point for the match has been found. +Also, when callouts or (*MARK) items are in use, these "start-up" optimizations +can cause them to be skipped if the pattern is never actually used. The start-up +optimizations are in effect a pre-scan of the subject that takes place before +the pattern is run. +.P +Disabling start-up optimizations ensures that in cases where the result is "no +match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are +considered at every possible starting position in the subject string. +.P +Disabling start-up optimizations may change the outcome of a matching operation. +Consider the pattern +.sp + (*COMMIT)ABC +.sp +When this is compiled, PCRE2 records the fact that a match must start with the +character "A". Suppose the subject string is "DEFABC". The start-up +optimization scans along the subject, finds "A" and runs the first match +attempt from there. The (*COMMIT) item means that the pattern must match the +current starting position, which in this case, it does. However, if the same +match is run without start-up optimizations, the initial scan along the subject +string does not happen. The first match attempt is run starting from "D" and +when this fails, (*COMMIT) prevents any further matches being tried, so the +overall result is "no match". +.P +Another start-up optimization makes use of a minimum length for a matching +subject, which is recorded when possible. Consider the pattern +.sp + (*MARK:1)B(*MARK:2)(X|Y) +.sp +The minimum length for a match is two characters. If the subject is "XXBB", the +"starting character" optimization skips "XX", then tries to match "BB", which +is long enough. In the process, (*MARK:2) is encountered and remembered. When +the match attempt fails, the next "B" is found, but there is only one character +left, so there are no more attempts, and "no match" is returned with the "last +mark seen" set to "2". Without start-up optimizations, however, matches are +tried at every possible starting position, including at the end of the subject, +where (*MARK:1) is encountered, but there is no "B", so the "last mark seen" +that is returned is "1". In this case, the optimizations do not affect the +overall match result, which is still "no match", but they do affect the +auxiliary information that is returned. . . .\" HTML @@ -1748,81 +1851,52 @@ though the reference can be by name or by number. .sp PCRE2_NO_AUTO_POSSESS .sp -If this option is set, it disables "auto-possessification", which is an -optimization that, for example, turns a+b into a++b in order to avoid +If this (deprecated) option is set, it disables "auto-possessification", which +is an optimization that, for example, turns a+b into a++b in order to avoid backtracks into a+ that can never be successful. However, if callouts are in use, auto-possessification means that some callouts are never taken. You can set this option if you want the matching functions to do a full unoptimized search and run all the callouts, but it is mainly provided for testing purposes. +.P +It is recommended to use \fBpcre2_set_optimize\fP with the \fIdirective\fP +PCRE2_AUTO_POSSESS_OFF rather than the compile option PCRE2_NO_AUTO_POSSESS. +Note that PCRE2_NO_AUTO_POSSESS takes precedence over the +\fBpcre2_set_optimize\fP optimization directives PCRE2_AUTO_POSSESS and +PCRE2_AUTO_POSSESS_OFF. .sp PCRE2_NO_DOTSTAR_ANCHOR .sp -If this option is set, it disables an optimization that is applied when .* is -the first significant item in a top-level branch of a pattern, and all the -other branches also start with .* or with \eA or \eG or ^. The optimization is -automatically disabled for .* if it is inside an atomic group or a capture -group that is the subject of a backreference, or if the pattern contains -(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is -automatically anchored if PCRE2_DOTALL is set for all the .* items and -PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match -must start either at the start of the subject or following a newline is +If this (deprecated) option is set, it disables an optimization that is applied +when .* is the first significant item in a top-level branch of a pattern, and +all the other branches also start with .* or with \eA or \eG or ^. The +optimization is automatically disabled for .* if it is inside an atomic group +or a capture group that is the subject of a backreference, or if the pattern +contains (*PRUNE) or (*SKIP). When the optimization is not disabled, such a +pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items +and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any +match must start either at the start of the subject or following a newline is remembered. Like other optimizations, this can cause callouts to be skipped. +(It is recommended to use \fBpcre2_set_optimize\fP instead.) .sp PCRE2_NO_START_OPTIMIZE .sp This is an option whose main effect is at matching time. It does not change what \fBpcre2_compile()\fP generates, but it does affect the output of the JIT -compiler. +compiler. Setting this option is equivalent to calling \fBpcre2_set_optimize\fP +with the \fIdirective\fP parameter set to PCRE2_START_OPTIMIZE_OFF. .P There are a number of optimizations that may occur at the start of a match, in order to speed up the process. For example, if it is known that an unanchored match must start with a specific code unit value, the matching code searches the subject for that value, and fails immediately if it cannot find it, without -actually running the main matching function. This means that a special item -such as (*COMMIT) at the start of a pattern is not considered until after a -suitable starting point for the match has been found. Also, when callouts or -(*MARK) items are in use, these "start-up" optimizations can cause them to be -skipped if the pattern is never actually used. The start-up optimizations are +actually running the main matching function. The start-up optimizations are in effect a pre-scan of the subject that takes place before the pattern is run. .P -The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations, -possibly causing performance to suffer, but ensuring that in cases where the -result is "no match", the callouts do occur, and that items such as (*COMMIT) -and (*MARK) are considered at every possible starting position in the subject -string. -.P -Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation. -Consider the pattern -.sp - (*COMMIT)ABC -.sp -When this is compiled, PCRE2 records the fact that a match must start with the -character "A". Suppose the subject string is "DEFABC". The start-up -optimization scans along the subject, finds "A" and runs the first match -attempt from there. The (*COMMIT) item means that the pattern must match the -current starting position, which in this case, it does. However, if the same -match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the -subject string does not happen. The first match attempt is run starting from -"D" and when this fails, (*COMMIT) prevents any further matches being tried, so -the overall result is "no match". -.P -As another start-up optimization makes use of a minimum length for a matching -subject, which is recorded when possible. Consider the pattern -.sp - (*MARK:1)B(*MARK:2)(X|Y) -.sp -The minimum length for a match is two characters. If the subject is "XXBB", the -"starting character" optimization skips "XX", then tries to match "BB", which -is long enough. In the process, (*MARK:2) is encountered and remembered. When -the match attempt fails, the next "B" is found, but there is only one character -left, so there are no more attempts, and "no match" is returned with the "last -mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried -at every possible starting position, including at the end of the subject, where -(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is -returned is "1". In this case, the optimizations do not affect the overall -match result, which is still "no match", but they do affect the auxiliary -information that is returned. +Disabling the start-up optimizations may cause performance to suffer. However, +this may be desirable for patterns which contain callouts or items such as +(*COMMIT) and (*MARK). See the above description of PCRE2_START_OPTIMIZE_OFF +for further details. .sp PCRE2_NO_UTF_CHECK .sp @@ -2272,6 +2346,7 @@ following are true: PCRE2_DOTALL is in force for .* Neither (*PRUNE) nor (*SKIP) appears in the pattern PCRE2_NO_DOTSTAR_ANCHOR is not set + Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF .sp For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the options returned for PCRE2_INFO_ALLOPTIONS. diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3 index 84e4aff47..b0936c91a 100644 --- a/doc/pcre2pattern.3 +++ b/doc/pcre2pattern.3 @@ -2242,7 +2242,7 @@ package, and PCRE1 copied it from there. It found its way into Perl at release PCRE2 has an optimization that automatically "possessifies" certain simple pattern constructs. For example, the sequence A+B is treated as A++B because there is no point in backtracking into a sequence of A's when B must follow. -This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting +This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS). .P When a pattern contains an unlimited repeat inside a group that can itself be diff --git a/doc/pcre2test.1 b/doc/pcre2test.1 index 9b7d37598..378b5dced 100644 --- a/doc/pcre2test.1 +++ b/doc/pcre2test.1 @@ -636,6 +636,24 @@ notation. Otherwise, those less than 0x100 are output in hex without the curly brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and subject strings to be translated to UTF-16 or UTF-32, respectively, before being passed to library functions. +.sp +The following modifiers enable or disable performance optimizations by +calling \fBpcre2_set_optimize()\fP before invoking the regex compiler. +.sp + optimization_full enable all optional optimizations + optimization_none disable all optional optimizations + auto_possess auto-possessify variable quantifiers + auto_possess_off don't auto-possessify variable quantifiers + dotstar_anchor anchor patterns starting with .* + dotstar_anchor_off don't anchor patterns starting with .* + start_optimize enable pre-scan of subject string + start_optimize_off disable pre-scan of subject string +.sp +See the +.\" HREF +\fBpcre2_set_optimize\fP +.\" +documentation for details on these optimizations. . . .\" HTML diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt index 30e16c8b5..a4cb50ad0 100644 --- a/doc/pcre2test.txt +++ b/doc/pcre2test.txt @@ -618,6 +618,21 @@ PATTERN MODIFIERS causes pattern and subject strings to be translated to UTF-16 or UTF-32, respectively, before being passed to library functions. + The following modifiers enable or disable performance optimizations by + calling pcre2_set_optimize() before invoking the regex compiler. + + optimization_full enable all optional optimizations + optimization_none disable all optional optimizations + auto_possess auto-possessify variable quantifiers + auto_possess_off don't auto-possessify variable quantifiers + dotstar_anchor anchor patterns starting with .* + dotstar_anchor_off don't anchor patterns starting with .* + start_optimize enable pre-scan of subject string + start_optimize_off disable pre-scan of subject string + + See the pcre2_set_optimize documentation for details on these optimiza‐ + tions. + Setting compilation controls The following modifiers affect the compilation process or request in- diff --git a/src/pcre2.h.generic b/src/pcre2.h.generic index a3341e6f5..0896b72ca 100644 --- a/src/pcre2.h.generic +++ b/src/pcre2.h.generic @@ -464,6 +464,18 @@ released, the numbers must not be changed. */ #define PCRE2_CONFIG_COMPILED_WIDTHS 14 #define PCRE2_CONFIG_TABLES_LENGTH 15 +/* Optimization directives for pcre2_set_optimize(). +For binary compatibility, only add to this list; do not renumber. */ + +#define PCRE2_OPTIMIZATION_NONE 0 +#define PCRE2_OPTIMIZATION_FULL 1 + +#define PCRE2_AUTO_POSSESS 64 +#define PCRE2_AUTO_POSSESS_OFF 65 +#define PCRE2_DOTSTAR_ANCHOR 66 +#define PCRE2_DOTSTAR_ANCHOR_OFF 67 +#define PCRE2_START_OPTIMIZE 68 +#define PCRE2_START_OPTIMIZE_OFF 69 /* Types for code units in patterns and subject strings. */ @@ -617,7 +629,9 @@ PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \ pcre2_set_parens_nest_limit(pcre2_compile_context *, uint32_t); \ PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \ pcre2_set_compile_recursion_guard(pcre2_compile_context *, \ - int (*)(uint32_t, void *), void *); + int (*)(uint32_t, void *), void *); \ +PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \ + pcre2_set_optimize(pcre2_compile_context *, uint32_t); #define PCRE2_MATCH_CONTEXT_FUNCTIONS \ PCRE2_EXP_DECL pcre2_match_context *PCRE2_CALL_CONVENTION \ @@ -912,6 +926,7 @@ pcre2_compile are called by application code. */ #define pcre2_set_newline PCRE2_SUFFIX(pcre2_set_newline_) #define pcre2_set_parens_nest_limit PCRE2_SUFFIX(pcre2_set_parens_nest_limit_) #define pcre2_set_offset_limit PCRE2_SUFFIX(pcre2_set_offset_limit_) +#define pcre2_set_optimize PCRE2_SUFFIX(pcre2_set_optimize_) #define pcre2_set_substitute_callout PCRE2_SUFFIX(pcre2_set_substitute_callout_) #define pcre2_substitute PCRE2_SUFFIX(pcre2_substitute_) #define pcre2_substring_copy_byname PCRE2_SUFFIX(pcre2_substring_copy_byname_) diff --git a/src/pcre2.h.in b/src/pcre2.h.in index a19313c9e..9595a8540 100644 --- a/src/pcre2.h.in +++ b/src/pcre2.h.in @@ -464,6 +464,18 @@ released, the numbers must not be changed. */ #define PCRE2_CONFIG_COMPILED_WIDTHS 14 #define PCRE2_CONFIG_TABLES_LENGTH 15 +/* Optimization directives for pcre2_set_optimize(). +For binary compatibility, only add to this list; do not renumber. */ + +#define PCRE2_OPTIMIZATION_NONE 0 +#define PCRE2_OPTIMIZATION_FULL 1 + +#define PCRE2_AUTO_POSSESS 64 +#define PCRE2_AUTO_POSSESS_OFF 65 +#define PCRE2_DOTSTAR_ANCHOR 66 +#define PCRE2_DOTSTAR_ANCHOR_OFF 67 +#define PCRE2_START_OPTIMIZE 68 +#define PCRE2_START_OPTIMIZE_OFF 69 /* Types for code units in patterns and subject strings. */ @@ -617,7 +629,9 @@ PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \ pcre2_set_parens_nest_limit(pcre2_compile_context *, uint32_t); \ PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \ pcre2_set_compile_recursion_guard(pcre2_compile_context *, \ - int (*)(uint32_t, void *), void *); + int (*)(uint32_t, void *), void *); \ +PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \ + pcre2_set_optimize(pcre2_compile_context *, uint32_t); #define PCRE2_MATCH_CONTEXT_FUNCTIONS \ PCRE2_EXP_DECL pcre2_match_context *PCRE2_CALL_CONVENTION \ @@ -912,6 +926,7 @@ pcre2_compile are called by application code. */ #define pcre2_set_newline PCRE2_SUFFIX(pcre2_set_newline_) #define pcre2_set_parens_nest_limit PCRE2_SUFFIX(pcre2_set_parens_nest_limit_) #define pcre2_set_offset_limit PCRE2_SUFFIX(pcre2_set_offset_limit_) +#define pcre2_set_optimize PCRE2_SUFFIX(pcre2_set_optimize_) #define pcre2_set_substitute_callout PCRE2_SUFFIX(pcre2_set_substitute_callout_) #define pcre2_substitute PCRE2_SUFFIX(pcre2_substitute_) #define pcre2_substring_copy_byname PCRE2_SUFFIX(pcre2_substring_copy_byname_) diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c index 48dae18fa..946198c0f 100644 --- a/src/pcre2_compile.c +++ b/src/pcre2_compile.c @@ -834,7 +834,8 @@ enum { PSO_OPT, /* Value is an option bit */ PSO_BSR, /* Value is a \R type */ PSO_LIMH, /* Read integer value for heap limit */ PSO_LIMM, /* Read integer value for match limit */ - PSO_LIMD /* Read integer value for depth limit */ + PSO_LIMD, /* Read integer value for depth limit */ + PSO_OPTMZ /* Value is an optimization bit */ }; typedef struct pso { @@ -852,10 +853,10 @@ static const pso pso_list[] = { { STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP }, { STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET }, { STRING_NOTEMPTY_ATSTART_RIGHTPAR, 17, PSO_FLG, PCRE2_NE_ATST_SET }, - { STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS }, - { STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPT, PCRE2_NO_DOTSTAR_ANCHOR }, + { STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPTMZ, PCRE2_OPTIM_AUTO_POSSESS }, + { STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPTMZ, PCRE2_OPTIM_DOTSTAR_ANCHOR }, { STRING_NO_JIT_RIGHTPAR, 7, PSO_FLG, PCRE2_NOJIT }, - { STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE }, + { STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPTMZ, PCRE2_OPTIM_START_OPTIMIZE }, { STRING_LIMIT_HEAP_EQ, 11, PSO_LIMH, 0 }, { STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 }, { STRING_LIMIT_DEPTH_EQ, 12, PSO_LIMD, 0 }, @@ -8883,13 +8884,14 @@ this prevents the number of characters it matches from being adjusted. cb points to the compile data block atomcount atomic group level inassert TRUE if in an assertion + dotstar_anchor TRUE if automatic anchoring optimization is enabled Returns: TRUE or FALSE */ static BOOL is_anchored(PCRE2_SPTR code, uint32_t bracket_map, compile_block *cb, - int atomcount, BOOL inassert) + int atomcount, BOOL inassert, BOOL dotstar_anchor) { do { PCRE2_SPTR scode = first_significant_code( @@ -8901,7 +8903,7 @@ do { if (op == OP_BRA || op == OP_BRAPOS || op == OP_SBRA || op == OP_SBRAPOS) { - if (!is_anchored(scode, bracket_map, cb, atomcount, inassert)) + if (!is_anchored(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor)) return FALSE; } @@ -8912,14 +8914,14 @@ do { { int n = GET2(scode, 1+LINK_SIZE); uint32_t new_map = bracket_map | ((n < 32)? (1u << n) : 1); - if (!is_anchored(scode, new_map, cb, atomcount, inassert)) return FALSE; + if (!is_anchored(scode, new_map, cb, atomcount, inassert, dotstar_anchor)) return FALSE; } /* Positive forward assertion */ else if (op == OP_ASSERT || op == OP_ASSERT_NA) { - if (!is_anchored(scode, bracket_map, cb, atomcount, TRUE)) return FALSE; + if (!is_anchored(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor)) return FALSE; } /* Condition. If there is no second branch, it can't be anchored. */ @@ -8927,7 +8929,7 @@ do { else if (op == OP_COND || op == OP_SCOND) { if (scode[GET(scode,1)] != OP_ALT) return FALSE; - if (!is_anchored(scode, bracket_map, cb, atomcount, inassert)) + if (!is_anchored(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor)) return FALSE; } @@ -8935,7 +8937,7 @@ do { else if (op == OP_ONCE) { - if (!is_anchored(scode, bracket_map, cb, atomcount + 1, inassert)) + if (!is_anchored(scode, bracket_map, cb, atomcount + 1, inassert, dotstar_anchor)) return FALSE; } @@ -8950,8 +8952,7 @@ do { op == OP_TYPEPOSSTAR)) { if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 || - atomcount > 0 || cb->had_pruneorskip || inassert || - (cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0) + atomcount > 0 || cb->had_pruneorskip || inassert || !dotstar_anchor) return FALSE; } @@ -8988,13 +8989,14 @@ or *SKIP does not count, because once again the assumption no longer holds. cb points to the compile data atomcount atomic group level inassert TRUE if in an assertion + dotstar_anchor TRUE if automatic anchoring optimization is enabled Returns: TRUE or FALSE */ static BOOL is_startline(PCRE2_SPTR code, unsigned int bracket_map, compile_block *cb, - int atomcount, BOOL inassert) + int atomcount, BOOL inassert, BOOL dotstar_anchor) { do { PCRE2_SPTR scode = first_significant_code( @@ -9025,7 +9027,8 @@ do { return FALSE; default: /* Assertion */ - if (!is_startline(scode, bracket_map, cb, atomcount, TRUE)) return FALSE; + if (!is_startline(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor)) + return FALSE; do scode += GET(scode, 1); while (*scode == OP_ALT); scode += 1 + LINK_SIZE; break; @@ -9039,7 +9042,7 @@ do { if (op == OP_BRA || op == OP_BRAPOS || op == OP_SBRA || op == OP_SBRAPOS) { - if (!is_startline(scode, bracket_map, cb, atomcount, inassert)) + if (!is_startline(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor)) return FALSE; } @@ -9050,14 +9053,15 @@ do { { int n = GET2(scode, 1+LINK_SIZE); unsigned int new_map = bracket_map | ((n < 32)? (1u << n) : 1); - if (!is_startline(scode, new_map, cb, atomcount, inassert)) return FALSE; + if (!is_startline(scode, new_map, cb, atomcount, inassert, dotstar_anchor)) + return FALSE; } /* Positive forward assertions */ else if (op == OP_ASSERT || op == OP_ASSERT_NA) { - if (!is_startline(scode, bracket_map, cb, atomcount, TRUE)) + if (!is_startline(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor)) return FALSE; } @@ -9065,7 +9069,7 @@ do { else if (op == OP_ONCE) { - if (!is_startline(scode, bracket_map, cb, atomcount + 1, inassert)) + if (!is_startline(scode, bracket_map, cb, atomcount + 1, inassert, dotstar_anchor)) return FALSE; } @@ -9079,8 +9083,7 @@ do { else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR) { if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 || - atomcount > 0 || cb->had_pruneorskip || inassert || - (cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0) + atomcount > 0 || cb->had_pruneorskip || inassert || !dotstar_anchor) return FALSE; } @@ -10362,6 +10365,10 @@ int regexrc; /* Return from compile */ uint32_t i; /* Local loop counter */ +/* Enable all optimizations by default. */ +uint32_t optim_flags = ccontext != NULL ? ccontext->optimization_flags : + PCRE2_OPTIMIZATION_ALL; + /* Comments at the head of this file explain about these variables. */ uint32_t stack_groupinfo[GROUPINFO_DEFAULT_SIZE]; @@ -10432,6 +10439,18 @@ if (patlen > ccontext->max_pattern_length) return NULL; } +/* Optimization flags in 'options' can override those in the compile context. +This is because some options to disable optimizations were added before the +optimization flags word existed, and we need to continue supporting them +for backwards compatibility. */ + +if ((options & PCRE2_NO_AUTO_POSSESS) != 0) + optim_flags &= ~PCRE2_OPTIM_AUTO_POSSESS; +if ((options & PCRE2_NO_DOTSTAR_ANCHOR) != 0) + optim_flags &= ~PCRE2_OPTIM_DOTSTAR_ANCHOR; +if ((options & PCRE2_NO_START_OPTIMIZE) != 0) + optim_flags &= ~PCRE2_OPTIM_START_OPTIMIZE; + /* From here on, all returns from this function should end up going via the EXIT label. */ @@ -10568,6 +10587,32 @@ if ((options & PCRE2_LITERAL) == 0) else limit_depth = c; skipatstart = ++pp; break; + + case PSO_OPTMZ: + optim_flags &= ~(p->value); + + /* For backward compatibility the three original VERBs to disable + optimizations need to also update the corresponding external option. */ + + switch(p->value) + { + case PCRE2_OPTIM_AUTO_POSSESS: + cb.external_options |= PCRE2_NO_AUTO_POSSESS; + break; + + case PCRE2_OPTIM_DOTSTAR_ANCHOR: + cb.external_options |= PCRE2_NO_DOTSTAR_ANCHOR; + break; + + case PCRE2_OPTIM_START_OPTIMIZE: + cb.external_options |= PCRE2_NO_START_OPTIMIZE; + break; + } + + break; + + default: + PCRE2_UNREACHABLE(); } break; /* Out of the table scan loop */ } @@ -10863,6 +10908,7 @@ re->top_bracket = 0; re->top_backref = 0; re->name_entry_size = cb.name_entry_size; re->name_count = cb.names_found; +re->optimization_flags = optim_flags; /* The basic block is immediately followed by the name table, and the compiled code follows after that. */ @@ -11005,7 +11051,7 @@ used in this code because at least one compiler gives a warning about loss of "const" attribute if the cast (PCRE2_UCHAR *)codestart is used directly in the function call. */ -if (errorcode == 0 && (re->overall_options & PCRE2_NO_AUTO_POSSESS) == 0) +if (errorcode == 0 && (optim_flags & PCRE2_OPTIM_AUTO_POSSESS) != 0) { PCRE2_UCHAR *temp = (PCRE2_UCHAR *)codestart; if (PRIV(auto_possessify)(temp, &cb) != 0) errorcode = ERR80; @@ -11022,17 +11068,17 @@ there are no occurrences of *PRUNE or *SKIP (though there is an option to disable this case). */ if ((re->overall_options & PCRE2_ANCHORED) == 0 && - is_anchored(codestart, 0, &cb, 0, FALSE)) + is_anchored(codestart, 0, &cb, 0, FALSE, (optim_flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0)) re->overall_options |= PCRE2_ANCHORED; /* Set up the first code unit or startline flag, the required code unit, and -then study the pattern. This code need not be obeyed if PCRE2_NO_START_OPTIMIZE -is set, as the data it would create will not be used. Note that a first code +then study the pattern. This code need not be obeyed if PCRE2_OPTIM_START_OPTIMIZE +is disabled, as the data it would create will not be used. Note that a first code unit (but not the startline flag) is useful for anchored patterns because it can still give a quick "no match" and also avoid searching for a last code unit. */ -if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0) +if ((optim_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0) { int minminlength = 0; /* For minimal minlength from first/required CU */ @@ -11096,7 +11142,7 @@ if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0) that disables this case.) */ else if ((re->overall_options & PCRE2_ANCHORED) == 0 && - is_startline(codestart, 0, &cb, 0, FALSE)) + is_startline(codestart, 0, &cb, 0, FALSE, (optim_flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0)) re->flags |= PCRE2_STARTLINE; /* Handle the "required code unit", if one is set. In the UTF case we can diff --git a/src/pcre2_context.c b/src/pcre2_context.c index 84a967d7a..6b80f9e9f 100644 --- a/src/pcre2_context.c +++ b/src/pcre2_context.c @@ -141,7 +141,8 @@ pcre2_compile_context PRIV(default_compile_context) = { NEWLINE_DEFAULT, /* Newline convention */ PARENS_NEST_LIMIT, /* As it says */ 0, /* Extra options */ - MAX_VARLOOKBEHIND /* As it says */ + MAX_VARLOOKBEHIND, /* As it says */ + PCRE2_OPTIMIZATION_ALL /* All optimizations enabled */ }; /* The create function copies the default into the new memory, but must @@ -409,6 +410,42 @@ ccontext->stack_guard_data = user_data; return 0; } +PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION +pcre2_set_optimize(pcre2_compile_context *ccontext, uint32_t directive) +{ +if (ccontext == NULL) + return PCRE2_ERROR_NULL; + +switch (directive) + { + case PCRE2_OPTIMIZATION_NONE: + ccontext->optimization_flags = 0; + break; + + case PCRE2_OPTIMIZATION_FULL: + ccontext->optimization_flags = PCRE2_OPTIMIZATION_ALL; + break; + + case PCRE2_AUTO_POSSESS: + case PCRE2_AUTO_POSSESS_OFF: + case PCRE2_DOTSTAR_ANCHOR: + case PCRE2_DOTSTAR_ANCHOR_OFF: + case PCRE2_START_OPTIMIZE: + case PCRE2_START_OPTIMIZE_OFF: + /* Even directive numbers switch a bit on, odd numbers switch a bit off. + * 64-65 affect the LSB, 66-67 the 2 bit, 68-69 the 4 bit, and so on. */ + if (directive & 0x1) + ccontext->optimization_flags &= ~(1u << ((directive >> 1) - 32)); + else + ccontext->optimization_flags |= 1u << ((directive >> 1) - 32); + break; + + default: + return PCRE2_ERROR_BADOPTION; + } + +return 0; +} /* ------------ Match context ------------ */ diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c index 3e34c7ca5..d1d33ad5b 100644 --- a/src/pcre2_dfa_match.c +++ b/src/pcre2_dfa_match.c @@ -3432,7 +3432,7 @@ if ((re->flags & PCRE2_MODE_MASK) != PCRE2_CODE_UNIT_WIDTH/8) /* PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART are match-time flags in the options variable for this function. Users of PCRE2 who are not calling the function directly would like to have a way of setting these flags, in the same -way that they can set pcre2_compile() flags like PCRE2_NO_AUTOPOSSESS with +way that they can set pcre2_compile() flags like PCRE2_NO_AUTO_POSSESS with constructions like (*NO_AUTOPOSSESS). To enable this, (*NOTEMPTY) and (*NOTEMPTY_ATSTART) set bits in the pattern's "flag" function which can now be transferred to the options for this function. The bits are guaranteed to be @@ -3699,7 +3699,7 @@ for (;;) these, for testing and for ensuring that all callouts do actually occur. The optimizations must also be avoided when restarting a DFA match. */ - if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0 && + if ((re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0 && (options & PCRE2_DFA_RESTART) == 0) { /* If firstline is TRUE, the start of the match is constrained to the first diff --git a/src/pcre2_internal.h b/src/pcre2_internal.h index 043d2c563..1b9bdc6a1 100644 --- a/src/pcre2_internal.h +++ b/src/pcre2_internal.h @@ -609,6 +609,13 @@ total length of the tables. */ #define ctypes_offset (cbits_offset + cbit_length) /* Character types */ #define TABLES_LENGTH (ctypes_offset + 256) +/* Private flags used in compile_context.optimization_flags */ + +#define PCRE2_OPTIM_AUTO_POSSESS 0x00000001u +#define PCRE2_OPTIM_DOTSTAR_ANCHOR 0x00000002u +#define PCRE2_OPTIM_START_OPTIMIZE 0x00000004u + +#define PCRE2_OPTIMIZATION_ALL 0x00000007u /* -------------------- Character and string names ------------------------ */ diff --git a/src/pcre2_intmodedep.h b/src/pcre2_intmodedep.h index a798cdd4f..6c14be8dc 100644 --- a/src/pcre2_intmodedep.h +++ b/src/pcre2_intmodedep.h @@ -579,6 +579,7 @@ typedef struct pcre2_real_compile_context { uint32_t parens_nest_limit; uint32_t extra_options; uint32_t max_varlookbehind; + uint32_t optimization_flags; } pcre2_real_compile_context; /* The real match context structure. */ @@ -646,6 +647,7 @@ typedef struct pcre2_real_code { uint16_t top_backref; /* Highest numbered back reference */ uint16_t name_entry_size; /* Size (code units) of table entries */ uint16_t name_count; /* Number of name entries in the table */ + uint32_t optimization_flags; /* Optimizations enabled at compile time */ } pcre2_real_code; /* The real match data structure. Define ovector as large as it can ever diff --git a/src/pcre2_jit_compile.c b/src/pcre2_jit_compile.c index 5de4666d1..328edbcd4 100644 --- a/src/pcre2_jit_compile.c +++ b/src/pcre2_jit_compile.c @@ -14474,7 +14474,9 @@ if (!check_opcode_types(common, common->start, ccend)) } /* Checking flags and updating ovector_start. */ -if (mode == PCRE2_JIT_COMPLETE && (re->flags & PCRE2_LASTSET) != 0 && (re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0) +if (mode == PCRE2_JIT_COMPLETE && + (re->flags & PCRE2_LASTSET) != 0 && + (re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0) { common->req_char_ptr = common->ovector_start; common->ovector_start += sizeof(sljit_sw); @@ -14534,7 +14536,9 @@ memset(common->private_data_ptrs, 0, total_length * sizeof(sljit_s32)); private_data_size = common->cbra_ptr + (re->top_bracket + 1) * sizeof(sljit_sw); -if ((re->overall_options & PCRE2_ANCHORED) == 0 && (re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0 && !common->has_skip_in_assert_back) +if ((re->overall_options & PCRE2_ANCHORED) == 0 && + (re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0 && + !common->has_skip_in_assert_back) detect_early_fail(common, common->start, &private_data_size, 0, 0); set_private_data_ptrs(common, &private_data_size, ccend); @@ -14600,7 +14604,7 @@ if ((re->overall_options & PCRE2_ANCHORED) == 0) mainloop_label = mainloop_entry(common); continue_match_label = LABEL(); /* Forward search if possible. */ - if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0) + if ((re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0) { if (mode == PCRE2_JIT_COMPLETE && fast_forward_first_n_chars(common)) ; @@ -14615,7 +14619,8 @@ if ((re->overall_options & PCRE2_ANCHORED) == 0) else continue_match_label = LABEL(); -if (mode == PCRE2_JIT_COMPLETE && re->minlength > 0 && (re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0) +if (mode == PCRE2_JIT_COMPLETE && re->minlength > 0 && + (re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0) { OP1(SLJIT_MOV, SLJIT_RETURN_REG, 0, SLJIT_IMM, PCRE2_ERROR_NOMATCH); OP2(SLJIT_ADD, TMP2, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(re->minlength)); diff --git a/src/pcre2_match.c b/src/pcre2_match.c index f55410394..cb139658e 100644 --- a/src/pcre2_match.c +++ b/src/pcre2_match.c @@ -6788,7 +6788,7 @@ if ((re->flags & PCRE2_MODE_MASK) != PCRE2_CODE_UNIT_WIDTH/8) /* PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART are match-time flags in the options variable for this function. Users of PCRE2 who are not calling the function directly would like to have a way of setting these flags, in the same -way that they can set pcre2_compile() flags like PCRE2_NO_AUTOPOSSESS with +way that they can set pcre2_compile() flags like PCRE2_NO_AUTO_POSSESS with constructions like (*NO_AUTOPOSSESS). To enable this, (*NOTEMPTY) and (*NOTEMPTY_ATSTART) set bits in the pattern's "flag" function which we now transfer to the options for this function. The bits are guaranteed to be @@ -7326,7 +7326,7 @@ for(;;) However, there is an option (settable at compile time) that disables these, for testing and for ensuring that all callouts do actually occur. */ - if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0) + if ((re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0) { /* If firstline is TRUE, the start of the match is constrained to the first line of a multiline string. That is, the match must be before or at the diff --git a/src/pcre2test.c b/src/pcre2test.c index d8f5d6483..7e92ff19e 100644 --- a/src/pcre2test.c +++ b/src/pcre2test.c @@ -468,6 +468,7 @@ enum { MOD_CTC, /* Applies to a compile context */ MOD_NL, /* Is a newline value */ MOD_NN, /* Is a number or a name; more than one may occur */ MOD_OPT, /* Is an option bit */ + MOD_OPTMZ, /* Is an optimization directive */ MOD_SIZ, /* Is a PCRE2_SIZE value */ MOD_STR }; /* Is a string */ @@ -661,6 +662,8 @@ static modstruct modlist[] = { { "ascii_digit", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_DIGIT, CO(extra_options) }, { "ascii_posix", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_POSIX, CO(extra_options) }, { "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) }, + { "auto_possess", MOD_CTC, MOD_OPTMZ, PCRE2_AUTO_POSSESS, 0 }, + { "auto_possess_off", MOD_CTC, MOD_OPTMZ, PCRE2_AUTO_POSSESS_OFF, 0 }, { "bad_escape_is_literal", MOD_CTC, MOD_OPT, PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL, CO(extra_options) }, { "bincode", MOD_PAT, MOD_CTL, CTL_BINCODE, PO(control) }, { "bsr", MOD_CTC, MOD_BSR, 0, CO(bsr_convention) }, @@ -688,6 +691,8 @@ static modstruct modlist[] = { { "disable_recurseloop_check", MOD_DAT, MOD_OPT, PCRE2_DISABLE_RECURSELOOP_CHECK, DO(options) }, { "dollar_endonly", MOD_PAT, MOD_OPT, PCRE2_DOLLAR_ENDONLY, PO(options) }, { "dotall", MOD_PATP, MOD_OPT, PCRE2_DOTALL, PO(options) }, + { "dotstar_anchor", MOD_CTC, MOD_OPTMZ, PCRE2_DOTSTAR_ANCHOR, 0 }, + { "dotstar_anchor_off", MOD_CTC, MOD_OPTMZ, PCRE2_DOTSTAR_ANCHOR_OFF, 0 }, { "dupnames", MOD_PATP, MOD_OPT, PCRE2_DUPNAMES, PO(options) }, { "endanchored", MOD_PD, MOD_OPT, PCRE2_ENDANCHORED, PD(options) }, { "escaped_cr_is_lf", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ESCAPED_CR_IS_LF, CO(extra_options) }, @@ -744,6 +749,8 @@ static modstruct modlist[] = { { "null_subject", MOD_DAT, MOD_CTL, CTL2_NULL_SUBJECT, DO(control2) }, { "offset", MOD_DAT, MOD_INT, 0, DO(offset) }, { "offset_limit", MOD_CTM, MOD_SIZ, 0, MO(offset_limit)}, + { "optimization_full", MOD_CTC, MOD_OPTMZ, PCRE2_OPTIMIZATION_FULL, 0 }, + { "optimization_none", MOD_CTC, MOD_OPTMZ, PCRE2_OPTIMIZATION_NONE, 0 }, { "ovector", MOD_DAT, MOD_INT, 0, DO(oveccount) }, { "parens_nest_limit", MOD_CTC, MOD_INT, 0, CO(parens_nest_limit) }, { "partial_hard", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) }, @@ -760,6 +767,8 @@ static modstruct modlist[] = { { "regerror_buffsize", MOD_PAT, MOD_INT, 0, PO(regerror_buffsize) }, { "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) }, { "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) }, + { "start_optimize", MOD_CTC, MOD_OPTMZ, PCRE2_START_OPTIMIZE, 0 }, + { "start_optimize_off", MOD_CTC, MOD_OPTMZ, PCRE2_START_OPTIMIZE_OFF, 0 }, { "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) }, { "startoffset", MOD_DAT, MOD_INT, 0, DO(offset) }, { "subject_literal", MOD_PATP, MOD_CTL, CTL2_SUBJECT_LITERAL, PO(control2) }, @@ -3884,7 +3893,7 @@ for (;;) when needed. */ m = modlist + index; /* Save typing */ - if (m->type != MOD_CTL && m->type != MOD_OPT && + if (m->type != MOD_CTL && m->type != MOD_OPT && m->type != MOD_OPTMZ && (m->type != MOD_IND || *pp == '=')) { if (*pp++ != '=') @@ -3925,6 +3934,21 @@ for (;;) else *((uint32_t *)field) |= m->value; break; + case MOD_OPTMZ: +#ifdef SUPPORT_PCRE2_8 + if (test_mode == PCRE8_MODE) + pcre2_set_optimize_8((pcre2_compile_context_8*)field, m->value); +#endif +#ifdef SUPPORT_PCRE2_16 + if (test_mode == PCRE16_MODE) + pcre2_set_optimize_16((pcre2_compile_context_16*)field, m->value); +#endif +#ifdef SUPPORT_PCRE2_32 + if (test_mode == PCRE32_MODE) + pcre2_set_optimize_32((pcre2_compile_context_32*)field, m->value); +#endif + break; + case MOD_BSR: if (len == 7 && strncmpic(pp, (const uint8_t *)"default", 7) == 0) { @@ -4361,6 +4385,33 @@ else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s", } +/************************************************* +* Show optimization flags * +*************************************************/ + +/* +Arguments: + flags an options word + before text to print before + after text to print after + +Returns: nothing +*/ + +static void +show_optimize_flags(uint32_t flags, const char *before, const char *after) +{ +if (flags == 0) fprintf(outfile, "%s%s", before, after); +else fprintf(outfile, "%s%s%s%s%s%s%s", + before, + ((flags & PCRE2_OPTIM_AUTO_POSSESS) != 0) ? "auto_possess" : "", + ((flags & PCRE2_OPTIM_AUTO_POSSESS) != 0 && (flags >> 1) != 0) ? "," : "", + ((flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0) ? "dotstar_anchor" : "", + ((flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0 && (flags >> 2) != 0) ? "," : "", + ((flags & PCRE2_OPTIM_START_OPTIMIZE) != 0) ? "start_optimize" : "", + after); +} + #ifdef SUPPORT_PCRE2_8 /************************************************* @@ -4777,6 +4828,9 @@ if ((pat_patctl.control & CTL_INFO) != 0) if (extra_options != 0) show_compile_extra_options(extra_options, "Extra options:", "\n"); + if (FLD(compiled_code, optimization_flags) != PCRE2_OPTIMIZATION_ALL) + show_optimize_flags(FLD(compiled_code, optimization_flags), "Optimizations: ", "\n"); + if (jchanged) fprintf(outfile, "Duplicate name status changes\n"); if ((pat_patctl.control2 & CTL2_BSR_SET) != 0 || @@ -4879,7 +4933,7 @@ if ((pat_patctl.control & CTL_INFO) != 0) } } - if ((FLD(compiled_code, overall_options) & PCRE2_NO_START_OPTIMIZE) == 0) + if ((FLD(compiled_code, optimization_flags) & PCRE2_OPTIM_START_OPTIMIZE) != 0) fprintf(outfile, "Subject length lower bound = %d\n", minlength); if (pat_patctl.jit != 0 && (pat_patctl.control & CTL_JITVERIFY) != 0) diff --git a/testdata/testinput2 b/testdata/testinput2 index 51e2095c8..5b82f7451 100644 --- a/testdata/testinput2 +++ b/testdata/testinput2 @@ -831,6 +831,16 @@ /x++/IB +# For comparison with the following test, which disables auto-possessification +# In this regex, x+ should be converted to x++ +/x+y/B,auto_possess + +# In this regex, x+ should not be converted to x++ +/x+y/B,auto_possess_off + +# Also in this regex, x+ should not be converted to x++ +/x+y/B,optimization_none + /x{1,3}+/B,no_auto_possess /x{1,3}+/Bi,no_auto_possess @@ -839,6 +849,8 @@ /[^x]{1,3}+/Bi,no_auto_possess +/x{1,3}+/IB,auto_possess_off + /(x)*+/IB /^(\w++|\s++)*$/I @@ -4056,10 +4068,16 @@ /(?(VERSION=10.101)yes|no)/ +# We should see the starting code unit, required code unit, and minimum length set for this regex: /abcd/I +# None of the following three should have the starting code unit, required code unit, and minimum length set: /abcd/I,no_start_optimize +/abcd/I,start_optimize_off + +/abcd/I,optimization_none + /(|ab)*?d/I abd xyd @@ -4224,6 +4242,19 @@ /^abc/info,no_dotstar_anchor +/^abc/info,dotstar_anchor_off + +# For comparison with the following tests, which disable automatic dotstar anchoring +/.*abc/BI + +/.*abc/BI,dotstar_anchor_off + +/.*abc/BI,start_optimize_off + +/.*abc/BI,optimization_none + +/.*abc/BI,no_dotstar_anchor + /.*\d/info,auto_callout \= Expect no match aaa @@ -6390,6 +6421,27 @@ a)"xI ab ac +# Tests for pcre2_set_optimize() + +/abc/I,optimization_none + +/abc/I,optimization_none,auto_possess + +/abc/I,optimization_none,dotstar_anchor,auto_possess + +/abc/I,optimization_none,start_optimize + +/abc/I,dotstar_anchor_off,optimization_full + +# If pcre2_set_optimize() is used to turn on some optimization, but at the same time, +# the compile options word turns it off... the compile options word "wins": + +/abc/I,no_auto_possess,auto_possess + +/abc/I,no_dotstar_anchor,dotstar_anchor + +/abc/I,no_start_optimize,start_optimize + # -------------- # End of testinput2 diff --git a/testdata/testoutput15 b/testdata/testoutput15 index f36faeeaf..892473bc9 100644 --- a/testdata/testoutput15 +++ b/testdata/testoutput15 @@ -477,6 +477,7 @@ Failed: error -52: nested recursion at the same subject position ------------------------------------------------------------------ Capture group count = 0 Options: no_auto_possess +Optimizations: dotstar_anchor,start_optimize Starting code units: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z Subject length lower bound = 1 @@ -501,6 +502,7 @@ No match Capture group count = 0 Compile options: Overall options: no_auto_possess +Optimizations: dotstar_anchor,start_optimize Starting code units: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z Subject length lower bound = 1 diff --git a/testdata/testoutput2 b/testdata/testoutput2 index eeb635d6d..f1f6a4f50 100644 --- a/testdata/testoutput2 +++ b/testdata/testoutput2 @@ -2942,6 +2942,37 @@ Capture group count = 0 First code unit = 'x' Subject length lower bound = 1 +# For comparison with the following test, which disables auto-possessification +# In this regex, x+ should be converted to x++ +/x+y/B,auto_possess +------------------------------------------------------------------ + Bra + x++ + y + Ket + End +------------------------------------------------------------------ + +# In this regex, x+ should not be converted to x++ +/x+y/B,auto_possess_off +------------------------------------------------------------------ + Bra + x+ + y + Ket + End +------------------------------------------------------------------ + +# Also in this regex, x+ should not be converted to x++ +/x+y/B,optimization_none +------------------------------------------------------------------ + Bra + x+ + y + Ket + End +------------------------------------------------------------------ + /x{1,3}+/B,no_auto_possess ------------------------------------------------------------------ Bra @@ -2978,6 +3009,19 @@ Subject length lower bound = 1 End ------------------------------------------------------------------ +/x{1,3}+/IB,auto_possess_off +------------------------------------------------------------------ + Bra + x + x{0,2}+ + Ket + End +------------------------------------------------------------------ +Capture group count = 0 +Optimizations: dotstar_anchor,start_optimize +First code unit = 'x' +Subject length lower bound = 1 + /(x)*+/IB ------------------------------------------------------------------ Bra @@ -13592,15 +13636,26 @@ Failed: error 179 at offset 16: syntax error or number too big in (?(VERSION con /(?(VERSION=10.101)yes|no)/ Failed: error 179 at offset 16: syntax error or number too big in (?(VERSION condition +# We should see the starting code unit, required code unit, and minimum length set for this regex: /abcd/I Capture group count = 0 First code unit = 'a' Last code unit = 'd' Subject length lower bound = 4 +# None of the following three should have the starting code unit, required code unit, and minimum length set: /abcd/I,no_start_optimize Capture group count = 0 Options: no_start_optimize +Optimizations: auto_possess,dotstar_anchor + +/abcd/I,start_optimize_off +Capture group count = 0 +Optimizations: auto_possess,dotstar_anchor + +/abcd/I,optimization_none +Capture group count = 0 +Optimizations: /(|ab)*?d/I Capture group count = 1 @@ -13616,6 +13671,7 @@ Subject length lower bound = 1 /(|ab)*?d/I,no_start_optimize Capture group count = 1 Options: no_start_optimize +Optimizations: auto_possess,dotstar_anchor abd 0: abd 1: ab @@ -13887,9 +13943,81 @@ Subject length lower bound = 3 Capture group count = 0 Compile options: no_dotstar_anchor Overall options: anchored no_dotstar_anchor +Optimizations: auto_possess,start_optimize +First code unit = 'a' +Subject length lower bound = 3 + +/^abc/info,dotstar_anchor_off +Capture group count = 0 +Compile options: +Overall options: anchored +Optimizations: auto_possess,start_optimize First code unit = 'a' Subject length lower bound = 3 +# For comparison with the following tests, which disable automatic dotstar anchoring +/.*abc/BI +------------------------------------------------------------------ + Bra + Any* + abc + Ket + End +------------------------------------------------------------------ +Capture group count = 0 +First code unit at start or follows newline +Last code unit = 'c' +Subject length lower bound = 3 + +/.*abc/BI,dotstar_anchor_off +------------------------------------------------------------------ + Bra + Any* + abc + Ket + End +------------------------------------------------------------------ +Capture group count = 0 +Optimizations: auto_possess,start_optimize +Last code unit = 'c' +Subject length lower bound = 3 + +/.*abc/BI,start_optimize_off +------------------------------------------------------------------ + Bra + Any* + abc + Ket + End +------------------------------------------------------------------ +Capture group count = 0 +Optimizations: auto_possess,dotstar_anchor + +/.*abc/BI,optimization_none +------------------------------------------------------------------ + Bra + Any* + abc + Ket + End +------------------------------------------------------------------ +Capture group count = 0 +Optimizations: + +/.*abc/BI,no_dotstar_anchor +------------------------------------------------------------------ + Bra + Any* + abc + Ket + End +------------------------------------------------------------------ +Capture group count = 0 +Options: no_dotstar_anchor +Optimizations: auto_possess,start_optimize +Last code unit = 'c' +Subject length lower bound = 3 + /.*\d/info,auto_callout Capture group count = 0 Options: auto_callout @@ -13908,6 +14036,7 @@ No match /.*\d/info,no_dotstar_anchor,auto_callout Capture group count = 0 Options: auto_callout no_dotstar_anchor +Optimizations: auto_possess,start_optimize Subject length lower bound = 1 \= Expect no match aaa @@ -13935,12 +14064,14 @@ Subject length lower bound = 1 /.*\d/dotall,no_dotstar_anchor,info Capture group count = 0 Options: dotall no_dotstar_anchor +Optimizations: auto_possess,start_optimize Subject length lower bound = 1 /(*NO_DOTSTAR_ANCHOR)(?s).*\d/info Capture group count = 0 Compile options: Overall options: no_dotstar_anchor +Optimizations: auto_possess,start_optimize Subject length lower bound = 1 '^(?:(a)|b)(?(1)A|B)' @@ -18049,12 +18180,14 @@ Subject length lower bound = 1 /a?(?=b(*COMMIT)c|)d/I,no_start_optimize Capture group count = 0 Options: no_start_optimize +Optimizations: auto_possess,dotstar_anchor bd No match /(?=b(*COMMIT)c|)d/I,no_start_optimize Capture group count = 0 Options: no_start_optimize +Optimizations: auto_possess,dotstar_anchor bd No match @@ -19060,6 +19193,57 @@ No match ac No match +# Tests for pcre2_set_optimize() + +/abc/I,optimization_none +Capture group count = 0 +Optimizations: + +/abc/I,optimization_none,auto_possess +Capture group count = 0 +Optimizations: auto_possess + +/abc/I,optimization_none,dotstar_anchor,auto_possess +Capture group count = 0 +Optimizations: auto_possess,dotstar_anchor + +/abc/I,optimization_none,start_optimize +Capture group count = 0 +Optimizations: start_optimize +First code unit = 'a' +Last code unit = 'c' +Subject length lower bound = 3 + +/abc/I,dotstar_anchor_off,optimization_full +Capture group count = 0 +First code unit = 'a' +Last code unit = 'c' +Subject length lower bound = 3 + +# If pcre2_set_optimize() is used to turn on some optimization, but at the same time, +# the compile options word turns it off... the compile options word "wins": + +/abc/I,no_auto_possess,auto_possess +Capture group count = 0 +Options: no_auto_possess +Optimizations: dotstar_anchor,start_optimize +First code unit = 'a' +Last code unit = 'c' +Subject length lower bound = 3 + +/abc/I,no_dotstar_anchor,dotstar_anchor +Capture group count = 0 +Options: no_dotstar_anchor +Optimizations: auto_possess,start_optimize +First code unit = 'a' +Last code unit = 'c' +Subject length lower bound = 3 + +/abc/I,no_start_optimize,start_optimize +Capture group count = 0 +Options: no_start_optimize +Optimizations: auto_possess,dotstar_anchor + # -------------- # End of testinput2 diff --git a/testdata/testoutput5 b/testdata/testoutput5 index 1b658f99e..befccd419 100644 --- a/testdata/testoutput5 +++ b/testdata/testoutput5 @@ -474,6 +474,7 @@ Subject length lower bound = 0 Capture group count = 0 Compile options: no_start_optimize utf Overall options: anchored no_start_optimize utf +Optimizations: auto_possess,dotstar_anchor /()()()()()()()()()() ()()()()()()()()()() diff --git a/testdata/testoutput6 b/testdata/testoutput6 index 283b00da0..63ec1ee29 100644 --- a/testdata/testoutput6 +++ b/testdata/testoutput6 @@ -6860,6 +6860,7 @@ No match /(abc|def|xyz)/I,no_start_optimize Capture group count = 1 Options: no_start_optimize +Optimizations: auto_possess,dotstar_anchor terhjk;abcdaadsfe 0: abc the quick xyz brown fox