diff --git a/doc/html/pcre2_set_optimize.html b/doc/html/pcre2_set_optimize.html
new file mode 100644
index 000000000..764b7f77c
--- /dev/null
+++ b/doc/html/pcre2_set_optimize.html
@@ -0,0 +1,47 @@
+
+
+pcre2_set_optimize specification
+
+
+pcre2_set_optimize man page
+
+Return to the PCRE2 index page.
+
+
+This page is part of the PCRE2 HTML documentation. It was generated
+automatically from the original man page. If there is any nonsense in it,
+please consult the man page, in case the conversion went wrong.
+
+
+SYNOPSIS
+
+
+#include <pcre2.h>
+
+
+int pcre2_set_optimize(pcre2_compile_context *ccontext,
+ uint32_t directive);
+
+
+DESCRIPTION
+
+
+This function controls which performance optimizations will be applied
+by pcre2_compile. It can be called multiple times with the same compile
+context; the effects are cumulative, with the effects of later calls taking
+precedence over earlier ones.
+
+
+The result is zero for success, PCRE2_ERROR_NULL if ccontext is NULL,
+or PCRE2_ERROR_BADOPTION if directive is unknown. This can be used to
+detect when the available version of PCRE2 does not implement a certain
+optimization.
+
+
+There is a complete description of the PCRE2 native API, including all
+permitted values for the directive parameter of pcre2_set_optimize,
+in the
+pcre2api
+page.
+Return to the PCRE2 index page.
+
diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
index 9766f454d..08ba35811 100644
--- a/doc/html/pcre2api.html
+++ b/doc/html/pcre2api.html
@@ -179,6 +179,10 @@ pcre2api man page
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
int (*guard_function)(uint32_t, void *), void *user_data);
+
+
+int pcre2_set_optimize(pcre2_compile_context *ccontext,
+ uint32_t directive);
PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
@@ -808,6 +812,7 @@
pcre2api man page
The compile time nested parentheses limit
The maximum length of the pattern string
The extra options bits (none set by default)
+ Which performance optimizations the compiler should apply
A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
@@ -952,6 +957,110 @@ pcre2api man page
nesting, and the second is user data that is set up by the last argument of
pcre2_set_compile_recursion_guard(). The callout function should return
zero if all is well, or non-zero to force an error.
+
+
+int pcre2_set_optimize(pcre2_compile_context *ccontext,
+ uint32_t directive);
+
+
+PCRE2 can apply various performance optimizations during compilation, in order
+to make matching faster. For example, the compiler might convert some regex
+constructs into an equivalent construct which pcre2_match() can execute
+faster. By default, all available optimizations are enabled. However, in rare
+cases, one might wish to disable specific optimizations. For example, if it is
+known that some optimizations cannot benefit a certain regex, it might be
+desirable to disable them, in order to speed up compilation.
+
+
+The permitted values of directive are as follows:
+
+ PCRE2_OPTIMIZATION_NONE
+
+Disable all optional performance optimizations.
+
+ PCRE2_OPTIMIZATION_FULL
+
+Enable all optional performance optimizations. This is the default value.
+
+ PCRE2_AUTO_POSSESS
+ PCRE2_AUTO_POSSESS_OFF
+
+Enable/disable "auto-possessification" of variable quantifiers such as * and +.
+This optimization, for example, turns a+b into a++b in order to avoid
+backtracks into a+ that can never be successful. However, if callouts are in
+use, auto-possessification means that some callouts are never taken. You can
+disable this optimization if you want the matching functions to do a full,
+unoptimized search and run all the callouts.
+
+ PCRE2_DOTSTAR_ANCHOR
+ PCRE2_DOTSTAR_ANCHOR_OFF
+
+Enable/disable an optimization that is applied when .* is the first significant
+item in a top-level branch of a pattern, and all the other branches also start
+with .* or with \A or \G or ^. Such a pattern is automatically anchored if
+PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any
+^ items. Otherwise, the fact that any match must start either at the start of
+the subject or following a newline is remembered. Like other optimizations,
+this can cause callouts to be skipped.
+
+
+Dotstar anchor optimization is automatically disabled for .* if it is inside an
+atomic group or a capture group that is the subject of a backreference, or if
+the pattern contains (*PRUNE) or (*SKIP).
+
+ PCRE2_START_OPTIMIZE
+ PCRE2_START_OPTIMIZE_OFF
+
+Enable/disable optimizations which cause matching functions to scan the subject
+string for specific code unit values before attempting a match. For example, if
+it is known that an unanchored match must start with a specific value, the
+matching code searches the subject for that value, and fails immediately if it
+cannot find it, without actually running the main matching function. This means
+that a special item such as (*COMMIT) at the start of a pattern is not
+considered until after a suitable starting point for the match has been found.
+Also, when callouts or (*MARK) items are in use, these "start-up" optimizations
+can cause them to be skipped if the pattern is never actually used. The start-up
+optimizations are in effect a pre-scan of the subject that takes place before
+the pattern is run.
+
+
+Disabling start-up optimizations ensures that in cases where the result is "no
+match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are
+considered at every possible starting position in the subject string.
+
+
+Disabling start-up optimizations may change the outcome of a matching operation.
+Consider the pattern
+
+ (*COMMIT)ABC
+
+When this is compiled, PCRE2 records the fact that a match must start with the
+character "A". Suppose the subject string is "DEFABC". The start-up
+optimization scans along the subject, finds "A" and runs the first match
+attempt from there. The (*COMMIT) item means that the pattern must match the
+current starting position, which in this case, it does. However, if the same
+match is run without start-up optimizations, the initial scan along the subject
+string does not happen. The first match attempt is run starting from "D" and
+when this fails, (*COMMIT) prevents any further matches being tried, so the
+overall result is "no match".
+
+
+Another start-up optimization makes use of a minimum length for a matching
+subject, which is recorded when possible. Consider the pattern
+
+ (*MARK:1)B(*MARK:2)(X|Y)
+
+The minimum length for a match is two characters. If the subject is "XXBB", the
+"starting character" optimization skips "XX", then tries to match "BB", which
+is long enough. In the process, (*MARK:2) is encountered and remembered. When
+the match attempt fails, the next "B" is found, but there is only one character
+left, so there are no more attempts, and "no match" is returned with the "last
+mark seen" set to "2". Without start-up optimizations, however, matches are
+tried at every possible starting position, including at the end of the subject,
+where (*MARK:1) is encountered, but there is no "B", so the "last mark seen"
+that is returned is "1". In this case, the optimizations do not affect the
+overall match result, which is still "no match", but they do affect the
+auxiliary information that is returned.
The match context
@@ -1807,85 +1916,55 @@ pcre2api man page
PCRE2_NO_AUTO_POSSESS
-If this option is set, it disables "auto-possessification", which is an
-optimization that, for example, turns a+b into a++b in order to avoid
+If this (deprecated) option is set, it disables "auto-possessification", which
+is an optimization that, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing
purposes.
+
+
+It is recommended to use pcre2_set_optimize with the directive
+PCRE2_AUTO_POSSESS_OFF rather than the compile option PCRE2_NO_AUTO_POSSESS.
+Note that PCRE2_NO_AUTO_POSSESS takes precedence over the
+pcre2_set_optimize optimization directives PCRE2_AUTO_POSSESS and
+PCRE2_AUTO_POSSESS_OFF.
PCRE2_NO_DOTSTAR_ANCHOR
-If this option is set, it disables an optimization that is applied when .* is
-the first significant item in a top-level branch of a pattern, and all the
-other branches also start with .* or with \A or \G or ^. The optimization is
-automatically disabled for .* if it is inside an atomic group or a capture
-group that is the subject of a backreference, or if the pattern contains
-(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
-automatically anchored if PCRE2_DOTALL is set for all the .* items and
-PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
-must start either at the start of the subject or following a newline is
+If this (deprecated) option is set, it disables an optimization that is applied
+when .* is the first significant item in a top-level branch of a pattern, and
+all the other branches also start with .* or with \A or \G or ^. The
+optimization is automatically disabled for .* if it is inside an atomic group
+or a capture group that is the subject of a backreference, or if the pattern
+contains (*PRUNE) or (*SKIP). When the optimization is not disabled, such a
+pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items
+and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any
+match must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
+(It is recommended to use pcre2_set_optimize instead.)
PCRE2_NO_START_OPTIMIZE
This is an option whose main effect is at matching time. It does not change
what pcre2_compile() generates, but it does affect the output of the JIT
-compiler.
+compiler. Setting this option is equivalent to calling pcre2_set_optimize
+with the directive parameter set to PCRE2_START_OPTIMIZE_OFF.
There are a number of optimizations that may occur at the start of a match, in
order to speed up the process. For example, if it is known that an unanchored
match must start with a specific code unit value, the matching code searches
the subject for that value, and fails immediately if it cannot find it, without
-actually running the main matching function. This means that a special item
-such as (*COMMIT) at the start of a pattern is not considered until after a
-suitable starting point for the match has been found. Also, when callouts or
-(*MARK) items are in use, these "start-up" optimizations can cause them to be
-skipped if the pattern is never actually used. The start-up optimizations are
+actually running the main matching function. The start-up optimizations are
in effect a pre-scan of the subject that takes place before the pattern is run.
-The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
-possibly causing performance to suffer, but ensuring that in cases where the
-result is "no match", the callouts do occur, and that items such as (*COMMIT)
-and (*MARK) are considered at every possible starting position in the subject
-string.
-
-
-Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation.
-Consider the pattern
-
- (*COMMIT)ABC
-
-When this is compiled, PCRE2 records the fact that a match must start with the
-character "A". Suppose the subject string is "DEFABC". The start-up
-optimization scans along the subject, finds "A" and runs the first match
-attempt from there. The (*COMMIT) item means that the pattern must match the
-current starting position, which in this case, it does. However, if the same
-match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
-subject string does not happen. The first match attempt is run starting from
-"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
-the overall result is "no match".
-
-
-As another start-up optimization makes use of a minimum length for a matching
-subject, which is recorded when possible. Consider the pattern
-
- (*MARK:1)B(*MARK:2)(X|Y)
-
-The minimum length for a match is two characters. If the subject is "XXBB", the
-"starting character" optimization skips "XX", then tries to match "BB", which
-is long enough. In the process, (*MARK:2) is encountered and remembered. When
-the match attempt fails, the next "B" is found, but there is only one character
-left, so there are no more attempts, and "no match" is returned with the "last
-mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
-at every possible starting position, including at the end of the subject, where
-(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
-returned is "1". In this case, the optimizations do not affect the overall
-match result, which is still "no match", but they do affect the auxiliary
-information that is returned.
+Disabling the start-up optimizations may cause performance to suffer. However,
+this may be desirable for patterns which contain callouts or items such as
+(*COMMIT) and (*MARK). See the above description of PCRE2_START_OPTIMIZE_OFF
+for further details.
PCRE2_NO_UTF_CHECK
@@ -2312,6 +2391,7 @@ pcre2api man page
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set
+ Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
index 1902f1030..fe7c42eef 100644
--- a/doc/html/pcre2pattern.html
+++ b/doc/html/pcre2pattern.html
@@ -2243,7 +2243,7 @@ pcre2pattern man page
PCRE2 has an optimization that automatically "possessifies" certain simple
pattern constructs. For example, the sequence A+B is treated as A++B because
there is no point in backtracking into a sequence of A's when B must follow.
-This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
+This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, or starting
the pattern with (*NO_AUTO_POSSESS).
diff --git a/doc/html/pcre2test.html b/doc/html/pcre2test.html
index a519793c7..b38869e10 100644
--- a/doc/html/pcre2test.html
+++ b/doc/html/pcre2test.html
@@ -681,6 +681,23 @@
pcre2test man page
brackets. Setting utf in 16-bit or 32-bit mode also causes pattern and
subject strings to be translated to UTF-16 or UTF-32, respectively, before
being passed to library functions.
+
+
+The following modifiers enable or disable performance optimizations by
+calling pcre2_set_optimize() before invoking the regex compiler.
+
+ optimization_full enable all optional optimizations
+ optimization_none disable all optional optimizations
+ auto_possess auto-possessify variable quantifiers
+ auto_possess_off don't auto-possessify variable quantifiers
+ dotstar_anchor anchor patterns starting with .*
+ dotstar_anchor_off don't anchor patterns starting with .*
+ start_optimize enable pre-scan of subject string
+ start_optimize_off disable pre-scan of subject string
+
+See the
+pcre2_set_optimize
+documentation for details on these optimizations.
Setting compilation controls
diff --git a/doc/pcre2.txt b/doc/pcre2.txt
index d6b8d0b34..42c2f452b 100644
--- a/doc/pcre2.txt
+++ b/doc/pcre2.txt
@@ -296,6 +296,9 @@ PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
int (*guard_function)(uint32_t, void *), void *user_data);
+ int pcre2_set_optimize(pcre2_compile_context *ccontext,
+ uint32_t directive);
+
PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
@@ -978,6 +981,110 @@ PCRE2 CONTEXTS
ment of pcre2_set_compile_recursion_guard(). The callout function
should return zero if all is well, or non-zero to force an error.
+ int pcre2_set_optimize(pcre2_compile_context *ccontext,
+ uint32_t directive);
+
+ PCRE2 can apply various performance optimizations during compilation,
+ in order to make matching faster. For example, the compiler might con‐
+ vert some regex constructs into an equivalent construct which
+ pcre2_match() can execute faster. By default, all available optimiza‐
+ tions are enabled. However, in rare cases, one might wish to disable
+ specific optimizations. For example, if it is known that some optimiza‐
+ tions cannot benefit a certain regex, it might be desirable to disable
+ them, in order to speed up compilation.
+
+ The permitted values of directive are as follows:
+
+ PCRE2_OPTIMIZATION_NONE
+
+ Disable all optional performance optimizations.
+
+ PCRE2_OPTIMIZATION_FULL
+
+ Enable all optional performance optimizations. This is the default
+ value.
+
+ PCRE2_AUTO_POSSESS
+ PCRE2_AUTO_POSSESS_OFF
+
+ Enable/disable "auto-possessification" of variable quantifiers such as
+ * and +. This optimization, for example, turns a+b into a++b in order
+ to avoid backtracks into a+ that can never be successful. However, if
+ callouts are in use, auto-possessification means that some callouts are
+ never taken. You can disable this optimization if you want the matching
+ functions to do a full, unoptimized search and run all the callouts.
+
+ PCRE2_DOTSTAR_ANCHOR
+ PCRE2_DOTSTAR_ANCHOR_OFF
+
+ Enable/disable an optimization that is applied when .* is the first
+ significant item in a top-level branch of a pattern, and all the other
+ branches also start with .* or with \A or \G or ^. Such a pattern is
+ automatically anchored if PCRE2_DOTALL is set for all the .* items and
+ PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that
+ any match must start either at the start of the subject or following a
+ newline is remembered. Like other optimizations, this can cause call‐
+ outs to be skipped.
+
+ Dotstar anchor optimization is automatically disabled for .* if it is
+ inside an atomic group or a capture group that is the subject of a
+ backreference, or if the pattern contains (*PRUNE) or (*SKIP).
+
+ PCRE2_START_OPTIMIZE
+ PCRE2_START_OPTIMIZE_OFF
+
+ Enable/disable optimizations which cause matching functions to scan the
+ subject string for specific code unit values before attempting a match.
+ For example, if it is known that an unanchored match must start with a
+ specific value, the matching code searches the subject for that value,
+ and fails immediately if it cannot find it, without actually running
+ the main matching function. This means that a special item such as
+ (*COMMIT) at the start of a pattern is not considered until after a
+ suitable starting point for the match has been found. Also, when call‐
+ outs or (*MARK) items are in use, these "start-up" optimizations can
+ cause them to be skipped if the pattern is never actually used. The
+ start-up optimizations are in effect a pre-scan of the subject that
+ takes place before the pattern is run.
+
+ Disabling start-up optimizations ensures that in cases where the result
+ is "no match", the callouts do occur, and that items such as (*COMMIT)
+ and (*MARK) are considered at every possible starting position in the
+ subject string.
+
+ Disabling start-up optimizations may change the outcome of a matching
+ operation. Consider the pattern
+
+ (*COMMIT)ABC
+
+ When this is compiled, PCRE2 records the fact that a match must start
+ with the character "A". Suppose the subject string is "DEFABC". The
+ start-up optimization scans along the subject, finds "A" and runs the
+ first match attempt from there. The (*COMMIT) item means that the pat‐
+ tern must match the current starting position, which in this case, it
+ does. However, if the same match is run without start-up optimizations,
+ the initial scan along the subject string does not happen. The first
+ match attempt is run starting from "D" and when this fails, (*COMMIT)
+ prevents any further matches being tried, so the overall result is "no
+ match".
+
+ Another start-up optimization makes use of a minimum length for a
+ matching subject, which is recorded when possible. Consider the pattern
+
+ (*MARK:1)B(*MARK:2)(X|Y)
+
+ The minimum length for a match is two characters. If the subject is
+ "XXBB", the "starting character" optimization skips "XX", then tries to
+ match "BB", which is long enough. In the process, (*MARK:2) is encoun‐
+ tered and remembered. When the match attempt fails, the next "B" is
+ found, but there is only one character left, so there are no more at‐
+ tempts, and "no match" is returned with the "last mark seen" set to
+ "2". Without start-up optimizations, however, matches are tried at ev‐
+ ery possible starting position, including at the end of the subject,
+ where (*MARK:1) is encountered, but there is no "B", so the "last mark
+ seen" that is returned is "1". In this case, the optimizations do not
+ affect the overall match result, which is still "no match", but they do
+ affect the auxiliary information that is returned.
+
The match context
A match context is required if you want to:
@@ -1775,86 +1882,55 @@ COMPILING A PATTERN
PCRE2_NO_AUTO_POSSESS
- If this option is set, it disables "auto-possessification", which is an
- optimization that, for example, turns a+b into a++b in order to avoid
- backtracks into a+ that can never be successful. However, if callouts
- are in use, auto-possessification means that some callouts are never
- taken. You can set this option if you want the matching functions to do
- a full unoptimized search and run all the callouts, but it is mainly
- provided for testing purposes.
+ If this (deprecated) option is set, it disables "auto-possessifica‐
+ tion", which is an optimization that, for example, turns a+b into a++b
+ in order to avoid backtracks into a+ that can never be successful. How‐
+ ever, if callouts are in use, auto-possessification means that some
+ callouts are never taken. You can set this option if you want the
+ matching functions to do a full unoptimized search and run all the
+ callouts, but it is mainly provided for testing purposes.
+
+ It is recommended to use pcre2_set_optimize with the directive
+ PCRE2_AUTO_POSSESS_OFF rather than the compile option
+ PCRE2_NO_AUTO_POSSESS. Note that PCRE2_NO_AUTO_POSSESS takes prece‐
+ dence over the pcre2_set_optimize optimization directives
+ PCRE2_AUTO_POSSESS and PCRE2_AUTO_POSSESS_OFF.
PCRE2_NO_DOTSTAR_ANCHOR
- If this option is set, it disables an optimization that is applied when
- .* is the first significant item in a top-level branch of a pattern,
- and all the other branches also start with .* or with \A or \G or ^.
- The optimization is automatically disabled for .* if it is inside an
- atomic group or a capture group that is the subject of a backreference,
- or if the pattern contains (*PRUNE) or (*SKIP). When the optimization
- is not disabled, such a pattern is automatically anchored if
+ If this (deprecated) option is set, it disables an optimization that is
+ applied when .* is the first significant item in a top-level branch of
+ a pattern, and all the other branches also start with .* or with \A or
+ \G or ^. The optimization is automatically disabled for .* if it is in‐
+ side an atomic group or a capture group that is the subject of a back‐
+ reference, or if the pattern contains (*PRUNE) or (*SKIP). When the op‐
+ timization is not disabled, such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
for any ^ items. Otherwise, the fact that any match must start either
at the start of the subject or following a newline is remembered. Like
- other optimizations, this can cause callouts to be skipped.
+ other optimizations, this can cause callouts to be skipped. (It is
+ recommended to use pcre2_set_optimize instead.)
PCRE2_NO_START_OPTIMIZE
- This is an option whose main effect is at matching time. It does not
+ This is an option whose main effect is at matching time. It does not
change what pcre2_compile() generates, but it does affect the output of
- the JIT compiler.
+ the JIT compiler. Setting this option is equivalent to calling
+ pcre2_set_optimize with the directive parameter set to PCRE2_START_OP‐
+ TIMIZE_OFF.
There are a number of optimizations that may occur at the start of a
match, in order to speed up the process. For example, if it is known
that an unanchored match must start with a specific code unit value,
- the matching code searches the subject for that value, and fails imme-
- diately if it cannot find it, without actually running the main match-
- ing function. This means that a special item such as (*COMMIT) at the
- start of a pattern is not considered until after a suitable starting
- point for the match has been found. Also, when callouts or (*MARK)
- items are in use, these "start-up" optimizations can cause them to be
- skipped if the pattern is never actually used. The start-up optimiza-
- tions are in effect a pre-scan of the subject that takes place before
- the pattern is run.
-
- The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
- possibly causing performance to suffer, but ensuring that in cases
- where the result is "no match", the callouts do occur, and that items
- such as (*COMMIT) and (*MARK) are considered at every possible starting
- position in the subject string.
-
- Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
- operation. Consider the pattern
-
- (*COMMIT)ABC
-
- When this is compiled, PCRE2 records the fact that a match must start
- with the character "A". Suppose the subject string is "DEFABC". The
- start-up optimization scans along the subject, finds "A" and runs the
- first match attempt from there. The (*COMMIT) item means that the pat-
- tern must match the current starting position, which in this case, it
- does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
- set, the initial scan along the subject string does not happen. The
- first match attempt is run starting from "D" and when this fails,
- (*COMMIT) prevents any further matches being tried, so the overall re-
- sult is "no match".
-
- As another start-up optimization makes use of a minimum length for a
- matching subject, which is recorded when possible. Consider the pattern
-
- (*MARK:1)B(*MARK:2)(X|Y)
+ the matching code searches the subject for that value, and fails imme‐
+ diately if it cannot find it, without actually running the main match‐
+ ing function. The start-up optimizations are in effect a pre-scan of
+ the subject that takes place before the pattern is run.
- The minimum length for a match is two characters. If the subject is
- "XXBB", the "starting character" optimization skips "XX", then tries to
- match "BB", which is long enough. In the process, (*MARK:2) is encoun-
- tered and remembered. When the match attempt fails, the next "B" is
- found, but there is only one character left, so there are no more at-
- tempts, and "no match" is returned with the "last mark seen" set to
- "2". If NO_START_OPTIMIZE is set, however, matches are tried at every
- possible starting position, including at the end of the subject, where
- (*MARK:1) is encountered, but there is no "B", so the "last mark seen"
- that is returned is "1". In this case, the optimizations do not affect
- the overall match result, which is still "no match", but they do affect
- the auxiliary information that is returned.
+ Disabling the start-up optimizations may cause performance to suffer.
+ However, this may be desirable for patterns which contain callouts or
+ items such as (*COMMIT) and (*MARK). See the above description of
+ PCRE2_START_OPTIMIZE_OFF for further details.
PCRE2_NO_UTF_CHECK
@@ -2261,6 +2337,7 @@ INFORMATION ABOUT A COMPILED PATTERN
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set
+ Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
the options returned for PCRE2_INFO_ALLOPTIONS.
diff --git a/doc/pcre2_set_optimize.3 b/doc/pcre2_set_optimize.3
new file mode 100644
index 000000000..1a51cc27e
--- /dev/null
+++ b/doc/pcre2_set_optimize.3
@@ -0,0 +1,33 @@
+.TH PCRE2_SET_OPTIMIZE 3 "16 September 2024" "PCRE2 10.45"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH SYNOPSIS
+.rs
+.sp
+.B #include
+.PP
+.nf
+.B int pcre2_set_optimize(pcre2_compile_context *\fIccontext\fP,
+.B " uint32_t \fIdirective\fP);"
+.fi
+.
+.SH DESCRIPTION
+.rs
+.sp
+This function controls which performance optimizations will be applied
+by \fBpcre2_compile\fP. It can be called multiple times with the same compile
+context; the effects are cumulative, with the effects of later calls taking
+precedence over earlier ones.
+.P
+The result is zero for success, PCRE2_ERROR_NULL if \fIccontext\fP is NULL,
+or PCRE2_ERROR_BADOPTION if \fIdirective\fP is unknown. This can be used to
+detect when the available version of PCRE2 does not implement a certain
+optimization.
+.P
+There is a complete description of the PCRE2 native API, including all
+permitted values for the \fIdirective\fP parameter of \fBpcre2_set_optimize\fP,
+in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+page.
\ No newline at end of file
diff --git a/doc/pcre2api.3 b/doc/pcre2api.3
index a362982d8..026e85d0b 100644
--- a/doc/pcre2api.3
+++ b/doc/pcre2api.3
@@ -115,6 +115,9 @@ document for an overview of all the PCRE2 documentation.
.sp
.B int pcre2_set_compile_recursion_guard(pcre2_compile_context *\fIccontext\fP,
.B " int (*\fIguard_function\fP)(uint32_t, void *), void *\fIuser_data\fP);"
+.sp
+.B int pcre2_set_optimize(pcre2_compile_context *\fIccontext\fP,
+.B " uint32_t \fIdirective\fP);"
.fi
.
.
@@ -738,6 +741,7 @@ following compile-time parameters:
The compile time nested parentheses limit
The maximum length of the pattern string
The extra options bits (none set by default)
+ Which performance optimizations the compiler should apply
.sp
A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
@@ -881,6 +885,105 @@ The first argument to the callout function gives the current depth of
nesting, and the second is user data that is set up by the last argument of
\fBpcre2_set_compile_recursion_guard()\fP. The callout function should return
zero if all is well, or non-zero to force an error.
+.sp
+.nf
+.B int pcre2_set_optimize(pcre2_compile_context *\fIccontext\fP,
+.B " uint32_t \fIdirective\fP);"
+.fi
+.sp
+PCRE2 can apply various performance optimizations during compilation, in order
+to make matching faster. For example, the compiler might convert some regex
+constructs into an equivalent construct which \fBpcre2_match()\fP can execute
+faster. By default, all available optimizations are enabled. However, in rare
+cases, one might wish to disable specific optimizations. For example, if it is
+known that some optimizations cannot benefit a certain regex, it might be
+desirable to disable them, in order to speed up compilation.
+.P
+The permitted values of \fIdirective\fP are as follows:
+.sp
+ PCRE2_OPTIMIZATION_NONE
+.sp
+Disable all optional performance optimizations.
+.sp
+ PCRE2_OPTIMIZATION_FULL
+.sp
+Enable all optional performance optimizations. This is the default value.
+.sp
+ PCRE2_AUTO_POSSESS
+ PCRE2_AUTO_POSSESS_OFF
+.sp
+Enable/disable "auto-possessification" of variable quantifiers such as * and +.
+This optimization, for example, turns a+b into a++b in order to avoid
+backtracks into a+ that can never be successful. However, if callouts are in
+use, auto-possessification means that some callouts are never taken. You can
+disable this optimization if you want the matching functions to do a full,
+unoptimized search and run all the callouts.
+.sp
+ PCRE2_DOTSTAR_ANCHOR
+ PCRE2_DOTSTAR_ANCHOR_OFF
+.sp
+Enable/disable an optimization that is applied when .* is the first significant
+item in a top-level branch of a pattern, and all the other branches also start
+with .* or with \eA or \eG or ^. Such a pattern is automatically anchored if
+PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any
+^ items. Otherwise, the fact that any match must start either at the start of
+the subject or following a newline is remembered. Like other optimizations,
+this can cause callouts to be skipped.
+.P
+Dotstar anchor optimization is automatically disabled for .* if it is inside an
+atomic group or a capture group that is the subject of a backreference, or if
+the pattern contains (*PRUNE) or (*SKIP).
+.sp
+ PCRE2_START_OPTIMIZE
+ PCRE2_START_OPTIMIZE_OFF
+.sp
+Enable/disable optimizations which cause matching functions to scan the subject
+string for specific code unit values before attempting a match. For example, if
+it is known that an unanchored match must start with a specific value, the
+matching code searches the subject for that value, and fails immediately if it
+cannot find it, without actually running the main matching function. This means
+that a special item such as (*COMMIT) at the start of a pattern is not
+considered until after a suitable starting point for the match has been found.
+Also, when callouts or (*MARK) items are in use, these "start-up" optimizations
+can cause them to be skipped if the pattern is never actually used. The start-up
+optimizations are in effect a pre-scan of the subject that takes place before
+the pattern is run.
+.P
+Disabling start-up optimizations ensures that in cases where the result is "no
+match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are
+considered at every possible starting position in the subject string.
+.P
+Disabling start-up optimizations may change the outcome of a matching operation.
+Consider the pattern
+.sp
+ (*COMMIT)ABC
+.sp
+When this is compiled, PCRE2 records the fact that a match must start with the
+character "A". Suppose the subject string is "DEFABC". The start-up
+optimization scans along the subject, finds "A" and runs the first match
+attempt from there. The (*COMMIT) item means that the pattern must match the
+current starting position, which in this case, it does. However, if the same
+match is run without start-up optimizations, the initial scan along the subject
+string does not happen. The first match attempt is run starting from "D" and
+when this fails, (*COMMIT) prevents any further matches being tried, so the
+overall result is "no match".
+.P
+Another start-up optimization makes use of a minimum length for a matching
+subject, which is recorded when possible. Consider the pattern
+.sp
+ (*MARK:1)B(*MARK:2)(X|Y)
+.sp
+The minimum length for a match is two characters. If the subject is "XXBB", the
+"starting character" optimization skips "XX", then tries to match "BB", which
+is long enough. In the process, (*MARK:2) is encountered and remembered. When
+the match attempt fails, the next "B" is found, but there is only one character
+left, so there are no more attempts, and "no match" is returned with the "last
+mark seen" set to "2". Without start-up optimizations, however, matches are
+tried at every possible starting position, including at the end of the subject,
+where (*MARK:1) is encountered, but there is no "B", so the "last mark seen"
+that is returned is "1". In this case, the optimizations do not affect the
+overall match result, which is still "no match", but they do affect the
+auxiliary information that is returned.
.
.
.\" HTML
@@ -1748,81 +1851,52 @@ though the reference can be by name or by number.
.sp
PCRE2_NO_AUTO_POSSESS
.sp
-If this option is set, it disables "auto-possessification", which is an
-optimization that, for example, turns a+b into a++b in order to avoid
+If this (deprecated) option is set, it disables "auto-possessification", which
+is an optimization that, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing
purposes.
+.P
+It is recommended to use \fBpcre2_set_optimize\fP with the \fIdirective\fP
+PCRE2_AUTO_POSSESS_OFF rather than the compile option PCRE2_NO_AUTO_POSSESS.
+Note that PCRE2_NO_AUTO_POSSESS takes precedence over the
+\fBpcre2_set_optimize\fP optimization directives PCRE2_AUTO_POSSESS and
+PCRE2_AUTO_POSSESS_OFF.
.sp
PCRE2_NO_DOTSTAR_ANCHOR
.sp
-If this option is set, it disables an optimization that is applied when .* is
-the first significant item in a top-level branch of a pattern, and all the
-other branches also start with .* or with \eA or \eG or ^. The optimization is
-automatically disabled for .* if it is inside an atomic group or a capture
-group that is the subject of a backreference, or if the pattern contains
-(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
-automatically anchored if PCRE2_DOTALL is set for all the .* items and
-PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
-must start either at the start of the subject or following a newline is
+If this (deprecated) option is set, it disables an optimization that is applied
+when .* is the first significant item in a top-level branch of a pattern, and
+all the other branches also start with .* or with \eA or \eG or ^. The
+optimization is automatically disabled for .* if it is inside an atomic group
+or a capture group that is the subject of a backreference, or if the pattern
+contains (*PRUNE) or (*SKIP). When the optimization is not disabled, such a
+pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items
+and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any
+match must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
+(It is recommended to use \fBpcre2_set_optimize\fP instead.)
.sp
PCRE2_NO_START_OPTIMIZE
.sp
This is an option whose main effect is at matching time. It does not change
what \fBpcre2_compile()\fP generates, but it does affect the output of the JIT
-compiler.
+compiler. Setting this option is equivalent to calling \fBpcre2_set_optimize\fP
+with the \fIdirective\fP parameter set to PCRE2_START_OPTIMIZE_OFF.
.P
There are a number of optimizations that may occur at the start of a match, in
order to speed up the process. For example, if it is known that an unanchored
match must start with a specific code unit value, the matching code searches
the subject for that value, and fails immediately if it cannot find it, without
-actually running the main matching function. This means that a special item
-such as (*COMMIT) at the start of a pattern is not considered until after a
-suitable starting point for the match has been found. Also, when callouts or
-(*MARK) items are in use, these "start-up" optimizations can cause them to be
-skipped if the pattern is never actually used. The start-up optimizations are
+actually running the main matching function. The start-up optimizations are
in effect a pre-scan of the subject that takes place before the pattern is run.
.P
-The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
-possibly causing performance to suffer, but ensuring that in cases where the
-result is "no match", the callouts do occur, and that items such as (*COMMIT)
-and (*MARK) are considered at every possible starting position in the subject
-string.
-.P
-Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching operation.
-Consider the pattern
-.sp
- (*COMMIT)ABC
-.sp
-When this is compiled, PCRE2 records the fact that a match must start with the
-character "A". Suppose the subject string is "DEFABC". The start-up
-optimization scans along the subject, finds "A" and runs the first match
-attempt from there. The (*COMMIT) item means that the pattern must match the
-current starting position, which in this case, it does. However, if the same
-match is run with PCRE2_NO_START_OPTIMIZE set, the initial scan along the
-subject string does not happen. The first match attempt is run starting from
-"D" and when this fails, (*COMMIT) prevents any further matches being tried, so
-the overall result is "no match".
-.P
-As another start-up optimization makes use of a minimum length for a matching
-subject, which is recorded when possible. Consider the pattern
-.sp
- (*MARK:1)B(*MARK:2)(X|Y)
-.sp
-The minimum length for a match is two characters. If the subject is "XXBB", the
-"starting character" optimization skips "XX", then tries to match "BB", which
-is long enough. In the process, (*MARK:2) is encountered and remembered. When
-the match attempt fails, the next "B" is found, but there is only one character
-left, so there are no more attempts, and "no match" is returned with the "last
-mark seen" set to "2". If NO_START_OPTIMIZE is set, however, matches are tried
-at every possible starting position, including at the end of the subject, where
-(*MARK:1) is encountered, but there is no "B", so the "last mark seen" that is
-returned is "1". In this case, the optimizations do not affect the overall
-match result, which is still "no match", but they do affect the auxiliary
-information that is returned.
+Disabling the start-up optimizations may cause performance to suffer. However,
+this may be desirable for patterns which contain callouts or items such as
+(*COMMIT) and (*MARK). See the above description of PCRE2_START_OPTIMIZE_OFF
+for further details.
.sp
PCRE2_NO_UTF_CHECK
.sp
@@ -2272,6 +2346,7 @@ following are true:
PCRE2_DOTALL is in force for .*
Neither (*PRUNE) nor (*SKIP) appears in the pattern
PCRE2_NO_DOTSTAR_ANCHOR is not set
+ Dotstar anchoring has not been disabled with PCRE2_DOTSTAR_ANCHOR_OFF
.sp
For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in the
options returned for PCRE2_INFO_ALLOPTIONS.
diff --git a/doc/pcre2pattern.3 b/doc/pcre2pattern.3
index 84e4aff47..b0936c91a 100644
--- a/doc/pcre2pattern.3
+++ b/doc/pcre2pattern.3
@@ -2242,7 +2242,7 @@ package, and PCRE1 copied it from there. It found its way into Perl at release
PCRE2 has an optimization that automatically "possessifies" certain simple
pattern constructs. For example, the sequence A+B is treated as A++B because
there is no point in backtracking into a sequence of A's when B must follow.
-This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
+This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, or starting
the pattern with (*NO_AUTO_POSSESS).
.P
When a pattern contains an unlimited repeat inside a group that can itself be
diff --git a/doc/pcre2test.1 b/doc/pcre2test.1
index 9b7d37598..378b5dced 100644
--- a/doc/pcre2test.1
+++ b/doc/pcre2test.1
@@ -636,6 +636,24 @@ notation. Otherwise, those less than 0x100 are output in hex without the curly
brackets. Setting \fButf\fP in 16-bit or 32-bit mode also causes pattern and
subject strings to be translated to UTF-16 or UTF-32, respectively, before
being passed to library functions.
+.sp
+The following modifiers enable or disable performance optimizations by
+calling \fBpcre2_set_optimize()\fP before invoking the regex compiler.
+.sp
+ optimization_full enable all optional optimizations
+ optimization_none disable all optional optimizations
+ auto_possess auto-possessify variable quantifiers
+ auto_possess_off don't auto-possessify variable quantifiers
+ dotstar_anchor anchor patterns starting with .*
+ dotstar_anchor_off don't anchor patterns starting with .*
+ start_optimize enable pre-scan of subject string
+ start_optimize_off disable pre-scan of subject string
+.sp
+See the
+.\" HREF
+\fBpcre2_set_optimize\fP
+.\"
+documentation for details on these optimizations.
.
.
.\" HTML
diff --git a/doc/pcre2test.txt b/doc/pcre2test.txt
index 30e16c8b5..a4cb50ad0 100644
--- a/doc/pcre2test.txt
+++ b/doc/pcre2test.txt
@@ -618,6 +618,21 @@ PATTERN MODIFIERS
causes pattern and subject strings to be translated to UTF-16 or
UTF-32, respectively, before being passed to library functions.
+ The following modifiers enable or disable performance optimizations by
+ calling pcre2_set_optimize() before invoking the regex compiler.
+
+ optimization_full enable all optional optimizations
+ optimization_none disable all optional optimizations
+ auto_possess auto-possessify variable quantifiers
+ auto_possess_off don't auto-possessify variable quantifiers
+ dotstar_anchor anchor patterns starting with .*
+ dotstar_anchor_off don't anchor patterns starting with .*
+ start_optimize enable pre-scan of subject string
+ start_optimize_off disable pre-scan of subject string
+
+ See the pcre2_set_optimize documentation for details on these optimiza‐
+ tions.
+
Setting compilation controls
The following modifiers affect the compilation process or request in-
diff --git a/src/pcre2.h.generic b/src/pcre2.h.generic
index a3341e6f5..0896b72ca 100644
--- a/src/pcre2.h.generic
+++ b/src/pcre2.h.generic
@@ -464,6 +464,18 @@ released, the numbers must not be changed. */
#define PCRE2_CONFIG_COMPILED_WIDTHS 14
#define PCRE2_CONFIG_TABLES_LENGTH 15
+/* Optimization directives for pcre2_set_optimize().
+For binary compatibility, only add to this list; do not renumber. */
+
+#define PCRE2_OPTIMIZATION_NONE 0
+#define PCRE2_OPTIMIZATION_FULL 1
+
+#define PCRE2_AUTO_POSSESS 64
+#define PCRE2_AUTO_POSSESS_OFF 65
+#define PCRE2_DOTSTAR_ANCHOR 66
+#define PCRE2_DOTSTAR_ANCHOR_OFF 67
+#define PCRE2_START_OPTIMIZE 68
+#define PCRE2_START_OPTIMIZE_OFF 69
/* Types for code units in patterns and subject strings. */
@@ -617,7 +629,9 @@ PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_parens_nest_limit(pcre2_compile_context *, uint32_t); \
PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_compile_recursion_guard(pcre2_compile_context *, \
- int (*)(uint32_t, void *), void *);
+ int (*)(uint32_t, void *), void *); \
+PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
+ pcre2_set_optimize(pcre2_compile_context *, uint32_t);
#define PCRE2_MATCH_CONTEXT_FUNCTIONS \
PCRE2_EXP_DECL pcre2_match_context *PCRE2_CALL_CONVENTION \
@@ -912,6 +926,7 @@ pcre2_compile are called by application code. */
#define pcre2_set_newline PCRE2_SUFFIX(pcre2_set_newline_)
#define pcre2_set_parens_nest_limit PCRE2_SUFFIX(pcre2_set_parens_nest_limit_)
#define pcre2_set_offset_limit PCRE2_SUFFIX(pcre2_set_offset_limit_)
+#define pcre2_set_optimize PCRE2_SUFFIX(pcre2_set_optimize_)
#define pcre2_set_substitute_callout PCRE2_SUFFIX(pcre2_set_substitute_callout_)
#define pcre2_substitute PCRE2_SUFFIX(pcre2_substitute_)
#define pcre2_substring_copy_byname PCRE2_SUFFIX(pcre2_substring_copy_byname_)
diff --git a/src/pcre2.h.in b/src/pcre2.h.in
index a19313c9e..9595a8540 100644
--- a/src/pcre2.h.in
+++ b/src/pcre2.h.in
@@ -464,6 +464,18 @@ released, the numbers must not be changed. */
#define PCRE2_CONFIG_COMPILED_WIDTHS 14
#define PCRE2_CONFIG_TABLES_LENGTH 15
+/* Optimization directives for pcre2_set_optimize().
+For binary compatibility, only add to this list; do not renumber. */
+
+#define PCRE2_OPTIMIZATION_NONE 0
+#define PCRE2_OPTIMIZATION_FULL 1
+
+#define PCRE2_AUTO_POSSESS 64
+#define PCRE2_AUTO_POSSESS_OFF 65
+#define PCRE2_DOTSTAR_ANCHOR 66
+#define PCRE2_DOTSTAR_ANCHOR_OFF 67
+#define PCRE2_START_OPTIMIZE 68
+#define PCRE2_START_OPTIMIZE_OFF 69
/* Types for code units in patterns and subject strings. */
@@ -617,7 +629,9 @@ PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_parens_nest_limit(pcre2_compile_context *, uint32_t); \
PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
pcre2_set_compile_recursion_guard(pcre2_compile_context *, \
- int (*)(uint32_t, void *), void *);
+ int (*)(uint32_t, void *), void *); \
+PCRE2_EXP_DECL int PCRE2_CALL_CONVENTION \
+ pcre2_set_optimize(pcre2_compile_context *, uint32_t);
#define PCRE2_MATCH_CONTEXT_FUNCTIONS \
PCRE2_EXP_DECL pcre2_match_context *PCRE2_CALL_CONVENTION \
@@ -912,6 +926,7 @@ pcre2_compile are called by application code. */
#define pcre2_set_newline PCRE2_SUFFIX(pcre2_set_newline_)
#define pcre2_set_parens_nest_limit PCRE2_SUFFIX(pcre2_set_parens_nest_limit_)
#define pcre2_set_offset_limit PCRE2_SUFFIX(pcre2_set_offset_limit_)
+#define pcre2_set_optimize PCRE2_SUFFIX(pcre2_set_optimize_)
#define pcre2_set_substitute_callout PCRE2_SUFFIX(pcre2_set_substitute_callout_)
#define pcre2_substitute PCRE2_SUFFIX(pcre2_substitute_)
#define pcre2_substring_copy_byname PCRE2_SUFFIX(pcre2_substring_copy_byname_)
diff --git a/src/pcre2_compile.c b/src/pcre2_compile.c
index 48dae18fa..946198c0f 100644
--- a/src/pcre2_compile.c
+++ b/src/pcre2_compile.c
@@ -834,7 +834,8 @@ enum { PSO_OPT, /* Value is an option bit */
PSO_BSR, /* Value is a \R type */
PSO_LIMH, /* Read integer value for heap limit */
PSO_LIMM, /* Read integer value for match limit */
- PSO_LIMD /* Read integer value for depth limit */
+ PSO_LIMD, /* Read integer value for depth limit */
+ PSO_OPTMZ /* Value is an optimization bit */
};
typedef struct pso {
@@ -852,10 +853,10 @@ static const pso pso_list[] = {
{ STRING_UCP_RIGHTPAR, 4, PSO_OPT, PCRE2_UCP },
{ STRING_NOTEMPTY_RIGHTPAR, 9, PSO_FLG, PCRE2_NOTEMPTY_SET },
{ STRING_NOTEMPTY_ATSTART_RIGHTPAR, 17, PSO_FLG, PCRE2_NE_ATST_SET },
- { STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPT, PCRE2_NO_AUTO_POSSESS },
- { STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPT, PCRE2_NO_DOTSTAR_ANCHOR },
+ { STRING_NO_AUTO_POSSESS_RIGHTPAR, 16, PSO_OPTMZ, PCRE2_OPTIM_AUTO_POSSESS },
+ { STRING_NO_DOTSTAR_ANCHOR_RIGHTPAR, 18, PSO_OPTMZ, PCRE2_OPTIM_DOTSTAR_ANCHOR },
{ STRING_NO_JIT_RIGHTPAR, 7, PSO_FLG, PCRE2_NOJIT },
- { STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPT, PCRE2_NO_START_OPTIMIZE },
+ { STRING_NO_START_OPT_RIGHTPAR, 13, PSO_OPTMZ, PCRE2_OPTIM_START_OPTIMIZE },
{ STRING_LIMIT_HEAP_EQ, 11, PSO_LIMH, 0 },
{ STRING_LIMIT_MATCH_EQ, 12, PSO_LIMM, 0 },
{ STRING_LIMIT_DEPTH_EQ, 12, PSO_LIMD, 0 },
@@ -8883,13 +8884,14 @@ this prevents the number of characters it matches from being adjusted.
cb points to the compile data block
atomcount atomic group level
inassert TRUE if in an assertion
+ dotstar_anchor TRUE if automatic anchoring optimization is enabled
Returns: TRUE or FALSE
*/
static BOOL
is_anchored(PCRE2_SPTR code, uint32_t bracket_map, compile_block *cb,
- int atomcount, BOOL inassert)
+ int atomcount, BOOL inassert, BOOL dotstar_anchor)
{
do {
PCRE2_SPTR scode = first_significant_code(
@@ -8901,7 +8903,7 @@ do {
if (op == OP_BRA || op == OP_BRAPOS ||
op == OP_SBRA || op == OP_SBRAPOS)
{
- if (!is_anchored(scode, bracket_map, cb, atomcount, inassert))
+ if (!is_anchored(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor))
return FALSE;
}
@@ -8912,14 +8914,14 @@ do {
{
int n = GET2(scode, 1+LINK_SIZE);
uint32_t new_map = bracket_map | ((n < 32)? (1u << n) : 1);
- if (!is_anchored(scode, new_map, cb, atomcount, inassert)) return FALSE;
+ if (!is_anchored(scode, new_map, cb, atomcount, inassert, dotstar_anchor)) return FALSE;
}
/* Positive forward assertion */
else if (op == OP_ASSERT || op == OP_ASSERT_NA)
{
- if (!is_anchored(scode, bracket_map, cb, atomcount, TRUE)) return FALSE;
+ if (!is_anchored(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor)) return FALSE;
}
/* Condition. If there is no second branch, it can't be anchored. */
@@ -8927,7 +8929,7 @@ do {
else if (op == OP_COND || op == OP_SCOND)
{
if (scode[GET(scode,1)] != OP_ALT) return FALSE;
- if (!is_anchored(scode, bracket_map, cb, atomcount, inassert))
+ if (!is_anchored(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor))
return FALSE;
}
@@ -8935,7 +8937,7 @@ do {
else if (op == OP_ONCE)
{
- if (!is_anchored(scode, bracket_map, cb, atomcount + 1, inassert))
+ if (!is_anchored(scode, bracket_map, cb, atomcount + 1, inassert, dotstar_anchor))
return FALSE;
}
@@ -8950,8 +8952,7 @@ do {
op == OP_TYPEPOSSTAR))
{
if (scode[1] != OP_ALLANY || (bracket_map & cb->backref_map) != 0 ||
- atomcount > 0 || cb->had_pruneorskip || inassert ||
- (cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
+ atomcount > 0 || cb->had_pruneorskip || inassert || !dotstar_anchor)
return FALSE;
}
@@ -8988,13 +8989,14 @@ or *SKIP does not count, because once again the assumption no longer holds.
cb points to the compile data
atomcount atomic group level
inassert TRUE if in an assertion
+ dotstar_anchor TRUE if automatic anchoring optimization is enabled
Returns: TRUE or FALSE
*/
static BOOL
is_startline(PCRE2_SPTR code, unsigned int bracket_map, compile_block *cb,
- int atomcount, BOOL inassert)
+ int atomcount, BOOL inassert, BOOL dotstar_anchor)
{
do {
PCRE2_SPTR scode = first_significant_code(
@@ -9025,7 +9027,8 @@ do {
return FALSE;
default: /* Assertion */
- if (!is_startline(scode, bracket_map, cb, atomcount, TRUE)) return FALSE;
+ if (!is_startline(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor))
+ return FALSE;
do scode += GET(scode, 1); while (*scode == OP_ALT);
scode += 1 + LINK_SIZE;
break;
@@ -9039,7 +9042,7 @@ do {
if (op == OP_BRA || op == OP_BRAPOS ||
op == OP_SBRA || op == OP_SBRAPOS)
{
- if (!is_startline(scode, bracket_map, cb, atomcount, inassert))
+ if (!is_startline(scode, bracket_map, cb, atomcount, inassert, dotstar_anchor))
return FALSE;
}
@@ -9050,14 +9053,15 @@ do {
{
int n = GET2(scode, 1+LINK_SIZE);
unsigned int new_map = bracket_map | ((n < 32)? (1u << n) : 1);
- if (!is_startline(scode, new_map, cb, atomcount, inassert)) return FALSE;
+ if (!is_startline(scode, new_map, cb, atomcount, inassert, dotstar_anchor))
+ return FALSE;
}
/* Positive forward assertions */
else if (op == OP_ASSERT || op == OP_ASSERT_NA)
{
- if (!is_startline(scode, bracket_map, cb, atomcount, TRUE))
+ if (!is_startline(scode, bracket_map, cb, atomcount, TRUE, dotstar_anchor))
return FALSE;
}
@@ -9065,7 +9069,7 @@ do {
else if (op == OP_ONCE)
{
- if (!is_startline(scode, bracket_map, cb, atomcount + 1, inassert))
+ if (!is_startline(scode, bracket_map, cb, atomcount + 1, inassert, dotstar_anchor))
return FALSE;
}
@@ -9079,8 +9083,7 @@ do {
else if (op == OP_TYPESTAR || op == OP_TYPEMINSTAR || op == OP_TYPEPOSSTAR)
{
if (scode[1] != OP_ANY || (bracket_map & cb->backref_map) != 0 ||
- atomcount > 0 || cb->had_pruneorskip || inassert ||
- (cb->external_options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
+ atomcount > 0 || cb->had_pruneorskip || inassert || !dotstar_anchor)
return FALSE;
}
@@ -10362,6 +10365,10 @@ int regexrc; /* Return from compile */
uint32_t i; /* Local loop counter */
+/* Enable all optimizations by default. */
+uint32_t optim_flags = ccontext != NULL ? ccontext->optimization_flags :
+ PCRE2_OPTIMIZATION_ALL;
+
/* Comments at the head of this file explain about these variables. */
uint32_t stack_groupinfo[GROUPINFO_DEFAULT_SIZE];
@@ -10432,6 +10439,18 @@ if (patlen > ccontext->max_pattern_length)
return NULL;
}
+/* Optimization flags in 'options' can override those in the compile context.
+This is because some options to disable optimizations were added before the
+optimization flags word existed, and we need to continue supporting them
+for backwards compatibility. */
+
+if ((options & PCRE2_NO_AUTO_POSSESS) != 0)
+ optim_flags &= ~PCRE2_OPTIM_AUTO_POSSESS;
+if ((options & PCRE2_NO_DOTSTAR_ANCHOR) != 0)
+ optim_flags &= ~PCRE2_OPTIM_DOTSTAR_ANCHOR;
+if ((options & PCRE2_NO_START_OPTIMIZE) != 0)
+ optim_flags &= ~PCRE2_OPTIM_START_OPTIMIZE;
+
/* From here on, all returns from this function should end up going via the
EXIT label. */
@@ -10568,6 +10587,32 @@ if ((options & PCRE2_LITERAL) == 0)
else limit_depth = c;
skipatstart = ++pp;
break;
+
+ case PSO_OPTMZ:
+ optim_flags &= ~(p->value);
+
+ /* For backward compatibility the three original VERBs to disable
+ optimizations need to also update the corresponding external option. */
+
+ switch(p->value)
+ {
+ case PCRE2_OPTIM_AUTO_POSSESS:
+ cb.external_options |= PCRE2_NO_AUTO_POSSESS;
+ break;
+
+ case PCRE2_OPTIM_DOTSTAR_ANCHOR:
+ cb.external_options |= PCRE2_NO_DOTSTAR_ANCHOR;
+ break;
+
+ case PCRE2_OPTIM_START_OPTIMIZE:
+ cb.external_options |= PCRE2_NO_START_OPTIMIZE;
+ break;
+ }
+
+ break;
+
+ default:
+ PCRE2_UNREACHABLE();
}
break; /* Out of the table scan loop */
}
@@ -10863,6 +10908,7 @@ re->top_bracket = 0;
re->top_backref = 0;
re->name_entry_size = cb.name_entry_size;
re->name_count = cb.names_found;
+re->optimization_flags = optim_flags;
/* The basic block is immediately followed by the name table, and the compiled
code follows after that. */
@@ -11005,7 +11051,7 @@ used in this code because at least one compiler gives a warning about loss of
"const" attribute if the cast (PCRE2_UCHAR *)codestart is used directly in the
function call. */
-if (errorcode == 0 && (re->overall_options & PCRE2_NO_AUTO_POSSESS) == 0)
+if (errorcode == 0 && (optim_flags & PCRE2_OPTIM_AUTO_POSSESS) != 0)
{
PCRE2_UCHAR *temp = (PCRE2_UCHAR *)codestart;
if (PRIV(auto_possessify)(temp, &cb) != 0) errorcode = ERR80;
@@ -11022,17 +11068,17 @@ there are no occurrences of *PRUNE or *SKIP (though there is an option to
disable this case). */
if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
- is_anchored(codestart, 0, &cb, 0, FALSE))
+ is_anchored(codestart, 0, &cb, 0, FALSE, (optim_flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0))
re->overall_options |= PCRE2_ANCHORED;
/* Set up the first code unit or startline flag, the required code unit, and
-then study the pattern. This code need not be obeyed if PCRE2_NO_START_OPTIMIZE
-is set, as the data it would create will not be used. Note that a first code
+then study the pattern. This code need not be obeyed if PCRE2_OPTIM_START_OPTIMIZE
+is disabled, as the data it would create will not be used. Note that a first code
unit (but not the startline flag) is useful for anchored patterns because it
can still give a quick "no match" and also avoid searching for a last code
unit. */
-if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
+if ((optim_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0)
{
int minminlength = 0; /* For minimal minlength from first/required CU */
@@ -11096,7 +11142,7 @@ if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
that disables this case.) */
else if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
- is_startline(codestart, 0, &cb, 0, FALSE))
+ is_startline(codestart, 0, &cb, 0, FALSE, (optim_flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0))
re->flags |= PCRE2_STARTLINE;
/* Handle the "required code unit", if one is set. In the UTF case we can
diff --git a/src/pcre2_context.c b/src/pcre2_context.c
index 84a967d7a..6b80f9e9f 100644
--- a/src/pcre2_context.c
+++ b/src/pcre2_context.c
@@ -141,7 +141,8 @@ pcre2_compile_context PRIV(default_compile_context) = {
NEWLINE_DEFAULT, /* Newline convention */
PARENS_NEST_LIMIT, /* As it says */
0, /* Extra options */
- MAX_VARLOOKBEHIND /* As it says */
+ MAX_VARLOOKBEHIND, /* As it says */
+ PCRE2_OPTIMIZATION_ALL /* All optimizations enabled */
};
/* The create function copies the default into the new memory, but must
@@ -409,6 +410,42 @@ ccontext->stack_guard_data = user_data;
return 0;
}
+PCRE2_EXP_DEFN int PCRE2_CALL_CONVENTION
+pcre2_set_optimize(pcre2_compile_context *ccontext, uint32_t directive)
+{
+if (ccontext == NULL)
+ return PCRE2_ERROR_NULL;
+
+switch (directive)
+ {
+ case PCRE2_OPTIMIZATION_NONE:
+ ccontext->optimization_flags = 0;
+ break;
+
+ case PCRE2_OPTIMIZATION_FULL:
+ ccontext->optimization_flags = PCRE2_OPTIMIZATION_ALL;
+ break;
+
+ case PCRE2_AUTO_POSSESS:
+ case PCRE2_AUTO_POSSESS_OFF:
+ case PCRE2_DOTSTAR_ANCHOR:
+ case PCRE2_DOTSTAR_ANCHOR_OFF:
+ case PCRE2_START_OPTIMIZE:
+ case PCRE2_START_OPTIMIZE_OFF:
+ /* Even directive numbers switch a bit on, odd numbers switch a bit off.
+ * 64-65 affect the LSB, 66-67 the 2 bit, 68-69 the 4 bit, and so on. */
+ if (directive & 0x1)
+ ccontext->optimization_flags &= ~(1u << ((directive >> 1) - 32));
+ else
+ ccontext->optimization_flags |= 1u << ((directive >> 1) - 32);
+ break;
+
+ default:
+ return PCRE2_ERROR_BADOPTION;
+ }
+
+return 0;
+}
/* ------------ Match context ------------ */
diff --git a/src/pcre2_dfa_match.c b/src/pcre2_dfa_match.c
index 3e34c7ca5..d1d33ad5b 100644
--- a/src/pcre2_dfa_match.c
+++ b/src/pcre2_dfa_match.c
@@ -3432,7 +3432,7 @@ if ((re->flags & PCRE2_MODE_MASK) != PCRE2_CODE_UNIT_WIDTH/8)
/* PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART are match-time flags in the
options variable for this function. Users of PCRE2 who are not calling the
function directly would like to have a way of setting these flags, in the same
-way that they can set pcre2_compile() flags like PCRE2_NO_AUTOPOSSESS with
+way that they can set pcre2_compile() flags like PCRE2_NO_AUTO_POSSESS with
constructions like (*NO_AUTOPOSSESS). To enable this, (*NOTEMPTY) and
(*NOTEMPTY_ATSTART) set bits in the pattern's "flag" function which can now be
transferred to the options for this function. The bits are guaranteed to be
@@ -3699,7 +3699,7 @@ for (;;)
these, for testing and for ensuring that all callouts do actually occur.
The optimizations must also be avoided when restarting a DFA match. */
- if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0 &&
+ if ((re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0 &&
(options & PCRE2_DFA_RESTART) == 0)
{
/* If firstline is TRUE, the start of the match is constrained to the first
diff --git a/src/pcre2_internal.h b/src/pcre2_internal.h
index 043d2c563..1b9bdc6a1 100644
--- a/src/pcre2_internal.h
+++ b/src/pcre2_internal.h
@@ -609,6 +609,13 @@ total length of the tables. */
#define ctypes_offset (cbits_offset + cbit_length) /* Character types */
#define TABLES_LENGTH (ctypes_offset + 256)
+/* Private flags used in compile_context.optimization_flags */
+
+#define PCRE2_OPTIM_AUTO_POSSESS 0x00000001u
+#define PCRE2_OPTIM_DOTSTAR_ANCHOR 0x00000002u
+#define PCRE2_OPTIM_START_OPTIMIZE 0x00000004u
+
+#define PCRE2_OPTIMIZATION_ALL 0x00000007u
/* -------------------- Character and string names ------------------------ */
diff --git a/src/pcre2_intmodedep.h b/src/pcre2_intmodedep.h
index a798cdd4f..6c14be8dc 100644
--- a/src/pcre2_intmodedep.h
+++ b/src/pcre2_intmodedep.h
@@ -579,6 +579,7 @@ typedef struct pcre2_real_compile_context {
uint32_t parens_nest_limit;
uint32_t extra_options;
uint32_t max_varlookbehind;
+ uint32_t optimization_flags;
} pcre2_real_compile_context;
/* The real match context structure. */
@@ -646,6 +647,7 @@ typedef struct pcre2_real_code {
uint16_t top_backref; /* Highest numbered back reference */
uint16_t name_entry_size; /* Size (code units) of table entries */
uint16_t name_count; /* Number of name entries in the table */
+ uint32_t optimization_flags; /* Optimizations enabled at compile time */
} pcre2_real_code;
/* The real match data structure. Define ovector as large as it can ever
diff --git a/src/pcre2_jit_compile.c b/src/pcre2_jit_compile.c
index 5de4666d1..328edbcd4 100644
--- a/src/pcre2_jit_compile.c
+++ b/src/pcre2_jit_compile.c
@@ -14474,7 +14474,9 @@ if (!check_opcode_types(common, common->start, ccend))
}
/* Checking flags and updating ovector_start. */
-if (mode == PCRE2_JIT_COMPLETE && (re->flags & PCRE2_LASTSET) != 0 && (re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
+if (mode == PCRE2_JIT_COMPLETE &&
+ (re->flags & PCRE2_LASTSET) != 0 &&
+ (re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0)
{
common->req_char_ptr = common->ovector_start;
common->ovector_start += sizeof(sljit_sw);
@@ -14534,7 +14536,9 @@ memset(common->private_data_ptrs, 0, total_length * sizeof(sljit_s32));
private_data_size = common->cbra_ptr + (re->top_bracket + 1) * sizeof(sljit_sw);
-if ((re->overall_options & PCRE2_ANCHORED) == 0 && (re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0 && !common->has_skip_in_assert_back)
+if ((re->overall_options & PCRE2_ANCHORED) == 0 &&
+ (re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0 &&
+ !common->has_skip_in_assert_back)
detect_early_fail(common, common->start, &private_data_size, 0, 0);
set_private_data_ptrs(common, &private_data_size, ccend);
@@ -14600,7 +14604,7 @@ if ((re->overall_options & PCRE2_ANCHORED) == 0)
mainloop_label = mainloop_entry(common);
continue_match_label = LABEL();
/* Forward search if possible. */
- if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
+ if ((re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0)
{
if (mode == PCRE2_JIT_COMPLETE && fast_forward_first_n_chars(common))
;
@@ -14615,7 +14619,8 @@ if ((re->overall_options & PCRE2_ANCHORED) == 0)
else
continue_match_label = LABEL();
-if (mode == PCRE2_JIT_COMPLETE && re->minlength > 0 && (re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
+if (mode == PCRE2_JIT_COMPLETE && re->minlength > 0 &&
+ (re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0)
{
OP1(SLJIT_MOV, SLJIT_RETURN_REG, 0, SLJIT_IMM, PCRE2_ERROR_NOMATCH);
OP2(SLJIT_ADD, TMP2, 0, STR_PTR, 0, SLJIT_IMM, IN_UCHARS(re->minlength));
diff --git a/src/pcre2_match.c b/src/pcre2_match.c
index f55410394..cb139658e 100644
--- a/src/pcre2_match.c
+++ b/src/pcre2_match.c
@@ -6788,7 +6788,7 @@ if ((re->flags & PCRE2_MODE_MASK) != PCRE2_CODE_UNIT_WIDTH/8)
/* PCRE2_NOTEMPTY and PCRE2_NOTEMPTY_ATSTART are match-time flags in the
options variable for this function. Users of PCRE2 who are not calling the
function directly would like to have a way of setting these flags, in the same
-way that they can set pcre2_compile() flags like PCRE2_NO_AUTOPOSSESS with
+way that they can set pcre2_compile() flags like PCRE2_NO_AUTO_POSSESS with
constructions like (*NO_AUTOPOSSESS). To enable this, (*NOTEMPTY) and
(*NOTEMPTY_ATSTART) set bits in the pattern's "flag" function which we now
transfer to the options for this function. The bits are guaranteed to be
@@ -7326,7 +7326,7 @@ for(;;)
However, there is an option (settable at compile time) that disables these,
for testing and for ensuring that all callouts do actually occur. */
- if ((re->overall_options & PCRE2_NO_START_OPTIMIZE) == 0)
+ if ((re->optimization_flags & PCRE2_OPTIM_START_OPTIMIZE) != 0)
{
/* If firstline is TRUE, the start of the match is constrained to the first
line of a multiline string. That is, the match must be before or at the
diff --git a/src/pcre2test.c b/src/pcre2test.c
index d8f5d6483..7e92ff19e 100644
--- a/src/pcre2test.c
+++ b/src/pcre2test.c
@@ -468,6 +468,7 @@ enum { MOD_CTC, /* Applies to a compile context */
MOD_NL, /* Is a newline value */
MOD_NN, /* Is a number or a name; more than one may occur */
MOD_OPT, /* Is an option bit */
+ MOD_OPTMZ, /* Is an optimization directive */
MOD_SIZ, /* Is a PCRE2_SIZE value */
MOD_STR }; /* Is a string */
@@ -661,6 +662,8 @@ static modstruct modlist[] = {
{ "ascii_digit", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_DIGIT, CO(extra_options) },
{ "ascii_posix", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ASCII_POSIX, CO(extra_options) },
{ "auto_callout", MOD_PAT, MOD_OPT, PCRE2_AUTO_CALLOUT, PO(options) },
+ { "auto_possess", MOD_CTC, MOD_OPTMZ, PCRE2_AUTO_POSSESS, 0 },
+ { "auto_possess_off", MOD_CTC, MOD_OPTMZ, PCRE2_AUTO_POSSESS_OFF, 0 },
{ "bad_escape_is_literal", MOD_CTC, MOD_OPT, PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL, CO(extra_options) },
{ "bincode", MOD_PAT, MOD_CTL, CTL_BINCODE, PO(control) },
{ "bsr", MOD_CTC, MOD_BSR, 0, CO(bsr_convention) },
@@ -688,6 +691,8 @@ static modstruct modlist[] = {
{ "disable_recurseloop_check", MOD_DAT, MOD_OPT, PCRE2_DISABLE_RECURSELOOP_CHECK, DO(options) },
{ "dollar_endonly", MOD_PAT, MOD_OPT, PCRE2_DOLLAR_ENDONLY, PO(options) },
{ "dotall", MOD_PATP, MOD_OPT, PCRE2_DOTALL, PO(options) },
+ { "dotstar_anchor", MOD_CTC, MOD_OPTMZ, PCRE2_DOTSTAR_ANCHOR, 0 },
+ { "dotstar_anchor_off", MOD_CTC, MOD_OPTMZ, PCRE2_DOTSTAR_ANCHOR_OFF, 0 },
{ "dupnames", MOD_PATP, MOD_OPT, PCRE2_DUPNAMES, PO(options) },
{ "endanchored", MOD_PD, MOD_OPT, PCRE2_ENDANCHORED, PD(options) },
{ "escaped_cr_is_lf", MOD_CTC, MOD_OPT, PCRE2_EXTRA_ESCAPED_CR_IS_LF, CO(extra_options) },
@@ -744,6 +749,8 @@ static modstruct modlist[] = {
{ "null_subject", MOD_DAT, MOD_CTL, CTL2_NULL_SUBJECT, DO(control2) },
{ "offset", MOD_DAT, MOD_INT, 0, DO(offset) },
{ "offset_limit", MOD_CTM, MOD_SIZ, 0, MO(offset_limit)},
+ { "optimization_full", MOD_CTC, MOD_OPTMZ, PCRE2_OPTIMIZATION_FULL, 0 },
+ { "optimization_none", MOD_CTC, MOD_OPTMZ, PCRE2_OPTIMIZATION_NONE, 0 },
{ "ovector", MOD_DAT, MOD_INT, 0, DO(oveccount) },
{ "parens_nest_limit", MOD_CTC, MOD_INT, 0, CO(parens_nest_limit) },
{ "partial_hard", MOD_DAT, MOD_OPT, PCRE2_PARTIAL_HARD, DO(options) },
@@ -760,6 +767,8 @@ static modstruct modlist[] = {
{ "regerror_buffsize", MOD_PAT, MOD_INT, 0, PO(regerror_buffsize) },
{ "replace", MOD_PND, MOD_STR, REPLACE_MODSIZE, PO(replacement) },
{ "stackguard", MOD_PAT, MOD_INT, 0, PO(stackguard_test) },
+ { "start_optimize", MOD_CTC, MOD_OPTMZ, PCRE2_START_OPTIMIZE, 0 },
+ { "start_optimize_off", MOD_CTC, MOD_OPTMZ, PCRE2_START_OPTIMIZE_OFF, 0 },
{ "startchar", MOD_PND, MOD_CTL, CTL_STARTCHAR, PO(control) },
{ "startoffset", MOD_DAT, MOD_INT, 0, DO(offset) },
{ "subject_literal", MOD_PATP, MOD_CTL, CTL2_SUBJECT_LITERAL, PO(control2) },
@@ -3884,7 +3893,7 @@ for (;;)
when needed. */
m = modlist + index; /* Save typing */
- if (m->type != MOD_CTL && m->type != MOD_OPT &&
+ if (m->type != MOD_CTL && m->type != MOD_OPT && m->type != MOD_OPTMZ &&
(m->type != MOD_IND || *pp == '='))
{
if (*pp++ != '=')
@@ -3925,6 +3934,21 @@ for (;;)
else *((uint32_t *)field) |= m->value;
break;
+ case MOD_OPTMZ:
+#ifdef SUPPORT_PCRE2_8
+ if (test_mode == PCRE8_MODE)
+ pcre2_set_optimize_8((pcre2_compile_context_8*)field, m->value);
+#endif
+#ifdef SUPPORT_PCRE2_16
+ if (test_mode == PCRE16_MODE)
+ pcre2_set_optimize_16((pcre2_compile_context_16*)field, m->value);
+#endif
+#ifdef SUPPORT_PCRE2_32
+ if (test_mode == PCRE32_MODE)
+ pcre2_set_optimize_32((pcre2_compile_context_32*)field, m->value);
+#endif
+ break;
+
case MOD_BSR:
if (len == 7 && strncmpic(pp, (const uint8_t *)"default", 7) == 0)
{
@@ -4361,6 +4385,33 @@ else fprintf(outfile, "%s%s%s%s%s%s%s%s%s%s%s%s%s%s",
}
+/*************************************************
+* Show optimization flags *
+*************************************************/
+
+/*
+Arguments:
+ flags an options word
+ before text to print before
+ after text to print after
+
+Returns: nothing
+*/
+
+static void
+show_optimize_flags(uint32_t flags, const char *before, const char *after)
+{
+if (flags == 0) fprintf(outfile, "%s%s", before, after);
+else fprintf(outfile, "%s%s%s%s%s%s%s",
+ before,
+ ((flags & PCRE2_OPTIM_AUTO_POSSESS) != 0) ? "auto_possess" : "",
+ ((flags & PCRE2_OPTIM_AUTO_POSSESS) != 0 && (flags >> 1) != 0) ? "," : "",
+ ((flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0) ? "dotstar_anchor" : "",
+ ((flags & PCRE2_OPTIM_DOTSTAR_ANCHOR) != 0 && (flags >> 2) != 0) ? "," : "",
+ ((flags & PCRE2_OPTIM_START_OPTIMIZE) != 0) ? "start_optimize" : "",
+ after);
+}
+
#ifdef SUPPORT_PCRE2_8
/*************************************************
@@ -4777,6 +4828,9 @@ if ((pat_patctl.control & CTL_INFO) != 0)
if (extra_options != 0)
show_compile_extra_options(extra_options, "Extra options:", "\n");
+ if (FLD(compiled_code, optimization_flags) != PCRE2_OPTIMIZATION_ALL)
+ show_optimize_flags(FLD(compiled_code, optimization_flags), "Optimizations: ", "\n");
+
if (jchanged) fprintf(outfile, "Duplicate name status changes\n");
if ((pat_patctl.control2 & CTL2_BSR_SET) != 0 ||
@@ -4879,7 +4933,7 @@ if ((pat_patctl.control & CTL_INFO) != 0)
}
}
- if ((FLD(compiled_code, overall_options) & PCRE2_NO_START_OPTIMIZE) == 0)
+ if ((FLD(compiled_code, optimization_flags) & PCRE2_OPTIM_START_OPTIMIZE) != 0)
fprintf(outfile, "Subject length lower bound = %d\n", minlength);
if (pat_patctl.jit != 0 && (pat_patctl.control & CTL_JITVERIFY) != 0)
diff --git a/testdata/testinput2 b/testdata/testinput2
index 51e2095c8..5b82f7451 100644
--- a/testdata/testinput2
+++ b/testdata/testinput2
@@ -831,6 +831,16 @@
/x++/IB
+# For comparison with the following test, which disables auto-possessification
+# In this regex, x+ should be converted to x++
+/x+y/B,auto_possess
+
+# In this regex, x+ should not be converted to x++
+/x+y/B,auto_possess_off
+
+# Also in this regex, x+ should not be converted to x++
+/x+y/B,optimization_none
+
/x{1,3}+/B,no_auto_possess
/x{1,3}+/Bi,no_auto_possess
@@ -839,6 +849,8 @@
/[^x]{1,3}+/Bi,no_auto_possess
+/x{1,3}+/IB,auto_possess_off
+
/(x)*+/IB
/^(\w++|\s++)*$/I
@@ -4056,10 +4068,16 @@
/(?(VERSION=10.101)yes|no)/
+# We should see the starting code unit, required code unit, and minimum length set for this regex:
/abcd/I
+# None of the following three should have the starting code unit, required code unit, and minimum length set:
/abcd/I,no_start_optimize
+/abcd/I,start_optimize_off
+
+/abcd/I,optimization_none
+
/(|ab)*?d/I
abd
xyd
@@ -4224,6 +4242,19 @@
/^abc/info,no_dotstar_anchor
+/^abc/info,dotstar_anchor_off
+
+# For comparison with the following tests, which disable automatic dotstar anchoring
+/.*abc/BI
+
+/.*abc/BI,dotstar_anchor_off
+
+/.*abc/BI,start_optimize_off
+
+/.*abc/BI,optimization_none
+
+/.*abc/BI,no_dotstar_anchor
+
/.*\d/info,auto_callout
\= Expect no match
aaa
@@ -6390,6 +6421,27 @@ a)"xI
ab
ac
+# Tests for pcre2_set_optimize()
+
+/abc/I,optimization_none
+
+/abc/I,optimization_none,auto_possess
+
+/abc/I,optimization_none,dotstar_anchor,auto_possess
+
+/abc/I,optimization_none,start_optimize
+
+/abc/I,dotstar_anchor_off,optimization_full
+
+# If pcre2_set_optimize() is used to turn on some optimization, but at the same time,
+# the compile options word turns it off... the compile options word "wins":
+
+/abc/I,no_auto_possess,auto_possess
+
+/abc/I,no_dotstar_anchor,dotstar_anchor
+
+/abc/I,no_start_optimize,start_optimize
+
# --------------
# End of testinput2
diff --git a/testdata/testoutput15 b/testdata/testoutput15
index f36faeeaf..892473bc9 100644
--- a/testdata/testoutput15
+++ b/testdata/testoutput15
@@ -477,6 +477,7 @@ Failed: error -52: nested recursion at the same subject position
------------------------------------------------------------------
Capture group count = 0
Options: no_auto_possess
+Optimizations: dotstar_anchor,start_optimize
Starting code units: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
Subject length lower bound = 1
@@ -501,6 +502,7 @@ No match
Capture group count = 0
Compile options:
Overall options: no_auto_possess
+Optimizations: dotstar_anchor,start_optimize
Starting code units: 0 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Q R S T U V W X Y Z _ a b c d e f g h i j k l m n o p q r s t u v w x y z
Subject length lower bound = 1
diff --git a/testdata/testoutput2 b/testdata/testoutput2
index eeb635d6d..f1f6a4f50 100644
--- a/testdata/testoutput2
+++ b/testdata/testoutput2
@@ -2942,6 +2942,37 @@ Capture group count = 0
First code unit = 'x'
Subject length lower bound = 1
+# For comparison with the following test, which disables auto-possessification
+# In this regex, x+ should be converted to x++
+/x+y/B,auto_possess
+------------------------------------------------------------------
+ Bra
+ x++
+ y
+ Ket
+ End
+------------------------------------------------------------------
+
+# In this regex, x+ should not be converted to x++
+/x+y/B,auto_possess_off
+------------------------------------------------------------------
+ Bra
+ x+
+ y
+ Ket
+ End
+------------------------------------------------------------------
+
+# Also in this regex, x+ should not be converted to x++
+/x+y/B,optimization_none
+------------------------------------------------------------------
+ Bra
+ x+
+ y
+ Ket
+ End
+------------------------------------------------------------------
+
/x{1,3}+/B,no_auto_possess
------------------------------------------------------------------
Bra
@@ -2978,6 +3009,19 @@ Subject length lower bound = 1
End
------------------------------------------------------------------
+/x{1,3}+/IB,auto_possess_off
+------------------------------------------------------------------
+ Bra
+ x
+ x{0,2}+
+ Ket
+ End
+------------------------------------------------------------------
+Capture group count = 0
+Optimizations: dotstar_anchor,start_optimize
+First code unit = 'x'
+Subject length lower bound = 1
+
/(x)*+/IB
------------------------------------------------------------------
Bra
@@ -13592,15 +13636,26 @@ Failed: error 179 at offset 16: syntax error or number too big in (?(VERSION con
/(?(VERSION=10.101)yes|no)/
Failed: error 179 at offset 16: syntax error or number too big in (?(VERSION condition
+# We should see the starting code unit, required code unit, and minimum length set for this regex:
/abcd/I
Capture group count = 0
First code unit = 'a'
Last code unit = 'd'
Subject length lower bound = 4
+# None of the following three should have the starting code unit, required code unit, and minimum length set:
/abcd/I,no_start_optimize
Capture group count = 0
Options: no_start_optimize
+Optimizations: auto_possess,dotstar_anchor
+
+/abcd/I,start_optimize_off
+Capture group count = 0
+Optimizations: auto_possess,dotstar_anchor
+
+/abcd/I,optimization_none
+Capture group count = 0
+Optimizations:
/(|ab)*?d/I
Capture group count = 1
@@ -13616,6 +13671,7 @@ Subject length lower bound = 1
/(|ab)*?d/I,no_start_optimize
Capture group count = 1
Options: no_start_optimize
+Optimizations: auto_possess,dotstar_anchor
abd
0: abd
1: ab
@@ -13887,9 +13943,81 @@ Subject length lower bound = 3
Capture group count = 0
Compile options: no_dotstar_anchor
Overall options: anchored no_dotstar_anchor
+Optimizations: auto_possess,start_optimize
+First code unit = 'a'
+Subject length lower bound = 3
+
+/^abc/info,dotstar_anchor_off
+Capture group count = 0
+Compile options:
+Overall options: anchored
+Optimizations: auto_possess,start_optimize
First code unit = 'a'
Subject length lower bound = 3
+# For comparison with the following tests, which disable automatic dotstar anchoring
+/.*abc/BI
+------------------------------------------------------------------
+ Bra
+ Any*
+ abc
+ Ket
+ End
+------------------------------------------------------------------
+Capture group count = 0
+First code unit at start or follows newline
+Last code unit = 'c'
+Subject length lower bound = 3
+
+/.*abc/BI,dotstar_anchor_off
+------------------------------------------------------------------
+ Bra
+ Any*
+ abc
+ Ket
+ End
+------------------------------------------------------------------
+Capture group count = 0
+Optimizations: auto_possess,start_optimize
+Last code unit = 'c'
+Subject length lower bound = 3
+
+/.*abc/BI,start_optimize_off
+------------------------------------------------------------------
+ Bra
+ Any*
+ abc
+ Ket
+ End
+------------------------------------------------------------------
+Capture group count = 0
+Optimizations: auto_possess,dotstar_anchor
+
+/.*abc/BI,optimization_none
+------------------------------------------------------------------
+ Bra
+ Any*
+ abc
+ Ket
+ End
+------------------------------------------------------------------
+Capture group count = 0
+Optimizations:
+
+/.*abc/BI,no_dotstar_anchor
+------------------------------------------------------------------
+ Bra
+ Any*
+ abc
+ Ket
+ End
+------------------------------------------------------------------
+Capture group count = 0
+Options: no_dotstar_anchor
+Optimizations: auto_possess,start_optimize
+Last code unit = 'c'
+Subject length lower bound = 3
+
/.*\d/info,auto_callout
Capture group count = 0
Options: auto_callout
@@ -13908,6 +14036,7 @@ No match
/.*\d/info,no_dotstar_anchor,auto_callout
Capture group count = 0
Options: auto_callout no_dotstar_anchor
+Optimizations: auto_possess,start_optimize
Subject length lower bound = 1
\= Expect no match
aaa
@@ -13935,12 +14064,14 @@ Subject length lower bound = 1
/.*\d/dotall,no_dotstar_anchor,info
Capture group count = 0
Options: dotall no_dotstar_anchor
+Optimizations: auto_possess,start_optimize
Subject length lower bound = 1
/(*NO_DOTSTAR_ANCHOR)(?s).*\d/info
Capture group count = 0
Compile options:
Overall options: no_dotstar_anchor
+Optimizations: auto_possess,start_optimize
Subject length lower bound = 1
'^(?:(a)|b)(?(1)A|B)'
@@ -18049,12 +18180,14 @@ Subject length lower bound = 1
/a?(?=b(*COMMIT)c|)d/I,no_start_optimize
Capture group count = 0
Options: no_start_optimize
+Optimizations: auto_possess,dotstar_anchor
bd
No match
/(?=b(*COMMIT)c|)d/I,no_start_optimize
Capture group count = 0
Options: no_start_optimize
+Optimizations: auto_possess,dotstar_anchor
bd
No match
@@ -19060,6 +19193,57 @@ No match
ac
No match
+# Tests for pcre2_set_optimize()
+
+/abc/I,optimization_none
+Capture group count = 0
+Optimizations:
+
+/abc/I,optimization_none,auto_possess
+Capture group count = 0
+Optimizations: auto_possess
+
+/abc/I,optimization_none,dotstar_anchor,auto_possess
+Capture group count = 0
+Optimizations: auto_possess,dotstar_anchor
+
+/abc/I,optimization_none,start_optimize
+Capture group count = 0
+Optimizations: start_optimize
+First code unit = 'a'
+Last code unit = 'c'
+Subject length lower bound = 3
+
+/abc/I,dotstar_anchor_off,optimization_full
+Capture group count = 0
+First code unit = 'a'
+Last code unit = 'c'
+Subject length lower bound = 3
+
+# If pcre2_set_optimize() is used to turn on some optimization, but at the same time,
+# the compile options word turns it off... the compile options word "wins":
+
+/abc/I,no_auto_possess,auto_possess
+Capture group count = 0
+Options: no_auto_possess
+Optimizations: dotstar_anchor,start_optimize
+First code unit = 'a'
+Last code unit = 'c'
+Subject length lower bound = 3
+
+/abc/I,no_dotstar_anchor,dotstar_anchor
+Capture group count = 0
+Options: no_dotstar_anchor
+Optimizations: auto_possess,start_optimize
+First code unit = 'a'
+Last code unit = 'c'
+Subject length lower bound = 3
+
+/abc/I,no_start_optimize,start_optimize
+Capture group count = 0
+Options: no_start_optimize
+Optimizations: auto_possess,dotstar_anchor
+
# --------------
# End of testinput2
diff --git a/testdata/testoutput5 b/testdata/testoutput5
index 1b658f99e..befccd419 100644
--- a/testdata/testoutput5
+++ b/testdata/testoutput5
@@ -474,6 +474,7 @@ Subject length lower bound = 0
Capture group count = 0
Compile options: no_start_optimize utf
Overall options: anchored no_start_optimize utf
+Optimizations: auto_possess,dotstar_anchor
/()()()()()()()()()()
()()()()()()()()()()
diff --git a/testdata/testoutput6 b/testdata/testoutput6
index 283b00da0..63ec1ee29 100644
--- a/testdata/testoutput6
+++ b/testdata/testoutput6
@@ -6860,6 +6860,7 @@ No match
/(abc|def|xyz)/I,no_start_optimize
Capture group count = 1
Options: no_start_optimize
+Optimizations: auto_possess,dotstar_anchor
terhjk;abcdaadsfe
0: abc
the quick xyz brown fox