Skip to content

Commit

Permalink
Add new API function pcre2_set_optimization() for controlling enabled…
Browse files Browse the repository at this point in the history
… optimizations

It is anticipated that over time, more and more optimizations will be
added to PCRE2, and we want to be able to switch optimizations off/on,
both for testing purposes and to be able to work around bugs in a
released library version.

The number of free bits left in the compile options word is very small.
Hence, we will start putting all optimization enable/disable flags in
a separate word. To switch these off/on, the new API function
pcre2_set_optimization() will be used.

The values which can be passed to pcre2_set_optimization() are
different from the internal flag bit values. The values accepted by
pcre2_set_optimization() are contiguous integers, so there is no
danger of ever running out of them. This means in the future, the
internal representation can be changed at any time without breaking
backwards compatibility. Further, the 'directives' passed to
pcre2_set_optimization() are not restricted to control a single,
specific optimization. As an example, passing PCRE2_OPTIMIZATION_FULL
will turn on all optimizations supported by whatever version of
PCRE2 the client program happens to be linked with.

Co-Authored-By: Carlo Marcelo Arenas Belón <carenas@gmail.com>
Co-Authored-by: Zoltan Herczeg <hzmester@freemail.hu>
  • Loading branch information
3 people committed Sep 16, 2024
1 parent 5e75d9b commit a346039
Show file tree
Hide file tree
Showing 25 changed files with 713 additions and 84 deletions.
121 changes: 121 additions & 0 deletions doc/html/pcre2_set_optimize.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
<html>
<head>
<title>pcre2_set_optimize specification</title>
</head>
<body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
<h1>pcre2_set_optimize man page</h1>
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
<p>
This page is part of the PCRE2 HTML documentation. It was generated
automatically from the original man page. If there is any nonsense in it,
please consult the man page, in case the conversion went wrong.
<br>
<br><b>
SYNOPSIS
</b><br>
<P>
<b>#include &#60;pcre2.h&#62;</b>
</P>
<P>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
</P>
<br><b>
DESCRIPTION
</b><br>
<P>
This function controls which performance optimizations will be applied
by <b>pcre2_compile</b>. The permitted values of <i>directive</i> are as follows:
<pre>
PCRE2_OPTIMIZATION_NONE
</pre>
Disable all optional performance optimizations.
<pre>
PCRE2_OPTIMIZATION_FULL
</pre>
Enable all optional performance optimizations. This is the default value.
<pre>
PCRE2_AUTO_POSSESS
PCRE2_AUTO_POSSESS_OFF
</pre>
Enable/disable "auto-possessification" of variable quantifiers such as * and +.
This optimization, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some callouts are never taken. You can
disable this optimization if you want the matching functions to do a full,
unoptimized search and run all the callouts.
<pre>
PCRE2_DOTSTAR_ANCHOR
PCRE2_DOTSTAR_ANCHOR_OFF
</pre>
Enable/disable an optimization that is applied when .* is the first significant
item in a top-level branch of a pattern, and all the other branches also start
with .* or with \A or \G or ^. Such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set for any
^ items. Otherwise, the fact that any match must start either at the start of
the subject or following a newline is remembered. Like other optimizations,
this can cause callouts to be skipped.
</P>
<P>
Dotstar anchor optimization is automatically disabled for .* if it is inside an
atomic group or a capture group that is the subject of a backreference, or if
the pattern contains (*PRUNE) or (*SKIP).
<pre>
PCRE2_START_OPTIMIZE
PCRE2_START_OPTIMIZE_OFF
</pre>
Enable/disable optimizations which cause matching functions to scan the subject
string for specific code unit values before attempting a match. For example, if
it is known that an unanchored match must start with a specific value, the
matching code searches the subject for that value, and fails immediately if it
cannot find it, without actually running the main matching function. This means
that a special item such as (*COMMIT) at the start of a pattern is not
considered until after a suitable starting point for the match has been found.
Also, when callouts or (*MARK) items are in use, these "start-up" optimizations
can cause them to be skipped if the pattern is never actually used. The start-up
optimizations are in effect a pre-scan of the subject that takes place before
the pattern is run.
</P>
<P>
Disabling start-up optimizations ensures that in cases where the result is "no
match", the callouts do occur, and that items such as (*COMMIT) and (*MARK) are
considered at every possible starting position in the subject string.
</P>
<P>
Disabling start-up optimizations may change the outcome of a matching operation.
Consider the pattern
<pre>
(*COMMIT)ABC
</pre>
When this is compiled, PCRE2 records the fact that a match must start with the
character "A". Suppose the subject string is "DEFABC". The start-up
optimization scans along the subject, finds "A" and runs the first match
attempt from there. The (*COMMIT) item means that the pattern must match the
current starting position, which in this case, it does. However, if the same
match is run without start-up optimizations, the initial scan along the subject
string does not happen. The first match attempt is run starting from "D" and
when this fails, (*COMMIT) prevents any further matches being tried, so the
overall result is "no match".
</P>
<P>
Another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
<pre>
(*MARK:1)B(*MARK:2)(X|Y)
</pre>
The minimum length for a match is two characters. If the subject is "XXBB", the
"starting character" optimization skips "XX", then tries to match "BB", which
is long enough. In the process, (*MARK:2) is encountered and remembered. When
the match attempt fails, the next "B" is found, but there is only one character
left, so there are no more attempts, and "no match" is returned with the "last
mark seen" set to "2". Without start-up optimizations, however, matches are
tried at every possible starting position, including at the end of the subject,
where (*MARK:1) is encountered, but there is no "B", so the "last mark seen"
that is returned is "1". In this case, the optimizations do not affect the
overall match result, which is still "no match", but they do affect the
auxiliary information that is returned.
<p>
Return to the <a href="index.html">PCRE2 index page</a>.
</p>
54 changes: 41 additions & 13 deletions doc/html/pcre2api.html
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,10 @@ <h1>pcre2api man page</h1>
<br>
<b>int pcre2_set_compile_recursion_guard(pcre2_compile_context *<i>ccontext</i>,</b>
<b> int (*<i>guard_function</i>)(uint32_t, void *), void *<i>user_data</i>);</b>
<br>
<br>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
</P>
<br><a name="SEC5" href="#TOC1">PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS</a><br>
<P>
Expand Down Expand Up @@ -808,6 +812,7 @@ <h1>pcre2api man page</h1>
The compile time nested parentheses limit
The maximum length of the pattern string
The extra options bits (none set by default)
Which performance optimizations the compiler should apply
</pre>
A compile context is also required if you are using custom memory management.
If none of these apply, just pass NULL as the context argument of
Expand Down Expand Up @@ -952,6 +957,24 @@ <h1>pcre2api man page</h1>
nesting, and the second is user data that is set up by the last argument of
<b>pcre2_set_compile_recursion_guard()</b>. The callout function should return
zero if all is well, or non-zero to force an error.
<br>
<br>
<b>int pcre2_set_optimize(pcre2_compile_context *<i>ccontext</i>,</b>
<b> uint32_t <i>directive</i>);</b>
<br>
<br>
PCRE2 can apply various performance optimizations during compilation, in order
to make matching faster. For example, the compiler might convert some regex
constructs into an equivalent construct which <b>pcre2_match</b> can execute
faster. By default, all available optimizations are enabled. However, in rare
cases, one might wish to disable specific optimizations. For example, if it is
known that some optimizations cannot benefit a certain regex, it might be
desirable to disable them, in order to speed up compilation.
</P>
<P>
For details on allowable values of <i>directive</i>, consult the
<a href="pcre2_set_optimize.html"><b>pcre2_set_optimize</b></a>
documentation.
<a name="matchcontext"></a></P>
<br><b>
The match context
Expand Down Expand Up @@ -1807,26 +1830,27 @@ <h1>pcre2api man page</h1>
<pre>
PCRE2_NO_AUTO_POSSESS
</pre>
If this option is set, it disables "auto-possessification", which is an
optimization that, for example, turns a+b into a++b in order to avoid
If this (deprecated) option is set, it disables "auto-possessification", which
is an optimization that, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts are in
use, auto-possessification means that some callouts are never taken. You can
set this option if you want the matching functions to do a full unoptimized
search and run all the callouts, but it is mainly provided for testing
purposes.
purposes. (It is recommended to use <b>pcre2_set_optimize</b> instead.)
<pre>
PCRE2_NO_DOTSTAR_ANCHOR
</pre>
If this option is set, it disables an optimization that is applied when .* is
the first significant item in a top-level branch of a pattern, and all the
other branches also start with .* or with \A or \G or ^. The optimization is
automatically disabled for .* if it is inside an atomic group or a capture
group that is the subject of a backreference, or if the pattern contains
(*PRUNE) or (*SKIP). When the optimization is not disabled, such a pattern is
automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
must start either at the start of the subject or following a newline is
If this (deprecated) option is set, it disables an optimization that is applied
when .* is the first significant item in a top-level branch of a pattern, and
all the other branches also start with .* or with \A or \G or ^. The
optimization is automatically disabled for .* if it is inside an atomic group
or a capture group that is the subject of a backreference, or if the pattern
contains (*PRUNE) or (*SKIP). When the optimization is not disabled, such a
pattern is automatically anchored if PCRE2_DOTALL is set for all the .* items
and PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any
match must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
(It is recommended to use <b>pcre2_set_optimize</b> instead.)
<pre>
PCRE2_NO_START_OPTIMIZE
</pre>
Expand Down Expand Up @@ -1870,7 +1894,7 @@ <h1>pcre2api man page</h1>
the overall result is "no match".
</P>
<P>
As another start-up optimization makes use of a minimum length for a matching
Another start-up optimization makes use of a minimum length for a matching
subject, which is recorded when possible. Consider the pattern
<pre>
(*MARK:1)B(*MARK:2)(X|Y)
Expand All @@ -1886,6 +1910,10 @@ <h1>pcre2api man page</h1>
returned is "1". In this case, the optimizations do not affect the overall
match result, which is still "no match", but they do affect the auxiliary
information that is returned.
<br>
<br>
(Rather than the PCRE2_NO_START_OPTIMIZE option, It is recommended to use
<b>pcre2_set_optimize</b> instead.)
<pre>
PCRE2_NO_UTF_CHECK
</pre>
Expand Down
2 changes: 1 addition & 1 deletion doc/html/pcre2pattern.html
Original file line number Diff line number Diff line change
Expand Up @@ -2243,7 +2243,7 @@ <h1>pcre2pattern man page</h1>
PCRE2 has an optimization that automatically "possessifies" certain simple
pattern constructs. For example, the sequence A+B is treated as A++B because
there is no point in backtracking into a sequence of A's when B must follow.
This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
This feature can be disabled by the PCRE2_NO_AUTO_POSSESS option, or starting
the pattern with (*NO_AUTO_POSSESS).
</P>
<P>
Expand Down
17 changes: 17 additions & 0 deletions doc/html/pcre2test.html
Original file line number Diff line number Diff line change
Expand Up @@ -681,6 +681,23 @@ <h1>pcre2test man page</h1>
brackets. Setting <b>utf</b> in 16-bit or 32-bit mode also causes pattern and
subject strings to be translated to UTF-16 or UTF-32, respectively, before
being passed to library functions.
<br>
<br>
The following modifiers enable or disable performance optimizations by
calling <b>pcre2_set_optimize()</b> before invoking the regex compiler.
<pre>
optimization_full enable all optional optimizations
optimization_none disable all optional optimizations
auto_possess auto-possessify variable quantifiers
auto_possess_off don't auto-possessify variable quantifiers
dotstar_anchor anchor patterns starting with .*
dotstar_anchor_off don't anchor patterns starting with .*
start_optimize enable pre-scan of subject string
start_optimize_off disable pre-scan of subject string
</pre>
See the
<a href="pcre2_set_optimize.html"><b>pcre2_set_optimize</b></a>
documentation for details on these optimizations.
<a name="controlmodifiers"></a></P>
<br><b>
Setting compilation controls
Expand Down
57 changes: 40 additions & 17 deletions doc/pcre2.txt
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,9 @@ PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
int (*guard_function)(uint32_t, void *), void *user_data);

int pcre2_set_optimize(pcre2_compile_context *ccontext,
uint32_t directive);


PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS

Expand Down Expand Up @@ -978,6 +981,21 @@ PCRE2 CONTEXTS
ment of pcre2_set_compile_recursion_guard(). The callout function
should return zero if all is well, or non-zero to force an error.

int pcre2_set_optimize(pcre2_compile_context *ccontext,
uint32_t directive);

PCRE2 can apply various performance optimizations during compilation,
in order to make matching faster. For example, the compiler might con‐
vert some regex constructs into an equivalent construct which
pcre2_match can execute faster. By default, all available optimizations
are enabled. However, in rare cases, one might wish to disable specific
optimizations. For example, if it is known that some optimizations can‐
not benefit a certain regex, it might be desirable to disable them, in
order to speed up compilation.

For details on allowable values of directive, consult the pcre2_set_op�‐
timize documentation.

The match context

A match context is required if you want to:
Expand Down Expand Up @@ -1775,31 +1793,33 @@ COMPILING A PATTERN

PCRE2_NO_AUTO_POSSESS

If this option is set, it disables "auto-possessification", which is an
optimization that, for example, turns a+b into a++b in order to avoid
backtracks into a+ that can never be successful. However, if callouts
are in use, auto-possessification means that some callouts are never
taken. You can set this option if you want the matching functions to do
a full unoptimized search and run all the callouts, but it is mainly
provided for testing purposes.
If this (deprecated) option is set, it disables "auto-possessifica‐
tion", which is an optimization that, for example, turns a+b into a++b
in order to avoid backtracks into a+ that can never be successful. How‐
ever, if callouts are in use, auto-possessification means that some
callouts are never taken. You can set this option if you want the
matching functions to do a full unoptimized search and run all the
callouts, but it is mainly provided for testing purposes. (It is recom‐
mended to use pcre2_set_optimize instead.)

PCRE2_NO_DOTSTAR_ANCHOR

If this option is set, it disables an optimization that is applied when
.* is the first significant item in a top-level branch of a pattern,
and all the other branches also start with .* or with \A or \G or ^.
The optimization is automatically disabled for .* if it is inside an
atomic group or a capture group that is the subject of a backreference,
or if the pattern contains (*PRUNE) or (*SKIP). When the optimization
is not disabled, such a pattern is automatically anchored if
If this (deprecated) option is set, it disables an optimization that is
applied when .* is the first significant item in a top-level branch of
a pattern, and all the other branches also start with .* or with \A or
\G or ^. The optimization is automatically disabled for .* if it is in‐
side an atomic group or a capture group that is the subject of a back‐
reference, or if the pattern contains (*PRUNE) or (*SKIP). When the op‐
timization is not disabled, such a pattern is automatically anchored if
PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
for any ^ items. Otherwise, the fact that any match must start either
at the start of the subject or following a newline is remembered. Like
other optimizations, this can cause callouts to be skipped.
other optimizations, this can cause callouts to be skipped. (It is
recommended to use pcre2_set_optimize instead.)

PCRE2_NO_START_OPTIMIZE

This is an option whose main effect is at matching time. It does not
This is an option whose main effect is at matching time. It does not
change what pcre2_compile() generates, but it does affect the output of
the JIT compiler.

Expand Down Expand Up @@ -1838,7 +1858,7 @@ COMPILING A PATTERN
(*COMMIT) prevents any further matches being tried, so the overall re-
sult is "no match".

As another start-up optimization makes use of a minimum length for a
Another start-up optimization makes use of a minimum length for a
matching subject, which is recorded when possible. Consider the pattern

(*MARK:1)B(*MARK:2)(X|Y)
Expand All @@ -1856,6 +1876,9 @@ COMPILING A PATTERN
the overall match result, which is still "no match", but they do affect
the auxiliary information that is returned.

(Rather than the PCRE2_NO_START_OPTIMIZE option, It is recommended to
use pcre2_set_optimize instead.)

PCRE2_NO_UTF_CHECK

When PCRE2_UTF is set, the validity of the pattern as a UTF string is
Expand Down
Loading

0 comments on commit a346039

Please sign in to comment.