Skip to content

Commit

Permalink
Rewrite regexes where common prefix can be pulled out from alternatio…
Browse files Browse the repository at this point in the history
…n branches

Thanks to Michael Voříšek for suggesting this optimization.

A new "rewrite" pass has been added to the regex compilation process.
For now, the rewrite pass only optimized one type of regex: those where
every branch of an alternation construct has a common prefix.

In such cases, we rewrite the regex like so (for example):

    (abc|abd|abe) ⇒ (ab(?:c|d|e))

An extra non-capturing group is not introduced if the alternation
is within a non-capturing group (which is not quantified using ?, *, or
a similar suffix). In that case we simply do something like:

    (?:abc|abd|abe) ⇒ ab(?:c|d|e)

In some edge cases, it is possible that rewriting a group with common
alternation prefix might open up the opportunity to pull out more common
prefixes. For example:

    (a(b|c)d|(ab|ac)e)

In that case, if the group '(ab|ac)' was rewritten to pull out the
common prefix, it would then become possible to pull out a common
prefix from the top-level group. However, we do not take advantage of
that opportunity.

Further, we do not perform the rewrite in cases where the prefixes are
semantically equivalent, but parse to a different parsed_pattern
sequence.

Groups which the regex engine might need to backtrack into are never
pulled out, since this could change the order in which the regex
engine considers possible ways of matching the pattern against the
subject string, and could thus change the returned match. For
example, this pattern will not be rewritten:

    ((?:a|b)c|(?:a|b)d)

Also, callouts are never extracted even if they form a common prefix
to an alternation. Some backtracking control verbs, like (*SKIP) and
(*COMMIT), are never extracted either.

A different type of rewrite is performed if an alternation construct
matches only single, literal characters:

    (a|b|c) ⇒ ([a-c])

A new compile option, PCRE2_NO_PATTERN_REWRITE, has been added to
skip the pattern rewrite phase when compiling a pattern.
  • Loading branch information
alexdowad committed Sep 8, 2024
1 parent ef218fb commit 12147f7
Show file tree
Hide file tree
Showing 28 changed files with 3,837 additions and 525 deletions.
21 changes: 19 additions & 2 deletions RunTest
Original file line number Diff line number Diff line change
Expand Up @@ -89,7 +89,8 @@ title23="Test 23: \C disabled test"
title24="Test 24: Non-UTF pattern conversion tests"
title25="Test 25: UTF pattern conversion tests"
title26="Test 26: Auto-generated unicode property tests"
maxtest=26
title27="Test 27: Pattern rewriter tests"
maxtest=27
titleheap="Test 'heap': Environment-specific heap tests"

if [ $# -eq 1 -a "$1" = "list" ]; then
Expand Down Expand Up @@ -120,6 +121,7 @@ if [ $# -eq 1 -a "$1" = "list" ]; then
echo $title24
echo $title25
echo $title26
echo $title27
echo ""
echo $titleheap
echo ""
Expand Down Expand Up @@ -255,6 +257,7 @@ do23=no
do24=no
do25=no
do26=no
do27=no
doheap=no

while [ $# -gt 0 ] ; do
Expand Down Expand Up @@ -286,6 +289,7 @@ while [ $# -gt 0 ] ; do
24) do24=yes;;
25) do25=yes;;
26) do26=yes;;
27) do27=yes;;
heap) doheap=yes;;
-8) arg8=yes;;
-16) arg16=yes;;
Expand Down Expand Up @@ -437,7 +441,7 @@ if [ $do0 = no -a $do1 = no -a $do2 = no -a $do3 = no -a \
$do12 = no -a $do13 = no -a $do14 = no -a $do15 = no -a \
$do16 = no -a $do17 = no -a $do18 = no -a $do19 = no -a \
$do20 = no -a $do21 = no -a $do22 = no -a $do23 = no -a \
$do24 = no -a $do25 = no -a $do26 = no -a $doheap = no \
$do24 = no -a $do25 = no -a $do26 = no -a $do27 = no -a $doheap = no \
]; then
do0=yes
do1=yes
Expand Down Expand Up @@ -466,6 +470,7 @@ if [ $do0 = no -a $do1 = no -a $do2 = no -a $do3 = no -a \
do24=yes
do25=yes
do26=yes
do27=yes
fi

# Handle any explicit skips at this stage, so that an argument list may consist
Expand Down Expand Up @@ -898,6 +903,18 @@ for bmode in "$test8" "$test16" "$test32"; do
fi
fi

# Pattern rewriter tests

if [ $do27 = yes ] ; then
echo $title27
if [ $utf -eq 0 ] ; then
echo " Skipped because UTF-$bits support is not available"
else
$sim $valgrind ./pcre2test -q $setstack $bmode $testdata/testinput27 testtry
checkresult $? 27 ""
fi
fi

# Manually selected heap tests - output may vary in different environments,
# which is why that are not automatically run.

Expand Down
1 change: 1 addition & 0 deletions doc/pcre2_compile.3
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,7 @@ The primary option bits are:
theses (named ones available)
PCRE2_NO_AUTO_POSSESS Disable auto-possessification
PCRE2_NO_DOTSTAR_ANCHOR Disable automatic anchoring for .*
PCRE2_NO_PATTERN_REWRITE Disable pattern rewriting optimizations
PCRE2_NO_START_OPTIMIZE Disable match-time start optimizations
PCRE2_NO_UTF_CHECK Do not check the pattern for UTF validity
(only relevant if PCRE2_UTF is set)
Expand Down
12 changes: 12 additions & 0 deletions doc/pcre2api.3
Original file line number Diff line number Diff line change
Expand Up @@ -1768,6 +1768,18 @@ automatically anchored if PCRE2_DOTALL is set for all the .* items and
PCRE2_MULTILINE is not set for any ^ items. Otherwise, the fact that any match
must start either at the start of the subject or following a newline is
remembered. Like other optimizations, this can cause callouts to be skipped.
.sp
PCRE2_NO_PATTERN_REWRITE
.sp
This option disables all optimizations which occur during the pattern rewriting
phase (after parsing but before compilation). Pattern rewriting may remove
redundant items, coalesce items, adjust group structure, or replace some
constructs with an equivalent construct. Pattern rewriting will never affect
which strings are and are not matched, or what substrings are captured by
capture groups. However, since it may change the structure of a pattern,
if you are tracing the matching process, you might prefer PCRE2 to use the
original pattern without rewriting. This option is also useful for testing.
Pattern rewriting is also disabled if PCRE2_AUTO_CALLOUT is set.
.sp
PCRE2_NO_START_OPTIMIZE
.sp
Expand Down
4 changes: 3 additions & 1 deletion doc/pcre2callout.3
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,9 @@ Callouts can be useful for tracking the progress of pattern matching. The
program has a pattern qualifier (/auto_callout) that sets automatic callouts.
When any callouts are present, the output from \fBpcre2test\fP indicates how
the pattern is being matched. This is useful information when you are trying to
optimize the performance of a particular pattern.
optimize the performance of a particular pattern. However, note that some
optimizations which adjust the structure of the pattern are disabled when
automatic callouts are enabled.
.
.
.SH "MISSING CALLOUTS"
Expand Down
1 change: 1 addition & 0 deletions doc/pcre2syntax.3
Original file line number Diff line number Diff line change
Expand Up @@ -414,6 +414,7 @@ appear. For the first three, d is a decimal number.
(*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
(*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
(*NO_JIT) disable JIT optimization
(*NO_PATTERN_REWRITE) disable pattern rewriting optimizations (PCRE2_NO_PATTERN_REWRITE)
(*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
(*UTF) set appropriate UTF mode for the library in use
(*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
Expand Down
1 change: 1 addition & 0 deletions doc/pcre2test.1
Original file line number Diff line number Diff line change
Expand Up @@ -623,6 +623,7 @@ for a description of the effects of these options.
/n no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_pattern_rewrite set PCRE2_NO_PATTERN_REWRITE
no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP
Expand Down
1 change: 1 addition & 0 deletions doc/pcre2test.txt
Original file line number Diff line number Diff line change
Expand Up @@ -604,6 +604,7 @@ PATTERN MODIFIERS
/n no_auto_capture set PCRE2_NO_AUTO_CAPTURE
no_auto_possess set PCRE2_NO_AUTO_POSSESS
no_dotstar_anchor set PCRE2_NO_DOTSTAR_ANCHOR
no_pattern_rewrite set PCRE2_NO_PATTERN_REWRITE
no_start_optimize set PCRE2_NO_START_OPTIMIZE
no_utf_check set PCRE2_NO_UTF_CHECK
ucp set PCRE2_UCP
Expand Down
1 change: 1 addition & 0 deletions src/pcre2.h.generic
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
#define PCRE2_LITERAL 0x02000000u /* C */
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
#define PCRE2_NO_PATTERN_REWRITE 0x08000000u /* C */

/* An additional compile options word is available in the compile context. */

Expand Down
1 change: 1 addition & 0 deletions src/pcre2.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ D is inspected during pcre2_dfa_match() execution
#define PCRE2_EXTENDED_MORE 0x01000000u /* C */
#define PCRE2_LITERAL 0x02000000u /* C */
#define PCRE2_MATCH_INVALID_UTF 0x04000000u /* J M D */
#define PCRE2_NO_PATTERN_REWRITE 0x08000000u /* C */

/* An additional compile options word is available in the compile context. */

Expand Down
Loading

0 comments on commit 12147f7

Please sign in to comment.