Documentation for substitions processing changes

PCRE2Project · Sep 17, 2024 · d8b7f31 · d8b7f31
1 parent 12c8dbc
commit d8b7f31
Show file tree

Hide file tree

Showing 7 changed files with 117 additions and 88 deletions.
diff --git a/ChangeLog b/ChangeLog
@@ -91,6 +91,8 @@ Perl.
 
 17. Merged PR478, which disallows \x if not followed by { or a hex digit.
 
+18. Merged PR473, which implements Python-style backrefs in substitutions.
+
 
 Version 10.44 07-June-2024
 --------------------------

diff --git a/doc/html/pcre2api.html b/doc/html/pcre2api.html
@@ -3745,33 +3745,39 @@ <h1>pcre2api man page</h1>
 </P>
 <P>
 Firstly, backslash in a replacement string is interpreted as an escape
-character. The usual forms such as \n or \x{ddd} can be used to specify
-particular character codes, and backslash followed by any non-alphanumeric
-character quotes that character. Extended quoting can be coded using \Q...\E,
-exactly as in pattern strings.
+character. The usual forms such as \x{ddd} can be used to specify particular
+character codes, and backslash followed by any non-alphanumeric character
+quotes that character. Extended quoting can be coded using \Q...\E, exactly
+as in pattern strings.
+</P>
+<P>
+The interpretation of backslash followed by one or more digits is the same as
+in a pattern, which in Perl has some ambiguities. Details are given in the
+<a href="pcre2pattern.html"><b>pcre2pattern</b></a>
+page.
 </P>
 <P>
 There are also four escape sequences for forcing the case of inserted letters.
-The insertion mechanism has three states: no case forcing, force upper case,
-and force lower case. The escape sequences change the current state: \U and
-\L change to upper or lower case forcing, respectively, and \E (when not
-terminating a \Q quoted sequence) reverts to no case forcing. The sequences
-\u and \l force the next character (if it is a letter) to upper or lower
-case, respectively, and then the state automatically reverts to no case
-forcing.
+Case forcing applies to all inserted characters, including those from capture
+groups and letters within \Q...\E quoted sequences. The insertion mechanism
+has three states: no case forcing, force upper case, and force lower case. The
+escape sequences change the current state: \U and \L change to upper or lower
+case forcing, respectively, and \E (when not terminating a \Q quoted
+sequence) reverts to no case forcing. The sequences \u and \l force the next
+character (if it is a letter) to upper or lower case, respectively, and then
+the state automatically reverts to no case forcing.
 </P>
 <P>
 However, if \u is immediately followed by \L or \l is immediately followed
 by \U, the next character's case is forced by the first escape sequence, and
-subsequent characters by the second. This provides a "title casing" facility.
-For example, the string "\u\LheLLo" becomes "Hello".
+subsequent characters by the second. This provides a "title casing" facility
+that can be applied to group captures. For example, if group 1 has captured
+"heLLo", the replacement string "\u\L$1" becomes "Hello".
 </P>
 <P>
-Case forcing applies to all inserted characters, including those from capture
-groups and letters within \Q...\E quoted sequences. If either PCRE2_UTF or
-PCRE2_UCP was set when the pattern was compiled, Unicode properties are used
-for case forcing characters whose code points are greater than 127. However,
-only basic case folding, as determined by the Unicode file
+If either PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
+properties are used for case forcing characters whose code points are greater
+than 127. However, only basic case folding, as determined by the Unicode file
 <b>CaseFolding.txt</b> is supported. PCRE2 does not support language-specific
 special casing rules such as using different lower case Greek sigmas in the
 middle and ends of words (as defined in the Unicode file
@@ -4199,7 +4205,7 @@ <h1>pcre2api man page</h1>
 </P>
 <br><a name="SEC43" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 16 September 2024
+Last updated: 17 September 2024
 <br>
 Copyright &copy; 1997-2024 University of Cambridge.
 <br>

diff --git a/doc/html/pcre2pattern.html b/doc/html/pcre2pattern.html
@@ -439,11 +439,16 @@ <h1>pcre2pattern man page</h1>
   \x{hhh..}   character with hex code hhh..
   \N{U+hhh..} character with Unicode hex code point hhh..
 </pre>
-By default, after \x that is not followed by {, from zero to two hexadecimal
-digits are read (letters can be in upper or lower case). Any number of
-hexadecimal digits may appear between \x{ and }. If a character other than a
-hexadecimal digit appears between \x{ and }, or if there is no terminating },
-an error occurs.
+By default, after \x that is not followed by {, one or two hexadecimal
+digits are read (letters can be in upper or lower case). If the character that 
+follows \x is neither { nor a hexadecimal digit, an error occurs. This is 
+different from Perl's default behaviour, which generates a NUL character, but 
+is in line with Perl's "strict" behaviour. 
+</P>
+<P>
+Any number of hexadecimal digits may appear between \x{ and }. If a character
+other than a hexadecimal digit appears between \x{ and }, or if there is no
+terminating }, an error occurs.
 </P>
 <P>
 Characters whose code points are less than 256 can be defined by either of the
@@ -3944,7 +3949,7 @@ <h1>pcre2pattern man page</h1>
 </P>
 <br><a name="SEC33" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 04 September 2024
+Last updated: 17 September 2024
 <br>
 Copyright &copy; 1997-2024 University of Cambridge.
 <br>

diff --git a/doc/html/pcre2syntax.html b/doc/html/pcre2syntax.html
@@ -102,7 +102,7 @@ <h1>pcre2syntax man page</h1>
   \uhhhh     character with hex code hhhh
   \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
 </pre>
-When \x is not followed by {, from zero to two hexadecimal digits are read,
+When \x is not followed by {, one or two hexadecimal digits are read,
 but in ALT_BSUX mode \x must be followed by two hexadecimal digits to be
 recognized as a hexadecimal escape; otherwise it matches a literal "x".
 Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits
@@ -650,7 +650,7 @@ <h1>pcre2syntax man page</h1>
 </P>
 <br><a name="SEC33" href="#TOC1">REVISION</a><br>
 <P>
-Last updated: 04 September 2024
+Last updated: 17 September 2024
 <br>
 Copyright &copy; 1997-2024 University of Cambridge.
 <br>

diff --git a/doc/pcre2.txt b/doc/pcre2.txt
@@ -3613,35 +3613,40 @@ CREATING A NEW STRING WITH SUBSTITUTIONS
        When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
 
        Firstly, backslash in a replacement string is interpreted as an  escape
-       character. The usual forms such as \n or \x{ddd} can be used to specify
-       particular  character codes, and backslash followed by any non-alphanu-
-       meric character quotes that character. Extended quoting  can  be  coded
-       using \Q...\E, exactly as in pattern strings.
+       character.  The usual forms such as \x{ddd} can be used to specify par-
+       ticular character codes, and backslash followed by any non-alphanumeric
+       character quotes that character. Extended quoting can  be  coded  using
+       \Q...\E, exactly as in pattern strings.
+
+       The  interpretation  of backslash followed by one or more digits is the
+       same as in a pattern, which in Perl has some ambiguities.  Details  are
+       given in the pcre2pattern page.
 
        There  are  also four escape sequences for forcing the case of inserted
-       letters.  The insertion mechanism has three states:  no  case  forcing,
-       force upper case, and force lower case. The escape sequences change the
-       current state: \U and \L change to upper or lower case forcing, respec-
-       tively,  and  \E (when not terminating a \Q quoted sequence) reverts to
-       no case forcing. The sequences \u and \l force the next  character  (if
-       it  is  a  letter)  to  upper or lower case, respectively, and then the
-       state automatically reverts to no case forcing.
+       letters.  Case forcing applies to all  inserted  characters,  including
+       those  from capture groups and letters within \Q...\E quoted sequences.
+       The insertion mechanism has three states: no case forcing, force  upper
+       case,  and  force  lower  case. The escape sequences change the current
+       state: \U and \L change to upper or lower case  forcing,  respectively,
+       and  \E  (when not terminating a \Q quoted sequence) reverts to no case
+       forcing. The sequences \u and \l force the next character (if it  is  a
+       letter)  to upper or lower case, respectively, and then the state auto-
+       matically reverts to no case forcing.
 
        However, if \u is immediately followed by \L or \l is immediately  fol-
        lowed  by  \U,  the next character's case is forced by the first escape
        sequence, and subsequent characters by the second. This provides a "ti-
-       tle casing" facility.  For  example,  the  string  "\u\LheLLo"  becomes
-       "Hello".
-
-       Case  forcing  applies to all inserted characters, including those from
-       capture groups and letters within \Q...\E quoted sequences.  If  either
-       PCRE2_UTF  or  PCRE2_UCP was set when the pattern was compiled, Unicode
-       properties are used for case forcing characters whose code  points  are
-       greater  than  127.  However, only basic case folding, as determined by
-       the Unicode file CaseFolding.txt is supported. PCRE2 does  not  support
-       language-specific  special  casing  rules such as using different lower
-       case Greek sigmas in the middle and ends of words (as  defined  in  the
-       Unicode file SpecialCasing.txt).
+       tle casing" facility that can be applied to group captures.  For  exam-
+       ple,  if  group 1 has captured "heLLo", the replacement string "\u\L$1"
+       becomes "Hello".
+
+       If either PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled,
+       Unicode properties are used for  case  forcing  characters  whose  code
+       points  are  greater than 127. However, only basic case folding, as de-
+       termined by the Unicode file CaseFolding.txt is supported.  PCRE2  does
+       not  support  language-specific special casing rules such as using dif-
+       ferent lower case Greek sigmas in the middle and ends of words (as  de-
+       fined in the Unicode file SpecialCasing.txt).
 
        Note that case forcing sequences such as \U...\E do not nest. For exam-
        ple,  the  result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
@@ -4030,11 +4035,11 @@ AUTHOR
 
 REVISION
 
-       Last updated: 16 September 2024
+       Last updated: 17 September 2024
        Copyright (c) 1997-2024 University of Cambridge.
 
 
-PCRE2 10.45                    16 September 2024                   PCRE2API(3)
+PCRE2 10.45                    17 September 2024                   PCRE2API(3)
 ------------------------------------------------------------------------------
 
 
@@ -6862,11 +6867,15 @@ BACKSLASH
          \x{hhh..}   character with hex code hhh..
          \N{U+hhh..} character with Unicode hex code point hhh..
 
-       By default, after \x that is not followed by {, from zero to two  hexa-
-       decimal  digits  are  read (letters can be in upper or lower case). Any
-       number of hexadecimal digits may appear between \x{ and }. If a charac-
-       ter other than a hexadecimal digit appears between \x{  and  },  or  if
-       there is no terminating }, an error occurs.
+       By default, after \x that is not followed by {, one or two  hexadecimal
+       digits are read (letters can be in upper or lower case). If the charac-
+       ter  that follows \x is neither { nor a hexadecimal digit, an error oc-
+       curs. This is different from Perl's default behaviour, which  generates
+       a NUL character, but is in line with Perl's "strict" behaviour.
+
+       Any  number  of  hexadecimal  digits may appear between \x{ and }. If a
+       character other than a hexadecimal digit appears between \x{ and },  or
+       if there is no terminating }, an error occurs.
 
        Characters whose code points are less than 256 can be defined by either
        of the two syntaxes for \x or by an octal sequence. There is no differ-
@@ -10146,11 +10155,11 @@ AUTHOR
 
 REVISION
 
-       Last updated: 04 September 2024
+       Last updated: 17 September 2024
        Copyright (c) 1997-2024 University of Cambridge.
 
 
-PCRE2 10.45                    04 Sepbember 2024               PCRE2PATTERN(3)
+PCRE2 10.45                    17 Sepbember 2024               PCRE2PATTERN(3)
 ------------------------------------------------------------------------------
 
 
@@ -11113,12 +11122,12 @@ ESCAPED CHARACTERS
          \uhhhh     character with hex code hhhh
          \u{hh..}   character with hex code hh.. but only for EXTRA_ALT_BSUX
 
-       When  \x  is not followed by {, from zero to two hexadecimal digits are
-       read, but in ALT_BSUX mode \x must be followed by two hexadecimal  dig-
-       its  to  be  recognized as a hexadecimal escape; otherwise it matches a
-       literal "x".  Likewise, if \u (in ALT_BSUX mode)  is  not  followed  by
-       four  hexadecimal  digits or (in EXTRA_ALT_BSUX mode) a sequence of hex
-       digits in curly brackets, it matches a literal "u".
+       When  \x  is not followed by {, one or two hexadecimal digits are read,
+       but in ALT_BSUX mode \x must be followed by two hexadecimal  digits  to
+       be  recognized  as a hexadecimal escape; otherwise it matches a literal
+       "x".  Likewise, if \u (in ALT_BSUX mode) is not followed by four  hexa-
+       decimal  digits or (in EXTRA_ALT_BSUX mode) a sequence of hex digits in
+       curly brackets, it matches a literal "u".
 
        Note that \0dd is always an octal code. The treatment of backslash fol-
        lowed by a non-zero digit is complicated; for details see  the  section
@@ -11651,11 +11660,11 @@ AUTHOR
 
 REVISION
 
-       Last updated: 04 September 2024
+       Last updated: 17 September 2024
        Copyright (c) 1997-2024 University of Cambridge.
 
 
-PCRE2 10.45                    04 September 2024                PCRE2SYNTAX(3)
+PCRE2 10.45                    17 September 2024                PCRE2SYNTAX(3)
 ------------------------------------------------------------------------------
 
 

diff --git a/doc/pcre2api.3 b/doc/pcre2api.3
@@ -1,4 +1,4 @@
-.TH PCRE2API 3 "16 September 2024" "PCRE2 10.45"
+.TH PCRE2API 3 "17 September 2024" "PCRE2 10.45"
 .SH NAME
 PCRE2 - Perl-compatible regular expressions (revised API)
 .sp
@@ -3743,30 +3743,37 @@ and only the group insertion forms listed above are valid. When
 PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
 .P
 Firstly, backslash in a replacement string is interpreted as an escape
-character. The usual forms such as \en or \ex{ddd} can be used to specify
-particular character codes, and backslash followed by any non-alphanumeric
-character quotes that character. Extended quoting can be coded using \eQ...\eE,
-exactly as in pattern strings.
+character. The usual forms such as \ex{ddd} can be used to specify particular
+character codes, and backslash followed by any non-alphanumeric character
+quotes that character. Extended quoting can be coded using \eQ...\eE, exactly
+as in pattern strings.
+.P
+The interpretation of backslash followed by one or more digits is the same as
+in a pattern, which in Perl has some ambiguities. Details are given in the
+.\" HREF
+\fBpcre2pattern\fP
+.\"
+page.
 .P
 There are also four escape sequences for forcing the case of inserted letters.
-The insertion mechanism has three states: no case forcing, force upper case,
-and force lower case. The escape sequences change the current state: \eU and
-\eL change to upper or lower case forcing, respectively, and \eE (when not
-terminating a \eQ quoted sequence) reverts to no case forcing. The sequences
-\eu and \el force the next character (if it is a letter) to upper or lower
-case, respectively, and then the state automatically reverts to no case
-forcing.
+Case forcing applies to all inserted characters, including those from capture
+groups and letters within \eQ...\eE quoted sequences. The insertion mechanism
+has three states: no case forcing, force upper case, and force lower case. The
+escape sequences change the current state: \eU and \eL change to upper or lower
+case forcing, respectively, and \eE (when not terminating a \eQ quoted
+sequence) reverts to no case forcing. The sequences \eu and \el force the next
+character (if it is a letter) to upper or lower case, respectively, and then
+the state automatically reverts to no case forcing.
 .P
 However, if \eu is immediately followed by \eL or \el is immediately followed
 by \eU, the next character's case is forced by the first escape sequence, and
-subsequent characters by the second. This provides a "title casing" facility.
-For example, the string "\eu\eLheLLo" becomes "Hello".
+subsequent characters by the second. This provides a "title casing" facility
+that can be applied to group captures. For example, if group 1 has captured
+"heLLo", the replacement string "\eu\eL$1" becomes "Hello".
 .P
-Case forcing applies to all inserted characters, including those from capture
-groups and letters within \eQ...\eE quoted sequences. If either PCRE2_UTF or
-PCRE2_UCP was set when the pattern was compiled, Unicode properties are used
-for case forcing characters whose code points are greater than 127. However,
-only basic case folding, as determined by the Unicode file
+If either PCRE2_UTF or PCRE2_UCP was set when the pattern was compiled, Unicode
+properties are used for case forcing characters whose code points are greater
+than 127. However, only basic case folding, as determined by the Unicode file
 \fBCaseFolding.txt\fP is supported. PCRE2 does not support language-specific
 special casing rules such as using different lower case Greek sigmas in the
 middle and ends of words (as defined in the Unicode file
@@ -4201,6 +4208,6 @@ Cambridge, England.
 .rs
 .sp
 .nf
-Last updated: 16 September 2024
+Last updated: 17 September 2024
 Copyright (c) 1997-2024 University of Cambridge.
 .fi
diff --git a/doc/pcre2demo.3 b/doc/pcre2demo.3
@@ -1,4 +1,4 @@
-.TH PCRE2DEMO 3 "16 September 2024" "PCRE2 10.44"
+.TH PCRE2DEMO 3 "17 September 2024" "PCRE2 10.44"
 .\"AUTOMATICALLY GENERATED BY PrepareRelease - do not EDIT!
 .SH NAME
 PCRE2DEMO - A demonstration C program for PCRE2
-Original file line number
+Diff line change
@@ Expand Up / @@ -91,6 +91,8 @@ Perl. @@
 . Merged PR478, which disallows \x if not followed by { or a hex digit.
+. Merged PR473, which implements Python-style backrefs in substitutions.
     Version 10.44 07-June-2024
     --------------------------
@@ Expand Down @@