From 2458ee472b14758d1033af6b4a0e2798f31ab170 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?=
Date: Sat, 15 Jun 2024 17:59:36 -0700
Subject: [PATCH] pcre2grep: add --posix-pattern-file for compatibility with
other grep
Historically, pcre2grep has done minor processing of the patterns that
were read through the `-f` option.
The end result is that for some patterns there are different results
depending if they were provided through `-e`, `-f` or as a parameter
in the command line.
Add a flag that could be provided to skip that processing so that the
same pattern file used with other grep implementations could be used
directly for the same result.
---
ChangeLog | 10 +++--
RunGrepTest | 19 +++++++++
...pcre2_set_max_pattern_compiled_length.html | 6 +--
doc/html/pcre2grep.html | 14 ++++---
doc/pcre2grep.1 | 8 +++-
src/config.h.in | 3 +-
src/pcre2grep.c | 41 +++++++++++++++++--
testdata/grepinputv | 1 +
testdata/grepoutput | 20 +++++++++
9 files changed, 105 insertions(+), 17 deletions(-)
diff --git a/ChangeLog b/ChangeLog
index e2e39bb45..667f70866 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -7,14 +7,18 @@ there is also the log of commit messages.
Version 10.45 xx-xxx-2024
-------------------------
-1. Change 6 of 10.44 broke 32-bit compiles because pcre2test's reporting of
-memory size was changed to the entire compiled data block, instead of just the
-pattern and tables data, so as to align with the new length restriction.
+1. Change 6 of 10.44 broke 32-bit tests because pcre2test's reporting of
+memory size was changed to the entire compiled data block, instead of just the
+pattern and tables data, so as to align with the new length restriction.
Because the block's header contains pointers, this meant the pcre2test output
was different in 32-bit mode. A patch by Carlo reverts to the preevious state
and makes sure that any limit set by pcre2_set_max_pattern_compiled_length()
also avoids the internal struct overhead.
+2. Add --posix-pattern-file to pcre2grep to allow processing of empty patterns
+through the -f option, as well as patterns that end in space characters for
+compatibility with other grep tools.
+
Version 10.44 07-June-2024
--------------------------
diff --git a/RunGrepTest b/RunGrepTest
index c38218710..aba256e04 100755
--- a/RunGrepTest
+++ b/RunGrepTest
@@ -861,6 +861,25 @@ echo "---------------------------- Test 153 -----------------------------" >>tes
(cd $srcdir; $valgrind $vjs $pcre2grep -nA3 --no-group-separator 'four' ./testdata/grepinputx) >>testtrygrep
echo "RC=$?" >>testtrygrep
+echo "---------------------------- Test 154 -----------------------------" >>testtrygrep
+>testtemp1grep
+(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 155 -----------------------------" >>testtrygrep
+echo "" >testtemp1grep
+(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 156 -----------------------------" >>testtrygrep
+echo "" >testtemp1grep
+(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file --file $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 157 -----------------------------" >>testtrygrep
+echo "spaces " >testtemp1grep
+(cd $srcdir; $valgrind $vjs $pcre2grep -o --posix-pattern-file --file=$builddir/testtemp1grep ./testdata/grepinputv >testtemp2grep && [ `wc -c >testtrygrep
+echo "RC=$?" >>testtrygrep
# Now compare the results.
diff --git a/doc/html/pcre2_set_max_pattern_compiled_length.html b/doc/html/pcre2_set_max_pattern_compiled_length.html
index ab570cf60..a40f41e45 100644
--- a/doc/html/pcre2_set_max_pattern_compiled_length.html
+++ b/doc/html/pcre2_set_max_pattern_compiled_length.html
@@ -27,9 +27,9 @@ pcre2_set_max_pattern_compiled_length man page
This function sets, in a compile context, the maximum size (in bytes) for the
-memory needed to hold the compiled version of a pattern that is compiled with
-this context. The result is always zero. If a pattern that is passed to
-pcre2_compile() with this context needs more memory, an error is
+memory needed to hold the compiled version of a pattern that is using this
+context. The result is always zero. If a pattern that is passed to
+pcre2_compile() referencing this context needs more memory, an error is
generated. The default is the largest number that a PCRE2_SIZE variable can
hold, which is effectively unlimited.
diff --git a/doc/html/pcre2grep.html b/doc/html/pcre2grep.html
index bd12246ae..d20fa796d 100644
--- a/doc/html/pcre2grep.html
+++ b/doc/html/pcre2grep.html
@@ -391,9 +391,10 @@ pcre2grep man page
command line, no delimiters should be used. What constitutes a newline when
reading the file is the operating system's default interpretation of \n. The
--newline option has no effect on this option. Trailing white space is
-removed from each line, and blank lines are ignored. An empty file contains no
+removed from each line, and blank lines are ignored unless the
+--posix-pattern-file option is also provided. An empty file contains no
patterns and therefore matches nothing. Patterns read from a file in this way
-may contain binary zeros, which are treated as ordinary data characters.
+may contain binary zeros, which are treated as ordinary character literals.
If this option is given more than once, all the specified files are read. A
@@ -408,9 +409,9 @@ pcre2grep man page
Read a list of files and/or directories that are to be scanned from the given
file, one per line. What constitutes a newline when reading the file is the
operating system's default. Trailing white space is removed from each line, and
-blank lines are ignored. These paths are processed before any that are listed
-on the command line. The file name can be given as "-" to refer to the standard
-input. If --file and --file-list are both specified as "-",
+blank lines are ignored. These paths are processed before any that are listed on the command
+line. The file name can be given as "-" to refer to the standard input.
+If --file and --file-list are both specified as "-",
patterns are read first. This is useful only when the standard input is a
terminal, from which further lines (the list of files) can be read after an
end-of-file indication. If this option is given more than once, all the
@@ -808,6 +809,9 @@ pcre2grep man page
allowing \w to match Unicode letters and digits.
+--posix-pattern-file
+When patterns are provided with the -f option, do not trim trailing
+spaces or ignore empty lines in a similar way than other grep tools.
-q, --quiet
Work quietly, that is, display nothing except error messages. The exit
status indicates whether or not any matches were found.
diff --git a/doc/pcre2grep.1 b/doc/pcre2grep.1
index ffe9d397b..650f679fb 100644
--- a/doc/pcre2grep.1
+++ b/doc/pcre2grep.1
@@ -337,9 +337,10 @@ Read patterns from the file, one per line. As is the case with patterns on the
command line, no delimiters should be used. What constitutes a newline when
reading the file is the operating system's default interpretation of \en. The
\fB--newline\fP option has no effect on this option. Trailing white space is
-removed from each line, and blank lines are ignored. An empty file contains no
+removed from each line, and blank lines are ignored unless the
+\fB--posix-pattern-file\fP option is also provided. An empty file contains no
patterns and therefore matches nothing. Patterns read from a file in this way
-may contain binary zeros, which are treated as ordinary data characters.
+may contain binary zeros, which are treated as ordinary character literals.
.sp
If this option is given more than once, all the specified files are read. A
data line is output if any of the patterns match it. A file name can be given
@@ -701,6 +702,9 @@ option settings within patterns that affect individual classes. For example,
when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
allowing \ew to match Unicode letters and digits.
.TP
+\fB--posix-pattern-file\fP
+When patterns are provided with the \fB-f\fP option, do not trim trailing
+spaces or ignore empty lines in a similar way than other grep tools.
\fB-q\fP, \fB--quiet\fP
Work quietly, that is, display nothing except error messages. The exit
status indicates whether or not any matches were found.
diff --git a/src/config.h.in b/src/config.h.in
index 8249182de..3bb01c83d 100644
--- a/src/config.h.in
+++ b/src/config.h.in
@@ -145,7 +145,8 @@ sure both macros are undefined; an emulation function will then be used. */
/* Define to 1 if you have the header file. */
#undef HAVE_UNISTD_H
-/* Define to 1 if the compiler supports simple visibility declarations. */
+/* Define to 1 if the compiler supports GCC compatible visibility
+ declarations. */
#undef HAVE_VISIBILITY
/* Define to 1 if you have the header file. */
diff --git a/src/pcre2grep.c b/src/pcre2grep.c
index bb96067f0..be1afc506 100644
--- a/src/pcre2grep.c
+++ b/src/pcre2grep.c
@@ -290,6 +290,7 @@ static BOOL show_total_count = FALSE;
static BOOL silent = FALSE;
static BOOL utf = FALSE;
static BOOL posix_digit = FALSE;
+static BOOL posix_pattern_file = FALSE;
static uint8_t utf8_buffer[8];
@@ -428,6 +429,7 @@ used to identify them. */
#define N_POSIX_DIGIT (-26)
#define N_GROUP_SEPARATOR (-27)
#define N_NO_GROUP_SEPARATOR (-28)
+#define N_POSIX_PATFILE (-29)
static option_item optionlist[] = {
{ OP_NODATA, N_NULL, NULL, "", "terminate options" },
@@ -449,6 +451,7 @@ static option_item optionlist[] = {
{ OP_PATLIST, 'e', &match_patdata, "regex(p)=pattern", "specify pattern (may be used more than once)" },
{ OP_NODATA, 'F', NULL, "fixed-strings", "patterns are sets of newline-separated strings" },
{ OP_FILELIST, 'f', &pattern_files_data, "file=path", "read patterns from file" },
+ { OP_NODATA, N_POSIX_PATFILE, NULL, "posix-pattern-file", "use POSIX semantics for pattern files" },
{ OP_FILELIST, N_FILE_LIST, &file_lists_data, "file-list=path","read files to search from file" },
{ OP_NODATA, N_FOFFSETS, NULL, "file-offsets", "output file offsets, not text" },
{ OP_STRING, N_GROUP_SEPARATOR, &group_separator, "group-separator=text", "set separator between groups of lines" },
@@ -1448,7 +1451,34 @@ while ((c = fgetc(f)) != EOF)
return yield;
}
+/*************************************************
+* Read one pattern from file *
+*************************************************/
+/* Wrap around read_one_line() to make sure any terminating '\n' is not
+included in the pattern and empty patterns are correctly identified.
+
+Arguments:
+ buffer the buffer to read into
+ length maximum number of characters to read and report how many were
+ f the file
+
+Returns: TRUE if a pattern was read into buffer
+*/
+
+static BOOL
+read_pattern(char *buffer, PCRE2_SIZE *length, FILE *f)
+{
+*buffer = '\0';
+*length = read_one_line(buffer, *length, f);
+if (*length > 0 && buffer[*length-1] == '\n') *length = *length - 1;
+if (!posix_pattern_file && *length > 0 && buffer[*length-1] == '\r')
+ {
+ *length = *length - 1;
+ if (*length == 0) return TRUE;
+ }
+return (*length > 0 || *buffer == '\n');
+}
/*************************************************
* Find end of line *
@@ -3598,6 +3628,7 @@ switch(letter)
case N_NOJIT: use_jit = FALSE; break;
case N_ALLABSK: extra_options |= PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK; break;
case N_NO_GROUP_SEPARATOR: group_separator = NULL; break;
+ case N_POSIX_PATFILE: posix_pattern_file = TRUE; break;
case 'a': binary_files = BIN_TEXT; break;
case 'c': count_only = TRUE; break;
case N_POSIX_DIGIT: posix_digit = TRUE; break;
@@ -3808,11 +3839,15 @@ else
filename = name;
}
-while ((patlen = read_one_line(buffer, sizeof(buffer), f)) > 0)
+while ((patlen = sizeof(buffer)) && read_pattern(buffer, &patlen, f))
{
- while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
+ if (!posix_pattern_file)
+ {
+ while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
+ }
+
linenumber++;
- if (patlen == 0) continue; /* Skip blank lines */
+ if (!posix_pattern_file && patlen == 0) continue; /* Skip blank lines */
/* Note: this call to add_pattern() puts a pointer to the local variable
"buffer" into the pattern chain. However, that pointer is used only when
diff --git a/testdata/grepinputv b/testdata/grepinputv
index 366d4fb49..1855ba87f 100644
--- a/testdata/grepinputv
+++ b/testdata/grepinputv
@@ -7,3 +7,4 @@ The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
+tailing spaces
diff --git a/testdata/grepoutput b/testdata/grepoutput
index d9233c26a..23ff75c59 100644
--- a/testdata/grepoutput
+++ b/testdata/grepoutput
@@ -464,6 +464,7 @@ The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
+tailing spaces
RC=0
---------------------------- Test 52 ------------------------------
fox [1;31mjumps[0m
@@ -1169,6 +1170,7 @@ The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
+tailing spaces
RC=0
---------------------------- Test 146 -----------------------------
(standard input):A123B
@@ -1253,3 +1255,21 @@ RC=0
36-sixteen
37-seventeen
RC=0
+---------------------------- Test 154 -----------------------------
+RC=1
+---------------------------- Test 155 -----------------------------
+RC=1
+---------------------------- Test 156 -----------------------------
+The quick brown
+fox jumps
+over the lazy dog.
+This time it jumps and jumps and jumps.
+This line contains \E and (regex) *meta* [characters].
+The word is cat in this line
+The caterpillar sat on the mat
+The snowcat is not an animal
+A buried feline in the syndicate
+tailing spaces
+RC=0
+---------------------------- Test 157 -----------------------------
+RC=0