From 2458ee472b14758d1033af6b4a0e2798f31ab170 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= <carenas@gmail.com>
Date: Sat, 15 Jun 2024 17:59:36 -0700
Subject: [PATCH] pcre2grep: add --posix-pattern-file for compatibility with
 other grep

Historically, pcre2grep has done minor processing of the patterns that
were read through the `-f` option.

The end result is that for some patterns there are different results
depending if they were provided through `-e`, `-f` or as a parameter
in the command line.

Add a flag that could be provided to skip that processing so that the
same pattern file used with other grep implementations could be used
directly for the same result.
---
 ChangeLog                                     | 10 +++--
 RunGrepTest                                   | 19 +++++++++
 ...pcre2_set_max_pattern_compiled_length.html |  6 +--
 doc/html/pcre2grep.html                       | 14 ++++---
 doc/pcre2grep.1                               |  8 +++-
 src/config.h.in                               |  3 +-
 src/pcre2grep.c                               | 41 +++++++++++++++++--
 testdata/grepinputv                           |  1 +
 testdata/grepoutput                           | 20 +++++++++
 9 files changed, 105 insertions(+), 17 deletions(-)

diff --git a/ChangeLog b/ChangeLog
index e2e39bb45..667f70866 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -7,14 +7,18 @@ there is also the log of commit messages.
 Version 10.45 xx-xxx-2024
 -------------------------
 
-1. Change 6 of 10.44 broke 32-bit compiles because pcre2test's reporting of 
-memory size was changed to the entire compiled data block, instead of just the 
-pattern and tables data, so as to align with the new length restriction. 
+1. Change 6 of 10.44 broke 32-bit tests because pcre2test's reporting of
+memory size was changed to the entire compiled data block, instead of just the
+pattern and tables data, so as to align with the new length restriction.
 Because the block's header contains pointers, this meant the pcre2test output
 was different in 32-bit mode. A patch by Carlo reverts to the preevious state
 and makes sure that any limit set by pcre2_set_max_pattern_compiled_length()
 also avoids the internal struct overhead.
 
+2. Add --posix-pattern-file to pcre2grep to allow processing of empty patterns
+through the -f option, as well as patterns that end in space characters for
+compatibility with other grep tools.
+
 
 Version 10.44 07-June-2024
 --------------------------
diff --git a/RunGrepTest b/RunGrepTest
index c38218710..aba256e04 100755
--- a/RunGrepTest
+++ b/RunGrepTest
@@ -861,6 +861,25 @@ echo "---------------------------- Test 153 -----------------------------" >>tes
 (cd $srcdir; $valgrind $vjs $pcre2grep -nA3 --no-group-separator 'four' ./testdata/grepinputx) >>testtrygrep
 echo "RC=$?" >>testtrygrep
 
+echo "---------------------------- Test 154 -----------------------------" >>testtrygrep
+>testtemp1grep
+(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 155 -----------------------------" >>testtrygrep
+echo "" >testtemp1grep
+(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 156 -----------------------------" >>testtrygrep
+echo "" >testtemp1grep
+(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file --file $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
+echo "RC=$?" >>testtrygrep
+
+echo "---------------------------- Test 157 -----------------------------" >>testtrygrep
+echo "spaces " >testtemp1grep
+(cd $srcdir; $valgrind $vjs $pcre2grep -o --posix-pattern-file --file=$builddir/testtemp1grep ./testdata/grepinputv >testtemp2grep && [ `wc -c <testtemp2grep` -eq 8 ]) >>testtrygrep
+echo "RC=$?" >>testtrygrep
 
 # Now compare the results.
 
diff --git a/doc/html/pcre2_set_max_pattern_compiled_length.html b/doc/html/pcre2_set_max_pattern_compiled_length.html
index ab570cf60..a40f41e45 100644
--- a/doc/html/pcre2_set_max_pattern_compiled_length.html
+++ b/doc/html/pcre2_set_max_pattern_compiled_length.html
@@ -27,9 +27,9 @@ <h1>pcre2_set_max_pattern_compiled_length man page</h1>
 </b><br>
 <P>
 This function sets, in a compile context, the maximum size (in bytes) for the
-memory needed to hold the compiled version of a pattern that is compiled with
-this context. The result is always zero. If a pattern that is passed to
-<b>pcre2_compile()</b> with this context needs more memory, an error is
+memory needed to hold the compiled version of a pattern that is using this
+context. The result is always zero. If a pattern that is passed to
+<b>pcre2_compile()</b> referencing this context needs more memory, an error is
 generated. The default is the largest number that a PCRE2_SIZE variable can
 hold, which is effectively unlimited.
 </P>
diff --git a/doc/html/pcre2grep.html b/doc/html/pcre2grep.html
index bd12246ae..d20fa796d 100644
--- a/doc/html/pcre2grep.html
+++ b/doc/html/pcre2grep.html
@@ -391,9 +391,10 @@ <h1>pcre2grep man page</h1>
 command line, no delimiters should be used. What constitutes a newline when
 reading the file is the operating system's default interpretation of \n. The
 <b>--newline</b> option has no effect on this option. Trailing white space is
-removed from each line, and blank lines are ignored. An empty file contains no
+removed from each line, and blank lines are ignored unless the
+<b>--posix-pattern-file</b> option is also provided. An empty file contains no
 patterns and therefore matches nothing. Patterns read from a file in this way
-may contain binary zeros, which are treated as ordinary data characters.
+may contain binary zeros, which are treated as ordinary character literals.
 <br>
 <br>
 If this option is given more than once, all the specified files are read. A
@@ -408,9 +409,9 @@ <h1>pcre2grep man page</h1>
 Read a list of files and/or directories that are to be scanned from the given
 file, one per line. What constitutes a newline when reading the file is the
 operating system's default. Trailing white space is removed from each line, and
-blank lines are ignored. These paths are processed before any that are listed
-on the command line. The file name can be given as "-" to refer to the standard
-input. If <b>--file</b> and <b>--file-list</b> are both specified as "-",
+blank lines are ignored. These paths are processed before any that are listed on the command
+line. The file name can be given as "-" to refer to the standard input.
+If <b>--file</b> and <b>--file-list</b> are both specified as "-",
 patterns are read first. This is useful only when the standard input is a
 terminal, from which further lines (the list of files) can be read after an
 end-of-file indication. If this option is given more than once, all the
@@ -808,6 +809,9 @@ <h1>pcre2grep man page</h1>
 allowing \w to match Unicode letters and digits.
 </P>
 <P>
+<b>--posix-pattern-file</b>
+When patterns are provided with the <b>-f</b> option, do not trim trailing
+spaces or ignore empty lines in a similar way than other grep tools.
 <b>-q</b>, <b>--quiet</b>
 Work quietly, that is, display nothing except error messages. The exit
 status indicates whether or not any matches were found.
diff --git a/doc/pcre2grep.1 b/doc/pcre2grep.1
index ffe9d397b..650f679fb 100644
--- a/doc/pcre2grep.1
+++ b/doc/pcre2grep.1
@@ -337,9 +337,10 @@ Read patterns from the file, one per line. As is the case with patterns on the
 command line, no delimiters should be used. What constitutes a newline when
 reading the file is the operating system's default interpretation of \en. The
 \fB--newline\fP option has no effect on this option. Trailing white space is
-removed from each line, and blank lines are ignored. An empty file contains no
+removed from each line, and blank lines are ignored unless the
+\fB--posix-pattern-file\fP option is also provided. An empty file contains no
 patterns and therefore matches nothing. Patterns read from a file in this way
-may contain binary zeros, which are treated as ordinary data characters.
+may contain binary zeros, which are treated as ordinary character literals.
 .sp
 If this option is given more than once, all the specified files are read. A
 data line is output if any of the patterns match it. A file name can be given
@@ -701,6 +702,9 @@ option settings within patterns that affect individual classes. For example,
 when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
 allowing \ew to match Unicode letters and digits.
 .TP
+\fB--posix-pattern-file\fP
+When patterns are provided with the \fB-f\fP option, do not trim trailing
+spaces or ignore empty lines in a similar way than other grep tools.
 \fB-q\fP, \fB--quiet\fP
 Work quietly, that is, display nothing except error messages. The exit
 status indicates whether or not any matches were found.
diff --git a/src/config.h.in b/src/config.h.in
index 8249182de..3bb01c83d 100644
--- a/src/config.h.in
+++ b/src/config.h.in
@@ -145,7 +145,8 @@ sure both macros are undefined; an emulation function will then be used. */
 /* Define to 1 if you have the <unistd.h> header file. */
 #undef HAVE_UNISTD_H
 
-/* Define to 1 if the compiler supports simple visibility declarations. */
+/* Define to 1 if the compiler supports GCC compatible visibility
+   declarations. */
 #undef HAVE_VISIBILITY
 
 /* Define to 1 if you have the <wchar.h> header file. */
diff --git a/src/pcre2grep.c b/src/pcre2grep.c
index bb96067f0..be1afc506 100644
--- a/src/pcre2grep.c
+++ b/src/pcre2grep.c
@@ -290,6 +290,7 @@ static BOOL show_total_count = FALSE;
 static BOOL silent = FALSE;
 static BOOL utf = FALSE;
 static BOOL posix_digit = FALSE;
+static BOOL posix_pattern_file = FALSE;
 
 static uint8_t utf8_buffer[8];
 
@@ -428,6 +429,7 @@ used to identify them. */
 #define N_POSIX_DIGIT  (-26)
 #define N_GROUP_SEPARATOR (-27)
 #define N_NO_GROUP_SEPARATOR (-28)
+#define N_POSIX_PATFILE (-29)
 
 static option_item optionlist[] = {
   { OP_NODATA,     N_NULL,   NULL,              "",              "terminate options" },
@@ -449,6 +451,7 @@ static option_item optionlist[] = {
   { OP_PATLIST,    'e',      &match_patdata,    "regex(p)=pattern", "specify pattern (may be used more than once)" },
   { OP_NODATA,     'F',      NULL,              "fixed-strings", "patterns are sets of newline-separated strings" },
   { OP_FILELIST,   'f',      &pattern_files_data, "file=path",   "read patterns from file" },
+  { OP_NODATA, N_POSIX_PATFILE, NULL,           "posix-pattern-file", "use POSIX semantics for pattern files" },
   { OP_FILELIST,   N_FILE_LIST, &file_lists_data, "file-list=path","read files to search from file" },
   { OP_NODATA,     N_FOFFSETS, NULL,            "file-offsets",  "output file offsets, not text" },
   { OP_STRING,     N_GROUP_SEPARATOR, &group_separator, "group-separator=text", "set separator between groups of lines" },
@@ -1448,7 +1451,34 @@ while ((c = fgetc(f)) != EOF)
 return yield;
 }
 
+/*************************************************
+*           Read one pattern from file           *
+*************************************************/
 
+/* Wrap around read_one_line() to make sure any terminating '\n' is not
+included in the pattern and empty patterns are correctly identified.
+
+Arguments:
+  buffer     the buffer to read into
+  length     maximum number of characters to read and report how many were
+  f          the file
+
+Returns:     TRUE if a pattern was read into buffer
+*/
+
+static BOOL
+read_pattern(char *buffer, PCRE2_SIZE *length, FILE *f)
+{
+*buffer = '\0';
+*length = read_one_line(buffer, *length, f);
+if (*length > 0 && buffer[*length-1] == '\n') *length = *length - 1;
+if (!posix_pattern_file && *length > 0 && buffer[*length-1] == '\r')
+  {
+  *length = *length - 1;
+  if (*length == 0) return TRUE;
+  }
+return (*length > 0 || *buffer == '\n');
+}
 
 /*************************************************
 *             Find end of line                   *
@@ -3598,6 +3628,7 @@ switch(letter)
   case N_NOJIT: use_jit = FALSE; break;
   case N_ALLABSK: extra_options |= PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK; break;
   case N_NO_GROUP_SEPARATOR: group_separator = NULL; break;
+  case N_POSIX_PATFILE: posix_pattern_file = TRUE; break;
   case 'a': binary_files = BIN_TEXT; break;
   case 'c': count_only = TRUE; break;
   case N_POSIX_DIGIT: posix_digit = TRUE; break;
@@ -3808,11 +3839,15 @@ else
   filename = name;
   }
 
-while ((patlen = read_one_line(buffer, sizeof(buffer), f)) > 0)
+while ((patlen = sizeof(buffer)) && read_pattern(buffer, &patlen, f))
   {
-  while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
+  if (!posix_pattern_file)
+   {
+   while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
+   }
+
   linenumber++;
-  if (patlen == 0) continue;   /* Skip blank lines */
+  if (!posix_pattern_file && patlen == 0) continue; /* Skip blank lines */
 
   /* Note: this call to add_pattern() puts a pointer to the local variable
   "buffer" into the pattern chain. However, that pointer is used only when
diff --git a/testdata/grepinputv b/testdata/grepinputv
index 366d4fb49..1855ba87f 100644
--- a/testdata/grepinputv
+++ b/testdata/grepinputv
@@ -7,3 +7,4 @@ The word is cat in this line
 The caterpillar sat on the mat
 The snowcat is not an animal
 A buried feline in the syndicate
+tailing spaces 
diff --git a/testdata/grepoutput b/testdata/grepoutput
index d9233c26a..23ff75c59 100644
--- a/testdata/grepoutput
+++ b/testdata/grepoutput
@@ -464,6 +464,7 @@ The word is cat in this line
 The caterpillar sat on the mat
 The snowcat is not an animal
 A buried feline in the syndicate
+tailing spaces 
 RC=0
 ---------------------------- Test 52 ------------------------------
 fox [1;31mjumps[0m
@@ -1169,6 +1170,7 @@ The word is cat in this line
 The caterpillar sat on the mat
 The snowcat is not an animal
 A buried feline in the syndicate
+tailing spaces 
 RC=0
 ---------------------------- Test 146 -----------------------------
 (standard input):A123B
@@ -1253,3 +1255,21 @@ RC=0
 36-sixteen
 37-seventeen
 RC=0
+---------------------------- Test 154 -----------------------------
+RC=1
+---------------------------- Test 155 -----------------------------
+RC=1
+---------------------------- Test 156 -----------------------------
+The quick brown
+fox jumps
+over the lazy dog.
+This time it jumps and jumps and jumps.
+This line contains \E and (regex) *meta* [characters].
+The word is cat in this line
+The caterpillar sat on the mat
+The snowcat is not an animal
+A buried feline in the syndicate
+tailing spaces 
+RC=0
+---------------------------- Test 157 -----------------------------
+RC=0