From 2458ee472b14758d1033af6b4a0e2798f31ab170 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Carlo=20Marcelo=20Arenas=20Bel=C3=B3n?= Date: Sat, 15 Jun 2024 17:59:36 -0700 Subject: [PATCH] pcre2grep: add --posix-pattern-file for compatibility with other grep Historically, pcre2grep has done minor processing of the patterns that were read through the `-f` option. The end result is that for some patterns there are different results depending if they were provided through `-e`, `-f` or as a parameter in the command line. Add a flag that could be provided to skip that processing so that the same pattern file used with other grep implementations could be used directly for the same result. --- ChangeLog | 10 +++-- RunGrepTest | 19 +++++++++ ...pcre2_set_max_pattern_compiled_length.html | 6 +-- doc/html/pcre2grep.html | 14 ++++--- doc/pcre2grep.1 | 8 +++- src/config.h.in | 3 +- src/pcre2grep.c | 41 +++++++++++++++++-- testdata/grepinputv | 1 + testdata/grepoutput | 20 +++++++++ 9 files changed, 105 insertions(+), 17 deletions(-) diff --git a/ChangeLog b/ChangeLog index e2e39bb45..667f70866 100644 --- a/ChangeLog +++ b/ChangeLog @@ -7,14 +7,18 @@ there is also the log of commit messages. Version 10.45 xx-xxx-2024 ------------------------- -1. Change 6 of 10.44 broke 32-bit compiles because pcre2test's reporting of -memory size was changed to the entire compiled data block, instead of just the -pattern and tables data, so as to align with the new length restriction. +1. Change 6 of 10.44 broke 32-bit tests because pcre2test's reporting of +memory size was changed to the entire compiled data block, instead of just the +pattern and tables data, so as to align with the new length restriction. Because the block's header contains pointers, this meant the pcre2test output was different in 32-bit mode. A patch by Carlo reverts to the preevious state and makes sure that any limit set by pcre2_set_max_pattern_compiled_length() also avoids the internal struct overhead. +2. Add --posix-pattern-file to pcre2grep to allow processing of empty patterns +through the -f option, as well as patterns that end in space characters for +compatibility with other grep tools. + Version 10.44 07-June-2024 -------------------------- diff --git a/RunGrepTest b/RunGrepTest index c38218710..aba256e04 100755 --- a/RunGrepTest +++ b/RunGrepTest @@ -861,6 +861,25 @@ echo "---------------------------- Test 153 -----------------------------" >>tes (cd $srcdir; $valgrind $vjs $pcre2grep -nA3 --no-group-separator 'four' ./testdata/grepinputx) >>testtrygrep echo "RC=$?" >>testtrygrep +echo "---------------------------- Test 154 -----------------------------" >>testtrygrep +>testtemp1grep +(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep +echo "RC=$?" >>testtrygrep + +echo "---------------------------- Test 155 -----------------------------" >>testtrygrep +echo "" >testtemp1grep +(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep +echo "RC=$?" >>testtrygrep + +echo "---------------------------- Test 156 -----------------------------" >>testtrygrep +echo "" >testtemp1grep +(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file --file $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep +echo "RC=$?" >>testtrygrep + +echo "---------------------------- Test 157 -----------------------------" >>testtrygrep +echo "spaces " >testtemp1grep +(cd $srcdir; $valgrind $vjs $pcre2grep -o --posix-pattern-file --file=$builddir/testtemp1grep ./testdata/grepinputv >testtemp2grep && [ `wc -c >testtrygrep +echo "RC=$?" >>testtrygrep # Now compare the results. diff --git a/doc/html/pcre2_set_max_pattern_compiled_length.html b/doc/html/pcre2_set_max_pattern_compiled_length.html index ab570cf60..a40f41e45 100644 --- a/doc/html/pcre2_set_max_pattern_compiled_length.html +++ b/doc/html/pcre2_set_max_pattern_compiled_length.html @@ -27,9 +27,9 @@

pcre2_set_max_pattern_compiled_length man page


This function sets, in a compile context, the maximum size (in bytes) for the -memory needed to hold the compiled version of a pattern that is compiled with -this context. The result is always zero. If a pattern that is passed to -pcre2_compile() with this context needs more memory, an error is +memory needed to hold the compiled version of a pattern that is using this +context. The result is always zero. If a pattern that is passed to +pcre2_compile() referencing this context needs more memory, an error is generated. The default is the largest number that a PCRE2_SIZE variable can hold, which is effectively unlimited.

diff --git a/doc/html/pcre2grep.html b/doc/html/pcre2grep.html index bd12246ae..d20fa796d 100644 --- a/doc/html/pcre2grep.html +++ b/doc/html/pcre2grep.html @@ -391,9 +391,10 @@

pcre2grep man page

command line, no delimiters should be used. What constitutes a newline when reading the file is the operating system's default interpretation of \n. The --newline option has no effect on this option. Trailing white space is -removed from each line, and blank lines are ignored. An empty file contains no +removed from each line, and blank lines are ignored unless the +--posix-pattern-file option is also provided. An empty file contains no patterns and therefore matches nothing. Patterns read from a file in this way -may contain binary zeros, which are treated as ordinary data characters. +may contain binary zeros, which are treated as ordinary character literals.

If this option is given more than once, all the specified files are read. A @@ -408,9 +409,9 @@

pcre2grep man page

Read a list of files and/or directories that are to be scanned from the given file, one per line. What constitutes a newline when reading the file is the operating system's default. Trailing white space is removed from each line, and -blank lines are ignored. These paths are processed before any that are listed -on the command line. The file name can be given as "-" to refer to the standard -input. If --file and --file-list are both specified as "-", +blank lines are ignored. These paths are processed before any that are listed on the command +line. The file name can be given as "-" to refer to the standard input. +If --file and --file-list are both specified as "-", patterns are read first. This is useful only when the standard input is a terminal, from which further lines (the list of files) can be read after an end-of-file indication. If this option is given more than once, all the @@ -808,6 +809,9 @@

pcre2grep man page

allowing \w to match Unicode letters and digits.

+--posix-pattern-file +When patterns are provided with the -f option, do not trim trailing +spaces or ignore empty lines in a similar way than other grep tools. -q, --quiet Work quietly, that is, display nothing except error messages. The exit status indicates whether or not any matches were found. diff --git a/doc/pcre2grep.1 b/doc/pcre2grep.1 index ffe9d397b..650f679fb 100644 --- a/doc/pcre2grep.1 +++ b/doc/pcre2grep.1 @@ -337,9 +337,10 @@ Read patterns from the file, one per line. As is the case with patterns on the command line, no delimiters should be used. What constitutes a newline when reading the file is the operating system's default interpretation of \en. The \fB--newline\fP option has no effect on this option. Trailing white space is -removed from each line, and blank lines are ignored. An empty file contains no +removed from each line, and blank lines are ignored unless the +\fB--posix-pattern-file\fP option is also provided. An empty file contains no patterns and therefore matches nothing. Patterns read from a file in this way -may contain binary zeros, which are treated as ordinary data characters. +may contain binary zeros, which are treated as ordinary character literals. .sp If this option is given more than once, all the specified files are read. A data line is output if any of the patterns match it. A file name can be given @@ -701,6 +702,9 @@ option settings within patterns that affect individual classes. For example, when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while allowing \ew to match Unicode letters and digits. .TP +\fB--posix-pattern-file\fP +When patterns are provided with the \fB-f\fP option, do not trim trailing +spaces or ignore empty lines in a similar way than other grep tools. \fB-q\fP, \fB--quiet\fP Work quietly, that is, display nothing except error messages. The exit status indicates whether or not any matches were found. diff --git a/src/config.h.in b/src/config.h.in index 8249182de..3bb01c83d 100644 --- a/src/config.h.in +++ b/src/config.h.in @@ -145,7 +145,8 @@ sure both macros are undefined; an emulation function will then be used. */ /* Define to 1 if you have the header file. */ #undef HAVE_UNISTD_H -/* Define to 1 if the compiler supports simple visibility declarations. */ +/* Define to 1 if the compiler supports GCC compatible visibility + declarations. */ #undef HAVE_VISIBILITY /* Define to 1 if you have the header file. */ diff --git a/src/pcre2grep.c b/src/pcre2grep.c index bb96067f0..be1afc506 100644 --- a/src/pcre2grep.c +++ b/src/pcre2grep.c @@ -290,6 +290,7 @@ static BOOL show_total_count = FALSE; static BOOL silent = FALSE; static BOOL utf = FALSE; static BOOL posix_digit = FALSE; +static BOOL posix_pattern_file = FALSE; static uint8_t utf8_buffer[8]; @@ -428,6 +429,7 @@ used to identify them. */ #define N_POSIX_DIGIT (-26) #define N_GROUP_SEPARATOR (-27) #define N_NO_GROUP_SEPARATOR (-28) +#define N_POSIX_PATFILE (-29) static option_item optionlist[] = { { OP_NODATA, N_NULL, NULL, "", "terminate options" }, @@ -449,6 +451,7 @@ static option_item optionlist[] = { { OP_PATLIST, 'e', &match_patdata, "regex(p)=pattern", "specify pattern (may be used more than once)" }, { OP_NODATA, 'F', NULL, "fixed-strings", "patterns are sets of newline-separated strings" }, { OP_FILELIST, 'f', &pattern_files_data, "file=path", "read patterns from file" }, + { OP_NODATA, N_POSIX_PATFILE, NULL, "posix-pattern-file", "use POSIX semantics for pattern files" }, { OP_FILELIST, N_FILE_LIST, &file_lists_data, "file-list=path","read files to search from file" }, { OP_NODATA, N_FOFFSETS, NULL, "file-offsets", "output file offsets, not text" }, { OP_STRING, N_GROUP_SEPARATOR, &group_separator, "group-separator=text", "set separator between groups of lines" }, @@ -1448,7 +1451,34 @@ while ((c = fgetc(f)) != EOF) return yield; } +/************************************************* +* Read one pattern from file * +*************************************************/ +/* Wrap around read_one_line() to make sure any terminating '\n' is not +included in the pattern and empty patterns are correctly identified. + +Arguments: + buffer the buffer to read into + length maximum number of characters to read and report how many were + f the file + +Returns: TRUE if a pattern was read into buffer +*/ + +static BOOL +read_pattern(char *buffer, PCRE2_SIZE *length, FILE *f) +{ +*buffer = '\0'; +*length = read_one_line(buffer, *length, f); +if (*length > 0 && buffer[*length-1] == '\n') *length = *length - 1; +if (!posix_pattern_file && *length > 0 && buffer[*length-1] == '\r') + { + *length = *length - 1; + if (*length == 0) return TRUE; + } +return (*length > 0 || *buffer == '\n'); +} /************************************************* * Find end of line * @@ -3598,6 +3628,7 @@ switch(letter) case N_NOJIT: use_jit = FALSE; break; case N_ALLABSK: extra_options |= PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK; break; case N_NO_GROUP_SEPARATOR: group_separator = NULL; break; + case N_POSIX_PATFILE: posix_pattern_file = TRUE; break; case 'a': binary_files = BIN_TEXT; break; case 'c': count_only = TRUE; break; case N_POSIX_DIGIT: posix_digit = TRUE; break; @@ -3808,11 +3839,15 @@ else filename = name; } -while ((patlen = read_one_line(buffer, sizeof(buffer), f)) > 0) +while ((patlen = sizeof(buffer)) && read_pattern(buffer, &patlen, f)) { - while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--; + if (!posix_pattern_file) + { + while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--; + } + linenumber++; - if (patlen == 0) continue; /* Skip blank lines */ + if (!posix_pattern_file && patlen == 0) continue; /* Skip blank lines */ /* Note: this call to add_pattern() puts a pointer to the local variable "buffer" into the pattern chain. However, that pointer is used only when diff --git a/testdata/grepinputv b/testdata/grepinputv index 366d4fb49..1855ba87f 100644 --- a/testdata/grepinputv +++ b/testdata/grepinputv @@ -7,3 +7,4 @@ The word is cat in this line The caterpillar sat on the mat The snowcat is not an animal A buried feline in the syndicate +tailing spaces diff --git a/testdata/grepoutput b/testdata/grepoutput index d9233c26a..23ff75c59 100644 --- a/testdata/grepoutput +++ b/testdata/grepoutput @@ -464,6 +464,7 @@ The word is cat in this line The caterpillar sat on the mat The snowcat is not an animal A buried feline in the syndicate +tailing spaces RC=0 ---------------------------- Test 52 ------------------------------ fox jumps @@ -1169,6 +1170,7 @@ The word is cat in this line The caterpillar sat on the mat The snowcat is not an animal A buried feline in the syndicate +tailing spaces RC=0 ---------------------------- Test 146 ----------------------------- (standard input):A123B @@ -1253,3 +1255,21 @@ RC=0 36-sixteen 37-seventeen RC=0 +---------------------------- Test 154 ----------------------------- +RC=1 +---------------------------- Test 155 ----------------------------- +RC=1 +---------------------------- Test 156 ----------------------------- +The quick brown +fox jumps +over the lazy dog. +This time it jumps and jumps and jumps. +This line contains \E and (regex) *meta* [characters]. +The word is cat in this line +The caterpillar sat on the mat +The snowcat is not an animal +A buried feline in the syndicate +tailing spaces +RC=0 +---------------------------- Test 157 ----------------------------- +RC=0