Skip to content

Commit

Permalink
pcre2grep: add --posix-pattern-file for compatibility with other grep
Browse files Browse the repository at this point in the history
Historically, pcre2grep has done minor processing of the patterns that
were read through the `-f` option.

The end result is that for some patterns there are different results
depending if they were provided through `-e`, `-f` or as a parameter
in the command line.

Add a flag that could be provided to skip that processing so that the
same pattern file used with other grep implementations could be used
directly for the same result.
  • Loading branch information
carenas committed Jun 17, 2024
1 parent 3b90149 commit 95c9e33
Show file tree
Hide file tree
Showing 9 changed files with 122 additions and 17 deletions.
10 changes: 7 additions & 3 deletions ChangeLog
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,18 @@ there is also the log of commit messages.
Version 10.45 xx-xxx-2024
-------------------------

1. Change 6 of 10.44 broke 32-bit compiles because pcre2test's reporting of
memory size was changed to the entire compiled data block, instead of just the
pattern and tables data, so as to align with the new length restriction.
1. Change 6 of 10.44 broke 32-bit tests because pcre2test's reporting of
memory size was changed to the entire compiled data block, instead of just the
pattern and tables data, so as to align with the new length restriction.
Because the block's header contains pointers, this meant the pcre2test output
was different in 32-bit mode. A patch by Carlo reverts to the preevious state
and makes sure that any limit set by pcre2_set_max_pattern_compiled_length()
also avoids the internal struct overhead.

2. Add --posix-pattern-file to pcre2grep to allow processing of empty patterns
through the -f option, as well as patterns that end in space characters for
compatibility with other grep tools.


Version 10.44 07-June-2024
--------------------------
Expand Down
24 changes: 24 additions & 0 deletions RunGrepTest
Original file line number Diff line number Diff line change
Expand Up @@ -861,6 +861,30 @@ echo "---------------------------- Test 153 -----------------------------" >>tes
(cd $srcdir; $valgrind $vjs $pcre2grep -nA3 --no-group-separator 'four' ./testdata/grepinputx) >>testtrygrep
echo "RC=$?" >>testtrygrep

echo "---------------------------- Test 154 -----------------------------" >>testtrygrep
>testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep

echo "---------------------------- Test 155 -----------------------------" >>testtrygrep
echo "" >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep -f $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep

echo "---------------------------- Test 156 -----------------------------" >>testtrygrep
echo "" >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file --file $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep

echo "---------------------------- Test 156 -----------------------------" >>testtrygrep
printf "\015\012" >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep --posix-pattern-file --file $builddir/testtemp1grep ./testdata/grepinputv) >>testtrygrep
echo "RC=$?" >>testtrygrep

echo "---------------------------- Test 157 -----------------------------" >>testtrygrep
echo "spaces " >testtemp1grep
(cd $srcdir; $valgrind $vjs $pcre2grep -o --posix-pattern-file --file=$builddir/testtemp1grep ./testdata/grepinputv >testtemp2grep && [ `wc -c <testtemp2grep` -eq 8 ]) >>testtrygrep
echo "RC=$?" >>testtrygrep

# Now compare the results.

Expand Down
6 changes: 3 additions & 3 deletions doc/html/pcre2_set_max_pattern_compiled_length.html
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ <h1>pcre2_set_max_pattern_compiled_length man page</h1>
</b><br>
<P>
This function sets, in a compile context, the maximum size (in bytes) for the
memory needed to hold the compiled version of a pattern that is compiled with
this context. The result is always zero. If a pattern that is passed to
<b>pcre2_compile()</b> with this context needs more memory, an error is
memory needed to hold the compiled version of a pattern that is using this
context. The result is always zero. If a pattern that is passed to
<b>pcre2_compile()</b> referencing this context needs more memory, an error is
generated. The default is the largest number that a PCRE2_SIZE variable can
hold, which is effectively unlimited.
</P>
Expand Down
14 changes: 9 additions & 5 deletions doc/html/pcre2grep.html
Original file line number Diff line number Diff line change
Expand Up @@ -391,9 +391,10 @@ <h1>pcre2grep man page</h1>
command line, no delimiters should be used. What constitutes a newline when
reading the file is the operating system's default interpretation of \n. The
<b>--newline</b> option has no effect on this option. Trailing white space is
removed from each line, and blank lines are ignored. An empty file contains no
removed from each line, and blank lines are ignored unless the
<b>--posix-pattern-file</b> option is also provided. An empty file contains no
patterns and therefore matches nothing. Patterns read from a file in this way
may contain binary zeros, which are treated as ordinary data characters.
may contain binary zeros, which are treated as ordinary character literals.
<br>
<br>
If this option is given more than once, all the specified files are read. A
Expand All @@ -408,9 +409,9 @@ <h1>pcre2grep man page</h1>
Read a list of files and/or directories that are to be scanned from the given
file, one per line. What constitutes a newline when reading the file is the
operating system's default. Trailing white space is removed from each line, and
blank lines are ignored. These paths are processed before any that are listed
on the command line. The file name can be given as "-" to refer to the standard
input. If <b>--file</b> and <b>--file-list</b> are both specified as "-",
blank lines are ignored. These paths are processed before any that are listed on the command
line. The file name can be given as "-" to refer to the standard input.
If <b>--file</b> and <b>--file-list</b> are both specified as "-",
patterns are read first. This is useful only when the standard input is a
terminal, from which further lines (the list of files) can be read after an
end-of-file indication. If this option is given more than once, all the
Expand Down Expand Up @@ -808,6 +809,9 @@ <h1>pcre2grep man page</h1>
allowing \w to match Unicode letters and digits.
</P>
<P>
<b>--posix-pattern-file</b>
When patterns are provided with the <b>-f</b> option, do not trim trailing
spaces or ignore empty lines in a similar way than other grep tools.
<b>-q</b>, <b>--quiet</b>
Work quietly, that is, display nothing except error messages. The exit
status indicates whether or not any matches were found.
Expand Down
8 changes: 6 additions & 2 deletions doc/pcre2grep.1
Original file line number Diff line number Diff line change
Expand Up @@ -337,9 +337,10 @@ Read patterns from the file, one per line. As is the case with patterns on the
command line, no delimiters should be used. What constitutes a newline when
reading the file is the operating system's default interpretation of \en. The
\fB--newline\fP option has no effect on this option. Trailing white space is
removed from each line, and blank lines are ignored. An empty file contains no
removed from each line, and blank lines are ignored unless the
\fB--posix-pattern-file\fP option is also provided. An empty file contains no
patterns and therefore matches nothing. Patterns read from a file in this way
may contain binary zeros, which are treated as ordinary data characters.
may contain binary zeros, which are treated as ordinary character literals.
.sp
If this option is given more than once, all the specified files are read. A
data line is output if any of the patterns match it. A file name can be given
Expand Down Expand Up @@ -701,6 +702,9 @@ option settings within patterns that affect individual classes. For example,
when in UCP mode, the sequence (?aP) restricts [:word:] to ASCII letters, while
allowing \ew to match Unicode letters and digits.
.TP
\fB--posix-pattern-file\fP
When patterns are provided with the \fB-f\fP option, do not trim trailing
spaces or ignore empty lines in a similar way than other grep tools.
\fB-q\fP, \fB--quiet\fP
Work quietly, that is, display nothing except error messages. The exit
status indicates whether or not any matches were found.
Expand Down
3 changes: 2 additions & 1 deletion src/config.h.in
Original file line number Diff line number Diff line change
Expand Up @@ -145,7 +145,8 @@ sure both macros are undefined; an emulation function will then be used. */
/* Define to 1 if you have the <unistd.h> header file. */
#undef HAVE_UNISTD_H

/* Define to 1 if the compiler supports simple visibility declarations. */
/* Define to 1 if the compiler supports GCC compatible visibility
declarations. */
#undef HAVE_VISIBILITY

/* Define to 1 if you have the <wchar.h> header file. */
Expand Down
41 changes: 38 additions & 3 deletions src/pcre2grep.c
Original file line number Diff line number Diff line change
Expand Up @@ -290,6 +290,7 @@ static BOOL show_total_count = FALSE;
static BOOL silent = FALSE;
static BOOL utf = FALSE;
static BOOL posix_digit = FALSE;
static BOOL posix_pattern_file = FALSE;

static uint8_t utf8_buffer[8];

Expand Down Expand Up @@ -428,6 +429,7 @@ used to identify them. */
#define N_POSIX_DIGIT (-26)
#define N_GROUP_SEPARATOR (-27)
#define N_NO_GROUP_SEPARATOR (-28)
#define N_POSIX_PATFILE (-29)

static option_item optionlist[] = {
{ OP_NODATA, N_NULL, NULL, "", "terminate options" },
Expand All @@ -449,6 +451,7 @@ static option_item optionlist[] = {
{ OP_PATLIST, 'e', &match_patdata, "regex(p)=pattern", "specify pattern (may be used more than once)" },
{ OP_NODATA, 'F', NULL, "fixed-strings", "patterns are sets of newline-separated strings" },
{ OP_FILELIST, 'f', &pattern_files_data, "file=path", "read patterns from file" },
{ OP_NODATA, N_POSIX_PATFILE, NULL, "posix-pattern-file", "use POSIX semantics for pattern files" },
{ OP_FILELIST, N_FILE_LIST, &file_lists_data, "file-list=path","read files to search from file" },
{ OP_NODATA, N_FOFFSETS, NULL, "file-offsets", "output file offsets, not text" },
{ OP_STRING, N_GROUP_SEPARATOR, &group_separator, "group-separator=text", "set separator between groups of lines" },
Expand Down Expand Up @@ -1448,7 +1451,34 @@ while ((c = fgetc(f)) != EOF)
return yield;
}

/*************************************************
* Read one pattern from file *
*************************************************/

/* Wrap around read_one_line() to make sure any terminating '\n' is not
included in the pattern and empty patterns are correctly identified.
Arguments:
buffer the buffer to read into
length maximum number of characters to read and report how many were
f the file
Returns: TRUE if a pattern was read into buffer
*/

static BOOL
read_pattern(char *buffer, PCRE2_SIZE *length, FILE *f)
{
*buffer = '\0';
*length = read_one_line(buffer, *length, f);
if (*length > 0 && buffer[*length-1] == '\n') *length = *length - 1;
if (posix_pattern_file && *length > 0 && buffer[*length-1] == '\r')
{
*length = *length - 1;
if (*length == 0) return TRUE;
}
return (*length > 0 || *buffer == '\n');
}

/*************************************************
* Find end of line *
Expand Down Expand Up @@ -3598,6 +3628,7 @@ switch(letter)
case N_NOJIT: use_jit = FALSE; break;
case N_ALLABSK: extra_options |= PCRE2_EXTRA_ALLOW_LOOKAROUND_BSK; break;
case N_NO_GROUP_SEPARATOR: group_separator = NULL; break;
case N_POSIX_PATFILE: posix_pattern_file = TRUE; break;
case 'a': binary_files = BIN_TEXT; break;
case 'c': count_only = TRUE; break;
case N_POSIX_DIGIT: posix_digit = TRUE; break;
Expand Down Expand Up @@ -3808,11 +3839,15 @@ else
filename = name;
}

while ((patlen = read_one_line(buffer, sizeof(buffer), f)) > 0)
while ((patlen = sizeof(buffer)) && read_pattern(buffer, &patlen, f))
{
while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
if (!posix_pattern_file)
{
while (patlen > 0 && isspace((unsigned char)(buffer[patlen-1]))) patlen--;
}

linenumber++;
if (patlen == 0) continue; /* Skip blank lines */
if (!posix_pattern_file && patlen == 0) continue; /* Skip blank lines */

/* Note: this call to add_pattern() puts a pointer to the local variable
"buffer" into the pattern chain. However, that pointer is used only when
Expand Down
1 change: 1 addition & 0 deletions testdata/grepinputv
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,4 @@ The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
tailing spaces
32 changes: 32 additions & 0 deletions testdata/grepoutput
Original file line number Diff line number Diff line change
Expand Up @@ -464,6 +464,7 @@ The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
tailing spaces
RC=0
---------------------------- Test 52 ------------------------------
fox jumps
Expand Down Expand Up @@ -1169,6 +1170,7 @@ The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
tailing spaces
RC=0
---------------------------- Test 146 -----------------------------
(standard input):A123B
Expand Down Expand Up @@ -1253,3 +1255,33 @@ RC=0
36-sixteen
37-seventeen
RC=0
---------------------------- Test 154 -----------------------------
RC=1
---------------------------- Test 155 -----------------------------
RC=1
---------------------------- Test 156 -----------------------------
The quick brown
fox jumps
over the lazy dog.
This time it jumps and jumps and jumps.
This line contains \E and (regex) *meta* [characters].
The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
tailing spaces
RC=0
---------------------------- Test 156 -----------------------------
The quick brown
fox jumps
over the lazy dog.
This time it jumps and jumps and jumps.
This line contains \E and (regex) *meta* [characters].
The word is cat in this line
The caterpillar sat on the mat
The snowcat is not an animal
A buried feline in the syndicate
tailing spaces
RC=0
---------------------------- Test 157 -----------------------------
RC=0
Expand Down

0 comments on commit 95c9e33

Please sign in to comment.