-
-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParserConfig type changes may cause a regression in pattern matching #351
Comments
The changes in No Can you propose a pull request that fixes the issues you describe? I'll be working on TatSu in upcoming days, so I can test and merge. If you do provide a pull request, please try to include a unit test that demonstrates the issue. |
Thanks for taking the time to respond @apalala. Before I make a PR, I think the approach should maybe be discussed to avoid needless back and forth in the PR. It sounds like you're in favor dropping the diff --git a/tatsu/buffering.py b/tatsu/buffering.py
index 87358d9..5a2a91f 100644
--- a/tatsu/buffering.py
+++ b/tatsu/buffering.py
@@ -357,7 +357,7 @@ class Buffer(Tokenizer):
if isinstance(pattern, RETYPE):
cre = pattern
else:
- cre = re.compile(pattern, re.MULTILINE)
+ cre = re.compile(pattern)
return cre.match(self.text, self.pos)
@property I think this is mostly straightforward, though it breaks one of the tests and i'm not sure i'm fixing it correctly as i'm not an expert at regex or at tatsu. The below below passes, but it's predicated on the following assumptions:
(venv) vfazio@vfazio4 ~/development/TatSu $ git diff
diff --git a/grammar/tatsu.ebnf b/grammar/tatsu.ebnf
index 870caae..0c897a4 100644
--- a/grammar/tatsu.ebnf
+++ b/grammar/tatsu.ebnf
@@ -1,7 +1,7 @@
@@grammar :: TatSu
@@whitespace :: /\s+/
@@comments :: ?"(?sm)[(][*](?:.|\n)*?[*][)]"
-@@eol_comments :: ?"#[^\n]*$"
+@@eol_comments :: ?"#[^\n]*"
@@parseinfo :: True
@@left_recursion :: False
#Regen'd tatsu --name=EBNFBootstrap grammar/tatsu.ebnf -o tatsu/bootstrap.py
diff --git a/tatsu/bootstrap.py b/tatsu/bootstrap.py
index 4f656b2..510350e 100644
--- a/tatsu/bootstrap.py
+++ b/tatsu/bootstrap.py
@@ -35,8 +35,8 @@ class EBNFBootstrapBuffer(Buffer):
ignorecase=False,
namechars='',
parseinfo=True,
- comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]',
- eol_comments_re='#[^\\n]*$',
+ comments_re=re.compile('(?sm)[(][*](?:.|\\n)*?[*][)]', re.MULTILINE|re.DOTALL),
+ eol_comments_re=re.compile('#[^\\n]*'),
keywords=KEYWORDS,
start='start',
)
@@ -55,8 +55,8 @@ class EBNFBootstrapParser(Parser):
ignorecase=False,
namechars='',
parseinfo=True,
- comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]',
- eol_comments_re='#[^\\n]*$',
+ comments_re=re.compile('(?sm)[(][*](?:.|\\n)*?[*][)]', re.MULTILINE|re.DOTALL),
+ eol_comments_re=re.compile('#[^\\n]*'),
keywords=KEYWORDS,
start='start',
) This feels wrong, but does fix the test that fails subsequently diff --git a/test/grammar/pattern_test.py b/test/grammar/pattern_test.py
index 91094fa..c651baf 100644
--- a/test/grammar/pattern_test.py
+++ b/test/grammar/pattern_test.py
@@ -22,7 +22,7 @@ class PatternTests(unittest.TestCase):
blankline
=
- /^[^\\n]*\\n$/
+ /(?m)^[^\\n]*\\n$/
;
""" I'm still curious about the following however:
Thoughts? |
I agree with the comments above. Things like If you can come up with an update that solves your issues, gets rid of the legacy noise, and passes existing unit tests, I'll test it against complex grammars like COBOL and Java. Consider that TatSu already offered to leave users behind if they don't keep up with Python releases. Older versions of TatSu are still published and available for who need them. I vouch for getting rid of legacy stuff.
|
I'm going to push a draft PR. The commits won't be pretty or well documented, but it will "work". I'll rework the commits when I have time next week. |
Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. [0]: neogeny#338 [1]: neogeny#351 (comment)
…fig` (#352) * [buffering] drop forced multiline match for string patterns Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. [0]: #338 [1]: #351 (comment) * [grammar] make eol_comments multiline match Make the default eol_comments regex use multiline matching. Recent changes to `ParserConfig` [0] now use a precompiled regex (an `re.Pattern`) instead of compiling the `str` regex on the fly. The `Tokenizer` previously assumed `str` type regexes should all be `re.MULTILINE` regardless of options defined in the regex itself when compiling the pattern. This behavior has since changed to no longer automatically apply and thus requires configurations to specify the option in the pattern. [0]: #338 * [infos] make {eol_}comments_re read-only attributes Previously, the `eol_comments_re` and `comments_re` attributes were public init arguments, were modifiable, and could thus become out of sync with the `eol_comments` and `comments` attributes. Also, with recent changes to `ParserConfig` [0], there were two ways to initialize the regex values for comments and eol_comments directives; either via the constructor using the *_re variables or by using the sister string arguments and relying on `__post_init__` to compile the values which trumped the explicit *_re argument values. Now, the constructor interface has been simplified to not take either `eol_comments_re` or `comments_re` as arguments. Callers may only use `eol_comments` and `comments`. The `eol_comments_re` and `comments_re` attributes are still public, but are read-only so they are always a reflection of their sister string values passed into the constructor. [0]: #200 * [codegen] migrate to {eol_}comments * [ngcodegen] migrate to {eol_}comments * [bootstrap] migrate to {eol_}comments * [lint] resolve errors * [docs] note {eol_}comments directive behavior changes * [docs] update syntax to reflect {eol_}comments arguments * [test] fix test_parse_hash to use eol_comments * [test] explicitly use multiline match in test_patterns_with_newlines
@vfazio, Please check that this issue is gone with the latest commits? |
I'll have to test it on Monday, I do not have access to my code as until then. I'm pretty sure the commits introduce a performance regression by reverting part of #338. But I suppose we can fix that subsequently. Do you want a new issue tracking this and other cleanup I identify? Or do you want me to just make a PR and have you review? |
I agree. There are several ways to add the optimizations again without reusing the deprecated names ( The caching done by |
Can we close this one, @vfazio ? |
Sorry, I've been in meetings all day, I try to test this tomorrow. |
I'm going to close this with the following caveat: Old behavior was not reintroduced. However, this behavior is documented Lines 32 to 33 in f3fb92a
re.Pattern and str should be consistent.
#354 will be updating the note, but the change stays the same. I've tested both branches on the BSDL grammar and it seems to play nice. |
We can keep adding enhancements, @vfazio. It think this was a good cycle, with a good release. |
Recent type changes to
ParserConfig
(#338) have subsequently changed the parser configurations generated by tatsu which may cause problems for users.I've been testing a migration from 5.6 to 5.12+ and ran into my BSDLs failing to parse.
Previously, we had an EOL comment marker defined as
'--.*?$'
which seemed to work fine for BSDL style files (spec: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6515989)See:
After the migration to 5.12.1+, since
ParserConfig
is now defined:The generated parser now looks like:
However, the logic in the default tokenizer (
Buffer
) has branching logic and treats string patterns differently than RE patterns, making string based patterns usere.MULTILINE
:Changing the generated file to either:
or
Restores the previous behavior.
Note that defining the pattern as:
Causes it to be generated like so:
This is slightly redundant but is how
repr
decides to render the object.Despite the field's typing as
re.Pattern
, strings are still apparently being supported as there is no runtime validation.Bootstrap still apparently uses strings instead of
re.compile
so maybe it needs to be regenerated?TatSu/tatsu/bootstrap.py
Lines 38 to 59 in 1b632e8
All of the documentation in Tatsu still points to this being a regex string, not a compiled pattern, but maybe that's work planned for #315?
I don't see any mention of the
eol_comment
orcomments
arguments which are defined as strings and are compiled subsequently into the respective{var}_re
values. Even if they are used, they do not definere.MULTILINE
by default, so it's up to users to update their patterns to set this flag or update their patterns to not depend on it.The text was updated successfully, but these errors were encountered: