Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParserConfig type changes may cause a regression in pattern matching #351

Closed
vfazio opened this issue Dec 26, 2024 · 11 comments · Fixed by #352
Closed

ParserConfig type changes may cause a regression in pattern matching #351

vfazio opened this issue Dec 26, 2024 · 11 comments · Fixed by #352

Comments

@vfazio
Copy link
Collaborator

vfazio commented Dec 26, 2024

Recent type changes to ParserConfig (#338) have subsequently changed the parser configurations generated by tatsu which may cause problems for users.

I've been testing a migration from 5.6 to 5.12+ and ran into my BSDLs failing to parse.

Previously, we had an EOL comment marker defined as '--.*?$' which seemed to work fine for BSDL style files (spec: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6515989)

See:

After the migration to 5.12.1+, since ParserConfig is now defined:

@dataclasses.dataclass
class ParserConfig:
    owner: Any = None
    name: str | None = 'Test'
    filename: str = ''
    encoding: str = 'utf-8'

    start: str | None = None  # FIXME
    start_rule: str | None = None  # FIXME
    rule_name: str | None = None  # Backward compatibility

    comments_re: re.Pattern | None = None
    eol_comments_re: re.Pattern | None = None

The generated parser now looks like:

class BSDLParser(Parser):
    def __init__(self, /, config: ParserConfig | None = None, **settings):
        config = ParserConfig.new(
            config,
            owner=self,
            whitespace=None,
            nameguard=None,
            ignorecase=True,
            namechars='',
            parseinfo=False,
            comments_re=None,
            eol_comments_re=re.compile('--.*?$'),
            keywords=KEYWORDS,
            start='start',
        )

However, the logic in the default tokenizer (Buffer) has branching logic and treats string patterns differently than RE patterns, making string based patterns use re.MULTILINE:

    def _scanre(self, pattern):
        if isinstance(pattern, RETYPE):
            cre = pattern
        else:
            cre = re.compile(pattern, re.MULTILINE)
        return cre.match(self.text, self.pos)

Changing the generated file to either:

class BSDLParser(Parser):
    def __init__(self, /, config: ParserConfig | None = None, **settings):
        config = ParserConfig.new(
            config,
            owner=self,
            whitespace=None,
            nameguard=None,
            ignorecase=True,
            namechars='',
            parseinfo=False,
            comments_re=None,
            eol_comments_re=re.compile('--.*?$', re.M),
            keywords=KEYWORDS,
            start='start',
        )

or

class BSDLParser(Parser):
    def __init__(self, /, config: ParserConfig | None = None, **settings):
        config = ParserConfig.new(
            config,
            owner=self,
            whitespace=None,
            nameguard=None,
            ignorecase=True,
            namechars='',
            parseinfo=False,
            comments_re=None,
            eol_comments_re='--.*?$',
            keywords=KEYWORDS,
            start='start',
        )

Restores the previous behavior.

Note that defining the pattern as:

@@eol_comments :: /(?m)--.*?$/

Causes it to be generated like so:

eol_comments_re=re.compile('(?m)--.*?$', re.MULTILINE),

This is slightly redundant but is how repr decides to render the object.

Despite the field's typing as re.Pattern, strings are still apparently being supported as there is no runtime validation.

Bootstrap still apparently uses strings instead of re.compile so maybe it needs to be regenerated?

TatSu/tatsu/bootstrap.py

Lines 38 to 59 in 1b632e8

comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]',
eol_comments_re='#[^\\n]*$',
keywords=KEYWORDS,
start='start',
)
config = config.replace(**settings)
super().__init__(text, config=config)
class EBNFBootstrapParser(Parser):
def __init__(self, /, config: ParserConfig | None = None, **settings):
config = ParserConfig.new(
config,
owner=self,
whitespace='\\s+',
nameguard=None,
ignorecase=False,
namechars='',
parseinfo=True,
comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]',
eol_comments_re='#[^\\n]*$',

All of the documentation in Tatsu still points to this being a regex string, not a compiled pattern, but maybe that's work planned for #315?

I don't see any mention of the eol_comment or comments arguments which are defined as strings and are compiled subsequently into the respective {var}_re values. Even if they are used, they do not define re.MULTILINE by default, so it's up to users to update their patterns to set this flag or update their patterns to not depend on it.

@apalala
Copy link
Collaborator

apalala commented Dec 27, 2024

The changes in ParserConfig were because the settings were not being applied consistency through parser layers.

No re flags should be used anywhere in TatSu because users can specify what they need with (?aiLmsux) patterns. Flags present in the code can be considered legay way of doing things.

Can you propose a pull request that fixes the issues you describe? I'll be working on TatSu in upcoming days, so I can test and merge.

If you do provide a pull request, please try to include a unit test that demonstrates the issue.

@vfazio
Copy link
Collaborator Author

vfazio commented Dec 27, 2024

Thanks for taking the time to respond @apalala.

Before I make a PR, I think the approach should maybe be discussed to avoid needless back and forth in the PR.

It sounds like you're in favor dropping the MULTILINE flag:

diff --git a/tatsu/buffering.py b/tatsu/buffering.py
index 87358d9..5a2a91f 100644
--- a/tatsu/buffering.py
+++ b/tatsu/buffering.py
@@ -357,7 +357,7 @@ class Buffer(Tokenizer):
         if isinstance(pattern, RETYPE):
             cre = pattern
         else:
-            cre = re.compile(pattern, re.MULTILINE)
+            cre = re.compile(pattern)
         return cre.match(self.text, self.pos)
 
     @property

I think this is mostly straightforward, though it breaks one of the tests and i'm not sure i'm fixing it correctly as i'm not an expert at regex or at tatsu.

The below below passes, but it's predicated on the following assumptions:

  1. bootstrap.py should be using compiled regex and not strings to abide by typing
  2. EOL comments should not be multiline match by default, otherwise I can change the pattern to '(?m)#[^\\n]*$' to restore previous behavior
(venv) vfazio@vfazio4 ~/development/TatSu $ git diff
diff --git a/grammar/tatsu.ebnf b/grammar/tatsu.ebnf
index 870caae..0c897a4 100644
--- a/grammar/tatsu.ebnf
+++ b/grammar/tatsu.ebnf
@@ -1,7 +1,7 @@
 @@grammar :: TatSu
 @@whitespace :: /\s+/
 @@comments :: ?"(?sm)[(][*](?:.|\n)*?[*][)]"
-@@eol_comments :: ?"#[^\n]*$"
+@@eol_comments :: ?"#[^\n]*"
 @@parseinfo :: True
 @@left_recursion :: False
 
#Regen'd tatsu --name=EBNFBootstrap grammar/tatsu.ebnf -o tatsu/bootstrap.py 

diff --git a/tatsu/bootstrap.py b/tatsu/bootstrap.py
index 4f656b2..510350e 100644
--- a/tatsu/bootstrap.py
+++ b/tatsu/bootstrap.py
@@ -35,8 +35,8 @@ class EBNFBootstrapBuffer(Buffer):
             ignorecase=False,
             namechars='',
             parseinfo=True,
-            comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]',
-            eol_comments_re='#[^\\n]*$',
+            comments_re=re.compile('(?sm)[(][*](?:.|\\n)*?[*][)]', re.MULTILINE|re.DOTALL),
+            eol_comments_re=re.compile('#[^\\n]*'),
             keywords=KEYWORDS,
             start='start',
         )
@@ -55,8 +55,8 @@ class EBNFBootstrapParser(Parser):
             ignorecase=False,
             namechars='',
             parseinfo=True,
-            comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]',
-            eol_comments_re='#[^\\n]*$',
+            comments_re=re.compile('(?sm)[(][*](?:.|\\n)*?[*][)]', re.MULTILINE|re.DOTALL),
+            eol_comments_re=re.compile('#[^\\n]*'),
             keywords=KEYWORDS,
             start='start',
         )

This feels wrong, but does fix the test that fails subsequently

diff --git a/test/grammar/pattern_test.py b/test/grammar/pattern_test.py
index 91094fa..c651baf 100644
--- a/test/grammar/pattern_test.py
+++ b/test/grammar/pattern_test.py
@@ -22,7 +22,7 @@ class PatternTests(unittest.TestCase):
 
             blankline
                 =
-                /^[^\\n]*\\n$/
+                /(?m)^[^\\n]*\\n$/
                 ;
         """

I'm still curious about the following however:

  1. ParserConfig has two ways to set the (eol_)comments_re, either via direct argument, or via (eol_)comments, with the latter trumping the first one as it's done in __post_init__. I think ideally these would not conflict with each other and there would only be one way to set these values so something can't accidentally be configured in a way that leads to weird behavior.

  2. ParserConfig does not validate that the (eol_)comments_re fields are actually the proper type. Since TatSu is largely untyped, we can't do a static analysis via mypy to ensure this is the case within the codebase and since the fields are not only public, but legacy config values, users may try to pass in an unexpected type. I think there's a few ways to handle that.

    • Validate the types as part of __post_init__.
    • Don't make these values part of the __init__ interface, update the code generator to use (eol_)comments and rely on dataclass.replace to update these calculated values.
    • Update the typing to allow strings and Patterns and drop (eol_)comments
  3. I think depending on the answers to the above, there will need to be documentation updates to reflect these changes. There should probably be a note that raw strings, if it's decided to allow them for (eol_)comments_re, will no longer match via re.MULTILINE

Thoughts?

@apalala
Copy link
Collaborator

apalala commented Dec 27, 2024

I agree with the comments above.

Things like comments_re are legacy with a history that I no longer remember.

If you can come up with an update that solves your issues, gets rid of the legacy noise, and passes existing unit tests, I'll test it against complex grammars like COBOL and Java.

Consider that TatSu already offered to leave users behind if they don't keep up with Python releases. Older versions of TatSu are still published and available for who need them. I vouch for getting rid of legacy stuff.

ParserConfig took a lot of work, and may still not be perfect, but it's the right way to get rid of many annoyances that have hindered professional use of TatSu.

@vfazio
Copy link
Collaborator Author

vfazio commented Dec 27, 2024

I'm going to push a draft PR. The commits won't be pretty or well documented, but it will "work". I'll rework the commits when I have time next week.

vfazio added a commit to vfazio/TatSu that referenced this issue Dec 29, 2024
Previously, when scanning for matches to a regex, if the type of the
pattern was `str`, the pattern was always compiled with `re.MULTILINE`.

Recent changes to `ParserConfig` [0] changed the type used for regex
matches in generated code from `str` to `re.Pattern` which could lead to
a difference in behavior from previous versions where a defined comments
or eol_comments may have been implicitly relying on the `re.MULTILINE`
flag.

After discussion [1], it has been determined that usage of `re` flags
within TatSu should be deprecated in favor of users specifying the
necessary flags within patterns.

As such, drop the `re.MULTILINE` flag for strings compiled on the fly.

[0]: neogeny#338
[1]: neogeny#351 (comment)
apalala pushed a commit that referenced this issue Dec 29, 2024
…fig` (#352)

* [buffering] drop forced multiline match for string patterns

Previously, when scanning for matches to a regex, if the type of the
pattern was `str`, the pattern was always compiled with `re.MULTILINE`.

Recent changes to `ParserConfig` [0] changed the type used for regex
matches in generated code from `str` to `re.Pattern` which could lead to
a difference in behavior from previous versions where a defined comments
or eol_comments may have been implicitly relying on the `re.MULTILINE`
flag.

After discussion [1], it has been determined that usage of `re` flags
within TatSu should be deprecated in favor of users specifying the
necessary flags within patterns.

As such, drop the `re.MULTILINE` flag for strings compiled on the fly.

[0]: #338
[1]: #351 (comment)

* [grammar] make eol_comments multiline match

Make the default eol_comments regex use multiline matching.

Recent changes to `ParserConfig` [0] now use a precompiled regex (an
`re.Pattern`) instead of compiling the `str` regex on the fly.

The `Tokenizer` previously assumed `str` type regexes should all be
`re.MULTILINE` regardless of options defined in the regex itself when
compiling the pattern. This behavior has since changed to no longer
automatically apply and thus requires configurations to specify the
option in the pattern.

[0]: #338

* [infos] make {eol_}comments_re read-only attributes

Previously, the `eol_comments_re` and `comments_re` attributes were
public init arguments, were modifiable, and could thus become out of
sync with the `eol_comments` and `comments` attributes.

Also, with recent changes to `ParserConfig` [0], there were two ways to
initialize the regex values for comments and eol_comments directives;
either via the constructor using the *_re variables or by using the
sister string arguments and relying on `__post_init__` to compile the
values which trumped the explicit *_re argument values.

Now, the constructor interface has been simplified to not take either
`eol_comments_re` or `comments_re` as arguments. Callers may only use
`eol_comments` and `comments`.

The `eol_comments_re` and `comments_re` attributes are still
public, but are read-only so they are always a reflection of their
sister string values passed into the constructor.

[0]: #200

* [codegen] migrate to {eol_}comments

* [ngcodegen] migrate to {eol_}comments

* [bootstrap] migrate to {eol_}comments

* [lint] resolve errors

* [docs] note {eol_}comments directive behavior changes

* [docs] update syntax to reflect {eol_}comments arguments

* [test] fix test_parse_hash to use eol_comments

* [test] explicitly use multiline match in test_patterns_with_newlines
@apalala
Copy link
Collaborator

apalala commented Jan 3, 2025

@vfazio, Please check that this issue is gone with the latest commits?

@vfazio
Copy link
Collaborator Author

vfazio commented Jan 3, 2025

@vfazio, Please check that this issue is gone with the latest commits?

I'll have to test it on Monday, I do not have access to my code as until then.

I'm pretty sure the commits introduce a performance regression by reverting part of #338. But I suppose we can fix that subsequently.

Do you want a new issue tracking this and other cleanup I identify? Or do you want me to just make a PR and have you review?

@apalala
Copy link
Collaborator

apalala commented Jan 4, 2025

I'm pretty sure the commits introduce a performance regression by reverting part of #338. But I suppose we can fix that subsequently.

I agree.

There are several ways to add the optimizations again without reusing the deprecated names (comments_re).

The caching done by re.compile and friends is not to be trusted.

@apalala
Copy link
Collaborator

apalala commented Jan 6, 2025

Can we close this one, @vfazio ?

@vfazio
Copy link
Collaborator Author

vfazio commented Jan 6, 2025

Can we close this one, @vfazio ?

Sorry, I've been in meetings all day, I try to test this tomorrow.

@vfazio
Copy link
Collaborator Author

vfazio commented Jan 6, 2025

I'm going to close this with the following caveat:

Old behavior was not reintroduced. However, this behavior is documented

.. note::
Prior to 5.12.1, comments implicitly had the `(?m) <https://docs.python.org/3/library/re.html#re.MULTILINE>`_ option defined. This is no longer the case.
and at least now the behavior between re.Pattern and str should be consistent.

#354 will be updating the note, but the change stays the same. I've tested both branches on the BSDL grammar and it seems to play nice.

@vfazio vfazio closed this as completed Jan 6, 2025
@apalala
Copy link
Collaborator

apalala commented Jan 7, 2025

We can keep adding enhancements, @vfazio. It think this was a good cycle, with a good release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants