ParserConfig type changes may cause a regression in pattern matching #351

vfazio · 2024-12-26T21:09:32Z

Recent type changes to ParserConfig (#338) have subsequently changed the parser configurations generated by tatsu which may cause problems for users.

I've been testing a migration from 5.6 to 5.12+ and ran into my BSDLs failing to parse.

Previously, we had an EOL comment marker defined as '--.*?$' which seemed to work fine for BSDL style files (spec: https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=6515989)

See:

After the migration to 5.12.1+, since ParserConfig is now defined:

@dataclasses.dataclass
class ParserConfig:
    owner: Any = None
    name: str | None = 'Test'
    filename: str = ''
    encoding: str = 'utf-8'

    start: str | None = None  # FIXME
    start_rule: str | None = None  # FIXME
    rule_name: str | None = None  # Backward compatibility

    comments_re: re.Pattern | None = None
    eol_comments_re: re.Pattern | None = None

The generated parser now looks like:

class BSDLParser(Parser):
    def __init__(self, /, config: ParserConfig | None = None, **settings):
        config = ParserConfig.new(
            config,
            owner=self,
            whitespace=None,
            nameguard=None,
            ignorecase=True,
            namechars='',
            parseinfo=False,
            comments_re=None,
            eol_comments_re=re.compile('--.*?$'),
            keywords=KEYWORDS,
            start='start',
        )

However, the logic in the default tokenizer (Buffer) has branching logic and treats string patterns differently than RE patterns, making string based patterns use re.MULTILINE:

    def _scanre(self, pattern):
        if isinstance(pattern, RETYPE):
            cre = pattern
        else:
            cre = re.compile(pattern, re.MULTILINE)
        return cre.match(self.text, self.pos)

Changing the generated file to either:

class BSDLParser(Parser):
    def __init__(self, /, config: ParserConfig | None = None, **settings):
        config = ParserConfig.new(
            config,
            owner=self,
            whitespace=None,
            nameguard=None,
            ignorecase=True,
            namechars='',
            parseinfo=False,
            comments_re=None,
            eol_comments_re=re.compile('--.*?$', re.M),
            keywords=KEYWORDS,
            start='start',
        )

or

class BSDLParser(Parser):
    def __init__(self, /, config: ParserConfig | None = None, **settings):
        config = ParserConfig.new(
            config,
            owner=self,
            whitespace=None,
            nameguard=None,
            ignorecase=True,
            namechars='',
            parseinfo=False,
            comments_re=None,
            eol_comments_re='--.*?$',
            keywords=KEYWORDS,
            start='start',
        )

Restores the previous behavior.

Note that defining the pattern as:

@@eol_comments :: /(?m)--.*?$/

Causes it to be generated like so:

eol_comments_re=re.compile('(?m)--.*?$', re.MULTILINE),

This is slightly redundant but is how repr decides to render the object.

Despite the field's typing as re.Pattern, strings are still apparently being supported as there is no runtime validation.

Bootstrap still apparently uses strings instead of re.compile so maybe it needs to be regenerated?

TatSu/tatsu/bootstrap.py

Lines 38 to 59 in 1b632e8

    
                       comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]', 
        
                       eol_comments_re='#[^\\n]*$', 
        
                       keywords=KEYWORDS, 
        
                       start='start', 
        
                   ) 
        
                   config = config.replace(**settings) 
        
                   super().__init__(text, config=config) 
        
           class EBNFBootstrapParser(Parser): 
        
               def __init__(self, /, config: ParserConfig | None = None, **settings): 
        
                   config = ParserConfig.new( 
        
                       config, 
        
                       owner=self, 
        
                       whitespace='\\s+', 
        
                       nameguard=None, 
        
                       ignorecase=False, 
        
                       namechars='', 
        
                       parseinfo=True, 
        
                       comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]', 
        
                       eol_comments_re='#[^\\n]*$',

All of the documentation in Tatsu still points to this being a regex string, not a compiled pattern, but maybe that's work planned for #315?

I don't see any mention of the eol_comment or comments arguments which are defined as strings and are compiled subsequently into the respective {var}_re values. Even if they are used, they do not define re.MULTILINE by default, so it's up to users to update their patterns to set this flag or update their patterns to not depend on it.

The text was updated successfully, but these errors were encountered:

apalala · 2024-12-27T12:04:37Z

The changes in ParserConfig were because the settings were not being applied consistency through parser layers.

No re flags should be used anywhere in TatSu because users can specify what they need with (?aiLmsux) patterns. Flags present in the code can be considered legay way of doing things.

Can you propose a pull request that fixes the issues you describe? I'll be working on TatSu in upcoming days, so I can test and merge.

If you do provide a pull request, please try to include a unit test that demonstrates the issue.

vfazio · 2024-12-27T16:43:02Z

Thanks for taking the time to respond @apalala.

Before I make a PR, I think the approach should maybe be discussed to avoid needless back and forth in the PR.

It sounds like you're in favor dropping the MULTILINE flag:

diff --git a/tatsu/buffering.py b/tatsu/buffering.py
index 87358d9..5a2a91f 100644
--- a/tatsu/buffering.py
+++ b/tatsu/buffering.py
@@ -357,7 +357,7 @@ class Buffer(Tokenizer):
         if isinstance(pattern, RETYPE):
             cre = pattern
         else:
-            cre = re.compile(pattern, re.MULTILINE)
+            cre = re.compile(pattern)
         return cre.match(self.text, self.pos)
 
     @property

I think this is mostly straightforward, though it breaks one of the tests and i'm not sure i'm fixing it correctly as i'm not an expert at regex or at tatsu.

The below below passes, but it's predicated on the following assumptions:

bootstrap.py should be using compiled regex and not strings to abide by typing
EOL comments should not be multiline match by default, otherwise I can change the pattern to '(?m)#[^\\n]*$' to restore previous behavior

(venv) vfazio@vfazio4 ~/development/TatSu $ git diff
diff --git a/grammar/tatsu.ebnf b/grammar/tatsu.ebnf
index 870caae..0c897a4 100644
--- a/grammar/tatsu.ebnf
+++ b/grammar/tatsu.ebnf
@@ -1,7 +1,7 @@
 @@grammar :: TatSu
 @@whitespace :: /\s+/
 @@comments :: ?"(?sm)[(][*](?:.|\n)*?[*][)]"
-@@eol_comments :: ?"#[^\n]*$"
+@@eol_comments :: ?"#[^\n]*"
 @@parseinfo :: True
 @@left_recursion :: False
 
#Regen'd tatsu --name=EBNFBootstrap grammar/tatsu.ebnf -o tatsu/bootstrap.py 

diff --git a/tatsu/bootstrap.py b/tatsu/bootstrap.py
index 4f656b2..510350e 100644
--- a/tatsu/bootstrap.py
+++ b/tatsu/bootstrap.py
@@ -35,8 +35,8 @@ class EBNFBootstrapBuffer(Buffer):
             ignorecase=False,
             namechars='',
             parseinfo=True,
-            comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]',
-            eol_comments_re='#[^\\n]*$',
+            comments_re=re.compile('(?sm)[(][*](?:.|\\n)*?[*][)]', re.MULTILINE|re.DOTALL),
+            eol_comments_re=re.compile('#[^\\n]*'),
             keywords=KEYWORDS,
             start='start',
         )
@@ -55,8 +55,8 @@ class EBNFBootstrapParser(Parser):
             ignorecase=False,
             namechars='',
             parseinfo=True,
-            comments_re='(?sm)[(][*](?:.|\\n)*?[*][)]',
-            eol_comments_re='#[^\\n]*$',
+            comments_re=re.compile('(?sm)[(][*](?:.|\\n)*?[*][)]', re.MULTILINE|re.DOTALL),
+            eol_comments_re=re.compile('#[^\\n]*'),
             keywords=KEYWORDS,
             start='start',
         )

This feels wrong, but does fix the test that fails subsequently

diff --git a/test/grammar/pattern_test.py b/test/grammar/pattern_test.py
index 91094fa..c651baf 100644
--- a/test/grammar/pattern_test.py
+++ b/test/grammar/pattern_test.py
@@ -22,7 +22,7 @@ class PatternTests(unittest.TestCase):
 
             blankline
                 =
-                /^[^\\n]*\\n$/
+                /(?m)^[^\\n]*\\n$/
                 ;
         """

I'm still curious about the following however:

ParserConfig has two ways to set the (eol_)comments_re, either via direct argument, or via (eol_)comments, with the latter trumping the first one as it's done in __post_init__. I think ideally these would not conflict with each other and there would only be one way to set these values so something can't accidentally be configured in a way that leads to weird behavior.
ParserConfig does not validate that the (eol_)comments_re fields are actually the proper type. Since TatSu is largely untyped, we can't do a static analysis via mypy to ensure this is the case within the codebase and since the fields are not only public, but legacy config values, users may try to pass in an unexpected type. I think there's a few ways to handle that.
- Validate the types as part of __post_init__.
- Don't make these values part of the __init__ interface, update the code generator to use (eol_)comments and rely on dataclass.replace to update these calculated values.
- Update the typing to allow strings and Patterns and drop (eol_)comments
I think depending on the answers to the above, there will need to be documentation updates to reflect these changes. There should probably be a note that raw strings, if it's decided to allow them for (eol_)comments_re, will no longer match via re.MULTILINE

Thoughts?

apalala · 2024-12-27T16:50:59Z

I agree with the comments above.

Things like comments_re are legacy with a history that I no longer remember.

If you can come up with an update that solves your issues, gets rid of the legacy noise, and passes existing unit tests, I'll test it against complex grammars like COBOL and Java.

Consider that TatSu already offered to leave users behind if they don't keep up with Python releases. Older versions of TatSu are still published and available for who need them. I vouch for getting rid of legacy stuff.

ParserConfig took a lot of work, and may still not be perfect, but it's the right way to get rid of many annoyances that have hindered professional use of TatSu.

vfazio · 2024-12-27T20:36:33Z

I'm going to push a draft PR. The commits won't be pretty or well documented, but it will "work". I'll rework the commits when I have time next week.

Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. [0]: neogeny#338 [1]: neogeny#351 (comment)

…fig` (#352) * [buffering] drop forced multiline match for string patterns Previously, when scanning for matches to a regex, if the type of the pattern was `str`, the pattern was always compiled with `re.MULTILINE`. Recent changes to `ParserConfig` [0] changed the type used for regex matches in generated code from `str` to `re.Pattern` which could lead to a difference in behavior from previous versions where a defined comments or eol_comments may have been implicitly relying on the `re.MULTILINE` flag. After discussion [1], it has been determined that usage of `re` flags within TatSu should be deprecated in favor of users specifying the necessary flags within patterns. As such, drop the `re.MULTILINE` flag for strings compiled on the fly. [0]: #338 [1]: #351 (comment) * [grammar] make eol_comments multiline match Make the default eol_comments regex use multiline matching. Recent changes to `ParserConfig` [0] now use a precompiled regex (an `re.Pattern`) instead of compiling the `str` regex on the fly. The `Tokenizer` previously assumed `str` type regexes should all be `re.MULTILINE` regardless of options defined in the regex itself when compiling the pattern. This behavior has since changed to no longer automatically apply and thus requires configurations to specify the option in the pattern. [0]: #338 * [infos] make {eol_}comments_re read-only attributes Previously, the `eol_comments_re` and `comments_re` attributes were public init arguments, were modifiable, and could thus become out of sync with the `eol_comments` and `comments` attributes. Also, with recent changes to `ParserConfig` [0], there were two ways to initialize the regex values for comments and eol_comments directives; either via the constructor using the *_re variables or by using the sister string arguments and relying on `__post_init__` to compile the values which trumped the explicit *_re argument values. Now, the constructor interface has been simplified to not take either `eol_comments_re` or `comments_re` as arguments. Callers may only use `eol_comments` and `comments`. The `eol_comments_re` and `comments_re` attributes are still public, but are read-only so they are always a reflection of their sister string values passed into the constructor. [0]: #200 * [codegen] migrate to {eol_}comments * [ngcodegen] migrate to {eol_}comments * [bootstrap] migrate to {eol_}comments * [lint] resolve errors * [docs] note {eol_}comments directive behavior changes * [docs] update syntax to reflect {eol_}comments arguments * [test] fix test_parse_hash to use eol_comments * [test] explicitly use multiline match in test_patterns_with_newlines

apalala · 2025-01-03T12:42:26Z

@vfazio, Please check that this issue is gone with the latest commits?

vfazio · 2025-01-03T13:28:03Z

@vfazio, Please check that this issue is gone with the latest commits?

I'll have to test it on Monday, I do not have access to my code as until then.

I'm pretty sure the commits introduce a performance regression by reverting part of #338. But I suppose we can fix that subsequently.

Do you want a new issue tracking this and other cleanup I identify? Or do you want me to just make a PR and have you review?

apalala · 2025-01-04T17:00:24Z

I'm pretty sure the commits introduce a performance regression by reverting part of #338. But I suppose we can fix that subsequently.

I agree.

There are several ways to add the optimizations again without reusing the deprecated names (comments_re).

The caching done by re.compile and friends is not to be trusted.

apalala · 2025-01-06T19:56:04Z

Can we close this one, @vfazio ?

vfazio · 2025-01-06T20:18:41Z

Can we close this one, @vfazio ?

Sorry, I've been in meetings all day, I try to test this tomorrow.

vfazio · 2025-01-06T20:44:48Z

I'm going to close this with the following caveat:

Old behavior was not reintroduced. However, this behavior is documented

TatSu/docs/directives.rst

Lines 32 to 33 in f3fb92a

    
           .. note:: 
        
              Prior to 5.12.1, comments implicitly had the `(?m) <https://docs.python.org/3/library/re.html#re.MULTILINE>`_ option defined. This is no longer the case.

and at least now the behavior between re.Pattern and str should be consistent.

#354 will be updating the note, but the change stays the same. I've tested both branches on the BSDL grammar and it seems to play nice.

apalala · 2025-01-07T12:33:38Z

We can keep adding enhancements, @vfazio. It think this was a good cycle, with a good release.

vfazio mentioned this issue Dec 27, 2024

Make {eol_}comments_re read-only and non-init arguments in ParserConfig #352

Merged

apalala mentioned this issue Dec 29, 2024

Make {eol_}comments_re read-only and non-init arguments in `ParserCon… #353

Merged

vfazio closed this as completed Jan 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParserConfig type changes may cause a regression in pattern matching #351

ParserConfig type changes may cause a regression in pattern matching #351

vfazio commented Dec 26, 2024

apalala commented Dec 27, 2024

vfazio commented Dec 27, 2024 •

edited

Loading

apalala commented Dec 27, 2024

vfazio commented Dec 27, 2024

apalala commented Jan 3, 2025

vfazio commented Jan 3, 2025

apalala commented Jan 4, 2025 •

edited

Loading

apalala commented Jan 6, 2025

vfazio commented Jan 6, 2025

vfazio commented Jan 6, 2025

apalala commented Jan 7, 2025

ParserConfig type changes may cause a regression in pattern matching #351

ParserConfig type changes may cause a regression in pattern matching #351

Comments

vfazio commented Dec 26, 2024

apalala commented Dec 27, 2024

vfazio commented Dec 27, 2024 • edited Loading

apalala commented Dec 27, 2024

vfazio commented Dec 27, 2024

apalala commented Jan 3, 2025

vfazio commented Jan 3, 2025

apalala commented Jan 4, 2025 • edited Loading

apalala commented Jan 6, 2025

vfazio commented Jan 6, 2025

vfazio commented Jan 6, 2025

apalala commented Jan 7, 2025

vfazio commented Dec 27, 2024 •

edited

Loading

apalala commented Jan 4, 2025 •

edited

Loading