Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🚧 Update parser to support look-ahead of two (or more) tokens #9

Draft
wants to merge 16 commits into
base: master
Choose a base branch
from

Conversation

atifaziz
Copy link
Owner

This PR addresses the feature described in issue #7. It updates the parser to use an adaptive strategy depending on whether the parser is asked to look-ahead one, two or more tokens. This is done by abstracting over the storage used for stack operations to push and pop tokens and then using an optimal storage for grammars that require just look-ahead of one or two tokens. This avoids using a single and general storage class such as an array as well as paying the penalty of a heap allocation for the most common cases. Like it is now, for a single peek, the parser only uses a single field that's an optional token (or optuple). For peeking up to, it uses a triplet of count and two tokens. Beyond that, it moves to using a Stack<>.

@atifaziz
Copy link
Owner Author

atifaziz commented Jul 12, 2022

@springcomp Below are my replies to your review/comments:

@atifaziz here is a quick PR on top of your multi-peek branch.

First of all, thanks for the quick turnaround on the feedback.

My question is what is the purpose of the default case in the Peek() method ? It seems to me that support for more than two look-ahead tokens is guaranteed to throw an exception.

I added a simple test illustrate this.

You found an incomplete implementation and this PR is still in draft. I think the overall direction is there, but it needs some polish and tests. It would also help to get your feedback on whether it helps your case.

I wanted to optimise for the common case of look-aheads of one or two tokens. For the less common one, I plan to implement one based on a regular and heap-allocated stack.

I like the ITokenStream abstraction that as been introduced.

Glad you like it!

As a general question why limit the implementation to only two tokens? I guess there should not be a grammar that needs more than two tokens of lookahead.

It's not limited. I am just not done. Again, you discovered my PR that was still work-in-progress.

Also, it seems that maybe the design you chose is extensible to more than two tokens of lookahead in the future, maybe by introducing a future ThreeTokenStackOps, etc.

I think I will only optimise for peeking up to two tokens. For more complex and rare grammars, I'll probably just go with a regular stack but that's a decision that can be changed with time. The point is that the design is adaptive.

But what lead you to this design, with two different StoreStackOps implementations? Is this for performance reason ? Is this in an attempt to prevent more upfront allocations for multiple tokens if not eventually necessary?

Yes, exactly! You pay for the rare cases, but not for the common ones. The only new cost for the common cases is a virtual dispatch for the stack operations.

@codecov-commenter
Copy link

codecov-commenter commented Jul 12, 2022

Codecov Report

Merging #9 (a3cae65) into master (65b8da4) will increase coverage by 4.20%.
The diff coverage is 92.42%.

@@            Coverage Diff             @@
##           master       #9      +/-   ##
==========================================
+ Coverage   88.76%   92.96%   +4.20%     
==========================================
  Files           1        1              
  Lines          89      199     +110     
==========================================
+ Hits           79      185     +106     
- Misses         10       14       +4     
Flag Coverage Δ
unittests 92.96% <92.42%> (+4.20%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
src/Parser.cs 92.96% <92.42%> (+4.20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 65b8da4...a3cae65. Read the comment docs.

@springcomp
Copy link

It would also help to get your feedback on whether it helps your case.

It totally does! In fact, I revisited my parser to make sure it did not require more than two tokens, this time. And I managed to do it. I suspect we will never need more than two, but you answered my question. Gratt will support anything.

You anwsered all my questions with the PR summary and follow up text. Using an adaptive strategy is a great design. I’m still getting used to C#’s pattern matching feature which evolved a lot with different versions of the language. That and the heavily generic nature of the library make it sometimes hard to exactly comprehend what’s going on.

But I was able to debug step by step and see for myself the switch from a OnTokenStackOps to a TwoTokenStackOps.

I’m looking forward to consuming the future NuGet once released.

On another note, after extensive work on redesigning the original JMESPath parser using Gratt and a hand-crafter automaton-based regex lexer, I could find no improvements in performance 😞. I also started a System.Text.Json-based JMESPath expression evaluator and, again, could not find any performance improvements using early tests. I wonder If I’m not missing something obvious. 🤔

Anyway, I learned so much by doing this that it will have been not for nothing.

@atifaziz
Copy link
Owner Author

My question is what is the purpose of the default case in the Peek() method ? It seems to me that support for more than two look-ahead tokens is guaranteed to throw an exception.

I added a simple test illustrate this.

@springcomp So see 25abe36 for the resolution where I added a stack-backed strategy when peeking into 3+ tokens.

@atifaziz
Copy link
Owner Author

atifaziz commented Jul 12, 2022

the heavily generic nature of the library make it sometimes hard to exactly comprehend what’s going on.

I can sympathize with that! I like to solve problems once, so it naturally leads to a more general design and generic code. In the case of Gratt, the parsing is mostly algorithmic so it can be made independent of the types and since there are many types involved in the data model (precedence, token, token kind, result, etc), it does make definitions long to read. What I've found is that aliases can help:

using Parser = Gratt.Parser<ParseContext, TokenKind, Token<TokenKind>, Precedence, bool>;
using PrefixParselet = System.Func<Token<TokenKind>, Gratt.Parser<ParseContext, TokenKind, Token<TokenKind>, Precedence, bool>, bool>;
using InfixParselet = System.Func<Token<TokenKind>, bool, Gratt.Parser<ParseContext, TokenKind, Token<TokenKind>, Precedence, bool>, bool>;

On another note, after extensive work on redesigning the original JMESPath parser using Gratt and a hand-crafter automaton-based regex lexer, I could find no improvements in performance 😞. I also started a System.Text.Json-based JMESPath expression evaluator and, again, could not find any performance improvements using early tests. I wonder If I’m not missing something obvious. 🤔

Sorry to hear about that. Unfortunately, I do not have the cycles right now to help there. I hope it's something simple and you find it soon. I can, however, commit to this PR and releasing a new version of Gratt in the coming weeks, as I find small bits of time.

@atifaziz atifaziz linked an issue Jul 12, 2022 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support look-ahead of two tokens
3 participants