Replies: 10 comments 19 replies
-
@eregon, @iliabylich, @enebo would love your feedback on this. If this makes sense, I can get started on a basic shell of what this would look like and we can iterate on it from there. |
Beta Was this translation helpful? Give feedback.
-
What are advantages of using binary serialisation over standard way of writing bindings?
And we can use codegen (based on config.yml in this repo) to quickly generate them. |
Beta Was this translation helpful? Give feedback.
-
Some context from my perspective is about getting an AST from YARP into a managed runtime (JS, Java, etc.) Specifically, how will JRuby and TruffleRuby get these ASTs. Two options are a push approach, and a query approach. In the push approach, the managed runtime provides a visitor pattern interface with methods for each node type, and YARP walks the AST and calls the visitor methods. The managed runtime can either directly produce something like bytecode from those visit, or it can recreate the AST within the managed environment. In the query interface, the YARP provides the managed runtime an interface with methods to get handles to child nodes from a node, and to read properties from a node via the handle. Maybe the query could be advanced, allowing you to read grandchildren and things like that. Both of these have the same problem - an AST is comprised of a large number of very little objects. In both approaches that means many, many calls between YARP and the managed environment. Calls between native (YARP) and managed can be very expensive, as each call has to park the managed runtime and set up the expected native environment. We'll also have lots of strings, which will have to be individually copied. The benefit for the serialised approach for me isn't saving to disk or anything like that - it's to be able to have a single managed to native call, which gives you everything you need in one go. All strings are copied in one, as they're part of the stream. Then the managed runtime does the de-serialisation safely and simply within the managed environment, rebuilding the AST, or directly creating bytecode, or whatever it wants. For me, this means that I'm not looking for a compact representation - simple to read is more important, and ideally simple to seek. I'd like to keep this use-case in mind and co-develop the interface with JRuby and TruffleRuby as part of the serialisation format. |
Beta Was this translation helpful? Give feedback.
-
Another thought is that to save time, the parser could directly generate the serialised blob, rather than reifying the AST on the YARP side, only to serialise it, throw it away, send the serialised blob across the managed boundary, for it to be reified again. |
Beta Was this translation helpful? Give feedback.
-
In agreement with @chrisseaton and @eregon we need this to be a coarse single call for JRuby/Java. It should just be a single piece of data (array) which is easy to navigate. I also think space traded for simpler navigation is worth it but I still would like it as small as it can be. Because... My one other feature request is I would like to be able to skip past defs in the tree (just record its location for later). In JRuby our IR will process the AST of methods only if they are ever called. The number of methods called in a typical library are small (I remember ~80% unused in simple Rails commands) so this helps with both memory and time to process. Also our IR is more memory than our AST (which should be true of MRI as well). It looks like with this design I have to process it to a Java form. Even if it is simple to skip past defs and record them:
I am not saying this is unacceptable. I think the intention is that we will process the blob into our backend impl. For me, the dream was to walk the blob and directly create IR. Generating an intermediate AST feels like it would we be adding time and potentially additional memory (and for sure more allocations). Since we will no longer pay a warmup time at startup for our lexer/parser to get up to speed this design will definitely be faster than what we have now. So this is merely just a comment wondering if we can have a feature like deferred method processing. |
Beta Was this translation helpful? Give feedback.
-
Just to add the discussion here, I've added this kind of serialization/deserialization to syntax_tree-json here: https://github.com/ruby-syntax-tree/syntax_tree-json/blob/main/lib/syntax_tree/json/serialization.rb. I just wanted to have a concrete small example of what we're talking about here so we can further discuss. Obviously this is all in Ruby, but the same principles apply. |
Beta Was this translation helpful? Give feedback.
-
Also, @flavorjones pointed me to https://en.wikipedia.org/wiki/ASN.1, which could be useful if we want to roll with a standard. |
Beta Was this translation helpful? Give feedback.
-
This PR: #17 implements the very basics of what we're talking about here. Feel free to take a look yourself. To run it, first build the extension with "1 + 2".then { |source| YARP.load(source, YARP.dump(source)) }
# => Program(Statements([Binary(IntegerLiteral(INTEGER("1")), PLUS("+"), IntegerLiteral(INTEGER("2")))])) It works for all of the stuff we have right now, and should continue to work as long as we run |
Beta Was this translation helpful? Give feedback.
-
I'm going to close this discussion in favor of the documentation that I made here: https://github.com/Shopify/yarp/blob/main/docs/serialization.md. If there's any further discussion or issues, feel free to open up again. |
Beta Was this translation helpful? Give feedback.
-
Oh lol turns out there's no closing a discussion. Okay. |
Beta Was this translation helpful? Give feedback.
-
We want a binary serialization of the syntax tree such that it can be deserialized and read in any other language. This issue is to begin looking at how that would work and what the structure would be. I'm imagining something to the effect of the below design.
First, the header:
Next, the tree. For each node, it'll walk the tree starting from the top. It will dump:
Then for each child node from this node it will recurse. Each child node type will have to have its own kind of serialization. I'll discuss those next.
*I think this is a good idea to include so that folks can skip past this node if it's trivial for the implementation. For example, if it's a
Self
node and you don't care about offset in the source then you can just skip directly to the next node in the tree.For a child node that is itself a node, it will use the schema defined above. For a child node that is a string, it will use:
For a child that is a node list, it will use:
and then for each node it will recurse.
Finally, at the end it will have a section for comments, which will look like:
then for each comment:
I believe this is enough information to get started. Certainly we're going to iterate on this.
Beta Was this translation helpful? Give feedback.
All reactions