Serialization #11

kddnewton · 2022-09-28T23:46:41Z

kddnewton
Sep 28, 2022
Maintainer

We want a binary serialization of the syntax tree such that it can be deserialized and read in any other language. This issue is to begin looking at how that would work and what the structure would be. I'm imagining something to the effect of the below design.

First, the header:

# bytes	field
4	YARP
1	major version
1	minor version
1	patch version
4	comments offset into file

Next, the tree. For each node, it'll walk the tree starting from the top. It will dump:

# bytes	field
1	type
4	offset of next node in tree* / length of this node
4	start offset in source
4	end offset in source

Then for each child node from this node it will recurse. Each child node type will have to have its own kind of serialization. I'll discuss those next.

*I think this is a good idea to include so that folks can skip past this node if it's trivial for the implementation. For example, if it's a Self node and you don't care about offset in the source then you can just skip directly to the next node in the tree.

For a child node that is itself a node, it will use the schema defined above. For a child node that is a string, it will use:

# bytes	field
4	start offset
4	end offset

For a child that is a node list, it will use:

# bytes	field
4	number of elements in the list

and then for each node it will recurse.

Finally, at the end it will have a section for comments, which will look like:

# bytes	field
4	number of comments in the file

then for each comment:

# bytes	field
4	start offset
4	end offset

I believe this is enough information to get started. Certainly we're going to iterate on this.

kddnewton · 2022-09-28T23:47:36Z

kddnewton
Sep 28, 2022
Maintainer Author

@eregon, @iliabylich, @enebo would love your feedback on this. If this makes sense, I can get started on a basic shell of what this would look like and we can iterate on it from there.

2 replies

eregon Sep 29, 2022
Maintainer

major version

I guess encoded as an unsigned byte, so each version component can be between 0-255 (if as an ASCII digit it would only allow 0-9 which is IMHO too restricting).

Of course everything should be 4-byte or maybe even 8-byte aligned (needed to read them fast). If this is done with struct in C I think it should already align e.g. the comments offset into file automatically (so i.e. we have 1 padding byte after patch version).

offset of next node in tree*

Another way to express it is a length field, i.e., how many bytes does this node take from here. It's the same thing anyway.

8 start offset in source

Would 4 bytes be enough? I don't think we need to care about Ruby files larger than 2/4GB 😅
I think we can replace all 8 above by 4, should help quite a bit for the size of the output.

For a child node that is a string

So the string would be stored somewhere in the bytes (e.g., towards the end, after the nodes), and those offsets are offsets from the bytes returned by the parser, right?

Sounds great to me.

eregon Sep 29, 2022
Maintainer

We'll also need some flags for some nodes (e.g., is a CallNode using &.), those could be individual byte mapping to the fields, or maybe some bitset as a single integer. The bitset would only be advantageous if we have more than 4 flags for a node, as we'd want to 4-byte align anyway what comes after the flag. The individual bytes seem simpler to serialize/deserialize and avoid needing hardcoded flag values, masks, etc.
I'd suggest to put the flags right after the type. The number of flags is of course decided by the type.
(with the bitset we could also do 2-bytes for type and 2-bytes flags, we have less than 65k node types, but it's probably overkill for now)

iliabylich · 2022-09-29T00:04:53Z

iliabylich
Sep 29, 2022

What are advantages of using binary serialisation over standard way of writing bindings?

For C++ we can wrap it with extern "C" { ... }
for Ruby we can either create something that converts C structs into pure Ruby objects or use something like shared_ptr (with ref counter that is incremented on .dup/.clone and decremented on GCing) and wrap individual C objects with Ruby objects (so that they have attached Data).
For Rust we can use bindgen crate
For Java .. I don't know much about Java, but there must be a way to generate Java src files and load dylib (similar to bindgen)

And we can use codegen (based on config.yml in this repo) to quickly generate them.

1 reply

eregon Sep 29, 2022
Maintainer

Essentially what @chrisseaton said below.

This are my thoughts/ideas:
I think using the C structs directly only works well in C (and maybe C++, not sure for Rust).

For other languages it's a big overhead to go between managed (e.g. Java, Ruby) and native (librubyparser) for every single node to convert.
The shared_ptr-like approach seems very brittle and dangerous.

So for Ruby I think we should also use the binary serialization, because:

A C extension which creates a lot of small nodes would be slow, it would be a lot of slow rb_funcall() and overhead
JRuby doesn't support C extensions (and using a C ext for this would also be a significant warmup overhead for TruffleRuby)
It should be fast and yet have a very usable Ruby API, so e.g. one Ruby class per node type.

Concretely for Java we'd get the result of the parser as a big char*+length and wrap that in a ByteBuffer, byte[] or long address with Unsafe.
For Ruby we'd get the result of the parser as a big char*+length and wrap that in a Ruby String and use unpack to deserialize.

The code to deserialize and serialize would be mostly automatically generated, using node types and config.yml, etc.

chrisseaton · 2022-09-29T09:50:14Z

chrisseaton
Sep 29, 2022

Some context from my perspective is about getting an AST from YARP into a managed runtime (JS, Java, etc.) Specifically, how will JRuby and TruffleRuby get these ASTs.

Two options are a push approach, and a query approach.

In the push approach, the managed runtime provides a visitor pattern interface with methods for each node type, and YARP walks the AST and calls the visitor methods. The managed runtime can either directly produce something like bytecode from those visit, or it can recreate the AST within the managed environment.

In the query interface, the YARP provides the managed runtime an interface with methods to get handles to child nodes from a node, and to read properties from a node via the handle. Maybe the query could be advanced, allowing you to read grandchildren and things like that.

Both of these have the same problem - an AST is comprised of a large number of very little objects. In both approaches that means many, many calls between YARP and the managed environment. Calls between native (YARP) and managed can be very expensive, as each call has to park the managed runtime and set up the expected native environment. We'll also have lots of strings, which will have to be individually copied.

The benefit for the serialised approach for me isn't saving to disk or anything like that - it's to be able to have a single managed to native call, which gives you everything you need in one go. All strings are copied in one, as they're part of the stream. Then the managed runtime does the de-serialisation safely and simply within the managed environment, rebuilding the AST, or directly creating bytecode, or whatever it wants.

For me, this means that I'm not looking for a compact representation - simple to read is more important, and ideally simple to seek.

I'd like to keep this use-case in mind and co-develop the interface with JRuby and TruffleRuby as part of the serialisation format.

1 reply

eregon Sep 29, 2022
Maintainer

For the context, I initially thought a visitor-like pattern where the parser calls things like visitIntegerLiteral(position, literalValue) or so would be nice and make for a simple interface for the parser, but it seems not flexible enough (e.g. I need information from parent nodes such as the local scope, constant scope. And what if I need to visit nodes in a slightly different order? etc) and would likely just result in creating the tree on the other side before transforming it again to some other representation, and with the significant overhead of thousands of such callbacks from C to Java (and return to C after the Java callback returns).
So since anyway we'd want an AST with a Java class per node type, we might as well deserialize fast from a binary blob than slowly callback-by-callback.

chrisseaton · 2022-09-29T12:13:13Z

chrisseaton
Sep 29, 2022

Another thought is that to save time, the parser could directly generate the serialised blob, rather than reifying the AST on the YARP side, only to serialise it, throw it away, send the serialised blob across the managed boundary, for it to be reified again.

1 reply

eregon Sep 29, 2022
Maintainer

Yeah that'd be cool but I'm not sure how realistic it is (but maybe it is, I'm unsure).

For instance that prevents any desugaring or tree transformations/cleanups, which are likely needed. Although I guess at least some of them could be done on the fly.
For instance foo as a local var vs a method call is decided rather late in current Ruby parsers IIRC, i.e. I'm not sure the lexer can find out the list of local variables in a scope (but maybe yes?). And so the parser might need a second pass on the tree to resolve that (I might have this wrong, feel free to correct me).

Also the parser might need to backtrack or so, or might have some state and then serializing on the fly seems hard.

enebo · 2022-09-29T14:38:07Z

enebo
Sep 29, 2022
Collaborator

In agreement with @chrisseaton and @eregon we need this to be a coarse single call for JRuby/Java. It should just be a single piece of data (array) which is easy to navigate. I also think space traded for simpler navigation is worth it but I still would like it as small as it can be. Because...

My one other feature request is I would like to be able to skip past defs in the tree (just record its location for later).

In JRuby our IR will process the AST of methods only if they are ever called. The number of methods called in a typical library are small (I remember ~80% unused in simple Rails commands) so this helps with both memory and time to process. Also our IR is more memory than our AST (which should be true of MRI as well).

It looks like with this design I have to process it to a Java form. Even if it is simple to skip past defs and record them:

No line/column info so I have to process the def lightly to keep track of current position for the pieces I will be processing
I will save a copy of original source and blob. This could just be the of the def if this repr is depth-first array (and text from start-end offset). It becomes less clear this would save memory.

I am not saying this is unacceptable. I think the intention is that we will process the blob into our backend impl. For me, the dream was to walk the blob and directly create IR. Generating an intermediate AST feels like it would we be adding time and potentially additional memory (and for sure more allocations).

Since we will no longer pay a warmup time at startup for our lexer/parser to get up to speed this design will definitely be faster than what we have now. So this is merely just a comment wondering if we can have a feature like deferred method processing.

8 replies

kddnewton Sep 29, 2022
Maintainer Author

A constant pool makes a lot of sense here also because it could potentially drastically cut down on the size of the serialized format, making interop faster.

enebo Sep 29, 2022
Collaborator

@kddnewton We do this in our IR serialization just to cut down on size. It makes a huge difference just from identifiers alone.

kddnewton Sep 29, 2022
Maintainer Author

Yeah that's a pretty convincing argument.

kddnewton Sep 29, 2022
Maintainer Author

It's not dissimilar to how Ruby interns strings.

eregon Sep 29, 2022
Maintainer

and the text of the source for that definition since at least string literals will not be stored in the serialized tree.

I'd think many string literals actually cannot just refer to the source because of escape sequences, "a" \ "b", etc.
So since we need string literals that are allocated outside the source I'd think it'd make sense to have all strings be part of the serialized blob, so there is no need to keep the source around while using the AST or at least there is no need to read stuff from it while using the AST.

Yeah I think a string pool would be pretty nice to cut down the size of the serialized blob.
Ruby implementations will have to re-intern on top though, because interning is done across files. But they might be able to cache offset (or much better as @enebo said, pool index as then it can be just an array for the cache) to runtime object (e.g. RubyString, RubySymbol) per file so there is no need to reread the string or to do a more expensive hash computation + lookup on every occurrence of the same string.

kddnewton · 2022-09-29T15:31:37Z

kddnewton
Sep 29, 2022
Maintainer Author

Just to add the discussion here, I've added this kind of serialization/deserialization to syntax_tree-json here: https://github.com/ruby-syntax-tree/syntax_tree-json/blob/main/lib/syntax_tree/json/serialization.rb. I just wanted to have a concrete small example of what we're talking about here so we can further discuss. Obviously this is all in Ruby, but the same principles apply.

0 replies

kddnewton · 2022-09-29T16:12:39Z

kddnewton
Sep 29, 2022
Maintainer Author

Also, @flavorjones pointed me to https://en.wikipedia.org/wiki/ASN.1, which could be useful if we want to roll with a standard.

6 replies

eregon Sep 29, 2022
Maintainer

Interesting, it looks like we came up exactly or almost with the same general layout (https://en.wikipedia.org/wiki/ASN.1#Example_encoded_in_DER, https://en.wikipedia.org/wiki/X.690#BER_encoding).
Since most of the deserializers would be generated, I don't think correctness is a big issue (should be easy), and we'd most likely not want to depend on some existing library to do that for us.
But it seems nice to conform to an existing standard.
This standard seems fairly flexible overall though, so it does not seem self-sufficient or so and we'll need to document the serialization format design in any case.

eregon Sep 29, 2022
Maintainer

As a concrete example, the length seems suboptimal to me: https://en.wikipedia.org/wiki/X.690#Length_octets
It seems to optimize size over efficiency to encode/decode (it causes more branches and non-alignment and some architectures like ARM really don't like unaligned reads/writes).

flavorjones Sep 30, 2022

I think I hear you saying that, if you had to make a choice between using a standard that makes deserialization simple, or creating a custom format that performs better for us, you'd choose the latter.

I think that's a defensible decision since we want to encourage most folks to use the Ruby API (and not the binary format)!

It's also worth noting that by choosing this path we're implicitly making a design decision to make consumption of the binary format more challenging in exchange for performance. Is this an acceptable design decision for everyone?

flavorjones Sep 30, 2022

If I was going to write this design decision out in full, it might be something like:

With the binary format, we're choosing to focus on performance over ease of consumption. We assume that casual users can use the Ruby API and only Ruby implementors who care deeply about performance will be interested in the binary format.

Does that sound about right?

eregon Sep 30, 2022
Maintainer

TBH I'm doubtful consumption is any easier by following such a standard. For instance it's easy in any language to read a 32-bit int from 4 bytes as native-endian (e.g., String#unpack), but it's already more work to deal with BER, even with https://ruby-doc.org/stdlib-2.4.0/libdoc/openssl/rdoc/OpenSSL/ASN1.html. And the meaning of the node type values is something specific to this parser, so not something ASN.1 can understand (except for a few primitives types but clearly not for node types).

You are right though that for the vast majority of (all?) applications the deserialized AST in that language is the API we expect people to use, not directly the serialization format which is basically internal since we generate both the serializer and deserializers for multiple languages in this repo.
And indeed I think the serialization format here should be geared towards performance more than compactness (since the typically use cases will both serialize+deserialize on top of parsing, so we want that first path as fast as possible), while still avoiding to waste too much memory of course (relevant for the lazy parsing use case, not so much if eager parsing as it'd just be a short-lived allocation).

kddnewton · 2022-09-29T19:15:09Z

kddnewton
Sep 29, 2022
Maintainer Author

This PR: #17 implements the very basics of what we're talking about here. Feel free to take a look yourself. To run it, first build the extension with bundle exec rake compile. Then run bin/console:

"1 + 2".then { |source| YARP.load(source, YARP.dump(source)) }
# => Program(Statements([Binary(IntegerLiteral(INTEGER("1")), PLUS("+"), IntegerLiteral(INTEGER("2")))]))

It works for all of the stuff we have right now, and should continue to work as long as we run bin/template.

0 replies

kddnewton · 2022-11-01T17:03:20Z

kddnewton
Nov 1, 2022
Maintainer Author

I'm going to close this discussion in favor of the documentation that I made here: https://github.com/Shopify/yarp/blob/main/docs/serialization.md. If there's any further discussion or issues, feel free to open up again.

0 replies

kddnewton · 2022-11-01T17:04:05Z

kddnewton
Nov 1, 2022
Maintainer Author

Oh lol turns out there's no closing a discussion. Okay.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization #11

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 19 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Serialization #11

kddnewton Sep 28, 2022 Maintainer

Replies: 10 comments · 19 replies

kddnewton Sep 28, 2022 Maintainer Author

eregon Sep 29, 2022 Maintainer

eregon Sep 29, 2022 Maintainer

iliabylich Sep 29, 2022

eregon Sep 29, 2022 Maintainer

chrisseaton Sep 29, 2022

eregon Sep 29, 2022 Maintainer

chrisseaton Sep 29, 2022

eregon Sep 29, 2022 Maintainer

enebo Sep 29, 2022 Collaborator

kddnewton Sep 29, 2022 Maintainer Author

enebo Sep 29, 2022 Collaborator

kddnewton Sep 29, 2022 Maintainer Author

kddnewton Sep 29, 2022 Maintainer Author

eregon Sep 29, 2022 Maintainer

kddnewton Sep 29, 2022 Maintainer Author

kddnewton Sep 29, 2022 Maintainer Author

eregon Sep 29, 2022 Maintainer

eregon Sep 29, 2022 Maintainer

flavorjones Sep 30, 2022

flavorjones Sep 30, 2022

eregon Sep 30, 2022 Maintainer

kddnewton Sep 29, 2022 Maintainer Author

kddnewton Nov 1, 2022 Maintainer Author

kddnewton Nov 1, 2022 Maintainer Author

kddnewton
Sep 28, 2022
Maintainer

Replies: 10 comments 19 replies

kddnewton
Sep 28, 2022
Maintainer Author

eregon Sep 29, 2022
Maintainer

eregon Sep 29, 2022
Maintainer

iliabylich
Sep 29, 2022

eregon Sep 29, 2022
Maintainer

chrisseaton
Sep 29, 2022

eregon Sep 29, 2022
Maintainer

chrisseaton
Sep 29, 2022

eregon Sep 29, 2022
Maintainer

enebo
Sep 29, 2022
Collaborator

kddnewton Sep 29, 2022
Maintainer Author

enebo Sep 29, 2022
Collaborator

kddnewton Sep 29, 2022
Maintainer Author

kddnewton Sep 29, 2022
Maintainer Author

eregon Sep 29, 2022
Maintainer

kddnewton
Sep 29, 2022
Maintainer Author

kddnewton
Sep 29, 2022
Maintainer Author

eregon Sep 29, 2022
Maintainer

eregon Sep 29, 2022
Maintainer

eregon Sep 30, 2022
Maintainer

kddnewton
Sep 29, 2022
Maintainer Author

kddnewton
Nov 1, 2022
Maintainer Author

kddnewton
Nov 1, 2022
Maintainer Author