-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Algebra of schema transformations #45
Comments
This probably needs proper definition of forward/backward compatibility. Backward compatibility is achieved when processes using a newer version of a schema remains able to process data that was written using an older version of the same schema. Similarly, forward compatibility is achieved when processes using an older version of a schema keep being able to process data emitted by processes that use a newer version of the said schema. More precisely, for any pair of functors
|
Let me take a look at what I can find about this |
Note that in some cases, upgrading data under older schema is all that's necessary. It's not necessary to downgrade data from the latest schema to a predecessor. As for operations, I'd look for a small, orthogonal, economical set of operations and test them with specific use cases. Here are some ideas:
(1) - (4) cover simple additions and removals; (5) - (6) cover elaborations / de-elaborations, (7) - (10) cover moving information around, without changing it. I have a feeling this is not quite right, something's fishy about (5) - (6). Maybe there's something more general or fundamental there. The other ones seem more solid. |
I think there's a more fundamental idea here, namely that of a normal form. So my idea is that we formulate a set of reversible algebraic operations. Some of those are given by the fact that Product and Sumtypes form at least a commutative semiring (as long as you have some sort of unique tag for each term in the product/sum). Having these operations I think we could define a normal form such that if I have the hunch that the steps that @jdegoes listed could be covered by adding additional, maybe non-reversible, operations to that set of algebraic transformations. This would mean that Iso[Schema[A],Schema[B]] However I am massively lacking in the required theoretical knowledge to verify that any of the above pans out. Eg. 2 & 4) not sure about these ones yet. 5 & 6) are application of identity law. if 7 - 10) I think that would be application of commutativity and associativity? |
"Reversible" is too strong, it would prevent you from removing information. You need Remember the concrete problem:
Migration is generally a fit for self-describing formats like Avro, JSON, etc. Avro has some of this built in, at least the basics, but you can't do structural modifications that preserve information. If you look at these requirements you can see that Possibly you can get that with But this is not ideal. It means your schema definitions, which might consist of 100+ different structures, has to be replicated on every major version. You have the "old" schemas, when you produce a new version, you have the "new" schemas. This is not really that much savings from copy/pasting your whole data model into a new package, and writing conversions where necessary. Both are troubling from a maintenance perspective. Now, at least with the Schema approach, you copy / paste the schemas but in theory because there's isolation between the schema and the ADT, you don't need a new ADT. So you modify your ADT but keep the old schemas the same. Maybe you version all the schemas independently so you only have to copy/paste the schemas that change. That starts to look like savings. Now even better is if you can take a schema and describe changes to that schema, and then dynamically produce a new schema based on the change list. In this case, you can imagine an append-only log of changes to a base schema, inserting version annotations as appropriate; and no schemas would ever have to be copy/pasted. Your "changelog" schema would only have additions to some sort of list-like structure, and it can always be used to materialize the latest version given any older version. That's in the realm of magic. It's easy from a maintenance perspective. And with proper types, you have guarantees you aren't breaking backward compatibility. Your data model involves as it needs to and you just add entries to the changelog when you change the format of the data. |
I've drawn a sketchy diagram to illustrate backward/forward compatibility. Diagram explanations
We have two target functors
In this setting, we name backward compatibility the ability to derive a morphism That is, the following identities must hold:
Similarly, A few things are worth noticing in this definition:
Implementation ideasThe above definition stresses the fact that successive versions of the ADT never coexist in the codebase and that we don't want to modify the wire-data. But modifying schemas is still allowed. Maybe b/f compatibility can be achieved by simply modifying the "current" schema in such a way that the Imagine for example that I have the following val sA1 = "age" -*>: prim(ScalaInt) :*: "name" -*>: prim(ScalaString) :*: "active" -*>: prim(ScalaBoolean) and that I know that is is equal to the result of adding an I can then (automatically) come up with a val sA0up = iso("name" -*>: prim(ScalaString) :*: "active" -*>: prim(ScalaBoolean), Iso[(String, Boolean), (Int, (String, Boolean))](p => (42, p))(t => t._2)) The instance of With the exact same information, we can also derive the downgrading version: val sA0down = iso(sA1, Iso[(Int, (String, Boolean)), (String, Boolean)](t => t._2)(p => (42, p))) Likewise, the ConclusionI think that coming up with a solution for this issue is rather easy after all. It is "just" a matter of defining an But I also think that the provided solution will be quite hard to verify. By that I mean that it would be:
¹ : They are "imaginary" in the sense that they will never be implemented in production code. ² : There are actually two diagrams overlayed on one another here. One could define backward and forward compatibility independently and draw two commuting diagrams, one with only ³ : Although it would be possible to define upgrading or downgrading writers (eg some |
Random thoughts:
|
I think this would deserve a whole discussion/issue on its own. My first intuition would be to aim for something like : // In a JVM running the (new) version where A1 is defined but not A0
val a1 : Schema[A1] = ???
val transfo: Transformation = ???
val upgradingA0: Schema[A1] = Schema.upgradingVia(transfo).to(a1)
val readA0asA1 = upgradingA0.to[Reads] // in a JVM running the (old) version where A0 is defined but not A1
val a0: Schema[A0] = ???
val transfo: Transformation = ??? // the same as above, but here it must be obtained at run time
val downgradingA1: Schema[A0] = Schema.downgradingVia(transfo).to(a0)
val readAsA0 = downgradingA1.to[Reads] Note that in each case, the version of |
Hi, your talk at scalar was very interesting. I think forward compatibility would be very interesting in time especially with streaming applications. With only backwards compatibility, all consumers have to be updated before the producer is updated. This is quite painful. As mentioned above forward compatibility needs some way to get schemas at runtime. The two approaches I'm aware of are embedding them in the record (protobuf) or fetching them from a versioned repository (confluent schema registry for avro). For doing the actual migration from the writers schema to the readers schema ( so up.down.??? ) we would need some way to retrieve the writers schema from serialized data without being able to fully deserialize it. This probably has to be something format specific like a version field in JSON or a magic byte in binary formats. I think this would tie in nicely with the migrations step approach mentioned above, where every migration rule would map to a new version. So given a history of (version, migration, data schema): (1, _, {"f0": {"type": "Int", "default": 1}}) We should be able to deserialize something like {"version": 0} in an application running version 3 as {"version": 3, "f1": 1} with the compiler guaranteeing correctness for it. This is an example avro struggles with confluentinc/schema-registry#209 |
We call "schema transformation" all the operations one can apply to a schema that maintain the guaranty that data written by the original schema can be read using the new one (backward compatibility) and vice versa (forward compatibility).
These transformations are implemented in Avro, although there are probably not identified as an algebra by Avro's author.
So the two steps for solving this issue can be:
Alternatively, searching through the academic literature should also yield interesting results.
The text was updated successfully, but these errors were encountered: