Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incremental Changes Support #245

Open
XedinUnknown opened this issue Dec 17, 2024 · 3 comments
Open

Incremental Changes Support #245

XedinUnknown opened this issue Dec 17, 2024 · 3 comments

Comments

@XedinUnknown
Copy link

XedinUnknown commented Dec 17, 2024

Problem

This library can be used to create migrations that set up a DB schema from scratch.
It cannot be used to create migrations that bring the DB from one non-empty state to another. I.e. it cannot create migrations that would add incremental changes based on another existing schema.

Solution

Add support for generation of migrations, which would configure the schema from an arbitrary state, not only from an empty state.

States

A migration can be seen as a diff: it encodes all of the changes that have to be made to state A in order to bring it to state B. At present, state A = 0, and there is no way to reliably use any other state as a base for changes.

Of course, even in the current implementation, migrations can be used on top of a non-empty DB. However, the existing state of the DB may not be compatible with these migrations: creating a table with a field of a specified name and type would fail if the DB already contains a table with that name which has that field. In this example scenario, the expected solution would be: to alter the existing table, either by adding a new field (if it does't exist yet), or by changing an existing field to match the specified configuration. The way forward therefore seems to be: allow specifying an arbitrary state A, to which migrations could be applied in order to bring it into state B.

State Representation

At present, state B is represented by a live DB: the generated migrations execute queries that bring a non-conflicting live database A into the same state as an existing live database B. This is non-ideal, because it requires an accessible database, but still works: most Laravel projects would work with such a database, and its initialization is trivial with supporting tools.

The same cannot be said about state A: the requirement to have 2 live databases in different states is not part of any usual workflow, and not something that can easily be achieved. Therefore, it could make sense to save a serialized representation of state A, in order to compare it to state B.

Note

When that is achieved, it appears relatively simple to serialize state B as well, which could make this package more versatile and efficient; however, that is outside of the scope of this proposal.

DBML

The DataBase Markup Language is the only widely used DSL for documenting cross-platform DB schemas, making it the de facto interop standard for interchanging such information. The butschster/dbml-parser package is a not tremendously outdated parser for it, making reading the schema from this fromat easy. The egyjs/dbml-to-laravel already uses DBML to do codegen, and according to its docs it does a pretty good job at that. Therefore, I suggest using DBML for schema serialization.

Note

egyjs/dbml-to-laravel claims to generate enums that are then used in other generated code. This integrated approach is very beneficial, and I would love to see your tool move in the same direction. However, that is out of scope for this feature request.

Diffing

Assuming that the schemas of both states A and B are available via a PHP API, here's what diffing could look like.

  1. Drop every table in B that is not in A.
  2. For every table in A
    1. Does a table with that name exist in B?
      1. No - build the table definition with fields and constraints, and generate code.
      2. Yes - alter that table:
        1. Drop every column in B that is not in A.
        2. For every column in A:
          1. Does a field with that name exist in same table of B?
            1. No - add new field.
            2. Yes - alter that column:
              1. Is the field definition (type/size/nullability) in B different to A?
                1. No - continue, nothing to do.
                2. Yes - create new definition which is same in all ways except what has changed.
        3. Drop every index in B that is not in A.
        4. For every index - compare similarly to fields.

Renaming

It isn't clear what to do about tables, fields, and indexes (collectively, "objects") that have been renamed (constraints contain no data, so they don't appear to be a concern here). If my_table has been renamed to my_cool_table, it will drop my_table and add my_cool_table; same would go for other named things. This is very sub-optimal, and sort of defeats the point of having incremental migrations: to preserve existing data while changing the underlying schema. Renamed things should lead to an alteration in the migration, rather than being deleted and re-added with a different name.

At the same time, it is unclear how renaming should be implemented: it doesn't seem possible to compare 2 schemas and tell with certainty whether an object in B is a renamed object in A or not. They may be very similar in structure but have different purposes; they may also have a slightly different name but a significantly different structure.

This seems to dictate that additional information is necessary in order for an intended migration to be generated. Where this information comes from - is a topic of its own, but it seems like it may be any combination of the below.

  1. A separate configuration source. Such as that living in the config dir of Laravel projects.
    This has the problem that it shouldn't change frequently between codegen runs.
  2. Command line args. When the tool is run, things that have changed may be specified by name, and indicate to the software that it should alter those objects instead of replacing.
    This is good because every run that generates incremental migrations could specify its own renames: they may differ from one run to another.
  3. Schema metadata. Tables, fields, and indexes all have comments associated with them. These comments can be compared to determine whether the thing is the same thing in B as it is in A, or a different one.
    This is good because database objects themselves may specify the identifier representing their significance once, it would be attached directly to the object (and would survive a re-import to/from MySQL dump etc), and any run would yield expected results without specifying additional information.

References

@XedinUnknown
Copy link
Author

Added "Diffing" section. Probably that's all for now; I'm ready for feedback.

@kitloong
Copy link
Owner

kitloong commented Jan 3, 2025

Thank you @XedinUnknown for the detailed description, I would like to confirm following questions from my perspedtive.

Therefore, it could make sense to save a serialized representation of state A, in order to compare it to state B

I agree with you about the diffing part, which is quite similar to what I would imagine for incremental changes.

If this is possible, the migrations generator should run the incremental changes automatically; hence, the DBML creation should be handled automatically by the generator.

Assuming we could generate state B DBML from a live database, how can we create state A DBML? At the point of running migration generators, we have no reference to state A.

I imagine we could upgrade the generator to read the existing migration files to create state A DBML. However, at this point, the generator would need the following abilities:

  1. Generate migrations from an existing database. (AS-IS)
  2. Generate DBML from an existing database. (TO-BE)
  3. Generate DBML from existing migration files. (TO-BE)

I think this is a bit too much for the migrations generator, which should focus on 1.

Renaming

This is a huge challenge for implementation. We must never allow the generator to produce any drop commands in the up method of migrations. A single mistake could cause disastrous damage to the live database. I believe we are on the same page regarding this.

A separate configuration source. Such as that living in the config dir of Laravel projects.
This has the problem that it shouldn't change frequently between codegen runs.

Yes, I agree on the problem.

Command line args. When the tool is run, things that have changed may be specified by name, and indicate to the software that it should alter those objects instead of replacing.
This is good because every run that generates incremental migrations could specify its own renames: they may differ from one run to another.

Yes, but this requires too much rename/alter information to be provided to the generator for significant changes. If such information is already recorded, why not have the user create migration files manually instead?

Schema metadata. Tables, fields, and indexes all have comments associated with them. These comments can be compared to determine whether the thing is the same thing in B as it is in A, or a different one.
This is good because database objects themselves may specify the identifier representing their significance once, it would be attached directly to the object (and would survive a re-import to/from MySQL dump etc), and any run would yield expected results without specifying additional information.

We should not modify the comments of tables/fields/indexes, as the owner may utilize them for their own explanation or description purposes.

My thought

The proposed implementation brings an interesting way to utilize DBML for diff checking. However, it still has unresolved challenges that I believe are not easily solvable at this moment:

  1. Renaming Is Unachievable: Even with the second option, it requires too much manual information to communicate changes to the generator.
  2. Complexity of Reverse DBML Generation:
    • Generating DBML from a live database adds unnecessary complexity and shifts the focus away from migration generation.
    • Reversing DBML from migration files is even more complicated, as migration files may not provide a consistent or complete representation of the database state.

@XedinUnknown
Copy link
Author

XedinUnknown commented Jan 3, 2025

If this is possible, the migrations generator should run the incremental changes automatically; hence, the DBML creation should be handled automatically by the generator.

I'm not sure what you mean by running the incremental changes automatically. But I disagree that the DBML creation should be handled automatically by the generator. Perhaps, the rest of my response will provide more details as to why.

Assuming we could generate state B DBML from a live database, how can we create state A DBML? At the point of running migration generators, we have no reference to state A.

Currenly, state B is being read from a live database, without any DBML. I would keep it that way, as changing this would change things about the workflow. So, before you were migrating from state A = 0 to B. Now you would be migrating from an arbitrary state A to B, not only from 0 state. For now, there's no need to serialize B, which is what my note is about:

When that is achieved, it appears relatively simple to serialize state B as well, which could make this package more versatile and efficient; however, that is outside of the scope of this proposal.

Therefore, the only that remains to be serialized is state A. I don't see this as something that the generator should handle; rather it should be left up to the consumer, which would enjoy great flexibility:

  1. Maybe they have already committed a DBML file as the single source of truth.
  2. Maybe they have committed an SQL dump file, which they can convert to DBML using the DBML CLI tool.
  3. Maybe they want to generate the SQL dump from another live database.
  • So, by default, the generator would operate like now: migrate from A = 0 to B, where B is their current live database.
  • If they specify state A, then the migrations would modify that state in order to get to B, instaed of doing it from scratch.
  • In the future, you may want to add support for a feature allowing users to also specify state B: perhaps a DBML file rather than the current live DB.

The workflow for adding incremental changes could therefore be:

  1. Check out commit X in the codebase, which has migrations for state A.
  2. Generate a DBML file for state A, perhaps by:
    1. Running the migrations at that commit.
    2. Exporting an SQL file using mysqldump or artisan schema:dump, whatever they choose.
    3. Converting the dump to DBML.
  3. Check out commit Y in the codebase, which has migrations for state B.
  4. Run the migrations at that commit.
  5. Run the generator, specifying state A as the source, with state B being the current state (like it is now).

In other words, it is the responsibility of the consumer to provide state A in DBML format; they may take a snapshot of any DB state that comes from anywhere, whether migrations in commit X, or custom-written migrations, or just DBML that they have designed elsewhere.

We must never allow the generator to produce any drop commands in the up method of migrations

I disagree with the above statement. It's perfectly valid for a table added in v1 of a software to be removed in v2, with no renaming or anything like that.

A single mistake could cause disastrous damage to the live database

The generator does not modify any database; it only writes changes to migrations. It is up to the consumer (and always has been) to review those changes before committing them and applying them to their database.

Yes, but this requires too much rename/alter information to be provided to the generator for significant changes

It requires some information, yes. But if migrations from state A to B need renaming, and this information is not obtainable from anywhere else (as discussed, renames cannot be inferred with any certainty from the states themselves), then this is the information that is necessary to provide. The metadata approach makes this easier, and the two can be used in conjunction.

We should not modify the comments of tables/fields/indexes, as the owner may utilize them for their own explanation or description purposes.

I agree, we should not. But we should allow the consumer to specify that they want the generator to use metadata as source of information about object identity, in which case they can specify that information themselves. We can provide a guide as to what works best.

So to address the points in your summary:

  1. Renaming seems very achievable by using a combination of configuration, CLI args, and metadata. In fact without renaming, the generator would never have enough utility to make the incremental changes feature usable in a real project.
  2. Complexity of serialized state generation is entirely avoidable by outsourcing generation of state A to the consumer (which is good for them too), and by continuing to use the live DB as state B for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants