The Problem With Database Diffs

When talking about how LiquiBase works, I’m often asked “Why not just do a database diff to know what changes need to be applied to a new database?”

There are advantages to the database diff technique compared to LiquiBase’s method of tracking changes, mainly related to not relying on developers to remember to document the changes they made. You simply have everyone make whatever changes they want to the dev database and let the diff tool sort it out.

There are two fundamental flaw with diff tools, however. The first is is that while they do a good job of showing what the differences are syntactically, they don’t understand the semantics of the changes. For example, if you rename a column from “fname” to “firstname”, a diff will show you that there is a “fname” column that needs to be dropped and a “firstname” column that needs to be be added. If you follow these suggestions, you will end up with a database that is structured correctly, but you will loose your valuable data when the changes is applied to an existing production system. You need to not just understand what the structural differences are, you also need to understand why they are.

The second flaw is that it is impossible to diff the data so you find inserts, updates, and deletes that must be applied for the system to work correctly. These data changes can range from additional lookup table values to copying data in a de-normalizing process and can be just as important as the structural changes. Theoretically, a diff tool could also check for new, updated, and missing data between database, but in practice this cannot work for two reasons:

  1. Performance. As your data set grows, the amount of information to compare grows until it is unmanageable.
  2. Changing Data. During development, test data is often added to the development database that shouldn’t be copied into other databases. Also, new data may be added to testing and production databases that should not be deleted just because it doesn’t exist in the development database.

If you track your data changes along with your structural changes during development, however, you can be certain that they will be applied correctly down the road.

In the end, I feel that the problems associated with a diff tool outweighs the convenience and simplicity they offer. While they can have great value as a “did all the changes get stored in the change log correctly” safeguard, it should not be relied apon as your primary means of tracking database changes.

9 Comments

  1. Jeston:

    Did you intend that the last statement instead should have been: “… it should NOT be relied upon as your primary means of tracking database changes.” ?
    Or did I understand something incorrectly?

  2. Nathan Voxland:

    You are right. Thanks for pointing that out, I fixed the post.

  3. Jeremy:

    “but you will loose your valuable data”
    You’ll set it loose into the world?

  4. Boy:

    TL;DR:
    1. Structural diffing can’t detect a rename (would be added as drop and add, potentially losing data for that column / table).
    2. Data diffing is impractical due to performance and test data.

    Makes sense.

  5. Fabio Pisaruk:

    I agree with you but there are some scenarios where doing a diff would be the only practical way to get the changed information.
    Let’s suppose we were given a system, without its source code, that could be customized using a user interface and the customization information resides in some tables. After some customization done by the user interface we could perform a database diff and see what was changed so that we would be able to apply this change to another database without making use of the interface.

    Thanks.

  6. Dave Brokaw:

    I agree that the issues you note can be problems when you rename or refactor schema objects. However, I’d argue that the most common changes in a stable system are simply adding tables / columns, and auto-diffing works fine for these if you have reasonable default values (which you typically need for back-compat anyway). From this angle, I’d love to see a system that supports a diff-based approach for this common (80%? 50%?) case, and makes it easy to add optional transforms (e.g. SQL snippets) indicating how to move data for the rename / refactor cases if/when it’s actually needed.

  7. enorniel:

    I would like to know if is there a workaround for datas diffs.

    I have a CI structure and scripts to clean dev datas before prod but using git I would like my team to be able to commit changes in schemas and datas automatically.

    They works directly on sql and don’t report their changes in a separate changelog file (it is not intuitive at all), so I put in place a diff for schemas and I would like the same for datas.

    Thanks.

  8. nick:

    If you have 5 tables, yes, you can possibly do manual diff. But I have 500 and sifting through diff xmls is very painful. In 3.1.1 “diffChangeLog” for some strange reason converts all identity columns form numeric(X,X) to number.
    No, I don’t buy it that diff is bad.

  9. Nathan Voxland:

    Sifting through a large XML is ugly and the current liquibase diff implementation may have unexpected differences, but that is separate from the point I was trying to make. If you are using a diff based workflow (whether it is liquibase’s diff or another diff tool) you are at the mercy of the diff tool to figure out what changes actually happened. If it gets those changes wrong (like doing a drop/add instead of a rename) you will run into trouble.

    It is better to use a workflow where you build up your change script in a known and understood way as you are developing without relying on a diff process at all.

Leave a comment