Archive for June 2007

The Problem With Rails Active Migrations

Rails Active Migration is nice for its simplicity, but in non-trivial projects, it quickly falls apart due to limitations it has regarding multiple developers and/or branches. It’s a well known problem

The fundamental problem is that Rails tracks the “database version” as a single incrementing integer. That works fine when only one developer is adding migrations and when there is only one branch. When you add developers and branches, however, you quickly run into problems with duplicated version numbers, and missed migrations because the production “database version” is higher than a newly merged in migration.

The reason this is a problem is because there the reliance on a the “database version” concept. While there may be a logical database version that you can think about, really the state of the database is simply the set of all the applied migrations. Think of a file in source control. The source control system may generate “version” numbers for reference, but it doesn’t use them to decide what should be merged in. When you say “merge the “1_1″ branch into trunk, it simply takes all the changes in the “1_1″ branch and applies the same changes to trunk, regardless of whether the file in trunk has a higher “version number” or not. The same logic needs to be applied to database updates.

There are attempts to solve the situation in the Ruby community. The work-around solution works by naming each change by a timestamp and storing all the executed changes in the schema_migrations table. While it is a step in the right direction, if you need to re-order the execution of changes after a commit by multiple developers or after a merge it can be difficult. Hopefully a good solution to the problem is created soon, or we will have to develop a port of LiquiBase for Ruby :)

LiquiBase 1.0 Released

The LiquiBase team is proud to announce version 1.0. LiquiBase is an open source (LGPL) java-based tool for managing database changes and refactorings. It has been under active development for over a year and supports many features including:

  • Change tracking format that supports multiple developers and code branches
  • Thirty built-in refactorings including “Merge Columns” and “Add Lookup Table”
  • Can execute updates directly, or save SQL for review by DBAs
  • Can roll back databases to earlier versions based on dates, tags, or number of changes
  • Database independent. Currently supports MySQL, PostgreSQL, Oracle, and MSSQL with additional databases planned for version 1.1.
  • Can be executed as an Ant task, a Maven Plug-in, as a Servlet Listener, or though a command-line program
  • Changes can be tagged with “contexts” so not all changes need to be applied to all environments
  • Uses a distributed locking system to protect against machines upgrading the same database at the same time
  • Extensive documentation including a quick-start guide and manual

As a database change tracking tool, LiquiBase is useful for any project with a database, but is especially useful in an agile environment due to the large number of changes that are generated throughout the project’s lifecycle.

There are many post-1.0 features planned, including support for additional databases (DB2, Sybase, Derby and HSQL are already implemented in the 1.1 branch), a database-refactoring IDE plug-ins, additional refactorings, a database diff tool, and more.

We would like to thank everyone who helped us get to the point we are at today.

Managing Change Logs

While you can include all your change sets in one giant change log, there are many good reasons to split them up. While there are many ways to divide your change logs, the best strategy I have found is to create a change log per major java package that contains data access code. This approach has several advantages:

Easy to Know Where to Add (and Look For) Database Changes

If you are making a database change due to code in a package, you know exactly where to put the change.

Makes Code-Reuse and Code-Repackaging Easier

Since packages are often used as a logical code unit, the code to manage database changes can be moved and re-used along with the java code.

Keeps Database Changes Closer to the Code

You don’t need to go hunt through your file navigator as you switch between adding required databases changes and making the corresponding code changes.

Fewer File Conflicts

On multi-developer teams, the database change logs are a shared resource with many individuals editing them. Like any version controlled file, the more developers and the more branches that touch a file, the more chances for problems. Breaking up a change log into multiple files limits the changes and extent of merge issues.

There are, of course, other ways to break up your change log files including one change log per project and one change log per table. Depending on your requirements, these or other strategies may work better for you. The important thing is to find what works best for you.

The Problem With Database Diffs

When talking about how LiquiBase works, I’m often asked “Why not just do a database diff to know what changes need to be applied to a new database?”

There are advantages to the database diff technique compared to LiquiBase’s method of tracking changes, mainly related to not relying on developers to remember to document the changes they made. You simply have everyone make whatever changes they want to the dev database and let the diff tool sort it out.

There are two fundamental flaw with diff tools, however. The first is is that while they do a good job of showing what the differences are syntactically, they don’t understand the semantics of the changes. For example, if you rename a column from “fname” to “firstname”, a diff will show you that there is a “fname” column that needs to be dropped and a “firstname” column that needs to be be added. If you follow these suggestions, you will end up with a database that is structured correctly, but you will loose your valuable data when the changes is applied to an existing production system. You need to not just understand what the structural differences are, you also need to understand why they are.

The second flaw is that it is impossible to diff the data so you find inserts, updates, and deletes that must be applied for the system to work correctly. These data changes can range from additional lookup table values to copying data in a de-normalizing process and can be just as important as the structural changes. Theoretically, a diff tool could also check for new, updated, and missing data between database, but in practice this cannot work for two reasons:

  1. Performance. As your data set grows, the amount of information to compare grows until it is unmanageable.
  2. Changing Data. During development, test data is often added to the development database that shouldn’t be copied into other databases. Also, new data may be added to testing and production databases that should not be deleted just because it doesn’t exist in the development database.

If you track your data changes along with your structural changes during development, however, you can be certain that they will be applied correctly down the road.

In the end, I feel that the problems associated with a diff tool outweighs the convenience and simplicity they offer. While they can have great value as a “did all the changes get stored in the change log correctly” safeguard, it should be relied apon as your primary means of tracking database changes.

LiquiBase 1.0 RC2 Released

LiquiBase 1.0 RC2 has been released. Changes since RC2 are primarily minor bug fixes.

The reason for having a RC2 rather than 1.0 final is that we changed exceptions thrown by some methods and want to make sure that change does not cause any unexpected problems.

Please report all bugs you find with the release candidate so they can be addressed in the 1.0 release, particularly in the Maven support.

Unless there are show-stopper bugs found, the final 1.0 release should be out next week.

LiquiBase 1.0 RC1 Released

LiquiBase 1.0 RC1 has been released. Changes since 0.9.1 are primarily code refactoring to make sure it is a clean code base to work with post-1.0. The only new feature is a batch and shell script for easier running of the command line migrator. There were also the usual documentation improvements and bug fixes.

Please report all bugs you find with the release candidate so they can be addressed in the 1.0 release, particularly in the Maven support. Unless there are show-stopper bugs found, the final 1.0 release should be out in two weeks or less.

Building Database Tests that Don’t Break

As I discussed in Unit Testing the Database, a difficulty you run into when using a shared dataset for unit tests is making sure you write your test in such a way that changes to the dataset will not break older tests.

Note: depending on the database access framework you use, you will actually access the database through a Connection, EntityManager, Session, PersistenceManager etc. I’ll use the term “Connection” generically for any of these access types because the same pattern applies to them all.

Don’t Rely on Particular Data

The following is an example of a bad test that will break when the dataset changes:

public void testFindActive() {     Collection returnSet = new PersonDAO(connection).findActive();     assertEquals(5, returnSet.size());}

When you first write the above test, it will work fine because you have know your data set has 5 person rows in it. However, when you get to the next test and realize you need more test data and add a new person row, the testFindActive() method will suddenly fail when the method being tested isn’t actually broken.

What I have found to be the best way of writing better tests is to create a framework that lets you describe what type of rows you expect your method to return. Here is a code example:

public void testFindActive() {assertDataCorrect(”Did not return active people correctly”,    new DataComparator(entityManager, Person.class) {            public boolean include(Person person)  {                    return person.isActive();            }    }}

There are two advantages to this method of testing:

  1. It doesn’t break if person rows are added or removed
  2. The test better documents the intent of the isActive() method
  3. Since you are describing your filter via normal object methods, the test will stay up to date as you refactor your code.

Of course, you need to have an implementation of assertDataCorrect(). Unfortunately, although it is fairly straightforward to implement, how you actually implement it varies depending on your database access framework. Here is a pseudo-code sample to get you started:

public abstract class DataTestCase extends TestCase {assertDataCorrect(String message,            DataComparator comparator, Collection            collectionToTest) {

    assertNotNull(”Collection to test was null, poor test”, collectionToTest);    assertTrue(”Collection to test was empty, poor test”, collectionToTest.size() > 1);

    List allObjects = comparator.getEntityManager        .createQuery(”from ” + comparator.getEntityClass().getName());

    int matchingObjects = 0;    List filteredObjectsFromDB = filterObjects(comparator, allObjects);    for (Object objectFromDB : filteredObjectsFromDB) {      matchingObjects++;      if (!collectionToTest.contains(object)) {            fail (”Expected object ‘”+object.toString()+”‘ not found in collection”);      }    }    assertTrue(”Expected “+matchingObjects            +” objects, passed collection was “+collectionToTest.size(),            matchingObjects == collectionToTest.size());}}

Some notes on the above implementation:

  1. The class extends TestCase so as long as all your database tests extends the new DatabaseTestCase you will always have access to the new assertions.
  2. If the passed collection is null or empty, the test fails. The reason for this is because if your query is returning nothing from the database, you don’t really know if it is filtered down correctly. It could be working correctly, or it could be you have a typo in your filter that will always make it return no results.
  3. The assert function selects out all rows from the database and compares them to the returned objects. If you have a lot of rows, this will be a very poor-performing test. In practice, your test dataset will normally not grow so large this will become a problem, but it is something to keep in mind if you ever think about running your test suite against a copy of production data.

Eventually, I hope to create a database unit testing library that better encapsulates this logic, but have not had the time yet. If there is anyone who would like to volunteer to help out with creating one, please contact me at nathan [at] sundog.net.

Start a transaction in your setUp() method and roll it back in your tearDown() method

By running your test in a transaction, you ensure that (normally) whatever your code under test does to the database will have no lasting affect and will not cause later tests to fail. This is a simple extension of the unit testing principal of not assuming your tests will run in a particular order and to have no side effect of your tests.

If your code under test attempts to start and commit transactions, you may need to create a wrapper connection around the actual connection that intercepts calls to start and end transactions and simply logs that they happened. That way you can test that the commit you expected actually happened and still actually roll back the database in your tearDown() method.

The thing to watch out for is auto-committing code in your class under test. For example, if your tested logic creates a new table, many databases will auto-commit the transaction. In these cases, you will need to manually undo the changes in a finally block in your test so you still follow the “no side-effect” rule.

Unit Testing the Database Access Layer

The Problem

Writing unit tests for code that accesses a database is a difficult problem, and it’s one I’ve struggled with for several years. The problems come down to a balancing act between several (often competing) requirements:

  1. The unit test must not not break due to changes outside the class under test.
  2. The unit test must not break due to code refactoring that doesn’t break functionality.
  3. The unit test must run very fast.
  4. The unit test must remain valid through database refactorings.
  5. The unit test must fail if the code doesn’t function correctly.

I’ve tried several approaches in the past and was unsatisfied with them:

Mock the Database Connection

This is the most “pure” unit testing approach because it completely isolates the class under test from all external dependencies.

Connection conn = createMock(Connection.class);
Statement statement = createMock(Statement.class);
expect(conn.createStatement()).andReturn(statement);
expect(statement.execute(”SELECT ID FROM PERSON;”)).andStubReturn(true);
statement.close();
expectLastCall();
replay(conn);
replay(statement);

ClassUnderTest classUnderTest = new ClassUnderTest ();
classUnderTest .findPersonIds();

verify(conn);
verify(statement);

The advantages to mocking the database connection is that the tests run fast and don’t change to due to creating of other unit tests.

In practice, however, I’ve found these tests to be more or less worthless. They (like most mock tests) really just test the implementation of the method, not the behavior of the method, and therefore often fail due to code refactoring. The biggest problem, however, is that the access to the database is really the whole point of the method and there are so many ways that can fail.

I would argue, that really the method is implemented as SQL with a Java wrapper around it and in the mock unit test, only the Java code is really tested–the SQL is often just cut and pasted from what is in the method.

Throw in changes to the database over the life of the project (the “person” table name changes to “employee”) and you end up with tests that pass, but don’t really tell you if your code will work.

Create Test Data Sets For Each Test

The obvious response to the problems with mock database tests is to have your tests access an actual database. This method will create tests that actually fail if your SQL is wrong or your database schema changes without a corresponding code update, but brings with it its own difficulties, primarily with keeping the database in a known state. There are tools available such as DbUnit and Unitils to help with the test-data management, but these have disadvantages.

The main problem is that the way the test data is stored, it is very dependent on the exact schema when it was created, and if your database structure changes your test data can no longer be inserted and your tests are now worthless. For example, if your test is created with test data that inserts rows into a “person” table, but later that table is renamed to “employee”, your insert statements will no longer execute. Depending on the schema changes, you may be able to recover your test data with a search and replace, but often the changes are too much and it has been long enough since you wrote the test that you don’t remember exactly what was supposed to be in the data.

Additionally, the data is inserted as part of the setup method for each test. Accessing databases in Unit tests slows them down considerably already, we should work to minimize any calls to the database that we have to do.

Manage Test Data With All Other Database Changes

To solve the problem of keeping test data definitions from getting out of sync with the schema, you need to have your test data built up along with your database so it will be modified by database refactoring that were made after it was initially created. For example, here is a database change script:

  1. create table person
  2. insert row into person [test data]
  3. insert row into person [test data]
  4. rename table person to employee

By including the test data with your database changes, the data is automatically kept up in the same way the production data would be. Using this technique over the dataset per method also has the advantage of performing better because the data is only inserted once, but it has the disadvantage that you need to handle all the test data that any method would want in one place.
To facilitate this technique, I built the idea of execution contexts into LiquiBase so you can mark the test data changes and only apply them in environments where you run unit tests. So far I have been happy with the results. The tests fail when there is a differences between the database schema and what the code is expecting or when there is a bug in my SQL and I haven’t lost any tests due to database refactorings.

The tests don’t run as fast as mock connection tests would, but they aren’t prohibitively slow. We have over 1000 unit tests in one project that all run in about a minute. It would be better if they would all run in 10 seconds, but what good are fast tests that don’t actually test what’s important? Just remember to mock the (already tested) data access layer in higher layers of your code so you don’t take the database performance hit any more than you need to.

With this technique, you need to always remember that your test data set will evolve over time and write your tests accordingly, but how to do that will be covered in a later blog posting.

The Road to LiquiBase 1.0 and Beyond

There is a lot going on in LiquiBase-land right now.

We are finishing up code and documentation clean-up of the LiquiBase migrator and hope to have an initial release candidate available by the middle of June.

As long as no major bugs are found, we expect 1.0 to be released by the end of June.

Concurrent to the final code reviews and bug fixes leading up to 1.0, we have started on a SQuirreL plug-in for automatic creation of changelog files. There is no target date yet for an initial release, but it will probably be post 1.0. If there is anyone interested in working on a plug-in for other IDEs, just let us know.

We are also already looking at features to include in a 1.1 release. Items currently on the top of the list include an additional changeSet attribute for specifying the types of databases that it should be run against as well as support for additional databases (Sybase, DB2, Derby, perhaps more).