Tuesday, February 5, 2008

Remarks about Database Version Control

This post was triggered by a Coding Horror blog, where Jeff Atwood pointed to a recent series of 5 articles about versioning databases by K. Scott Allen. Let's start with a brief synopsis of each article:

1. Three rules for database work
1. Never use a shared database server for development work.
"a shared database is a tar pit waiting to fossilize a project... they ultimately waste time and help produce bugs".
2. Always Have a Single, Authoritative Source For Your Schema
"
Everyone should ... have a frictionless experience in getting a fresh database setup."
3. Always Version Your Database
"to propagate changes from development, to test, and ultimately to production in a controlled and consistent manner. A second goal is to have the ability to recreate a database at any point in time."

2. The Baseline
Start with a baseline schema when initial design somewhat settled. What should be included? Creation of all relevant db objects (tables, constraints, functions, indexes), but also commands to populate lookup tables with static data. Include bootstrap data needed by the application.
One script for all vs. one file per object? Middle road makes sense: e.g. create tables, defaults, indexes in a single file, views and functions in another. Automation: prepare script/application to install database without human intervention.
User tip for a green field project: "We always rebuild dev db from scratch. Then baseline script (with test data) can be versioned. Start creating change scripts only after release one."
Important: use a SchemaChangeLog table, e.g. (id, release_number X.Y.Z, script_name, date_applied). Initial record should be written as a last step of baseline script.

3. Change Scripts
Change scripts capture deltas, i.e. how to transform schema from previous version to the next one. They often contain complex data migration or transformation statements. Use a script naming convention, e.g. sc.01.00.0001.sql, so that an automation tool can compare with SchemaChangeLog and select scripts to apply. Commit means schema change is officially published. As a consequence, update scripts should never be changed. What if production db fails in the middle of update? Always backup before... or write idempotent scripts, advocated by Phil Haack (his implementation is SQL Server specific, but Oracle solution can be found in blog comments).

4. Views, Stored Procedures and the Like
Always drop all objects, then apply scripts to create objects from scratch. Scott's strategy is to hold one view (trigger, stored procedure) per file. No change scripts are needed. Delete the view file from source control and a view is deleted. It's so simple.
This approach works because there are no data to migrate/transform. As a consequence, change scripts cannot use views/triggers (they can create and drop a temporary one, though). Pros: Some problems show up early (e.g. column used by a view was deleted). Easy to go backwards in time. Cons: Every update requires downtime; could be problem in 24/7 production.

5. Branching and Merging
The technique described above is forward-only, i.e. branching is not easy... Use "branch per release" approach and hope that schema changes in a branch will be rare. Otherwise look at Ruby Migrations by Elliot Smith and Rob Nichols.
After branching, create a new baseline in the trunk (update major or minor version number). If a fix is applied in the branch, then its schema change script can be adapted and renamed to fit the trunk numbering.

There are few additional ideas in an older article by Martin Fowler and Pramod Sadalage: DBAs should collaborate closely with developers. Informal communication is crucial when developers are preparing changes in their instance, or to review before they commit to a shared master.
Scripts in the form of SQL DDL (for the schema change) and DML (for the data migration) include data for growing number of tests. Chaining together a sequence of automated small changes is much easier than one huge change. Automation of reverse changes could be possible, but there wasn't demand for it. "We update all developer databases automatically when there is a commit to the master."

In my opinion, ideal solution would keep the complete database structure under source control (e.g. DDL scripts capturing current schema) , with forward (also backward?) change scripts generated automatically from the diff. Several readers suggested to keep schema in xml for this. However, proper ordering can be tricky (table1 constraint depends upon table2). Also, automation of data migration is often difficult (if not impossible) - Steve Campbell suggests to use hand-written versioned scripts that can be appended to the generated diff script before each run.

Looking at my personal experience, there is much to be improved: we used to have a shared developer's database. Initially, baseline scripts were maintained manually to correspond to current schema (there was only one version in production and very low risk of moving to previous version), until they became out of sync. Nobody bothered, as all new databases were created by cloning production schema. On the other hand, change scripts were very elaborate, with complex data manipulation. Well tested on test and staging before going to production. There was downtime required because of replication (stop replication, change schema in all databases, start replication, migrate data).

Many of our lookup and control tables were populated from Excel spreadsheets (customer requirement), with relatively frequent updates (e.g. weekly). Data loading is often intermixed with data transformation scripts. Instead of using Windows-specific solution, we opted for a JDBC-based loading mechanism. I wrote configuration-driven tools for copying data between any two databases, so using Excel source does not require anything special.

Monday, February 4, 2008

What's on Jason's Hard Drive?

This title is borrowed from interesting article by Jason Hunter, famous author of Java Servlet Programming, that describes his "paperless house" organization. Published in Nov 2006. Main components:
  • Scanner; with a habit to scan all relevant documents to tiff (Jason keeps paper for a chance of audit)
  • All documents are kept in a version control system. His choice: Perforce (for historic reasons; can be used for free if max. 2 users and up to 5 workspaces), or Subversion (CVS is not a good choice if most documents are binary). Even if many documents exist in one version only, version control brings portability
  • Hierarchical organization of files. Everything in one repository - personal, business, scans, photos, source code, articles, ebooks.
Cons: all files on every client (is this a drawback?). Subversion with keep two copies on client??? Repository will be backed up (or RAID-ed) as well, i.e. lot of redundancy.
Explicit commit is needed.

There are also a few additional tips and links in the discussion section.

TODO: what is my organization of information?
logbooks, PeterWeekReports

As a side note: How did I ran into Jason's article? Spring's multipart (fileupload) documetation claims support COS FileUpload (http://www.servlets.com/cos). Jason is author of COS (com.oreilly.servlet) support classes (last version Nov. 2002). However, COS support was removed from Spring codebase starting with version 2.1. Currently, Spring supports only Commons FileUpload (http://jakarta.apache.org/commons/fileupload) in traditional mode, without streaming API (note also additional dependency upon Commons IO).

Oh yes, and a brief explanation of changes across Servlet specifications can be found in Jason's articles on JavaWorld.