Tuesday, February 5, 2008

Remarks about Database Version Control

This post was triggered by a Coding Horror blog, where Jeff Atwood pointed to a recent series of 5 articles about versioning databases by K. Scott Allen. Let's start with a brief synopsis of each article:

1. Three rules for database work
1. Never use a shared database server for development work.
"a shared database is a tar pit waiting to fossilize a project... they ultimately waste time and help produce bugs".
2. Always Have a Single, Authoritative Source For Your Schema
"
Everyone should ... have a frictionless experience in getting a fresh database setup."
3. Always Version Your Database
"to propagate changes from development, to test, and ultimately to production in a controlled and consistent manner. A second goal is to have the ability to recreate a database at any point in time."

2. The Baseline
Start with a baseline schema when initial design somewhat settled. What should be included? Creation of all relevant db objects (tables, constraints, functions, indexes), but also commands to populate lookup tables with static data. Include bootstrap data needed by the application.
One script for all vs. one file per object? Middle road makes sense: e.g. create tables, defaults, indexes in a single file, views and functions in another. Automation: prepare script/application to install database without human intervention.
User tip for a green field project: "We always rebuild dev db from scratch. Then baseline script (with test data) can be versioned. Start creating change scripts only after release one."
Important: use a SchemaChangeLog table, e.g. (id, release_number X.Y.Z, script_name, date_applied). Initial record should be written as a last step of baseline script.

3. Change Scripts
Change scripts capture deltas, i.e. how to transform schema from previous version to the next one. They often contain complex data migration or transformation statements. Use a script naming convention, e.g. sc.01.00.0001.sql, so that an automation tool can compare with SchemaChangeLog and select scripts to apply. Commit means schema change is officially published. As a consequence, update scripts should never be changed. What if production db fails in the middle of update? Always backup before... or write idempotent scripts, advocated by Phil Haack (his implementation is SQL Server specific, but Oracle solution can be found in blog comments).

4. Views, Stored Procedures and the Like
Always drop all objects, then apply scripts to create objects from scratch. Scott's strategy is to hold one view (trigger, stored procedure) per file. No change scripts are needed. Delete the view file from source control and a view is deleted. It's so simple.
This approach works because there are no data to migrate/transform. As a consequence, change scripts cannot use views/triggers (they can create and drop a temporary one, though). Pros: Some problems show up early (e.g. column used by a view was deleted). Easy to go backwards in time. Cons: Every update requires downtime; could be problem in 24/7 production.

5. Branching and Merging
The technique described above is forward-only, i.e. branching is not easy... Use "branch per release" approach and hope that schema changes in a branch will be rare. Otherwise look at Ruby Migrations by Elliot Smith and Rob Nichols.
After branching, create a new baseline in the trunk (update major or minor version number). If a fix is applied in the branch, then its schema change script can be adapted and renamed to fit the trunk numbering.

There are few additional ideas in an older article by Martin Fowler and Pramod Sadalage: DBAs should collaborate closely with developers. Informal communication is crucial when developers are preparing changes in their instance, or to review before they commit to a shared master.
Scripts in the form of SQL DDL (for the schema change) and DML (for the data migration) include data for growing number of tests. Chaining together a sequence of automated small changes is much easier than one huge change. Automation of reverse changes could be possible, but there wasn't demand for it. "We update all developer databases automatically when there is a commit to the master."

In my opinion, ideal solution would keep the complete database structure under source control (e.g. DDL scripts capturing current schema) , with forward (also backward?) change scripts generated automatically from the diff. Several readers suggested to keep schema in xml for this. However, proper ordering can be tricky (table1 constraint depends upon table2). Also, automation of data migration is often difficult (if not impossible) - Steve Campbell suggests to use hand-written versioned scripts that can be appended to the generated diff script before each run.

Looking at my personal experience, there is much to be improved: we used to have a shared developer's database. Initially, baseline scripts were maintained manually to correspond to current schema (there was only one version in production and very low risk of moving to previous version), until they became out of sync. Nobody bothered, as all new databases were created by cloning production schema. On the other hand, change scripts were very elaborate, with complex data manipulation. Well tested on test and staging before going to production. There was downtime required because of replication (stop replication, change schema in all databases, start replication, migrate data).

Many of our lookup and control tables were populated from Excel spreadsheets (customer requirement), with relatively frequent updates (e.g. weekly). Data loading is often intermixed with data transformation scripts. Instead of using Windows-specific solution, we opted for a JDBC-based loading mechanism. I wrote configuration-driven tools for copying data between any two databases, so using Excel source does not require anything special.

Monday, February 4, 2008

What's on Jason's Hard Drive?

This title is borrowed from interesting article by Jason Hunter, famous author of Java Servlet Programming, that describes his "paperless house" organization. Published in Nov 2006. Main components:
  • Scanner; with a habit to scan all relevant documents to tiff (Jason keeps paper for a chance of audit)
  • All documents are kept in a version control system. His choice: Perforce (for historic reasons; can be used for free if max. 2 users and up to 5 workspaces), or Subversion (CVS is not a good choice if most documents are binary). Even if many documents exist in one version only, version control brings portability
  • Hierarchical organization of files. Everything in one repository - personal, business, scans, photos, source code, articles, ebooks.
Cons: all files on every client (is this a drawback?). Subversion with keep two copies on client??? Repository will be backed up (or RAID-ed) as well, i.e. lot of redundancy.
Explicit commit is needed.

There are also a few additional tips and links in the discussion section.

TODO: what is my organization of information?
logbooks, PeterWeekReports

As a side note: How did I ran into Jason's article? Spring's multipart (fileupload) documetation claims support COS FileUpload (http://www.servlets.com/cos). Jason is author of COS (com.oreilly.servlet) support classes (last version Nov. 2002). However, COS support was removed from Spring codebase starting with version 2.1. Currently, Spring supports only Commons FileUpload (http://jakarta.apache.org/commons/fileupload) in traditional mode, without streaming API (note also additional dependency upon Commons IO).

Oh yes, and a brief explanation of changes across Servlet specifications can be found in Jason's articles on JavaWorld.

Sunday, January 20, 2008

Web site prototyping: hand-crafted mockups in Denim

As a software developer, I'd like to concentrate on the interaction flow of web application and leave visual design to a graphic designer.
However, there is a common problem when reviewing the early prototype with customers: they tend to focus on design issues instead of interaction/navigation you are trying to address at the meeting. Apparently, this is main reason even many visual designers prefer starting with paper-based wireframes, or rough ("low-fidelity") sketches in software (e.g. Visio, PowerPoint), instead of html. To understand difference between wireframe and prototype, see e.g. nice discussion here or follow links.

Recently I found the Denim web site prototyping tool that aims at this early stage of design. From their web site: "DENIM supports sketching input, allows design at different refinement levels, and unifies the levels through zooming."
Denim is an outcome of university research, Java GUI application with a long history, but never reaching mainstream awareness. Primarily oriented on pen tablet input, it "preserves the important properties of pencil and paper: a rough drawing can be produced quickly and the medium is flexible. However, unlike a paper sketch, this electronic sketch is interactive and can easily be annotated and modified using editing gestures."

Quick evaluation. Pros:
  • zooming from site maps to storyboards to details of an individual page
  • easy hyperlinks from page component to another page
  • test the interactions in the Run mode, or export to HTML
  • conditional hyperlinks (e.g. depending upon checkbox in the mockup)
  • looks like paper sketches, users are more eager to suggest modifications; not distracted with design issues
  • ability to replace handwritten component with text input or an image
  • portable (requires only JRE 5), no installation needed.
Cons:
  • tablet oriented: mouse + keyboard are less intuitive to use (and my mouse-drawn text looks horrible)
  • no reusable page chunks as header, menu, etc. (although promised in a future release)
  • design-oriented, not technology-oriented: great for working with customers, but I also need to communicate the design with the implementation team. I am missing a separate layer (or level of detail) that enables assigning technical comments to pages, components, links. Something like tags in commercial EasyPrototype, but with explicit control over visibility.
To make your own opinion, try an exported DENIM HTML site, or see videos on the home page.

Sunday, January 13, 2008

SELinux Kiosk Mode in Fedora 8

How to create a secure account (an isolated sandbox) on a Fedora machine, that could be useful for visitors, demo presentations, etc.:
Instructions how to define a Kiosk User account, based on xguest.
The idea is to tightly secure a machine: just login (locally, no password) and use the internet via Firefox. Any local changes made by the user, such as writes to $home or their desktop settings will be lost after they log out.
Requires SELinux in enforcing mode.

BTW, I am still not using SELinux on my desktop (more precisely, it is running in permissive mode, i.e. violations are logged in, but operation can continue). I believe it should be ON at least on the LAMP server in DMZ. But I had always applications that conflict with SE policies - and I've never mastered this topic (see Wikipedia/SELinux: "For me, given my threat model and how much my time is worth, life is too short for SELinux.” — Theodore Ts’o).

On the other hand, Dan Walsh blogs seem to shed some light into obscurity, with many configuration examples and troubleshooting. His blog "New Features in Fedora 8 - policy for my wife ...", explains motivation for a secure account in plain English - nice and easy reading for upper management :-)
Looks like I am running out of excuses and I have to set SELinux properly in upcoming upgrade of my desktop to Fedora 8.