Thursday, January 8, 2009

Time Synchronization across the lab

All servers and workstations within the lab need to keep time in sync. This is needed to correctly detect changes for automated build process, or to properly manage change_date columns in a database, or for other forms of Optimistic concurrency control.

Network_Time_Protocol (NTP) is used to synchronize time on all lab systems. Our local ntp server is 192.168.8.5, setup to synchronize with time.chu.nrc.ca. All the other machines are set as ntp clients using 192.168.8.5 as their time source.

How to setup the ntp client?
For Linux/Unix machines, make sure port 123/UDP is open in the firewall.

Procedure for Windows: Open Registry editor (regedit) and modify following keys:
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\Parameters]
"NtpServer"="192.168.8.5"
"Type"="NTP"
[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\W32Time\TimeProviders\NtpClient]
"SpecialPollInterval"="900"

This will set correct ntp server and change the poll interval from decimal value 604800 = 1 week to 900 = 15 min. Reboot the machine, or restart the Windows Time service (e.g. from command line net stop w32time && net start w32time).

The VMWare ESX servers (host machines) are synchronized to 192.168.8.5 as well. All virtual machines get their time from the ESX Server on bootup, migration or reverting snapshots. (You need to hack the vmx files if you want to disable this feature.) The VMTOOLS allows you to also set time synchronization with the ESX host (this is off by default). You can change this in the Settings of the Virtual machine OPTIONS-VMWARE Tools- check Synchronize guest time with host.
Unfortunately, we didn't have success with either method of VMWare synchronization. Switching on time sync in VMWare Tools didn't help. We tried to shutdown and power on the guest, but there was the same difference against the host time (~2 min) as before (with VMTools sync on). So, we rely only on NTP synchronization.

Tuesday, February 5, 2008

Remarks about Database Version Control

This post was triggered by a Coding Horror blog, where Jeff Atwood pointed to a recent series of 5 articles about versioning databases by K. Scott Allen. Let's start with a brief synopsis of each article:

1. Three rules for database work
1. Never use a shared database server for development work.
"a shared database is a tar pit waiting to fossilize a project... they ultimately waste time and help produce bugs".
2. Always Have a Single, Authoritative Source For Your Schema
"
Everyone should ... have a frictionless experience in getting a fresh database setup."
3. Always Version Your Database
"to propagate changes from development, to test, and ultimately to production in a controlled and consistent manner. A second goal is to have the ability to recreate a database at any point in time."

2. The Baseline
Start with a baseline schema when initial design somewhat settled. What should be included? Creation of all relevant db objects (tables, constraints, functions, indexes), but also commands to populate lookup tables with static data. Include bootstrap data needed by the application.
One script for all vs. one file per object? Middle road makes sense: e.g. create tables, defaults, indexes in a single file, views and functions in another. Automation: prepare script/application to install database without human intervention.
User tip for a green field project: "We always rebuild dev db from scratch. Then baseline script (with test data) can be versioned. Start creating change scripts only after release one."
Important: use a SchemaChangeLog table, e.g. (id, release_number X.Y.Z, script_name, date_applied). Initial record should be written as a last step of baseline script.

3. Change Scripts
Change scripts capture deltas, i.e. how to transform schema from previous version to the next one. They often contain complex data migration or transformation statements. Use a script naming convention, e.g. sc.01.00.0001.sql, so that an automation tool can compare with SchemaChangeLog and select scripts to apply. Commit means schema change is officially published. As a consequence, update scripts should never be changed. What if production db fails in the middle of update? Always backup before... or write idempotent scripts, advocated by Phil Haack (his implementation is SQL Server specific, but Oracle solution can be found in blog comments).

4. Views, Stored Procedures and the Like
Always drop all objects, then apply scripts to create objects from scratch. Scott's strategy is to hold one view (trigger, stored procedure) per file. No change scripts are needed. Delete the view file from source control and a view is deleted. It's so simple.
This approach works because there are no data to migrate/transform. As a consequence, change scripts cannot use views/triggers (they can create and drop a temporary one, though). Pros: Some problems show up early (e.g. column used by a view was deleted). Easy to go backwards in time. Cons: Every update requires downtime; could be problem in 24/7 production.

5. Branching and Merging
The technique described above is forward-only, i.e. branching is not easy... Use "branch per release" approach and hope that schema changes in a branch will be rare. Otherwise look at Ruby Migrations by Elliot Smith and Rob Nichols.
After branching, create a new baseline in the trunk (update major or minor version number). If a fix is applied in the branch, then its schema change script can be adapted and renamed to fit the trunk numbering.

There are few additional ideas in an older article by Martin Fowler and Pramod Sadalage: DBAs should collaborate closely with developers. Informal communication is crucial when developers are preparing changes in their instance, or to review before they commit to a shared master.
Scripts in the form of SQL DDL (for the schema change) and DML (for the data migration) include data for growing number of tests. Chaining together a sequence of automated small changes is much easier than one huge change. Automation of reverse changes could be possible, but there wasn't demand for it. "We update all developer databases automatically when there is a commit to the master."

In my opinion, ideal solution would keep the complete database structure under source control (e.g. DDL scripts capturing current schema) , with forward (also backward?) change scripts generated automatically from the diff. Several readers suggested to keep schema in xml for this. However, proper ordering can be tricky (table1 constraint depends upon table2). Also, automation of data migration is often difficult (if not impossible) - Steve Campbell suggests to use hand-written versioned scripts that can be appended to the generated diff script before each run.

Looking at my personal experience, there is much to be improved: we used to have a shared developer's database. Initially, baseline scripts were maintained manually to correspond to current schema (there was only one version in production and very low risk of moving to previous version), until they became out of sync. Nobody bothered, as all new databases were created by cloning production schema. On the other hand, change scripts were very elaborate, with complex data manipulation. Well tested on test and staging before going to production. There was downtime required because of replication (stop replication, change schema in all databases, start replication, migrate data).

Many of our lookup and control tables were populated from Excel spreadsheets (customer requirement), with relatively frequent updates (e.g. weekly). Data loading is often intermixed with data transformation scripts. Instead of using Windows-specific solution, we opted for a JDBC-based loading mechanism. I wrote configuration-driven tools for copying data between any two databases, so using Excel source does not require anything special.

Monday, February 4, 2008

What's on Jason's Hard Drive?

This title is borrowed from interesting article by Jason Hunter, famous author of Java Servlet Programming, that describes his "paperless house" organization. Published in Nov 2006. Main components:
  • Scanner; with a habit to scan all relevant documents to tiff (Jason keeps paper for a chance of audit)
  • All documents are kept in a version control system. His choice: Perforce (for historic reasons; can be used for free if max. 2 users and up to 5 workspaces), or Subversion (CVS is not a good choice if most documents are binary). Even if many documents exist in one version only, version control brings portability
  • Hierarchical organization of files. Everything in one repository - personal, business, scans, photos, source code, articles, ebooks.
Cons: all files on every client (is this a drawback?). Subversion with keep two copies on client??? Repository will be backed up (or RAID-ed) as well, i.e. lot of redundancy.
Explicit commit is needed.

There are also a few additional tips and links in the discussion section.

TODO: what is my organization of information?
logbooks, PeterWeekReports

As a side note: How did I ran into Jason's article? Spring's multipart (fileupload) documetation claims support COS FileUpload (http://www.servlets.com/cos). Jason is author of COS (com.oreilly.servlet) support classes (last version Nov. 2002). However, COS support was removed from Spring codebase starting with version 2.1. Currently, Spring supports only Commons FileUpload (http://jakarta.apache.org/commons/fileupload) in traditional mode, without streaming API (note also additional dependency upon Commons IO).

Oh yes, and a brief explanation of changes across Servlet specifications can be found in Jason's articles on JavaWorld.

Sunday, January 20, 2008

Web site prototyping: hand-crafted mockups in Denim

As a software developer, I'd like to concentrate on the interaction flow of web application and leave visual design to a graphic designer.
However, there is a common problem when reviewing the early prototype with customers: they tend to focus on design issues instead of interaction/navigation you are trying to address at the meeting. Apparently, this is main reason even many visual designers prefer starting with paper-based wireframes, or rough ("low-fidelity") sketches in software (e.g. Visio, PowerPoint), instead of html. To understand difference between wireframe and prototype, see e.g. nice discussion here or follow links.

Recently I found the Denim web site prototyping tool that aims at this early stage of design. From their web site: "DENIM supports sketching input, allows design at different refinement levels, and unifies the levels through zooming."
Denim is an outcome of university research, Java GUI application with a long history, but never reaching mainstream awareness. Primarily oriented on pen tablet input, it "preserves the important properties of pencil and paper: a rough drawing can be produced quickly and the medium is flexible. However, unlike a paper sketch, this electronic sketch is interactive and can easily be annotated and modified using editing gestures."

Quick evaluation. Pros:
  • zooming from site maps to storyboards to details of an individual page
  • easy hyperlinks from page component to another page
  • test the interactions in the Run mode, or export to HTML
  • conditional hyperlinks (e.g. depending upon checkbox in the mockup)
  • looks like paper sketches, users are more eager to suggest modifications; not distracted with design issues
  • ability to replace handwritten component with text input or an image
  • portable (requires only JRE 5), no installation needed.
Cons:
  • tablet oriented: mouse + keyboard are less intuitive to use (and my mouse-drawn text looks horrible)
  • no reusable page chunks as header, menu, etc. (although promised in a future release)
  • design-oriented, not technology-oriented: great for working with customers, but I also need to communicate the design with the implementation team. I am missing a separate layer (or level of detail) that enables assigning technical comments to pages, components, links. Something like tags in commercial EasyPrototype, but with explicit control over visibility.
To make your own opinion, try an exported DENIM HTML site, or see videos on the home page.

Sunday, January 13, 2008

SELinux Kiosk Mode in Fedora 8

How to create a secure account (an isolated sandbox) on a Fedora machine, that could be useful for visitors, demo presentations, etc.:
Instructions how to define a Kiosk User account, based on xguest.
The idea is to tightly secure a machine: just login (locally, no password) and use the internet via Firefox. Any local changes made by the user, such as writes to $home or their desktop settings will be lost after they log out.
Requires SELinux in enforcing mode.

BTW, I am still not using SELinux on my desktop (more precisely, it is running in permissive mode, i.e. violations are logged in, but operation can continue). I believe it should be ON at least on the LAMP server in DMZ. But I had always applications that conflict with SE policies - and I've never mastered this topic (see Wikipedia/SELinux: "For me, given my threat model and how much my time is worth, life is too short for SELinux.” — Theodore Ts’o).

On the other hand, Dan Walsh blogs seem to shed some light into obscurity, with many configuration examples and troubleshooting. His blog "New Features in Fedora 8 - policy for my wife ...", explains motivation for a secure account in plain English - nice and easy reading for upper management :-)
Looks like I am running out of excuses and I have to set SELinux properly in upcoming upgrade of my desktop to Fedora 8.

Sunday, December 16, 2007

How-to: Paging of a large result set

Problem: result set of a database query contains too many rows (more than acceptable/practical for the application). Only a subset is needed in memory ("from_row=X to_row=Y"). For example, a "window" to the resultSet is needed (show me page 13 of 25). Often, next subset / next page will be needed soon (in order of seconds to minutes).

This post is not polished. I decided to write down what I learned, in case we'll need it again. Java technologies are mentioned, but solutions are not Java specific.

Approaches:

1. Modify the query (design more restrictive WHERE clause, e.g. "key>'key1' AND key<='key2'", with suitable ORDER BY).
Often possible, but neglected (requires to overcome mental block :-) Nice explanation, with source code (generic, but quite complicated) is here.

2. Keep ResultSet open for a long time
a) batch applications: invoke processing for each row
b) user interaction: progress when user requests next page (partSMart)
Drawbacks: against best practice to hold connection only for a minimum time.
Waste of db and connection pool resources. Will not work with containers?
Encapsulation is difficult (batch could be done as callback, but paging?).

3. Cache all results in memory, present only a subset (actually not a solution to the problem, but commonly used approach for user presentation).
See for example Value List Pattern in the Core J2EE Pattern Catalog.
Alternative (Miro): cache only keys, retrieve records on demand.

4. Iterate through the whole result set, but read only the relevant subset. Next time, execute statement again, iterate reading the next subset. (Can be optimized to stop processing after reading the "to_row" record).
Drawback: time + waste of db resources.

Common problem where each "page" is a result of separate select statement:
Can resultSet change over time? Result of next statement will be different if records were inserted, updated, deleted in the meantime.
Solution is application specific. What is better for your application: getting some rows twice, or missing a row?

5. Use database specific solution, if exists.
E.g. Oracle:
SELECT * FROM emp WHERE ROWNUM > 11 AND ROWNUM <= 20 ORDER BY empno; (Oracle counts from 1);
or MYSQL: SELECT * FROM MYTABLE LIMIT offset, row_count;
Non-portable, not all databases have such construct, but efficient.

There is a danger of skipping (missing) rows when using numeric positions if records could be deleted between two statements:
some rows will have lower numeric position after the next statement execution.
This is a problem e.g. if results are sorted by primary key, with new records inserted at the end. Situation is even worse if results are sorted by something like changeDate - any update will change order of rows. Not only skipping a row, but also getting the same row twice is possible (this problem is not specific to numeric position approach).

6. Use jdbc method Statement.setMaxRows(max), portable equivalent of WHERE ROWNUM <= max.
Spring framework offers setMaxRows() in JdbcTemplate and derived classes.

This will nicely restrict number of records retrieved from the resultset. However, I didn't find any jdbc equivalent of setting offset, or "from_row".
Well, numeric offset is dangerous anyway, so let's use combination with approach 1 instead: modify the query to accept "key > ?" in a (parametrized) WHERE clause.
(Note that setFetchSize() is not usable for this purpose, suggestion in this forum is misleading).

7. TODO: Look at ORM approaches. Hibernate offers Pagination (setFirstResult(), setMaxResults()), and Scrollable iteration (but only if JDBC driver supports scrollable ResultSets; open database connection is required for this functionality). Discussion here and here actually does not mention whether they are using caching in memory, database specific constructs, or something really smart.
This Hibernate blog investigates scrollable resultSets. Very nice overview of scrollable support across databases. Seems too many problems, plus encapsulation of db access is questionable. More reading needed.

So, I implemented the solution no.6 today.
I've modified the query to use parametrized key as "from" (with proper ordering), plus JdbcTemplate.setMaxRows(). I am also fine with duplicated rows in this application (if they ever occur).
Works like charm (actually, different ORDER BY improved performance substantially).

This text started as a chat with Miro, who provided me with both initial pointers and an encouragement to invent my own wheel. Thanks Miro :-)