Sunday, December 16, 2007

How-to: Paging of a large result set

Problem: result set of a database query contains too many rows (more than acceptable/practical for the application). Only a subset is needed in memory ("from_row=X to_row=Y"). For example, a "window" to the resultSet is needed (show me page 13 of 25). Often, next subset / next page will be needed soon (in order of seconds to minutes).

This post is not polished. I decided to write down what I learned, in case we'll need it again. Java technologies are mentioned, but solutions are not Java specific.

Approaches:

1. Modify the query (design more restrictive WHERE clause, e.g. "key>'key1' AND key<='key2'", with suitable ORDER BY).
Often possible, but neglected (requires to overcome mental block :-) Nice explanation, with source code (generic, but quite complicated) is here.

2. Keep ResultSet open for a long time
a) batch applications: invoke processing for each row
b) user interaction: progress when user requests next page (partSMart)
Drawbacks: against best practice to hold connection only for a minimum time.
Waste of db and connection pool resources. Will not work with containers?
Encapsulation is difficult (batch could be done as callback, but paging?).

3. Cache all results in memory, present only a subset (actually not a solution to the problem, but commonly used approach for user presentation).
See for example Value List Pattern in the Core J2EE Pattern Catalog.
Alternative (Miro): cache only keys, retrieve records on demand.

4. Iterate through the whole result set, but read only the relevant subset. Next time, execute statement again, iterate reading the next subset. (Can be optimized to stop processing after reading the "to_row" record).
Drawback: time + waste of db resources.

Common problem where each "page" is a result of separate select statement:
Can resultSet change over time? Result of next statement will be different if records were inserted, updated, deleted in the meantime.
Solution is application specific. What is better for your application: getting some rows twice, or missing a row?

5. Use database specific solution, if exists.
E.g. Oracle:
SELECT * FROM emp WHERE ROWNUM > 11 AND ROWNUM <= 20 ORDER BY empno; (Oracle counts from 1);
or MYSQL: SELECT * FROM MYTABLE LIMIT offset, row_count;
Non-portable, not all databases have such construct, but efficient.

There is a danger of skipping (missing) rows when using numeric positions if records could be deleted between two statements:
some rows will have lower numeric position after the next statement execution.
This is a problem e.g. if results are sorted by primary key, with new records inserted at the end. Situation is even worse if results are sorted by something like changeDate - any update will change order of rows. Not only skipping a row, but also getting the same row twice is possible (this problem is not specific to numeric position approach).

6. Use jdbc method Statement.setMaxRows(max), portable equivalent of WHERE ROWNUM <= max.
Spring framework offers setMaxRows() in JdbcTemplate and derived classes.

This will nicely restrict number of records retrieved from the resultset. However, I didn't find any jdbc equivalent of setting offset, or "from_row".
Well, numeric offset is dangerous anyway, so let's use combination with approach 1 instead: modify the query to accept "key > ?" in a (parametrized) WHERE clause.
(Note that setFetchSize() is not usable for this purpose, suggestion in this forum is misleading).

7. TODO: Look at ORM approaches. Hibernate offers Pagination (setFirstResult(), setMaxResults()), and Scrollable iteration (but only if JDBC driver supports scrollable ResultSets; open database connection is required for this functionality). Discussion here and here actually does not mention whether they are using caching in memory, database specific constructs, or something really smart.
This Hibernate blog investigates scrollable resultSets. Very nice overview of scrollable support across databases. Seems too many problems, plus encapsulation of db access is questionable. More reading needed.

So, I implemented the solution no.6 today.
I've modified the query to use parametrized key as "from" (with proper ordering), plus JdbcTemplate.setMaxRows(). I am also fine with duplicated rows in this application (if they ever occur).
Works like charm (actually, different ORDER BY improved performance substantially).

This text started as a chat with Miro, who provided me with both initial pointers and an encouragement to invent my own wheel. Thanks Miro :-)