Some scribbles detailing various SQL approaches for retrieving versioned data

A common theme in home-rolled retrofitted database data history versioning implementations is the presence of a table column within a database table named, for example, CURRENT or MOSTRECENTVERSION, or the addition of a column named VERSION, VERSIONNO, or VERSIONNUMBER.

Retrieval of information from a table in the first scenario would result in typical SQL with a syntax not dissimilar to

SELECT something
  FROM somewhere
    WHERE mostRecentVersion = ‘Y’
      AND a whole bunch of other predicates and stuff.

Clearly too, some supporting DML needs to be implemented to ensure that only a single row within any group has the MOSTRECENTVERSION flag set to Y.

The second scenario, where the table is endowed with a column typically named VERSIONNUMBER, appears to be the more common approach to the data versioning requirement. In the absence of a column for ordering the audit history by date modified/changed, the VERSIONNUMBER column can also substitute for this purpose too.

By means of an example, for the following test data set

postgres=#
postgres=#
postgres=# \connect testdb001
You are now connected to database "testdb001" as user "postgres".
testdb001=#
testdb001=# select * from tblData;
datatypeid  |  somedata  | versionnumber
------------+------------+---------------
          1 | foo        |             1
          1 | bar        |             2
          1 | baz        |             3
          2 | aaa1       |             1
          1 | qux        |             4
          5 | der Apfel  |             1
          5 | die Äpfel  |             2
          9 | die Birne  |             1
          9 | die Birnen |             2
        100 | <null>     |             1
(10 rows)
testdb001=#
testdb001=#

I have very regularly encountered the following SQL construct in production systems, used to retrieve the most recent version of some data set.

testdb001=#
testdb001=#
testdb001=# select dataTypeID, versionNumber, someData
testdb001-#  from tblData t
testdb001-#   where versionNumber = (
testdb001(#                           select max(versionNumber)
testdb001(#                             from tblData
testdb001(#                              where dataTypeID = t.dataTypeID
testdb001(#                                 group by dataTypeID
testdb001(#                         )
testdb001-#     order by 1 asc;
datatypeid | versionnumber |  somedata

------------+---------------+------------
          1 |             4 | qux
          2 |             1 | aaa1
          5 |             2 | die Äpfel
          9 |             2 | die Birnen
        100 |             1 | <null>
(5 rows)

testdb001=#

It works to be fair, but note that the most recent version number, the version number with the largest value, is obtained through a second query, more formally known as a correlated subquery. Alarm bells should ring regarding the performance implications of this approach, irrespective of whether you are using PostgreSQL, SQL Server, Oracle or some other vendor product.

Generically and minimally, and this model often appears in application architectures using micro-services, a (web)service will return, as shown above, the dataTypeID and the versionNumber (or perhaps a unique ID/PK) for the most recent version of the versioned data, and little else. Other services will then use this key/ID information to retrieve the specific data requested, invoking some other API method.

How is this minimal information obtained in SQL code? Well ….. this is the point of this blog post, and it describes the SQL variants behind the API request.

Pruning back the query above, I am going to outline the approaches I have encountered during my travels as a software developer.

Approach 1 : The Correlated Subquery

The correlated subquery is by far the most common construct observed in production non-performant code. This type of SQL is often generated on-the-fly by ORM frameworks too, so its existence is regularly hidden from plain sight.

Given that, by inspection, the subquery must be performed for each row returned, queries of this type often result in a high execution cost. Put another way, the query runs slowly and the resource time-slice detrimentally affects other server processes. Of course these are distant view throw-away comments, and the execution plan cost and server impact will clearly vary with each DBMS implementation, ie. different for PostgreSQL and SQL Server and MySQL and so on, their versions, the OS versions, hardware specifications, and so on.

testdb001=# 
testdb001=# 
testdb001=# select dataTypeID, versionNumber
testdb001-#   from tblData t
testdb001-#     where versionNumber = (
testdb001(#                             select max(versionNumber)
testdb001(#                               from tblData 
testdb001(#                                 where dataTypeID = t.dataTypeID
testdb001(#                                   group by dataTypeID
testdb001(#                           )
testdb001-#       order by 1 asc;
 datatypeid | versionnumber 
------------+---------------
          1 |              4
          2 |              1
          5 |              2
          9 |              2
        100 |              1
(5 rows)

testdb001=# 
testdb001=# 

Approach 2 : The TopN Query

TopN queries are another commonly observed way of retrieving the most recent (ie. the top, bottom, or offsetted) result representing the most recent versioned data result. The approach does however mandate that a full table, or full range, data scan being performed, twice, and often using some temporary or overflow table space void of intermediary result-set indexes.

The approach is however clear to read. In my book, code that is readable is also maintainable and will contain fewer bugs than complicated code written by a smart-alec that just never works right every time.

testdb001=# 
testdb001=# 
testdb001=# select dataTypeID, versionNumber
testdb001-#   from (
testdb001(#          select dataTypeID, 
testdb001(#                 versionNumber,
testdb001(#                 rank() over (partition by dataTypeID order by versionNumber desc) rn
testdb001(#            from tblData t
testdb001(#        ) t
testdb001-#     where t.rn = 1
testdb001-#       order by 1 asc;
 datatypeid | versionnumber 
------------+---------------
          1 |              4
          2 |              1
          5 |              2 
          9 |              2
        100 |              1
(5 rows)

testdb001=# 

Approach 3 : The Bullet-Proof INNER (disguised SELF) JOIN

The INNER JOIN – there is not much more to say. Peculiarly however, for this most-recent-data-version requirement, developers tend to shy away from using a join, despite all DBMS vendors implementing numerous efficient join optimising algorithms. From my desk, the INNER JOIN is my default without-thinking-much-more-about-it approach to the problem.

testdb001=# 
testdb001=# 
testdb001=# 
testdb001=# select t.dataTypeID, t.versionNumber
testdb001-#   from tblData t
testdb001-#     inner join (
testdb001(#                  select dataTypeID, max(versionNumber) versionNumber
testdb001(#                    from tblData
testdb001(#                      group by dataTypeID
testdb001(#                ) x
testdb001-#        on   t.dataTypeID = x.dataTypeID
testdb001-#         and t.versionNumber = x.versionNumber
testdb001-#     order by 1 asc;
 datatypeid | versionnumber 
------------+---------------
          1 |              4
          2 |              1
          5 |              2
          9 |              2
        100 |              1
(5 rows)

testdb001=# 

Approach 4 : INTERSECT Set Operation

The use of the INTERSECT set operation to meet this technical requirement is charming. The SQL code is very simple to read, and the code, as written, is also can be easily parallelised with a low execution plan cost. Like the INNER JOIN above, the code construct is rarely seen in production code, more than likely as the developer didn’t even consider using it to retrieve some most recent data version ID’s.

testdb001=# 
testdb001=# 
testdb001=#   select t.dataTypeID, versionNumber
testdb001-#     from tblData t
testdb001-# intersect
testdb001-#   select dataTypeID, max(versionNumber)
testdb001-#     from tblData
testdb001-#       group by dataTypeID
testdb001-#  order by 1 asc; 
 datatypeid | versionnumber 
------------+---------------
          1 |              4
          2 |              1
          5 |              2
          9 |              2
        100 |              1
(5 rows)

testdb001=# 

Approach 5 : GROUP BY HAVING

The obvious technical approach that satisfies the requirement is to use a SQL HAVING clause.

The ‘positives’ to using the GROUP BY HAVING pair include

  • it is generally efficient – note there is no disguised table self-join and tblData is being referenced in the SQL code once and only once
  • the code is minimal and very clear
  • the HAVING clause is designed for aggregation rollup – why reinvent the wheel with the other approaches shown herein?

The ‘negatives’ are

  • As written (see code excerpt below), the code is bugged for data holes – for example, if version numbers present in the data were 1, 2, 3, 5, then MAX(versionNumber) would be 5 and COUNT(*) would sum to 4, and the query result would have a valid piece of data omitted. Furthermore, and despite what you may believe, production data is always full of dirty data 🙁 Using the GROUP BY HAVING approach may result in a bug that is very hard to find. Last year I wrote a small blog article on resolving holes in versioned data.

The embarrassing ‘real reasons’ why the HAVING clause is infrequently used

  • software developers don’t understand the HAVING clause. It is a bitter pill to swallow, but it is true.

The textbook GROUP BY HAVING construct that satisfies the most recent version requirement is shown below:

testdb001=# 
testdb001=# 
testdb001=# 
testdb001=# select t.dataTypeID, max(versionNumber) versionNumber
testdb001-#   from tblData t
testdb001-#     group by dataTypeID
testdb001-#       having count(*) = max(versionNumber)
testdb001-#         order by 1 asc;
 datatypeid | versionnumber 
------------+---------------
          1 |              4
          2 |              1
          5 |              2
          9 |              2
        100 |              1
(5 rows)

testdb001=# 
testdb001=# 

There are two elephants in the room.

  • If the most recent version of the data is inserted into the table with versionNumber = 1, the necessity to introduce complexity into the SQL queries above is not required. Clearly some sort of version number shuffle would need to be performed each time a new row is inserted (ie. UPDATE tblData SET versionNumber = versionNumber + 1 WHERE dataTypeID = ??), but appropriately indexed, this should be a low cost piece of DML. Put another way, the most recent data version is always 1, the next most recent has a versionNumber of 2, …., and the most recent data can be retrieved with SELECT whatever FROM tblData WHERE versionNumber=1. Sadly this shuffling or data versioning is often performed in some comedy database trigger with slow convoluted logic, and the shuffle of existing versioned data also represents a data modification too, so irrelevant and misleading additional information is often appended to the audit log/version history.
  • If SQL were a declarative language, and no it is not, then it should not matter which of the 5 SQL constructs shown is executed, the same execution plan should result! After all, a declarative language is not supposed to describe the execution steps, but the action and results required. Sadly DBMS’s are jurassic in their implementation, despite what the Marketing Departments in these multinationals would prefer you to believe, and the way the SQL is written ultimately determines the execution plan and cost estimate. Put another way, the SQL executed determines the steps taken by the DBMS to retrieve your result set, or control-flow in the jargon. That is not a declarative language!

— Published by Mike, 09:56:51 17 August 2017 (GMT)

Leave a Reply