A common theme in home-rolled retrofitted database data history versioning implementations is the presence of a table column within a database table named, for example, CURRENT or MOSTRECENTVERSION, or the addition of a column named VERSION, VERSIONNO, or VERSIONNUMBER.
Retrieval of information from a table in the first scenario would result in typical SQL with a syntax not dissimilar to
SELECT something FROM somewhere WHERE mostRecentVersion = ‘Y’ AND a whole bunch of other predicates and stuff.
Clearly too, some supporting DML needs to be implemented to ensure that only a single row within any group has the MOSTRECENTVERSION flag set to Y.
The second scenario, where the table is endowed with a column typically named VERSIONNUMBER, appears to be the more common approach to the data versioning requirement. In the absence of a column for ordering the audit history by date modified/changed, the VERSIONNUMBER column can also substitute for this purpose too.
By means of an example, for the following test data set
postgres=# postgres=# postgres=# \connect testdb001 You are now connected to database "testdb001" as user "postgres". testdb001=# testdb001=# select * from tblData; datatypeid | somedata | versionnumber ------------+------------+--------------- 1 | foo | 1 1 | bar | 2 1 | baz | 3 2 | aaa1 | 1 1 | qux | 4 5 | der Apfel | 1 5 | die Äpfel | 2 9 | die Birne | 1 9 | die Birnen | 2 100 | <null> | 1 (10 rows) testdb001=# testdb001=#
I have very regularly encountered the following SQL construct in production systems, used to retrieve the most recent version of some data set.
testdb001=# testdb001=# testdb001=# select dataTypeID, versionNumber, someData testdb001-# from tblData t testdb001-# where versionNumber = ( testdb001(# select max(versionNumber) testdb001(# from tblData testdb001(# where dataTypeID = t.dataTypeID testdb001(# group by dataTypeID testdb001(# ) testdb001-# order by 1 asc; datatypeid | versionnumber | somedata ------------+---------------+------------ 1 | 4 | qux 2 | 1 | aaa1 5 | 2 | die Äpfel 9 | 2 | die Birnen 100 | 1 | <null> (5 rows) testdb001=#
It works to be fair, but note that the most recent version number, the version number with the largest value, is obtained through a second query, more formally known as a correlated subquery. Alarm bells should ring regarding the performance implications of this approach, irrespective of whether you are using PostgreSQL, SQL Server, Oracle or some other vendor product.
Generically and minimally, and this model often appears in application architectures using micro-services, a (web)service will return, as shown above, the dataTypeID and the versionNumber (or perhaps a unique ID/PK) for the most recent version of the versioned data, and little else. Other services will then use this key/ID information to retrieve the specific data requested, invoking some other API method.
How is this minimal information obtained in SQL code? Well ….. this is the point of this blog post, and it describes the SQL variants behind the API request.
Pruning back the query above, I am going to outline the approaches I have encountered during my travels as a software developer.
The correlated subquery is by far the most common construct observed in production non-performant code. This type of SQL is often generated on-the-fly by ORM frameworks too, so its existence is regularly hidden from plain sight.
Given that, by inspection, the subquery must be performed for each row returned, queries of this type often result in a high execution cost. Put another way, the query runs slowly and the resource time-slice detrimentally affects other server processes. Of course these are distant view throw-away comments, and the execution plan cost and server impact will clearly vary with each DBMS implementation, ie. different for PostgreSQL and SQL Server and MySQL and so on, their versions, the OS versions, hardware specifications, and so on.
testdb001=# testdb001=# testdb001=# select dataTypeID, versionNumber testdb001-# from tblData t testdb001-# where versionNumber = ( testdb001(# select max(versionNumber) testdb001(# from tblData testdb001(# where dataTypeID = t.dataTypeID testdb001(# group by dataTypeID testdb001(# ) testdb001-# order by 1 asc; datatypeid | versionnumber ------------+--------------- 1 | 4 2 | 1 5 | 2 9 | 2 100 | 1 (5 rows) testdb001=# testdb001=#
TopN queries are another commonly observed way of retrieving the most recent (ie. the top, bottom, or offsetted) result representing the most recent versioned data result. The approach does however mandate that a full table, or full range, data scan being performed, twice, and often using some temporary or overflow table space void of intermediary result-set indexes.
The approach is however clear to read. In my book, code that is readable is also maintainable and will contain fewer bugs than complicated code written by a smart-alec that just never works right every time.
testdb001=# testdb001=# testdb001=# select dataTypeID, versionNumber testdb001-# from ( testdb001(# select dataTypeID, testdb001(# versionNumber, testdb001(# rank() over (partition by dataTypeID order by versionNumber desc) rn testdb001(# from tblData t testdb001(# ) t testdb001-# where t.rn = 1 testdb001-# order by 1 asc; datatypeid | versionnumber ------------+--------------- 1 | 4 2 | 1 5 | 2 9 | 2 100 | 1 (5 rows) testdb001=#
The INNER JOIN – there is not much more to say. Peculiarly however, for this most-recent-data-version requirement, developers tend to shy away from using a join, despite all DBMS vendors implementing numerous efficient join optimising algorithms. From my desk, the INNER JOIN is my default without-thinking-much-more-about-it approach to the problem.
testdb001=# testdb001=# testdb001=# testdb001=# select t.dataTypeID, t.versionNumber testdb001-# from tblData t testdb001-# inner join ( testdb001(# select dataTypeID, max(versionNumber) versionNumber testdb001(# from tblData testdb001(# group by dataTypeID testdb001(# ) x testdb001-# on t.dataTypeID = x.dataTypeID testdb001-# and t.versionNumber = x.versionNumber testdb001-# order by 1 asc; datatypeid | versionnumber ------------+--------------- 1 | 4 2 | 1 5 | 2 9 | 2 100 | 1 (5 rows) testdb001=#
The use of the INTERSECT set operation to meet this technical requirement is charming. The SQL code is very simple to read, and the code, as written, is also can be easily parallelised with a low execution plan cost. Like the INNER JOIN above, the code construct is rarely seen in production code, more than likely as the developer didn’t even consider using it to retrieve some most recent data version ID’s.
testdb001=# testdb001=# testdb001=# select t.dataTypeID, versionNumber testdb001-# from tblData t testdb001-# intersect testdb001-# select dataTypeID, max(versionNumber) testdb001-# from tblData testdb001-# group by dataTypeID testdb001-# order by 1 asc; datatypeid | versionnumber ------------+--------------- 1 | 4 2 | 1 5 | 2 9 | 2 100 | 1 (5 rows) testdb001=#
The obvious technical approach that satisfies the requirement is to use a SQL HAVING clause.
The ‘positives’ to using the GROUP BY HAVING pair include
The ‘negatives’ are
The embarrassing ‘real reasons’ why the HAVING clause is infrequently used
The textbook GROUP BY HAVING construct that satisfies the most recent version requirement is shown below:
testdb001=# testdb001=# testdb001=# testdb001=# select t.dataTypeID, max(versionNumber) versionNumber testdb001-# from tblData t testdb001-# group by dataTypeID testdb001-# having count(*) = max(versionNumber) testdb001-# order by 1 asc; datatypeid | versionnumber ------------+--------------- 1 | 4 2 | 1 5 | 2 9 | 2 100 | 1 (5 rows) testdb001=# testdb001=#
There are two elephants in the room.
— Published by Mike, 09:56:51 17 August 2017 (GMT)
By Month: November 2022, October 2022, August 2022, February 2021, January 2021, December 2020, November 2020, March 2019, September 2018, June 2018, May 2018, April 2018
Apple, C#, Databases, Faircom, General IT Rant, German, Informatics, LINQ, MongoDB, Oracle, Perl, PostgreSQL, SQL, SQL Server, Unit Testing, XML/XSLT
Leave a Reply