Traditionally database indexes, a mechanism to quickly locate data within the database without significant performance penalty, have been limited to database data type primitives. In Oracle these primitives are typically DATEs, data types containing text such such as VARCHAR2 and CHAR, and NUMBERs. Little support for other data types is provided. Notably, out of the box, Oracle does not provide the facility to index and query information containing chemical reaction schemes or discrete chemical structures.
An Oracle Data Cartridge, a plugin component conceptually similar technology to IBM DB2 extenders, or IBM Informix Datablades, is a database extensible indexing technology that permits the cartridge developer to extend the DBMS feature set to recognise more complex data types. This can include chemical structure information. For chemoinformatic applications, the cartridge would inter alia extend the DBMS feature set to index and query information containing chemical structures and reaction schemes on-a-par with other Oracle data types such VARCHAR2, CHAR, DATE, and NUMBER.
Pharmaceutical partner “Chemical Services plc” has a collection of several hundred thousand internal scientific documents. In-house staff have imported these documents into the database. Three typical document excerpts are shown below.
These documents were then searched with the aim of finding subject matter containing all the words enantiomer, resolution, and tartaric.
A most inefficient way to search database tables containing this information would be to use this type of SQL construct:
SELECT whatever FROM tableContainingDocuments WHERE dbColumn LIKE '%tartaric%enantiomer%resolution%'
This Oracle SQL query would meet the search criteria requirements when performed against the database containing the documents above, only returning the last of the three in the result set. Despite meeting the search requirements, the query does have severe limitations however. These include:
Oracle addresses the types of issues raised above, and many more, in a no additional licence cost component known as Oracle Text (referred to as Intermedia Text or ConText in now unsupported versions of Oracle). Within Oracle, this component is implemented as a Data Cartridge. Oracle Text is incredibly feature rich and SQL constructs not unlike that shown below address many of the points raise above.
SELECT whatever FROM tableContainingDocuments WHERE CONTAINS(dbColumn,'<query> <textquery grammar="context"> enantiomer AND resolution AND tartaric </textquery> </query>') > 0
Notably
In summary extensible indexing components like Oracle Text allow Oracle to index data types of choice. In the case of Oracle Text, this includes content within Microsoft Word and Adobe PDF documents. The underlying database indexes are contructed through reverse-engineering and lexical analysis of the source document content. When database queries are performed against the document content, the indexes are used to retrieve the result set
candidates. As described, the necessity to perform a full table scan and expensively reverse-engineer each document on the fly has been obviated.
Extensible indexing technology also exists to manipulate discrete chemical structures, sub structures, and reaction schemes within the database. Enhancing Oracle via a chemistry aware data cartridge allows Oracle to manipulate NUMBERs, character types containing text such as VARCHARs and CHAR, DATEs, and chemical structures! The interface, as demonstrated above for rich data types such as Microsoft Word and Adobe PDF documents remains SQL.
A superb example of chemical structure searching using an online publically accessible system, keyed on chemical structure, is eMolecules (ooriginally known as Chmoogle but they were strong-armed intochanging their name by legal pressure from Google). EMolecules accepts a number of commonly used electronic formats for the representation of chemical structure information including SMILES strings, Cambridgesoft ChemDraw files, and IsisDraw files, and even InChi identifiers. Oracle Data Cartridges can be used for more than just Chemical Structures and documents containing texto. For example, C$WILDNA1 is a COTS product for indexing DNA profiles on-a-par with native Oracle data types.
This is the third of eight blog articles with a subtitle of “Chemical Structures In Databases”. The prior article can be located here and the next article in the series of eight here.
— Published by Mike, 14:58:23 06 April 2018 (BST)
By Month: November 2022, October 2022, August 2022, February 2021, January 2021, December 2020, November 2020, March 2019, September 2018, June 2018, May 2018, April 2018
Apple, C#, Databases, Faircom, General IT Rant, German, Informatics, LINQ, MongoDB, Oracle, Perl, PostgreSQL, SQL, SQL Server, Unit Testing, XML/XSLT
Leave a Reply