Design of data models for structure and reaction database systems and their related biological and physicochemical data for data registration and retrieval (data warehouses, datamarts) 

The design of molecule databases depends on the definition of “compound”, “batch/lot” and “sample” as described for example in “Chemical drawing rules”. The “compound” may be seen as the idealization of the “biological active structure” or as salt that may or may not include non-rational ratios of the components and residual solvents. Information about stereochemistry may be drawn as integral part of the chemical structures or may be handled in explicit structure related text comments. Mixtures may be drawn implicitly or explicitly, or may be handled by relational tables that link the structure entries with its composition data to a common mixture entity. As these simple examples show the compound side of the data model is mainly defined by the way how the idea of the “compound” is defined within the given context. On the other side the modeling of “batches” or “lots” and “samples” is pretty much influenced by the application environment describing the physical workflow of substances and has to ensure that other hardware and software modules may add data to the system or get the necessary information for their purposes.

Physicochemistry and biological data are measured from physically existing samples of a compound and are sample related therefore. In terms of registration generic data models help very much to handle the wide variety of different data types especially in Biology while most of these data must be pivoted against the sample (compound) identifier for data retrieval to display the frequently required reporting format of compound against all related biological results. This is one of the reasons, why data warehouses and data marts are very common in regards to data retrieval, analysis and reporting of the biological and physicochemical data.

Data models suitable for reactions have to handle the reactions and all their related components like the information for reactants, products, catalysts, solvents or reagents. A quite common approach for reaction databases is taken from the former MDL REACCS program (the successor of ISIS for reactions, the ISIS/Host reaction databases use the REACCS format) and is used as the data model for RD files (Reaction Data files): each reaction consists out of one or more variations. While reactants and products define the reaction part (therefore two reactions are identical if all reactants and all products are identical) the variation contains the agent data (catalysts, solvents, reagents) and physical reaction conditions like temperature or pressure. Accordingly one reaction is related to one or more variations leaving a “1 to n” relationship between reactions and variations.
Reaction Smiles and RInChIs do not know variations. Instead a reaction consists out of reactants, products and agents, so that the uniqueness of a reaction is defined by all participating structural components.

Over decades the members of StructurePendium have gathered a lot of experiences in designing the appropriate data models for registration and retrieval purposes for

  • Compound databases
  • Reaction databases
  • Databases for biological and physicochemical data
  • Data warehouses and data marts