In 2010 CΞC StructurePendium Technologies GmbH has been founded to support the optimization of data management in the scope of R&D of the pharmaceutical, agrochemical, chemical and cosmetics industry. It offers consulting services for the workflow management of biological, chemical, and physical data, implementation services, and training for standard software components in the following fields
- Representation of chemical structures and reactions
- Normalization of chemical structures and reactions
- Development of database consistent company drawing rules
- Handling of biologics and biopolymers based on V2/V3 molfiles, Accelrys SCSR format, HELM, or Biochemfusion PLN Notation
- Special compound classes like polymers, mixtures, formulations, or compounds with structure sections described by statistical distributions based on V2/V3 molfile Format
- Design of data models for structure and reaction database systems and their related biological and physicochemical data for data registration and retrieval (data warehouses, datamarts)
- Development of registration and retrieval systems for chemical and biological data
- Development of data mining and reporting tools, data analysis for compound or reaction related data with related alphanumeric data from physicochemistry and biology
- Migration and/or separation of databases containing chemical structures and/or reactions
- Maintenance of legacy Systems
- Collectiing of user scenarios and defining user requirements
- Project Management
- Business consulting
Using tools from Accelrys, ChemAxon, BioChemfusion, Knime, RDKit, Oracle, Microsoft etc.
Representation of chemical structures and reactions
Traditional small pharma compounds, statistically distributed compound structures in the cosmetic or chemical industry, or peptide sequences for diagnostics are examples of the diverse world of chemical structures. Depending on the industry area the requirements for the chemical representation vary quite heavily and lead to individual solutions for each industry field. Beside consistent drawing rules to ensure the reproducibility of chemical depictions the registration and retrieval systems for chemical compounds and reactions have to support these individual needs of the industry sectors.
Most of the standard drawing tools (Accelrys/Draw, ChemDraw, Marvin) support the needs of small pharma molecules out of the box, but different opinions about the right way to draw functional groups or to handle stereochemistry, for example. make it necessary to have enterprise wide drawing rules in place and to control them during registration (for more details see "Chemical Drawing Rules on Enterprise Level" and "Automated Structure Modifications and Normalizations")
On the other hand peptide sequences, DNA, or RNA are mainly described by texts, because they use standard abbreviations for the standard amino acids and nucleobases that can be handled by methods like BLAST or FASTA. But as soon as amino acids or nucleobases are chemically modified and linkers and protecting groups are introduced into the molecules only structure based DBMS provide a unique registration and retrieval of these compound classes. HELM, Biochemfusion’s PLN or Accelrys’ SCSR extension for the V3 molfile format handle this area. ( more Details will follow ….)
The most challenging part is the structural representations of polymers and other compound classes that are not precisely defined but are described by statistical distributions. (see "Sgroups – Abbreviations, Mixtures, Formulations, Polymers, Structures with Statistical Distribution and Other Special Cases" for more details). In this case the drawing, registration and retrieval tools must be packaged in a way that the unique representation of the structures is guaranteed as well as a simple user friendly interface for registration and retrieval supporting the easy to use access for end-users.
In this context StructurePendium has gathered experiences in the following Domains
- Development of database consistent company drawing rules
- Normalization of chemical structures and reactions using tools like Cheshire, Ppleline Pilot with other tools under Investigation.
- Handling of biologics and biopolymers based on V2/V3 molfiles, Accelrys SCSR format, HELM, or Biochemfusion’s PLN notation and the storage in the related cartridge systems of Accelrys, Biochemfusion and ChemAxon,
- Special compound classes like polymers, mixtures, formulations, or compounds with structure sections described by statistical distributions based on the V2/V3 molfile Format
The applications side of these services is mostly embedded into registration and retrieval systems or into data pipelining tools like Accelrys Pipeline Pilot and Knime.
Design of data models for structure and reaction database systems and their related biological and physicochemical data for data registration and retrieval (data warehouses, datamarts)
The design of molecule databases depends on the definition of “compound”, “batch/lot” and “sample” as described for example in “Chemical drawing rules”. The “compound” may be seen as the idealization of the “biological active structure” or as salt that may or may not include non-rational ratios of the components and residual solvents. Information about stereochemistry may be drawn as integral part of the chemical structures or may be handled in explicit structure related text comments. Mixtures may be drawn implicitly or explicitly, or may be handled by relational tables that link the structure entries with its composition data to a common mixture entity. As these simple examples show the compound side of the data model is mainly defined by the way how the idea of the “compound” is defined within the given context. On the other side the modeling of “batches” or “lots” and “samples” is pretty much influenced by the application environment describing the physical workflow of substances and has to ensure that other hardware and software modules may add data to the system or get the necessary information for their purposes.
Physicochemistry and biological data are measured from physically existing samples of a compound and are sample related therefore. In terms of registration generic data models help very much to handle the wide variety of different data types especially in Biology while most of these data must be pivoted against the sample (compound) identifier for data retrieval to display the frequently required reporting format of compound against all related biological results. This is one of the reasons, why data warehouses and data marts are very common in regards to data retrieval, analysis and reporting of the biological and physicochemical data.
Data models suitable for reactions have to handle the reactions and all their related components like the information for reactants, products, catalysts, solvents or reagents. A quite common approach for reaction databases is taken from the former MDL REACCS program (the successor of ISIS for reactions, the ISIS/Host reaction databases use the REACCS format) and is used as the data model for RD files (Reaction Data files): each reaction consists out of one or more variations. While reactants and products define the reaction part (therefore two reactions are identical if all reactants and all products are identical) the variation contains the agent data (catalysts, solvents, reagents) and physical reaction conditions like temperature or pressure. Accordingly one reaction is related to one or more variations leaving a “1 to n” relationship between reactions and variations.
Reaction Smiles and RInChIs do not know variations. Instead a reaction consists out of reactants, products and agents, so that the uniqueness of a reaction is defined by all participating structural components.
Over decades the members of StructurePendium have gathered a lot of experiences in designing the appropriate data models for registration and retrieval purposes for
- Compound databases
- Reaction databases
- Databases for biological and physicochemical data
- Data warehouses and data marts
Development of registration and retrieval systems for chemical and biological data
The goals of StructurePendium’s developments for registration and retrieval systems for chemical and biological data is to evolve user friendly systems that fully integrate into the existing landscape of the customer for a seamless introduction of the new system into the existing Environment.
Over the years a broad variety of registration and retrieval systems has been developed partly by implementing systems on the market partly by programming systems from scratch. Very early systems were based on MDL’s ISIS and its Hviews, later on isentris using Integrated Data Sources (IDS), Pipeline Pilot or Insight. Most of the registration systems for chemical structures and reactions contain calls to structure normalization software like for example Accelrys Cheshire. (see "Automated Structure Modifications and Normalizations"). Other routines were written for individual tools like Accelrys/Draw to build a direct link to centrally administered structure templates or to simplify peptide sequence drawing by key board input in the context of structure registration and retrieval.
On the biological side Accelrys Assay Explorer was used for the registration of biological data as well as self-developed registration tools based on MS Excel. Most of the retrieval systems for biological data are part of the structure retrieval systems using ISIS, isentris, Pipeline Pilot, or Insight.
Proprietary developments have been done based on Oracle PL/SQL using the Accelrys Direct Cartridge with clients on VB/C# or Excel (for biological data).
Currently we are gaining more experiences with the products delivered by ChemAxon and the latest versions of the Accelrys software.
StructurePendium provides the following services for data registration and retrieval:
- Implementation and integration of software packages by Accelrys or ChemAxon
- Development of retrieval or reporting workflows for example based on Accelrys Isentris, Insight or PipelinePilot, ChemAxon’s JChem, or Knime
- Development of in-house solutions using software components like the cartridges of Accelrys, ChemAxon, Biochemfusion or RDKit
- Programming of registration and retrieval tools
- Testing, training and maintenance (see “Project handling” for more details)
Development of data mining and reporting tools, data analysis for compound or reaction related data with related alphanumeric data from physicochemistry and biology
Based on existing retrieval systems data mining and reporting tools use a set of different methods to evaluate the results of database searches or those stored in files. A good example therefore is the clustering of chemical structures and the comparison of the alphanumerical results from biological or physical chemistry measurements. Clustering in itself is a statistical process. To cluster structures the depiction of the structure must be converted into number arrays that can be handled by statistical software packages. The outcome of the process are “bins” with similar structures meaning that the structures within a bin have a high overlap of the numerical descriptors that were used for the statistical process. For the chemist that means that he/she finds structures within a bin that show very similar structural elements like for example a 5 membered heterocyclic ring or certain functional groups. Because the chemical structures and their properties are related, the probability is quite high that structures of the same bin show similar activity. Therefore each bin is correlated with the related data (in practice mostly biological data) to identify active compounds as well as exceptions. Using this filtering process iteratively over multiple steps combined with structure and data filtering it allows you to identify active compound classes within the project you work on.
As this examples shows you need multiple tools for the data analysis and its reporting. Besides scripting on operating level or Excel-Add-Ins CΞC StructurePendium Technologies GmbH uses data pipelining tools like Accelrys Pipeline Pilot or Knime for the data analysis. These tools have “nodes” that do single tasks like retrieving data from the database or reading a SD file, converting the structures into statistically meaningful sets of numbers, run clustering (using our example above) and returning the results into spreadsheets or other visualization tools that allow the analysis of these results.
Migration and/or separation of databases containing chemical structures and/or reactions
Migrations or separations of structure and reaction databases are pretty straight forward as long as the drawing rules and data models of the starting database(s) and destination(s) are identical or at least very similar. In this special case all structures and their related data that must be migrated can be stored in a flat SD file (or simple data record) that is used for the re-registration into the new database for example by tools like Accelrys’ Pipeline Pilot, Knime, or by Oracle processes. But this simple case is the exception in the field of mergers and separations.
One of the most popular transfer file types for reactions, the Reaction Data file (RD file) is hierarchically organized representing the fact that each reaction may have one or more variations. That makes RD files laborious to be handled by data pipelining tools because most of them do not handle hierarchical data formats as default. But even if the data model adaption is solved the drawing rules of the starting and destination databases may differ providing different representation for identical chemical structures or reactions on both sides of the transfer process-And last but not least for mergers between different DBMS each system provides properties that may not be transferrable and need special handling in order not to lose any information. (See Accord example).
StructurePendium supports the full migration and separation process including
- development of migration rules for the participating data models
- development of migration rules for chemical structures and/or reactions (for comparison see article about drawing rules)
- set up of fully automated migration processes for chemical structures/reactions and all related data (for comparison see article about automated Transformation)
- migration of biological and physicochemical data between different databases / DBMS.
- development of the requirements and project management
The member of StructurePendium gathered experiences with multiple tools starting with the export/import tools of MDL’s ISIS and Isentris, working with Oracle PL/SQL procedures, VB.NET, or using Accelrys’ Pipeline Pilot or Knime beside other tools that are provided by the DBMS vendors.
Maintenance of legacy Systems
Because of decades of experiences with MDL/Symyx/Accelrys/BioVia Software CΞC StructurePendium Techologies GmbH is an ideal partner to maintain your legacy systems. Especially ISIS based systems are replaced by newer technology more and more. While the main applications are renewed you still find smaller installations for special interest groups in the software landscape of most of the R&D organization quite frequently. To keep these applications running without holding specially trained in-house personal available StructurePendium offers maintenance services that keep the systems up-to-date to the level that is offered by the software vendors involved in the applications.
The members of CΞC StructurePendium Technologies have developed expert knowledge of all steps in projects including
- Detailed business analysis
- Collecting user scenarios (stories) and capturing user requirements
- System architecture
- Development of RFPs
- Evaluating vendor proposals
- Development and/or implementation of software packages
- Training of the users involved including the development of training material
- Software maintenance
- All over project management, to keep the times and materials where they should be according to the plan.
As a consequence of working in the area of chem- and bioinformatics for a long time StructurePendium Technologies has gathered the necessary expertise for business consulting.
Contact us for more Details.