Development of data mining and reporting tools, data analysis for compound or reaction related data with related alphanumeric data from physicochemistry and biology

Based on existing retrieval systems data mining and reporting tools use a set of different methods to evaluate the results of database searches or those stored in files. A good example therefore is the clustering of chemical structures and the comparison of the alphanumerical results from biological or physical chemistry measurements. Clustering in itself is a statistical process. To cluster structures the depiction of the structure must be converted into number arrays that can be handled by statistical software packages. The outcome of the process are “bins” with similar structures meaning that the structures within a bin have a high overlap of the numerical descriptors that were used for the statistical process. For the chemist that means that he/she finds structures within a bin that show very similar structural elements like for example a 5 membered heterocyclic ring or certain functional groups. Because the chemical structures and their properties are related, the probability is quite high that structures of the same bin show similar activity. Therefore each bin is correlated with the related data (in practice mostly biological data) to identify active compounds as well as exceptions. Using this filtering process iteratively over multiple steps combined with structure and data filtering it allows you to identify active compound classes within the project you work on.

As this examples shows you need multiple tools for the data analysis and its reporting. Besides scripting on operating level or Excel-Add-Ins CΞC StructurePendium Technologies GmbH uses data pipelining tools like Accelrys Pipeline Pilot or Knime for the data analysis. These tools have “nodes” that do single tasks like retrieving data from the database or reading a SD file, converting the structures into statistically meaningful sets of numbers, run clustering (using our example above) and returning the results into spreadsheets or other visualization tools that allow the analysis of these results.