Archive for July 24th, 2008

I have spoken on this blog many times about the challenges of cleaning up data in chemistry databases. We’re expending a lot of efforts, with the assistance of many others, in cleaning up the data on ChemSpider and, as a benefit, assisting in cleaning up date in other databases also. The efforts to curate the chemical structure data on Wikipedia continues and the work is now focused on delivering ‘bots that will drive a cleansed data file to the individual records. Over the past few months I have developed a great appreciation for the efforts, dedication and commitment of the many contributors to Wikipedia Chemistry. There are many 10s of people editing and contributing to the articles and then there is the “core WP:Chem team” who show up for the IRC chats most Tuesdays at noon. Many of the past weeks have focused on how to curate the data and utilize ‘bots and control curated data moving forward. I am honored to share “IRC-space” with them!

Over the past few weeks I have been similarly blessed to interact with the ChEBI team via email as we have done our work to deposit their Entities of the Month (1,2). During the process of doing so we have exchanged many emails and have cleaned a number of errors in our mutual datasets. In my opinion a PERFECT example of the results of such detailed efforts is for Vancomycin. One week ago a search on vancomycin would give a dozen hits. Many of these had incomplete stereochemistry. Now a search on ChemSpider gives one hit for vancomycin here. This is the result of working with Kirill Degtyarenko at ChEBI. The conversation was initiated by my observation regarding stereo in the structure on ChEBI.

For details on how this is identified to be the correct structure read the description on that page. VERY DETAILED and includes links out to three publications.

Compare this with a search for vancomycin on PubChem giving 66 hits. Some of these differences are due to the different approaches for our text searches - the PubChem results list includes VANCOMYCIN HYDROCHLORIDE and Gatifloxacin & Vancomycin for example. However, there are a number of “vancomycins” also.

We believe we have the correct vancomycin identified at this point…we welcome any challengers!

Buy me a Coffee

Thanks to the efforts of contributors such as Heinz Kolshorn new compounds and associated analytical data are finding their way onto ChemSpider on a regular basis. These are chemical compounds that have been synthesized and fully characterized. Unless they are published they are unlikely to find their way into chemical registry systems or into training databases for the commercial NMR prediction packages such as those of ACD/Labs, Bio-Rad, Modgraph or Wolfgang Robien’s collection. As a result this type of information will be “Lost Chemistry“. These particular data from Heinz will almost certainly find their way into the NMRShiftDB since Heinz is hosting the database at his lab at the University of Mainz.

Heinz has been putting actual experimental spectra and the associated shift assignments onto ChemSpider of late. An example is here. This is enabled by our ability to upload and store both spectra and images. There are better ways to display the shift assignments by allow mouseover display of the structure and peak associations but this is not yet available on the system but clearly a nice to have. For now the information is there for others to use and is indicative of the value of integrating images and spectral data. I can envisage other pairings such as UV-spectra versus photo of colored solution for example.

Buy me a Coffee