I have spoken on this blog many times about the challenges of cleaning up data in chemistry databases. We’re expending a lot of efforts, with the assistance of many others, in cleaning up the data on ChemSpider and, as a benefit, assisting in cleaning up date in other databases also. The efforts to curate the chemical structure data on Wikipedia continues and the work is now focused on delivering ‘bots that will drive a cleansed data file to the individual records. Over the past few months I have developed a great appreciation for the efforts, dedication and commitment of the many contributors to Wikipedia Chemistry. There are many 10s of people editing and contributing to the articles and then there is the “core WP:Chem team” who show up for the IRC chats most Tuesdays at noon. Many of the past weeks have focused on how to curate the data and utilize ‘bots and control curated data moving forward. I am honored to share “IRC-space” with them!

Over the past few weeks I have been similarly blessed to interact with the ChEBI team via email as we have done our work to deposit their Entities of the Month (1,2). During the process of doing so we have exchanged many emails and have cleaned a number of errors in our mutual datasets. In my opinion a PERFECT example of the results of such detailed efforts is for Vancomycin. One week ago a search on vancomycin would give a dozen hits. Many of these had incomplete stereochemistry. Now a search on ChemSpider gives one hit for vancomycin here. This is the result of working with Kirill Degtyarenko at ChEBI. The conversation was initiated by my observation regarding stereo in the structure on ChEBI.

For details on how this is identified to be the correct structure read the description on that page. VERY DETAILED and includes links out to three publications.

Compare this with a search for vancomycin on PubChem giving 66 hits. Some of these differences are due to the different approaches for our text searches – the PubChem results list includes VANCOMYCIN HYDROCHLORIDE and Gatifloxacin & Vancomycin for example. However, there are a number of “vancomycins” also.

We believe we have the correct vancomycin identified at this point…we welcome any challengers!

Stumble it!

One Response to “Collaboration, Community and Quality in Chemistry Databases”

  1. ChemSpider Blog » Blog Archive » STOP COUNTING the Number of Chemical Entities in Public Compound Databases and There are Ghosts in the Closet says:

    [...] issue. I’ll try again but I reference you to previous posts about Taxol (1,2,3), Vancomycin (4) and Ginkgolide B (5,6). I suggest you read these earlier posts but will try and explain again [...]

Leave a Reply