Copyright©2009 Antony Williams
I gave a talk today at the ICIC 2009 meeting here in Sitges, Spain. It is an interesting meeting and I will report on some of the presentations later. I’m glad I am here. The presentation is here on Slideshare and is a modified version of a presentation I gave on Saturday at the Microsoft eScience conference in Pittsburg. One of the questions that followed the presentation was in regards to whether ChemSpider could be used as a measuring stick for quality (I am paraphrasing). My response was that there are millions of errors on ChemSpider and that seemed to raise a giggle and other people since then seemed surprised.
In my opinion, as shocking as it sounds, it must be true. Why?
There are almost 23 million unique chemical entities on the database. Many of them have multiple names associated, experimental properties, many have 10s of links to external databases. The structural layout has been created using algorithms. Algorithms have been used to generate systematic names. There are spectra submitted by the public and they can be mis-referenced, as an example, or declared to run in one solvent and ACTUALLY run in another. There are sometimes multiple registry numbers associated with a compound…a CAS number for a salt associated with with the neutral compound for example. The multiple links out to external resources number in the 10s of millions and these are changing daily as other websites and databases curate and edit their data. Errors are inevitable and, I judge, there must be millions of errors on ChemSPider. Just as there must be millions on Wikipedia and in the search results you get back from Google. The question is what counts as an error? I’m using a broad stroke brush for an error…a structure with a poor depiction is an error. A misspelling is an error. A dead link to a database is an error. So…definitely millions. But we continue our work to whittle down the number, with the assistance of the community, everyday. But we’re doing it while we are depositing new compounds onto the database so it’s an interesting challenge. Millions of errors doesn’t make ChemSpider less useful…we’re just realistic about the magnitude of the challenge!Stumble it!