I gave a talk today at the ICIC 2009 meeting here in Sitges, Spain. It is an interesting meeting and I will report on some of the presentations later. I’m glad I am here. The presentation is here on Slideshare and is a modified version of a presentation I gave on Saturday at the Microsoft eScience conference in Pittsburg. One of the questions that followed the presentation was in regards to whether ChemSpider could be used as a measuring stick for quality (I am paraphrasing). My response was that there are millions of errors on ChemSpider and that seemed to raise a giggle and other people since then seemed surprised.

In my opinion, as shocking as it sounds, it must be true. Why?

There are almost 23 million unique chemical entities on the database. Many of them have multiple names associated, experimental properties, many have 10s of links to external databases. The structural layout has been created using algorithms. Algorithms have been used to generate systematic names. There are spectra submitted by the public and they can be mis-referenced, as an example, or declared to run in one solvent and ACTUALLY run in another. There are sometimes multiple registry numbers associated with a compound…a CAS number for a salt associated with with the neutral compound for example. The multiple links out to external resources number in the 10s of millions and these are changing daily as other websites and databases curate and edit their data. Errors are inevitable and, I judge, there must be millions of errors on ChemSPider. Just as there must be millions on Wikipedia and in the search results you get back from Google. The question is what counts as an error? I’m using a broad stroke brush for an error…a structure with a poor depiction is an error. A misspelling is an error. A dead link to a database is an error. So…definitely millions. But we continue our work to whittle down the number, with the assistance of the community, everyday. But we’re doing it while we are depositing new compounds onto the database so it’s an interesting challenge. Millions of errors doesn’t make ChemSpider less useful…we’re just realistic about the magnitude of the challenge!

Stumble it!

6 Responses to “Does ChemSpider Have Millions of Errors?”

  1. Egon Willighagen says:

    Hi Tony,

    what happens with these ‘fixes’? Does ChemSpider have a mechanism of reporting them ‘upstream’ (where the error came from)? And, how is this handled when you pull in an updated upstream release, which may still lack the fix, or, and that’s where manual labor comes in again, have a *conflicting* fix?

  2. Antony Williams says:

    When I have reported errors to a number of the database hosts where errors have come into ChemSpider I generally get no response. Not always, but generally. In most cases I see no adjustments made to the data in error so I have to assume that either they are not receiving/reading my email or are choosing to not do the work to remove the errors. SOme groups do make adjustments – especially, of course, Wikipedia, where I make adjustments myself but also the WP:Chem team are very active and also David Wishart’s group at DrugBank.

    When we change name-structure associations they are persisted because we keep the erroneous names on the database and even if they are redeposited they remain flagged as in error. The same is true for database IDs. We are instituting the same policies for experimental data moving forward but not right now as we are adjusting our data model to make such data more discoverable and available via web services. I’ll expand on this discussion in a separate blog post.

    Bottom line though is we cannot clean up everyone elses data right now and, in many cases, people don’t seem to respond even when errors are pointed out.

  3. Egon Willighagen says:

    “Bottom line though is we cannot clean up everyone elses data right now and, in many cases, people don’t seem to respond even when errors are pointed out.”

    Indeed! That’s why I was so interested is learning if/how ChemSpider approaches this. Ubuntu as an example organization that has to deal with the same problem, and often need to file bug reports ‘upstream’ with Debian too.

    What this needs is a simple, clean ticket system that integrates with ChemSpider itself. If you file a ticket against a CS entry (make a modification), this should trigger this upstream reporting. There are two alternatives, but since ‘upstream’ in this case very likely does not have a ticket (or bug track) systems anyway, I would propose the following:

    * for each ‘upstream’ resources, create a web page that lists all errors reported for each entry in that upstream resource.
    * possibly create a machine readable version of it (RDFa perhaps), so that we can write a simple Userscript to make those reports *visible* on the upstream website too

    In that way, the manual work needed for upstream to take advantage of the ChemSpider work is minimized and to usefulness of ChemSpider maximized.

    This is beyond Open Data, and this goes into Open Projects and interoperability. If ChemSpider would adopt something along these lines, it would actually take chemical databases a step further instead of being yet another repository.

  4. Michael Kuhn says:

    KEGG also seems to be responsive to fixes. However, I think that it is probably better to adjust your downstream system to deal with errors from upstream. When we looked into creating a good set of chemical synonyms for the STITCH database (based on PubChem) it seemed to us that once erroneous synonyms get into the system, they’ll just be re-shared between dbs even if you get some of them to fix the error. We thus prioritized the source databases, KEGG being one of the “good” ones – but even KEGG makes mistakes sometimes.

    So perhaps it would make sense to make “diffs” that can be re-applied to source databases when you import them again?

  5. Imants Zudans says:

    I understand well the magnitude of this problem as we are facing some similar issues. You can generate InChI Keys and IUPAC names from structures. But it is very difficult to ensure that information from the outside is correct. However, sometimes even wrong information is useful so maybe you are too hard on yourself. CAS number might be incorrectly associated with a particular structure, for example. But it is far better to get 3 structures for one CAS query and then decide which one is the correct one than not to get any results. This is why Google is so useful. A lot of websites in search results are somewhat irrelevant but since they don’t censor the content, most of the time it is possible to find the needed information. Chemical databases are more restrictive about information they let in and that sometimes makes it harder to find the needed compound. That is even true for the structure search as many databases don’t offer a tautomer search option.

    I am looking forward to see the followup post!

  6. Joerg Kurt Wegner says:

    I agree with Egon that ‘upstream’ or ‘remote curation/annotation’ would be really helpful. The Reflect service of EMBL/EBI allows now to tag missed entries (e.g. gene identifiers) and to annotate wrong ones. With a little mashup and right mouse-click actions this could be done for ChemSpider, too.

    Input: URL, text (chemistry or identifier), meta-data (annotation details)
    Output in ChemSpider database – Annotation of basically any web-page out there allowing chemistry enrichment.

    On the long-run people could ‘remotely’ help curating chemistry and any associated data.

Leave a Reply