Tonight I finished an article on Public Chemistry Databases. During that article I commented on the size of the Public Chemistry Databases versus the commercial databases. There have been numerous discussions in the blogosphere about the size of databases such as PubChem relative to the CAS Registry. Recently PubChem and ChemSpider headed towards 20 million structures. The CAS Registry is about 33 million.

Now, I don’t know how much duplication there is in the Registry but I can comment is what is in ChemSpider and likely in PubChem. Here’s a basic comment about molecules with complex stereochemistry. They tens to exist MULTIPLE times in the database due to different variants of stereochemistry. Let’s examine Ginkgolide B. The structure below is taken from a recent RSC article. I was interested to see whether we had the “correct” structure of Ginkgolide B on ChemSpider, assuming that the correct structure is that one shown on the RSC webpage.

ginkgolide-b.png

A search on the name Ginkgolide B turned up a total of 6 structures. The connectivities are the same for all structures. The ONLY difference is in the stereochemistry. Take a look at the structures in Table View. There is one structure with full stereochemistry expressed. This one comes from PubChem, Thomson Pharma and xPharm. With full stereochemistry it might be safe to assume it is correct.

However, even for Taxol there are structures with complete stereochemistry and they are different: Structure 1, Structure 2, Structure 3, Structure 4 and Structure 5

I actually gave up looking eventually…here are the different complete stereochemistries. Look carefully…

t31-,32​-,33+,35-,​36+,37-,38​-,40-,45+,​46-,47+

t31-,32​+,33+,35-,​36+,37+,38​-,40-,45+,​46-,47+

t31-,32​+,33+,35-,​36-,37+,38​+,40-,45-,​46+,47-

t31-,32​-,33+,35-,​36-,37-,38​-,40-,45-,​46-,47-

t31-,32​-,33+,35-,​36+,37-,38​-,40-,45-,​46-,47-

Question for ChemSpider Users – there are actually WAY MORE than 10 Taxol skeletons on ChemSpider. Can anyone figure out how many? It actually takes one search to find them all!

We believe this is the correct structure of Taxol.

Back to Ginkgolide B. I redrew the structure shown in the RSC article (and as shown below).

ginkgolide-b_2.png

Generating the InChIKey for this structure and performing a search on ChemSpider gave me no hits. It looks like either the RSC structure is wrong OR all of the six structures from all of the different sources are wrong. As mentioned, there is actually only one Ginkgolide B structure ( a structure with the associated identifier) on ChemSpider with full stereochemistry. The stereo for that structure is:

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20+ ChemSpider

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20- RSC Stereo

There is ONE stereocenter difference.

This is what curation is all about. The question now is which one is correct? Is it RSC? Is it the structure on ChemSpider? Can anyone validate? Did I miss something up in the comparison (it happens!)

Now, for the LONG list of Ginkgolide B structures on ChemSpider shown here what do users think we should do? If we simply remove ALL labels for the incorrect structures then we will remove all links into other databases that contain information for “their Ginkgolide B”. If we collapse all links into the correct Ginkgolide B on ChemSpider, and bring 6 records into one, then the structures to which the correct structure links are actually incorrect in the linked databases (but useful information exists there).

Quite the conundrum. I’d appreciate feedback!

Stumble it!

9 Responses to “How Big is the Challenge of Curation and What is the Structure of Ginkgolide B”

  1. Chris Rusbridge says:

    No comment on which is the correct structure, but some thoughts on the approach. Wikipedia has a reasonable model here, with a clear distinction between “latest” and any specific previous version in the history. So the curation process might be:

    a) decide which is the “correct” structure, and mark it as “curated, current”

    b) find all the “incorrect” structures, and mark them as “curated: deprecated (or wrong!)”. with a pointer to the correct structure.

    Now all your links will work, and those who follow them should eventually find the correct info. Would that work?

  2. Joerg Kurt Wegner says:

    1. Finding all structures? The first key of the InchiKey should do the trick, since it is covering topology without stereo information
    http://www.chemspider.com/Search.aspx?q=SQOJOAFXDQDRGF

    2. @Chris: I love this approach ! Keeps all information and points to the good one.
    @Tony: It should be made very clear on the page that the whole page is deprecated, and not only a few entries needs curation.
    Beside, this is one of those examples where the ChemSpider-InChIKey (CIK) might get enriched with ‘curated’ tag, all the others , e.g.
    SQOJOAFXDQDRGF-UHFFFAOYAY-C(urrent)
    SQOJOAFXDQDRGF-TWOSMEHPBO-D(eprecated)
    SQOJOAFXDQDRGF-KRNPTNFHBW-D(eprecated)
    and so on…
    Further expert/user tags might be possible, e.g. W(rong), N(eeds help/discussion)

    Especially the N(eeds help/discussion) is a great Wikipedia feature, because people can look at a special page which structures needs emergency discussion/curation.

  3. Joerg Kurt Wegner says:

    Oh, forgot the “notify on followup comments’, and can you turn this on by default ;-)

  4. Soaring Bear says:

    The various ginkgolides ought to be considered as a group rather than just the B variant. Add the bilobalides along with the ginkgolides since they share the biosynthetic path. These compounds are prominent enough that there’s probably been enough labs evaluating the NMR spectra, and maybe crystal structures, that there is likely to be some consensus on the “true” stereochemistry. Contacting such people directly may be the best way to evaluate what is true on this.

    More broadly, we all keep in the back of our minds the occasional corrections of structures which keep us humble in being too sure of what the true structure of anything is, especially compounds that are less studied than the ginkgolides.

  5. Antony Williams says:

    Chris: You might be aware I’ve been working on Wikipedia (http://www.chemconnector.com/chemunicating/dedicating-christmas-time-to-the-cause-of-curating-wikipedia.html)
    I’ve been wondering about the wiki type approach. We hope to get our hands on Microsoft’s Sharepoint shortly to add wiki capabilities… http://www.chemspider.com/blog/the-chemspider-team-chooses-our-future-platform-for-collaboration-microsoft-sharepoint.html

    Thanks for the very constructive feedback Chris…it is very much aligned with what we’ve been thinking and it’s good to get it independently validated.

  6. Antony Williams says:

    Joerg..you hit the nail on the head…so for Taxol:
    http://www.chemspider.com/Search.aspx?q=RCINICONZNJXQF

    FORTY-TWO structures!

    The extension of the InChI sounds interesting but I’m not sure how well that would go down with the InChI team etc. But let’s see if anyone responds…it might fir well into the InChIKey resolver.

  7. Antony Williams says:

    Joerg…we CAN turn on “Notify Me” by default but I need more than one vote for that!

  8. Antony Williams says:

    Bear..I agree that the ginkgolides are a group. The challenge is that the name Ginkgolide B is associated with 6 structures. I cannot prove that this situation is one of a “changing structure” with time but rather an issue of drawing structure correctly. Again..I cannot validate this perception. I’ve done a search on ChemRefer now to find some structures: http://www.chemrefer.com/search.cgi?zoom_query=ginkgolide+B

    I need to now sketch these representations in and validate. Thanks for your comments!

  9. David Barden says:

    Antony – I am an organic chemist working on the RSC journal in which the published structure of ginkgolide B appeared, and am pretty sure that it is correct, having been written by a regular author of ours familiar with the literature on the ginkgolides. I think the problem might lie with the representation (and/or conversion to InChI) of the structures – even in the one structure you indicated as having “full stereochemistry”, it seemed to me that 3 stereocentres were undefined, from a visual inspection of the structure. Apart from these stereocentres, the structure and InChI (generated myself) otherwise seem identical, so I’m not sure why the last part of the string in the ChemSpider entry is “20+” rather than “20-”. The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.

Leave a Reply