Copyright©2008 Antony Williams
Tonight I finished an article on Public Chemistry Databases. During that article I commented on the size of the Public Chemistry Databases versus the commercial databases. There have been numerous discussions in the blogosphere about the size of databases such as PubChem relative to the CAS Registry. Recently PubChem and ChemSpider headed towards 20 million structures. The CAS Registry is about 33 million.
Now, I don’t know how much duplication there is in the Registry but I can comment is what is in ChemSpider and likely in PubChem. Here’s a basic comment about molecules with complex stereochemistry. They tens to exist MULTIPLE times in the database due to different variants of stereochemistry. Let’s examine Ginkgolide B. The structure below is taken from a recent RSC article. I was interested to see whether we had the “correct” structure of Ginkgolide B on ChemSpider, assuming that the correct structure is that one shown on the RSC webpage.
A search on the name Ginkgolide B turned up a total of 6 structures. The connectivities are the same for all structures. The ONLY difference is in the stereochemistry. Take a look at the structures in Table View. There is one structure with full stereochemistry expressed. This one comes from PubChem, Thomson Pharma and xPharm. With full stereochemistry it might be safe to assume it is correct.
I actually gave up looking eventually…here are the different complete stereochemistries. Look carefully…
Question for ChemSpider Users – there are actually WAY MORE than 10 Taxol skeletons on ChemSpider. Can anyone figure out how many? It actually takes one search to find them all!
We believe this is the correct structure of Taxol.
Back to Ginkgolide B. I redrew the structure shown in the RSC article (and as shown below).
Generating the InChIKey for this structure and performing a search on ChemSpider gave me no hits. It looks like either the RSC structure is wrong OR all of the six structures from all of the different sources are wrong. As mentioned, there is actually only one Ginkgolide B structure ( a structure with the associated identifier) on ChemSpider with full stereochemistry. The stereo for that structure is:
t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20- RSC Stereo
There is ONE stereocenter difference.
This is what curation is all about. The question now is which one is correct? Is it RSC? Is it the structure on ChemSpider? Can anyone validate? Did I miss something up in the comparison (it happens!)
Now, for the LONG list of Ginkgolide B structures on ChemSpider shown here what do users think we should do? If we simply remove ALL labels for the incorrect structures then we will remove all links into other databases that contain information for “their Ginkgolide B”. If we collapse all links into the correct Ginkgolide B on ChemSpider, and bring 6 records into one, then the structures to which the correct structure links are actually incorrect in the linked databases (but useful information exists there).
Quite the conundrum. I’d appreciate feedback!Stumble it!