STOP COUNTING the Number of Chemical Entities in Public Compound Databases and There are Ghosts in the ClosetPosted by: Antony Williams in Quality and Content
Let’s start off where I intend to finish. Bigger does not necessarily mean better. A large database of unique chemical entities does not necessarily mean a good database and accurate chemical representations of chemical entities can be pretty hard to find.
Few people realize how these simple statements are impacting the quality of what’s available online for chemists to use and how curation of data must occur in order to improve what’s available.
Now…what’s the basis for me to initiate this discussion and WHY would I prefer that ChemSpider was actually a smaller database?
Today on CHMINF Steve Heller posted the following review:
PubChem is a search tool for chemical information, divided into three areas: Compounds, Substances, and BioAssays. Full entries provide detailed information with the most basic information – a general description, the molecular weight and formula, the structure, plus a Table of Contents (ToC) for the full entryall easily found above the fold. Use the ToC or scroll down to retrieve more advanced information, such as bioactivity results, synonyms, chemical actions, detailed properties, and more. Each module is fully interlinked with the other sections of PubChem as well as resources in ToxNet and PubMed, providing full access to toxicology resources and the medical literature, and allowing users access to as much or as little of the chemical information as they need.
Author/Publisher: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
Date reviewed: February 16, 2009
PS. PubChem now has 37,326,949 DIFFERENT structures”.
Bob Buntrock made the following statement “Re the PS below, I find it difficult to believe that PubChem has 37.3 million “different” compounds. The figures from the CAS website show 48 million organic and inorganic compounds which excludes sequences but includes polymers, alloys, coordination compounds, minerals, and mixtures. Since PubChem aims to cover “small molecules”, it would seem that many compounds in these last 5 categories would not be present. Therefore, I assume that a significant number of the 37.3 million PubChem compounds are redundant.” All hell broke loose with lots of posts discussing the uniqueness of chemical entities and the fact that PubChem compounds WERE unique. Okay, I’m not going to argue this for the moment but I am going to agree with Bob that a significant number of the compounds are likely redundant. It is ALSO true of ChemSpider. Why?
I could write a multipage blog but I have already discussed this issue many times on this blog but am clearly failing to communicate the issue. I’ll try again but I reference you to previous posts about Taxol (1,2,3), Vancomycin (4) and Ginkgolide B (5,6). I suggest you read these earlier posts but will try and explain again anyway.
Some general statements. Many complex chemical compounds, especially natural products, have timelines. A compound when initially elucidated can give the connectivity only and get reported. Then stereochemistry might be layered on later, and reported. Then stereochemistry might be adjusted, and reported. Through this whole timeline the compound might be referred to by a particular chemical name….let’s call it Afonwenium. So, based on the timeline for this molecule there can be anywhere between 1-4 “versions” of the structure by that name. They are all unique chemical entities but the “final structure” is the one that people will want. It’s the one that should be represented on Wikipedia, the one that should correctly be drawn in all publications following the final elucidation report and assertion of structure and the one that should be found on many of the “reference” databases such as KEGG, DrugBank etc.
Search Taxol on ChemSpider and Taxol on PubChem and compare the number of structures you get. I judge that there are MANY unique chemical entities on PubChem that are MEANT to be Taxol but are not. And I don’t mean the ones that are named as “Taxol derivative”, I mean the ones that may have the SAME molecular weight, formula and connectivity but have DIFFERENT stereo – no stereo, MULTIPLE partial stereo and MULTIPLE full stereo. These issues exist for compounds like Ginkgolide B and Vancomycin and many more structures. There is of course only one Taxol, a compound registered by Bristol Myers Squibb and asserted to have a specific constitution.
Just out of interest lets see how many compounds are on ChemSPider with a specific skeleton (ignoring stereo).
There are 54 compounds with the skeleton of Taxol: http://www.chemspider.com/InChIKey/RCINICONZNJXQF. These are all UNIQUE chemical entities but there are C-11 and C-14 labeled, Deuterium and Tritium labeled and so on. But there are over 30 compounds that have the same skeleton, without isotopically labeled sites, that still have the Taxol skeleton. Maybe some of these are meant to be Taxol with different stereochemistry but I judge that MOST of these are meant to be Taxol and are labeled as such but differ in terms ofno, partial and full stereo at least. This is ONE example. To Bob’s question…is this redundancy? I say yes. How does this get solved? Curation will do it but it’s expensive and time consuming and the only way forward in my judgment is to crowdsource it. This problem is not going away anytime soon in PubChem or ChemSpider. We HAVE curated the name associations and removed the name of Taxol for all skeletons that are not what is the asserted form of Taxol. But the structures do remain on the database and link back to the original sources. We will be working on ways to show on every search that there are associated skeletons, compounds related by isotopic labeling and the status of no, partial and full stereochemistry. All to come…
The ongoing “Bigger is Better” arguments for Public Compound Databases is irrelevant at this point in my opinion. We can add 50 million new compounds with a simple enumeration exercise but woulf it bring any value? I say no. We can add virtual libraries from a number of our collaborators but I judge it to be of very limited value. The value of the Public Compound Databases are in what they connect to and whether there is an answer to a question at the end of the chain. If I search on a chemical and find it on ChemSpider but I cannot find a vendor for it, no analytical data, no properties of value, no manuscripts, no patents linked etc then I have just done a search, found it on ChemSpider but have derived no value. We are working on increasing the VALUE of our content. Linking compounds to rich data sources, layering on additional properties, links to papers, blog entries and discussions and so on. If the result of a search is a hit but with no value who cares. If the result of a search is a hit but with links to the wrong information that’s worse. If I ask the question “What is Taxol” and get one hit I need it to be right. If I ask the question and get tens of hits now what?
Curation has been underway for 2 years. We’re not finished. Its a massive task. In reality it will NEVER be finished – new chemistry comes in every day and more information gets associated. We don’t have answers to all of the issues that exist around these diverse datasets but we are not naive in our understanding that our database is polluted with issues inherited from many other sources. We have marked tens of thousands of structures for deprecation. We have likely added information into PubChem that has contributed to the issue of data quality. But we are working on it.
Meanwhile errors that exist in PubChem are proliferating. A simple example is that of methane in PubChem that I have blogged about many times…one example here. Here are some of the names associated with the structure of methane on PubChem: 1,3-DICHLORO-PROPAN-2-ONE, diamond, charcoal and many tens of other incorrect names.
The National Cancer Institute’s Chemical Structure Lookup Service has over 46 million unique chemical entities and they have offered a series of services to search by InChI, name and many other queries. A posting to CHMINF outlined the service
“Chemical Identifier Resolver (beta):
This service is a resolver for different chemical structure representations and identifiers, including those that do not carry any information about the structure itself. For instance, it can work as a Standard InChIKey Resolver, an NCI/CADD Identifier Resolver or a Chemical Name Resolver. The service also allows one to convert a given structure identifier into another representation or structure identifier.
Representations/identifiers supported are: Standard InChI/InChIKey, NCI/CADD Identifiers (FICuS, FICTS, uuuuu), SMILES, SDF, names, and a few other types of
IDs. See the web page for more information.
For those identifiers that require lookup, the underlying database currently contains about 67 million unique structure records, from which the respective Standard InChIKeys and NCI/CADD Identifiers have been calculated. For lookup by chemical names, 68 million names associated with 16 million unique structure records are currently available in the database. The database continues to grow.
Closely related are the new capabilities of resolving/converting chemical structure identifiers by simply using a URL adhering to the following scheme: http://cactus.nci.nih.gov/chemical/structure/”structure identifier”/”representation”[/xml]
We just list a few examples here that should give you an idea of what’s possible with this service. For more detailed explanations, see the above web page.
Example: Standard InChI for chemical name string “aspirin”: http://cactus.nci.nih.gov/chemical/structure/aspirin/stdinchi
Example: Standard InChIKey of “ethanol” specified as SMILES string “CCO”: http://cactus.nci.nih.gov/chemical/structure/CCO/stdinchikey
Example: Unique SMILES string of chemical name string “benzene”:http://cactus.nci.nih.gov/chemical/structure/benzene/smiles
Example: SD File for chemical name string “morphine”:http://cactus.nci.nih.gov/chemical/structure/morphine/sdf
Example: Chemical names for Standard InChIKey “InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N” (Standard InChIKey of “ethanol”): http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/names
Example: Synonyms for chemical name string “aspirin”:http://cactus.nci.nih.gov/chemical/structure/aspirin/names”
Unfortunately polluted names are finding their way across all of these databases which is why a lookup on methane gives us: http://cactus.nci.nih.gov/chemical/structure/methane/names including in the list:
1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo[184.108.40.206(3,9).1(5,15).1(7,13)]octasiloxane, mixture of isomers
The CAS database is highly curated, not without errors, and built up using robots and eyes. Public Compound Databases are built with the best intent and are useful. But they are not curated and are polluted. Bigger does NOT mean better and care is warranted. ChemSPider will likely stay smaller that many of the other Public Compound Databases moving forward as we remain focused on adding value and addressing the issues of inherited and future quality. It’s a long journey…