Archive for March 9th, 2008

The recent post regarding CAS numbers and Wikipedia has stirred up some great conversation and responses and I point you to the comments to peruse. For now I want to comment on one made by Cameron Neylon on his blog. I point you to his post to read first rather me lifting it from him and posting it here. It’s respectful of his work. Also, you may choose to add him to your Google Reader. Cameron is a great advocate of Open Notebook Science and I encourage you to visit his site.

OK…Did you read it???

Ok ..I am now lifting certain comments and wish to state my own views.

Cameron said “So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually.

At ChemSpider we likely have more experience than most in interconverting millions of structures from/to various other formats. They all have their own limitations. InChI, in both of its formats is limited in a number of ways. They are acknowledged and being worked on. For instance, polymers, inorganics, organometallics, mixtures of specific stoichiometries, Markush structures (not an issue for Cameron’s material in a bottle). SMILES comes in so many different flavors that it can be very distressing.Even the most popular Cheminformatics vendors can be incompatible..believe me, we’ve seen it in many ways! CML..haven’t worked with it yet on ChemSpider but remain interested. However, uptake seems to be very limited after maybe a decade of being available to the domain. InChI took off an proliferated at an incredible rate while CML has been around a lot longer and, as far as I know from 10 years in the cheminformatics business, has low adoption. It doesn’t mean it’s not the solution but if it is then it needs to be adopted by the masses. PubChem IDs ..there are structure IDs and compound IDs. So, a decision will be necessary there. More and more I am seeing PubChem IDs listed anyways…they are in the Aldrich catalog for example. The work will come with the curation of the data - making sure that people can find the “appropriate ID” for a compound. Check out my earlier posts about the need for curation (1,2,3,4 and many others). CAS is very highly curated and are the authority for the CAS numbers. PubChem are, of course, the authority for their IDs too but compounds can be deprecated from time to time as depositors find their own errors and there have been so many depositors with different quality standards to date that cleaning up the database is a major challenge. While they could do it it is not their mandate today, they are not funded to do so and it would be an enormous undertaking and would likely need to involve some form of crowdsourcing via online curation as we are doing here at ChemSpider.

Cameron said “The CAS number so appealing; it is short, easily typed in, and printed on most bottles.

‘Tis so. And asking the vendors to move away from it won’t work. Adding a PubChem ID might work but that’s a big shift too and I believe they would need to have guarantees about the long term future of PubChem and its database and funding to buy into that. Also, a MASSIVE validation exercise. if the companies ended up depositing their OWN compounds to get PubChem IDs I believe all hell will break loose…it’s already going on by the way of course, when they deposit. Their compounds go through internal processes at PubChem and come out the other side as deposited structures. Is everyone that went in deposited exactly as it was supplied. In theory yes. In reality? We have the same issues at ChemSpider…not easy. By the way, the CAS number with check digit and specific format is much nice than “just a number”. Maybe we should do the same with ChemSPider IDs…new format, plus a check digit?

Cameron said “I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers

Hmmm…really? Check out this link. Compare the snippet here from the Taxol Drugbox on Wikipedia

drugboxtaxol.png

with the taxol record here on PubChem and the synonyms list. You are looking for 33069-62-4.

taxolcas.png

When you find it click on it and you will find 6 structures for Taxol. The issue is curation. Which structure is right?

Why does the story that there are no CAS numbers on PubChem continue to proliferate??? Sure, they are not called CAS Numbers but it’s what they are. Depositors simply put them in as identifiers and PubChem don’t have to remove them. They respect the depositors right to deposit their identifiers..whether they are CAS numbers or not.

Now, I agree robotic curation can help with these issues and the RDF approach already being discussed for Wikipedia and ChemSpider (with Egon) can be useful in helping to link together resources and, if adopted by companies such as Aldrich etc, can be of great value in helping to clean up some of the issues. But, it is only part of the solution. The need for manual curation is being missed. Robots are already making a bigger mess in my opinion. Manual curation is a must.

Cameron said ”The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services.

I have made this statement to many people over the years. The Registry of chemical structures has little value as a collection of structures. Its what’s connected that has value. The CAS Numbers, the patents, the literature articles, the vendors. It’s the same with PubChem and ChemSpider. Who cares how many structures we have? We can generate HUNDREDS OF MILLIONS for you! It’s the associated information. I have always thought that CAS should provide an internet service that is simply a CAS lookup. Search a number and see the structure and or substance detail. End of story. You want patent details, papers, vendor details…you pay. It’s a transaction…same as STN now. In fact, do a study, see how many searches are done just to relate CAS numbers and structures, figure out the loss in revenue if you “give it away” and shift it over to the transaction charge to look at “more info”. Certainly the giveaway would help with the public relations! Oh..and you “could” make it a structure search to get a CAS number too…more work but of course possible.

Cameron said “…do people agree that CID is a good standard index to aggregate around?”

No…not yet. There is so much to be done before that can happen in my opinion. A lot.

What I’d rather do, and maybe I am a dreamer, but I would say get into relationship with ACS/CAS and try and establish a tipping point of support to where they see it is good for their business and the community as a whole. I prefer staying in relationship if possible. That said the effort around CIFs from ACS via CrystalEye has not moved forward as explained here so maybe my vision will not work in the case of CAS numbers either. Either way, we made a decision not to scrape CrystalEye anyway and this shows perfectly the issues of SMILES and InChI giving issues of structure representation!

I say let’s not abandon hope regarding CAS opening their numbers to the world just yet. This dialog is likely sparking discussions already. Let’s keep it out there and establish a groundswell of concern and support and hope that the right thing can happen for our good and for CAS. I have great respect for many of their people and their work and want the resolution to be appropriate for all parties. Let’s hope…and if hope doesn’t work then I encourage robotic and manual curation…the system is ready on ChemSpider. Come and help out!

Buy me a Coffee

Ever since ChemSpider went online we had committed to allowing community curation and annotation. We have done this…in spades. We have introduced the ability to Post Comments for curators approval, we have allowed the association of publications and the association of URLs. Data have been annotated with over 200 spectra, CIF files or images as described here and here.

Over 500 comments posted by users and administrators of ChemSpider have now been curated, fixed, acknowledged or rejected (use the filter at the top of the page to see the different types).

status.png.

Almost 200,000 identifiers have been curated (approved or removed). The necessary functionality to allow curation and annotation of the data has been delivered. We will now extend and enhance it specifically to handle batch depositions (already in beta testing).

Crowdsourced curation and annotation is going to be necessary to cleanse a lot of the material which has found its way out into the public domain and we have started.

Buy me a Coffee