Archive for May 14th, 2008

American Chemical Society

I admit to not being fully knowledgeable in the details of CAS Numbers. If anyone has a short treatise regarding their history and breadth relative to generic/specific structures and “materials” I’d welcome getting pointed to it. That said in the community in which I participate CAS Registry numbers appear to be very confusing. One thing is for sure…the authority IS the Chemical Abstracts Service. They have the reference data collection of course.

In the public domain there is a “mess of data” and various parties attempting to use them for full effect. It’s a problem. In a recent letter to C&E News (May 5, 2008,Volume 86, Number 18,pp. 4-7,) a Ms Deanna Morrow Hall, from Stone Mountain, Georgia commented on this confusion. I can’t paste the entire letter here because of Copyright issues of course but will abstract.

The most common problem is the confusion between the number for the generic formula of a compound (intended to be used for a chemical entity when its exact composition is unknown or variable) versus the number for a compound of specific known formula.

She gave as an example, Propanol:

Propanol (generic formula) : 62309-51-7

1-propanol: 71-23-8

2-propanol: 67-63-0

First, a vendor (either in a product specification or in a material safety data sheet) uses 62309–51–7 as the registry number for one of the specific configurations. If a buyer uses the correct specific registry number to search for suppliers of one of the specific configurations, then he will not find that vendor.

Second, a vendor uses the correct specific registry number for one of the specific configurations. If a buyer uses 62309–51–7 to search for suppliers of one of the specific configurations, then he will not find that vendor.

Third, a vendor correctly uses 62309–51–7 to describe a mixture of the two specific configurations, but the buyer thinks he’s ordering one of the pure compositions.

Her letter concludes with

“Given that these errors occur with greater frequency than one might anticipate and are not trivial in their consequences, it seems appropriate that ACS should initiate a study to quantify the extent of the problem and to identify solutions to it.”

Access to Registry Numbers and just the related structure/material would be a great service to chemists. It would likely have an enormous impact on the ACS/CAS bottom line though. This is understandable. But what about the bottom line of communication between chemists? Ms. Hall’s examples are definitely real.

In the Wikipedia curation project outlined on this blog we have run unto issues with validating CAS Numbers. Fortunately CAS have offered to help. The project is now rolling again after a hiatus and we are presently preparing 500 structures to upload…hopefully more. We definitely found errors and the validation process will be possible only with their help. What do we do moving forward though?

In a recent post Peter Murray-Rust discussed creation of semantic chemical information. I have a lot of comments to make on that post but it must wait. I’ll focus on the CAS Numbers for now

A good example is Wikipedia. (…..) relies on the “wisdom of crowds”, but I think it works well in chemistry. Chemspider has harnessed the wisdom of crowds but I suspect that only a very small fraction of their entries have been human-curated and I give an example below which seems to need attention.

The reality is that about 10X the number of chemicals on Wikipedia have been human-curated..I estimate about 50,000. Curation means what in this case? It makes validation of the consistency between the structure displayed and the numerous identifiers allocated to that structure. We cannot validate predicted values of course. 50,000 human curated records is significant.

Peter went on to discuss identifiers “Identifiers. Potentially identifiers are the easiest and most powerful tool. An identifier is a unique string associated by an authority with a substance (not necessarily pure). If an authority(X) asserts that substance A(X) and substance B(X) have the same identifier then they can be said to be equivalent. There are many authorities making such assertions. Ultimately it is only the authority(X) who can make assertions about its identifiers. To be widely useful the authority should provide a lookup (resolution) service which is both human- and machine-accessible. In practice many authorities don’t do this or provide only a toll-access service. The identifiers are also often copyright and may or may not be copied. This often leads to other authorities(Y) who copy identifiers without permission and make their own assertions which may or may not be compatible with the authority(X). Frequently also the source of the identifier is not given. Thus many people who submit information to Pubchem give identifiers and these are listed as “[RN]” = registry number. For aspirin for example, there seem to be many identifiers - in the Chemspider entry all the following link through to Pubchem, e.g. 2349-94-2[RN], 26914-13-6[RN], 98201-60-6[RN]

When Peter commented “I give an example below which seems to need attention” I think he was pointing to the fact that aspirin has many Registry Numbers “there seem to be many identifiers - in the Chemspider entry all the following link through to Pubchem, e.g. 2349-94-2[RN], 26914-13-6[RN], 98201-60-6[RN]“. Maybe it wasn’t the issue. Either way it’s a great foundation to examine CAS Numbers.

Is three RNs on ChemSpider appropriate? Well, we know that MULTIPLE RNs is okay already based on Ms Morrow Halls comments. Is ChemSpider on target with these three?

Landolt-Bornstein’s Poperty Index is very well known. They have aspirin here. They list the following CAS Numbers: 50-78-2, 2349-94-2, 11126-35-5, 11126-37-7, 26914-13-6, 98201-60-6

An online MSDS sheet for Aspirin is here and lists the registry numbers: 50-78-2, 98201-60-6, 26914-13-6, 2349-94-2, 11126-35-5, 11126-37-7

The German Institute of Medical Documentation and Information lists Aspirin here and lists the following CAS Numbers: 50-78-2, 2349-94-2; 11126-35-5; 11126-37-7; 26914-13-6; 98201-60-6.

The RTECS database lists for Aspirin:

The Registry of Toxic Effects of Chemical Substances

Salicylic acid, acetate

CAS #: 50-78-2


ALT CAS #: 2349-94-2
ALT CAS #: 11126-35-5
ALT CAS #: 11126-37-7
ALT CAS #: 26914-13-6
ALT CAS #: 98201-60-6

For the MSDS Sheet and the German Institute the CAS Numbers are the same as Landolt-Bornstein…maybe they were sourced there?

Peter had listed only three RNs on ChemSpider “2349-94-2[RN], 26914-13-6[RN], 98201-60-6[RN]“” Checking ChemSpider showed we actually had the following list there: One Validated RN: 50-78-2 (the one declared as the Primary Number on the other sites) and the following list (NONE of them validated):

11126-35-5[RN]
11126-37-7[RN]
2349-94-2[RN]
26914-13-6[RN]
337376-15-5[RN]
98201-60-6[RN]

ALL of these are valid based on the other data sources EXCEPT for 337376-15-5, a totally unrelated compound detailed here. This one has been deleted using the usual synonyms curation process and the others approved.

PubChem also lists ALL six Registry Numbers as shown below. There are those who believe registry numbers are not on PubChem. Not true.

50-78-2
11126-35-5
11126-37-7
2349-94-2
26914-13-6
98201-60-6

So, ChemSpider, PubChem, MSDS sheets and many others have a consistent set of 6 registry numbers for aspirin. Are they correct…only CAS could confirm. I believe this shows that multiple CAS Numbers are appropriate. What I cannot comment on is what each one stands for. This reverts back to Ms Morrow-Hall’s comments.

Moving forward how will we stop the proliferation of errors? How can we reduce the potential cost of mistakes made as a result of CAS Number miscommunications?

Buy me a Coffee

John Wilbanks opened his blog post regarding the Erosion of the Public Domain with the statement “This Chemspider licensing brouhaha is generating some needed discussions around open data, and something I keep hearing about is that it is GPL v. BSD all over again “. This relates to the recent blog post I posted regarding our renewed focus on our agenda of Building a Community for Chemists.

I cannot do justice to John’s manner of delivering his message. He hits the nail on the head. I quote “The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of “freedom” we’re playing a dangerous game.”

To be clear I never felt a need to put licenses on ChemSpider. People are using the content on ChemSpider, grabbing it, reusing it. We have provided web services to help people get more value out of the content. We will add more as time, resources and needs require it. The only reason we added licenses was pressure to do so. What was the pressure about? None of the USERS of the site ever put pressure on us and I don’t think CARE about licensing. They just use as is and seem happy to do so.

That said, I am looking for an education. Nay, REQUESTING it from people in the domain. Deepak Singh posted a comment on John’s blog post. “I do think that there is a lot of confusion around the differentiation around content (Creative Commons) and data (which is different). The data commons needs a different set of rules, and starting with a clear understanding of what Public Domain means and why it is a good thing.”

So, what is data, what is content? Is a structure and a series of chemical identifiers “Data”? Is a list of safety and toxicity information “Data”? Are a series of links to blog post and articles “Data”? Wikipedia is defined as content I believe. So, out of all of this discussion my question is whether ChemSpider is Content or Data. (Yes..I have my own views already!)

Buy me a Coffee

There have been some follow-on comments from the recent Nature Article written about ChemSpider.

I was happy to see this comment from Jose Barros:

“As enthusiast of “Internet-aided Chemistry” subject I wish to congratulate Nature for mentioning the Chemspider initiative. To the best of our knowledge, Chemspider represents a reliable alternative for those who were not able to access commercial databases, thus contributing for scientific inclusion mainly in the less developed countries. As Chemspider grows up, it may also be used by the scientific community as a bargain tool for obtain better services or lower prices from suppliers of commercial databases.”

and a follow-on post tonight from Barrie Walker relative to his comment in the article. Originally quoted as saying “There’s an awful lot of chemical information, but there’s an awful lot of rubbish as well, says Barrie Walker, a retired industrial chemist in Yorkshire, UK, who helps maintain the site.”  Barrie added a follow up comment “comments used by Nature applied to chemistry on the internet rather than anything to do with ChemSpider. As one of ChemSpider’s master curaters, I am very supportive of the project, otherwise I would not be spending time editing the data.

I have known and worked with Tony for many years and I believe the project has a great future and with further development will see an increasing number of users.”

Comments with context have a whole different meaning. I wonder what the context was when Bob Massie from Chemical Abstracts Service compared the Golfing Industry with the Drug Industry? Likely that whole comment was taken out of context …

Buy me a Coffee