The recent post regarding CAS numbers and Wikipedia has stirred up some great conversation and responses and I point you to the comments to peruse. For now I want to comment on one made by Cameron Neylon on his blog. I point you to his post to read first rather me lifting it from him and posting it here. It’s respectful of his work. Also, you may choose to add him to your Google Reader. Cameron is a great advocate of Open Notebook Science and I encourage you to visit his site.

OK…Did you read it???

Ok ..I am now lifting certain comments and wish to state my own views.

Cameron said “So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually.

At ChemSpider we likely have more experience than most in interconverting millions of structures from/to various other formats. They all have their own limitations. InChI, in both of its formats is limited in a number of ways. They are acknowledged and being worked on. For instance, polymers, inorganics, organometallics, mixtures of specific stoichiometries, Markush structures (not an issue for Cameron’s material in a bottle). SMILES comes in so many different flavors that it can be very distressing.Even the most popular Cheminformatics vendors can be incompatible..believe me, we’ve seen it in many ways! CML..haven’t worked with it yet on ChemSpider but remain interested. However, uptake seems to be very limited after maybe a decade of being available to the domain. InChI took off an proliferated at an incredible rate while CML has been around a lot longer and, as far as I know from 10 years in the cheminformatics business, has low adoption. It doesn’t mean it’s not the solution but if it is then it needs to be adopted by the masses. PubChem IDs ..there are structure IDs and compound IDs. So, a decision will be necessary there. More and more I am seeing PubChem IDs listed anyways…they are in the Aldrich catalog for example. The work will come with the curation of the data – making sure that people can find the “appropriate ID” for a compound. Check out my earlier posts about the need for curation (1,2,3,4 and many others). CAS is very highly curated and are the authority for the CAS numbers. PubChem are, of course, the authority for their IDs too but compounds can be deprecated from time to time as depositors find their own errors and there have been so many depositors with different quality standards to date that cleaning up the database is a major challenge. While they could do it it is not their mandate today, they are not funded to do so and it would be an enormous undertaking and would likely need to involve some form of crowdsourcing via online curation as we are doing here at ChemSpider.

Cameron said “The CAS number so appealing; it is short, easily typed in, and printed on most bottles.

‘Tis so. And asking the vendors to move away from it won’t work. Adding a PubChem ID might work but that’s a big shift too and I believe they would need to have guarantees about the long term future of PubChem and its database and funding to buy into that. Also, a MASSIVE validation exercise. if the companies ended up depositing their OWN compounds to get PubChem IDs I believe all hell will break loose…it’s already going on by the way of course, when they deposit. Their compounds go through internal processes at PubChem and come out the other side as deposited structures. Is everyone that went in deposited exactly as it was supplied. In theory yes. In reality? We have the same issues at ChemSpider…not easy. By the way, the CAS number with check digit and specific format is much nice than “just a number”. Maybe we should do the same with ChemSPider IDs…new format, plus a check digit?

Cameron said “I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers

Hmmm…really? Check out this link. Compare the snippet here from the Taxol Drugbox on Wikipedia

drugboxtaxol.png

with the taxol record here on PubChem and the synonyms list. You are looking for 33069-62-4.

taxolcas.png

When you find it click on it and you will find 6 structures for Taxol. The issue is curation. Which structure is right?

Why does the story that there are no CAS numbers on PubChem continue to proliferate??? Sure, they are not called CAS Numbers but it’s what they are. Depositors simply put them in as identifiers and PubChem don’t have to remove them. They respect the depositors right to deposit their identifiers..whether they are CAS numbers or not.

Now, I agree robotic curation can help with these issues and the RDF approach already being discussed for Wikipedia and ChemSpider (with Egon) can be useful in helping to link together resources and, if adopted by companies such as Aldrich etc, can be of great value in helping to clean up some of the issues. But, it is only part of the solution. The need for manual curation is being missed. Robots are already making a bigger mess in my opinion. Manual curation is a must.

Cameron said ”The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services.

I have made this statement to many people over the years. The Registry of chemical structures has little value as a collection of structures. Its what’s connected that has value. The CAS Numbers, the patents, the literature articles, the vendors. It’s the same with PubChem and ChemSpider. Who cares how many structures we have? We can generate HUNDREDS OF MILLIONS for you! It’s the associated information. I have always thought that CAS should provide an internet service that is simply a CAS lookup. Search a number and see the structure and or substance detail. End of story. You want patent details, papers, vendor details…you pay. It’s a transaction…same as STN now. In fact, do a study, see how many searches are done just to relate CAS numbers and structures, figure out the loss in revenue if you “give it away” and shift it over to the transaction charge to look at “more info”. Certainly the giveaway would help with the public relations! Oh..and you “could” make it a structure search to get a CAS number too…more work but of course possible.

Cameron said “…do people agree that CID is a good standard index to aggregate around?”

No…not yet. There is so much to be done before that can happen in my opinion. A lot.

What I’d rather do, and maybe I am a dreamer, but I would say get into relationship with ACS/CAS and try and establish a tipping point of support to where they see it is good for their business and the community as a whole. I prefer staying in relationship if possible. That said the effort around CIFs from ACS via CrystalEye has not moved forward as explained here so maybe my vision will not work in the case of CAS numbers either. Either way, we made a decision not to scrape CrystalEye anyway and this shows perfectly the issues of SMILES and InChI giving issues of structure representation!

I say let’s not abandon hope regarding CAS opening their numbers to the world just yet. This dialog is likely sparking discussions already. Let’s keep it out there and establish a groundswell of concern and support and hope that the right thing can happen for our good and for CAS. I have great respect for many of their people and their work and want the resolution to be appropriate for all parties. Let’s hope…and if hope doesn’t work then I encourage robotic and manual curation…the system is ready on ChemSpider. Come and help out!

Stumble it!

9 Responses to “Enforcing Copyright of CAS Numbers”

  1. DrZZ says:

    I think this is an example of how the details can be confusing. The example you gave is actually a example of PubChem working perfectly. All six structures for taxol are identical, at least as far as the PubChem comparison is concerned. They were submitted by six different depositors, but they all agree on the structure. I just don’t see how any other behavior can work. What if I find out there was a mistake in our compound handling and the tests with NSC125973 really used a different compound. As PubChem is structured now, I can update the structure record for my substance ID and it will get a new structure ID, but it will be distinct and insulated from all the other people who have submitted data for taxol.

    That is not to say that there still isn’t a problem. The query above used the CID as a search term so you got all substances that were given the CAS number and had CID = 36314. Erase the CID term and you search for just those substances where the CAS is 33069-62-4. Now you get 9 substance records with a total of 3 different CIDs so there really are differences in assigned structures. But that’s a problem anyway. Every journal I know what’s you to identify not just the structure of key compounds, but the source. If the source deposits its data in PubChem, the SID becomes a perfectly reasonable substitute, with the added advantage that the data exists in a framework the lets you check to see if the name/structure/Id assigned is consistent with other organizations usages. Of course there is the very real problem that not all sources deposit data in PubChem, but that is problem that exists to some extent in any external identifier.

  2. Science in the open » Who’s got the bottle? says:

    [...] I am with Antony Williams on this, if CAS got its act together and made their numbers an open standard then that would be the [...]

  3. David Bradley says:

    It’s possible that CAS numbers are not copyrightable…

    http://www.chemspy.com/chemistry-news/copyright-and-cas-numbers.html

    db

  4. Steven Bachrach says:

    The issue is not whether CAS numbers are copyrightable. (My guess is that they will have a hard time in court justifying that they are the result of some creative act.)

    The real issue is, as Peter MR has pointed out, that access to the CAS databases are largely through electronic means (SciFinder, STN) and a contract between the user and CAS restricts what can be done with the information in the database, including the CAS Number.

    Since the only way to verify that some string of numbers actually is a CAS number and it corresponds to a particular chemical substance requires access to the CAS Registry file, one will have to abide by the contract that dictates how that information may be re-used. And that restriction can be very limiting.

  5. David Goodman says:

    Because of the cost of providing access to the general public without educational discounts via the electronic products, some large public libraries still have the print Chemical Abstracts. One that does in the New York Public Library.

  6. Wolfgang Robien says:

    In order to make things short:

    1) I like CAS/Scifinder and I like emolecules+Chemspider and I make use of them.
    2) My access to Scifinder was never denied (except when the number of concurrent user was exceeded) – the only downtime is in the night from SAT to SUN. The access to Chemspider was sometimes (not often) impossible (system maintenance, etc.), but in a more unpredictable way.
    3) Has Chemspider a redundant hardware-configuration and a redundant access to the internet ? Same question to CAS ?
    What happens when Tony is less enthusiastic with this project ? It will either be stopped or at least a visible interruption will occur. What happens when Mr.X from CAS is less enthusiastic and leaves the organization ? Nothing!
    4) Does anybody discuss here about ‘free chemicals, free solvents, free NMR-spectrometers’
    5) You have to pay for watching television (at least here in Austria and many other countries) – should also be free, at least the news (=information!)
    6) I definitely do not agree with some parts of the politics of CAS – but without offering alternate business models any discussion will be impossible
    7) The slogan ‘destroy them’ is no solution ! What will happen (sorry, worst case): Industry will never use systems like Chemspider to the same extent as they use CAS, academia will switch to free systems (open is not so important, free of charge is important !) – long term perspective: academia will not use CAS, industry will use, prices of CAS will go up, usage of academia will be prohibited by prices. The consequence: Industry will add 10 cents to every Viagra-pill and overcompensate the costs – every person having ED will pay …… (feel free to substitute Viagra/ED by any other combination of tradename and disease)
    8) The end of the story: Academia has now open systems and free access – but the systems are incomplete and more or less a “private initiative” from a single person. Industry pays (more + (over)compensates !) and has all information available ….. in principle we pay again as customer !!!!!!! Its just another channel !
    9) Sorry for being sarcastic, INFORMATION is like chemicals,spectrometers,etc. – you have to pay for it, then you can expect stable systems !

    ( I put my kevlar vest on !)

  7. Antony Williams says:

    Wolfgang..thanks for the message of support. I definitely want to comment on your questions…

    1) Excellent!
    2) ChemSpider goes offline for a number of reasons. Interruptions with the ISP, power outages based on weather systems and other related challenges

    3) We have a level of redundancy now as a result of some contributions of servers and software. However, it is not where we want to be yet. I’ve blogged about this many times. I don’t think the challenge is going to be my enthusiasm for the project. It’s the reality of the needs to pay my bills and, ultimately, have some form of compensation for the team building ChemSpider. While we get much joy out of our efforts the demands on us to keep producing, at no cost to the community, a Free Access site that is available 24/7, cannot be done without funding. I have contributed to grant applications with the hope of deriving some funds to accelerate the system and support fulltime rather than night time only team members.

    7) I do NOT want to destroy by any means. I want to work with, support, integrate, participate with CAS. ChemSpider has grown a lot in less than a year with limited resources and challenges of not enough hardware and software. I don’t think a comparison of CAS vs ChemSpider is worth considering and the more appropriate comparison is likely CAS and PubChem, then PubChem and ChemSPider. If you have seen the statement now by CAS regarding their support of Wikipedia I can only say that I am joyful with their decision and what I hope for are similar collaborations and exposure of public service efforts and I am excited by the opportunities.

  8. Wolfgang Robien says:

    The slogan ‘destroy them’ was the very short (maybe too short) summary of some posts I have seen on many different websites (not by you Tony !) – hope you didnt misunderstand this statement !

    On a long term view, I am really afraid, that we divide the scientific community into the rich ( = industry + a few universities) and the poor (= 90% of academia). The ‘rich’ have access to open systems and to systems like CAS/scifinder, the poor have only access to free-of-charge systems, nice but incomplete ! Even if prices at CAS go up, industry will pay them – the additional costs will be simply added to their products !

  9. Antony Williams says:

    Wolfgang…no offense taken by me at all. I had assumed your comments were not pointed at me.

Leave a Reply