Archive for May 19th, 2008

I’ve blogged previously about the confusion that appears to persist around CAS Numbers. Most of you are probably aware of the Wikipedia Chemistry project ongoing at present to validate the set of structures on Wikipedia. The project is well underway now and i can comment that there were definitely incorrect CAS Numbers on Wikipedia associated with the chemical structures but in reality the quality was very good. At present my estimate is about 20:1 in terms of 20 times more CAS Numbers on Wikipedia were CORRECT than incorrect. Very impressive I say considering how haphazardly they are used by people. More below..

Now, we DO have registry numbers on ChemSpider. Probably more than 10000 of them too with an acknowledgment given to the statement on large electronic collections of structures and registry numbers. The majority of these are on PubChem too and proliferate across hundreds of chemical companies. Go and search for chemicals in China and see how they are listed! We have an increasing number of companies from offshore depositing their data into ChemSpider and see issues showing up.

As a perfect example of the confusion of registry numbers that are showing up on ChemSpider check out this query: search for the number 1429-50-1 on ChemSpider and you will get this list of hits.

Confused?  Well, it’s simply people misusing registry numbers when making their association. I doubt they are not getting the numbers issued by CAS and therefore label “a component” of their material with the CAS Number and throw in a few waters of hydration here and there, a counterion or two and whoops…proliferation of CAS numbers with the wrong association. For the example above the “primary component” should be clear and consistent between the six hits.

ChemSpider, just like PubChem, cannot be responsible for the quality of what’s deposited with us. What we can do is use processes, robots and manual curation efforts to help clean it up.

So, what IS the correct chemical associated with 1429-50-1? No idea!  Anybody else know?

Buy me a Coffee

I had previously posed the question “How many chemicals names are contained in the short paragraph below”? Well, I have highlighted the “chemicals” contained in this paragraph. Click on the link to see what’s what.

“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”

Ok..so you saw Aspirin immediately right? Maybe you could have made up that Advantage and Commando would be drugs? Some of you might have spotted “he” (helium) and “in” (Indium). But did you expect “of” and “the”?

What was this all challenge all about? It explains the need to do a good job in identifying chemical names when hunting for them in articles. With a dictionary of millions of systematic names, trade names, synonyms and database IDs even the most general text is full of chemicals. So, the application of a dictionary of chemical names must be done very carefully. And, the point is that matching the dictionary of names within ChemSpider at present to text contained within scientific articles will fail without the direct identification of chemical names OR identifying trade names etc within an appropriate context.

There are WAY more complexities than this though. A group at Cambridge has been working on Sciborg since 2005. The project description page outlines the project:

“SciBorg is a four-year project, starting October 1 2005, funded by the EPSRC under the programme for Computer Science for e-Science. The project is a collaboration between three groups at the University of Cambridge:

We are cooperating with three major publishers:

The project summary and objectives are below. For further information, please see the detailed project description , which was based on the project proposal, and the developing SciBorg project wiki pages.”

I have been following the project for a while and am getting much more interested in it right now. It makes for great reading about the challenges of text mining data. Peter Murray-Rust has made a couple of blog posts (1,2) over the weekend relative to the challenges of text mining and I reference you there for a good overview of some of the challenges. They are significant but there are ways to deal with some of the issues.

I’ll blog more about text-mining and names in the next few weeks…

Buy me a Coffee