I had previously posed the question “How many chemicals names are contained in the short paragraph below”? Well, I have highlighted the “chemicals” contained in this paragraph. Click on the link to see what’s what.

“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”

Ok..so you saw Aspirin immediately right? Maybe you could have made up that Advantage and Commando would be drugs? Some of you might have spotted “he” (helium) and “in” (Indium). But did you expect “of” and “the”?

What was this all challenge all about? It explains the need to do a good job in identifying chemical names when hunting for them in articles. With a dictionary of millions of systematic names, trade names, synonyms and database IDs even the most general text is full of chemicals. So, the application of a dictionary of chemical names must be done very carefully. And, the point is that matching the dictionary of names within ChemSpider at present to text contained within scientific articles will fail without the direct identification of chemical names OR identifying trade names etc within an appropriate context.

There are WAY more complexities than this though. A group at Cambridge has been working on Sciborg since 2005. The project description page outlines the project:

“SciBorg is a four-year project, starting October 1 2005, funded by the EPSRC under the programme for Computer Science for e-Science. The project is a collaboration between three groups at the University of Cambridge:

We are cooperating with three major publishers:

The project summary and objectives are below. For further information, please see the detailed project description , which was based on the project proposal, and the developing SciBorg project wiki pages.”

I have been following the project for a while and am getting much more interested in it right now. It makes for great reading about the challenges of text mining data. Peter Murray-Rust has made a couple of blog posts (1,2) over the weekend relative to the challenges of text mining and I reference you there for a good overview of some of the challenges. They are significant but there are ways to deal with some of the issues.

I’ll blog more about text-mining and names in the next few weeks…

