Copyright©2009 Antony Williams
Users of ChemSpider will likely know the name “SureChem” from when we linked up their patent collection to ChemSpider. Their name is also associated with ChemMantis – the SureChem Entity Extractor is the “brains” behind locating the chemical names/entities inside the chemistry articles. SureChem provided us with their Entity Extractor software development kit (SDK), which we extended with some additional entity libraries to meet our particular needs. We also introduced some additional filtering routines prior to feeding the chemical names to the name to structure algorithms we use.
A little bit about SureChem first…they offer an online portal for structure searching patents and MEDLINE. Their website is here…www.surechem.org…be sure to try it out. There’s a presentation online here. The strength of their approach includes the completeness of their coverage, the speed at which they update their patent collection, the ability for users to download structures from patents of interest, the ease of their workf low and the quality of their markup and visualization tools. The foundation of their process involves two specific tool sets: the “entity extractor” to find chemical names in the text and the name-to-structure tools. They use a series of commercial name to structure tools – three in total…Cambridgesoft‘s application, ACD/Labs NTS application and OpenEye‘s application. SureChem also users a set of heuristics to address errors common to chemical names in patent text and correct them so they can be successfully converted to structures. The entity extractor is their development, built up from years of research and experience in text-mining.
We have taken the SureChem entity extractor, originally optimized around patents, and have extended the source code to allow us to deal with a number of specific dictionaries (including chemical names, chemical families, reaction names, fragments, species, enzymes, hardware vendors, software vendors and so on.) We have also tested the resulting entity extractor on almost a thousand chemistry publications and have optimized the detection of beginning and end of a chemical name. Our work on optimizing the entity extractor for our needs on Chemmantis is not over. However, we have been able to get to where we are rather quickly because of the strong and capable technology provided to us by SureChem.We are grateful to them for their support and encourage you all to test out their service…we have just taken delivery of a new web service from them so that we can improve our integation from ChemSpider to their SureChem service. Watch this space.Stumble it!