Copyright©2010 Antony Williams
The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010
ChemSpider has been integrated into the SureChem service for almost 3 years. It’s a great service and the SureChem team have worked very hard to provide a premier offering at an affordable price point. The approach SureChem has taken is an interesting one…using entity extraction techniques to find chemical names in patents and using various Name-to-Structure conversion engines to generate a consensus result of what is the most appropriate chemical structure associated with a name. From this set of data they assemble a rich database of chemical structures linked back to the associated patents. This database is, of course, both structure and substructure searchable. Using the webservice provided to us by SureChem we have been able to link chemical structures in ChemSpider out to SureChem. This can produce a lot of hits across the various patent databases because of the existence of a chemical name in the patent text.Look under the patents tab for Xanax and you will see an abundance of related patents.
Note that the link between ChemSpider and SureChem is based on the structure using an InChI as the connection string. Structures will only exist in the SureChem database based on the success of the name to structure conversion software.There is a big upside to this, to be described shortly, but it also has a downside. The downside is that entity extraction depends on the identification of systematic names using tuned algorithms that SureChem have been optimizing for many years. However, for non-systematic names including trade names etc the dependence will be on both dictionaries within the entity extractor as well as dictionaries underlying the name to structure conversion engines. So, the fact that Cocaine has the identifiers snow, berries and bernice (believe me…check the names on ChemSpider!) depends on the extraction of the three names and then association to the structure cocaine. It is UNLIKELY that any patent will use these terms for cocaine of course! This all will become clear now….
We have decided to take advantage of the potential to integrate to Google Patents as it’s easy, it’s low-hanging fruit, and it may have advantages to bring together both SureChem and Google Patents for the users to choose. So, we have taken a similar approach to searching Google Patents as we do with PubMed. We search PubMed using all validated synonyms associated with a structure. Therefore for Xanax, which is validated, we would get this hit list. However, if there were no names validated against that structure then we would get NO Google Patents. There has been an active project for three years to validate name-structure pairs across ChemSpider and I am yet to see any incorrectly associated patents. Fortunately the names snow, bernice and bennies are NOT validated. If they were then we would get this list associated with Cocaine when bernice is validated.
If you look at slide 11 from my presentation at ICIC posted here you will see how complex a NAME-based search can be for a particular component. All of those names for OEA had to be searched. In the world of entity extraction all of those names would have to be found and correctly converted to the correct structure. A balance of name linking and entity extraction approaches will likely give an intersection.
The intersection of Google Patents seems to show great promise. It has advantages in that it offers access to the digitized US patents all the way back to 1790. However, SureChem has coverage of WO, European and Japanese Patents while Google Patents is limited to US only. We have already learned a lot about how to reduce erroneous results and get back the most value from the Google Patents service but we look to you, the users, for initial feedback when we release at the ACS. Google Patents will be available under the last tab in the Patents infobox.