Copyright©2008 Antony Williams
Over the past few months we have been working hard to integrate the ChemRefer system into ChemSpider. I have reported on our recent rollout of the first level of integration and Will Griffiths has started a discussion at the Open Chemistry Web blogpage. Will has played a key role in facilitating our relationships with publishers in the Open Access domain. Many have had an appreciation for his ChemRefer website.
When we connected ChemRefer to ChemSpider one of our first commitments was to facilitate direct structure/substructure searching of the IUCr publications. Will has indexed the publications from 1948 to present day. He has extracted chemical names and systematic names from the titles and abstracts of the articles. Since we have access to name to structure conversion capabilities from OpenEye we can convert many of these names to chemical structures in an automated fashion. As a result of my work on the curation of Wikipedia chemical structures I also have access to some of the tools I was involved in developing at ACD/Labs (ACD/ChemFolder and ACD/Name) This has allowed me to curate and convert other extracted names in a manual manner. This is NOT the most efficient way to conduct this process as it requires a lot of eyeballing. However, it is this type of approach, reviewing many hundreds of extracted names, which has allowed us to optimize the process, recognize where potential failures could arise and improve the chance to extract the best set of names to work with for conversion purposes. Now we are at the whim of name to structure conversion software in terms of the accuracy of the conversion but this can be validated (See later). Since organometallic names are very difficult to convert to structures we can only deal with organic structures at present. However, we do also have many validated synonyms in our hands too on ChemSpider and that provides a useful dictionary for conversion.
Ultimately I hope that other commercial providers of batch Name to Conversion software modules would see the potential value of collaboration with ChemSpider and give us access to their software tools to assist with our project.
We are in the process of curating and checking the names and chemical structures for all that we have extracted so far but the depositions have already started. Of the structures converted we have found about half of them are already on ChemSpider (example) but another half are unique (example). At present we have connected almost a thousand articles via DOIs from ChemSpider to the IUCr articles. The best estimates at present are that we will be able to connect about 8500 structures to articles.
Since this is the IUCr collection we will validate our approach in the future and, with their permission, we will use the CIFs to validate our extracted structures against the CIF-based structures. This will give us an indication of the errors one could expect from an automated extraction of chemical names from articles and conversion to structures using Name to Structure. We are also going to continue our curation process and allow people to curate structures from IUCr(and other articles) to ChemSpider. There will be cases where some names/structures from articles have not been converted to structures and these will need manual submission. This will be possible through the manual deposition system.
We are looking forward to providing similar services to other publishers should they be desired. This is our proof of concept…and it’s working.Stumble it!