Copyright©2008 Antony Williams
A couple of days ago I blogged about building the first dedicated website for Molecule of the Day. To continue our “proof of concept” demonstrations in this vein we now unveil our first support of a free-access publisher. Molbank is defined to be an Open Access journal on Wikipedia but based on some of the conversations I have seen on Murray-Rust’s blog this is in question. As I have expressed previously I hope to stay in relationship with publishers as we navigate our way through building our structure centric community for chemists. I have exchanged numerous emails with the editorial team at Mobank and have found them very supportive of our integration so away we went.
The data was scraped from the Molbank website, specifically the titles, authors, URL link to the article and the molfile itself. A couple of scripts later and an SDF was constructed from the molfiles and the text. This SDF file was then opened and reviewed visually to remove “errors in the data”. There were a number of different types of errors and some examples are listed below. For example:
http://www.mdpi.org/molbank/molbank2007/m558.htm includes HA and HB annotations
http://www.mdpi.org/molbank/molbank2007/m555.htm includes R groups – should be expanded
http://www.mdpi.org/molbank/molbank2005/m407.htm the mol file is for CH2=CH2
http://www.mdpi.org/molbank/molbank2005/m409.htm the mol file is for ethane
There are other example and Rich Apodaca has made a number of similar observations previously.
Our belief is that we have created from this dataset a high quality, curated (but likely not perfect) dataset as a subset at molbank.chemspider.com. The structures show names, identifiers, supplementary info where appropriate and a link to the original article. An example is shown below for the linkages.
Notice the Link to the article from the data sources, from the supplementary info and the miscellaneous safety and tox data scraoed from MSDS sheets online. We will now keep this dataset updated as Molbank expands. With the permission of the editorial staff we would be interested in extracting the analytical data also.
Our proof of concepts have shown that we can host different datasets on ChemSpider and we urge anybody interested in such a service to approach us for discussions.Stumble it!