PubChem is a very large source of compound structures and data, but the quality and reliability of these can be variable. However, within it, some sets of compounds and substances could be trusted more than most because they’ve been deposited by reliable data sources – for example those deposited by the Nature Publishing Group that correspond to compounds in Nature Chemistry, Nature Communications and Nature Chemistry Biology articles.

We have developed an automated method to search PubChem for substances deposited by the Nature Publishing Group, to extract their structures and properties in sdf format and then import them into ChemSpider. The result is a newly imported set of 5525 molecules in Chemspider. These compounds were deposited in PubChem since 2005 and originate from over 400 articles. All imported compounds link back to the original article – see below.

Example compound from PubChem

The process is automated and can be scheduled to scrape PubChem for newly deposited compounds, and stream these into ChemSpider so this subset will be updated regularly.

This initial prototype could pave the way for other high quality, consistently formatted subsets of PubChem to be identified and deposited into ChemSpider in a similar way. To suggest other possible subsets of PubChem which could be used by ChemSpider join the discussion on the ChemSpider forum.

Stumble it!

Leave a Reply