Over the past year ChemSpider has been working hard to build a functional and stable platform for the hosting, deposition and curation of structure-based data. This is to form the foundation of our mission to build a Structure-Based Community for Chemists. Our deposition system is in place and well-tested. Our indexing of articles is proven, and continues. We have indexed multiple Open Access articles. We support the deposition of analytical data (spectra and CIF files) into ChemSpider.

It is now time to take this to the next level and I would like to extend an invitation to Open Access publishers to work with us to design an interface (preferably a web service) to facilitate direct deposition of data into ChemSpider. We’d like to design an interface where you can feed your articles in with Title, Authors, Journal reference, DOI and Abstract. We would associate the article with the chemical structures in one of two specific ways – 1) extract the chemical names from the title and/or abstract and convert on the fly to deposit and/or associate with structures on ChemSpider and 2) allow the publisher to pass us a series of SMILES strings, InChI Strings, molfiles or chemical names to deposit on ChemSpider. Based on what we have already done it is clear this process is feasible, and will require some manual intervention until we optimize processes. If we do this we can design an interface and input format that can be made public, reusable by other groups for the deposition of information into their systems and, potentially, move away from the need for extracting information out of PDF files (and other formats). The outcome of this work would be a freely accessible structure and substructure searchable index of Open Access articles with links back to the Open Access article. We are already indexing articles so, with permission from even the non-Open Access publishers we could use similar processes to index abstracts and make articles structure/substructure searchable based on titles and abstracts.

So, my question. Are there any Open Access/Free Access publishers willing to discuss the possibilities I have outlined? If any of you will be at the ACS meeting and would like to discuss please post a response here or contact me at the usual email address (antonyDOTwilliamsATchemspiderDOTcom) and let’s talk about building a disruptive and enabling technology for chemists around the world

Stumble it!

3 Responses to “An Invitation to Open Access Publishers to Develop a Deposition API with ChemSpider”

  1. Wolfgang Robien says:

    ad C-NMR:

    Deposition of spectra saves the first 10% of the ‘job’ – the hard work starts when checking and curating !

    For details see:
    http://nmrpredict.orc.univie.ac.at/csearchlite/NMR_misinterpretation.html
    I have approx. 6,000 examples more in my queue and I am definitely NOT talking about assignment problems below 5ppm !

  2. Antony Williams says:

    Yes..you are of course correct. One would hope that some validation of the data is done by the reviewers but how often do they have access to raw data! So, following deposition some curation and checking would need to be done. That coule be manual by ChemSpider curators or could be done by integrating to other tools/systems. Maybe an opportunity to send structures and spectra through an automated checking system ,….sound like something of interest to setup?

  3. Eric Milgram says:

    Tony,

    The NIH’s public access policy “requires scientists to submit journal articles that arise from NIH funds to the digital archive PubMed Central (http://www.pubmedcentral.nih.gov/)…” The NIH has published a list of journals that by default, will upload their publications to pubmed: http://publicaccess.nih.gov/submit_process_journals.htm. Perhaps some of these same journals would also require that manuscript authors submit their data electronically to a source such as ChemSpider.

    One of the key foundations of science is the exchange of ideas, results, and data. Some prestigous journals, such as Nature or Cell, require authors to make rare or precious reagents available to others for purposes of validating novel or controversial results.

    Many people are aware of a number of cases where high profile scientific results turned out to be fraudulent, as reported in the popular media. There is no reason why authors should not be required to submit electronic copies of their data along with their manuscripts. Before the advent of the WWW, publishers did not have a good mechanism for accepting this data and making it widely available.

    However, with sites such as Wikipedia or Chemspider, the publishers do not need to incur the cost of maintaining a facility for data deposition and dissemination. Also, if the process for uploading and searching this data is well-constructed, then authors wouldn’t feel burdened by using it, and the scientific community benefits immensely by having access to it.

    As scientific experiments become more complex, especially in fields like systems biology, where a small number of samples can generate a tremendous amount of data, being able to access the raw data will become more important than ever for a number of reasons. For one thing, these fields are evolving rapidly, but there are no widely accepted best practices or QA/QC processes, even though a number of proposals are being discussed. Thus, today, if one were to take the same data set (e.g. proteomics) and give it to 10 different groups to process, the results from those groups, depending upon how each group decides to process and interpret the results, very different conclusions can be drawn from the same data set.

    A great example demonstrating this situation can be seen in the Jan 2008 journal, Statistical Applications in Genetics and Molecular Biology (http://www.bepress.com/sagmb/vol7/iss2/#competition). In this issue, a clinical proteomics data set was made available to participants to analyze and submit their conclusions. The journal selected some of the submissions and published the results, which were very interesting.

    Making the raw data available has so many advantages. Other researchers can use the data to verify the original conclusions, as well as to draw new conclusions. The likelihood of fraud is decreased. Also, so long as the sites where the data is being submitted have an adequate storage and retention policy, data archival for posterity becomes more facile.

    I’m glad to see that you’re making available a facility for data storage. The scientific community is slowly making the transition from the Gutenberg printing press era to the modern digital era. At some point, having electronic data be available will certainly be required.

Leave a Reply