Peter Murray-Rust and Henry Rzepa have started on an Open Notebook Science project around calculating NMR

In Peter’s own words:

“We are starting an experiment on Open Notebook Science. <..> ONS seems to be the generally agreed term for scientific endeavour where the experiments are rapidly posted in public view, possibly before being exhaustively checked. It takes bravery as it isn’t fun if you goof publicly.“
“The recent controversy over hexacyclinol – where a published structure seems to be “wrong” – has sparked one good development – the realisation that high-quality QM calculations can predict experimental data well enough to show whether the published structure is “correct”.
We’re now starting to do this for NMR spectra. Henry Rzepa has taken Scott Rychnovksy’s methods for calculating 13C spectra and refined the protocol. Christoph Steinbeck has helped us get 20, 000 spectra from NMRShiftDB and Nick Day (of crystalEye fame) has amended the protocols so we can run hundreds of jobs per day..”

I posted comments to Peter’s post commenting “My intuition is that the HOSE-code approach, neural network approach and LSR approach will outperform the GIAO approach. Certainly these approaches would be much faster I believe. It would be good to compare the outcome of your studies with these other prediction algorithms and if the data is open then it will make for a good study.”

Peter comments “I am a supporter of the HOSE code and NN approach, but I have also been impressed with the GIAO method. The time taken is relatively unimportant. A 20-atom molecule takes a day or so, smaller ones are faster. We can run 100 jobs a day – so 30,000 a month. That’s larger than NMRShiftDB.”

He also comments “I have extended George Whitesides’ ideas of writing papers that in doing research one should write the final paper first (and then of course modify it as you go along). So here’s what we will have accomplished (please correct mistakes)”.

The abstract is below. Rather than correct mistakes I have added a paragraph (NON-bolded). I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it!

PMR> “We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were “wrong” – i.e. the reported chemical shifts did not fit the reported spectra values.

    The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
    A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.

We argue that if spectra and compounds were published in CMLSpect in the supplemental data it would be possible for reviewers and editors to check the “correctness” on receipt of the manuscript. We wrote to all major editors. aa% agreed this was a good idea and asked us to help. bb% said they had no plans and the community liked things the way it is. cc% said that if we extracted data they would sue us. dd% failed to reply.

This has the potential to be a very exciting project. While I wouldn’t write the paper myself without doing the work I’ll certainly try the approach. Let’s see what the truth is. The challenge now is to get to agreement on how to compare the performance of the algorithms. We are comparing very different beasts with the QM vs. non-QM approaches so, in many ways, this should be much easier than the challenges discussed so far around comparing non-QM approaches between vendors.

Stumble it!

6 Responses to “An Invitation to Collaborate on Open Notebook Science for an NMR Study”

  1. Brent Lefebvre says:

    Peter and Tony,

    I think this is a fantastic project and am very keen to see how accurate the QM techniques prove to be for the subset of structures that you choose from the NMRShiftDB, and then how helpful they can be in improving the accuracy of experimental shifts in this wonderful resource.

    For the purposes of this work, we would be willing to provide the chemical shift predictions from the ACD/Labs software if you would like to use them in your comparison. If, for instance, they prove to be accurate enough to find many of these problems without the need for time consuming QM calculations, it may be preferrable to use the faster calculation algorithms that are available in our software. It may turn out that the ACD/Labs predictions could serve as a pre-filter to define which structures need the QM calculations and which don’t. Many variations on this theme come to mind, but we won’t know which are useful until we do the work.

    Sincerely,

    Brent Lefebvre
    NMR Product Manager
    Advanced Chemistry Development, Inc.

  2. Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Open Notebook NMR - cont says:

    [...] Peter – FYI ACD/Labs are ready to participate in the work as discussed: http://www.chemspider.com/blog/?p=213#comment-3735 [...]

  3. Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Open Notebook NMR: Commercial re-use of data? says:

    [...] Antony Williams of Chemspider has offered to participate in our Open Notebook NMR experiment. Now this offer has been joined by ACDLabs – I am not sure of the formal relation between the companies but they have clear common interests. I had originally thought this was one individual making a personal offer – now there is a company that is requesting our data for them to work on.I have some genuine concerns about how we should proceed so am clearing my thoughts online. Readers will recall that I have been strongly critical of companies or nonprofits which use closed source and protect closed data. I have roughly equal numbers of correspondents who think I have been too hard on these organizations and those who think I need to be tougher. So I am taking a measured tone here. Peter – FYI ACD/Labs are ready to participate in the work as discussed: http://www.chemspider.com/blog/?p=213#comment-3735 [...]

  4. Antony Williams says:

    Response to blog posting by Peter Murray-Rust posted (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=715) here for completeness

    Peter, I’ll comment in more detail on this after your readers have a chance to comment. But, for clarity, I point your readers to http://www.chemspider.com/blog/?p=213 where I took your “George Whitesides approach to writing papers” and added to the conclusion. Here’s the piece I added.

    “The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
    A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.”

    There should be no surprise to you that ACD/Labs stepped forward to participate. I declared it explicitly in my blog posting.

    I also posted in that blog the following statement “I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it!”

    I fully acknowledge your stance on commercial software companies. Also on publishers. And many other areas. You’re not shy with your judgments. Having worked in academia, a Fortune 500 company and in a commercial software company I can comment that all three have good science going on, some excellent people in their organizations and certainly people committed to their roles and to science. I beg the question why not help build the bridge rather than maintain the distance.

    Further clarification..ChemSpider does not have access to any NMR prediction algorithms. However, they would be willing to work on this project for the science. The hypothesis under question is whether HOSE, Neural Net or Increment based algorithms can outperform GIAO predictions. It is already known that they are faster …these are real numbers generated already “on the dataset of 23475 structures on a cluster of computers a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.” What is the statement on accuracy? I believe it’s a valid scientific question to be answered.

    We have just submitted a publication regarding one aspect of this validated on NMRShiftDB with your collaborator, Christoph Steinbeck, as our collaborator. The title and authors are below..should be in JCIM shortly. I have already sent you a copy I believe.

    The Performance Validation of Neural Network Based 13C NMR Prediction
    Using a Publicly Available Data Source.
    K.A. Blinov§, Y.D. Smurnyy§, M.E. Elyashberg§, T.S. Churanova§, M. Kvasha§, C. Steinbeck#, B.A. Lefebvre† and A.J Williams‡
    § Advanced Chemistry Development, Moscow Department, 6 Akademik Bakulev Street, Moscow 117513, Russian Federation
    † Advanced Chemistry Development, Inc., 110 Yonge Street, 14th floor, Toronto, Ontario, Canada, M5C 1T4
    # Steinbeck Molecular Informatics, Franz-John-Str. 10, 77855 Achern, Germany.
    ‡ ChemZoo Inc., 904 Tamaras Circle, Wake Forest, NC-27587

  5. Wolfgang Robien says:

    Dear Tony;

    there is a searchable database of 16.4 millions of calculated C13-NMR spectra available since approx. 1 year on http://nmrpredict.orc.univie.ac.at/identify
    (Peter: it’s free of charge ! ;-) ) )

    The spectra have been calculated for 16,4 millions of the PUBCHEM-structures using the CSEARCH NN-approach. The search technology used, is a modified SAHO-approach as implemented in CSEARCH.

    If there is more interest in using this, no problem to upgrade the data file to the actual size of the PUBCHEM-collection. The calculation of approx. 40 millions of spectra can be done in less than one week on a 4-processor box.

    Best regards, Wolfgang Robien

  6. Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Open NMR - again. Why we do it says:

    [...] read my highly opinionated views of what was originally entitled “Open Notebook Science NMR” (1,2). My views around that work were very strong…in fact I didn’t really “get it”. I didn’t [...]

Leave a Reply