Previously I blogged about “An Invitation to Collaborate on Open Notebook Science for an NMR Study“. I judged it was a great opportunity to “help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science.” In particular I believe the project offers an opportunity to answer a longstanding question I have had. Specifically, I have seen a lot of publications in recent years utilizing complex, time-consuming GIAO NMR predictions. Having been involved with the development of NMR prediction algorithms for the past few years (while working with the scientists at ACD/Labs) my judgment is that these complex calculations can be replaced by calculations which can take just a couple of seconds on a standard PC. I believe this to be true for most organic molecules. I do not believe such calculations would outperform GIAO predictions for inorganic molecules or organometallic complexes or solid state shift tensors. However, there has never been a rigorous examination comparing performance differences. I believe this project offered an excellent opportunity to validate the hypothesis that HOSE code/Neural Network/Increment based predictions could, in general, outperform GIAO predictions.

The study was to be performed on the NMRShiftDB now available on ChemSpider. I’ve blogged previously about the validation of the database (1,2). The conversation about the NMR project has continued and Peter has talked about some of the challenges about open Notebook Science based on Cameron Neylon’s comments. I’ve posted the comments below to the post and they will likely be moderated in shortly. I post them here for the purpose of conclusion since I don’t think my original hopes will come to fruition. Thanks to those of you who have been engaged both on and off blog. I suggest we all help with Peter’s intention to help explain identifiers that are being extracted in the work.

“Can you provide some more details regarding your concerns here:”it would be possible for someone to replicate the whole work in a day and submit it for publication (on the same day) and ostensibly legitimately claim that they had done this independently. They might, of course use a slightly different data set, and slightly different tweaks.”

I have two interpretations:

1) Someone could repeat the GIAO calculations in a day and identify outliers and submit for publication

2) Someone could do the calculations using other algorithms and identify outliers etc and submit for publication

Maybe you mean something else?

For 1) the GIAO calculations CANNOT be repeated since no one has access to Henrys algorithms and based on your comments he is modifying them on an ongoing basis as a result of this work. Even if they did have their own GIAO calculations unless they have improved the performance dramatically or have access to a “boat load” of computers the calculations will take weeks (based on your own estimates). That said, comparing one GIAO algorithm to another is valid science and absolutely appropriate and publishable. Also, if they had used used the same dataset as you, with an other algorithm to check prediction and identify outliers it WOULD be independent. Related to the work you are doing for sure but independent.

For 2)using other algorithms on the same dataset is valid and appropriate science. THis is what people do with logP prediction (or MANY other parameters)..they validate their algorithms on the same dataset many times over. Its one of the most common activities in the QSAR and modeling world in my opinion. And people do use slightly different tweaks…its one of the primary manners to shift the algorithms. Henrys doing this right now to deal with halogens according to your earlier post (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=732). Wolfgang Robien at University of Vienna, ACD/Labs and others use their own approaches but both at a minimum can use HOSE code and Neural Networks. Same general approaches with tweaks. They give different results…all is appropriate science.

Returning to the comment “it would be possible for someone to replicate the whole work in a day and submit it for publication (on the same day) and ostensibly legitimately claim that they had done this independently.”

Wolfgang Robien has taken the NMRShiftDB dataset and performed an analysis. Its posted here: http://nmrpredict.orc.univie.ac.at/csearchlite/enjoy_its_free.html . ACD/Labs performed a similar analysis as discussed on Ryans blog here: http://acdlabs.typepad.com/my_weblog/2007/05/nmrshiftdb_acdl.html. One of the outputs is this document: http://acdlabs.typepad.com/NMRShiftDB_Validation.pdf . This resulted in further exchanges and dialog: http://acdlabs.typepad.com/my_weblog/2007/06/robiens_and_mod.html . The parties have discussed this on the phone and face to face with Ryan talking with Wolfgang recently in Europe at a conference.

This was heated and opinionated for sure. STRONG scientific wills and GREAT scientists defending their approaches and performance. Wolfgang is NOT an enemy for ACD/Labs…he has made some of the greatest contributions to the domain of NMR prediction and, in many ways, has been one to emulate in terms of his approach to quality and innovation to create breakthroughs in performance. He is a worthy colleague and drives improvement by his ongoing search for improvements in his own algorithms. I honor him.

The bottom line is this: approaches for the identification of outliers in NMRShiftDB have been DONE already. Its been discussed online for months…just do a search on “Robien NMRshiftDB” on google or “ACD/Labs nmrshiftdb”. There are hundreds of pages. We/I just published on the validation of the NMRShiftDB. I blogged about it and you posted it here http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=673. Feedback on outliers have been returned to Christoph and changes made already. SO in many ways you are doing repeat work – just using a different algorithm and identifying new outliers. Neither ACD/Labs nor Wolfgangs work was exhaustive. it was very much a first cut but did help edit many records already. NO DOUBT you will find new outliers.

Ive gone back to the original post at http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=671 and extract two purposes to the work:

1) To perform Open Notebook Science

2) quote “To show that the philosophy works, that the method works, and that NMRShiftDB has a measurable high-quality.”

1) has already changed and is an appropriate outcome from the work.(http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=743)

2) The method of NMR prediction applied to NMRShiftDB to prove quality..high or not…has been done already. Wolfgang and ACD/labs did it already. I judge youll have similar conclusions…its the same dataset.

Stated here http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=737 is “We shall continue on the project, one of whose purposes is to investigate the hypothesis that QM calculations can be used to evaluate the quality of NMR spectra to a useful level.” Its a valid investigation and this is testing whether QM can provide good predictions. This is of course known already from the work done by Rychnovsky on hexacyclinol.

To summarize:

1) Using NMR predictions to identify outliers – already done (Robien and ACD/Labs)

2) Validating that GIAO predictions are useful to validate structures – already done (hexacylinol study)

3) Validating the quality of NMRSHiftDB – already done (Robien, ACD/Labs)

All this brings me down to what I “think” are the intentions or outcomes for the project at this point..but I likely have missed something..

1) Identify more outliers that were not identified by the studies of others

2) Deliver back to Christoph and the NMRShiftDB team a list of outliers/concerns/errors with annotations/metadata in order to improve the Open Data source of NMRShiftDB

3) Allow Nick Day to use a lot of what was learned delivering CrystalEye for a second application around NMR and useful for his thesis (A VERY valid goal..good luck Nick)

4) Show the power of blogging to drive Collaboration via OPen Collaborative NMR

SOme additional project deliverables I think include:

1) make online GIAO NMR predictions available

The project deliverables you are working on are defined here and I believe are consistent: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=742

* create a small subset of NMRShiftDB which has been freed from the main errors we – and hopefull the community – can identify.

* Use this to estimate the precision and variance of our QM-based protocol for calculating shifts.

* refine the protocol in the light of variance which can be scientifically explained.

What I still would like to see, BUT this project belongs to you/Henry/Nick of course and you define what it is, is:

1) to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science.” Wolfgang is in academia, so are you, ACD/Labs is commercial and Im independent (but of course am associated with ChemSPider…I am an NMR spectrosopist…its why Im interested)

2) To validate the performance of GIAO vs HOSE/NN/Inc by providing the final dataset that you used and statistics of performance for GIAO on that datatset. Id like to publish the results jointly, if you would be willing to work with the “dark side”

3) To identify where GIAO can outperform the HOSE/NN/Inc approaches

Wolfgang also has thoughts based on http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=742#comment-63089 where he says “What would be great to the scientific community: Do calculations on compounds where sophisticated NMR-techniques either fail or are very difficult to perform – e.g. proton-poor compounds or simply ask for a list of compounds which are really suspicious (either the structure is wrong or the assignment is strange, but the puzzle can’t be solved, because the compound is not available for additional measurements).

Ive put a lot of effort into blogging onto this project over the past few days. Im about to invest some time in making sure that you get information about outliers so you are not doing repeat work. I judge that my hopes for deeper collaboration will remain unfulfilled so Ill give up on asking.

Ill do what I can to help from this point forward and keep my own rhetoric off of this blog and restrain it to ChemSpider so as to not distract your readers. I look forward to helping for the benefit of the community.”

Stumble it!

2 Responses to “Open Notebook Science NMR Study Part 2”

  1. Antony Williams says:

    I added a new comment to PMR’s blog today entitled Open NMR calculations: Intermediate conclusions at http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=750

    Peter, You have some interesting conclusions in this post and some are contrary to earlier observations made by others. First some comments:
    1) Regarding “It is also unclear how many of these could be robotically downloaded without the publishers sending lawyers – at least Elsevier allow us to do this.” This is excellent news that one of the biggest publishers around allows you to robotically download spectra from their papers. Very good indeed!
    2) Regarding “the only Open collection of spectra is NMRShiftDB – open nmr database on the web.” Just to clarify these are NOT NMR spectra actually. Unless NMRShiftDB has a capability I am aware of NMRSHiftDB is a database of molecular structures with associated assignments (and maybe in some cases just a list of shifts..maybe all don’t have to be assigned.) As an NMR spectroscopist the spectrum itself is what comes off the instrument, the one that can be re-referenced, phased, baseline corrected etc. NMRShiftDB is limited (I think) to a peak listing. This should not detract from the value of the data collection but it may cause confusion. Certainly one conversation I have had in the past 24 hours suggests that people think that NMRShiftDB contains NMR “spectra”. But Christoph named it appropriately as a SHIFT database.

    3) REgarding “We downloaded the whole of NMRShiftDB. When we started we had NO idea of the quality.” I think you had an idea and point you to your own blogpostings. http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=278; http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=346. I recall you followed the scientific discourse between Wolfgang Robien and ACD/Labs regarding the quality and supported our conclusions that the data was of good quality. I recommend following the NMRShiftDB homepage (http://nmrshiftdb.ice.mpg.de/)where such reports get posted by Christoph as they occur:

    a) NMRShiftDB Critique 2007-04-05 02:01 – NMRShiftDB
    Prof. Wolfgang Robien from Vienna, maker of the CSearch system, has evaluated NMRShiftDB’s data quality and found a number of partly severe errors. Robien’s critique is summarized on his own site here.

    b) NMRShiftDB review 2007-05-03 04:12 – NMRShiftDB
    Antony Williams published an NMRShiftDB quality review in his ChemSpider blog. See here

    c)Quality Campaign 2007-07-02 08:03 – NMRShiftDB
    Between 2007-3-10 and today, altogether 72 spectra and/or structures in NMRShiftDB have been edited by the community to correct errors identified in analyses by Wolfgang Robien and Antony Williams as well as internal cross-checks.

    4)Regarding “We knew in advance that certain calculations would be inappropriate. Large molecules (> 20 heavy atoms) would take too long. ” The 20 heavy atom limit is a real constraint. I judge that most pharmaceuticals in use today are over 20 atoms (xanax, sildenafil, ketoconazole, singulair for example). I would hope that members of the NMR community are watching your work as it should be of value to them but I believe 20 atoms is a severe constraint. That said I know that with more time you could do larger molecules but a day per molecule is likely enough time investment.

    5) Regarding “Molecules with floppy groups cannot be easily analysed.” So, anything with a side chain then.

    6) Regarding “So we have a final list of about 300 candidates.” Out of a total of over 20000 individual structures your analysis was performed on 1.5% of the dataset. How many data points was this out of interest. A structure is clearly not a data point since each structure has multiple nuclear centers and you are predicting individual shifts. I’ll estimate about 3000 shifts? The earlier validation I reported on was 214,000 shifts (http://www.chemspider.com/blog/?p=37) but that was an old version of the database and it has grown since then.

    7) Regarding ” probably 20% of entries have misassignments and transcription errors. Difficult to say, but probably about 1-5%”. This suggests about 25% of shifts associated with my estimated 3000 shifts are in error. This is about 750 data points and this conclusion was made by the study of 300 molecules. For sure the 25% does not carry over to the entire database. It is of MUCH higher quality that that. My earlier posting suggested that there were about 250 BAD points. The subjective criteria are discussed here (http://www.chemspider.com/blog/?p=44). Wolfgang suggested about 300 bad points but we were both being very conservative.You discussed the difference between 250 and 300 here on your blog as you likely recall http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=346

    8) Regarding “We realise that other groups have access to larger and, they claim, better data sets. But they are closed. I shall argue in a later post that closed approaches hold back the quality of scientific data.” I think your comments are regarding Wolfgang Robien and ACD/labs. That is true that we have access to larger datasets but we can limit the conversations to NMRShiftDB since we ALL have access to that. Robien’s and ACD/Labs algorithms can adequately deal with the NMRSHiftDB dataset. For the neural nets and Increment based approach over 200,000 data points can be calculated in less than 5 minutes (http://www.chemspider.com/blog/?p=213). You have access to the same dataset and can handle 300 of the structures. Your statement is moot..it is NOT about database size but about algorithmic capabilities.

  2. Sanford Dickert says:

    In regard to the Open Data question, we have been doing our own work for curating community data with the Red Hen Spectra project – which you can see a version of it at alpha.redhenspectra.com:3005.

    We are still making minor modifications – and need to improve the dataset, but the development concepts are available for all to see.

    I would be happy to discuss with others about our design of our product as well as the web-service we make available for use outside the bounds of our website.

    Email me at sanford [AT] cooper dot edu

Leave a Reply