Following on from a recent post regarding an estimate of the number of potential errors in the NMRShiftDB a recent report was issued regarding a deep analysis of the data rather than the cursory estimates that I suggested. The details of that analysis have been exposed on Ryan Sasaki’s blog and in a separate report. By definition an objective examination of performance of algorithms by a software vendor is generally challenged. So be it. However, if the assumption that the examination is objective and driven by quality science then any other flavors should be distasteful and the science is what it is.

The report analyzing the performance of both the ACD/CNMR predictor and Robiens’ algorithm provided the statistics shown below:
Comparison of predictors

Previously I commented in the blog that based on an early analysis “The quality is excellent and there are “large errors” but minimum in number.” This comment seemed to cause some confusion.

So, what do I mean by large error?

What I do NOT mean is that a chemical shift at 120ppm is predicted to be at 80ppm and therefore there is a large error. No, the chemical shift at 120ppm could be experimentally correct but the prediction algorithm could fail to predict it correctly.

What I DO mean is that an assignment of a particular nucleus to 120ppm may be entered into the database but the ACTUAL shift should be 12ppm….that additional zero just showing up as an error during the submission process. So, the errors I am pointing to are those of incorrectly drawn structures, mis-assignments, transcription errors and other potential sources of error. My estimates refer to the number of significant assignment or structural errors that were glaringly incorrect and I was subjectively thinking of situations where the difference between the actual experimental shift value and the one assigned to nucleus was >20ppm….this does not mean that mis-assignments of even 1ppm are any less importance, just not necessarily as easy to detect and not part of my subjective criteria.

At present, the data have been examined in more detail and I believe I overestimated…a report of potential glaring errors has been returned to Christoph for him to examine and make changes to the database as he sees fit. Glaring errors are less than 250 in number based on my subjective criteria. Again, this does not mean that there aren’t hundreds or thousands of errors buried in the data…they are not obvious errors and require more manual examination.

Now, let’s return to some of the other comments about potential errors in the database. In his post Robien has highlighted some errors that are contained within the data. With his examination he has highlighted the necessary level of care and caution required to create the cleanest database of data possible and willingly admits there are likely still errors in his own database even after rigorous checking procedures. It would be naïve to not admit this for a large database and certainly this is true for some of the NMR data contained in the millions of shifts used as the basis of the ACD/Labs NMR predictions. There are errors in there…we are just not aware of them yet.

Robien comments that “the total number of incorrect assignments exceeds the above mentioned limit of 250 significantly. The intermediate number is at the moment around 300, but about ca. 1,000 pages of printouts are waiting for visual inspection.” This is the nature of building high quality databases..they need to be manually curated and appropriate processes and tools need to be applied to ensure rigorous examination of the data. Ryan’s already commented on the difference between 250 and 300…as has PMR in terms of 1% error.

Robien also comments that “ From my experience with C13-NMR databases a reasonable estimate for the number of mis-assigned shift values will be between 500 and 600, which corresponds to ca. 0.25-0,3%. Afterwards the data will reach the usual quality as within CSEARCH and NMRPredict.” Compare this with our experience of building databases of NMR shifts over the past decade. Year after year we have added somewhere close to 200,000 chemical shifts to the carbon NMR database. Each of the structures and individual chemical shifts passes through processes outlined elsewhere (this is an OLD presentation and processes have tweaked for even greater control of data quality…) . Over the years we have gathered statistics on the quality of data contained within the literature. This is almost exclusively peer-reviewed literature. Our statistics, on an annual basis, are that 8%, EIGHT PERCENT, of the chemical structures and associated shift assignments are in error in the literature. This can mean incorrectly elucidated structures or incorrectly assigned structures, transcription errors and so on. These are the errors we catch.

I reported previously on an example of the type of errors perpetuating in the literature. For a tosyl group as represented below, consider C-1 and C-4.

Tosyl structure

Without going into detail, when chemical shifts are predicted it is possible to examine what structures are used to generate the predictions There are TWO distinct columns of related structures displayed, one around 132ppm and one around 145ppm. See below.

Calculation Protocols

Conclusions from these data include the fact that ongoing assignments for C-1 and C-4 have been confused in many cases. These confusions did not arise from one particular publication that we can identify. We had to clean these up in our database as Wolfgang likely did! “Error propagation” as he comments, is dangerous! Finding the source of the errors is challenging.

Ryan Sasaki’s post and Peter Murray Rust’s post have both commented on the fact that NMRShiftDB is a valuable resource. I’ve got my hand up the air waving wildly with a “Yessir..it is”.

Christoph Steinbeck and the NMRShiftDB team have outlined their intentions admirably in at least two articles (J. Chem. Inf. Comput. Sci. 2003, 43, 1733-1739 and Phytochemistry 65 (2004) 2711–2717). They are well on their way to delivering on their vision. A detailed analysis of the performance of the ACD/Labs C-13 NMR NEURAL NETWORK based predictor using the NMRShiftDB dataset has been performed and a manuscript is presently in preparation. That work included a review of the impacts of overlap in the training set. The numbers are impressive and a teaser is available.

PMR commented on one of the capabilities of NMRShiftDB – “It contains mechanisms for assessing data quality automatically. For example software can be run that will indicate whether values are seriously in error.” NMRShiftDB is a research project and therefore “in development”. I’ve already provided a document outlining what appears to be a series of serious errors in the data (Robien identified some too). There is some work required on these algorithms. An example is shown below…notice that 8.30ppm shift on the carbonyl?

Assignment error

The assignments for this structure have already been reported, twice. Once in the solid state, once in DMSO. The screenshot below shows the values. EVERY shift except for one as submitted to NMRShiftDB is in error. But, this is ONE structure from one submitter. Maybe one approach is to have “faith criteria” for submitters after they have proven themselves with the submission of data? Maybe something untoward happened with the system this one time during submission. It does not take away from the effort….it’s just data that there is more work to be done. Here at ChemSpider…we know that world!

Database entry

Stumble it!

3 Responses to “Further Comments on the Quality of NMRShiftDB and NMR Prediction Algorithm Validation”

  1. Ryan's Blog on NMR Software says:

    More Dialogue on NMRShiftDB Debate

    Peter Murray-Rust and Tony Williams have added their two cents to this debate on their respective blogs. Peter provides a great justification on providing open access to scientific information:In the case of NMRShiftDB I am firmly of the opinion that

  2. Wolfgang Robien says:

    NMRShiftDB Debate

    CSEARCH performs at 2.22ppm/2.19ppm before/after correction with a certain
    network and a certain parameter setting.

    ACD’s CNMR-Predictor performs at 1.59 as can be seen from the table above.

    MODGRAPH’s NMRPredict performs at 1.40ppm as can be seen from their
    website.

    That’s the facts, which came out from the discussion now ….

    I think, it’s good to know that, verified on the same dataset. I initiated
    this discussion, therefore I want to recall the starting point of this
    discussion:

    The central question of my webpage at nmrpredict.orc.univie.ac.at was:
    (Statements copied from:
    nmrpredict.orc.univie.ac.at/csearchlite/Robien2Ryan_May31_2007.html
    nmrpredict.orc.univie.ac.at/csearchlite/enjoy_its_free.html)

    “Is there a visible improvement when you spent a few hours on
    data-correction, both on a statistical basis and also with a few specific
    examples?”

    The answer was:

    “This small improvement – which is far away from being perfect
    and/or complete – without the use of one, single literature citation
    improves the assignment quality by ca. 6,200 ppm !”

    “My finding was, that spending less than an afternoon you can improve a
    collection of some 20,000 NMR-spectra having more than 200,000 shiftvalues
    by 0.03ppm or ca. 6,200ppm in total.”

    My summary was:

    “That’s an amazing result ! NMRShiftDB has been online for about 5 years,
    the 20 most important contributors are mentioned on their homepage by name
    and I think many other people have contributed into this project. All
    these people together were unable to spend less than an afternoon on
    data-correction within 5 years! That’s exactly the point nobody seems to
    (be willing to) understand !”

    Peter Murray-Rust stated on his blog with respect to NMRShiftDB-data:

    “…… It contains mechanisms for assessing data quality automatically.
    For example software can be run that will indicate whether values are
    seriously in error………”

    My answer was:

    “That’s really a great idea AFTER being 5 years on the web ! ……. let
    me know which data-correction protocol has been applied to the
    NMRShiftDB-data leading to deviations of about 130ppm (= 2/3 of the usual
    C-NMR shift range) BEFORE you put them on the web ! ……”

    Final remarks:

    Only 4 main-contributors are involved in this discussion:

    That’s me – I have started this ‘avalanche’
    Antony Williams and Ryan Sasaki from ACD
    Peter Murray-Rust from Cambridge

    I couldn’t find any comment by NMRShiftDB people including Christoph
    Steinbeck.

    In order to improve science with a project like NMRShiftDB, the first step
    should be to look around what is already there (“State-of-the-art”), the
    second step should be to avoid the errors other people have already paid for
    (“Avoid starting from scratch”) and the third step should be avoiding
    hacking the same numbers/strings into a computer (“share resources”)
    The community needs chemical data, THEY SHOULD BE VALIDATED ACCORDING
    TO THE STATE-OF-THE-ART (when you cite me, please use both parts of this
    sentence!) – I have shown by spending a few minutes CPU-time, that the
    second part of this sentence was not true for NMRShiftDB as downloaded
    on March 10th, 2007.

    What’s the result of this discussion:

    a) We know how accurate 3 programs perform on the same dataset
    b) There was severe error-correction performed on NMRShiftDB after
    my analysis

    Both items are valuable contributions for the scientific community.

    At the end a personal remark to Christoph Steinbeck:

    Copied from: sourceforge.net/forum/forum.php?forum_id=681882

    Start citation —-

    ….. We also feel that this makes a strong case for our open access,
    open source policy, which gave our reviewer the chance to access our full
    material and run this test. As Eric Raymond puts it: “Given enough
    eyeballs, all bugs are shallow”

    End of citation —-

    I clearly state here: Don’t use the “open access, open source policy”
    as excuse. You simply have the responsibility as database supplier
    and/or project manager to apply basic statistical tests on the data,
    BEFORE you make them available to the scientific community in order
    to obtain a reasonable quality of the data you provide. I am
    talking only about errors, which can be found just by ‘snipping
    fingers’. Obviously this point has been missed over 5 years.

    Wolfgang Robien, June 6th, 2007

  3. Ryan Sasaki says:

    Before anyone takes Robien’s results about the NMR prediction comparisons AS FACT…please read my latest post about the details of the comparison.

    http://acdlabs.typepad.com/my_weblog/2007/06/robiens_and_mod.html

    There are still some remaining questions about how Modgraph came to an average deviation of 1.40 ppm.

    So before we can make a final decision on performance, I think Modgraph needs to make very clear the following:

    1. What is the overlap between NMRShiftDB and Modgraph’s NMR prediction databases? Further, with several different database sources how much duplication of data exists across the databases and within the entire package?

    2. Once that overlap is removed from the dataset, what is the final deviation produced by NMRPredict?

    I think this information needs to be made very clear from Modgraph before they can claim to be, “the most accurate carbon 13 NMR predictor in an independent evaluation?”

Leave a Reply