I give a thumbs up to the quality of the NMRSHIFTDB. We’ve validated it. Why would I care? I’m an NMR jock at heart. I also work for a commercial software company innovating NMR prediction software and compiling NMR databases as the basis of our work. Does this mean that the commercial software vendors and Open Source/Access communities can coexist and have mutual admiration. I believe so!

After 18 months work I finally signed off on one of those infamous copyright transfers for Elsevier, now the publishers of Progress in NMR. After over 18 months of work (that after hours style…much like blogging) a 360 page review article is finally submitted – “Computer-Assisted Structure Verification and Elucidation Tools In NMR-Based Structure Elucidation”. Proofs will arrive before end of month. It’s the culmination of over ten years of our own work as well as that of many contributors in the domain of CASE (Computer Assisted Structure Elucidation) systems. The complexity of structures that can be solved by computer algorithms is impressive…see examples here. Recently the StrucEluc CASE system solved the structure of an antibiotic of Mw>1150. Three NMR spectroscopists couldn’t solve it…a symbiotic relationship with software is VERY enabling!

One very active player in CASE is Christoph Steinbeck, a member of Blue Obelisk, one of the more active blogging groups on the net today.

Christoph’s group host NMRSHIFTDB. Recently ChemSpider linked to NMRSHIFTDB. In parallel I took interest in the recent critique of the quality of data published by Wolfgang Robien especially since during my “day job” I am directly involved with NMR prediction, structure verification using NMR and of course, CASE systems.

What was interesting about Robien’s post was the fact that it focused on the application of Neural Networks to prediction. With the availability of a public dataset we were able to repeat the analysis using our own Neural Networks as well as our classical approaches. The results will be reported elsewhere. What I want to confirm is PMR’s post regarding the quality of data. Peter commented, relative to our own efforts at ChemSpider. “There is little point in collecting 10 million structures if you cannot rely on any of them. It actually detracts from the hard work of people like Stefan, Christoph and others on NMRShiftDB as the general user of the database will judge all entries by the lowest common denominator.” After analyzing the data, over 200,000 individual chemical shifts I can say DON’T judge by the lowest common denominator. There is some junk on there..as seen by Wolgang Robien. But our estimates after our analysis is likely less than 250 data points in error. These are truly excellent statistics if you consider that this is an open access system where people are depositing data, that these data are free to download and utilize even for the development of derivative algorithms and that such systems can work. The addition or improvement of rigorous checking algorithms to the NMRSHIFTDB is the next natural step and flagging data to the submitter will have them check and validate the quality of their input. This will catch many errors during the submission process.

So, my compliments to Christoph and the team. The quality is excellent and there are “large errors” but minimum in number. I’ve already sent him a report to help cleanse the database though didn’t compare it with that of Robien…likely we saw the same things since they were very obvious. These errors should not detract from the effort ..with >200,000 data points it is obvious that there would be some. For ChemSpider we have the same problem…with >10 million structures there are errors….lots of them. But it’s very useful all the same!

  4. D.Riuos says:

    Interesting in you use of Neural Networks as prediction system.

  5. Antony Williams says:

  6. Bill Nichols says:

    Hey Tony,

    I hope all is going well and based on your Linkedin activity I’m guessing things are brilliant.

    Like others I’m using empirical formula information including accurate mass MS/MS data. I recently reread the Chemspider “Known Unknowns” paper and I’m still frustrated that there is no apparent means to use MS/MS substructure information as an orthogonal filter for the empirical formula search. I sent you a note on the subject approximately a year ago at the RSC website but I’m guessing you never saw or received the message. At any rate is there some to provide an empirical formula and known substructures??

