I give a thumbs up to the quality of the NMRSHIFTDB. We’ve validated it. Why would I care? I’m an NMR jock at heart. I also work for a commercial software company innovating NMR prediction software and compiling NMR databases as the basis of our work. Does this mean that the commercial software vendors and Open Source/Access communities can coexist and have mutual admiration. I believe so!

After 18 months work I finally signed off on one of those infamous copyright transfers for Elsevier, now the publishers of Progress in NMR. After over 18 months of work (that after hours style…much like blogging) a 360 page review article is finally submitted – “Computer-Assisted Structure Verification and Elucidation Tools In NMR-Based Structure Elucidation”. Proofs will arrive before end of month. It’s the culmination of over ten years of our own work as well as that of many contributors in the domain of CASE (Computer Assisted Structure Elucidation) systems. The complexity of structures that can be solved by computer algorithms is impressive…see examples here. Recently the StrucEluc CASE system solved the structure of an antibiotic of Mw>1150. Three NMR spectroscopists couldn’t solve it…a symbiotic relationship with software is VERY enabling!

One very active player in CASE is Christoph Steinbeck, a member of Blue Obelisk, one of the more active blogging groups on the net today.

Christoph’s group host NMRSHIFTDB. Recently ChemSpider linked to NMRSHIFTDB. In parallel I took interest in the recent critique of the quality of data published by Wolfgang Robien especially since during my “day job” I am directly involved with NMR prediction, structure verification using NMR and of course, CASE systems.

What was interesting about Robien’s post was the fact that it focused on the application of Neural Networks to prediction. With the availability of a public dataset we were able to repeat the analysis using our own Neural Networks as well as our classical approaches. The results will be reported elsewhere. What I want to confirm is PMR’s post regarding the quality of data. Peter commented, relative to our own efforts at ChemSpider. “There is little point in collecting 10 million structures if you cannot rely on any of them. It actually detracts from the hard work of people like Stefan, Christoph and others on NMRShiftDB as the general user of the database will judge all entries by the lowest common denominator.” After analyzing the data, over 200,000 individual chemical shifts I can say DON’T judge by the lowest common denominator. There is some junk on there..as seen by Wolgang Robien. But our estimates after our analysis is likely less than 250 data points in error. These are truly excellent statistics if you consider that this is an open access system where people are depositing data, that these data are free to download and utilize even for the development of derivative algorithms and that such systems can work. The addition or improvement of rigorous checking algorithms to the NMRSHIFTDB is the next natural step and flagging data to the submitter will have them check and validate the quality of their input. This will catch many errors during the submission process.

So, my compliments to Christoph and the team. The quality is excellent and there are “large errors” but minimum in number. I’ve already sent him a report to help cleanse the database though didn’t compare it with that of Robien…likely we saw the same things since they were very obvious. These errors should not detract from the effort ..with >200,000 data points it is obvious that there would be some. For ChemSpider we have the same problem…with >10 million structures there are errors….lots of them. But it’s very useful all the same!

Stumble it!

6 Responses to “Open Source Data, Testing Quality and Returning Value – Interactions with NMRSHIFTDB and the Blue Obelisk Community”

  1. Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » experiment and theory says:

    [...] Open Source Data, Testing Quality and Returning Value – Interactions with NMRSHIFTDB and the Blue … Posted by: Antony Williams in Quality and Content [...]

  2. Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Data validation and protcol validation says:

    [...] you will be aware of Wolfgang Robien’s critique of the NMRShiftDB. Following this critique, Tony Williams from the ChemSpider Blog and Peter Murray-Rust from the Unilever Cambridge Centre for Molecular Informatics replied to [...]

  3. ChemSpider Blog » Blog Archive » Curators Perform Heroic Duties. They Should be Celebrated! says:

    [...] used recently to provide feedback to the Blue Obelisk team member, Christoph Steinbeck, to help clean up errors in the NMRSHIFTDB. While others were attacking the open data effort those of us concerned with the details helped [...]

  4. D.Riuos says:

    Interesting in you use of Neural Networks as prediction system.

  5. Antony Williams says:

    It is clear that more and more chemists are starting to host their own websites. A good web site design (http://www.mywebmaster101.com/Web-Design-Info/) is certainly a key motivator to draw people to the site to expose products and services. These can be designed through microsoft templates (http://www.ewebskins.com/) but the advent of coldfusion (http://www.killerwebz.com/ColdFusion.html) tools offers greater levels of customization in most cases and after setting up the site only hosting becomes the issue. Reseller hosting (http://www.asharedhosting.com/reseller-web-hosting.html) companies such as startlogic (http://www.envisionwebhosting.com/reviews/startlogic-hosting.htm) provides very reasonably priced services.

  6. Bill Nichols says:

    Hey Tony,

    I hope all is going well and based on your Linkedin activity I’m guessing things are brilliant.

    Like others I’m using empirical formula information including accurate mass MS/MS data. I recently reread the Chemspider “Known Unknowns” paper and I’m still frustrated that there is no apparent means to use MS/MS substructure information as an orthogonal filter for the empirical formula search. I sent you a note on the subject approximately a year ago at the RSC website but I’m guessing you never saw or received the message. At any rate is there some to provide an empirical formula and known substructures??

Leave a Reply