Open Source Data, Testing Quality and Returning Value â€“ Interactions with NMRSHIFTDB and the Blue Obelisk CommunityPosted by: Antony Williams in Quality and Content
Copyright©2007 Antony Williams
I give a thumbs up to the quality of the NMRSHIFTDB. Weâ€™ve validated it. Why would I care? Iâ€™m an NMR jock at heart. I also work for a commercial software company innovating NMR prediction software and compiling NMR databases as the basis of our work. Does this mean that the commercial software vendors and Open Source/Access communities can coexist and have mutual admiration. I believe so!
After 18 months work I finally signed off on one of those infamous copyright transfers for Elsevier, now the publishers of Progress in NMR. After over 18 months of work (that after hours styleâ€¦much like blogging) a 360 page review article is finally submitted â€“ â€œComputer-Assisted Structure Verification and Elucidation Tools In NMR-Based Structure Elucidationâ€. Proofs will arrive before end of month. Itâ€™s the culmination of over ten years of our own work as well as that of many contributors in the domain of CASE (Computer Assisted Structure Elucidation) systems. The complexity of structures that can be solved by computer algorithms is impressiveâ€¦see examples here. Recently the StrucEluc CASE system solved the structure of an antibiotic of Mw>1150. Three NMR spectroscopists couldnâ€™t solve itâ€¦a symbiotic relationship with software is VERY enabling!
Christophâ€™s group host NMRSHIFTDB. Recently ChemSpider linked to NMRSHIFTDB. In parallel I took interest in the recent critique of the quality of data published by Wolfgang Robien especially since during my â€œday jobâ€ I am directly involved with NMR prediction, structure verification using NMR and of course, CASE systems.
What was interesting about Robienâ€™s post was the fact that it focused on the application of Neural Networks to prediction. With the availability of a public dataset we were able to repeat the analysis using our own Neural Networks as well as our classical approaches. The results will be reported elsewhere. What I want to confirm is PMRâ€™s post regarding the quality of data. Peter commented, relative to our own efforts at ChemSpider. â€œThere is little point in collecting 10 million structures if you cannot rely on any of them. It actually detracts from the hard work of people like Stefan, Christoph and others on NMRShiftDB as the general user of the database will judge all entries by the lowest common denominator.â€ After analyzing the data, over 200,000 individual chemical shifts I can say DONâ€™T judge by the lowest common denominator. There is some junk on there..as seen by Wolgang Robien. But our estimates after our analysis is likely less than 250 data points in error. These are truly excellent statistics if you consider that this is an open access system where people are depositing data, that these data are free to download and utilize even for the development of derivative algorithms and that such systems can work. The addition or improvement of rigorous checking algorithms to the NMRSHIFTDB is the next natural step and flagging data to the submitter will have them check and validate the quality of their input. This will catch many errors during the submission process.
So, my compliments to Christoph and the team. The quality is excellent and there are â€œlarge errorsâ€ but minimum in number. Iâ€™ve already sent him a report to help cleanse the database though didnâ€™t compare it with that of Robienâ€¦likely we saw the same things since they were very obvious. These errors should not detract from the effort ..with >200,000 data points it is obvious that there would be some. For ChemSpider we have the same problemâ€¦with >10 million structures there are errorsâ€¦.lots of them. But itâ€™s very useful all the same!Stumble it!