Copyright©2009 Antony Williams
Today I received an email via the CHMINF list server pointing to the following Press Release. Part of the press release is shown here:
“In collaboration with the German National Library of Science and Technology (TIB) Thieme is the first publisher to make primary chemistry data accessible worldwide. Analytical data, from various experiments, is the foundation of research work and scientific papers. From now on, primary data will be registered and made available online via the Thieme eJournals website (www.thieme-connect.com/ejournals) using digital object recognition in the form of Digital Object Identifiers (DOI). This will enable scientists to easily locate research articles, including accompanying data, and make enhanced use of the scientific content.”
There has been a lot of discussion over the years regarding making available “primary data”. We offered to do this on the ChemSpider Journal of Chemistry : if people wanted to submit analytical data with an article that we published then we would post them as spectra associated with the article. Unfortunately the general consensus based on a few conversations that I had is that it is a lot of work to prepare data and deposit it. This is one of the reasons that, until now, publishers have generally made the spectral data available as plots and printouts of the data. These data are generally made available as electronic supplementary data. These data ARE valuable even in that form but, and I believe that the majority of scientists would agree, they would be of more valuable if they were available in a format that would allow display in online applets, downloadable for processing and expansions etc. The RSC would certainly welcome the availability of spectral data associated with publications especially since they can now be hosted on ChemSpider.
Thieme have actually managed to pull off quite a coup and I commend them for their efforts. The first example datasets are available here. The listing includes “FIDs and associated files for the 1H, 13C and DEPT NMR spectra for compounds 14, (SS)-23, (SS)-25, (RS)-26, 27, (SS)-28, (RS,SS)-29, 30, (RS)-36, (SS)-36, (SS)-37, 38, (RS)-39, (SS)-39, (SS)-44, (RS)-46, (SS)-46, (RS)-48, (SS)-48, (SS)-49, 52, (RS)-53, (RS)-55, (RS)-57, (SS)-57, (SS)-58, (RS)-61, (SS)-61, (RS)-62, (SS)-62, (RS)-65 and (SS)-65 are summarized.” That’s a lot of data.
Since these are primary data they cannot be copyrighted so I chose to download the data, take a look and insert a couple into ChemSpider as an example of what can be done with these data. The associated PDF for the data says “The files can be processed using the following programs: MestReC, Bruker’s WINNMR and XWINNMR.” The files came as binary Bruker files so needed to be reprocessed and, in order to be deposited, had to be converted to JCAMP-DX format, the format supported by the JSpecView applet used on ChemSpider to display spectra. In order to this I am fortunate to have access to ACD/NMR Processor, a product I managed for a few years while working at ACD/Labs. This product also supports the Bruker format so I imported the data, processed and exported as JCAMP and imported to ChemSpider. For compound 14 I have attached the H1 and C13 spectra and they can be seen here. I didn’t attach the “DEPT spectrum” yet. In order for me to download the spectra, redraw the structure, process the spectra, export as JCAMP and deposit to ChemSpider took about 15 minutes. However, there are a lot of spectra and it will take me a while. There are 32 compounds, I assume 3 spectra per compound (HNMR, CNMR and DEPT) so that’s a total of 96 spectra. It’ll take me about 10-12 hours just to deposit this collection so that’s a lot of work to do in my spare time. If anyone wants to help out and can process the spectra to deposit please do!
One of the spectra are shown below using the Spectral Embed function we introduced previously:
This is a rich collection of data…it can feed the Spectral Game described in this article. I look forward to getting the data onto ChemSpider and will be following up with Thieme to see if we can work together to host the data in a more generic format for the future. It’s a shame that the data are locked into a binary file format that needs reprocessing to view and I believe display through the JSpecView applet is advantageous for all. I encourage Thieme to consider also making the structure collection available in molfile, SMILES, InChI and InChIKey format – the InChIs will make the article discoverable via internet searches and through the InChI Resolver while the download of molfiles will speed up the loading process to ChemSpider and other systems.Stumble it!