Archive for October 26th, 2009

I’ve been in discussions with JC Bradley and Andy Lang about the Open Notebook Science Solubility Data project. Specifically we’ve been comparing  logP predictions from the CDK versus those listed on ChemSpider. We actually have six values of logP listed for some records. For example, for toluene we have 4 predicted values, 1 experimental value from a database and 1 experimental value from a publication. These are shown below:

toluene4 logpThere are three predicted logP values from three different algorithms (ACD/LogP, XlogP and AlogPs) as shown at the top of the figure. There is a predicted value and a database value from the EPISuite from the EPA (middle of the figure) and there is a LogP value from a publication with the link out indicated by the arrow (this datum was deposited by Egon Willighagen when he deposited the data from his publication). If you examine the list of data, both experimental and predicted, you will see a general value of  around 2.65+/- error. This should be compared with the CDK value listed in the ONS spreadsheet that gives a predicted value of 0.64. This was the primary reason that we were discussing the comparison…the values of predicted logP from CDK were different from the predicted values listed on ChemSpider for a number of examples in the spreadsheet.

Egon and I exchanged a couple of emails discussing the fact that logP predictions could be generated by a number of parties if there was a good Open Data training set available. A recent publication entitled “Calculation of Molecular Lipophilicity:State of the Art and Comparison of Log P Methods on More Than 96000 Compounds” performed a thorough analysis of different logP methods on a very large dataset. The publication is available online here. They compared “the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed(N = 882) and Pfizer (N = 95 809). A total of 30 and 18 methods were tested for public and industrial datasets, respectively.” During the work they derived a simple equation based on the number of carbon atoms, NC, and the number of hetero atoms, NHET: log P = 1.46(±0.02) + 0.11(±0.001) NC – 0.11(±0.001) NHET. This equation was shown to outperform a large number of programs benchmarked in this study. This would certainly be easy to implement on ChemSpider and, just out of interest, applying this equation to toluene gives us a value of 2.23. Compare this with the values listed above.

Unfortunately there doesn’t appear to be too many Open logP datasets available around for people to use as training sets. Also, with the thorough work reported in the publication above is it necessary to build yet another logP prediction algorithm? ACD/Labs have made their logP prediction software free for download (http://www.acdlabs.com/download/logp.html), the VCCLab software is available for free (http://www.vcclab.org/lab/alogps/), the EPISuite software is available for free (http://www.epa.gov/oppt/exposure/pubs/episuite.htm) and if you just want to predict a value for a compound not on ChemSpider then you can use the services here: http://www.chemspider.com/Services.aspx.

However, even though there are a lot of predictors available it still makes sense to gather data and provide it as an experimental dataset, made available as Open Data for the developers of such algorithms to ake the benefits of structural diversity and fresh data to potentially improve their models. If you have any logP data available please point me to the data to download or contact me offline to discuss. We are presently working on enhancing our data model to provide improved access to experimental data on ChemSpider as well as access to the predicted data via web services. More to follow…

Today I received an email via the CHMINF list server pointing to the following Press Release. Part of the press release is shown here:

“In collaboration with the German National Library of Science and Technology (TIB) Thieme is the first publisher to make primary chemistry data accessible worldwide. Analytical data, from various experiments, is the foundation of research work and scientific papers. From now on, primary data will be registered and made available online via the Thieme eJournals website (www.thieme-connect.com/ejournals) using digital object recognition in the form of Digital Object Identifiers (DOI). This will enable scientists to easily locate research articles, including accompanying data, and make enhanced use of the scientific content.”

There has been a lot of discussion over the years regarding making available “primary data”. We offered to do this on the ChemSpider Journal of Chemistry : if people wanted to submit analytical data with an article that we published then we would post them as spectra associated with the article. Unfortunately the general consensus based on a few conversations that I had is that it is a lot of work to prepare data and deposit it. This is one of the reasons that, until now, publishers have generally made the spectral data available as plots and printouts of the data. These data are generally made available as electronic supplementary data. These data ARE valuable even in that form but, and I believe that the majority of scientists would agree, they would be of more valuable if they were available in a format that would allow display in online applets, downloadable for processing and expansions etc. The RSC would certainly welcome the availability of spectral data associated with publications especially since they can now be hosted on ChemSpider.

Thieme have actually managed to pull off quite a coup and I commend them for their efforts. The first example datasets are available here. The listing includes “FIDs and associated files for the 1H, 13C and DEPT NMR spectra for compounds 14, (SS)-23, (SS)-25, (RS)-26, 27, (SS)-28, (RS,SS)-29, 30, (RS)-36, (SS)-36, (SS)-37, 38, (RS)-39, (SS)-39, (SS)-44, (RS)-46, (SS)-46, (RS)-48, (SS)-48, (SS)-49, 52, (RS)-53, (RS)-55, (RS)-57, (SS)-57, (SS)-58, (RS)-61, (SS)-61, (RS)-62, (SS)-62, (RS)-65 and (SS)-65 are summarized.” That’s a lot of data.

Since these are primary data they cannot be copyrighted so I chose to download the data, take a look and insert a couple into ChemSpider as an example of what can be done with these data. The associated PDF for the data says “The files can be processed using the following programs: MestReC, Bruker’s WINNMR and XWINNMR.” The files came as binary Bruker files so needed to be reprocessed and, in order to be deposited, had to be converted to JCAMP-DX format, the format supported by the JSpecView applet used on ChemSpider to display spectra. In order to this I am fortunate to have access to ACD/NMR Processor, a product I managed for a few years while working at ACD/Labs. This product also supports the Bruker format so I imported the data, processed and exported as JCAMP and imported to ChemSpider.  For compound 14 I have attached the H1 and C13 spectra and they can be seen here. I didn’t attach the “DEPT spectrum” yet. In order for me to download the spectra, redraw the structure, process the spectra, export as JCAMP and deposit to ChemSpider took about 15 minutes. However, there are a lot of spectra and it will take me a while. There are 32 compounds, I assume 3 spectra per compound (HNMR, CNMR and DEPT) so that’s a total of 96 spectra. It’ll take me about 10-12 hours just to deposit this collection so that’s a lot of work to do in my spare time. If anyone wants to help out and can process the spectra to deposit please do!

One of the spectra are shown below using the Spectral Embed function we introduced previously:

This is a rich collection of data…it can feed the Spectral Game described in this article. I look forward to getting the data onto ChemSpider and will be following up with Thieme to see if we can work together to host the data in a more generic format for the future. It’s a shame that the data are locked into a binary file format that needs reprocessing to view and I believe display through the JSpecView applet is advantageous for all. I encourage Thieme to consider also making the structure collection available in molfile, SMILES, InChI and InChIKey format – the InChIs will make the article discoverable via internet searches and through the InChI Resolver while the download of molfiles will speed up the loading process to ChemSpider and other systems.