I’ve been in discussions with JC Bradley and Andy Lang about the Open Notebook Science Solubility Data project. Specifically we’ve been comparing logP predictions from the CDK versus those listed on ChemSpider. We actually have six values of logP listed for some records. For example, for toluene we have 4 predicted values, 1 experimental value from a database and 1 experimental value from a publication. These are shown below:
There are three predicted logP values from three different algorithms (ACD/LogP, XlogP and AlogPs) as shown at the top of the figure. There is a predicted value and a database value from the EPISuite from the EPA (middle of the figure) and there is a LogP value from a publication with the link out indicated by the arrow (this datum was deposited by Egon Willighagen when he deposited the data from his publication). If you examine the list of data, both experimental and predicted, you will see a general value of around 2.65+/- error. This should be compared with the CDK value listed in the ONS spreadsheet that gives a predicted value of 0.64. This was the primary reason that we were discussing the comparison…the values of predicted logP from CDK were different from the predicted values listed on ChemSpider for a number of examples in the spreadsheet.
Egon and I exchanged a couple of emails discussing the fact that logP predictions could be generated by a number of parties if there was a good Open Data training set available. A recent publication entitled “Calculation of Molecular Lipophilicity:State of the Art and Comparison of Log P Methods on More Than 96000 Compounds” performed a thorough analysis of different logP methods on a very large dataset. The publication is available online here. They compared “the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed(N = 882) and Pfizer (N = 95 809). A total of 30 and 18 methods were tested for public and industrial datasets, respectively.” During the work they derived a simple equation based on the number of carbon atoms, NC, and the number of hetero atoms, NHET: log P = 1.46(±0.02) + 0.11(±0.001) NC – 0.11(±0.001) NHET. This equation was shown to outperform a large number of programs benchmarked in this study. This would certainly be easy to implement on ChemSpider and, just out of interest, applying this equation to toluene gives us a value of 2.23. Compare this with the values listed above.”
Unfortunately there doesn’t appear to be too many Open logP datasets available around for people to use as training sets. Also, with the thorough work reported in the publication above is it necessary to build yet another logP prediction algorithm? ACD/Labs have made their logP prediction software free for download (http://www.acdlabs.com/download/logp.html), the VCCLab software is available for free (http://www.vcclab.org/lab/alogps/), the EPISuite software is available for free (http://www.epa.gov/oppt/exposure/pubs/episuite.htm) and if you just want to predict a value for a compound not on ChemSpider then you can use the services here: http://www.chemspider.com/Services.aspx.
However, even though there are a lot of predictors available it still makes sense to gather data and provide it as an experimental dataset, made available as Open Data for the developers of such algorithms to ake the benefits of structural diversity and fresh data to potentially improve their models. If you have any logP data available please point me to the data to download or contact me offline to discuss. We are presently working on enhancing our data model to provide improved access to experimental data on ChemSpider as well as access to the predicted data via web services. More to follow…