I’ve been in discussions with JC Bradley and Andy Lang about the Open Notebook Science Solubility Data project. Specifically we’ve been comparing  logP predictions from the CDK versus those listed on ChemSpider. We actually have six values of logP listed for some records. For example, for toluene we have 4 predicted values, 1 experimental value from a database and 1 experimental value from a publication. These are shown below:

toluene4 logpThere are three predicted logP values from three different algorithms (ACD/LogP, XlogP and AlogPs) as shown at the top of the figure. There is a predicted value and a database value from the EPISuite from the EPA (middle of the figure) and there is a LogP value from a publication with the link out indicated by the arrow (this datum was deposited by Egon Willighagen when he deposited the data from his publication). If you examine the list of data, both experimental and predicted, you will see a general value of  around 2.65+/- error. This should be compared with the CDK value listed in the ONS spreadsheet that gives a predicted value of 0.64. This was the primary reason that we were discussing the comparison…the values of predicted logP from CDK were different from the predicted values listed on ChemSpider for a number of examples in the spreadsheet.

Egon and I exchanged a couple of emails discussing the fact that logP predictions could be generated by a number of parties if there was a good Open Data training set available. A recent publication entitled “Calculation of Molecular Lipophilicity:State of the Art and Comparison of Log P Methods on More Than 96000 Compounds” performed a thorough analysis of different logP methods on a very large dataset. The publication is available online here. They compared “the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed(N = 882) and Pfizer (N = 95 809). A total of 30 and 18 methods were tested for public and industrial datasets, respectively.” During the work they derived a simple equation based on the number of carbon atoms, NC, and the number of hetero atoms, NHET: log P = 1.46(±0.02) + 0.11(±0.001) NC – 0.11(±0.001) NHET. This equation was shown to outperform a large number of programs benchmarked in this study. This would certainly be easy to implement on ChemSpider and, just out of interest, applying this equation to toluene gives us a value of 2.23. Compare this with the values listed above.

Unfortunately there doesn’t appear to be too many Open logP datasets available around for people to use as training sets. Also, with the thorough work reported in the publication above is it necessary to build yet another logP prediction algorithm? ACD/Labs have made their logP prediction software free for download (http://www.acdlabs.com/download/logp.html), the VCCLab software is available for free (http://www.vcclab.org/lab/alogps/), the EPISuite software is available for free (http://www.epa.gov/oppt/exposure/pubs/episuite.htm) and if you just want to predict a value for a compound not on ChemSpider then you can use the services here: http://www.chemspider.com/Services.aspx.

However, even though there are a lot of predictors available it still makes sense to gather data and provide it as an experimental dataset, made available as Open Data for the developers of such algorithms to ake the benefits of structural diversity and fresh data to potentially improve their models. If you have any logP data available please point me to the data to download or contact me offline to discuss. We are presently working on enhancing our data model to provide improved access to experimental data on ChemSpider as well as access to the predicted data via web services. More to follow…

Stumble it!

4 Responses to “Gathering Physicochemical Data Onto ChemSpider”

  1. Egon Willighagen says:

    Hi Antony,

    I would even be interested in doing some data extraction from literature… I have done this for NMR data too, and drawing chemical structures is just fun. So, if people know a nice paper that reports experimental LogP values…

    The next question would be, how I could enter this into ChemSpider… I either need a webform, but alternatively I could hack up a CML editor for Bioclipse to hold: 1. the chemical structure, 2. the InChI, 3. the experimental LogP value, and 4. the literature from which the data was taken. Such an editor is available in Bioclipse for submitting assigned NMR spectra with the matching molecular structure, and bibliography too.

    BTW, it is rather interesting to observe that the paper you published with the 96k data points is about comparing the modeling methods, which is rather meaningless and can be done by any student, whereas the (scientific) value in that paper is in the aggregation of the data. However, the latter is practically impossible to get awarded.

    Antony, when does ChemSpider step up and have there data entries indexed, so that enlisting experimental (curated, Open) data from a paper actually causes a citation of that paper? I believe that’s the kind of incentive we need to solve the lack of data moving from literature into databases. Can we please start talking to Thomson about this possibility?


  2. Antony Williams says:

    If drawing structures is fun you should have played in the digitonin curation (http://www.chemspider.com/blog/a-request-for-a-crowdsourced-investigation-of-digitonin.html) :-) I will summarize that situation today. I don’t think it’s over yet…

    Why not go ahead and get the data deposited into Bioclipse now. If I get any data then I’ll direct it to you for now. Let me know the URL for the NMR entry and I can try it out and then the LogP one later.

    Afterwards we can move the data into ChemSpider, if there is any. We’re in the middle of delivering a development sprint so can’t hack up the form etc right now.

    We’ll be chatting with numerous publishers about harvesting data and maybe THomson when the time is right.

  3. Soaring Bear says:

    >with the thorough work reported in the publication above is it necessary to build yet another logP prediction algorithm?

    For the toluene example mentioned, 2.65 – 2.23 = 0.42 which on the log scale is huge. So although this new work is impressive, it is not the last word. Until we can better quantify the variables of solubility, and the biochemical relevance, there will be more of such thorough efforts.

  4. Antony Williams says:

    J.Med Chem 1969, Vol.12, p.692 has the logP of toluene as 2.11 which is even outside of that predictive spread. BAsed on my experiences of working with large pharma companies the accurate and reproducible determination of logP is challenging. I would say that Toluene is well characterized at this point and a settled value of around 2.75 +/- 0.1 is accepted. The “estimate” of 2.23 is good enough for binning but is not a prediction. But its way better that the AlogP value.

Leave a Reply