During the past couple of weeks there have been a number of comments regarding the association of Physicochemical Properties with chemical structures inside the ChemSpider database. Specifically the comments have been directed at inorganic and organometallic compounds.

Based on this feedback we are considering a series of actions regarding physicochemical predictions and these are outlined below.

1) Filter the ChemSpider database and remove the following PhysChem predictions (ACD/LogP, ACD/LogD (pH 5.5), ACD/LogD (pH 7.4), Number of Rule of 5 Violations, Number of H bond acceptors, Number of H bond donors, Number of Freely Rotating Bonds, Polar Surface Area) for substances with the following properties:

• Exclude multi-component substances
• Exclude substances represented as a single atom
• Exclude radicals
• Exclude structures with a delocalized charge
• Exclude structures containing isotopes
• Exclude substances containing elements other than As,B,Br,C,Cl,F,Ge,H,I,N,O,P,Pb, S,Se,Si,Sn, the elements supported by ACD/PhysChem predictors

A question for readers..do you believe that structures with isotopes should be excluded from PhysChem prediction of the properties listed? I have my own opinions but would like community feedback as to whether this is necessary and is all inclusive for all isotopes. Also, should all multicomponent systems be excluded? For example, if there is one water of solvation present should the parameters NOT be calculated for the primary component?

2) For any future updates to the database pre-filter using the criteria listed above

3) Apply the criteria above prior to performing PhysChem predictions on the Services Page

Our predicted properties for certain structures have created a response suggesting that we should alter our application of such algorithms. The suggested path forward is listed above. We welcome feedback from the community! These questions have also been posted to PMR’s blog supporting the suggestions. Please provide your feedback at either site. It will be important in helping to make the decision. It is possible that there will not be sufficient feedback but a decision does need to be made and we will determine a path forward regardless.

Stumble it!

3 Responses to “Physical Property Predictions – Filtering Out Potential Problematic Data on ChemSpider…or is it NOT a problem?”

  1. Antony Williams says:

    This is a link back to Peter Murray-Rust’s comments at his blog to retain linkage.

    # pm286 Says:
    May 16th, 2007 at 7:39 am


    1) DO people believe that isotopes will make a difference (within prediction error) to the calculation of the physicochemical properties predicted. I have my own judgments but put this question out there for public feedback.

    PMR> Deuterium has a significant influence on many physical properties – e.g. boiling point of D2O – and obviously on vibrational frequencies. But in general it depends on the accuracy and precision of the property. We, for example, compute phonons of crystalline materials and these are certainly isotope dependent.

    2) Should all multi-component systems be excluded? I demonstrated clearly in an earlier post that prediction of LogP for CaCO3 was appropriate so should it be excluded or not?

    PMR> It depends what your properties are. Until you address the aspect of physical state I would strongly suggest you omit multi-component systems. For example we are working with calcite, vaterite and other forms of CaCO3 and these have many properties that depend on the polymorph. In principle (log)P should be independent of polymorph but I would be suspicious of this for many systems

    You commented “These are very close to the filters based on molecular formula which I would recommend. Since I don’t have knowledge of your metadata (e.g. date, format, contributor, etc.) I can’t comment, but it may be that these are also useful filters.”. So, the ones I have suggested are close…what additional ones would you suggest?

    PMR> I don’t know what your properties actually are – the only ones displayed are MW, (log)P, polar surface area and volume. Since I don’t know the algorithm for the last two I an’t comment, but I would expect both to depend on molecular flexibibility.

    Also, we DO have the date, format and contributor data available. How would you use these data yourself to make a decision to predict physchem properties. Assuming all data available as MOL/SDF files how would date of submission and contributor info be used?

    PMR> Since I assume you have compared experiment with prediction I would look to see if outliers showed any predictabiluty of source, data, etc. For example some submitters may routinely get molecular formulae garbled (e.g. hydrogen atoms). We have found that some have garbled Celcius and Kelvin – several crystallographic experiements were reported at 298 degC, which is almost certainly an error.

  2. Greg Pearl says:

    All of these questions essentially boil down to the philosophical question of whether or not any value is better than nothing. The problem of this is that the answer is dependent upon the objective of the individual user and is hence quite similar to opinions, everyone has one.

    Predicting Isotopes: The only isotope that has potential of significantly effecting the calculated physical properties (excluding MW) would be deuterium when it is replacing hydrogen in a potential hydrogen bond.

    Multi-component systems: There is potential value in providing the predicted properties for these systems. It might be useful to set an additional text field that is a count of the number molecules/ions in a record. Then users could easily limit their search to compounds containing multiple molecules, additionally we can put a notice on the multi-component systems that indicates the predicted value is for the largest molecule/ion and based upon the protonation of the molecule/ion to form a neutral species.

    Filtering Data: Generally speaking the more search options the better with the following caveat that the interface must remain uncluttered. So would recommend that an additional search panel is created that enables the user to search any data field/ meta data that is available.

    So there are two possible solutions…
    1. Create 2 different systems (1 for curated data and 1 for non-curated)
    2. Design Data-structure to easily enable users to select which type of data they want to utilize
    3. Assign Weighting factor to the data that describes the quality. So using the wilkipedia concept allow the users to assign a grade to the data and then aggregate the data accordingly

    Good Luck, Glad to see that progress is being made to improve the access to chemical data on the internet….

  3. ChemSpider Blog » Blog Archive » Prediction Errors and Filtering the ChemSpider Database - How Accurate Does a Prediction Need to Be? says:

    [...] had previously added comments to his post regarding my questions. Based on this feedback and other comments on blog postings and email exchanges it’s time to summarize our path forward and the reasons [...]

Leave a Reply