Copyright©2008 Antony Williams
Most blog readers will likely be aware of the recent article written in Nature about ChemSpider. PMR has recently commented on what he said to the Nature reporter who interviewed him but did not make it into press.
I’ll clarify some of Peter’s statements and differentiate judgments versus truths, some of this is a repeat, again.
1) “Firstly to say that I commented to Geoff before Chemspider’s announcement that it was adopting CC-SA licences. This is a major advance and has enhanced the importance of Chemspider.”
We have REMOVED these licenses now after the rather interesting situation resulting from that and Peter had already commented on his own blog “I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism – CC-SA is conformant to the OK definition, but difficult to operate for re-use). That’s why we use the OKF’s OpenData sticker on CrystalEye.”
2) “It’s (now) based on Web 2.0 principles in that it uses social computing for some of its content and can and has reacted to external changes.” ChemSpider has been based on Web 2.0 principles since the first rollout and I have commented on this previously.
3) “It’s not, however, based on semantic web technology such as RDF and XML and this may be a future limitation in managing some of the more complex content.” We use XML in many places on our site and some of this will be exposed in the future. We have discussed RDF’ing our system with Egon Willighagen but it’s not a priority for us at present. It’s on the list though.
4) “Although I’m not party to the internal design I’d guess it has a relational database, most of whose primary keys are the identifiers for chemical compounds. These identifiers map onto canonicalised chemical structures (one serialization of which is the InChI) and this is the primary mechanism for indexing compounds.” Yes, it’s a relational database, on Microsoft SQL server. Primary keys are structures and we do use InChis, a lot.
5) “CS has ca 20 million compounds and the only way to manage these is robotically.” We have a hybrid model of robotic handling and human intervention and interaction with the data. To see human interaction in action visit the feedback page.
6)”there is no guarantee that the computation of properties is free from error – indeed it cannot be. Many physical properties depend on the physical form of the compound and this is often not recorded. I suspect most of the properties are computed by heuristic means (”QSPR”) rather than QM calculations. And many of them fail to take things like chemical stability and reactivity into account. (Examples are boiling points for compounds that decompose, flashpoints for things that could never burn). But how do you tell this robotically – I don’t have a good suggestion But one can guarantee that in 20 million calculations some will be meaningless.”
I agree with the scientific declarations that properties depend on the physical form of the compound. None of the predictions are QM-based, definitely not feasible with 20 million compounds not only because of lack of access to software but more about time issues as discussed previously in regards to QM NMR predictions. I have 15 years experience around QSPR type predictions and they are fast and generally applied by the majority of chemists at the desktop in Life Science environments (and others) for the prediction of logP, solubility, logD, pKa, NMR etc. I GUARANTEE that in 20 million compounds some will be meaningless. This definiely doesn’t mean the predicted values across the DB are of no value.
Despite some of the previous comments about the properties in the vast majority of cases property prediction is valid. See such discussions here :Calcium Carbonate is not soluble and can’t have a logP PLUS Lipinski says Calcium Carbonate CAN have a logP
We are presently adding MORE predicted properties. Check out at this record the “EPI Summary” at the bottom of the page and you will see this (Scroll inside the box)
Log Octanol-Water Partition Coef (SRC): Log Kow (KOWWIN v1.67 estimate) = 0.85 Boiling Pt, Melting Pt, Vapor Pressure Estimations (MPBPWIN v1.42): Boiling Pt (deg C): 290.82 (Adapted Stein & Brown method) Melting Pt (deg C): 80.58 (Mean or Weighted MP) VP(mm Hg,25 deg C): 3.99E-005 (Modified Grain method) Subcooled liquid VP: 0.000135 mm Hg (25 deg C, Mod-Grain method) Water Solubility Estimate from Log Kow (WSKOW v1.41): Water Solubility at 25 deg C (mg/L): 3.574e+004 log Kow used: 0.85 (estimated) no-melting pt equation used Water Sol Estimate from Fragments: Wat Sol (v1.01 est) = 1e+006 mg/L ECOSAR Class Program (ECOSAR v0.99h): Class(es) found: Neutral Organics-acid Henrys Law Constant (25 deg C) [HENRYWIN v3.10]: Bond Method : 9.70E-009 atm-m3/mole Group Method: Incomplete Henrys LC [VP/WSol estimate using EPI values]: 2.176E-010 atm-m3/mole Log Octanol-Air Partition Coefficient (25 deg C) [KOAWIN v1.10]: Log Kow used: 0.85 (KowWin est) Log Kaw used: -6.402 (HenryWin est) Log Koa (KOAWIN v1.10 estimate): 7.252 Log Koa (experimental database): None Probability of Rapid Biodegradation (BIOWIN v4.10): Biowin1 (Linear Model) : 0.7245 Biowin2 (Non-Linear Model) : 0.7196 Expert Survey Biodegradation Results: Biowin3 (Ultimate Survey Model): 3.1842 (weeks ) Biowin4 (Primary Survey Model) : 3.9956 (days ) MITI Biodegradation Probability: Biowin5 (MITI Linear Model) : 0.6808 Biowin6 (MITI Non-Linear Model): 0.7604 Anaerobic Biodegradation Probability: Biowin7 (Anaerobic Linear Model): 0.5224 Ready Biodegradability Prediction: YES Hydrocarbon Biodegradation (BioHCwin v1.01): Structure incompatible with current estimation method! Sorption to aerosols (25 Dec C)[AEROWIN v1.00]: Vapor pressure (liquid/subcooled): 0.018 Pa (0.000135 mm Hg) Log Koa (Koawin est ): 7.252 Kp (particle/gas partition coef. (m3/ug)): Mackay model : 0.000167 Octanol/air (Koa) model: 4.39E-006 Fraction sorbed to airborne particulates (phi): Junge-Pankow model : 0.00598 Mackay model : 0.0132 Octanol/air (Koa) model: 0.000351 Atmospheric Oxidation (25 deg C) [AopWin v1.92]: Hydroxyl Radicals Reaction: OVERALL OH Rate Constant = 24.3848 E-12 cm3/molecule-sec Half-Life = 0.439 Days (12-hr day; 1.5E6 OH/cm3) Half-Life = 5.264 Hrs Ozone Reaction: No Ozone Reaction Estimation Fraction sorbed to airborne particulates (phi): 0.00957 (Junge,Mackay) Note: the sorbed fraction may be resistant to atmospheric oxidation Soil Adsorption Coefficient (PCKOCWIN v1.66): Koc : 1 Log Koc: 0.000 Aqueous Base/Acid-Catalyzed Hydrolysis (25 deg C) [HYDROWIN v1.67]: Rate constants can NOT be estimated for this structure! Bioaccumulation Estimates from Log Kow (BCFWIN v2.17): Log BCF from regression-based method = 0.500 (BCF = 3.162) log Kow used: 0.85 (estimated) Volatilization from Water: Henry LC: 9.7E-009 atm-m3/mole (estimated by Bond SAR Method) Half-Life from Model River: 7.347E+004 hours (3061 days) Half-Life from Model Lake : 8.016E+005 hours (3.34E+004 days) Removal In Wastewater Treatment: Total removal: 1.88 percent Total biodegradation: 0.09 percent Total sludge adsorption: 1.78 percent Total to Air: 0.00 percent (using 10000 hr Bio P,A,S) Level III Fugacity Model: Mass Amount Half-Life Emissions (percent) (hr) (kg/hr) Air 0.189 10.5 1000 Water 37 360 1000 Soil 62.7 720 1000 Sediment 0.0722 3.24e+003 0 Persistence Time: 547 hr
7) “Chemspider is using social computing (crowdsourcing) to clean up (curate) the information in the database. This works in Wikipedia, although the number of chemicals in in the thousands, not the millins, and there are still many data and chemical problems. Moreover WP shows that there are compounds – e.g. aluminium chloride – where there is no single structure.” Social computing curation is working well. It’s working on Wikipedia too..I am in the middle of that effort. There is no reason that ChemSpider cannot support multiple species for one compound either. For example, see the structure of Thymol Blue on Wikipedia and then look at this search: http://www.chemspider.com/q/thymol%20blue on ChemSpider. 2 of the 3 structures in the scheme are noted on ChemSPider. The third can be added. For aluminium chloride we link to Wikipedia to explain this…at present only the lede of the article, we could host the entire article. Why not?
8 ) “What is Chemspider now is and where it may be going? It’s difficult to predict anything on the web but it’s also clear that chemists are one of the most conservative disciplines. Why use a free service when you can get your library to pay (a lot of money) for ACS or Beilstein services? So I wouldn’t predict explosive growth like Flickr or Google” Yup, I’d agree. But it’s not only conservatism. it’s marketing (we don’t do any paid marketing” and ChemSpider is for chemists. Flickr’s for everybody, so is Google. How can it be as explosive? But can it and is it growing? Yup.
9) “Nick found 26 sites displaying staurosporine and there were 19 different structures given. Some were incomplete and several were just crazily wrong. Clearly many chemical suppliers, journal editors, etc. do not care about chemical structures. So there is a huge amount of rubbish out there.” I’ve said the same many times (1,2, and others). But does it mean we should stop? I don’t think so…
and to conclude
10) “PMR: At some stage, therefore, the community will react against this centralisation of information, but it could be a long time. I don’t think anyone should set up to duplicate what ACS does – I think we should use modern thinking to do things quicker, smarter, cheaper and in tune with the modern Web. Chemspider may have to make some choices soon – is it a company or a voluntary activity? does it concentrate on high volume and variable quality, or low volume and high quality – it cannot do both? What is the particular USP of its repository service ?- there may well be a role for a specialist chemical repository service but when? Is it different from Pubchem, and how…? ”
ChemSpider is not a company. ChemZoo is. We ARE using modern thinking in tune with the modern web. Probably one of the fastest moving efforts in this area..are there others moving as fast at depositing? curating? integrating? So, we are a company and at no cost to the users. Volunteers are helping. We are working on BOTH high volume and high quality. It is work. We are being successful on both. The Wikipedia collection, when finished, will only be a subset of ChemSpider. But structures and associated information (other than predictions!) are validated daily at present, And crowdsourcing can speed it up. And there WILL be disagreements between chemists..just like on Wikipedia! I am in those conversations too. I think there is a role for a free access chemical repository now. We may be surpassed at any time but for now our efforts are valid and valiant, in my opinion…what say you?Stumble it!