ChemSpider has been online since March 24th 2007, about 6 weeks. We opened the ability to curate the data one month later.
Is there a need to curate the data? ChemSpider is built up of a series of databases. The list of contributors continues to increase
and there will be some very exciting announcements made in the next few days about new contributors. One of the largest components is the PubChem database. Peter Murray-Rust recently blogged about the quality of the name-structure pairs inside the PubChem database. He used as an example methane… I point you to the original blog for his comments. For my purposes I will use water. Here is the list of names, synonyms and registry numbers posted for Water at PubChem. Certainly a number of these have carried over to ChemSpider. Out of interest it is worth comparing the results of the searches for the word “water” at both PubChem and ChemSpider. Search Pubchem for water title=”Water on PubChem”>here and ChemSpider for water title=”Water on ChemSPider”>here. 228 hits versus 1. Looking at ChemSpider we get the following list of names, synonyms and registry numbers. The hyperlinks below are those links to wikipedia.

“water; Water vapor; Dihydrogen oxide; Distilled water; Purified water; Water, purified; hydrogen oxide; Deionized water; Oxygen atom; dihydridooxygen; ether; ethers; hydroxide; oxidane; Monooxygen; Photooxygen; Wasser; Singlet oxygen; Atomic oxygen; Deuterated water; Dihydrogen Monoxide; Oxygen, atomic; Water, mineral; Water, deionized; Water, distilled; Water, heavy; Water-t; DHMO; See Remark 8; HYDROXY GROUP; Water for injection; BOUND OXYGEN; BOUND WATER; Oxygen(sup 3P); 3H-Water; OXO GROUP; UNKNOWN; Water-18O; Sterile purified water; Tritiated water, mono-; Tritiated water (HTO); DISORDERED SOLVENT; Water, purified (JAN); Purified water (JP15); Water (JP15/USP); Type 2 Copper Site Water; Type 2 Copper Site Waters; CCRIS 6115; Oxygen O8 Of 8-Oxoadenine; GLUCOSE 4-O4 GROUP; Oxygen Of Oxidized
Methionine; Water for injection (JP15); Oxygen Bound To Cys 83 Sg; Oxygen Bond To Sg Cys A 67; Sterile purified water (JP15); CHEBI:15377; CHEBI:25698; [OH2]; H(2)O; Disordered Solvent – See Remark 8; EINECS 231-791-2; Oxygen Bound To +a B 17 At C8; Disordered Solvent – See Remark 10; Disordered Solvent – See Remark 11; Disordered Solvent – See Remark 12; NSC147337; NSC 147337; The Oxygen Is Linked To The Haem Iron; Hydroxy Group Bond To Sg Cys B 67.; Oxygen Bound To Cys 25 Sg – Remark 4; The Oxo Group Is Linked To The Haem Iron; C00001; D00001; 7732-18-5; Water, distilled, conductivity or of similar purity; H2O; HOH; Ice; DIS; GTE; H20; HYD; MTO; OX; OXO; UNK; 13670-17-2; 14314-42-2; 17778-80-2; 558440-22-5; DOD; DUM; glc; O2; OH; OX1; TIP; UNL”

NOT what I would call a quality set of names. These will be curated, some will be done with appropriate robots and some manually.
This is an extreme. Let’s look at other examples already identified by curators. Below is an example of curation in process.

Some examples of curated data

Returning to Peter’s blog…an excerpt states “Pubchem faithfully reflects the broken nature of chemical infomation. It cannot mend
it – there are only ca. 20 people – and anyway the commercial chemical information world prefers to work with a broken system. But
could social computing change it? Like Wikipedia has? [..] I think chemistry is different. And I think we could do it almost effortlessly
- rather like the Internet Movie Database. Here every participant can vote for popularity or tomatos. A greasemonkey-like system could allow us to flag “unuseful names” or to vote for the preferred names and structures. And this doesn’t have to be done on PubChem – it could be a standoff site [..].” I happen to agree. I believe social computing can change it. That is the purpose of the curating process on ChemSpider. When we set up the system we were not sure that people would care or help in curating the data. Why? Here’s why people might NOT want to help us curate the data:

  1. ChemSpider is not PubChem. The data cannot be downloaded.
  2. ChemSpider is a business…why should people help a business increase the quality of the data they host?
  3. ChemSpider is new. Who says that the efforts made to curate data will be of value to others? How long will ChemSpider be around to allow peoples work to benefit others?

All valid questions. And they likely ARE deterrents to people helping improve the quality of data on ChemSpider. So, what are the
answers to these questions.. are they enough to convince ChemSpider users to assist in curating the database? Our responses to the
questions above are as follows:

  1. We do not have permission from all depositors to ChemSpider to allow their data to be downloaded, only viewed. However, we WILL redeposit all curated data originally sourced from PubChem back to PubChem. In an email exchange this past week with Steve Bryant from PubChem commented that they would willingly accept curated data back to their database. We will also make available a downloadable database of all curated data originally sourced from public sources. We will also provide feedback to other depositors when we find errors.
  2. I have done my utmost to explain this in a previous post here.
  3. ChemSpider has traction. It is getting lots of use. Based on interest we believe that our initial efforts have already provided enough response to have us continue this work. We have challenges as discussed previously but we are busily addressing these now. We believe that every effort made to improve the quality of data on the ChemSpider database will benefit all users and the community in general with our giveaway to PubChem and other database providers of the curated data.

I have outlined only a small number of possible concerns above. There may be more. I welcome any other questions you may have about our intentions.

7 Responses to “Curating data on ChemSpider…should it be supported by the community?”

