Archive for the Quality and Content Category

As reported recently we have handed over the entire ChemSpider Database to PubChem for deposition. I did receive a number of offline requests about when would it be deposited. I am so used to having us survive with a minimum number of servers and in a world of developing processes to support close to 20 million compounds that I was estimating at least end of year for deposition and exposure.

My favorite color is Green. And I am experiencing green now. Pure envy…but in a good way. The data were only delivered to PubChem late last week and I’ve been informed that they should have the data deposited by the end of this weekend/early next week. That is amazing. That is all about their experience from receiving data for many data sources and “learning lessons”. It’s all about access to “enough hardware” . It’s all about the commitment of the PubChem team to keep things moving forward, not lose momentum regarding the benefit of the PubChem project to the chemistry community at large and staying on task. I’m impressed…and green :-)

Keep your eye on the PubChem data sources page and you will see ChemSpider top out at about 17.8 Million structures when all are deposited. We’re proud and happy to have contributed!

We have also had requests from other people to access the ChemSpider database files. Yesterday an organization tried to download the files by FTP but unfortunately we have had to cut off access since there are so many files being downloaded, so much bandwidth being consumed, that we have decided we can only provide the data on DVD. Apologies…

For those of you who have been watching the blog of late you will be aware of the recent discussions about Open Data (1,2). We have offered the possibility to submitters of spectral data to declare their data either Open or Closed. Noel posted a comment on the blog asking the question “Why is the default Closed? Why even offer the option of Closed?”

So..my response to “Why not offer the option of Closed?” My opinion is that this is the submitters decision. It’s not our role to force “Openness” of data onto users. We are working to create an environment that provides value to ChemSpider users rather than one that forces them into a policy regarding openness. Personally, I would prefer to have access to data to help answer a question, even if they are NOT Open Data, than to not have access to those data. I have asked all of the people who have submitted data or had me submit data to ChemSpider whether they would like to have their data moved to open. 3 said yes 2 said no. I do NOT intend to force people to adhere to making their data Open. That is their choice, not mine. We are creating a community for collaboration. There is value in having access to data whether it is Open or not. if you look at the recent conversations about RSC and their Free Access versus Open Access we must agree that there IS value to Free Access to their articles despite the fact that they are not Open Access.

My friend Gary Martin has allowed us to deposit some of his data onto ChemSpider. He has commented twice (1,2) and I refer you to those blog postings for his opinions. They are interesting to read.

The reality is tha our policies, even as they are, appear to be appropriate to have people deposit their data. We already have over 100 spectra deposited on ChemSpider and more to come based on recent conversations. Some of these ARE Open Data and the depositors are acknowledged for this. They are sharing their data with you through us. That’s the benefit of building a community for chemists.

Recently there was a commentary made about the “highly curated data” on Wikipedia. To me curators are heroes. They are detail oriented, committed to the cause and simply “care”.

As a result of reading that post you saw me go off and check on Taxol, post a few comments and come out the other end of the work with a “more highly curated record” on Wikipedia.

Then I commented on there are better ways to ensure the quality of structure drawings than redrawing them…specifically dictionary look-up and optical structure recognition.

I don’t mind being taken to task on my opinions. As my late father said…”Opinions are like nostrils, everybody has them”. Okay, the body cavity was a little more south but you get the point. However, this opinion stirred me…

“If you wish to spend your life recording typos in chemical documents, I hope it is fulfilling.”

Now, sometimes when you are stirred emotionally, it helps to sit down and think about it.

dog_sitting_on_cat.png

So, I’ve thought about it… and I’m happy about where I’ve ended up.

My life IS fulfilling. I might need therapy for this particular passion but I DO actually enjoy checking typos in “documents” – of course our conversations are about chemical documents (structures) and I DO confess I like it. Why? I care about Quality.

When I see an acknowledgment that Wikipedia is highly curated and I know I have contributed to that I have a certain pride to having contributed to community science. Those of us cleaning up the historical record for others to benefit are doing a lot of the grunt work that others talk about being necessary and espouse the need for platforms to do so. You can throw a palette of colors and a brush on a floor but someone has to pick it up and do something with it. Platforms, tools, visions are great…we need thinkers but we also need doers. Doers are important and necessary and people who find typos in chemical documents likely do find it fulfilling. I’m a thinker and a doer. until I have experienced the challenges of curating historical records I do not feel I am sufficiently immersed in the challenge. Oh…there’s another nostril (opinion).

So, who are my heroes? Some of them in this domain are:

1) Barrie Walker, ChemSpider Advisory Group member and our KING OF QUALITY.

2) Ann Richards, EPA, founder of the DSSTox effort and quality guru extraordinaire. Ann and her team have taken on the task of assembling, from various sources (and of various quality levels), a public resource of incredible value to the Tox community. This paper explains in detail. With her fine eye for detail, commitment to detail (checking CAS numbers to the digit, stereochemistry of each bond and the accuracy of the chemical names) her databases are likely the cleanest and most highly curated databases from any government labs (no intention to offend others here and if your DB is as good as DSSTox you are my heroes too!) In particular I acknowledge Marti Wolf from Ann’s lab who has spent thousands of hours assembling data, “recording typos in chemical documents” and correcting them to the benefit of the community.

3) People like Peter Corbett. He really seems to care about what’s in a database and the quality of what’s there. He is discovering these issues by observation and checking. His careful eye, clearly necessary for the development of OSCAR, makes him a hero (I look forward to meeting him!)

4) The people I worked with at ACD/Labs in the database compilation office are heroes. This group of 10s of individuals over the years, have manually curated 100s of thousands of structures and associated properties (Physchem parameters, NMR shifts, name-structure pairs). They have done it with a fine eye. THEIR efforts were the basis of what led to industry leading NMR prediction algorithms which were used recently to provide feedback to the Blue Obelisk team member, Christoph Steinbeck, to help clean up errors in the NMRSHIFTDB. While others were attacking the open data effort those of us concerned with the details helped curate the data.

5) The curators at CAS, at MDL (now Symyx), at GVKBio, and in software houses and labs all over the world who manually curate data, and, from their experience, build robots to help their processes and improve the data for all.

For all of you who wish to spend your life recording typos in chemical documents, it is likely very fulfilling if you care about quality.

I find it fulfilling. It’s a necessary part of understanding the problem. Quality is hard to define. But, we’ve been challenged on the quality of our science on ChemSpider enough. We’ve been challenged for sodium chloride dimers and shown it’s valid science. We’ve been challenged for logP prediction of Calcium Carbonate and had an industry great acknowledge our attention to detail. We’ve been challenged on inorganic chemistry and compared ourselves to others.

We Monkeys have been told to close the gates of ChemZoo. We didn’t. Instead we are doing great things for the community I hope. We have opened up a series of services that the Open Access world likes (specifically the Blue obelisk players..), we are donating our database to PubChem shortly, and we are working with some of the best people on our advisory group to satiate their needs. It’s pretty damn fulfilling.

* I will acknowledge that the comment “If you wish to spend your life recording typos in chemical documents, I hope it is fulfilling.” is removed from the context of the entire post. So read the post. Then read all the others I’ve mentioned. I made my interpretation of the comment based on the ongoing flavor. Maybe my nostril was clogged…

In a couple of email exchanges this weekend the “Right to Use” FAQ regarding data provided on ChemSpider was under discussion. The FAQ page hasn’t been updated since we went live in March so, based on almost 6 months of experience with feedback and commentary, and stimulated by the exchange over the weekend I’ve updated the statement on the FAQ page to state the following:

May I download the data and use it in my own database(s)?

You have limited rights in this regard. You can only assemble a database of 5000 structures or less, and their associated properties, from our database without our permission. You can download up to 1000 structures per day from the website. Please contact us at feedbackATchemspiderDOTcom to request an extension outside this constraint. We are willing to provide the ENTIRE database of ChemSpider structures at your request – the file will consist of InChI Strings, InChIKeys and ChemSpider IDs. These constraints are under regular review so please feel free to engage us in conversation.”

What we’re trying to do here is to stop the offshore raiding of the database that is going on. Certain groups are attempting to download the database and putting an incredible load on our server(s). So, please stop!

We are presently in the process of downloading the entire database into a series of SDF files so that we can provide the ENTIRE ChemSpider database to interested parties. We will cross the 20 million mark shortly in terms of unique structures on the database. Each structure will be accompanied by the InChI String, the InChIKey and the associated ChemSpider ID. We are INTENT on proliferating the value of InChI across the chemical community and expanding the value of InChI to the semantic web.

So, a question to you, our readers…is there anyone out there who would like to receive the ChemSpider database when it is ready? Please let me know by responding to this blog post. Thanks

Egon Willighagen recently dealt with the question “What is dapagliflozin?” then later went on to expand on what the structure actually is including adding the structure to Wikipedia for future reference.

We are presently beta-testing the Structure Deposition process we have committed to in regards to allowing scientists around the world to expand ChemSpider by adding their own structures and associated information. The screenshot below shows the result of about 3 minutes work to add the structure, four identifiers (systematic name, CAS number, BMS number, trade name) and some short text about the drug to ChemSpider. Following submission the structure and associated text is reviewed and approved for exposure on the ChemSpider database.

The structure of dapagliflozin submitted to ChemSpider




If you want to join the best-testers let us know!

Last week we unveiled the ability to deposit spectra to ChemSpider and initiated some beta-testing. Based on our own testing and the feedback from our users we are rolling out the deposition of spectra to all users and described here.

We have also unveiled the ability to curate synonyms online as described on ChemSpider news here. This moves us further to our wiki-enabling of the ChemSpider system.

In order to use both of these capabilities you must be a registered user of the system. Sign up as described here.

We welcome your feedback on these efforts!

Since we went live in March we had a link on the web page for Registration. It is only now enabled so our apologies for the delays. Was it worth the wait? Absolutely. Visit the registration page and sign up to benefit from a number of advantages as well as to become a “contributor” to the site.

I have posted a number of times about the intention for ChemSpider to become Wiki-enabled. While we have not yet layered the full wiki capability onto the system we are about to unveil single structure deposition, multiple-structure deposition and spectrum association with a structure. We already have beta testers testing the spectral association and over 40 spectra have already been added to the database.

The reason we require registration should be fairly obvious. For the purpose of providing access to beta-testers and granting the appropriate rights to submitters and curators we need to have traceability regarding who is making the submissions, the comments and the edits. While we understand this might be slightly invasive it is appropriate to retain a certain level of control over what might show up on the database. We hope we have your support.

Some side benefits of registration include an additional way for us to deliver you updated information regarding new capabilities being introduced to ChemSpider, updates regarding content enhancements and, when the time is right, delivering information to you via a ChemSpider newsletter.

Register today!

For those of you watching the progress of ChemSpider since it’s initial exposure in March of this year we have been incrementally adding new features and specifically integration to other rich sources of information. We have delivered integration to multiple data sources (Click on the Data Sources checkbox under the Advanced Search for the list) as well as the integration to text-based searching of 50,000 Open Access articles via the ChemRefer service. Now we have extended the ability to include review of Patents.

In a collaboration with Reel Two we have provided a way to provide structure and substructure searching and access through millions of chemical structures integrated to patents on the US, European and Asian Patent Offices via their SureChem Portal. Following a search simply click through to the Detailed Results page for a particular structure and look in the Data Sources list for the word SureChem. See below as an example…note Surechem blocked in red.

Surechem Link









Clicking on any of the names in the Data Sources link launches a new Browser Window containing the links to the External Substances links as shown below.

links to Surechem Data Sources

Clicking on any of the External Links will take you to the actual patent sitting on the Patent Analysis website and identified via the Surechem query. For example, see here.

We have a number of ideas to enhance the deliver of patent information via ChemSpider but for the time-being we believe that the ChemSpider and the Reel Two SureChem integration offers a powerful means by which a chemist can navigate their way from a chemical structure to a patent. We welcome your feedback.

A number of new data sources have been added/updated to the ChemSpider database. A separate posting on the ChemSpider News blog outlines HOW to search only specific data sources rather than the entire dataset. This should be of interest to you.

Newly added databases are listed below:

1) A set of “synthesized peptides”. Over 168,000 peptide structures generated from all amino acids – from the individual amino acids up to and including all tetrapeptides. This database will likely be of value to mass spectrometrists examining biofluids.


2) The Human Metabolome Database – a database containing detailed information about small molecule metabolites found in the human body. It is intended to be used for applications in metabolomics, clinical chemistry, biomarker discovery and general education. The database is designed to contain or link three kinds of data: 1) chemical data, 2) clinical data, and 3) molecular biology/biochemistry data. An example is shown for phenylephrine which is linked through to the HMDB here.

.
3) The DrugBank database is a cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains nearly 4300 drug entries including >1,000 FDA-approved small molecule drugs, 113 FDA-approved biotech (protein/peptide) drugs, 62 nutraceuticals and >3,000 experimental drugs. The link through from ChemSpider to DrugBank is not yet enabled but will come shortly.


4) The CombiUgi set – Jean-Claude Bradley‘s team, in collaboration with others are working together on the CombiUgi project as described on their wiki. The basis of the work is explained in Nature Precedings in the article “The CombiUgi Project and Closing the Open Science Loop” by Jean-Claude Bradley, Rikesh Parikh, Dan Zaharevitz and Rajarshi Guha. ChemSpider has added and indexed the 68,000 structures to the database, together with the predicted properties. We will be adding additional data to these structure records as the data become available.

We have additional data to deposit onto the database and will keep you informed on new additions.

ChemSpider has expanded to over 16.5 Million compounds in the past week and is now, as far as we know, the largest online collection of chemical structures and related information available online (If not…let us know what is).

At present we have a list of over 100 chemical vendors and chemistry database suppliers to approach regarding adding their data to the ChemSpider index. That said, we are interested in providing maximum value to our users. With that in mind we are looking to our users to help identify and prioritize the datasets that they find of highest value. Rich Apodaca has blogged previously on a series of thirty-two databases and these are not all indexed as yet. Let us know if any of these have particular interest or value to you or point us to others so we can contact the hosts/providers.

As promised we will be unveiling user registration and submission in the near future to facilitate the expansion of the database by direct user submissions. Watch this space….it’s coming! We have a vision of getting to 20 million compounds online on ChemSpider by end of year. Help us get there if you can…point us to the highest value resources.

The Journal of Heterocyclic Chemistry is one of my favorite journals. We publish NMR articles in this journal fairly regularly regarding complex structure elucidations performed with the Structure Elucidator software or, of late, regarding the applications of indirect covariance techniques. J. Het. Chem. is a “family owned” journal…one of the only ones not owned by the ACS or a major publishing house (other than Open Access Journals of course).

J Het Chem is one of the few journals offering online structure searching of its articles. Since the structures were already extracted in order to index the articles it made sense to suggest the indexing of the J Het Chem structures on the article to Dr Lyle Castle, the manager of the journal. Within a couple of weeks, after a few email exchanges with Specs, the provider of the structure searching capability on the www.jhetchem.com site, we had received, indexed and hooked up the results on ChemSpider to the website.

So, over 208,000 structures from the Journal of Heterocyclic Chemistry are now indexed on ChemSpider and linked through to the abstracts on that journal. Our intention with this work, and with the work to connect Open Access articles, is to extend the reach of ChemSpider out to the chemistry literature.

For those of you watching ChemSpider you may be aware of the structure counter on the Home Page. It looks like this…

Over 13 Million Compounds and Counting
The number in parentheses tracks the number of compounds contained in the ChemSpider Database. If you check it after you have read this post you may notice it will have incremented again. Recently we have added a number of databases as detailed on our News Blog. At present we have millions of structures being added to the database. The process includes a de-duplication process so that only unique compounds are added to the database, the generation of PhysChem properties and Systematic Identifiers and the generation of the chemical structure image and appropriate indices to enable structure and substructure searching. This process is going to take a few days to update the data we have available for adding to ChemSpider. Continue to watch the counter…you are likely to see it cross the next million structure increment very shortly.

A series of new databases have been deposited into ChemSpider and has helped to expand the reach of the ChemSpider services. The index has been expanded with additions from the following depositors:

1) Over 200,000 chemical structures from the Journal of Heterocyclic Chemistry have been added to the database and linked back to the publication website
2) UsefulChem molecules have been indexed and linked to the UsefulChem website. Example
3) The MDPI data have been indexed and link back to an information page. There are no molecules online so this is simply a connection to the data source.
4) Many hundreds of thousands of chemical structures supplied by Enamine have been added and linked to the structures on their website.
5) The Nanogens collection is now in ChemSpider and linked back to information about the data collection. Example

These structures are presently available for searching by text only as they need to be indexed for structure and substructure searching. The deposition system is a multi-stage process requiring generation of properties, systematic identifiers, addition to the database, indexing for searching and generation of images for review as a result of the search. These processes are not yet background processes. Please bear with us as we upgrade the index.
These data add over 1 million new structures to the database this month. We already have many more, multiple millions, to add to the database in the coming month.

At present the status is >11.7 million unique compounds and over 26 million substances from 65 total data sources.

Zen and the Art of Motorcycle Maintenance…for those of you interested in a discussion about Quality this is a great read (or listen on a long ride with an audio book. Persig discusses the Metaphysics of Quality as a philosophy, a theory about reality and asks questions such as what is real, what is good and what is moral. The narrator of the book, Phaedrus, named after the character from the Plato dialogue of the same name, criticizes his instructors for poorly educating the students.

Now, poorly educating the students is an issue and certainly this concern has been raised recently in regards to the ChemSpider system. The question this leads to is about Quality. The Quality of LARGE public domain databases. ChemSpider is an example and not without challenges…to be expected with over 10 million compounds. However, as shown in a couple of specific posts about sodium chloride and recently prussian blue issues regarding the judgment of quality of both our system and other databases abound.

Rich Apodaca has recently posted a request for information about new free access/free speech/free beer databases to follow on from his very popular posting regarding 32 free chemistry databases. For those of you who do not frequent Depth-First I HIGHLY recommend a browse…one of my top sites for commentary on our domain. There will be a number of databases submitted for inclusion in Rich’s next list. However, the question I will have then will be about Quality. It is a concern as we choose to post certain content or not…there actually should be a quality flag depending on data sources in our opinion. Some are simply better than others.

We are already in the process of curating the ChemSpider content ourselves as well as with the assistance of some dedicated individuals. Clearly there are issues with some of the content within the index. With 10 million structures what is one to expect? The database is set to double in size over the next couple of months we believe and, in parallel, the number of potential errors will grow also.

So, the questions I have tumbling around my rather non-Zen brain this time of night are:

1) Assuming perfection is not feasible and errors will occur in a large free database of millions of structures, what level of error/misinformation is acceptable? This is after all an issue of cost versus quality in many cases. If you were paying $50 per search the expectations of quality would be much higher I would assume.

2) There are different criteria for quality for different data – it may be acceptable to have a poor predicted property for a compound since it IS a prediction but what if the structure itself is wrong, one stereocenter is mislabeled, one trade name is misspelled. Can you identify the highest quality data and for which is it acceptable to have errors?

3) What are peoples experiences of other large free databases…there are many out there as posted in Richard Apodaca’s list? What is the quality like?

4) Which public domain free online database is the gold standard by which others should be measured? How good is the database? What level of error content? What type of errors?

Any other commentary is welcomed. The question I posit is “How is Quality measured in terms of public domain free online databases?”

As posted on ChemSpider News we are presently adding and indexing a series of new databases to the ChemSpider database. We have an increasing number of contributors who can see value in exposing their data on ChemSpider. As well as those listed at the ChemSpider News site we have also received over 200,000 structures from a chemistry journal and these will be indexed and linked back to all abstracts on their website shortly enabling structure searching of their journal.

here has been an increasing number of downloads of the Internet Explorer and Firefox Add-ins as well as using ACD/ChemSketch with the add-in. The majority of searches are still based on text-based searching thereby validating the value of the name-structure approach discussed elsewhere. Not only text searching is enabled on ChemSpider. Rather, the structure and substructure searching capabilities through both Freeware ChemSketch and the structure drawing applet are also being utilized. The Prediction Services are also being utilized. At present the ChemSpider system appears to be delivering as we expected at this stage of our beta release.

We are fortunate to have some excellent beta testers providing feedback and the curation process is now fully active with 63 curated records as of 05/08/2007, an average of 5 structures per day since going live with the curation process. The majority of these records are contained within PubChem and ultimately will be returned to the PubChem database as explained previously.

ChemSpider has been set up with the intention of providing value to the chemical community in various ways. Direct usage of the ChemSpider system and its associated services is the most direct manner but we have made a commitment to ourselves to ensure that we return value to the date source providers supporting ChemSpider. We intend to do this through the process of curation as well as by providing data to the public domain.

In this regard ChemSpider has submitted our first database to PubChem. This database will be indexed on ChemSpider shortly. This database is generated from the conversion of a series of chemical names to chemical structures and has been performed in collaboration with Advanced Chemistry Development (ACD/Labs) using a number of their tools as detailed in a technical note posted at their site. ACD/Labs has also made the database available at their website in ChemFolder format. The database is available in SDF format by downloading the data from PubChem.

The database in question is a derivative work of a database downloaded in text file format from the U.S. Food and Drug Administration (FDA) site. ChemSpider has every intention to continue contributing such data back to the community on an ongoing basis and will post additional databases in the future.

ChemSpider has been online since March 24th 2007, about 6 weeks. We opened the ability to curate the data one month later.
Is there a need to curate the data? ChemSpider is built up of a series of databases. The list of contributors continues to increase
and there will be some very exciting announcements made in the next few days about new contributors. One of the largest components is the PubChem database. Peter Murray-Rust recently blogged about the quality of the name-structure pairs inside the PubChem database. He used as an example methane… I point you to the original blog for his comments. For my purposes I will use water. Here is the list of names, synonyms and registry numbers posted for Water at PubChem. Certainly a number of these have carried over to ChemSpider. Out of interest it is worth comparing the results of the searches for the word “water” at both PubChem and ChemSpider. Search Pubchem for water title=”Water on PubChem”>here and ChemSpider for water title=”Water on ChemSPider”>here. 228 hits versus 1. Looking at ChemSpider we get the following list of names, synonyms and registry numbers. The hyperlinks below are those links to wikipedia.

“water; Water vapor; Dihydrogen oxide; Distilled water; Purified water; Water, purified; hydrogen oxide; Deionized water; Oxygen atom; dihydridooxygen; ether; ethers; hydroxide; oxidane; Monooxygen; Photooxygen; Wasser; Singlet oxygen; Atomic oxygen; Deuterated water; Dihydrogen Monoxide; Oxygen, atomic; Water, mineral; Water, deionized; Water, distilled; Water, heavy; Water-t; DHMO; See Remark 8; HYDROXY GROUP; Water for injection; BOUND OXYGEN; BOUND WATER; Oxygen(sup 3P); 3H-Water; OXO GROUP; UNKNOWN; Water-18O; Sterile purified water; Tritiated water, mono-; Tritiated water (HTO); DISORDERED SOLVENT; Water, purified (JAN); Purified water (JP15); Water (JP15/USP); Type 2 Copper Site Water; Type 2 Copper Site Waters; CCRIS 6115; Oxygen O8 Of 8-Oxoadenine; GLUCOSE 4-O4 GROUP; Oxygen Of Oxidized
Methionine; Water for injection (JP15); Oxygen Bound To Cys 83 Sg; Oxygen Bond To Sg Cys A 67; Sterile purified water (JP15); CHEBI:15377; CHEBI:25698; [OH2]; H(2)O; Disordered Solvent – See Remark 8; EINECS 231-791-2; Oxygen Bound To +a B 17 At C8; Disordered Solvent – See Remark 10; Disordered Solvent – See Remark 11; Disordered Solvent – See Remark 12; NSC147337; NSC 147337; The Oxygen Is Linked To The Haem Iron; Hydroxy Group Bond To Sg Cys B 67.; Oxygen Bound To Cys 25 Sg – Remark 4; The Oxo Group Is Linked To The Haem Iron; C00001; D00001; 7732-18-5; Water, distilled, conductivity or of similar purity; H2O; HOH; Ice; DIS; GTE; H20; HYD; MTO; OX; OXO; UNK; 13670-17-2; 14314-42-2; 17778-80-2; 558440-22-5; DOD; DUM; glc; O2; OH; OX1; TIP; UNL”

NOT what I would call a quality set of names. These will be curated, some will be done with appropriate robots and some manually.
This is an extreme. Let’s look at other examples already identified by curators. Below is an example of curation in process.

Some examples of curated data

Returning to Peter’s blog…an excerpt states “Pubchem faithfully reflects the broken nature of chemical infomation. It cannot mend
it – there are only ca. 20 people – and anyway the commercial chemical information world prefers to work with a broken system. But
could social computing change it? Like Wikipedia has? [..] I think chemistry is different. And I think we could do it almost effortlessly
- rather like the Internet Movie Database. Here every participant can vote for popularity or tomatos. A greasemonkey-like system could allow us to flag “unuseful names” or to vote for the preferred names and structures. And this doesn’t have to be done on PubChem – it could be a standoff site [..].” I happen to agree. I believe social computing can change it. That is the purpose of the curating process on ChemSpider. When we set up the system we were not sure that people would care or help in curating the data. Why? Here’s why people might NOT want to help us curate the data:

  1. ChemSpider is not PubChem. The data cannot be downloaded.
  2. ChemSpider is a business…why should people help a business increase the quality of the data they host?
  3. ChemSpider is new. Who says that the efforts made to curate data will be of value to others? How long will ChemSpider be around to allow peoples work to benefit others?

All valid questions. And they likely ARE deterrents to people helping improve the quality of data on ChemSpider. So, what are the
answers to these questions.. are they enough to convince ChemSpider users to assist in curating the database? Our responses to the
questions above are as follows:

  1. We do not have permission from all depositors to ChemSpider to allow their data to be downloaded, only viewed. However, we WILL redeposit all curated data originally sourced from PubChem back to PubChem. In an email exchange this past week with Steve Bryant from PubChem commented that they would willingly accept curated data back to their database. We will also make available a downloadable database of all curated data originally sourced from public sources. We will also provide feedback to other depositors when we find errors.
  2. I have done my utmost to explain this in a previous post here.
  3. ChemSpider has traction. It is getting lots of use. Based on interest we believe that our initial efforts have already provided enough response to have us continue this work. We have challenges as discussed previously but we are busily addressing these now. We believe that every effort made to improve the quality of data on the ChemSpider database will benefit all users and the community in general with our giveaway to PubChem and other database providers of the curated data.

I have outlined only a small number of possible concerns above. There may be more. I welcome any other questions you may have about our intentions.

I give a thumbs up to the quality of the NMRSHIFTDB. We’ve validated it. Why would I care? I’m an NMR jock at heart. I also work for a commercial software company innovating NMR prediction software and compiling NMR databases as the basis of our work. Does this mean that the commercial software vendors and Open Source/Access communities can coexist and have mutual admiration. I believe so!

After 18 months work I finally signed off on one of those infamous copyright transfers for Elsevier, now the publishers of Progress in NMR. After over 18 months of work (that after hours style…much like blogging) a 360 page review article is finally submitted – “Computer-Assisted Structure Verification and Elucidation Tools In NMR-Based Structure Elucidation”. Proofs will arrive before end of month. It’s the culmination of over ten years of our own work as well as that of many contributors in the domain of CASE (Computer Assisted Structure Elucidation) systems. The complexity of structures that can be solved by computer algorithms is impressive…see examples here. Recently the StrucEluc CASE system solved the structure of an antibiotic of Mw>1150. Three NMR spectroscopists couldn’t solve it…a symbiotic relationship with software is VERY enabling!

One very active player in CASE is Christoph Steinbeck, a member of Blue Obelisk, one of the more active blogging groups on the net today.

Christoph’s group host NMRSHIFTDB. Recently ChemSpider linked to NMRSHIFTDB. In parallel I took interest in the recent critique of the quality of data published by Wolfgang Robien especially since during my “day job” I am directly involved with NMR prediction, structure verification using NMR and of course, CASE systems.

What was interesting about Robien’s post was the fact that it focused on the application of Neural Networks to prediction. With the availability of a public dataset we were able to repeat the analysis using our own Neural Networks as well as our classical approaches. The results will be reported elsewhere. What I want to confirm is PMR’s post regarding the quality of data. Peter commented, relative to our own efforts at ChemSpider. “There is little point in collecting 10 million structures if you cannot rely on any of them. It actually detracts from the hard work of people like Stefan, Christoph and others on NMRShiftDB as the general user of the database will judge all entries by the lowest common denominator.” After analyzing the data, over 200,000 individual chemical shifts I can say DON’T judge by the lowest common denominator. There is some junk on there..as seen by Wolgang Robien. But our estimates after our analysis is likely less than 250 data points in error. These are truly excellent statistics if you consider that this is an open access system where people are depositing data, that these data are free to download and utilize even for the development of derivative algorithms and that such systems can work. The addition or improvement of rigorous checking algorithms to the NMRSHIFTDB is the next natural step and flagging data to the submitter will have them check and validate the quality of their input. This will catch many errors during the submission process.

So, my compliments to Christoph and the team. The quality is excellent and there are “large errors” but minimum in number. I’ve already sent him a report to help cleanse the database though didn’t compare it with that of Robien…likely we saw the same things since they were very obvious. These errors should not detract from the effort ..with >200,000 data points it is obvious that there would be some. For ChemSpider we have the same problem…with >10 million structures there are errors….lots of them. But it’s very useful all the same!

In a recent blog PMR commented on the quality of ChemSpider while focusing appropriately on the issue of quality in aggregated datasets. He commented specifically on a search performed on Sodium Chloride and the records located shown here. His comment was that the third of these , the structure of Na2Cl2, was “rubbish”. PMR commented “There is little point in collecting 10 million structures if you cannot rely on any of them”. We’re struggling with the issue.

So here’s the question…is it rubbish? PMR commented that for data collections that  “…there is junk in the historical record. And there is junk in some of the links donated. That may be where the Na2Cl2 for sodium chloride came from.” But it’s not junk. It’s chemistry. The existence of Na2Cl2 microclusters has been reported (1).  The sodium chloride dimer is also on the NIST webbook  and indexed into PubChem as record cid=6914545.

PMR went on to comment “But Pubchem is not, and should not be, a data repository except for NIH data. But nor should any other organisation try to aggregate all the data. Whar we should do is pool the metadata (InChIs, names, etc.( in pubchem and develop links and searches to distributed repositories and datasets elsewhere.”. I love the vision!

Here’s the definition of what ChemSpider is trying to do from the What is ChemSpider page “There are tens if not hundreds of chemical structure databases and no single way to search across them. There are databases of curated literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data and on and on. The only way to know whether a specific piece of information is available for a chemical structure is to have simultaneous access to all of these databases. Since many of these databases are for profit there is no way to easily determine the availability of information within these commercial or even in the open access databases. With ChemSpider the intention is to aggregate into a single database all chemical structures available within open access and commercial databases and to provide the necessary pointers from the ChemSpider search engine to the information of interest. This service will allow users to either access the data immediately via open access links or have the information necessary to continue their searches into commercially available systems. The question “is there specific information about my chemical” will be answered. Accessing the information may require a commercial transaction with the appropriate provider.”

Our intention is to do exactly as suggested “pool the metadata (InChIs, names, etc and develop links and searches to distributed repositories and datasets elsewhere.” It’s already started.

One comment…the molecular weight for Na2Cl2 in the record on ChemSpider IS incorrect. This is listed as a known bug . It’s fixed…we just need to calculate properties for 10 million compounds. It’s underway.