I was in an exchange with a friend this weekend about his interest in depositing data onto ChemSpider. Due to our travel schedules and family commitments we rarely talk by phone. This gentlemen is a retired chemist, though highly active. He is an expert in nomenclature and has an incredible eye for quality and is a master curator of chemistry databases.

So, he is very interested in ChemSpider and the potential of exposing his databases. However, his expressed concern is that he will lose all the efforts he has invested in developing the databases. Again, these are manually curated, with an experts eye and, based on my experiences working with him are of the highest quality. They amount to tens if not hundreds of hours of work and are a source of revenue for this gentleman.

WIth this in mind, and based on other blog posts I have seen, it appears that we have not clearly defined the intention of ChemSpider. What we are NOT doing is aggregating all data from all publicly available data sources or even supplied databases. Our intention for the immediate future is to form a structure centric environment linking out to the initial data source providers via the chemical structure. The individual providers continue to provide their content and retain their value proposition.

For example, The NIST webbook is a container for a lot of information including spectral data. As discussed in another post about the sodium chloride dimer ChemSpider will provide the link to the webbook to display relevant data for this gas phase species. A search for diazepam will provide links out to all original data sources as shown here and they include ChemBank, ChEBI, NIST Webbook and many others.

ChemSpider is an aggregator of chemical structures and associated identifiers (enabling connectivity to other sites). We are NOT duplicating all content available at other sites. This removes the burden of updating associated data across multiple data sources as individual providers curate and update their own sources. It also keeps ChemSpider on task of linking together multiple sources of data via chemical structures rather than grabbing the work of other groups and reposting.

So, back to my friend who is worried about depositing data on ChemSpider. All we will be taking delivery of are the structures, the structure IDs (if available) and a link to information about the database. In this way we are directing individuals to rich sources of information for ChemSpider users to pursue as they see fit. Just as many depositions into the public online databases are from chemical vendors intending to potentially sell their materials the same model applies to database providers. After all, if information content is of value it is up to the user to choose to pay for the right to access.

Taking this one step further one has to consider the following question. For the large database providers (Beilstein – now MDL, Derwent, CAS, Cambridge Crystallogrpahic Databases, DiscoveryGate and others) why not put their structure collections into the public domain for the purpose of searching and connecting back to the actual content of value. The structures themselves, as far as I know, are in all cases in the public domain since they are published (I might be wrong here but I cannot find statements to the contrary). The value comes from the information associated with the structures – one or more publictaions, reaction details, experimental or predicted properties, connection to a patent, and other such content association.

What’s at risk to provide public access to the structure database(s) for searching and charging the appropropriate fees to access the information once identified? There is little value in simply knowing that a structure exists in a database is there? Isn’t it the information associated that has value? If this wasn’t true then that would suggest that a large database of algorithmically generated structures created with something like MolGen or the structure generator in Structure Elucidator would have value. In fact it does….see the work of Reymond et al in their “Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F“. The value however comes not from the computer structure itself but rather the virtual screening response.

I judge there are two challenges – a decision at the management level to expose the large structural repositories and the enormous hurdles in migrating certain classes of chemical structures to SDF format to be hosted by general services – specifically polymers, organometallic complexes and inorganics (also all challenges for ChemSpider!). I think the primary challenge is the decision to expose the data…I judge it’s the right decision to make with the increasing availability of Open Access databases such as ChemSpider. It’s a BIG decision …

Stumble it!

Leave a Reply