Archive for November, 2009

My friend and colleague Sean Ekins and I wrote a perspective for the RSC’s Lab on a Chip journal and it was released as an Advance Article, as Free Access, this evening.

The perspective is entitled “Precompetitive preclinical ADME/Tox data: set it free on the web to facilitate computational model building and assist drug development“ . The title is self-explanatory in terms of what we are trying to communicate. The paper is online now and available here.

ADME

I have been at the German Conference for Cheminformatics for the past three days. The conference is in Goslar. I twittered the conference using #goslarcheminf and it appears that there was little interest in twittering here…seems like it’s an “American” thing to do. I gave a presentation entitled “ChemSpider – Building a Foundation for the Semantic Web by Hosting a Crowd Sourced Databasing Platform for Chemistry” and have put it on SlideShare here. The abstract for the talk is below as well as the embedded Slideshare widget for the talk. This talk was a lot less rushed than usual…not just 20 minutes and I personally enjoyed giving this talk to the audience. Commonly I feel that the talks I give are very rished and I only get to scratch the surface of what we are up to with ChemSpider. It’s amazing how an additional 15 minutes allowed me to expand on the issues and the work. The presentation drew a lot of questions and attention after the session and I’m hoping that many of the discussions regarding collaboration and depositions of new data come to fruition.

Abstract

There is an increasing availability of free and open access resources for chemists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. It was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge.

There are tens if not hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them.  Despite the fact that there were a large number of databases containing chemical compounds and data available online their inherent quality, accuracy and completeness was lacking in many regards. The intention with ChemSpider was to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data, experimental properties and linking to other valuable resources. It has grown into a resource containing over 21 million unique chemical structures from over 200 data sources.

ChemSpider has enabled real time curation of the data, association of analytical data with chemical structures, real-time deposition of single or batch chemical structures (including with activity data) and transaction-based predictions of physicochemical data. The social community aspects of the system demonstrate the potential of this approach. Curation of the data continues daily and thousands of edits and depositions by members of the community have dramatically improved the quality of the data relative to other public resources for chemistry.

This presentation will provide an overview of the history of ChemSpider, the present capabilities of the platform and how it can become one of the primary foundations of the semantic web for chemistry. It will also discuss some of the present projects underway since the acquisition of ChemSpider by the Royal Society of Chemistry.

For those of you who read this blog you will be aware that it can take a lot of time just to get a single chemical curated against its correct associations of chemical names and synonyms. I’ve shown this for vancomycin, Taxol (1,2,3), Ginkgolide B and it is presently underway with Digitonin, though not yet complete. Working on one structure is hard enough. Building a database of a few thousand curated structures is difficult work yet the EBI did it, and did it well when they built ChEBI. ChEBI is also not perfect as we discovered working on vancomycin and I still find occasional small issues.

The EBI recently released the ChEMBL database. This is a much bigger resource as described at the home page for the resource here. The site states “ChEMBL is a database of ca. 500,000 bioactive compounds, their quantitative properties and bioactivities (binding constants, pharmacology and ADMET, etc). The data is abstracted and curated from the primary scientific literature and the data made available due to funding by the Wellcome Trust.” It is MUCH harder to curate larger databases and 1/2 a million records is a challenge.

I downloaded the data from the FTP site and took a browse of the data. There are definitely structures in the data file that we don’t have in ChemSpider but I found an issue with charge balance for many hundreds of records where the counterions were charged (for example, chloride or bromide) but the primary component was neutral. An example is here where the compound is named as a hydrochloride but the compound has the chloride anion. I think this likely arises from treatment with some type of standardizer so it should be a matter of changing the standardizer settings and regenerating. We deal with over 23 million compounds and have been through such issues ourselves when it comes to generation of structure images.

For an example of a rich record in ChEMBL take a look at this record showing the target, assay, activity type, value and reference all listed. ChEMBL is sure to be an invaluable reference for the Life Sciences.

I have never met Warren DeLano. But, I have respected him from afar for a long time. Warren is the developer of PyMol, an Open Source molecular visualization system that has made enormous contributions to the community and can produce stunning visualizations of Proteins. His impact on the field of protein visualization has been recognized many times by the community and his tools are used in labs all over the world. He has garnered respect across our community.

A few months ago I had the opportunity to spend an hour on the phone with him after he had made such positive comments when the RSC acquired ChemSpider. We talked about Open Science, Open Source and models of business. We talked about the adventure of trying to change the world one step at a time by making our humble contributions to the world of science. By the end of our conversation I knew that when I met Warren we would be able to talk for many more hours as we shared many common views and, primarily, a want to make a difference.

Today I learned of the sad news that Warren had passed away. Despite the fact that I hadn’t yet managed to sit with Warren face to face I was immediately  saddened. My truth is that there is a specific type of shock I feel when someone younger than myself passes away. Warren and I talked about the impact of our chosen career paths on our relationships with our wives and the hours spent in front of a screen instead of spending them with those we share our lives with. We both reflected on the fact that we have given too much to the keyboard over the years driven by our need to make a difference. Warren’s hard work and superior programming skills and are paralleled by the fact that he was clearly a charitable contributor to science by giving his code away to the world and was, even based on only one phone call, a kind man.

My thoughts go out to his wife and family for his loss.