My friend and colleague Sean Ekins and I wrote a perspective for the RSC’s Lab on a Chip journal and it was released as an Advance Article, as Free Access, this evening.

The perspective is entitled “Precompetitive preclinical ADME/Tox data: set it free on the web to facilitate computational model building and assist drug development“ . The title is self-explanatory in terms of what we are trying to communicate. The paper is online now and available here.


I have been at the German Conference for Cheminformatics for the past three days. The conference is in Goslar. I twittered the conference using #goslarcheminf and it appears that there was little interest in twittering here…seems like it’s an “American” thing to do. I gave a presentation entitled “ChemSpider – Building a Foundation for the Semantic Web by Hosting a Crowd Sourced Databasing Platform for Chemistry” and have put it on SlideShare here. The abstract for the talk is below as well as the embedded Slideshare widget for the talk. This talk was a lot less rushed than usual…not just 20 minutes and I personally enjoyed giving this talk to the audience. Commonly I feel that the talks I give are very rished and I only get to scratch the surface of what we are up to with ChemSpider. It’s amazing how an additional 15 minutes allowed me to expand on the issues and the work. The presentation drew a lot of questions and attention after the session and I’m hoping that many of the discussions regarding collaboration and depositions of new data come to fruition.


There is an increasing availability of free and open access resources for chemists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. It was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge.

There are tens if not hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them.  Despite the fact that there were a large number of databases containing chemical compounds and data available online their inherent quality, accuracy and completeness was lacking in many regards. The intention with ChemSpider was to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data, experimental properties and linking to other valuable resources. It has grown into a resource containing over 21 million unique chemical structures from over 200 data sources.

ChemSpider has enabled real time curation of the data, association of analytical data with chemical structures, real-time deposition of single or batch chemical structures (including with activity data) and transaction-based predictions of physicochemical data. The social community aspects of the system demonstrate the potential of this approach. Curation of the data continues daily and thousands of edits and depositions by members of the community have dramatically improved the quality of the data relative to other public resources for chemistry.

This presentation will provide an overview of the history of ChemSpider, the present capabilities of the platform and how it can become one of the primary foundations of the semantic web for chemistry. It will also discuss some of the present projects underway since the acquisition of ChemSpider by the Royal Society of Chemistry.

JC has given a great overview of how students might want to use ChemSpider for the purpose of chemical information retrieval on the internet. JC’s course lecture thoroughly exercises ChemSpider, in real time, to do searches across the internet. He posted his seminar to Scivee here and I have embedded the lecture below. It’s a good talk for students and I encourage you to share it and review how ChemSpider can be used in your classwork and in your laboratories.

I gave a talk today at the ICIC 2009 meeting here in Sitges, Spain. It is an interesting meeting and I will report on some of the presentations later. I’m glad I am here. The presentation is here on Slideshare and is a modified version of a presentation I gave on Saturday at the Microsoft eScience conference in Pittsburg. One of the questions that followed the presentation was in regards to whether ChemSpider could be used as a measuring stick for quality (I am paraphrasing). My response was that there are millions of errors on ChemSpider and that seemed to raise a giggle and other people since then seemed surprised.

In my opinion, as shocking as it sounds, it must be true. Why?

There are almost 23 million unique chemical entities on the database. Many of them have multiple names associated, experimental properties, many have 10s of links to external databases. The structural layout has been created using algorithms. Algorithms have been used to generate systematic names. There are spectra submitted by the public and they can be mis-referenced, as an example, or declared to run in one solvent and ACTUALLY run in another. There are sometimes multiple registry numbers associated with a compound…a CAS number for a salt associated with with the neutral compound for example. The multiple links out to external resources number in the 10s of millions and these are changing daily as other websites and databases curate and edit their data. Errors are inevitable and, I judge, there must be millions of errors on ChemSPider. Just as there must be millions on Wikipedia and in the search results you get back from Google. The question is what counts as an error? I’m using a broad stroke brush for an error…a structure with a poor depiction is an error. A misspelling is an error. A dead link to a database is an error. So…definitely millions. But we continue our work to whittle down the number, with the assistance of the community, everyday. But we’re doing it while we are depositing new compounds onto the database so it’s an interesting challenge. Millions of errors doesn’t make ChemSpider less useful…we’re just realistic about the magnitude of the challenge!

Last week I had the pleasure of being on an agenda with a number of people whose work I applaud and who I genuinely enjoy spending time with and sharing thoughts about “what if?” Martin Walker, one of the people I collaborate with on Wikipedia, invited me to speak in his session “Publishing and Promoting Chemistry in the Internet Age“. Martin gave an introduction to the session and spoke about Chemistry on the Internet. Beth Brown gave an overview of the Chemist’s Toolkit for Publishing and Promoting your work on the Internet. I followed with an overview about what’s going on with ChemSpider and the issues of connectedness and quality of chemistry on the internet. JC Bradley spoke about transparency and Open Notebook Science. My hat’s off to Martin for arranging the speakers in that order. Considering we didn’t coordinate our talks it was an excellent trajectory throughout the session and very much an integrated overview of activities regarding chemistry on the internet.

My talk is posted on SlideShare here and is available below. Any comments and questions are welcomed.

Beth Brown has her talk online here and JC Bradley will post his online here.

JC Bradley and I had a good talk about ways we can collaborate together more closely on Open Notebook Science. We have a path forward so that ChemSpider can provide additional support and will be discussing the path forward offline.

It was a busy week at the ACS meeting in Washington. I gave three presentations and the title, abstracts and links to Slideshare are given below:

Oops and Downs of Resolving InChIs For the Chemistry Community (Link to Slideshare)

The InChI resolver was rolled out to the community in March 2009 with the purpose of providing a centralized resource for chemists to resolve InChIs (International Chemical Identifiers). This presentation will provide an overview of the development of the underlying technologies associated with the InChI resolver, and how the resolver is being used, integrated and enhanced to provide additional value to the chemistry community. We will discuss present limitations to application of the resolver for providing access to databases and chemistry information distributed across the internet and define our vision for enhancing interconnectivity across Open databases using the InChI resolver as the glue.

ChemSpider: Building a knowledge-based community for chemists using social and data networking technologies (Link to Slideshare)

In less than 2 years ChemSpider has become one of the primary online resources for chemists providing access to an unsurpassed aggregate of free-access knowledge and data. ChemSpider was developed with the intention of providing a structure centric community for chemists that would be enhanced by data depositions, curations and annotations by the community. The system presently hosts over 21.5 million chemical compounds from over 200 data sources. Working with a network of advisors, collaborators and data providers ChemSpider has created a unique resource of integrated information for chemists. These efforts have enabled us to support the curation of the Wikipedia chemistry pages, the production of a community supported Open Access chemistry journal and provision of web services integrated to spectrometer systems distributed around the world. This talk will provide an overview of how ChemSpider utilized social and data networking to create a community for chemistry.

Building an integrated system for chemistry markup and online publishing integrated to online chemistry resources (Link to Slideshare)

The extraction of chemical entities from documents such as patents and publications has been pursued for a number of years. We wish to report on ChemMantis, an integrated system for chemistry-based entity extraction and document mark-up enabling access to the rich resource of online chemistry know as ChemSpider. We will discuss the development of the platform from its inception as a series of dictionaries to the integration of an entity extraction algorithm and its expansion to a public deposition and publishing platform for chemistry. Chemistry articles can now be deposited, marked-up and exposed to the public within a few minutes in many cases making it an ideal platform for communicating research and providing integrated access to data sources including PubChem, ChEBI, Wikipedia and Entrez.

The Spectral Game at is powered by chemical structures and spectra from ChemSpider. A provisional form of our manuscript regarding this paper is now online at the Journal of Cheminformatics here:

The Spectral Game: leveraging Open Data and crowdsourcing for education

Jean-Claude Bradley , Robert J Lancashire , Andrew SID Lang and Antony J Williams

Journal of Cheminformatics 2009, 1:9doi:10.1186/1758-2946-1-9

Published: 26 June 2009

Abstract (provisional)

We report on the implementation of the Spectral Game, a web-based game where players try to match molecules to various forms of interactive spectra including 1D/2D NMR, Mass Spectrometry and Infrared spectra. Each correct selection earns the player one point and play continues until the player supplies an incorrect answer. The game is usually played using a web browser interface, although a version has been developed in the virtual 3D environment of Second Life. Spectra uploaded as Open Data to ChemSpider in JCAMP-DX format are used for the problem sets together with structures extracted from the website. The spectra are displayed using JSpecView, an Open Source spectrum viewing applet which affords zooming and integration. The application of the game to the teaching of proton NMR spectroscopy in an undergraduate organic chemistry class and a 2D Spectrum Viewer are also presented.

I was in Boston for two days at the Bio-IT meeting. As usual this was a long list of conversations, meetings and chance introductions that will help ChemSpider grow in reputation. I had an opportunity to sit with Peter Murray-Rust, Steve Heller and Alex Tropsha (the latter two gentlemen on our advisory group). We discussed Open Data, specifically in terms of adding spectral data into the NIST MS Database and how the data on the NIST Webbook are NOT Open Data, they are copyrighted. They are FREE to access and even download. But they are copyrighted and not Open.

Peter showed me some of his recent research. It’s research and I won’t discuss the details but I am interested in helping.  Suffice to say that I am impressed by what Peter is doing and I look forward to seeing the results as the work progresses.

Peter, Rajarshi Guha and I were speakers in a session on Open Science. I opened up with my presentation “Crowdsourcing, Collaborations and Text-Mining in a World of Open Chemistry” on SlideShare here.

Peter talked about Open Data and Semantic Data and gave live demos of CrystalEye and Chem4Word. It’s always a risk to do live demos…I’ve done 100s over the years. Peter’s Chem4Word demo failed in the session but I had seen it the day before at the Microsoft BioIT Alliance luncheon and it worked well. The concept of semantic chemistry documents is clear and the alpha version will improve in stability and functionality. I had to leave for my plane before Rajarshi spoke and did not get to see his talk.

ChemSpider was nominated for a Bio-IT Award. There were 72 nominations in total and we did not win but we were up against some very significant projects from organizations such as GSK, Astra Zeneca etc. The winners are listed here. In any case…it was nice to be nominated for a Best Practices Award at the Bio-IT meeting. We clearly have some fans.

I did see some of the “Swine Flu” worries…at least two people were wearing masks. How it all started is know. Patient Zero has been identified…


I gave my talk yesterday at CShals 2009, the conference on Semantics in Healthcare and Life Sciences.It was a great meeting for me (hindered by dismal access to wireless internet as a result of Marriott’s want to make more money from the conference organizers. They should be ashamed of themselves in this day and age!) as it was not about Chemistry, not about spectroscopy, not even about Open Data, Open Access and Open Source. It was about Semantics. I learned a lot and got to hear Tim Berners-Lee talk about where the semantic web is and where it can go and how can be disruptive in a good way while NOT being too disruptive to layer onto what already exists. The best part of the meetingfor me was the clear passion for the InChI, as well as a lot of acknowledgement that it is not perfect, cannot presently compete with molfiles, commercial systems, CAS Numbers and so on. But, people are optimistic and are waiting and supportive. Overnight I inserted a lot more information about InChIs and how they can be useful, where some of the limitations are presently, how the StdInChI has now added a new level of complexity on one hand and simplifcation on the other. There have already been a number of requests for a copy of the talk so it is up on Slideshare for now (and linked below). I’ll do a voice over in the next few days and upload to Scivee. I unveiled the first version of the InChI Resolver at conference and showed it to a couple of people. The general consensus is we are heading in the right direction. The timing on this conference was good because the intention is to layer on RDF before we release at the ACS, time allowing.

As posted previously I gave a talk on Monday at the Library of Congress. This meeting was about “Making the Web Work for Science and the Impact of e-Science and the Cyberinfrastructure.” It was one of the few occasions where I looked out at the audience, about 150 people, and didn’t know anyone (well, except for the person who invited me and one of my fellow bloggers, Michael Nielsen). I gave a talk of a very different flavor. It wasn’t about ChemSpider…it was about chemistry and access to information. I provided an overview of how access to information has changed over the past 20 years for me. I talked about the challenges for publishers serving the chemistry community and how their business models are being challenged and how I empathize with the struggle to figure out how to deal with it. I talked about quality and how care must be taken when using information online. We are ALL challenged with errors – whether you consider PubChem, ChemSpider, Wikipedia or any of the other online databases they all have errors – how do you find them? Some of them are obvious and I pointed to obvious examples in the talk. I hoped to educate the attendees in regards to the value of InChI which, while not a perfect fit yet, is a great start to structure-based communication of chemistry. I think I achieved my goals there.

I publicly blessed the efforts of publishers such as the RSC and Nature Publishing group for the efforts they are making to support InChI and improve the quality of document presentation online. I blessed CAS as a treasure trove of information and the gold standard of curated chemistry. We need them all to be successful for the sake of our science. The challenge is how to fit into the ongoing proliferation of free access to information without modifying the business models.

I also announced the ChemSpider Journal to be released this month.

The movie has been posted to SciVee and the talk is on Slideshare here (it’s already been read 67 times in 48 hours). The movie is about an hour long compared to the 25 minute presentation I gave. Not sure how that happened..maybe more relaxed sitting on the couch than standing in front of the crowd. I struggled to upload the movie to SciVee and received for SURE the best technical support ever for a free service. For those of you not visiting SciVee I encourage you to patronize it.