ALPSP Publishing Innovation award 2010

Some of the team were present at the ALPSP Conference last Thursday – as the envelope was opened to announce ChemSpider as the winner of the ALPSP Publishing Innovation award for 2010! The judging panel commented that “[ChemSpider] has quickly become a highly valued and comprehensive community resource and has immense potential for future development”.

We’re especially proud as we were up against the other excellent shortlisted finalists of DataSalon’s Mastervision (which was highly commended, and we use it ourselves), the Semantic Biochemical Journal from Portland Press and the University of Manchester, and the AIP’s UniPHY social networking site.

We also managed to recreate the prize giving with Antony & Valery this morning – difficult to recreate the atmosphere of a conference dinner at 9am on an autumn Monday morning though…

Laying out intrinsically three-dimensional molecular structures in a readable way on a two-dimensional page is a hard problem for human beings, let alone for algorithms, which is why ChemSpider stores a 2D layout alongside the InChI, which only describes which atom is connected to which other atom.

This is a really valuable resource for enhancing our RSC journal articles, so we’ve been experimenting with adding galleries to compounds, with examples here and here. Is this what PDFs you download from the website should look like? Would a digest gallery of the latest articles published be more useful? Do let us know.

As this is my first posting on the ChemSpider blog I should introduce myself. I’m Colin Batchelor and I’m in the Informatics team at the RSC. Some of my work is on ChemSpider, but I also work on informatics for RSC Publishing, and I’m a member of the InChI subcommittee.

We are building ChemSpider into the world’s leading resource for chemistry on the internet, and have a position for another team member at the RSC offices in Cambridge, UK.

We’re looking for someone with established cheminformatics and programming skills (including SQLServer, C#, ASP.NET, AJAX experience) to join a small team who work in both the UK and US. They need to have a track record in working in the field of cheminformatics, have knowledge of handling chemical structures, experience in working with web-based systems and, of course, have a big appetite for making a difference and working with a fast-moving team. The job holder will develop new database applications and tools to bring creators and users of public chemistry data together, and this position offers an unrivalled opportunity to contribute to this development. Join us and help change the way chemistry data is used.

Details here; closing date 1 April (really!)

We’ll at the ACS Spring Meeting in San Francisco next week, so if you’re there and want to find out more, catch me (Richard Kidd) or one of the team at the RSC stand #310; I’ll be there when not in the CINF ‘Future of Scholarly Communication’ sessions.

I am shamelessly lifting this post from the wonderful blog of Cameron Neylon by way of advertising a symposium that will happen at the end of this week. If you are on the West Coast and want to come and hear about the changes going on in the world.

Science Commons
Image by dullhunk via Flickr

“One of the great things about being invited to speak that people don’t often emphasise is that it gives you space and time to hear other people speak. And sometimes someone puts together a programme that means you just have to shift the rest of the world around to make sure you can get there. Lisa Green and Hope Leman have put together the biggest concentration of speakers in the Open Science space that I think I have ever seen for the Science Commons Symposium – Pacific Northwest to be held on the Microsoft Campus in Redmond on 20 February. If you are in the Seattle area and have an interest in the future of science, whether pro- or anti- the “open” movement, or just want to hear some great talks you should be there. If you can’t be there then watch out for the video stream.

Along with me you’ll get Jean-Claude Bradley, Antony Williams, Peter Murray-Rust, Heather Joseph, Stephen Friend, Peter Binfield, and John Wilbanks. Everything from policy to publication, software development to bench work, and from capturing the work of a single researcher to the challenges of placing several hundred millions dollars worth of drug discovery data into the public domain. All with a focus on how we make more science available and generate more and innovative. Not to be missed, in person or online – and if that sounds too much like self promotion then feel free to miss the first talk… ;-)

aileendayI’m Aileen Day and I’m one of the Royal Society of Chemistry’s Informatics team who are working with ChemSpider. We can loosely be defined as chemists who have picked up enough computer programming to make our lives and those of people around us a bit more exciting and less tedious. Our job is to develop new tools to help viewers and authors of articles in our journals, and our publishing editors. Probably the most high profile example of these tools is the development of Project Prospect.
So the long-term plan for us and ChemSpider is to fully integrate Prospect (and RSC publications) with ChemSpider so that a user can seamlessly bounce back and forth between finding compounds of interest using the ChemSpider search and selection tools and finding more information about them in our journals amongst other sources. Also, to improve the functionality of and content of everything we can along the way (ChemSpider, Prospect etc.).
As a first step of this I’m currently developing a way to automatically deposit the primary (most important) compounds in our prospected articles into ChemSpider, with publication information about the RSC article, including a link back. I’ll keep you posted as we make progress…

Over the past three years I’ve carried a double-edged sword on the ChemSpider Blog: the honor and the burden.

As anyone who runs a blog would likely tell you hosting a blog can take a lot of time and effort, especially if you are passionate about communicating. Fortunately, since ChemSpider was acquired by the Royal Society of Chemistry we now have a lot more people involved with the platform including support staff and our colleagues in the Cambridge, UK-based Informatics team. Since we are working hard to further integrate various processes, systems and projects it makes sense that more of the team discussing our activities around ChemSpider can post here. In particular there are a number of activities going on regarding the technical aspects of ChemSpider development that will start to show up on this blog and we encourage your participation, comments and feedback. ChemSpider is, after all, about community participation so do engage us!

Over the next few days a number of my colleagues will introduce themselves to the readers of this blog. I welcome them all to the “honor and the burden”…it’s a pleasure to share this space with them.

There have been other comments about Wolfram Alpha and it’s support for Chemistry (1,2 and others) but I have remained rather quiet until now about my experiences with Alpha for a couple of reasons. First of all I’d rather let the service settle down a bit before poking at it too hard. My experiences of going live with ChemSpider were definitely that it takes a while to stabilize the system and address some of the earliest feedback. Also, knowing that I would be at Scifoo and aware that Theodore Gray would be there I had hoped to see Alpha in action. I wasn’t disappointed. Yesterday Theodore drove the system in front of an audience including a number of interested scientists, members of Google and, Peter Murray-Rust and myself from Chemistry. Theo had no fear…essential for live demos. He was asked questions and he did took the plunge, did the search and with the rest of us celebrated a successful search, a weird result and just plain wrong. It was ALL good. I am impressed. I am impressed by that they are out to achieve with Wolfram Alpha. I am convinced that what they are doing with Alpha will contribute to science and mathematics in general and that Chemists will be using this system when they have more awareness of it.

For a general intro to Alpha see the presentation here.

So, some examples of interesting searches:

1) A guy in the room had asked the question “What is the largest land mammal?” and had not received an answer a few weeks earlier. Now Theo posed that question and got the answer here. Nice! Now, I took that to mean that they were keeping logs of failed queries and tweaking…confirmed by Theo. VERY nice.

2) Peter Murray Rust had previously blogged about bad results from his searches (searching on dibromoethane for example). When he repeated his searches in the session hosted by Theo he acknowledged that he was pleased that they had fixed the issues he had previously blogged about. This is how modern systems should be …moving quickly.

3) Searching on names…for example, what is the number of people with my name…my spelling is Antony NOT Anthony. See here for the results.

4) What is the return per employee for Google versus IBM. It’s in this query:

5) What are the chemical structures of Taxol? Methamphetamine? Cholesterol? Buckminsterfullerene? You get answers for all. The organic molecules all give images of chemical structures. The connections in all cases are correct but I see no evidence of stereochemistry anywhere across the chemical structures on the doesn’t mean it’s not there but I couldn’t find it.

So, for chemistry, am I impressed. Yes I am. I’m not worried right now that Alpha is not dealing with stereochemistry…I am sure they will layer that on later. It is clear based on most of the results that I have seen that there is some GOOD curation of the data going on. According to Theo there are chemists on staff and they are curating the data coming in. Hallelujah! If you look in the Source Information for Taxol you see a LONG list of sources of chemical source information and the primary source is the Wolfram Alpha Curated Data.

alpha-data There is much that can be done to help Wolfram Alpha to have better Chemistry. They have a HARD job ahead of them if they are going to sample the Public Databases to grab quality chemistry. It’s in there for sure but it’s hard to find. What could come out of ChemSpider and Wolfram Alpha working together?

1) If we could get the list of “compounds” in Wolfram Alpha then we can provide chemical compound connection tables with all necessary stereochemistry etc.

2) When we pass back the compound list then we can pass back ChemSpider IDs and get them listed as identifiers alongside the PubChem CID. In theory it would be good to get these linked back to ChemSpider so that a user can come and find associated articles, analytical data, the wikipedia article, predicted and experimental properties and so on. This is where ChemSpider’s integration would be of value.

3) There is an opportunity to expand the chemistry in Wolfram Alpha by passing a subset of ChemSpider compounds to be added to Alpha. Certainly I don’t think that Alpha should host all 21.5 million of our compounds for the reasons I have enumerated many times on this blog. See my last post about the 54 versions of the Taxol skeleton…there should be only one Taxol. But, there may be a way to subset “important chemistry” and get it into Alpha. OR, maybe they do want it all?

There are clearly opportunities to help expand the chemistry and I hope we have the chance. I think Alpha is incredibly ambitious. But why not be ambitious? ChemSpider was ambitious too and look what we have done with three servers in a basement…it’s a whole lot less resources that Wolfram are throwing at Alpha. I want them to be successful…a computational engine for the public. Why not….so many of us are asking questions using search engines right now and can’t get anywhere near an answer…

We continue to expand the ChemSpider Database with new depositions sourced from various collaborators. We are especially privileged to have received the RSC’s structure collection associated with their Project Prospect articles and have spent a couple of weeks working with the data prior to depositing onto ChemSpider. During the deposition process we have formed the link between the chemical structures and their articles via a DOI link. We have been able to deposit the title, an associated author and the DOI. In this way we have been able to link thousands of chemical structures to articles on the RSC website. On each record associated an RSC article you will see both a link from the data source table and a link via DOI from the reference as shown here and in the figure below.

rsc_linkWith the RSC depositions came many beautiful structures – highly symmetric, complex and just plain “pretty” to a chemist. But a high level of complexity also arrived with the collection and while many InChIs could be converted to their associated connection tables the act of converting the InChIs could add additional stereochemistry and structure cleaning could change stereochemistry so this was a long, tedious and mostly manual process I’m afraid. Nevertheless, a wonderul addition to the ChemSpider database and our sincere thanks, on behalf of the community too, to the Royal Society of Chemistry for sharing their data with us. The InChIs will be deposited into the InChI Resolver shortly.

Jean-Claude Bradley has recently posted about about an NMR Game running on Second Life. Read his blog for details but I excerpt some of the comments here:

Andy and I brainstormed some new chemistry games that we could introduce to Second Life to leverage our recent tools. One of the applications is the NMR game. By combining the orac molecule rezzer, the SL spectral viewing tool and ChemSpider Open Data spectra I think we have a pretty good game.

The idea is simple: click on the molecule that is represented by the spectrum. If it is correct you get 2 points and get another spectrum. You lose a point by clicking on an incorrect molecule. After going through all the spectra your score gets posted on the web to a top10 list. For equal scores the best time takes it.”

So, here at ChemSpider we are delivering spectra as Open Data to help with the game. And we’re happy to do so. It’s always been our intention to have ChemSpider provide value like this. ANY registered user can upload spectra to the ChemSpider website. The details are outlined here (I just noticed the interface has changed since I wrote that but you should still be able to follow the process). We need the spectra to be in JCAMP format and if you want them to be available for the game, and for people to download, they MUST be declared as Open Data.

Right now we have 100s of spectra. You can find them here. But we need more. Much more!We’d like you to contribute them. if you don’t want to upload them yourself then contact us directly and we will process and uplood for you. We need the data and the name/structure of the associated molecule.

And how will the game be used on these spectra? The game will be used to “curate and validate” the spectra. As the game is being played a score of how many people say it is correct will be kept. And of course what is wrong. Based on these scores our curators will be directed to “problematic spectra” for their attention. This is true crowdsourcing and a great way to do spectral validation.

We would like the spectral collection to grow and welcome contributions from anyone. They do NOT have to be just NMR. They can be IR, MS, Raman etc too. Ultimately a Spectral game will be unveiled. Please consider ChemSpider as a repository for your data as it will benefit the community of chemists and, in particular, the process of teaching students and allowing them to “game their way” through the process. Watch where this goes…it’s VERY interesting to consider how it can improve…there is an NMR game website in development so you won’t have to go just to Second Life.

Today I had the privilege of meeting with many members of the team creating the RCSB Protein Data Bank. This resulted from the wonderful networking opportunity offered by the Scifoo camp held earlier this year at Google where I met Helen Berman, director of the PDB team, part of the worldwide Protein Data Bank. Helen and I shared some conversations sitting outside the Google offices in California and shared our opinions and visions regarding the quality of small molecule data available online. Today was an opportunity to take those conversations further, meet with members of the team and determine whether ChemSpider’s efforts could bring benefit to the PDB in terms of our curation efforts and whether ChemSpider users could benefit from having access to information on the PDB via hosting of the PDB ligand dictionary.

I gave a presentation (online here and based on others I have delivered previously) and received a one on one review of the deposition and curation processes of the PDB as well participated in a group discussion about how to continue the stringent and exacting process of validation and curation associated with small molecule structure sets. We discussed the complex relationships between systematic names, trivial names, registry IDs, database IDs, tautomers, charged states, SMILES and InChIs. It was a particularly validating day to spend time with a group of people who have responsibility for building one of the most valuable resources in the world and have faced the many challenges associated with validating structure-based data. There is a distinction between people who talk about what it takes to curate structure collections rather than those who actually do the job for a living. This team is made up of dedicated, passionate and skilled individuals who deeply care about the quality of their data and who do the heavy lifting and grunt work so that the users of the PDB enjoy the benefits. They have been working on a multi-year process to curate and improve the PDB data and are in the final major phase of the effort to clean up the archive and apply the processes to all new data moving forward . ChemSpider and PDB will be more integrated in the near future and we look forward to supporting their efforts for providing high quality structure data to the community and continuing to expand the network of integrated online chemistry.

I think the press release here, and copied below, speaks for itself…When I posted the blog about the need for an InChIKey Resolver it resulted in a great discussion and series of comments. Since that time I’ve had many discussions with interested parties about the need. The RSC and ChemSpider share a mutual view regarding the need for the InChI resolver and we are honored to be entrusted to develop a resolver for the community. Will it be “the” resolver..only time will tell. There are various ways to deliver a system to do this so we’ll start here and garner feedback. There are many ways to “hunt a Welshman” (I can say that since I’m Welsh!) so there may be other efforts to deliver a resolver coming too.

“RSC and ChemSpider develop InChI Resolver

01 December 2008

An InChI Resolver, a unique free service for scientists to share chemical structures and data, will be developed by a collaboration between ChemZoo Inc., host of ChemSpider, and the Royal Society of Chemistry. 

Using the InChI – an IUPAC standard identifier for compounds – scientists can share and contribute their own molecular data and search millions of others from many web sources. The RSC/ChemSpider InChI Resolver will give researchers the tools to create standard InChI data for their own compounds, create and use search engine-friendly InChIKeys to search for compounds, and deposit their data for others to use in the future. 

The future of publishing

‘The wider adoption and unambiguous use of the InChI standard will be an important development in the way chemistry is published in the future, and the further development of the semantic web,’ comments Robert Parker, Managing Director of RSC Publishing. 

The InChI Resolver will be based on ChemSpider’s existing database of over 21 million chemical compounds and will provide the first stable environment to promote the use and sharing of compound data. ‘ChemSpider hosts the largest and most diverse online database of chemical structures sourced from over 150 different data sources’ adds Antony Williams of ChemSpider, ‘We have embraced the InChI identifier as a key component of our platform and the basis of our structure searches and integration path to a number of other resources. We have delivered a number of InChI-based web services and, with the introduction of the InChI Resolver, we hope to continue to expand the utility and value of both InChI and the ChemSpider service.’ 

Society support

‘As a learned society publisher it is important that RSC provide support for the standard and contribute to the development of the resolver, which promises to be a valuable service for the chemical science community.’ continues Parker, ‘our collaboration with ChemSpider on this project will enable this to be delivered quickly and sustainably.’ 

The imminent adoption of the InChI generation protocol will be a welcome and necessary step to the wider adoption of the InChI standard. “

The Journal of Visual Experiments (JoVE) is “an online research journal employing visualization to increase reproducibility and transparency in biological sciences.” Recently, one of our collaborators (and one of my friends) Jean-Claude Bradley from Drexel University was “published” to the JoVE journal with an article entitled “Optimization of the Ugi Reaction Using Parallel Synthesis and Automated Liquid Handling“. ChemSpider has been hosting some of the UsefulChem data as discussed elsewhere on this blog and it’s great to watch J.C and his students continue to get exposure for their work in this new way. He is certainly leading the path forward for Open Notebook Science.

A lot of people have been helping to improve the quality of ChemSpider content by depositing new data and “Cleaning up” errors in the data over the past few months. it’s been a long climb. Our thanks to all of you who have contributed. I’ll be the first one to put my hand up and acknowledge that in some ways I have not made the act of contributing to the curation process very easily since I’ve been feeding the data out via the blog in chunks, as it has developed. Following a recent “long flight” I am happy to announce that the Curators Handbook/Bible is now available in its first form and is available online here. This document gives some pretty detailed guidance regarding how to curate the ChemSpider database. As always we welcome feedback. If something is not clear let us know and we will expand/enhance as appropriate.

What I also want to do is to thank those people who have commented on how truly impressed they are with the rate at which we are cleaning the data. In general most curation requests identified on the site are addressed within 24 hours. There are some issues hanging out there that we don’t have solutions for at present, specifically in regards to organometallic data handling, but we are still thinking about a path forward.

I recently started a discussion with the users of ChemSpider about how they use our system. There have already been two responses and I am hoping for more. Having sat in on a IUPAC InChI meeting in Washington last week I can honestly say that it was one of the most functional and on-task meetings I have sat in on in a long time. Decisions were made about how to move forward with the next release of the InChIKey and “standard versions” of both the InChIString and InChIKey.

The meeting has prompted the question how do you use InChI? For what purpose do you use InChI and do you use only the string? Do you use it for communication purposes and structure exchange? Do you use it in your internal databases? Is it a primary path to deduplication? What settings do you use for the InChIString?

I’m interested in how you are using InChI nad how important it has become for you? Comments welcomed..

I am very proud at the response from our user base to my request for assistance with curating ChemSpider in regards to carbohydrates. Carbohydrates are complex in nature. They can be represented in linear form and cyclic form, they exist in ChemSpider with a common name but no defined stereochemistry, there are pentoses, hexoses and many stereoisomers per skeleton. There are MANY common carbohydrates with trivial names - RiboseArabinoseXyloseLyxoseAlloseAltroseMannoseGuloseIdoseGalactoseTalose

Carbohydrates have been very challenging for us at ChemSpider…many depositors have not been careful with the  association between the chemical structure and the associated identifiers. With a chemical structure as the primary key on a record we find confusing associations with structures. For example, a search on Maltotriose as an identifier turns up 5 structures on ChemSpider. Maltotriose is defined on Wikipedia as “trisaccharide (three-part sugar) consisting of three glucose molecules linked with 1,4 glycosidic bonds.” This should mean that it is not appropriate for the identifier maltotriose to be associated with this structure. The registry number associated with this structure should be deleted also based on Wikipedia as a resource. How many of the other identifiers should be deleted? Maybe all???

Looking at this record we see identifiers such as: alpha-D-G​lc-(1->4)​-alpha-D-​Glc-(1->4​)-D-Glc; alpha-D-G​lc, O-alp​ha-D-glc; GLC-(4-1)​GLC-(4-1)​GLC-(4-4)​GTE and O-alpha-D​D-Glucopy​ranosyl-(​1->4)-O-a​lpha-D-gl​ucopyrano​syl-(1->4​)-D-gluco​se . Are these appropriate for this compound?

The challenge for maltotriose is therefore to identify the CORRECT structure associated with that name. “Maybe” it is the structure on Wikipedia but don’t forget that we have an effort underway to validate the structures on Wikipedia and make sure they are correctly associated with the monograph title. Is Maltotriose an identifier for a unique stereoconfiguration or is there alpha- and beta-maltotriose?  I am not sure. What needs to be determined is the correct association between structures and identifiers. Incorrect associations should be removed so that they do not turn up the incorrect structures in ChemSpider when searched.

This is the start of the validation process for carbohydrates…its iterative, complex and hard work. Its going to begin with giving the group of interested parties curator power over on ChemSpider and asking them to work on this challenge. We welcome their assistance. The efforts of contributors like this will be essential. 

A link to the presentation I gave at ACS-Philly yesterday in Rajarshi Guha’s session is provided below. A lot changes between writing an abstract and writing a talk so I had the chance to expose an increasing number of papers ALREADY using ChemSpider as one of its platforms of choice to source information from.

Can a Free Access Structure-Centric Community for Chemists Benefit Drug Discovery?

ChemSpider is an online database of over 20 million chemical structures assembled from well over a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples. QSAR analyses utilizing the ChemModLab environment for assessing quantitative structure-activity relationships will and screening using a molecular surface descriptor model.

Link to presentation

I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site – word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

ChemSpider has been working hard to support Wikipedia for a number of months now. We have been curating the structures on Wikipedia, I have been an active member of the WP:Chem team, we have extended our integration of WIkipedia to show the leed of the Wikipedia article on associated record views and have a lot of background activities going on re. Wikipedia at present (info will be released shortly). There are new articles released on Wikipedia on an ongoing basis and we stay up to date as best we can monitoring bots for updates. Harvesting monographs out of Wikipedia based only on ChemBoxes and Drugboxes is not sufficient for sure since not every article about drugs and chemicals on Wikipedia has an associated Drugbox or ChemBox. For example… You have likely heard of Rember for Alzheimers already? A search on Google for Rember Alzheimers will give about 2 million hits. It’s already being discussed in the blogosphere including Derek Lowe’s  In the Pipeline. Rember turns out to be methylene blue. There is already an article on Wikipedia about Rember but there is no chembox as yet. As I was researching Rember out of interest I noticed we did not have methylene blue linked to Wikipedia and Rember wasn’t associated with methylene blue. Adding the name was of course easy..5 seconds work after login. We have now added the ability to associate data sources directly too. What does this mean? On a record view page is a list of “Data Sources” associated with a compound. This is where depositions about a compound came from and, generally, links back to the associated web pages. Previously in order to populate the Data Source table it would be necessary to deposit the structure and associated info as an SDF file. TOO MUCH work. So, now we have made it easy. To add a data source simply login and select “Edit” (top right hand side of the data source table). To add a new data source simply click Add and input the information into the pop up box.The input is the name to be listed in the Data Source table, the URL to the information on the Data Source page (if info exists) and the name of the Data Source. This is one caveat of adding such links..the data source must exist. If you want to add data associated with your own website you need to register yourself, add a Data Source and wait for us to approve. Wikipedia is a special case since when the link is made we grab the leed of the article directly and show it in the Record View. For methylene blue there are two related Wikipedia articles so we have linked to them both as you can see on the record view. Simple go to ChemSpider and search for rember and you’ll see two linked Wikipedia articles.

For those of you who have been following the discussions of Stevan Harnad, Peter Suber and others regarding institutional repositories and Open Access you will already be up to speed regarding OA mandates and what they could mean in terms of access to data. Rather than go into this area in detail myself I point you specifically to Steven Harnad’s site to review ongoing discussions there (there are mutliple parties exchanging views.

What I am going to so though is point you to this comment on Peter Suber’s blog regarding “ Stanford Opens Access to All Its Education Studies“.

Specifically, the following comments are of interest “Under Stanford’s new policy, only the author’s final, peer-reviewed copy of the article would be posted online —in some cases, potentially months before the printed version becomes available….By early fall, the education school plans to have a Web site in place where the articles will be posted and archived in a searchable database. With approximately 50 scholars on Stanford’s education school faculty, the site could accumulate as many as 100 articles a year, by Mr. Willinsky’s estimate.”

Stanford is not alone in this type of shift. What does this mean for indexing of articles and availability for searching in terms of the work we are doing with ChemSpider right now (1,2,3). Text-indexing of chemistry articles would simply mean turning our spider onto the repository. Using the tools we have available now and the database of 21 million compounds and associated dictionary we could also convert the chemical names to structures and make the articles searchable by both text and structure BEFORE publication, in theory, months before. With the work that is already underway on Open Access articles on ChemSpider and SOON to be unveiled, we could also provide tools for authors to markup their own documents. My preference, as for many others, is that authors of Chemistry articles use semantic authoring tools to allow us to grab the appropriate information from the articles for linking as well as provide a path for semantic connectivity.

The question then is whether or not ChemSpider can index institutional repositories or authors self-archived collections on their university research group websites. The authors self-archived collections will be very valuable but of course most likely to upset the publishers. We’d like to do both.

I envisage a time when articles are indexed and searchable even before they are published and indexed by others. Why not? If there are changes to the article between pre-and post-publication both can be indexed.

We welcome your comments! Anyone want to introduce me to the host of an institutional repository?

I’ve been looking at various forms of communication to assist with people understanding a little more about ChemSpider. I am presently investigating the production of online movies to assist users in understanding how to use the system to full effect and hope to rollout a few examples shortly. In parallel I’ve been looking at podcasting technology.

Serendipitously I was approached by Nature to be involved in one of their podcasts and went through the experience with them. Though  you’d never know it from the podcast it was done during vacation while trying to balance the energy of our boisterous twin boys in the room with a background noise of the ocean crashing on the shore. There are worse ways to be involved in a podcast for sure…balancing the nice overview of the sea with two little boys desperately trying to stay quiet and the professionalism and speed of Geoff Brumfiel at Nature made this a very pleasant experience.

If you’re interested you can check out the podcast here. Based on the feedback we might add podcasting as one more way to communicate with our users. Thoughts?

Last a week I had a pleasant chat with a reporter from Nature magazine, a Mr Geoff Brumfiel. Geoff was interested in ChemSpider…what it was, how it ran, who used it, who supported it, who liked it, who curated it, who didn’t like it and so on.

The results of that discussion, and others he spoke to about ChemSpider, are here in his article.

Chemists spin a web of data p139
Chemspider website provides free information on millions of molecules.
Geoff Brumfiel
It is a rule at Nature, at least for this type of article, that I could not see the article before it went to press and therefore I didn’t get the chance to proofread and comment. Geoff has accurately captured the spirit of our discussions but a few detailed clarifications are needed too. I have pasted in black the article content and in italics the clarification.

providing the community with an open-access source of chemical information

I giggled and commented please don’t say it’s Open Access. Say it’s Free Access. Say there are Open Data. And now we have Creative Commons licenses. But don’t say it’s Open Access, not Strong, not weak, not gold, not green. Just Free Access. No price barriers to usage.

Chemist Antony Williams is hoping to change this in a move likely to ruffle the feathers of the American Chemical Society.

I commented that we are not purposely in competition with anyone. It’s not what drives us to do this. Whether others see us to be competitive is for them not us. We don’t intentionally try to ruffle feathers. It doesn’t mean that what we are doing won’t ruffle feathers of course. Whether it’s ACS or others. It’s not the might be an outcome.

The modest project has made chemists interested in open access take notice — last week, the number of daily users of the site surpassed 5,000.

We have crossed 5500 users for the past two nights. The trend is positive.

“Other potential sources of information, such as Wikipedia, lack the algorithms needed to search chemicals according to their structure. “

Structure searching is “feasible” of course with InChI Strings. But substructure isn’t and Wikipedia is treated as a text-based search by almost all of its users

“The site is maintained with modest profits from advertising and the work of about 30 active volunteers who double-check the data pulled in from outside.

The original investment in hardware and software costs has finally been recouped. Modest profits? No one gets paid for the work we do. There is a phenomenal sweat equity investment in the platform numbering many thousands of hours to get here. We are indebted to the many software collaborators, providers of tools and the people curating and depositing to the system. There have BEEN about 30 active volunteers. RIght now I would say the number of active depositors and curators is around 10. But it is growing. I hadn’t checked the number of REGISTERED users for a long time. We have over 1150 registered users…those who CAN login and curate data, deposit data, see new features etc. People do NOT have to register to use the site…but >1150 did. Wow. I didn’t know it was that many until i just checked (BIG SMILE)

““There’s an awful lot of chemical information, but there’s an awful lot of rubbish as well,” says Barrie Walker, a retired industrial chemist in Yorkshire, UK, who helps maintain the site.”

Don’t know whether Barrie said this or not. He IS an honest guy and he is our QUALITY GURU and we are proud that he is willing to give us his fine eyes. There IS garbage on the site still. But, after a year online and active curating it has been much reduced. About 200 edits a day are made to the site: names changed/deleted/added, spectra/structures/URLs/Publications added etc. It’s quite the pace. We have cleaned up 100s of thousands of incorrect associations from the external data sources. It’s been and will remain an enormous task with an enormous payback for the community

Williams adds that the site still has problems with certain searches. For example, it struggles to distinguish between isomers: molecules with the same chemical formula arranged in different structures.

We can distinguish isomers no problem. The PROBLEM is that there is a mixture of isomeric species submitted from multiple data sources and data are mixed and intermingled in way that the user cannot get to the correct structure. Search taxol or Ginkgolide on the ChemSpider blog and read the mutliple blog posts about this. We can of course search all isomers for a particular chemical formula…

“But Williams nevertheless believes that the service may be able to compete with for-profit services. “What I’m doing is highly disruptive,” he says. “I think it can be done and it needs to be done.”

I think what WE are doing…its not’s we…is disruptive. In a good way. Many chemists will benefit. Will it have an impact on for-profit services? Yes, maybe. As an outcome but not as the target. Our team of people, both internal to ChemSpider’s development and Advisory Group, and the people we don’t even know who are cleaning and depositing into the system for their colleagues in the community, are creating a powerful resource for Chemists. The FOCUS of this effort is to Build a Structure Centric Community for Chemists. We will change that soon…the focus on Structure-Centric will be to cover Chemistry in general and to Build a Community for Chemists.

We are well on our way and thanks to Nature, and Geoff in particular for exposing it. My comments above are not meant to detract from Geoff’s reporting abilities but it was a long discussion and some clarification statements are of value i believe.

Over the past few weeks I have had a few discussions with a member of the ChemSpider Advisory group regarding a concept to create WiChempedia. I’ve enjoyed these conversations with Alex Tropsha (professor and Chair in the Division of Medicinal Chemistry and Natural Products in the School of Pharmacy, UNC-Chapel Hill.) We are like-minded in a number of ways but specifically in what can be done to facilitate delivery of quality information to the chemistry community.

As you will notice if you frequent this blog I am rather a stickler for accuracy and quality (1,2,3). I think it’s important (4). Over the past few weeks I’ve spent more time looking at the quality of data on Wikipedia and trying to figure out the best way to bring together our efforts on ChemSpider to enhance the capabilities of integrated information and to support the quality efforts being made by the WP:CHEM team and help them. I also intend to facilitate the development of our own Wiki environment for chemistry and to generally enhance the tools available to chemists not only for Wikipedia type annotation but also to support Open notebook Science.

Now, I don’t want to reinvent the wheel. Wikipedia has a lot of what is necessary in terms of being a known system, a following of people and committed supporters in the WP:CHEM team. What I have been hoping for was a shift around structure and substructure searching on the MediaWiki platform but I know that is a tough request as the platform is not built for that type of thing, The InChKey holds some promise for exact structure searching but does not offer an opportunity for substructure searching without a lookup across a larger database. I want to facilitate information and data sharing further. I do want to provide the type of service that Wikipedia does in terms of general information but also layer cheminformatics tools onto that knowledge and information, allow addition of analytical data, analysis tools, real time predictions and analysis ultimately. This platform should certainly be wiki-enabled.

Decision made. Our intention is to deliver wiki-capabilities in ChemSpider and to use the Open Content associated with chemicals and drugs on Wikipedia inside the system. We will then provide an environment for people to continue to add to, enhance and curate the Wikipedia content as well as add their own. Last night (and well into the early morning) I spent some time talking to Martin Walker from WP:CHEM regarding my concerns that we might offend the Wikipedians with our efforts and that I did not want them to feel that we were ripping off their hard work but rather have our efforts seen as supportive and enabling. My intention as we work through downloading the data and to check, validate and correct what is sitting on Wikipedia directly for benefit to the community. Also, we will of course need to leave all Wikipedia content under the appropriate licensing for others to use. Martin commented that there are tens of mirrors of Wikipedia out there ripped purely with the purpose of exposing and getting ads revenue. We are not working from that model….our intention, as usual, is to build a structure centric community for chemists and with so much excellent work done on Wikipedia I want to take advantage of it and give back also by the work we will do.

Two domain names have been grabbed for this project : WiChempedia, for compatibility with Wikipedia, and also WeChempedia, to emphasize the community aspects of the project.

If you frequent this blog you will recall that we have made a commitment to Microsoft Sharepoint as our future platform for wiki’ing ChemSpider. That is where we believe this work will be done ultimately but we don’t have the platform in our hands yet.

The Xmas vacation is going to be full of holiday movies and manual examination and curation of the Wikipedia data. Wish us luck!

Yesterday I announced the availability of the MassSpec web services for ChemSpider and, less than 24 hours later, I am happy to announce that it is already integrated. Egon Willighagen, one of the members of our Advisory Group, has already reported on his integration to ChemSpider with the intention of speeding up metabolomics analysis. He has used Taverna, a workflow and pipelining tool to set up his workflows. What’s good to see is how easy this was for him to do …well, I assume it was easy since he didn’t need to consult with us. We released the MassSpec web service and voila, he was integrated.

This is what is happening with our other web services too. A number of organizations are now integrated to ChemSpider and using the services on a daily basis.

Recently there was a commentary made about the “highly curated data” on Wikipedia. To me curators are heroes. They are detail oriented, committed to the cause and simply “care”.

As a result of reading that post you saw me go off and check on Taxol, post a few comments and come out the other end of the work with a “more highly curated record” on Wikipedia.

Then I commented on there are better ways to ensure the quality of structure drawings than redrawing them…specifically dictionary look-up and optical structure recognition.

I don’t mind being taken to task on my opinions. As my late father said…”Opinions are like nostrils, everybody has them”. Okay, the body cavity was a little more south but you get the point. However, this opinion stirred me…

“If you wish to spend your life recording typos in chemical documents, I hope it is fulfilling.”

Now, sometimes when you are stirred emotionally, it helps to sit down and think about it.


So, I’ve thought about it… and I’m happy about where I’ve ended up.

My life IS fulfilling. I might need therapy for this particular passion but I DO actually enjoy checking typos in “documents” – of course our conversations are about chemical documents (structures) and I DO confess I like it. Why? I care about Quality.

When I see an acknowledgment that Wikipedia is highly curated and I know I have contributed to that I have a certain pride to having contributed to community science. Those of us cleaning up the historical record for others to benefit are doing a lot of the grunt work that others talk about being necessary and espouse the need for platforms to do so. You can throw a palette of colors and a brush on a floor but someone has to pick it up and do something with it. Platforms, tools, visions are great…we need thinkers but we also need doers. Doers are important and necessary and people who find typos in chemical documents likely do find it fulfilling. I’m a thinker and a doer. until I have experienced the challenges of curating historical records I do not feel I am sufficiently immersed in the challenge. Oh…there’s another nostril (opinion).

So, who are my heroes? Some of them in this domain are:

1) Barrie Walker, ChemSpider Advisory Group member and our KING OF QUALITY.

2) Ann Richards, EPA, founder of the DSSTox effort and quality guru extraordinaire. Ann and her team have taken on the task of assembling, from various sources (and of various quality levels), a public resource of incredible value to the Tox community. This paper explains in detail. With her fine eye for detail, commitment to detail (checking CAS numbers to the digit, stereochemistry of each bond and the accuracy of the chemical names) her databases are likely the cleanest and most highly curated databases from any government labs (no intention to offend others here and if your DB is as good as DSSTox you are my heroes too!) In particular I acknowledge Marti Wolf from Ann’s lab who has spent thousands of hours assembling data, “recording typos in chemical documents” and correcting them to the benefit of the community.

3) People like Peter Corbett. He really seems to care about what’s in a database and the quality of what’s there. He is discovering these issues by observation and checking. His careful eye, clearly necessary for the development of OSCAR, makes him a hero (I look forward to meeting him!)

4) The people I worked with at ACD/Labs in the database compilation office are heroes. This group of 10s of individuals over the years, have manually curated 100s of thousands of structures and associated properties (Physchem parameters, NMR shifts, name-structure pairs). They have done it with a fine eye. THEIR efforts were the basis of what led to industry leading NMR prediction algorithms which were used recently to provide feedback to the Blue Obelisk team member, Christoph Steinbeck, to help clean up errors in the NMRSHIFTDB. While others were attacking the open data effort those of us concerned with the details helped curate the data.

5) The curators at CAS, at MDL (now Symyx), at GVKBio, and in software houses and labs all over the world who manually curate data, and, from their experience, build robots to help their processes and improve the data for all.

For all of you who wish to spend your life recording typos in chemical documents, it is likely very fulfilling if you care about quality.

I find it fulfilling. It’s a necessary part of understanding the problem. Quality is hard to define. But, we’ve been challenged on the quality of our science on ChemSpider enough. We’ve been challenged for sodium chloride dimers and shown it’s valid science. We’ve been challenged for logP prediction of Calcium Carbonate and had an industry great acknowledge our attention to detail. We’ve been challenged on inorganic chemistry and compared ourselves to others.

We Monkeys have been told to close the gates of ChemZoo. We didn’t. Instead we are doing great things for the community I hope. We have opened up a series of services that the Open Access world likes (specifically the Blue obelisk players..), we are donating our database to PubChem shortly, and we are working with some of the best people on our advisory group to satiate their needs. It’s pretty damn fulfilling.

* I will acknowledge that the comment “If you wish to spend your life recording typos in chemical documents, I hope it is fulfilling.” is removed from the context of the entire post. So read the post. Then read all the others I’ve mentioned. I made my interpretation of the comment based on the ongoing flavor. Maybe my nostril was clogged…

The ChemSpider team is a small group of passionate individuals. We all have day jobs. ChemSpider is innovated, extended and maintained during evenings and weekends, with the support of friends, family, collaborators and chemists wanting to make a difference.Here’s a disclaimer regarding one member of the team from the recent Press Release when ChemSpider integrated ACD/Labs properties “Disclaimer: ChemZoo, Inc., is founded by Dr. Antony Williams, who is also serving as VP and Chief Science Officer for ACD/Labs. ChemZoo and ChemSpider services are not affiliated with Advanced Chemistry Development, Inc., (ACD/Labs) and are developed independently of ACD/Labs initiatives.”

Antony Williams…that’s me. To be clear, I work at ACD/Labs. I have just celebrated 10 years of employment with them, and proud to do so.

ChemSpider is a passion project, one that the team involved in producing really cares about. It is NOT our “job”. It is something we want to do. We have a lot of the necessary skills to make a valuable contribution to the chemistry community so we are going to do what we can to make a difference. We will likely make mistakes along the way…we’ve already had many successes. Having worked at a commercial chemistry software company for 10 years it is clear that we will not make everyone happy – we will have our evangelists and our critics. We will navigate the community feedback and provide value where we can with the intention of making a difference. ChemSpider went live on March 24 th …only one month ago. To date we have successfully performed thousands of searches and transactions for chemists around the world on a beta release. That is a result we are proud of. What is to come in the next month will only add to the value of ChemSpider. Watch this space…