For all you Tweeters out there following Science Online the Twitter account for Aileen and Dave at the RSC  is  ChemSpider.

Not to be confused with that of Antony Williams who is still vey much ChemSpiderman.

Nature, Mendeley, and the British Library are excited to present Science Online London 2010. How is the web changing the way we conduct, communicate, share, and evaluate research? How can we employ these trends for the greater good? This September, a brilliant group of scientists, bloggers, web entrepreneurs, and publishers will be meeting for two days to address these very questions.

ChemSpider will be there to hear and record what is being said. If you are going to be there look out for David Sharpe and Aileen Day.

We will of course report back on topics that pertain to ChemSpider and the greater world of chemistry publishing.

Our Article regarding “Automatic vs. manual curation of a multi-source chemical dictionary: the impact on text mining” was published in Journal of Cheminformatics as an Open Access article a few weeks ago. The link is:

Background: Previously, we developed a combined dictionary dubbed Chemlist for the identification of small molecules and drugs in text based on a number of publicly available databases and tested it on an annotated corpus. To achieve an acceptable recall and precision we used a number of automatic and semi-automatic processing steps together with disambiguation rules. However, it remained to be investigated which impact an extensive manual curation of a multi-source chemical dictionary would have on chemical term identification in text. ChemSpider is a chemical database that has undergone extensive manual curation aimed at establishing valid chemical name-to-structure relationships.

After a long week at the ACS meeting, four presentations, one poster and a two hour training…I headed over to the Lawrence Berkeley National Lab (LBNL) to give a talk. I was hosted by Jeffrey Loo. And I mean hosted. I give a lot of talks in a lot of places but it is rare to experience the level of coordination and organization that Jeffrey provided. After what was almost a 2 hour presentation and discussion I had the opportunity to share dinner with a number of people who attended the presentation and to continue our discussions. It was great…I am used to being shuttled out of the building after a presentation but being asked to have dinner and continue discussing is a much more pleasant transition…especially prior to a red eye flight home.

The presentation was recorded and will be available shortly. For now the actual presentation is given below.

It was a busy week at the ACS meeting in Washington. I gave three presentations and the title, abstracts and links to Slideshare are given below:

Oops and Downs of Resolving InChIs For the Chemistry Community (Link to Slideshare)

The InChI resolver was rolled out to the community in March 2009 with the purpose of providing a centralized resource for chemists to resolve InChIs (International Chemical Identifiers). This presentation will provide an overview of the development of the underlying technologies associated with the InChI resolver, and how the resolver is being used, integrated and enhanced to provide additional value to the chemistry community. We will discuss present limitations to application of the resolver for providing access to databases and chemistry information distributed across the internet and define our vision for enhancing interconnectivity across Open databases using the InChI resolver as the glue.

ChemSpider: Building a knowledge-based community for chemists using social and data networking technologies (Link to Slideshare)

In less than 2 years ChemSpider has become one of the primary online resources for chemists providing access to an unsurpassed aggregate of free-access knowledge and data. ChemSpider was developed with the intention of providing a structure centric community for chemists that would be enhanced by data depositions, curations and annotations by the community. The system presently hosts over 21.5 million chemical compounds from over 200 data sources. Working with a network of advisors, collaborators and data providers ChemSpider has created a unique resource of integrated information for chemists. These efforts have enabled us to support the curation of the Wikipedia chemistry pages, the production of a community supported Open Access chemistry journal and provision of web services integrated to spectrometer systems distributed around the world. This talk will provide an overview of how ChemSpider utilized social and data networking to create a community for chemistry.

Building an integrated system for chemistry markup and online publishing integrated to online chemistry resources (Link to Slideshare)

The extraction of chemical entities from documents such as patents and publications has been pursued for a number of years. We wish to report on ChemMantis, an integrated system for chemistry-based entity extraction and document mark-up enabling access to the rich resource of online chemistry know as ChemSpider. We will discuss the development of the platform from its inception as a series of dictionaries to the integration of an entity extraction algorithm and its expansion to a public deposition and publishing platform for chemistry. Chemistry articles can now be deposited, marked-up and exposed to the public within a few minutes in many cases making it an ideal platform for communicating research and providing integrated access to data sources including PubChem, ChEBI, Wikipedia and Entrez.

The Spectral Game at is powered by chemical structures and spectra from ChemSpider. A provisional form of our manuscript regarding this paper is now online at the Journal of Cheminformatics here:

The Spectral Game: leveraging Open Data and crowdsourcing for education

Jean-Claude Bradley , Robert J Lancashire , Andrew SID Lang and Antony J Williams

Journal of Cheminformatics 2009, 1:9doi:10.1186/1758-2946-1-9

Published: 26 June 2009

Abstract (provisional)

We report on the implementation of the Spectral Game, a web-based game where players try to match molecules to various forms of interactive spectra including 1D/2D NMR, Mass Spectrometry and Infrared spectra. Each correct selection earns the player one point and play continues until the player supplies an incorrect answer. The game is usually played using a web browser interface, although a version has been developed in the virtual 3D environment of Second Life. Spectra uploaded as Open Data to ChemSpider in JCAMP-DX format are used for the problem sets together with structures extracted from the website. The spectra are displayed using JSpecView, an Open Source spectrum viewing applet which affords zooming and integration. The application of the game to the teaching of proton NMR spectroscopy in an undergraduate organic chemistry class and a 2D Spectrum Viewer are also presented.

An article entitled “Computer-assisted methods for molecular structure elucidation: realizing a spectroscopist’s dream” has been published online at the Journal of Cheminformatics (Journal of Cheminformatics 2009, 1:3). This was a review article of what’s possible with computer assisted structure elucidation and in particular focused on the ACD/Structure Elucidator software package I was involved with during my tenure at ACD/Labs.  An outline of the article is provided below.

Computer-assisted methods for molecular structure elucidation: realizing a spectroscopist’s dream

Mikhail Elyashberg, Kirill Blinov, Sergey Molodtsov, Yegor Smurnyy, Antony J Williams and Tatiana Churanova

Journal of Cheminformatics 2009, 1:3doi:10.1186/1758-2946-1-3

Published: 17 March 2009

Abstract (provisional)


This article coincides with the 40 year anniversary of the first published works devoted to the creation of algorithms for computer-aided structure elucidation (CASE). The general principles on which CASE methods are based will be reviewed and the present state of the art in this field will be described using, as an example, the expert system Structure Elucidator.


The developers of CASE systems have been forced to overcome many obstacles hindering the development of a software application capable of drastically reducing the time and effort required to determine the structures of newly isolated organic compounds. Large complex molecules of up to 100 or more skeletal atoms with topological peculiarity can be quickly identified using the expert system Structure Elucidator based on spectral data. Logical analysis of 2D NMR data frequently allows for the detection of the presence of COSY and HMBC correlations of “nonstandard” length. Fuzzy structure generation provides a possibility to obtain the correct solution even in those cases when an unknown number of nonstandard correlations of unknown length are present in the spectra. The relative stereochemistry of big rigid molecules containing many stereocenters can be determined using the StrucEluc system and NOESY/ROESY 2D NMR data for this purpose.


The StrucEluc system continues to be developed in order to expand the general applicability, provide improved workflows, usability of the system and increased reliability of the results. It is expected that expert systems similar to that described in this paper will receive increasing acceptance in the next decade and will ultimately be integrated directly to analytical instruments for the purpose of organic analysis. Work in this direction is in progress. In spite of the fact that many difficulties have already been overcome to deliver on the spectroscopist’s dream of “fully automated structure elucidation” there is still work to do. Nevertheless, as the efficiency of expert systems is enhanced the solution of increasingly complex structural problems will be achievable.

I’m sharing this news as a public service to the community…this is excerpted from GenomeWeb

“NEW YORK (GenomeWeb News) — A bill aimed at limiting the open-access publishing policy adopted by the National Institutes of Health has been re-introduced in the US House of Representatives by Rep. John Conyers (D – Mich.), after the same legislation expired at the end of the 110th Congress.

The law would effectively overturn the policy NIH put into effect last year mandating that all NIH-funded investigators must submit electronic versions of their final, peer-reviewed manuscripts to PubMed Central within a year after they are officially published.”

More details are here…

The ChemSpider Journal of Chemistry is an experiment. We intend to demonstrate how modern web technologies can be used to dramatically enhance the type of information that can be communicated using web-based tools over standard online publishing approaches. There are some publishers who are working in delivering additional value to their readers by providing enhanced HTML articles and adding information to their articles such as InChIs to allow structure-based queries online. These publishers include the Royal Society of Chemistry with their Project Prospect and the Nature Publishing Group with their Nature Chemical Biology papers. The majority of articles presented by the commercial publishers are not of a “just-in-time” nature and are delayed by the “processes of publishing”. They are generally fairly lengthy documents and report successful results. They are commonly peer-reviewed and have endured a significant timeline from initial writing to submission, publishers processing, review and publication. Science is however being reported in near real-time under Open Notebook Science (ONS) initiatives. We believe that an online journal can co-exist between the immediate nature of blogging and wiki tools hosting ONS efforts and the more standard processes of the scientific publishers. Some publishers are already allowing online and open peer-review whereby readers provide their feedback to the author in a public forum. Papers can enter a period of online peer review and commentary during which readers provide feedback to the author(s). As a result of this process the authors can engage in public discourse with the commentators and issue a final form of the manuscript. We will offer similar facilities.

We invite manuscripts from anybody interested in exposing their work in the field of chemistry and intersecting fields. In general we expect these communications to be 1500-3000 words in length but there is no limit. We encourage submissions relating to chemistry, biochemistry and chemical biology; regarding synthesis, the analytical sciences and computational chemistry; as research, as commentaries and as questions to the community. Provided the submission relates to the domain of the chemical sciences we will find a place for it within the ChemSpider Journal of Chemistry. We encourage submissions from academia and industry, from students and senior scientists, from individuals and teams, for successful research or failed experiments. We encourage submitters to challenge us to host your manuscripts in a manner which most clearly communicates your science. This may include hosting various forms of data made available to the public as Open Data, providing visualization tools for the display of molecules, spectra, images and videos. We intend to not be constrained and to make full use of web-based tools available today and coming online tomorrow.

All articles will be Open Access articles. We will abide by the Budapest Open Access Initiative which declares “By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” Authors must agree to allow unrestricted reading, downloading, distribution, printing, searching and linking to the published work.

Over the past 2 years we believe we have demonstrated our passion for public science, our willingness to serve the community, and integrity in our actions. We hope that the ChemSpider Journal of Chemistry will provide a vehicle to all scientists operating within the domain of the chemical sciences to expose their work and interests to the community. We intend to deliver a facile process of submission and superior tools for delivery. We welcome your support and look forward to expanding the communication of chemistry.

When  ChemSpider was rolled out to the world as a part of ChemZoo we always knew we would be introducing more “critters”. We are happy to announce our progree with our new development ChemMantis. Why Mantis? Well…it’s the Markup And Nomenclature Transformation Integrated System. Fits perfectly into our zoo!

We have been working on the markup of chemistry documents for a number of months and I unveiled the first aspects of our work at the ACS meeting in Philadelphia. The presentation is available online on my Slideshare account. What we are trying to do is to use our ChemSpider platform as the foundation of a document markup system whereby chemical names are automatically identified and can either be converted to chemical structures (possible using algorithms for name to structure conversion) or are retrieved from our ChemSpider database. We have invested a lot of efforts to curate and validate the ChemSpider database of over 21.5 million unique chemical entities over the past year and are now sitting on a foundation of information allowing us to connect between chemical identifiers, chemical structures and out to rich sources such as Wikipedia and PubChem and to provide information such as chemical vendors and other online systems. ChemMantis is well and truly weved into the web of ChemSpider now.

We are now in alpha release and are adding some finishing tweaks to the markup system, the visualization elements and the  workflow. You can see the immediate effects of our recent work on improving the quality of structure images in the balloon below.

We_would_like to test the system on YOUR documents if you are willing to participate. What we are looking for are WORD documents for already published papers. They can be Open or Closed access papers. We are not expecting copyright transfer – we want to markup the documents and return to you for feedback. In the process we will be testing the quality of our Dictionary, our conversions, our visulaizations and our process. We welcome your support. Feel free to connect with us at infoATchemspiderDOTcom. Over the next few weeks you will hear more about ChemMantis and our contributions to text mining and markup of chemistry documents.

I’ve been in a number of conversations of late about how Mass Spectrometrists might use ChemSpider and get value from our efforts. I recently gave a short Powerpoint presentation to a group about what ChemSpider is and the types of queries that ChemSpider users can conduct today. I’ve posted the presentation to Slideshare as usual so people can access it there if they are interested.

I’ve started wrapping my head around how we could provide more value to some of our users in regards to MS, HPLC and NMR. One of the things we could do is to use our known text mining skills to look for NMR or MS (LCMS) articles based on the use of the terms in the title or abstract and then using those terms as tags against chemical structures in the abstract/title. So, from titles such as “High-Performance Liquid Chromatographic Method for Determination of Phenytoin in Rabbits Receiving Sildenafil” from our collaborator Libertas Academica we would extract HPLC and Phenytoin and connect the article to the structure as we have done here. In this way the article would be searchable by structure and associated analytical technique and we could even look at extracting the detailed experimental approach from Open Access articles. More work but feasible. Any comments???

Readers of this blog will know we have a focus on enabling chemists to source information via both Open AND Closed access publishers with the aim, ultimately, of providing a way to perform structure and substructure searching of these articles. This work is well underway.

If you visit our Literature Search Page you will see that we have recently added the ACS AuthorChoice Free Access articles to the index and we will continue to index on an ongoing basis.  There are very few ACS AuthorChoice articles to search but the usual validation search of “Searching Taxol”  it does turn up one hit.

Herding Nanotransporters: Localized Activation via Release and Sequestration of Control Molecules (Nano Lett. 2007 Volume 8 Issue 1 Page 221) – American Chemical Society

R. Tucker, P. Katira, H. Hess

… 1 mM MgCl, 1 mM EGTA, pH 6 .9) containing 10 micromolar taxol for stabilization and kept at room temperature (20 C). Caged -ATP and “

I’ve had a number of questions about the presentation I gave at ACS Philly last week about document markup. The phrase I keep hearing is “very disruptive” followed by the question “will authors do more work and what’s in it for them?”.

The presentation here outlines the general concept that I talked about…

The basic concept I presented is as follows, with a focus on Chemistry Articles.

A lot of effort is being expended in “text-mining” publications, post-publication, to index these articles and make them searchable not only by text but by the specific language of chemistry, chemical structures. We are specifically asking the question “why extract chemical structures from articles using chemical name conversion approaches and chemical image conversion tools when the structures in the article were ORIGINALLY machine readable?”

We are considering a system whereby authors are asked to contribute to the availability of a free online service for performing structure and substructure-based searches of chemistry articles. While the submission of journal articles is already a lot of work (I know from experience of authoring/co-authoring about 10 a year) we hope that authors will support a service whereby they can upload their own articles to a “validation and mark-up service”. The upload capabilities will support upload of the primary document, chemical structures in standard formats and supplementary information of various types (to be defined)

This system will perform the following services:

1) semi-automated markup of a document – title, author(s), abstract and additional dictionary-based terms plus the ability to use the NLM-DTD markup
2) identification of chemical names and conversion to structures in an automated fashion
3) conversion of structure IMAGES to connection tables using optical structure recognition software (either commercial or open surce)
4) ask authors to confirm whether the converted structures are appropriate
5) provide a structure validation service for submitted molecules checking for “accurate representation”
6) Deposit all structures associated with an article onto ChemSpider but under embargo. Associate the article Title, authors and “abstract snippet” with all structures.
7) Issue a set of ChemSpider IDs for the author to submit to the publisher with the article
8) When a publication has passed through review the author can release the structures from embargo using a DOI or an article URL (more common for Open Access articles)

The result of this project will be a way for publishers to link their articles directly to a free access chemistry database and use a series of web services to enable other capabilities (to be defined). It will also allow articles in Open Access and non-Open Access publications to searchable by the “language of chemistry”.

This is only a slice of the overall project but I think it may be of interest relative to the comments you have made below.

Parts of this were shown last week at Drexel University and a particular snippet is available online here:

We are also going to provide a Microsoft Word add-on which will allow users to prepare articles for publishing using similar technologies.

We think this IS disruptive..what say you?

I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site – word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

ChemSpider has been working hard to support Wikipedia for a number of months now. We have been curating the structures on Wikipedia, I have been an active member of the WP:Chem team, we have extended our integration of WIkipedia to show the leed of the Wikipedia article on associated record views and have a lot of background activities going on re. Wikipedia at present (info will be released shortly). There are new articles released on Wikipedia on an ongoing basis and we stay up to date as best we can monitoring bots for updates. Harvesting monographs out of Wikipedia based only on ChemBoxes and Drugboxes is not sufficient for sure since not every article about drugs and chemicals on Wikipedia has an associated Drugbox or ChemBox. For example… You have likely heard of Rember for Alzheimers already? A search on Google for Rember Alzheimers will give about 2 million hits. It’s already being discussed in the blogosphere including Derek Lowe’s  In the Pipeline. Rember turns out to be methylene blue. There is already an article on Wikipedia about Rember but there is no chembox as yet. As I was researching Rember out of interest I noticed we did not have methylene blue linked to Wikipedia and Rember wasn’t associated with methylene blue. Adding the name was of course easy..5 seconds work after login. We have now added the ability to associate data sources directly too. What does this mean? On a record view page is a list of “Data Sources” associated with a compound. This is where depositions about a compound came from and, generally, links back to the associated web pages. Previously in order to populate the Data Source table it would be necessary to deposit the structure and associated info as an SDF file. TOO MUCH work. So, now we have made it easy. To add a data source simply login and select “Edit” (top right hand side of the data source table). To add a new data source simply click Add and input the information into the pop up box.The input is the name to be listed in the Data Source table, the URL to the information on the Data Source page (if info exists) and the name of the Data Source. This is one caveat of adding such links..the data source must exist. If you want to add data associated with your own website you need to register yourself, add a Data Source and wait for us to approve. Wikipedia is a special case since when the link is made we grab the leed of the article directly and show it in the Record View. For methylene blue there are two related Wikipedia articles so we have linked to them both as you can see on the record view. Simple go to ChemSpider and search for rember and you’ll see two linked Wikipedia articles.

In keeping with our commitment to continue to index Open Access journals for searching on ChemSpider we are happy to announce our indexing of Libertas Academica. Most people I have spoken to about our indexing of Open Access journals have never heard of this Open Access publisher. Libertas Academica offers “Open access journals on clinical medicine, bioinformatics, biology, chemistry, pharmacology, gene signalling, systems biology, informatics, virology, substance abuse, translational science and complimentary medicine.” I know of LA-press because of their Analytical Chemistry Insights journal.

Their list of Popular Journals is given below and their full list of journals is given on the third tab.

The publisher allows direct commenting on articles on their website as shown here for their article on “High-Performance Liquid Chromatographic Method for Determination of Phenytoin in Rabbits Receiving Sildenafil” (This article is already linked from the structures of Phenytoin and Sildenafil)

Following our previous approach of using Taxol and Paclitaxel as a measure of potential contibution to search results on ChemSpider searching Libertas Academica gives 6 hits on Taxol while a search on Paclitaxel gave 23 hits.

Our growing list of Open Access Publishers is rather impressive at this point…see below. It will continue to grow.

For those of you who have been following the discussions of Stevan Harnad, Peter Suber and others regarding institutional repositories and Open Access you will already be up to speed regarding OA mandates and what they could mean in terms of access to data. Rather than go into this area in detail myself I point you specifically to Steven Harnad’s site to review ongoing discussions there (there are mutliple parties exchanging views.

What I am going to so though is point you to this comment on Peter Suber’s blog regarding “ Stanford Opens Access to All Its Education Studies“.

Specifically, the following comments are of interest “Under Stanford’s new policy, only the author’s final, peer-reviewed copy of the article would be posted online —in some cases, potentially months before the printed version becomes available….By early fall, the education school plans to have a Web site in place where the articles will be posted and archived in a searchable database. With approximately 50 scholars on Stanford’s education school faculty, the site could accumulate as many as 100 articles a year, by Mr. Willinsky’s estimate.”

Stanford is not alone in this type of shift. What does this mean for indexing of articles and availability for searching in terms of the work we are doing with ChemSpider right now (1,2,3). Text-indexing of chemistry articles would simply mean turning our spider onto the repository. Using the tools we have available now and the database of 21 million compounds and associated dictionary we could also convert the chemical names to structures and make the articles searchable by both text and structure BEFORE publication, in theory, months before. With the work that is already underway on Open Access articles on ChemSpider and SOON to be unveiled, we could also provide tools for authors to markup their own documents. My preference, as for many others, is that authors of Chemistry articles use semantic authoring tools to allow us to grab the appropriate information from the articles for linking as well as provide a path for semantic connectivity.

The question then is whether or not ChemSpider can index institutional repositories or authors self-archived collections on their university research group websites. The authors self-archived collections will be very valuable but of course most likely to upset the publishers. We’d like to do both.

I envisage a time when articles are indexed and searchable even before they are published and indexed by others. Why not? If there are changes to the article between pre-and post-publication both can be indexed.

We welcome your comments! Anyone want to introduce me to the host of an institutional repository?

Since originally incorporating Chemrefer text-based searches of Chemistry literature to ChemSpider we have continued to expand the list of supported publishers as described here. We recently added Royal Society of Chemistry articles (in the past week). This weekend we have indexed IUPAC’s Pure and Applied Chemistry journal and added that to our list of supported publishers also. In keeping with my previous reports of contributions to text-based searching I performed a search on both Taxol and paclitaxel. Performing a search only on the IUPAC index provided 37 hits for Taxol and 11 hits for paclitaxe.l

We now have 12 sources feeding our Literature Search. We are looking for others to index and add into the list. if you have any suggestions please let us know.

I have blogged previously about ChEBI entities of the month and our work to include the information to ChemSpider. In order to do so we had to introduce rich text support. This work is done and reported here. As of today nearly all ChEBI Entity of the Month information is now posted to ChemSpider. During the processs we have provided feedback to the team about some suggested changes to some structure depictions and have also noted some differences in stereochemistry between our reference structures and those on ChEBI. This type of interaction has us all be very vigilant about accuracy and it was great (and fast) to work with the group at ChEBI to cross-validate the limited dataset. Everyone gains.

The Rich text editor worked perfectly and without failure and is ready to roll out to the general public we think but we would still like some beta-testers to help test it please.

The DailyMed website is a valuable website when it comes to chemical names, trade names, drug names and chemical structures.What is interesting is the quality of the information on the website. We were originally interested in using the website to expand our dictionary of drug names and associated chemical structures and exercising our text-mining tools to recognize chemical structures. In order to test our text-mining capabilities we had to examine every record for accuracy and appropriateness and to tune the algorithms. This amounted to over 3000 records. During this process we were able to review every chemical structure diagram and the appropriateness of these diagrams. As part of the process we were able to build a highly validated dataset of chemical structures and their chemical/trade/drug names. These will be exposed on ChemSpider in the near future.

For now lets examine the quality of information on Daily Med.

The website is advertised as:

“DailyMed provides high quality information about marketed drugs. This information includes FDA approved labels (package inserts). This Web site provides health information providers and the public with a standard, comprehensive, up-to-date, look-up and download resource of medication content and labeling as found in medication package inserts. ”

So, what type of materials can we find on Daily Med?

Look at Soltamox here. What do you think about the chemical structure image below? Do you think that was drawn with a structure drawing software package?

What about the one for Clindamycin phosphate? Do you think there might be a lack of stereochemistry on this structure of norethindrone below? Same question for trobicin

My favorite “not drawn by a chemist” chemical structure is the one for cefobid shown below.

Many chemical structures on DailyMed are imperfect. What is quite shocking is that many of these are not even drawn with structure drawing packages. There are other issues…more to come.

COMMENT: DailyMed is a delivery vehicle for content provided by vendors (I believe). The site is a valuable public service and is applauded. The hope is the work that we are doing on Daily Med will be of similar value and might encourage that some of the labels will be “cleaned up”

Here at ChemSpider we’ve been working for almost a year and a half to build a structure centric community for chemists. During this time we have been dabbling, in the background, with ChemSpider being not so structure-centric, but this has not been exposed yet. Of late we have been attracted to the possibilities around text-mining and mark-up of articles.

We are well underway in terms of providing tools for markup and they will be released incrementally. We have a lot more ideas and are interested to participate in the Article 2.0 contest to see what we can do. What is Article 2.0? Article 2.0 was announced by Elsevier here with the following statement:

“We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset. We will encourage the use of XQuery, but this will not be a mandate. By leveraging these APIs, the contestant becomes the publisher and can render scientific articles to meet their needs including integrating the article into existing applications or combining it with other web service APIs.”

7500 articles and complete freedom to present the articles as we see fit. Enticing! What do we already have on ChemSpider that we could reuse?

1) Structure deposition

2) Analytical data and image deposition

3) Integration to other data via URLs

4) Add comments/description

5) Text markup with “Chemical enhancements”

6) A dataset of >21 million structures and integration to over 120 data sources

7) Good ideas …

Article 2.0 looks interesting…we hope to be involved

We are adding our finishing touches to some markup tools for Open Access articles at present and they will unveil shortly. In parallel we’ve been manually curating a series of articles about drugs, about 3000 of them, and will rollout these articles with similar markup using the tools we have developed. When rolled out we will of extended our ChemSpider toolkit to facilitate integration between “documents” and ChemSpider – watch this space…

A lot of text-indexing of publishers and journals has been underway over the past few weeks, with permission. The two latest additions are the Journal of Biological Chemistry (added over 122,000 new articles) and the Proceedings of the National Academy of Sciences (added over 50,000 new articles). Now on the Literature Search page you will see a series of checkboxes for you to choose the resources for text-searching (as shown below).

I have been testing the searches based on one of my adopted molecules, paclitaxel, sometimes referred to as Taxol.

Searching on paclitaxel without JBC and PNAS gives a total of 427 articles in 11 seconds.

Searching on Taxol without JBC and PNAS gives a total of 270 articles in 5 seconds.

Searching on paclitaxel with JBC and PNAS gives a total of 745 articles in 26 seconds.

Searching on Taxol with JBC and PNAS gives a total of 1192 articles in 35 seconds.

Clearly adding JBC and PNAS is giving a lot more hits on both names with over a 4x increase for Taxol hits. Clearly the number of hits is highly dependent on the name used to perform the searching. Now, when we integrate the chemical structure searching via linked identifiers this dependency should be dramatically reduced. This work is in development.

ChemSpider has taken some thrashing over the past year. We’ve been hit on science (and proven our point many times), on Open Access versus Free Access statements, on whether or not we have Open Data or not. There has been encouragement to define what the data on our site is in terms of Open Data or not. We’ve adopted Open Data tags on deposited data from users after pressure there. When I’ve asked more about Open Data I have heard that it is not ratified at the same level as Creative Commons licenses and they would be better to use. A week ago we put up Creative Commons Licenses in what I hoped was a GOOD move for the ChemSpider site and would relax the criticism of our site and potentially receive their blessing and support.

We received a blessing for all of 72 hours. In his blog post Peter Murray-Rust was DELIGHTED with our decision to do this. I quote: “I am DELIGHTED to report that Chemspider has adopted a CC-SA licence for its data.” and espoused “PMR: This is wonderful. As far as I know Chemspider is the only commercial chemical information company offering data under this licence, which is completely compatible with the Open Knowledge Definition. (It is also BBB-compliant, though data and publications are different animals).”

I assumed therefore we’d done a good thing. There was no indication to me that our postion was anything other than positive.

There has been a conversation going on in the blogosphere for a couple of weeks now about Strong and Weak Open Access. I’ve read, watched and simply let others share their opinions because they’ve been in Open Access discussions for a number of years and have more context, background and passion to stay engaged in these discussions. They ARE important discussions and will come to a conclusion.

It appears that “I” am confused by Creative Commons licenses. This based on the fact that 72 hours we had done a good thing and got a blessing but 3 days later I read yet another post this time with a comment  from John Wilbanks stating “I’d like to see a meaningful discussion of the risks of Share Alike and Attribution on data integration. Chemspider’s move to CC BY SA fits into this discussion nicely – it’s a total violation of the open data protocol we laid out at SC, which says “Don’t Use CC Licenses on Data” – but it does conform inside the broader OKD.”

Uh-oh. ChemSpider is in Total Violation of Creative Commons Licenses. As we say in Wales in times of distress … “Hell’s Bells” (My dad was a builder..if you believe he taught me to curse like that well….)

Peter followed it up with a comment “PMR: I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism – CC-SA is conformant to the OK definition, but difficult to operate for re-use). That’s why we use the OKF’s OpenData sticker on CrystalEye.” Hmmm…

Again, when I’ve asked about the OpenData sticker I’ve been informed that this is not yet ratified.

There have been many discussions about Openness I’ve been involved with..just one example here. It has been difficult. Openness and licensing remains confusing…see here an example and this is just about a blogsite!

So the question is what now? Do we remove Creative Commons Licenses? Do we adopt Open Data licenses or do we just get ourselves out of the middle of this entire confusing discussion until all is resolved and settled. And IF we remove CC licenses and don’t post other licenses I know we’ll get criticized for that too. But let’s be honest…we’ve been highlighted for NOT having licenses up to this point. Now we are highlighted FOR having them. Maybe we can hope that no press is bad press. I’ll await feedback on this post and make a decision about what to do in the next 48 hours. Blog away…

I had commented recently on my pleasant experiences of working with MDPI regarding Molbank articles. See the posts here and here. SInce Peter Murray-Rust and I had both blogged on this issue (from different points of view) Deitrich Rordorf from MDPI went out of his way to make us both aware, via email, of a recent publication they had posted on their site:

“Just for your information and in reply to the blog posts regarding use of Creative Commons By Attribution License v3.0: We recently published an editorial “Changes Coming to MDPI Journals: Digital Object Identifier (DOI) and Creative Commons Attribution License” at

The paper is entitled “Changes Coming to MDPI Journals: Digital Object Identifier (DOI) and Creative Commons Attribution License” and speaks for itself. I recommend that interested parties read the entire paper and commend MDPI on their decisions. EXCELLENT news.