Archive for March, 2010

If any readers will be attending the ENC conference in Daytona Beach next month I hope you will stop by and visit the two posters I will be presenting. The poster details are given below.

The Spectral Game – Teaching NMR Spectroscopy Via a Web Browser
Antony Williams1; Jean-Claude Bradley2; Robert Lancashire3; Andrew Lang4

We report on the implementation of the Spectral Game, a web-based game where players try to match molecules to various forms of interactive spectra including 1D/2D NMR. Each correct selection earns the player one point and play continues until the player supplies an incorrect answer. The game is played using a web browser interface using spectra from the ChemSpider database (www.chemspider.com) for the problem sets together with structures extracted from the website. Spectra are displayed using JSpecView, an Open Source spectrum viewing applet which affords zooming and integration of JCAMP spectra. Players of the game provide both active and passive feedback regarding the quality of the spectral data resulting in crowd sourced curation and validation of the data.

ChemSpider – Building an Online Database of Open Spectra
Antony Williams; Valery Tkachenko

ChemSpider is an online database of over 20 million chemical compounds sourced from over 300 different sources including government laboratories, chemical vendors, public resources and publications. Developed with the intention of building community for chemists ChemSpider allows its users to deposit data including structures, properties, links to external resources and various forms of spectral data. Over the past three years ChemSpider has aggregated almost 3000 high quality NMR spectra and continues to expand as the community deposits additional data. The majority of spectral data is licensed as Open Data allowing it to be downloaded and reused in presentations, lesson plans and for teaching purposes.

After a long week at the ACS meeting, four presentations, one poster and a two hour training…I headed over to the Lawrence Berkeley National Lab (LBNL) to give a talk. I was hosted by Jeffrey Loo. And I mean hosted. I give a lot of talks in a lot of places but it is rare to experience the level of coordination and organization that Jeffrey provided. After what was almost a 2 hour presentation and discussion I had the opportunity to share dinner with a number of people who attended the presentation and to continue our discussions. It was great…I am used to being shuttled out of the building after a presentation but being asked to have dinner and continue discussing is a much more pleasant transition…especially prior to a red eye flight home.

The presentation was recorded and will be available shortly. For now the actual presentation is given below.

I gave a talk at the ACS Meeting in the Future of Scholarly Communication Meeting yesterday. The abstract is below…and I DID talk about how Viagra keeps flower stems stiff. It was recorded and should go online soon and I will point to it again. For now I have linked the Slideshare presentation

Enhancing discoverability across Royal Society of Chemistry content by integrating to ChemSpider, an online database of chemical structures

The ability to query across a chemistry publishers content using chemical structure searching can dramatically enhance discoverability. RSC has been applying a number of procedures to integrate RSC’s ChemSpider community resource with our published content and databases. These include: 1) entity extraction procedures 2) chemical name conversion procedures using software algorithms and curated dictionaries 3) semantic markup and 4) a crowdsourced curation processes. This presentation will provide an overview of the processes we have utilized in order to provide structure-based integration to RSC content. We will discuss our ongoing efforts to extend the approaches to the mining of data from the rich supplementary information sections of many RSC publications. Our intention is to provide access to synthesis procedures and analytical data and further enrich the ChemSpider database for the benefit of the chemistry community.

Tuesday morning at the ACS meeting here in San Francisco…two talks done, one 2 hour training session completed, one poster presented and two talks left to give before heading off to the Lawrence Berkeley National Laboratory to give my final talk before the dreaded red-eye home. I am so looking forward to sitting on a cramped plane overnight…

My presentations delivered so far are already on SlideShare and are linked below for display.

We are building ChemSpider into the world’s leading resource for chemistry on the internet, and have a position for another team member at the RSC offices in Cambridge, UK.

We’re looking for someone with established cheminformatics and programming skills (including SQLServer, C#, ASP.NET, AJAX experience) to join a small team who work in both the UK and US. They need to have a track record in working in the field of cheminformatics, have knowledge of handling chemical structures, experience in working with web-based systems and, of course, have a big appetite for making a difference and working with a fast-moving team. The job holder will develop new database applications and tools to bring creators and users of public chemistry data together, and this position offers an unrivalled opportunity to contribute to this development. Join us and help change the way chemistry data is used.

Details here; closing date 1 April (really!)

We’ll at the ACS Spring Meeting in San Francisco next week, so if you’re there and want to find out more, catch me (Richard Kidd) or one of the team at the RSC stand #310; I’ll be there when not in the CINF ‘Future of Scholarly Communication’ sessions.

When ChemSpider SyntheticPages is formally released at the American Chemical Society meeting next week at the Spring National Meeting in San Francisco we will also introduce both the ability to structure search ChemSpider SyntheticPages as well as show a snippet of the reaction in ChemSpider itself. We have introduced a new reaction Infobox and display the snippet of the reaction based on compounds marked as primary reactants or products in the SyntheticPages article.

For example, see the reaction infobox for this record.

CSP

For the past few months we have been busily developing new functionality and capabilities for the ChemSpider platform with the intention of making navigation easier, enhancing integration to external resources, adding new rich data sources and providing access to brand new capabilities. This new functionality has been described in a series of recent blog posts today and is outlined below.

Improving the ChemSpider interface using tabbed infoboxes

Introducing NMR prediction capabilities to ChemSpider

Linking Google Patents searching to ChemSpider

Integrating RSC Databases into ChemSpider

Integrating RSC Publishing Beta into ChemSpider – includes integrations to Google Scholar, Google Books and Microsoft Academic Search

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

Following on from the last post regarding integrating to RSC Databases via the RSC Publishing Beta web services layer this post expands on the nature of the integration that we have been able to introduce. The RSC publishing beta gives us access to over 500,000 journal articles, book chapters and database records through one simple search interface. Using a similar approach to that outlined for the RSC database searches, that of using validated synonyms as the basis of the search for chemicals, we are able to search across the entire ePlatform of articles and retrieve hits as shown below. The hits are under the RSC journals tab.

Since the RSC publishing platform segregates the journals from the books the same search will return results from RSC books also. Our tests show that this is incredibly fast and highly accurate. This is our first venture into tapping into the chemical compounds sitting inside the RSC archive. More work is coming…

If you look at the tabs below you will also see that we have integrated to Google Books, Google Scholar and the Microsoft Academic Search. We are truly integrating to available internet resources to bring together the benefits of all of the primary search engines available.

eplatform

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

The Royal Society of Chemistry has a whole series of databases. None of them have been structure searchable…until now. As with our PubMed integration and our Google Patents integration rolling out shortly, just because a database hasn’t had the chemical structures extracted and indexed doesn’t mean that those resources cannot be made “structure searchable”. It’s not a subtle distinction however, as discussed in the Google Patents blog post. These types of integrations depend on the correct association between chemical names and structures, access to an API allowing facile and flexible searching and, something that is purely serendipitous in nature, the absence of overlaps between chemical names and common language.

We have used the recently announced RSC Publishing beta platform and the API made available to us to enable the searching. As my colleague Graham McCann announced recently “(the) platform gives access to over 500,000 journal articles, book chapters and database records through one simple search interface. The new platform delivers faster browsing, intelligent searching and more intuitive navigation and is open for beta testing now.”

Our approach has been to search the title and the abstract for each of the databases for all of the validated identifiers. It works. It is FAST and it provides “structure-related” access to all six RSC databases. An example screen shot is below where a search on chlorobenzene retrieves data on each of the following databases: Mass Spectrometry Bulletin, Laboratory Hazards Bulletin, Methods in Organic Synthesis, Catalysts and Catalysed Reactions, Natural Product Updates and Analytical Abstracts. The screen shot below shows the analytical abstracts linked by the term chlorobenzene in the title or abstract itself. 284 hits..in a fraction of a second. The abstract is linked out to the original article via DOI, where possible.

databases

My personal favorites in the set of databases are the Natural Product Updates (NPU) and the Methods in Organic Synthesis (MOS) databases. The NPU database contains tens of thousands of natural product chemical structures, together with chemical names, references and some physical properties. Rich resources for ChemSpider. MOS includes includes reaction schemes, title and bibliographic details. Rich resources to connect to ChemSpider SyntheticPages in the future.

We have only just started to tap into the riches contained within the RSC archive. It’s like stumbling across a roomful of rubies to pick up diamonds. There is content all around us waiting for us to connect. We will connect this up to ChemSpider and make it available. Access to the databases will be shown at the ACS Meeting in San Francisco.

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

ChemSpider has been integrated into the SureChem service for almost 3 years. It’s a great service and the SureChem team have worked very hard to provide a premier offering at an affordable price point. The approach SureChem has taken is an interesting one…using entity extraction techniques to find chemical names in patents and using various Name-to-Structure conversion engines to generate a consensus result of what is the most appropriate chemical structure associated with a name. From this set of data they assemble a rich database of chemical structures linked back to the associated patents. This database is, of course, both structure and substructure searchable. Using the webservice provided to us by SureChem we have been able to link chemical structures in ChemSpider out to SureChem. This can produce a lot of hits across the various patent databases because of the existence of a chemical name in the patent text.Look under the patents tab for Xanax and you will see an abundance of related patents.

Note that the link between ChemSpider and SureChem is based on the structure using an InChI as the connection string. Structures will only exist in the SureChem database based on the success of the name to structure conversion software.There is a big upside to this, to be described shortly, but it also has a downside. The downside is that entity extraction depends on the identification of systematic names using tuned algorithms that SureChem have been optimizing for many years. However, for non-systematic names including trade names etc the dependence will be on both dictionaries within the entity extractor as well as dictionaries underlying the name to structure conversion engines. So, the fact that Cocaine has the identifiers snow, berries and bernice (believe me…check the names on ChemSpider!) depends on the extraction of the three names and then association to the structure cocaine. It is UNLIKELY that any patent will use these terms for cocaine of course! This all will become clear now….

We have decided to take advantage of the potential to integrate to Google Patents as it’s easy, it’s low-hanging fruit, and it may have advantages to bring together both SureChem and Google Patents for the users to choose. So, we have taken a similar approach to searching Google Patents as we do with PubMed. We search PubMed using all validated synonyms associated with a structure. Therefore for Xanax, which is validated, we would get this hit list. However, if there were no names validated against that structure then we would get NO Google Patents. There has been an active project for three years to validate name-structure pairs across ChemSpider and I am yet to see any incorrectly associated patents. Fortunately the names snow, bernice and bennies are NOT validated. If they were then we would get this list associated with Cocaine when bernice is validated.

If you look at slide 11 from my presentation at ICIC posted here you will see how complex a NAME-based search can be for a particular component. All of those names for OEA had to be searched. In the world of entity extraction all of those names would have to be found and correctly converted to the correct structure. A balance of name linking and entity extraction approaches will likely give an intersection.

freepatents

___________________________________________________________________________

The intersection of Google Patents seems to show great promise. It has advantages in that it offers access to the digitized US patents all the way back to 1790. However, SureChem has coverage of WO, European and Japanese Patents while Google Patents is limited to US only. We have already learned a lot about how to reduce erroneous results and get back the most value from the Google Patents service but we look to you, the users, for initial feedback when we release at the ACS. Google Patents will be available under the last tab in the Patents infobox.

google patents

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

We had previously released NMR prediction on ChemSpider as announced here. Based on community feedback we later removed that connection and had never reconnected, despite reported improvements. I am an NMR spectroscopist by training …if you check out my Mendeley profile you’ll see that the majority of my papers are NMR-based. Because I am an NMR jock, and despite working in cheminformatics I do keep my hands in NMR research (NMR prediction and computer assisted structure elucidation) I really wanted to make sure that we deliver NMR prediction via ChemSpider. I was involved with the development of the ACD/Labs NMR prediction tools for H1, C13, N15, F19 and P31 nuclei. There are a number of other NMR prediction modules on the market including those of Bio-Rad (in the Know-It-All package), Modgraph and certainly the work of Wolfgang Robien, one of the founding fathers of NMR prediction. These are primarily commercial packages.

In the background we have been working on the introduction of NMR prediction to ChemSpider in time for the ACS. We were looking for a platform that we could integrate that involved community deposition of data to ensure there was a growing database to enhance the prediction algorithms. We also wanted to know that the underlying data quality was good. We wanted to integrate to an Open system that had support from both an active community of participants as well as at least one developer who could provide support if we needed it. All of these criteria point to only one resource, NMRShiftDB. There have been some heated discussions, including on this blog, regarding data quality, especially in NMRShiftDB. However, I co-authored a paper with Chris Steinbeck and colleagues from ACD/Labs validating the dataset as well as ACD/Labs’ NMR prediction approaches.

NMRShiftDB is a high quality data set and certainly contains enough data to provide a training set for NMR prediction algorithms. The NMR predictions provided by NMRShiftDB are used by many people and overall feedback seems to be very positive.  Based on our previous knowledge of the data in NMRShiftDB, and the availability of a well defined programming interface to connect ChemSpider, we have worked with Stefan Kuhn at the EBI to produce a first level integration.

As a result at the ACS meeting in San Francisco next week we will roll out NMR prediction integration. In keeping with the new layout model we have adopted for ChemSpider using tabbed approaches for display of data, we have bundled together all predictions. The first ACD/Labs tab provides access to ACD/Labs PhysChem properties, the EPI Summary provides access to the EPISuite and the NMRShiftDB provides access to the predicted NMR spectra. The left spectrum shows the Proton NMR spectrum and the right spectrum shows the C13 NMR spectrum.

NMRshiftDB

When the system is fully integrated the process will work as follows. Since NMRShiftDB already contains many thousands of assigned spectra we will retrieve the experimentally assigned spectra directly and display them. When we cannot retrieve the experimental spectra then we will predict the NMR spectra and display them.

In the future we might pre-predict and store the NMR spectra for all structures on the NMR database. I am a little leery of doing this at present as we need to gather some basic feedback from the ChemSpider users regarding the performance of the NMR prediction algorithms and our existing implementation. In terms of predicting NMR spectra across a database of this size then a lot of consideration has to be given to domain applicability..i.e, what subset of structures should be excluded from having NMR predictions performed? For example, organometallic complexes, free radicals etc. CAS likely had to take this type of issue into account when they applied NMR predictions to their CAS registry.

If there are other NMR prediction algorithms or databases that you would be interested in integrating into ChemSpider please contact me. If you are a cheminformatics vendor selling NMR predictions/databases we would be VERY interested in receiving JUST the structures from your NMR databases. We will deposit them and link directly to your product page as an indicator that you have NMR data available.

As ChemSpider has grown in the amount and diversity of data that we link to our interface has had to evolve. The reality is that our pages have started to become heavy with data and information and, in many ways, can be unwieldy for some of the pages. As a result we have introduced Tabbed Infoboxes to make navigation much easier.

These tabbed infoboxes collapse the information into an infobox but keep them segregated under various tabs. The two examples below are from the present site that is online and shows the data sources box (now it’s EASY to find all chemical vendors under an aggregated infobox tab called chemical vendors) and the patents infobox, using the SureChem service and separating the tabs into different patent classes.

What I will unveil in a later post regarding OTHER tabbed boxes will be more exciting and you will see why we have taken this path shortly.

data sources

patents

From the early days of the acquisition of ChemSpider by the RSC we have been focused on accessing the rich content that the RSC has contained in its databases and in its rich archive. We have been working hard for a number of months now to integrate systems, projects and processes into ChemSpider so that RSC chemistry is more discoverable. What we will be unveiling in the next few days we believe is big. We’ll roll it out one piece at a time. The last blog post discussed the deposition of new compounds from RSC prospected articles into ChemSpider. The email below results from the deposition of compounds from one article. One set of 10 structures from one article that are directly deposited into ChemSpider when the article goes live. These are compounds that are deposited and live immediately, not abstracted later. Imagine when we are doing this for all RSC articles, database and books….

ALL of the compounds below are NEW to the ChemSpider database…everyone of them. While not all RSC articles are only about novel compounds clearly there are new compounds moving into the database from the RSC publications.

Dear RSC Prospect,

This email is to notify that your deposition (#3427) has been published. Below please find a list of links to the structures that belong to your deposition:

http://www.chemspider.com/Chemical-Structure.23558982.html

http://www.chemspider.com/Chemical-Structure.23558983.html

http://www.chemspider.com/Chemical-Structure.23558984.html

http://www.chemspider.com/Chemical-Structure.23558985.html

http://www.chemspider.com/Chemical-Structure.23558986.html

http://www.chemspider.com/Chemical-Structure.23558987.html

http://www.chemspider.com/Chemical-Structure.23558988.html

http://www.chemspider.com/Chemical-Structure.23558989.html

http://www.chemspider.com/Chemical-Structure.23558990.html

http://www.chemspider.com/Chemical-Structure.23558991.html

Cheers,

ChemSpider

The structures link back directly to the RSC article via DOI as shown below.

Prospectedarticle

We’ve taken the first step towards user being able to seamlessly bounce back and forth between finding compounds of interest using the ChemSpider search and selection tools and finding more information about them in RSC journals…

I’m pleased to announce that we’ve just switched on a deposition system which will take compounds from the prospected version of RSC articles as they are published and automatically deposit them into ChemSpider, making a link back to the original article from the new compound page. An example of a new compound is here which was generated when this article was prospected. The same deposition process is used to make links from existing ChemSpider compounds to new RSC articles, for example here was generated when this article was published.

This is basically a way to stick our toe in the water to investigate how much intervention and cleaning is necessary to deposit compounds when all the information that we have been storing for them is the InChI without any 2D layout information (which is an issue that other potential data sources may also face too).  To do this we’ve been making use of the ChemSpider webservices http://www.chemspider.com/InChI.asmx to download the mol files of InChIs already in ChemSpider, or using the InChItoMol webservice to generate new mol files where they don’t exist already.  Tracking and fixing problems as they crop up at this manageable rate will help us when we face the larger task of importing all of the compounds that have been prospected in the past into ChemSpider.

OVERVIEW

The LBNL Library is hosting a seminar for researchers interested in online collaboration, data storage and curation, data exchange, crowdsourcing, and open access.

This seminar will explore ChemSpider (http://www.chemspider.com/) – a free access service providing a structure centric community for chemists and the richest single source of structure-based chemistry information.

EVENT DETAILS

March 24, 2010 – Wednesday
3:00 p.m. – 4:30 p.m.
Building 50 Auditorium, Lawrence Berkeley National Laboratory

Bring your laptop for a hands-on demo session.”For non-Berkeley Lab personnel: Please contact Jeffery Loo (JLLoo@lbl.gov) by Monday, March 22, 12:00 p.m. for a visitor pass and shuttle bus directions.  A visitor pass is required for entry into the Berkeley Lab by guests.

ABSTRACT

The increasing availability of free and open access resources for scientists on the internet presents us with a revolution in data availability. The Royal Society of Chemistry hosts ChemSpider, a free access website for chemists built with the intention of building community for chemists (http://www.chemspider.com/).

ChemSpider is an aggregator of chemistry related information, at present over 20 million unique chemical entities linked out to over 300 separate data sources, ChemSpider has taken on the task of both robotically and manually curating publicly available data sources. It is also a public deposition platform where chemists can deposit their own data including novel structures, analytical data, synthesis procedures and host data associated with the growing activities associated with Open Notebook Science.

This presentation will examine chemistry on the internet, the dubious quality of what is available and how the ChemSpider crowdsourced curation platform is fast becoming one of the centralized hubs for resourcing information about chemical entities.

We will also review our efforts to provide free resources for synthesis procedures, spectral data and structure-based searching of the chemistry literature and how chemists can contribute directly to each of these projects.

Following the presentation and a question and answer session, a hands on session showing how to search for, curate and deposit data on ChemSpider will be given for interested parties.

SPEAKER

Profile Photo

Antony Williams, PhD, is a leader in the domain of free access chemistry. He is the Vice President of Strategic Development at the Royal Society of Chemistry and is the host of ChemSpider, a free online structure centric community for chemists.

ChemSpider began as a hobby project in a basement and went on to become one of the most popular Chemistry websites with the highest quality of data available online. Antony spent over a decade in the commercial scientific software business as Chief Science Officer for ACD/Labs, one of the domain leaders in scientific software. He is an accomplished NMR spectroscopist with over 100 peer-reviewed publications. During his career he was the NMR Technology Leader for the Eastman-Kodak company and has worked in both academia and national government research institutions.

I had the pleasure of giving a presentation at the Science Commons Symposium a couple of weeks ago. The meeting was held at Microsoft Research and, as is to be expected, Microsoft had all of the necessary technologies in place to capture the video and provide access to the presentations online. My presentation is available here and all talks are available from this page. It was a great gathering and it was a privilege to spend time with the participants and supporters of Science In The Open.