Archive for March 16th, 2010

For the past few months we have been busily developing new functionality and capabilities for the ChemSpider platform with the intention of making navigation easier, enhancing integration to external resources, adding new rich data sources and providing access to brand new capabilities. This new functionality has been described in a series of recent blog posts today and is outlined below.

Improving the ChemSpider interface using tabbed infoboxes

Introducing NMR prediction capabilities to ChemSpider

Linking Google Patents searching to ChemSpider

Integrating RSC Databases into ChemSpider

Integrating RSC Publishing Beta into ChemSpider – includes integrations to Google Scholar, Google Books and Microsoft Academic Search

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

Following on from the last post regarding integrating to RSC Databases via the RSC Publishing Beta web services layer this post expands on the nature of the integration that we have been able to introduce. The RSC publishing beta gives us access to over 500,000 journal articles, book chapters and database records through one simple search interface. Using a similar approach to that outlined for the RSC database searches, that of using validated synonyms as the basis of the search for chemicals, we are able to search across the entire ePlatform of articles and retrieve hits as shown below. The hits are under the RSC journals tab.

Since the RSC publishing platform segregates the journals from the books the same search will return results from RSC books also. Our tests show that this is incredibly fast and highly accurate. This is our first venture into tapping into the chemical compounds sitting inside the RSC archive. More work is coming…

If you look at the tabs below you will also see that we have integrated to Google Books, Google Scholar and the Microsoft Academic Search. We are truly integrating to available internet resources to bring together the benefits of all of the primary search engines available.

eplatform

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

The Royal Society of Chemistry has a whole series of databases. None of them have been structure searchable…until now. As with our PubMed integration and our Google Patents integration rolling out shortly, just because a database hasn’t had the chemical structures extracted and indexed doesn’t mean that those resources cannot be made “structure searchable”. It’s not a subtle distinction however, as discussed in the Google Patents blog post. These types of integrations depend on the correct association between chemical names and structures, access to an API allowing facile and flexible searching and, something that is purely serendipitous in nature, the absence of overlaps between chemical names and common language.

We have used the recently announced RSC Publishing beta platform and the API made available to us to enable the searching. As my colleague Graham McCann announced recently “(the) platform gives access to over 500,000 journal articles, book chapters and database records through one simple search interface. The new platform delivers faster browsing, intelligent searching and more intuitive navigation and is open for beta testing now.”

Our approach has been to search the title and the abstract for each of the databases for all of the validated identifiers. It works. It is FAST and it provides “structure-related” access to all six RSC databases. An example screen shot is below where a search on chlorobenzene retrieves data on each of the following databases: Mass Spectrometry Bulletin, Laboratory Hazards Bulletin, Methods in Organic Synthesis, Catalysts and Catalysed Reactions, Natural Product Updates and Analytical Abstracts. The screen shot below shows the analytical abstracts linked by the term chlorobenzene in the title or abstract itself. 284 hits..in a fraction of a second. The abstract is linked out to the original article via DOI, where possible.

databases

My personal favorites in the set of databases are the Natural Product Updates (NPU) and the Methods in Organic Synthesis (MOS) databases. The NPU database contains tens of thousands of natural product chemical structures, together with chemical names, references and some physical properties. Rich resources for ChemSpider. MOS includes includes reaction schemes, title and bibliographic details. Rich resources to connect to ChemSpider SyntheticPages in the future.

We have only just started to tap into the riches contained within the RSC archive. It’s like stumbling across a roomful of rubies to pick up diamonds. There is content all around us waiting for us to connect. We will connect this up to ChemSpider and make it available. Access to the databases will be shown at the ACS Meeting in San Francisco.

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

ChemSpider has been integrated into the SureChem service for almost 3 years. It’s a great service and the SureChem team have worked very hard to provide a premier offering at an affordable price point. The approach SureChem has taken is an interesting one…using entity extraction techniques to find chemical names in patents and using various Name-to-Structure conversion engines to generate a consensus result of what is the most appropriate chemical structure associated with a name. From this set of data they assemble a rich database of chemical structures linked back to the associated patents. This database is, of course, both structure and substructure searchable. Using the webservice provided to us by SureChem we have been able to link chemical structures in ChemSpider out to SureChem. This can produce a lot of hits across the various patent databases because of the existence of a chemical name in the patent text.Look under the patents tab for Xanax and you will see an abundance of related patents.

Note that the link between ChemSpider and SureChem is based on the structure using an InChI as the connection string. Structures will only exist in the SureChem database based on the success of the name to structure conversion software.There is a big upside to this, to be described shortly, but it also has a downside. The downside is that entity extraction depends on the identification of systematic names using tuned algorithms that SureChem have been optimizing for many years. However, for non-systematic names including trade names etc the dependence will be on both dictionaries within the entity extractor as well as dictionaries underlying the name to structure conversion engines. So, the fact that Cocaine has the identifiers snow, berries and bernice (believe me…check the names on ChemSpider!) depends on the extraction of the three names and then association to the structure cocaine. It is UNLIKELY that any patent will use these terms for cocaine of course! This all will become clear now….

We have decided to take advantage of the potential to integrate to Google Patents as it’s easy, it’s low-hanging fruit, and it may have advantages to bring together both SureChem and Google Patents for the users to choose. So, we have taken a similar approach to searching Google Patents as we do with PubMed. We search PubMed using all validated synonyms associated with a structure. Therefore for Xanax, which is validated, we would get this hit list. However, if there were no names validated against that structure then we would get NO Google Patents. There has been an active project for three years to validate name-structure pairs across ChemSpider and I am yet to see any incorrectly associated patents. Fortunately the names snow, bernice and bennies are NOT validated. If they were then we would get this list associated with Cocaine when bernice is validated.

If you look at slide 11 from my presentation at ICIC posted here you will see how complex a NAME-based search can be for a particular component. All of those names for OEA had to be searched. In the world of entity extraction all of those names would have to be found and correctly converted to the correct structure. A balance of name linking and entity extraction approaches will likely give an intersection.

freepatents

___________________________________________________________________________

The intersection of Google Patents seems to show great promise. It has advantages in that it offers access to the digitized US patents all the way back to 1790. However, SureChem has coverage of WO, European and Japanese Patents while Google Patents is limited to US only. We have already learned a lot about how to reduce erroneous results and get back the most value from the Google Patents service but we look to you, the users, for initial feedback when we release at the ACS. Google Patents will be available under the last tab in the Patents infobox.

google patents

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

We had previously released NMR prediction on ChemSpider as announced here. Based on community feedback we later removed that connection and had never reconnected, despite reported improvements. I am an NMR spectroscopist by training …if you check out my Mendeley profile you’ll see that the majority of my papers are NMR-based. Because I am an NMR jock, and despite working in cheminformatics I do keep my hands in NMR research (NMR prediction and computer assisted structure elucidation) I really wanted to make sure that we deliver NMR prediction via ChemSpider. I was involved with the development of the ACD/Labs NMR prediction tools for H1, C13, N15, F19 and P31 nuclei. There are a number of other NMR prediction modules on the market including those of Bio-Rad (in the Know-It-All package), Modgraph and certainly the work of Wolfgang Robien, one of the founding fathers of NMR prediction. These are primarily commercial packages.

In the background we have been working on the introduction of NMR prediction to ChemSpider in time for the ACS. We were looking for a platform that we could integrate that involved community deposition of data to ensure there was a growing database to enhance the prediction algorithms. We also wanted to know that the underlying data quality was good. We wanted to integrate to an Open system that had support from both an active community of participants as well as at least one developer who could provide support if we needed it. All of these criteria point to only one resource, NMRShiftDB. There have been some heated discussions, including on this blog, regarding data quality, especially in NMRShiftDB. However, I co-authored a paper with Chris Steinbeck and colleagues from ACD/Labs validating the dataset as well as ACD/Labs’ NMR prediction approaches.

NMRShiftDB is a high quality data set and certainly contains enough data to provide a training set for NMR prediction algorithms. The NMR predictions provided by NMRShiftDB are used by many people and overall feedback seems to be very positive.  Based on our previous knowledge of the data in NMRShiftDB, and the availability of a well defined programming interface to connect ChemSpider, we have worked with Stefan Kuhn at the EBI to produce a first level integration.

As a result at the ACS meeting in San Francisco next week we will roll out NMR prediction integration. In keeping with the new layout model we have adopted for ChemSpider using tabbed approaches for display of data, we have bundled together all predictions. The first ACD/Labs tab provides access to ACD/Labs PhysChem properties, the EPI Summary provides access to the EPISuite and the NMRShiftDB provides access to the predicted NMR spectra. The left spectrum shows the Proton NMR spectrum and the right spectrum shows the C13 NMR spectrum.

NMRshiftDB

When the system is fully integrated the process will work as follows. Since NMRShiftDB already contains many thousands of assigned spectra we will retrieve the experimentally assigned spectra directly and display them. When we cannot retrieve the experimental spectra then we will predict the NMR spectra and display them.

In the future we might pre-predict and store the NMR spectra for all structures on the NMR database. I am a little leery of doing this at present as we need to gather some basic feedback from the ChemSpider users regarding the performance of the NMR prediction algorithms and our existing implementation. In terms of predicting NMR spectra across a database of this size then a lot of consideration has to be given to domain applicability..i.e, what subset of structures should be excluded from having NMR predictions performed? For example, organometallic complexes, free radicals etc. CAS likely had to take this type of issue into account when they applied NMR predictions to their CAS registry.

If there are other NMR prediction algorithms or databases that you would be interested in integrating into ChemSpider please contact me. If you are a cheminformatics vendor selling NMR predictions/databases we would be VERY interested in receiving JUST the structures from your NMR databases. We will deposit them and link directly to your product page as an indicator that you have NMR data available.

As ChemSpider has grown in the amount and diversity of data that we link to our interface has had to evolve. The reality is that our pages have started to become heavy with data and information and, in many ways, can be unwieldy for some of the pages. As a result we have introduced Tabbed Infoboxes to make navigation much easier.

These tabbed infoboxes collapse the information into an infobox but keep them segregated under various tabs. The two examples below are from the present site that is online and shows the data sources box (now it’s EASY to find all chemical vendors under an aggregated infobox tab called chemical vendors) and the patents infobox, using the SureChem service and separating the tabs into different patent classes.

What I will unveil in a later post regarding OTHER tabbed boxes will be more exciting and you will see why we have taken this path shortly.

data sources

patents