Archive for the Community Building Category

I had previously announced the integration of NMRShiftDB as a beta integration. I have received feedback both on-blog and off-blog about the performance of the algorithms and the need for improved display of results. Wolfgang Robien, one of the major contributors to the domain of curated NMR databases and NMR prediction, gave feedback in the comments section regarding the performance of the initial integration. There was a significant bug highlighted in the integration that resulted in dropping double bonds when passing structures from ChemSpider to the API for NMRShiftDB. This was clearly a very significant issue but Stefan Kuhn has fixed the issue. NMRShiftDB was taken offline for a couple of days while the error persisted but is back online today with this issue resolved.

What we KNOW we need to do to enhance the integration is as follows:

1) Always display the chemical structure and associated numbering scheme

2) Display the spectrum type so it is clear what nucleus is being displayed

3) Indicate whether the spectrum displayed is a database hit or is a predicted spectrum

4) Display the details of the assignments including the number of SHELLS used in the HOSE code based prediction.

Our work on the integration to NMRShiftDB will continue and we will enhance it moving forward. Thanks to all for the ongoing feedback and testing.

Andrea Wendel is a student at Potsdam University in upstate New York and is one of Martin Walker’s chemistry majors. Martin is on our advisory group for ChemSpider. Andrea was kind enough to answer the question “How do you use ChemSpider” and I felt it was of value to share the short report with the readers. We’d love to hear from other users about how you use ChemSpider and your feedback that could be shared with blog readers. Thanks

How I use ChemSpider

I use ChemSpider on a regular basis when I need additional or supplemental information about molecules or reactions. Homework or lab write-ups require the most need for this website. I use the main box in the center of the page to search for my information. I mainly use systematic names, trade names or formulas to search for my data. Most results appear in a very short time, which is very helpful. I rarely use the structure search option, although if my type of research was different I might find it more useful. For many of my lab write-ups and papers, the information I search for includes common physical properties such as boiling point, melting point, and density. The molecular weight and safety comments are also items I find extremely useful. The icon at the end of the information that leads you to the webpage it came from is another useful option, as the original source is very important to know. ChemSpider has many features that are useful to an undergraduate student.

By: Andrea Wendel

The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010

We had previously released NMR prediction on ChemSpider as announced here. Based on community feedback we later removed that connection and had never reconnected, despite reported improvements. I am an NMR spectroscopist by training …if you check out my Mendeley profile you’ll see that the majority of my papers are NMR-based. Because I am an NMR jock, and despite working in cheminformatics I do keep my hands in NMR research (NMR prediction and computer assisted structure elucidation) I really wanted to make sure that we deliver NMR prediction via ChemSpider. I was involved with the development of the ACD/Labs NMR prediction tools for H1, C13, N15, F19 and P31 nuclei. There are a number of other NMR prediction modules on the market including those of Bio-Rad (in the Know-It-All package), Modgraph and certainly the work of Wolfgang Robien, one of the founding fathers of NMR prediction. These are primarily commercial packages.

In the background we have been working on the introduction of NMR prediction to ChemSpider in time for the ACS. We were looking for a platform that we could integrate that involved community deposition of data to ensure there was a growing database to enhance the prediction algorithms. We also wanted to know that the underlying data quality was good. We wanted to integrate to an Open system that had support from both an active community of participants as well as at least one developer who could provide support if we needed it. All of these criteria point to only one resource, NMRShiftDB. There have been some heated discussions, including on this blog, regarding data quality, especially in NMRShiftDB. However, I co-authored a paper with Chris Steinbeck and colleagues from ACD/Labs validating the dataset as well as ACD/Labs’ NMR prediction approaches.

NMRShiftDB is a high quality data set and certainly contains enough data to provide a training set for NMR prediction algorithms. The NMR predictions provided by NMRShiftDB are used by many people and overall feedback seems to be very positive.  Based on our previous knowledge of the data in NMRShiftDB, and the availability of a well defined programming interface to connect ChemSpider, we have worked with Stefan Kuhn at the EBI to produce a first level integration.

As a result at the ACS meeting in San Francisco next week we will roll out NMR prediction integration. In keeping with the new layout model we have adopted for ChemSpider using tabbed approaches for display of data, we have bundled together all predictions. The first ACD/Labs tab provides access to ACD/Labs PhysChem properties, the EPI Summary provides access to the EPISuite and the NMRShiftDB provides access to the predicted NMR spectra. The left spectrum shows the Proton NMR spectrum and the right spectrum shows the C13 NMR spectrum.

NMRshiftDB

When the system is fully integrated the process will work as follows. Since NMRShiftDB already contains many thousands of assigned spectra we will retrieve the experimentally assigned spectra directly and display them. When we cannot retrieve the experimental spectra then we will predict the NMR spectra and display them.

In the future we might pre-predict and store the NMR spectra for all structures on the NMR database. I am a little leery of doing this at present as we need to gather some basic feedback from the ChemSpider users regarding the performance of the NMR prediction algorithms and our existing implementation. In terms of predicting NMR spectra across a database of this size then a lot of consideration has to be given to domain applicability..i.e, what subset of structures should be excluded from having NMR predictions performed? For example, organometallic complexes, free radicals etc. CAS likely had to take this type of issue into account when they applied NMR predictions to their CAS registry.

If there are other NMR prediction algorithms or databases that you would be interested in integrating into ChemSpider please contact me. If you are a cheminformatics vendor selling NMR predictions/databases we would be VERY interested in receiving JUST the structures from your NMR databases. We will deposit them and link directly to your product page as an indicator that you have NMR data available.

For those of you looking for some assistance on either searching ChemSpider or how to become more involved in the community by depositing your own structures or adding links to your research there is now some updated documentation available under the support page of ChemSpider.

There is a Guide to Database Curation and Structure Deposition and four Quick Cards to help with chemical name searching, structure searching, single structure deposition and adding links to the record view.

12-02-2010 11-28-09

Over the past three years we have received a lot of kudos from the users of ChemSpider, mostly via email. I’d now like to make a direct and some would say quite cheeky request for community participation. It will result in us immortalizing your participation on the pages of ChemSpider. We’d like to gather together some comments/quotes/statements from the community about how valuable ChemSpider has become for you and how you might be using it. We’d like to use these quotes as some sort of rolling banner (as yet to be designed!) as well as maybe in some presentations etc.

We’d appreciate your comments about ChemSpider and encourage you to comment on this blog if you would. Please leave your name and, if appropriate and should you wish, your organization name. We understand that we live in times where it is necessary to have the disclaimer “these views represent the view of the individual and not XXXX organization” so feel free to just sign yourself “a medicinal chemist” or “a Happy ChemSpider User” . We don’t ask for much in exchange for the work we’ve been doing for the past few years and yes, I agree it is cheeky, but as my son reminds me regularly …if I don’t ask I don’t get :-)

Of, feel free to leave the comments on LinkedIn, Facebook, Twitter or Friendfeed. I’ll find them! Thanks in advance

We mailed out the first issue of the ChemSpider Newsletter in January which was packed with info on what’s happening with ChemSpider and tips on how you can get the best out of ChemSpider. To make sure you receive your personal copy of future issues by email please make sure to register.

ChemSpider Newsletter

scienceonline2010It’s one week to ScienceOnline 2010. Last year I missed it because of the threat of weather and this year I’ll likely be hobbling in on crutches. I’m listed to give a few presentations/demos and a joint session regarding Citizen Science and Students with Sandra Porter and Tara Richerson. I’m going to have the chance to catch up with people I know such as Cameron Neylon and JC Bradley who’ll be covering Open Notebook Science. I notice that Bill Hooker is in town and look forward to connecting with him too. I’ll get to meet Hope Leman who runs the blog “Significant Science” and released a blog this weekend regarding an interview with me. It was a real pleasure to work on that with Hope.

ScienceOnline is THE place to be to discuss online science based on what I’ve heard from previous attendees. It always fills up early, is incredibly well organized in terms of workshops, guest attendees and social events based on what I’ve seen. I am looking forward to experiencing the event, sharing space with some of the leaders in the domain of online science and seeing some old friends. One week to go…lots of preparation work to do.

I’m presently building a list of examples of Citizen Science in Chemistry. If you have any examples you believe are worth highlighting please feel free to send them through. Thanks

social widget

Following on from other posts in this series from this week I’m going to continue to list new functionality over the holiday season. I’ll continue with the “Social Widget”. What IS the Social Widget? Well…it’s this thing to the left….it is an AddThis Button that is available for every compound page on ChemSpider now. If there is a particular chemical of interest on ChemSpider that you want to include into your social networking then you can do so by choosing the social networking site of interest and “adding” the link in there. For some it posts the link and for others it posts a thumbnail of the structure there that is linked back directly into ChemSpider.

So, if I posted to Friendfeed it will send the link directly into Friendfeed. I just did it..worked perfectly. For Facebook it actually carries the thumbnail as shown below on my Facebook page. SO, deposit some of your molecules onto ChemSpider and let the world know! Add some data, tell a story, post a reaction…and use AddThis to tell your network!

photochromism

My friend and colleague Sean Ekins and I wrote a perspective for the RSC’s Lab on a Chip journal and it was released as an Advance Article, as Free Access, this evening.

The perspective is entitled “Precompetitive preclinical ADME/Tox data: set it free on the web to facilitate computational model building and assist drug development“ . The title is self-explanatory in terms of what we are trying to communicate. The paper is online now and available here.

ADME

I have been at the German Conference for Cheminformatics for the past three days. The conference is in Goslar. I twittered the conference using #goslarcheminf and it appears that there was little interest in twittering here…seems like it’s an “American” thing to do. I gave a presentation entitled “ChemSpider – Building a Foundation for the Semantic Web by Hosting a Crowd Sourced Databasing Platform for Chemistry” and have put it on SlideShare here. The abstract for the talk is below as well as the embedded Slideshare widget for the talk. This talk was a lot less rushed than usual…not just 20 minutes and I personally enjoyed giving this talk to the audience. Commonly I feel that the talks I give are very rished and I only get to scratch the surface of what we are up to with ChemSpider. It’s amazing how an additional 15 minutes allowed me to expand on the issues and the work. The presentation drew a lot of questions and attention after the session and I’m hoping that many of the discussions regarding collaboration and depositions of new data come to fruition.

Abstract

There is an increasing availability of free and open access resources for chemists to use on the internet. Coupled with the increasing availability of Open Source software tools we are in the middle of a revolution in data availability and tools to manipulate these data. ChemSpider is a free access website for chemists built with the intention of providing a structure centric community for chemists. It was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge.

There are tens if not hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them.  Despite the fact that there were a large number of databases containing chemical compounds and data available online their inherent quality, accuracy and completeness was lacking in many regards. The intention with ChemSpider was to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data, experimental properties and linking to other valuable resources. It has grown into a resource containing over 21 million unique chemical structures from over 200 data sources.

ChemSpider has enabled real time curation of the data, association of analytical data with chemical structures, real-time deposition of single or batch chemical structures (including with activity data) and transaction-based predictions of physicochemical data. The social community aspects of the system demonstrate the potential of this approach. Curation of the data continues daily and thousands of edits and depositions by members of the community have dramatically improved the quality of the data relative to other public resources for chemistry.

This presentation will provide an overview of the history of ChemSpider, the present capabilities of the platform and how it can become one of the primary foundations of the semantic web for chemistry. It will also discuss some of the present projects underway since the acquisition of ChemSpider by the Royal Society of Chemistry.

QSAR worldFollowing on from my earlier post regarding our interest in aggregating physicochemical data for other groups to use in building their models and algorithms we announce that we are now depositing the data from QSAR world into ChemSpider and pointing back to the original sources on QSAR World. We harvest the SDF files, deposit onto ChemSpider and provide direct links into the original SDF file, with the appropriate titles, so that our users can proceed to gather the data for re-analysis if they find it of interest. An example record is here for Atovaquone where we list the links to data residing on QSAR world for download. The links can be seen under the supplemental information section as shown below where you can see links to seven different types of data. We have chosen, for the time being, to not deposit the values associated with these data onto ChemSpider as the data are very heterogeneous in representation even though they are all delivered as SDF files.

supplemental information

roadrunnerAs an active member of the Wikipedia Chemistry team I continue to be impressed with the dedication and commitment that the members have to improving the quality AND quantity of information available on Wikipedia for chemists. The number of lost hours of sleep freely given to the benefit of Wikipedia, and in this specific case to the chemistry community, is immense. The number of “Compound Pages” on Wikipedia dedicated to drugs/chemicals has continued to grow and, despite a sincere effort on our part to keep everything linked up from ChemSpider to Wikipedia it’s a little like chasing the Road Runner….we’re always behind!

We have been working with the WikiChem team of late to embed links from Wikipedia back to ChemSpider. I am humbled to know that our hard work to establish ChemSpider as a source of quality information has reached a level of trust such that Wikipedia now links from the ChemBoxes out to ChemSpider. The links are being updated on an on going basis at present with hundreds of new links already established and more being generated on an ongoing basis. Wikipedia User: Beetstra has written a ‘bot that is inserting ChemSpiderIDs across the database (see below) and we ARE doing rigorous checking of all of the links.This was using a file that we generated on our side showing links to Wikipedia from ChemSpider.

beetstra

We will then be able to generate a list of all ChemBoxes/DrugBoxes without links from Wikipedia to ChemSpider and we will then make the links on our side, manually curating the structures, and then hand back a file to finish all linking. At this point we will have the backfile under control and we can perform ongoing updates as new compound pages are created on ChemSpider and, if we curate and find errors on Wikipedia or ChemSpider making a few manual edits is easy.

There are very dedicated teams on Wikipedia and ChemSpider carefully poring over data with their robots and eyeballs to create a linked data set of quality chemistry. It’s long, tedious AND important work. When its done we will have an expanded set of data to semantically link from RSC articles when we do markup.

I’ve been in discussions with JC Bradley and Andy Lang about the Open Notebook Science Solubility Data project. Specifically we’ve been comparing  logP predictions from the CDK versus those listed on ChemSpider. We actually have six values of logP listed for some records. For example, for toluene we have 4 predicted values, 1 experimental value from a database and 1 experimental value from a publication. These are shown below:

toluene4 logpThere are three predicted logP values from three different algorithms (ACD/LogP, XlogP and AlogPs) as shown at the top of the figure. There is a predicted value and a database value from the EPISuite from the EPA (middle of the figure) and there is a LogP value from a publication with the link out indicated by the arrow (this datum was deposited by Egon Willighagen when he deposited the data from his publication). If you examine the list of data, both experimental and predicted, you will see a general value of  around 2.65+/- error. This should be compared with the CDK value listed in the ONS spreadsheet that gives a predicted value of 0.64. This was the primary reason that we were discussing the comparison…the values of predicted logP from CDK were different from the predicted values listed on ChemSpider for a number of examples in the spreadsheet.

Egon and I exchanged a couple of emails discussing the fact that logP predictions could be generated by a number of parties if there was a good Open Data training set available. A recent publication entitled “Calculation of Molecular Lipophilicity:State of the Art and Comparison of Log P Methods on More Than 96000 Compounds” performed a thorough analysis of different logP methods on a very large dataset. The publication is available online here. They compared “the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed(N = 882) and Pfizer (N = 95 809). A total of 30 and 18 methods were tested for public and industrial datasets, respectively.” During the work they derived a simple equation based on the number of carbon atoms, NC, and the number of hetero atoms, NHET: log P = 1.46(±0.02) + 0.11(±0.001) NC – 0.11(±0.001) NHET. This equation was shown to outperform a large number of programs benchmarked in this study. This would certainly be easy to implement on ChemSpider and, just out of interest, applying this equation to toluene gives us a value of 2.23. Compare this with the values listed above.

Unfortunately there doesn’t appear to be too many Open logP datasets available around for people to use as training sets. Also, with the thorough work reported in the publication above is it necessary to build yet another logP prediction algorithm? ACD/Labs have made their logP prediction software free for download (http://www.acdlabs.com/download/logp.html), the VCCLab software is available for free (http://www.vcclab.org/lab/alogps/), the EPISuite software is available for free (http://www.epa.gov/oppt/exposure/pubs/episuite.htm) and if you just want to predict a value for a compound not on ChemSpider then you can use the services here: http://www.chemspider.com/Services.aspx.

However, even though there are a lot of predictors available it still makes sense to gather data and provide it as an experimental dataset, made available as Open Data for the developers of such algorithms to ake the benefits of structural diversity and fresh data to potentially improve their models. If you have any logP data available please point me to the data to download or contact me offline to discuss. We are presently working on enhancing our data model to provide improved access to experimental data on ChemSpider as well as access to the predicted data via web services. More to follow…

Today I received an email via the CHMINF list server pointing to the following Press Release. Part of the press release is shown here:

“In collaboration with the German National Library of Science and Technology (TIB) Thieme is the first publisher to make primary chemistry data accessible worldwide. Analytical data, from various experiments, is the foundation of research work and scientific papers. From now on, primary data will be registered and made available online via the Thieme eJournals website (www.thieme-connect.com/ejournals) using digital object recognition in the form of Digital Object Identifiers (DOI). This will enable scientists to easily locate research articles, including accompanying data, and make enhanced use of the scientific content.”

There has been a lot of discussion over the years regarding making available “primary data”. We offered to do this on the ChemSpider Journal of Chemistry : if people wanted to submit analytical data with an article that we published then we would post them as spectra associated with the article. Unfortunately the general consensus based on a few conversations that I had is that it is a lot of work to prepare data and deposit it. This is one of the reasons that, until now, publishers have generally made the spectral data available as plots and printouts of the data. These data are generally made available as electronic supplementary data. These data ARE valuable even in that form but, and I believe that the majority of scientists would agree, they would be of more valuable if they were available in a format that would allow display in online applets, downloadable for processing and expansions etc. The RSC would certainly welcome the availability of spectral data associated with publications especially since they can now be hosted on ChemSpider.

Thieme have actually managed to pull off quite a coup and I commend them for their efforts. The first example datasets are available here. The listing includes “FIDs and associated files for the 1H, 13C and DEPT NMR spectra for compounds 14, (SS)-23, (SS)-25, (RS)-26, 27, (SS)-28, (RS,SS)-29, 30, (RS)-36, (SS)-36, (SS)-37, 38, (RS)-39, (SS)-39, (SS)-44, (RS)-46, (SS)-46, (RS)-48, (SS)-48, (SS)-49, 52, (RS)-53, (RS)-55, (RS)-57, (SS)-57, (SS)-58, (RS)-61, (SS)-61, (RS)-62, (SS)-62, (RS)-65 and (SS)-65 are summarized.” That’s a lot of data.

Since these are primary data they cannot be copyrighted so I chose to download the data, take a look and insert a couple into ChemSpider as an example of what can be done with these data. The associated PDF for the data says “The files can be processed using the following programs: MestReC, Bruker’s WINNMR and XWINNMR.” The files came as binary Bruker files so needed to be reprocessed and, in order to be deposited, had to be converted to JCAMP-DX format, the format supported by the JSpecView applet used on ChemSpider to display spectra. In order to this I am fortunate to have access to ACD/NMR Processor, a product I managed for a few years while working at ACD/Labs. This product also supports the Bruker format so I imported the data, processed and exported as JCAMP and imported to ChemSpider.  For compound 14 I have attached the H1 and C13 spectra and they can be seen here. I didn’t attach the “DEPT spectrum” yet. In order for me to download the spectra, redraw the structure, process the spectra, export as JCAMP and deposit to ChemSpider took about 15 minutes. However, there are a lot of spectra and it will take me a while. There are 32 compounds, I assume 3 spectra per compound (HNMR, CNMR and DEPT) so that’s a total of 96 spectra. It’ll take me about 10-12 hours just to deposit this collection so that’s a lot of work to do in my spare time. If anyone wants to help out and can process the spectra to deposit please do!

One of the spectra are shown below using the Spectral Embed function we introduced previously:

This is a rich collection of data…it can feed the Spectral Game described in this article. I look forward to getting the data onto ChemSpider and will be following up with Thieme to see if we can work together to host the data in a more generic format for the future. It’s a shame that the data are locked into a binary file format that needs reprocessing to view and I believe display through the JSpecView applet is advantageous for all. I encourage Thieme to consider also making the structure collection available in molfile, SMILES, InChI and InChIKey format – the InChIs will make the article discoverable via internet searches and through the InChI Resolver while the download of molfiles will speed up the loading process to ChemSpider and other systems.

We recently added the Borochem structure set to ChemSpider and asked whether they would be willing to provide us with a little quote for a brochure we are building to encourage other companies to deposit their data with us. We received a very nice response!

“ChemSpider is a great source of information for the chemical community.” says Alexandre Bouillon, C.E.O. and C.S.O. of organoboron building blocks provider BoroChem. “Our objective is to give a maximum number of chemists access to our catalogue of rare and innovative building blocks. Our chemical intermediates are available on our own corporate website and listed on the major online chemical directories, but we feel that getting exposed through the ChemSpider search engine will allow us to increase our visibility on the web and moreover, to contribute to one of the most complete databases of the chemical industry.”

Last week I had the pleasure of being on an agenda with a number of people whose work I applaud and who I genuinely enjoy spending time with and sharing thoughts about “what if?” Martin Walker, one of the people I collaborate with on Wikipedia, invited me to speak in his session “Publishing and Promoting Chemistry in the Internet Age“. Martin gave an introduction to the session and spoke about Chemistry on the Internet. Beth Brown gave an overview of the Chemist’s Toolkit for Publishing and Promoting your work on the Internet. I followed with an overview about what’s going on with ChemSpider and the issues of connectedness and quality of chemistry on the internet. JC Bradley spoke about transparency and Open Notebook Science. My hat’s off to Martin for arranging the speakers in that order. Considering we didn’t coordinate our talks it was an excellent trajectory throughout the session and very much an integrated overview of activities regarding chemistry on the internet.

My talk is posted on SlideShare here and is available below. Any comments and questions are welcomed.

Beth Brown has her talk online here and JC Bradley will post his online here.

JC Bradley and I had a good talk about ways we can collaborate together more closely on Open Notebook Science. We have a path forward so that ChemSpider can provide additional support and will be discussing the path forward offline.

Google are riding the surf associated with their release of Wave, even to a very small group of testers. Just do a search of Google Wave and you’ll see what I mean. There is a certain amount of “wave envy” in our domain right now as people want to get accounts to test. Test accounts are however being freed up quite quickly and there will be a number of cheminformaticians eager to insert their code into Wave as robots and enable specific integrations. When I was at Scifoo a few weeks ago we were granted Wave accounts to play around. I was impressed with the possibilities but found the system to be a little underwhelming in terms of stability and a little unfriendly in terms of usability. But, these are issues acknowledged by the team and, like many things Google, we are sure to see Wave get picked up by the masses when it’s released. And, if WILL release, with great fanfare.

Cameron Neylon has been the most vocal advocate of Google Wave ever since the first announcements were made about the platform. He has been pivotal in getting a voice for science with the Google Wave team and coordinated a meeting for us with members of the dev team at SciFoo. It was clear in that meeting that the meshing of ChemSpider web services into Google Wave would enable Waves to be enhanced with (semi-)semantic markups so that, at a minimum, chemical names could be used to lookup chemicals on ChemSpider and retrieve a structure image so that hovering over the name in the document would sow the structure image. Unfortunately we’ve been swamped with migrating ChemSpider to RSC servers and preparing for and attending the IUPAC Congress and ACS Fall Meeting in Washington. So, we got a grand sum of  nothing done integrating Wave and ChemSpider.

Fortunately, we did well when the web services were built and Cameron has moved ahead with coding up ChemSpidey on his own. He announced that ChemSpider is alive and kicking, with all eight legs, in his blog post here. Stealing shamelessly from Cameron’s post:

“If ChemSpidey is added to a wave it watches for text of the form “chem[ChemicalName{;weight {m}g}]” where the curly bracketed parts are optional. When a blip is submitted by hitting the “done” button ChemSpidey searches through the blip looking for this text and if it finds it, strips out the name and sends it) to the ChemSpider SimpleSearch service. ChemSpider returns a list of database ids and the robot currently just pulls the top one off the list and adds the text ChemicalName (csid:####) to the wave, where the id is linked back to ChemSpider. If there is a weight present it asks the ChemSpider MassSpec API for the nominal molecular weight calculates the number of moles and inserts that. You can see video of it working here (look along the timeline for the ChemSpidey tag).”

Go nd watch the movie. You’ll likely have to watch it while zoomed in to see what is gong on. Cameron went on further than I’d originally consider by pulling back Mw from our MassSpec Web service in order to do calculations on the fly etc. The display of the structure by hovering over the CSID embedded in the Wave is not yet implemented and we need to cover this for sure.

This is a good start to build on and some things that we have to work on…

1) If a call is made to retrieve a chemical based on a chemical name and there are MULTIPLE compounds with that name then figure out how to allow the user to select the one they want

2) Display the structure image with direct link back to ChemSpider – and if appropriate extend to include links to PubChem, Wikipedia, RSC journal articles etc, presence of analytical data etc. (all the things we were going to do with ChemMantis!)

3) Change data model to mark “Fully Curated”  structures so that when a structure image and associated meta data are passed to ChemSpidey the robot knows that this isn’t just a name-structure relationship but that humans have curated the data and say “it’s correct”. Then of course…humans can be wrong too!

4) Provide access to other services -from a structure in a Google Wave document allow generation of InChI, InChIKey, SMILES, search PubMed, search Patents, “world is my oyster.com”

We are now working in multiweek development sprints and will look to include some time for ChemSpidey enhancement/development in a future sprint. I have a lot of faith in wha Google Wave will bring to us all and despite the early teething troubles,as with all things Google (as far as I can tell) it will improve in terms of stability and usability but may be in perptual beta for a few years!

We just returned from the IUPAC Congress in Glasgow Scotland where we unveiled the new ChemSpider logo, booth, catalog, ChemSpider game (!) and spent time talking to lots of delegates. I summarized the experience for the team involved with making RSC ChemSpider one of the most exciting things to see on the exhibition floor…the communication I sent around is below. I hope it conveys the pride we have for what was done in a very short period of time following ChemSpider joining RSC on June 1st. It was a focused team and we did great. ACS is one week away and we’re now focused on that meeting. See you there!

“Today was a moment of pride. It was the first day at the booth…the first time we unveiled RSC-ChemSpider to the world. The IUPAC Congress was busy for us. Graham, Richard and Tony became the representatives for the collective efforts of a VERY large team who, in less than 2 months, have taken a basement-hosted system developed by a small team to a system hosted on the RSC Servers in Cambridge, with a marketing profile that is much higher and more polished than anything previously associated with the “‘Spider”.

It was great. People got it. They searched for chemicals they didn’t expect us to have…and found them. They found their OWN papers on our system….three times just in my hands! They deposited data live. They curated data live. The system was responsive ..and that was on wireless! The ChemSpider Game got great attention…and Richard still holds the record! The ChemSpider shirts were smart. RSC employees were genuinely excited by the opportunity that ChemSpider offers for the society. People who’d never heard of us were rushing home to tell others.

People made interesting comments…”This is the best thing I have seen all day”. “Do you realize how much this will do for the world of chemistry?” We have new potential collaborators. We have people interested in depositing data at a “personal” level, managing their data on our system.

What did they like? Text searching, structure searching, the size and diversity of the database, the nature of what was on the DB (videos, reactions, blog posts etc), speed, the logo, the booth, the game.

Getting here in 2 months has been an amazing achievement and has required the effort of a very large and extended team. We’ve dealt with the geographical challenges (and its time zone challenges). In two months we’ve crossed through the forming and storming stages of functional team development (if this doesn’t ring a bell see: http://en.wikipedia.org/wiki/Forming-storming-norming-performing) and are now WELL on our way to having a highly functional and productive team to make ChemSpider sing at ACS and well into the future.

The last time I was this happy about a “first day in the booth” experience was many years ago and the product we rolled out then became the dominant standard in the industry and revolutionized the way scientists handled their data.

In my humble opinion the take up for ChemSpider today was BIGGER! People got it.”

Reblog this post [with Zemanta]

DISTRIBUTED BY NATURE PUBLISHING GROUP ON BEHALF OF THE INCHI TRUST
21 July 2009

Contact: Grace Baynes
Corporate Public Relations, Nature Publishing Group
T:+44 (0)20 7014 4063
g.baynes@nature.com

The InChI Trust, a not-for-profit organisation to expand and develop the InChI Open Source chemical structure representation algorithm, is formally launched this week. Originally developed by the International Union of Pure and Applied Chemistry (IUPAC), the IUPAC International Chemical Identifier (InChI) is an alpha-numeric character string generated by an algorithm. The InChI was developed as a new, non-proprietary, international standard to represent chemical structures. The Trust aims to develop and improve on the current InChI standard, further enabling the interlinking of chemistry and chemical structures on the web. The connection with IUPAC is maintained through IUPAC’s InChI Subcommittee.

The InChI algorithm turns chemical structures into machine-readable strings of information. InChIs are unique to the compound they describe and can encode absolute stereochemistry Machine-readable, the InChI allows chemistry and chemical structures to be navigable and discoverable. A simple analogy is that InChI is the bar-code for chemistry and chemical structures. The InChI format and algorithm are non-proprietary and the software is open source, with ongoing development done by the community.

“The goal of the InChI Trust”, says Project Director Stephen Heller “is to continue to develop the InChI and InChIKey, the condensed machine-searchable version, as a tool to enable widescale linking of chemical information.”

The InChI Trust was formally incorporated in the UK in May 2009, and now has 6 charter members: The Royal Society of Chemistry, Nature Publishing Group, FIZ-Chemie Berlin, Symyx Technologies, Taylor & Francis and OpenEye. Further organizations and publishers are in the process of joining the InChI Trust.

“Nature Publishing Group is delighted to be a charter member of the InChI Trust”, says Jason Wilde, Publisher for the Physical Sciences, Nature Publishing Group. “We view the ongoing maintenance of the InChI algorithm, and the resulting adoption of InChI, as important for the development of chemistry communication. The interlinking that the InChI offers between journal content and databases ensures that chemistry is the first truly web-enabled scientific discipline.”

“The InChI has already gained a wide user base,” says Richard Kidd, Informatics Manager at the Royal Society of Chemistry, “and the Trust will ensure continuing development and support for this key standard, helping to link together chemical resources across the internet. The RSC is proud to support the InChI Trust.”

Since the introduction of the InChI in 2005, there has been widespread take-up of InChI standards by public databases and journals. Today, there are more than 100 million InChIs in scientific literature and products.

To date, numerous databases, journals, and chemical structure drawing programs have incorporated the InChI algorithm. These include the NIST WebBook and mass spectral databases, the NIH/NCBI PubChem database, the NIH/NCI database, the EBI chemistry database, ChemSpider, Symyx Draw and many others.

The initiative serves chemists, publishers, chemical software companies, chemical structure drawing vendors, librarians, and intermediaries by creating an international standard to represent defined chemical structures. This provides a consistent, credible and compatible way for databases of chemical structures to be linked together for the benefit of users of chemical information around the world.

-ENDS-

For further information, please contact:

Project Director, Dr. Stephen Heller at steve@inchi-trust.org

Background notes:

The InChI project was initially undertaken by IUPAC with the cooperation of National Institute for Standards and Technology (NIST). In 2009, a standard version of InChI and the InChIKey were released. Members of the InChI Trust will pay annual dues to support the continued development of InChI, and maintainance of the InChI algorithm. This income will be used exclusively for InChI algorithm development, maintenance, outreach, and educational activities associated with the project

Details of the up-take by many chemical database providers, software developers, and journal publishers are available at www.iupac.org/inchi/adopters.html

Reblog this post [with Zemanta]

There have been other comments about Wolfram Alpha and it’s support for Chemistry (1,2 and others) but I have remained rather quiet until now about my experiences with Alpha for a couple of reasons. First of all I’d rather let the service settle down a bit before poking at it too hard. My experiences of going live with ChemSpider were definitely that it takes a while to stabilize the system and address some of the earliest feedback. Also, knowing that I would be at Scifoo and aware that Theodore Gray would be there I had hoped to see Alpha in action. I wasn’t disappointed. Yesterday Theodore drove the system in front of an audience including a number of interested scientists, members of Google and, Peter Murray-Rust and myself from Chemistry. Theo had no fear…essential for live demos. He was asked questions and he did took the plunge, did the search and with the rest of us celebrated a successful search, a weird result and just plain wrong. It was ALL good. I am impressed. I am impressed by that they are out to achieve with Wolfram Alpha. I am convinced that what they are doing with Alpha will contribute to science and mathematics in general and that Chemists will be using this system when they have more awareness of it.

For a general intro to Alpha see the presentation here.

So, some examples of interesting searches:

1) A guy in the room had asked the question “What is the largest land mammal?” and had not received an answer a few weeks earlier. Now Theo posed that question and got the answer here. Nice! Now, I took that to mean that they were keeping logs of failed queries and tweaking…confirmed by Theo. VERY nice.

2) Peter Murray Rust had previously blogged about bad results from his searches (searching on dibromoethane for example). When he repeated his searches in the session hosted by Theo he acknowledged that he was pleased that they had fixed the issues he had previously blogged about. This is how modern systems should be …moving quickly.

3) Searching on names…for example, what is the number of people with my name…my spelling is Antony NOT Anthony. See here for the results.

4) What is the return per employee for Google versus IBM. It’s in this query: http://www35.wolframalpha.com/input/?i=GOOG+IBM

5) What are the chemical structures of Taxol? Methamphetamine? Cholesterol? Buckminsterfullerene? You get answers for all. The organic molecules all give images of chemical structures. The connections in all cases are correct but I see no evidence of stereochemistry anywhere across the chemical structures on the database..it doesn’t mean it’s not there but I couldn’t find it.

So, for chemistry, am I impressed. Yes I am. I’m not worried right now that Alpha is not dealing with stereochemistry…I am sure they will layer that on later. It is clear based on most of the results that I have seen that there is some GOOD curation of the data going on. According to Theo there are chemists on staff and they are curating the data coming in. Hallelujah! If you look in the Source Information for Taxol you see a LONG list of sources of chemical source information and the primary source is the Wolfram Alpha Curated Data.

alpha-data There is much that can be done to help Wolfram Alpha to have better Chemistry. They have a HARD job ahead of them if they are going to sample the Public Databases to grab quality chemistry. It’s in there for sure but it’s hard to find. What could come out of ChemSpider and Wolfram Alpha working together?

1) If we could get the list of “compounds” in Wolfram Alpha then we can provide chemical compound connection tables with all necessary stereochemistry etc.

2) When we pass back the compound list then we can pass back ChemSpider IDs and get them listed as identifiers alongside the PubChem CID. In theory it would be good to get these linked back to ChemSpider so that a user can come and find associated articles, analytical data, the wikipedia article, predicted and experimental properties and so on. This is where ChemSpider’s integration would be of value.

3) There is an opportunity to expand the chemistry in Wolfram Alpha by passing a subset of ChemSpider compounds to be added to Alpha. Certainly I don’t think that Alpha should host all 21.5 million of our compounds for the reasons I have enumerated many times on this blog. See my last post about the 54 versions of the Taxol skeleton…there should be only one Taxol. But, there may be a way to subset “important chemistry” and get it into Alpha. OR, maybe they do want it all?

There are clearly opportunities to help expand the chemistry and I hope we have the chance. I think Alpha is incredibly ambitious. But why not be ambitious? ChemSpider was ambitious too and look what we have done with three servers in a basement…it’s a whole lot less resources that Wolfram are throwing at Alpha. I want them to be successful…a computational engine for the public. Why not….so many of us are asking questions using search engines right now and can’t get anywhere near an answer…

Reblog this post [with Zemanta]

The ChemBL blog is an excellent read and if you’re interested in “Open Access Drug Discovery And Medicinal Chemistry Data ” this is one for you. We are shamelessly, and WITH permission, taking some of the blogposts about New Drug Approvals and adding them into the descriptions on ChemSpider. Some examples are here and here. To date for all cases where we have added the description the compound itself was already on ChemSpider and with the correct name. That’s good news based on some of our subjective measures of coverage for the database.

The Spectral Game at www.spectralgame.com is powered by chemical structures and spectra from ChemSpider. A provisional form of our manuscript regarding this paper is now online at the Journal of Cheminformatics here:

The Spectral Game: leveraging Open Data and crowdsourcing for education

Jean-Claude Bradley , Robert J Lancashire , Andrew SID Lang and Antony J Williams

Journal of Cheminformatics 2009, 1:9doi:10.1186/1758-2946-1-9

 
Published: 26 June 2009

Abstract (provisional)

We report on the implementation of the Spectral Game, a web-based game where players try to match molecules to various forms of interactive spectra including 1D/2D NMR, Mass Spectrometry and Infrared spectra. Each correct selection earns the player one point and play continues until the player supplies an incorrect answer. The game is usually played using a web browser interface, although a version has been developed in the virtual 3D environment of Second Life. Spectra uploaded as Open Data to ChemSpider in JCAMP-DX format are used for the problem sets together with structures extracted from the website. The spectra are displayed using JSpecView, an Open Source spectrum viewing applet which affords zooming and integration. The application of the game to the teaching of proton NMR spectroscopy in an undergraduate organic chemistry class and a 2D Spectrum Viewer are also presented.

scifooScifoo is just a few weeks away and I was reviewing the list of attendees this evening to see who I would be sharing space with.

I am especially looking forward to spening time with Andrew Lang, one of the brains behind the Spectral Game. We’ve spoken on the phone, exchanged many emails and worked on a couple of projects together. But we get to meet at SciFoo!

Last time I was at SciFoo I spent time talking with Cameron Neylon and JC Bradley about Open Notebook Science. At that time I had lots of ideas about what we could do to support Open Notebook Science. We actually have done quite well but at that time we were severely resource constrained. Things are a little different now we have been acquired by the RSC and I am looking forward to talking about what’s necessary and possible now.

Nicko Goncharoff from SureChem will be there. Nicko and I have spent a lot of time together over the past few years, mostly by phone and over email as we worked to integrate SureChem into ChemSpider and use their software development kit under our ChemMantis semantic chemistry markup tool. It’s always good to see him.

Other people I hope to spend some time talking to: Peter Murray Rust from the university of Cambridge, Timo Hannay, Alf Eaton and Terry Sheppard from the Nature Publishing Group and Theodore Gray.

linkedin I have set up a LinkedIn Users and Advisors group today and welcome any LinkedIn users interested in ChemSpider to join the group and stay informed about our activities on ChemSpider. I hope that it also provides a useful environment for discussion and collaboration around ChemSpider.

The ChemSpider LinkedIn Group can be accessed here.

Reblog this post [with Zemanta]

I have given a number of talks regarding ChemSpider over the past few months and generally comment “ChemSPider hosts almost 21.5 Million unqiue chemical entities from over 200 data sources. As of today it is over 21. 5 million chemical entities. We have deposited data from a number of new contributors of late, many of these are smaller chemical vendors such as Bridge Organics and ExtraSynthese. However, we recently crossed the 21.5 million mark because we have started to take advantage of the eMolecules dataset made available as a downloadable set. There are over 5 million structures in the dataset.

Many, but not all of these, deduplicate onto the ChemSpider database. The 21.5 millionth structure links to this record on eMolecules as shown below.

emolecules

When the data are added onto ChemSpider we automatically add SMILES, InChIs, MW, MF and a series of predicted physicochemical properties. This is for the new structures from eMolecules. In many cases however eMolecules is simply one more data source among many and information such as spectra, Wikipedia links, experimental data etc are all integrated. In this case though eMolecules can help you source a vendor for the material as is their strength.