Archive for October, 2009

JC has given a great overview of how students might want to use ChemSpider for the purpose of chemical information retrieval on the internet. JC’s course lecture thoroughly exercises ChemSpider, in real time, to do searches across the internet. He posted his seminar to Scivee here and I have embedded the lecture below. It’s a good talk for students and I encourage you to share it and review how ChemSpider can be used in your classwork and in your laboratories.

Buy me a Coffee

QSAR worldFollowing on from my earlier post regarding our interest in aggregating physicochemical data for other groups to use in building their models and algorithms we announce that we are now depositing the data from QSAR world into ChemSpider and pointing back to the original sources on QSAR World. We harvest the SDF files, deposit onto ChemSpider and provide direct links into the original SDF file, with the appropriate titles, so that our users can proceed to gather the data for re-analysis if they find it of interest. An example record is here for Atovaquone where we list the links to data residing on QSAR world for download. The links can be seen under the supplemental information section as shown below where you can see links to seven different types of data. We have chosen, for the time being, to not deposit the values associated with these data onto ChemSpider as the data are very heterogeneous in representation even though they are all delivered as SDF files.

supplemental information

Buy me a Coffee

What’s your favorite flavor of mercury acetate..on Wikipedia here? on CAS Common Chemistry here or on ChemSpider here?

How would you represent this structure if you were to draw it as a 2D diagram?

mercury acetate

Buy me a Coffee

roadrunnerAs an active member of the Wikipedia Chemistry team I continue to be impressed with the dedication and commitment that the members have to improving the quality AND quantity of information available on Wikipedia for chemists. The number of lost hours of sleep freely given to the benefit of Wikipedia, and in this specific case to the chemistry community, is immense. The number of “Compound Pages” on Wikipedia dedicated to drugs/chemicals has continued to grow and, despite a sincere effort on our part to keep everything linked up from ChemSpider to Wikipedia it’s a little like chasing the Road Runner….we’re always behind!

We have been working with the WikiChem team of late to embed links from Wikipedia back to ChemSpider. I am humbled to know that our hard work to establish ChemSpider as a source of quality information has reached a level of trust such that Wikipedia now links from the ChemBoxes out to ChemSpider. The links are being updated on an on going basis at present with hundreds of new links already established and more being generated on an ongoing basis. Wikipedia User: Beetstra has written a ‘bot that is inserting ChemSpiderIDs across the database (see below) and we ARE doing rigorous checking of all of the links.This was using a file that we generated on our side showing links to Wikipedia from ChemSpider.

beetstra

We will then be able to generate a list of all ChemBoxes/DrugBoxes without links from Wikipedia to ChemSpider and we will then make the links on our side, manually curating the structures, and then hand back a file to finish all linking. At this point we will have the backfile under control and we can perform ongoing updates as new compound pages are created on ChemSpider and, if we curate and find errors on Wikipedia or ChemSpider making a few manual edits is easy.

There are very dedicated teams on Wikipedia and ChemSpider carefully poring over data with their robots and eyeballs to create a linked data set of quality chemistry. It’s long, tedious AND important work. When its done we will have an expanded set of data to semantically link from RSC articles when we do markup.

Buy me a Coffee

I’ve been in discussions with JC Bradley and Andy Lang about the Open Notebook Science Solubility Data project. Specifically we’ve been comparing  logP predictions from the CDK versus those listed on ChemSpider. We actually have six values of logP listed for some records. For example, for toluene we have 4 predicted values, 1 experimental value from a database and 1 experimental value from a publication. These are shown below:

toluene4 logpThere are three predicted logP values from three different algorithms (ACD/LogP, XlogP and AlogPs) as shown at the top of the figure. There is a predicted value and a database value from the EPISuite from the EPA (middle of the figure) and there is a LogP value from a publication with the link out indicated by the arrow (this datum was deposited by Egon Willighagen when he deposited the data from his publication). If you examine the list of data, both experimental and predicted, you will see a general value of  around 2.65+/- error. This should be compared with the CDK value listed in the ONS spreadsheet that gives a predicted value of 0.64. This was the primary reason that we were discussing the comparison…the values of predicted logP from CDK were different from the predicted values listed on ChemSpider for a number of examples in the spreadsheet.

Egon and I exchanged a couple of emails discussing the fact that logP predictions could be generated by a number of parties if there was a good Open Data training set available. A recent publication entitled “Calculation of Molecular Lipophilicity:State of the Art and Comparison of Log P Methods on More Than 96000 Compounds” performed a thorough analysis of different logP methods on a very large dataset. The publication is available online here. They compared “the predictive power of representative methods for one public (N = 266) and two in house datasets from Nycomed(N = 882) and Pfizer (N = 95 809). A total of 30 and 18 methods were tested for public and industrial datasets, respectively.” During the work they derived a simple equation based on the number of carbon atoms, NC, and the number of hetero atoms, NHET: log P = 1.46(±0.02) + 0.11(±0.001) NC – 0.11(±0.001) NHET. This equation was shown to outperform a large number of programs benchmarked in this study. This would certainly be easy to implement on ChemSpider and, just out of interest, applying this equation to toluene gives us a value of 2.23. Compare this with the values listed above.

Unfortunately there doesn’t appear to be too many Open logP datasets available around for people to use as training sets. Also, with the thorough work reported in the publication above is it necessary to build yet another logP prediction algorithm? ACD/Labs have made their logP prediction software free for download (http://www.acdlabs.com/download/logp.html), the VCCLab software is available for free (http://www.vcclab.org/lab/alogps/), the EPISuite software is available for free (http://www.epa.gov/oppt/exposure/pubs/episuite.htm) and if you just want to predict a value for a compound not on ChemSpider then you can use the services here: http://www.chemspider.com/Services.aspx.

However, even though there are a lot of predictors available it still makes sense to gather data and provide it as an experimental dataset, made available as Open Data for the developers of such algorithms to ake the benefits of structural diversity and fresh data to potentially improve their models. If you have any logP data available please point me to the data to download or contact me offline to discuss. We are presently working on enhancing our data model to provide improved access to experimental data on ChemSpider as well as access to the predicted data via web services. More to follow…

Buy me a Coffee

Today I received an email via the CHMINF list server pointing to the following Press Release. Part of the press release is shown here:

“In collaboration with the German National Library of Science and Technology (TIB) Thieme is the first publisher to make primary chemistry data accessible worldwide. Analytical data, from various experiments, is the foundation of research work and scientific papers. From now on, primary data will be registered and made available online via the Thieme eJournals website (www.thieme-connect.com/ejournals) using digital object recognition in the form of Digital Object Identifiers (DOI). This will enable scientists to easily locate research articles, including accompanying data, and make enhanced use of the scientific content.”

There has been a lot of discussion over the years regarding making available “primary data”. We offered to do this on the ChemSpider Journal of Chemistry : if people wanted to submit analytical data with an article that we published then we would post them as spectra associated with the article. Unfortunately the general consensus based on a few conversations that I had is that it is a lot of work to prepare data and deposit it. This is one of the reasons that, until now, publishers have generally made the spectral data available as plots and printouts of the data. These data are generally made available as electronic supplementary data. These data ARE valuable even in that form but, and I believe that the majority of scientists would agree, they would be of more valuable if they were available in a format that would allow display in online applets, downloadable for processing and expansions etc. The RSC would certainly welcome the availability of spectral data associated with publications especially since they can now be hosted on ChemSpider.

Thieme have actually managed to pull off quite a coup and I commend them for their efforts. The first example datasets are available here. The listing includes “FIDs and associated files for the 1H, 13C and DEPT NMR spectra for compounds 14, (SS)-23, (SS)-25, (RS)-26, 27, (SS)-28, (RS,SS)-29, 30, (RS)-36, (SS)-36, (SS)-37, 38, (RS)-39, (SS)-39, (SS)-44, (RS)-46, (SS)-46, (RS)-48, (SS)-48, (SS)-49, 52, (RS)-53, (RS)-55, (RS)-57, (SS)-57, (SS)-58, (RS)-61, (SS)-61, (RS)-62, (SS)-62, (RS)-65 and (SS)-65 are summarized.” That’s a lot of data.

Since these are primary data they cannot be copyrighted so I chose to download the data, take a look and insert a couple into ChemSpider as an example of what can be done with these data. The associated PDF for the data says “The files can be processed using the following programs: MestReC, Bruker’s WINNMR and XWINNMR.” The files came as binary Bruker files so needed to be reprocessed and, in order to be deposited, had to be converted to JCAMP-DX format, the format supported by the JSpecView applet used on ChemSpider to display spectra. In order to this I am fortunate to have access to ACD/NMR Processor, a product I managed for a few years while working at ACD/Labs. This product also supports the Bruker format so I imported the data, processed and exported as JCAMP and imported to ChemSpider.  For compound 14 I have attached the H1 and C13 spectra and they can be seen here. I didn’t attach the “DEPT spectrum” yet. In order for me to download the spectra, redraw the structure, process the spectra, export as JCAMP and deposit to ChemSpider took about 15 minutes. However, there are a lot of spectra and it will take me a while. There are 32 compounds, I assume 3 spectra per compound (HNMR, CNMR and DEPT) so that’s a total of 96 spectra. It’ll take me about 10-12 hours just to deposit this collection so that’s a lot of work to do in my spare time. If anyone wants to help out and can process the spectra to deposit please do!

One of the spectra are shown below using the Spectral Embed function we introduced previously:

This is a rich collection of data…it can feed the Spectral Game described in this article. I look forward to getting the data onto ChemSpider and will be following up with Thieme to see if we can work together to host the data in a more generic format for the future. It’s a shame that the data are locked into a binary file format that needs reprocessing to view and I believe display through the JSpecView applet is advantageous for all. I encourage Thieme to consider also making the structure collection available in molfile, SMILES, InChI and InChIKey format – the InChIs will make the article discoverable via internet searches and through the InChI Resolver while the download of molfiles will speed up the loading process to ChemSpider and other systems.

Buy me a Coffee

We recently added the Borochem structure set to ChemSpider and asked whether they would be willing to provide us with a little quote for a brochure we are building to encourage other companies to deposit their data with us. We received a very nice response!

“ChemSpider is a great source of information for the chemical community.” says Alexandre Bouillon, C.E.O. and C.S.O. of organoboron building blocks provider BoroChem. “Our objective is to give a maximum number of chemists access to our catalogue of rare and innovative building blocks. Our chemical intermediates are available on our own corporate website and listed on the major online chemical directories, but we feel that getting exposed through the ChemSpider search engine will allow us to increase our visibility on the web and moreover, to contribute to one of the most complete databases of the chemical industry.”

Buy me a Coffee

We get a lot of kudos for what we do with ChemSpider and we appreciate it. Sometimes there is an email that comes in that just makes me smile. One from this week is shown below…it’s nice to be appreciated!

“Dr ChemSpider,
GOD BLESS you and your website! My classmate and I just wanted you to know that we appreciate your website to the UTMOST!! you saved us hours upon hours of work… we have been spending hours trying to figure out a structure from our lab reaction product. THANKS for the awesome website, we are now able to further our knowledge in organic chemistry!!!”

Buy me a Coffee

I gave a talk today at the ICIC 2009 meeting here in Sitges, Spain. It is an interesting meeting and I will report on some of the presentations later. I’m glad I am here. The presentation is here on Slideshare and is a modified version of a presentation I gave on Saturday at the Microsoft eScience conference in Pittsburg. One of the questions that followed the presentation was in regards to whether ChemSpider could be used as a measuring stick for quality (I am paraphrasing). My response was that there are millions of errors on ChemSpider and that seemed to raise a giggle and other people since then seemed surprised.

In my opinion, as shocking as it sounds, it must be true. Why?

There are almost 23 million unique chemical entities on the database. Many of them have multiple names associated, experimental properties, many have 10s of links to external databases. The structural layout has been created using algorithms. Algorithms have been used to generate systematic names. There are spectra submitted by the public and they can be mis-referenced, as an example, or declared to run in one solvent and ACTUALLY run in another. There are sometimes multiple registry numbers associated with a compound…a CAS number for a salt associated with with the neutral compound for example. The multiple links out to external resources number in the 10s of millions and these are changing daily as other websites and databases curate and edit their data. Errors are inevitable and, I judge, there must be millions of errors on ChemSPider. Just as there must be millions on Wikipedia and in the search results you get back from Google. The question is what counts as an error? I’m using a broad stroke brush for an error…a structure with a poor depiction is an error. A misspelling is an error. A dead link to a database is an error. So…definitely millions. But we continue our work to whittle down the number, with the assistance of the community, everyday. But we’re doing it while we are depositing new compounds onto the database so it’s an interesting challenge. Millions of errors doesn’t make ChemSpider less useful…we’re just realistic about the magnitude of the challenge!

Buy me a Coffee

Last week I had the pleasure of being on an agenda with a number of people whose work I applaud and who I genuinely enjoy spending time with and sharing thoughts about “what if?” Martin Walker, one of the people I collaborate with on Wikipedia, invited me to speak in his session “Publishing and Promoting Chemistry in the Internet Age“. Martin gave an introduction to the session and spoke about Chemistry on the Internet. Beth Brown gave an overview of the Chemist’s Toolkit for Publishing and Promoting your work on the Internet. I followed with an overview about what’s going on with ChemSpider and the issues of connectedness and quality of chemistry on the internet. JC Bradley spoke about transparency and Open Notebook Science. My hat’s off to Martin for arranging the speakers in that order. Considering we didn’t coordinate our talks it was an excellent trajectory throughout the session and very much an integrated overview of activities regarding chemistry on the internet.

My talk is posted on SlideShare here and is available below. Any comments and questions are welcomed.

Beth Brown has her talk online here and JC Bradley will post his online here.

JC Bradley and I had a good talk about ways we can collaborate together more closely on Open Notebook Science. We have a path forward so that ChemSpider can provide additional support and will be discussing the path forward offline.

Buy me a Coffee

For those of you who have watched the historical development of ChemSpider you are likely aware of our development of the ChemMantis platform and our use of the system to deliver the Open Access journal “The ChemSpider Journal of Chemistry” (CJOC). Following the acquisition of ChemSpider by the RSC we have been extremely busy in migrating ChemSpider onto the RSC infrastructure and working on a whole series of public-facing and internal projects. Just because of time available we haven’t had time to populate CJOC with new articles. That said we have also been looking to bring more of a focus to both CJOC and ChemMantis.

The majority of interest we were getting for the platform, and the greatest benefits in terms of  the semantic markup, were shown for discussions about organic chemistry and specific to the application of organic synthesis procedures. Many of the articles that we posted to CJOC as examples were sourced from the Molbank collection, an excellent Open Access journal focused on the synthesis of chemical compounds. ChemSpider is a database of chemical compounds. When we were developing the data model for ChemSpider we always knew that a time would come where we would need to support chemical reactions. CJOC became the container for those reactions in the initial phase of our work, housing only the textual description of the synthesis and semantically linked out to chemical compounds on ChemSpider, reaction articles on Wikipedia and out links to other related information.

We have decided on a path forward for CJOC from here. That is a re-dedication of the platform to the support of synthesis procedures only. ChemMantis, or a variant of the initial platform, will be the basis of the new ChemSpider Syntheses Database (this is just an interim title for the project for now). We will host a growing collection of synthesis procedures from the community (providing a deposition platform for the community to use). We will source procedures from the RSC electronic supplementary information (ESI) provided for many of the RSC publications. We will work with collaborators, publishers and other reaction database providers to source synthesis procedures from their collections. The full details regarding this project are presently being fleshed out but the extension of ChemSpider to host chemical reactions is underway. We welcome your questions, thoughts and comments.

Buy me a Coffee

The ChemSpider blog has become very quiet in many ways. For that I am both saddened and realistic….we are very busy with working on improvements to ChemSpider both in the functionality and to the overall infrastructure. You will see these roll out in the near future. I personally am traveling a lot more than previously and engaged in the writing of many articles and presentations. My backlog of articles is over half a dozen and more than that in presentations to prepare. Add to that H1N1 through the household, one little boy in our family with pneumonia and my intention to participate in a mini-triathlon next year and to see that I am distracted would be an understatement.

I hope this “bad news” post is the first of many to get me active on the blog. This bad news post is actually a good news post, we hope. We have been seeing some conflicts between backups and server performance and need to apply some Microsoft Hotfixes and will be taking the system down on Wednesday for about 30 minutes as announced on the HomePage. Our apologies if it causes a disruption.

Service Interruption 07/10/2009
Due to essential maintenance ChemSpider will be unavailable during the following period:
07/10/2009 from 10:30 GMT until 11:00am GMT
We apologise for any inconvenience this may cause.

Buy me a Coffee