Archive for the ChemSpider Chemistry Category

Frequent users to ChemSpider who use the identifiers for searching will commonly find a mixture of “names” and “database IDs” as well as “registry numbers”. Since the number of database IDs can sometimes swamp the synonyms and chemical names we chose to separate them. We have run some regular expressions across the database to separate database IDs out. We have left registry numbers (marked by [RN]), EINECS numbers (marked by [EINECS]) and Wiswesser Line Notation (marked by [WLN]) in with the synonyms.

database-ids.png

Unfortunately there are MANY flavors of Database IDs and we might have missed some. If you come across any “potential” DB IDs and think we should segregate them out please use the POST COMMENTS  ability to inform us. Simply Post a Comment to a record and suggest we check the identifiers out for potential DBIDs. Thank!

Over the past few months we have been working hard to integrate the ChemRefer system into ChemSpider. I have reported on our recent rollout of the first level of integration and Will Griffiths has started a discussion at the Open Chemistry Web blogpage. Will has played a key role in facilitating our relationships with publishers in the Open Access domain. Many have had an appreciation for his ChemRefer website.

When we connected ChemRefer to ChemSpider one of our first commitments was to facilitate direct structure/substructure searching of the IUCr publications. Will has indexed the publications from 1948 to present day. He has extracted chemical names and systematic names from the titles and abstracts of the articles. Since we have access to name to structure conversion capabilities from OpenEye we can convert many of these names to chemical structures in an automated fashion. As a result of my work on the curation of Wikipedia chemical structures I also have access to some of the tools I was involved in developing at ACD/Labs (ACD/ChemFolder and ACD/Name) This has allowed me to curate and convert other extracted names in a manual manner. This is NOT the most efficient way to conduct this process as it requires a lot of eyeballing. However, it is this type of approach, reviewing many hundreds of extracted names, which has allowed us to optimize the process, recognize where potential failures could arise and improve the chance to extract the best set of names to work with for conversion purposes. Now we are at the whim of name to structure conversion software in terms of the accuracy of the conversion but this can be validated (See later). Since organometallic names are very difficult to convert to structures we can only deal with organic structures at present. However, we do also have many validated synonyms in our hands too on ChemSpider and that provides a useful dictionary for conversion.

Ultimately I hope that other commercial providers of batch Name to Conversion software modules would see the potential value of collaboration with ChemSpider and give us access to their software tools to assist with our project.

We are in the process of curating and checking the names and chemical structures for all that we have extracted so far but the depositions have already started. Of the structures converted we have found about half of them are already on ChemSpider (example) but another half are unique (example). At present we have connected almost a thousand articles via DOIs from ChemSpider to the IUCr articles. The best estimates at present are that we will be able to connect about 8500 structures to articles.

Since this is the IUCr collection we will validate our approach in the future and, with their permission, we will use the CIFs to validate our extracted structures against the CIF-based structures. This will give us an indication of the errors one could expect from an automated extraction of chemical names from articles and conversion to structures using Name to Structure. We are also going to continue our curation process and allow people to curate structures from IUCr(and other articles) to ChemSpider. There will be cases where some names/structures from articles have not been converted to structures and these will need manual submission. This will be possible through the manual deposition system.

We are looking forward to providing similar services to other publishers should they be desired. This is our proof of concept…and it’s working.

Two particular Open Access resources providing enormous value to Life Sciences nowadays are PubMed and PubChem. I’m sure everyone reading this blog has heard of them and used them both. We previously announced our deeper collaboration with ChemRefer. We have been busily working in the background to integrate ChemRefer into ChemSpider and the very “alpha” version of the integration is now available online. It can be accessed from the Search menu as shown below. Simply click on ChemRefer

chemrefer.png

When selected it will open the ChemRefer search window and a list of Publishers who have allowed us to index them as shown below. All can be searched or the search can be limited by using the Check Boxes.

chemrefer2.png

The search results for searching on taxol are shown here.The figure below, while too small to see detail, shows that the word searched is highlighted in the text.

chemrefer3.png

Notice that the RSC is no longer indexed at their request. We were sad to lose them from our searches.

We have also integrated to Entrez, for searching health sciences databases at the National Center for Biotechnology Information (NCBI) website. This can be searched by choosing NCBI Entrez from the Search drop down menu. It is available here. An example of the results is shown here. For this first integration we limit the results to 100 hits.

entrez1.png

Clicking on any of the titles is a direct hyperlink to the article in PubMed Central.

These integrations to these text searching engines are only the first part of our work. We have already been extracting chemical names and linking them up to structures in the ChemSpider database.Our first efforts in this area will be unveiled shortly. In this case it will be possible to simultaneously perform structure/substructure and text-based searches. This is a very significant undertaking but we are well underway to bringing our vision of structure and text based indexing of the Open Access literature to fruition.

Recently I posted about trying to identify the correct structure of Ginkgolide B and the need for curation of ChemSpider entries. David Barden from the RSC commented on my post:

“Antony – I am an organic chemist working on the RSC journal in which the published structure of ginkgolide B appeared, and am pretty sure that it is correct, having been written by a regular author of ours familiar with the literature on the ginkgolides. I think the problem might lie with the representation (and/or conversion to InChI) of the structures – even in the one structure you indicated as having “full stereochemistry”, it seemed to me that 3 stereocenters were undefined, from a visual inspection of the structure. Apart from these stereocenters, the structure and InChI (generated myself) otherwise seem identical, so I’m not sure why the last part of the string in the ChemSpider entry is “20+” rather than “20-”. The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.”

I have redrawn the structure of Ginkgolide to echo that shown in the RSC journal and it is shown below alongside a cropped image from the article:

compare-the-two.png

I’m_pretty sure I have the structure correct. The InChIString is:

InChI=1/C20H24O10/c1-6-12(23)28-11-9(21)18-8-5-7(16(2,3)4)17(18)10(22)13(24)29-15(17)30-20(18,14(25)27-8)19(6,11)26/h6-11,15,21-22,26H,5H2,1-4H3/t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20-/m1/s1

and the InChIKey is:

SQOJOAFXDQDRGF-MMQTXUMRBS

In the previous post I searched on Ginkgolide B as an identifier to see how many Ginkgolide B’s there are. There are 6 as shown here.

I searched on the entire InChIkey and found no hits. This means that the structure is NOT on ChemSpider.

I then searched on the CONNECTIONs captured within the InChIKey and represented by: SQOJOAFXDQDRGF . I received 18 hits in total varying in completeness in terms of incomplete stereochemistry and DIFFERENT but fully assigned stereochemistry. I searched the entire InChIKey on Google (SQOJOAFXDQDRGF-MMQTXUMRBS) but received no hits. Just to check I then searched the InChIString shown above on Google. Surprisingly, I DID get a hit! It was for this structure. I was puzzled and a comparison of the strings showed a difference in ONE section of the string, the stereo layer.

Searched on Google: /t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20-/m1/s1

Found by Google: /t6-,7​+,8-,9+,10​+,11+,15+,​17+,18+,19​-,20+/m1/s​1

See the difference? ONE stereocenter… 20- versus 20+ . Thank goodness we are moving to InChIKeys rather than InChIStrings since the majority of people would likely miss the detail. I did the first time! So, based on all of my searches the structure of Ginkgolide B as represented in the article published by the RSC is NOT in the ChemSpider database. I agree with David Barden when he comments “The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.” It is very complex and time-consuming and the hope is that comparison of InChIKeys, specifically the second part of the key, will help catch the differences in a more facile manner.

The question, unfortunately, remains. What IS the correct structure of Ginkgolide B? For now I have assumed that the one in the RSC article is correct and have added the structure to the database using the normal deposition process and have associated with the RSC article and the blog discussions on ChemSpider. If it turns out it is not correct then I will leave the structure, the connection to the article but remove the identifier Ginkgolide B.

suppinfo.png

We have started to introduce new capabilities onto ChemSpider in preparation for our shift from ChemSpider Beta to ChemSpider RELEASE VERSION to celebrate our one year anniversary. You should see a number of incremental improvements happening over the next few weeks. I’ll highlight as many of them as I can as we release them.

Let’s start with new Search Capabilities. On the search screen you will see a drop down menu. You can now select from 5 different types of searches.This post with highlight only the “Structure” type search. Details about the others will follow.

searches.png

It appears that many people believe that ChemSpider is only a text-based search. The reality is  we went live with structure and substructure search in version 1. For sure we have improved things over the months and the searches are better and faster but structure searching has always been present.

To perform structure searching choose the Structure Search and you will see:

structuresearch1.png

Now click on Input Structure and  a screen for submitting structures will be opened. There are three ways to do this. The Convert will allow you to input a SMILES string,  InChI string or chemical name and convert to the structure. If the structure is corrected select ACCEPT and perform the search. Alternatively load a molfile by browsing and uploading then click ACCEPT. If you want to draw a structure or edit one you have converted simply select edit and draw into the applet.

structuresearch2.png

The manual for the applet is here. The applet will look like this:

structuresearch3.png

These capabilities have been in place for a while. Now what we have done is enable you to search in different ways from this place. Specifically…

structuresearch4.png

These searches are very valuable  specifically in relation to the issue of tautomers and structure skeletons. As I have shown with previous post on ginkgolide B in some cases there are many similar structures on the database. Searching on the skeleton will find them all. A search on Taxol  shows 1 exact structure, 1 tautomer of the exact structure and 42  structures with same skeleton as shown below.

structuresearch5.png

Enjoy the new capabilities. We welcome your feedback.

Tonight I finished an article on Public Chemistry Databases. During that article I commented on the size of the Public Chemistry Databases versus the commercial databases. There have been numerous discussions in the blogosphere about the size of databases such as PubChem relative to the CAS Registry. Recently PubChem and ChemSpider headed towards 20 million structures. The CAS Registry is about 33 million.

Now, I don’t know how much duplication there is in the Registry but I can comment is what is in ChemSpider and likely in PubChem. Here’s a basic comment about molecules with complex stereochemistry. They tens to exist MULTIPLE times in the database due to different variants of stereochemistry. Let’s examine Ginkgolide B. The structure below is taken from a recent RSC article. I was interested to see whether we had the “correct” structure of Ginkgolide B on ChemSpider, assuming that the correct structure is that one shown on the RSC webpage.

ginkgolide-b.png

A search on the name Ginkgolide B turned up a total of 6 structures. The connectivities are the same for all structures. The ONLY difference is in the stereochemistry. Take a look at the structures in Table View. There is one structure with full stereochemistry expressed. This one comes from PubChem, Thomson Pharma and xPharm. With full stereochemistry it might be safe to assume it is correct.

However, even for Taxol there are structures with complete stereochemistry and they are different: Structure 1, Structure 2, Structure 3, Structure 4 and Structure 5

I actually gave up looking eventually…here are the different complete stereochemistries. Look carefully…

t31-,32​-,33+,35-,​36+,37-,38​-,40-,45+,​46-,47+

t31-,32​+,33+,35-,​36+,37+,38​-,40-,45+,​46-,47+

t31-,32​+,33+,35-,​36-,37+,38​+,40-,45-,​46+,47-

t31-,32​-,33+,35-,​36-,37-,38​-,40-,45-,​46-,47-

t31-,32​-,33+,35-,​36+,37-,38​-,40-,45-,​46-,47-

Question for ChemSpider Users – there are actually WAY MORE than 10 Taxol skeletons on ChemSpider. Can anyone figure out how many? It actually takes one search to find them all!

We believe this is the correct structure of Taxol.

Back to Ginkgolide B. I redrew the structure shown in the RSC article (and as shown below).

ginkgolide-b_2.png

Generating the InChIKey for this structure and performing a search on ChemSpider gave me no hits. It looks like either the RSC structure is wrong OR all of the six structures from all of the different sources are wrong. As mentioned, there is actually only one Ginkgolide B structure ( a structure with the associated identifier) on ChemSpider with full stereochemistry. The stereo for that structure is:

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20+ ChemSpider

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20- RSC Stereo

There is ONE stereocenter difference.

This is what curation is all about. The question now is which one is correct? Is it RSC? Is it the structure on ChemSpider? Can anyone validate? Did I miss something up in the comparison (it happens!)

Now, for the LONG list of Ginkgolide B structures on ChemSpider shown here what do users think we should do? If we simply remove ALL labels for the incorrect structures then we will remove all links into other databases that contain information for “their Ginkgolide B”. If we collapse all links into the correct Ginkgolide B on ChemSpider, and bring 6 records into one, then the structures to which the correct structure links are actually incorrect in the linked databases (but useful information exists there).

Quite the conundrum. I’d appreciate feedback!

I’ve been involved in a number of conversations recently around how monoisotopic masses can be used and the chance of “elucidating a structure” from a molecular formula. There are some shockingly naive views of this possibility. With the availability of accurate mass determinations by mass spectrometry, and the possibility to extract a molecular formula from the data, there are some who believe it is possible to “elucidate structures” using a monoisotopic mass. Let’s clear this naivety up…

Recently I gave a presentation at a local university regarding informatics. During the presentation I asked the students how many structures could be generated “withint the rules of basic organic chemistry” for some very short elemental formulae. General rules means no inappropriate valences but no limitations on the nature of the rings (except none base don 2-carbons :-) ) etc. EVERYONE underestimated by many factors.

While working on a structure elucidation software program the issue of how many structures could be generated from some fairly nominal formulae became very clear. Below are some example formulae, the “correct” structure associated with the data under analysis and the number of chemical structures that can be generated from this formula. Notice those numbers….numbers like: 138,136,211,624 structures from a formula of C15H22O2 !

Therefore,_the story that monoisotopic mass, that can give a single molecular formula, can give you an unambiguous chemical structure needs to stop. Now, that said, since we have close to 20 million structures online at present the question “What is the distribution of molecular formulae across ChemSpider?” was an interesting question. So, we ran a query to determine the highest frequency of formulae. The formula C18H20N2O3 occurred 5110 times in the database, 4804 times when looking at single components only. Some representative structures are shown:

mf-search-1.png

I imported the data into Excel (Office 2003) with a 65000 row limit. While there are single molecular formula compounds in the list at the end of the file (viewed in wordpad) at the 65000′th row the frequency was still 45 entries in the database. It’s a long tail..

mf-distribution.png

Now, many people are using mI masses to examine metabonomics data so it may be more appropriate to do the analysis on a more restricted dataset. For example, databases of interest to metabonomics people include KEGG and HMDB. Isolating the search to such databases shows that while there is a much shorter list of unique formulae (8590) a similar distribution persists . The most common formula is C6H12O6 with 71 hits. Searching this in the database shows a number of linear and cyclic carbohydrates, some with stereo, some without as shown below. if you are confused about “linear versus cyclic” see this Wikipedia article.

mf-search-2.png

Monoisotopic mass isn’t going to provide the stereo information anyways and all you will get is a lot of similar structures…but of course there are MANY carbohydrates with that formula. I’ve the listed a group of some of the top formulae here and leave it to you to investigate!

Formula Number

C12H22O11 = 55 hits

C6H8O7 = 52 hits

C5H10O5 = 46 hits

C20H3205 = 46 hits

C8H803 = 40 hits

C20H32O3 = 39 hits

C20H32O4 = 38 hits

C2H4O2 = 38 hits

C24H40O4 = 37 hits

CH4O3S = 36 hits

Bottom line…even removing stereo issues and isolating to a small number of databases it is still an issue to declare that a structure is elucidated just from a mass and some form of prior knowledge or additional information such as elution order or time is necessary.

Now, this observation may not be surprising to many people. The response may be that tandem Mass Spectrometry would give an ambiguous structure. This is also not true unfortunately and in general even tandem MS (MS^n) cannot give a conclusive structure. Certainly, if stereochemistry is involved (as with many carbohydrate molecules) you are still stuck. While library look-ups using monoisotopic mass ARE valuable, and tandem MS adds more criteria for structure identification, neither are unambiguous.

There is a new contributor to the blogosphere…SimBioSys. I recommend adding the blog to your Google Reader. There are some very exciting things going on there right now. I have commented previously about how high performance computing engines such as the Cell Broadband Engine are being brought to bear on scientific problems. SimBioSys appear to be the only group who have chosen the Cell processor to port their virtual high-throughput screening and docking solution to. Their white paper makes for an interesting read.

In their most recent post “Roping in your next scaffold hop with LASSO” they talked about their LASSO publication: LASSO—ligand activity by surface similarity order: a new tool for ligand based virtual screening”. We are presently in the middle of a very exciting project regarding LASSO. We have teamed up to provide the virtual screening results for 40 target families on the full ChemSpider Library, currently containing over 18 million molecules. Using the LASSO similarity search tool, SimBioSys has screened the ChemSpider database against all 40 target families from the Database of Useful Decoys (DUD) dataset.

LASSO descriptors (Ligand Activity by Surface Similarity Order) contain a count of the different Interacting Surface Point Types (ISPT) found on a molecule. LASSO descriptors use 23 different surface point types, ranging from hydrogen bond donors/acceptor, to hydrophobic sites, to pi stacking interactions. Figure 1 shows a “histidinelike” fragment of a molecule. The triangles are the surface point types of this fragment, colored by type. Based on the idea that ligands must have surface properties compatible with the target site in order to bind, LASSO uses a descriptor of Interacting Surface Point Types (ISPT) to find molecules with diverse chemical scaffolds but similar surface properties.

lasso1.png

We are presently populating the ChemSpider database with 10s of millions of LASSO descriptors and this will allow screening of the ChemSpider database to:

● Find molecules which have a higher likelihood of binding to targets.
● Find molecules with better selectivity for a target.
● Reduce toxicity issues.

The 40 Target receptor families included in the screening results were chosen to cover a wide range of receptor classes due to their interest in drug discovery. Each target family had 10s to 100s of known active molecules, which were used as the basis for the query files used by LASSO, one query for each family. The similarity screening was performed on the full ChemSpider database across all 40 targets and the similarity scores for each structure/target pair is available via the ChemSpider website. Thus for each structure in the ChemSpider database, you can find its similarity score (based on surface properties) relative to actives of each of the 40 target receptors. In addition to allowing instant ranking results for a particular target of interest (retrieving molecules that are likely to be active for a receptor) this matrix of screening results can be used to find molecules that have predicted affinity for a target but low predicted affinity for all other targets. Performing such searches promises to improve selectivity and can be a guide to reducing toxicity concerns. More detail about this collaborative project will be forthcoming but the overview is provided here.

Watch this space for updates and an unveiling date.

I previously blogged about the fact that we had embedded a 3D optimizer under Jmol so that that 3D molecules could be displayed. There were two problems with the approach we took. 1) It was very time-consuming to wait for real-time 3D optimization for molecules 2) The 3D optimizer would sometimes fail to optimize a structure based on the starting geometry.

3dimage.png

We have just finished publishing millions of pre-optimized structures onto the  ChemSpider database. In MOST, but not all, cases the molecules are now pre-optimized. This makes display of the 3D molecule in JMol much faster. Since we were optimizing  millions of molecules we did set a threshold for the time within which the molecule should reach some minimum. As a result some molecules were not optimized and the 3D coordinates are still not available so a real time optimization is attempted using smi23d as discussed previously. If you find any structures which don’t optimize please send us the ChemSpiderID and comment to feedback|at|chemspider|DOT|com.

I have reported previously on the possibility to upload and display spectra on ChemSpider. JSpecView is our tool of choice for spectral display and the workflow is described here. I’m happy to announce that people have been uploading spectra and we have over a hundred online at present.

I am happy to announce that we have the first deposition of analytical data onto ChemSpider from a publisher. Bronwen Decker of the Nature Publishing Group (NPG) has facilitated the deposition of data onto the system. A paper entitled”Use of Raman spectroscopy as a tool for in situ monitoring of microwave-promoted reactions” has had two accompanying NMR spectra submitted online (and hooked back to the original article using the DOI number).

The_record here shows the structure, connected data sources, the spectra and supplementary information accompanying the structure (part of the new deposition process). We’re looking forward to working with NPG to facilitate the process of deposition of more analytical data as we move forward.

Over the past few weeks I have had a few discussions with a member of the ChemSpider Advisory group regarding a concept to create WiChempedia. I’ve enjoyed these conversations with Alex Tropsha (professor and Chair in the Division of Medicinal Chemistry and Natural Products in the School of Pharmacy, UNC-Chapel Hill.) We are like-minded in a number of ways but specifically in what can be done to facilitate delivery of quality information to the chemistry community.

As you will notice if you frequent this blog I am rather a stickler for accuracy and quality (1,2,3). I think it’s important (4). Over the past few weeks I’ve spent more time looking at the quality of data on Wikipedia and trying to figure out the best way to bring together our efforts on ChemSpider to enhance the capabilities of integrated information and to support the quality efforts being made by the WP:CHEM team and help them. I also intend to facilitate the development of our own Wiki environment for chemistry and to generally enhance the tools available to chemists not only for Wikipedia type annotation but also to support Open notebook Science.

Now, I don’t want to reinvent the wheel. Wikipedia has a lot of what is necessary in terms of being a known system, a following of people and committed supporters in the WP:CHEM team. What I have been hoping for was a shift around structure and substructure searching on the MediaWiki platform but I know that is a tough request as the platform is not built for that type of thing, The InChKey holds some promise for exact structure searching but does not offer an opportunity for substructure searching without a lookup across a larger database. I want to facilitate information and data sharing further. I do want to provide the type of service that Wikipedia does in terms of general information but also layer cheminformatics tools onto that knowledge and information, allow addition of analytical data, analysis tools, real time predictions and analysis ultimately. This platform should certainly be wiki-enabled.

Decision made. Our intention is to deliver wiki-capabilities in ChemSpider and to use the Open Content associated with chemicals and drugs on Wikipedia inside the system. We will then provide an environment for people to continue to add to, enhance and curate the Wikipedia content as well as add their own. Last night (and well into the early morning) I spent some time talking to Martin Walker from WP:CHEM regarding my concerns that we might offend the Wikipedians with our efforts and that I did not want them to feel that we were ripping off their hard work but rather have our efforts seen as supportive and enabling. My intention as we work through downloading the data and to check, validate and correct what is sitting on Wikipedia directly for benefit to the community. Also, we will of course need to leave all Wikipedia content under the appropriate licensing for others to use. Martin commented that there are tens of mirrors of Wikipedia out there ripped purely with the purpose of exposing and getting ads revenue. We are not working from that model….our intention, as usual, is to build a structure centric community for chemists and with so much excellent work done on Wikipedia I want to take advantage of it and give back also by the work we will do.

Two domain names have been grabbed for this project : WiChempedia, for compatibility with Wikipedia, and also WeChempedia, to emphasize the community aspects of the project.

If you frequent this blog you will recall that we have made a commitment to Microsoft Sharepoint as our future platform for wiki’ing ChemSpider. That is where we believe this work will be done ultimately but we don’t have the platform in our hands yet.

The Xmas vacation is going to be full of holiday movies and manual examination and curation of the Wikipedia data. Wish us luck!

 

Those of you frequenting this blog might have read my highly opinionated views of what was originally entitled “Open Notebook Science NMR” (1,2). My views around that work were very strong…in fact I didn’t really “get it”. I didn’t get why GIAO approaches for NMR prediction (with all of the stated limitations) would be done to prove that you could validate NMR assignments by comparing predictions with assignments made experimentally . It’s known that NMR prediction can validate structures – it’s done on a daily basis in commercial software tools. I was involved in building tools like that for over a decade so what was to prove?

The work concluded, as I understand it, with the examination and “validation” of about 500 structures over a month, but with a limited set of elements, no flexible side chains and a limited Mw range. Meanwhile, chemists are doing it on a daily basis across industry in a few seconds using one button click (1,2). For one of the most impressive overviews of verification technology in the lab see this talk by Phil Keyes from Lexicon Pharmaceuticals.

I get that there is value in validating assignments extracted from the literature as was finally declared as the focus for the work…actually BEFORE they are put in the literature. But, to clarify, that’s not about prediction so much as workflow…prediction and validation has been proven for many years. It’s not perfect but really good. That will be reviewed in our future publication “The Performance Validation of Neural Network Based 13C NMR Prediction Using a Publicly Available Data Source” presently in review at JCIM. This paper will review neural network prediction applied to NMRShiftDB. Just as a reminder from previous posts…the ENTIRE dataset of over 200,000 shifts were calculated in less than 5 minutes…no limitations re. flexible chains, Mw limits etc.

With these comments as an intro I am pointing interested readers to an article that found its way onto JCIM ASAP Articles yesterday. This reviews the Neural Network prediction as well as other approaches. I will confess that the Neural Network approach far exceeded my own expectations of performance. But hey, that’s why we do research…we have opinions, expectations and hypotheses…and go off to prove them. And if we can’t something fresh and exciting can show up anyway!

Toward More Reliable 13C and 1H Chemical Shift Prediction: A Systematic Comparison of Neural-Network and Least-Squares Regression Based Approaches 10.1021/ci700256n

The efficacy of neural network (NN) and partial least-squares (PLS) methods is compared for the prediction of NMR chemical shifts for both 1H and 13C nuclei using very large databases containing millions of chemical shifts. The chemical structure description scheme used in this work is based on individual atoms rather than functional groups. The performances of each of the methods were optimized in a systematic manner described in this work. Both of the methods, least-squares and neural network analyses, produce results of a very similar quality, but the least-squares algorithm is approximately 2-3 times faster.

If you have examined the predicted physchem properties associated with a structure on ChemSpider you will see something of this nature, on occasion. Notice that there is a logP value but there is also a logD value..in fact two values, one at pH 5.5 and one at pH 7.4. These chosen values were representative of typical physiological pH values of interest. But what IS logD?

Let’s visit the Wikipedia definition of both logP and logD to start:

logP – The partition coefficient is the ratio of concentrations of un-ionized compound between the two solutions. To measure the partition coefficient of ionizable solutes, the pH of the aqueous phase is adjusted such that the predominant form of the compound is un-ionized. The logarithm of the ratio of the concentrations of the un-ionized solute in the solvents is called log P

logD- The distribution coefficient is the ratio of the sum of the concentrations of all forms of the compound (ionized plus unionized) in each of the two phases. For measurements of distribution coefficient, the pH of the aqueous phase is buffered to a specific value such that the pH is not significantly perturbed by the introduction of the compound. The logarithm of the ratio of the sum of concentrations of the solute’s various forms in one solvent, to the sum of the concentrations of its forms in the other solvent is called Log D. In addition, log D is pH dependent, hence the one must specify the pH at which the log D was measured. Of particular interest is the log D at pH = 7.4 (the physiological pH of blood serum). For un-ionizable compounds, log P = log D at any pH.

With these definitions in mind what would we expect for an ionizable compound such as Zyrtec shown below?

The curve below shows the logD as a function of pH for Zyrtec.

At the two pHs of interest what are the structures of the ionized species? See below:

Why would this be important? In fact it’s crucial in the design of drugs in terms of how they act in the body across the physiological profile exhibited by the human body. But, I’m not going to tell you the story…instead I am going to point you to an article “The Rule of Five Revisited: Applying Log D in Place of Log P in Drug Likeness Filters”. This was one of the most accessed articles published in Molecular Pharmaceutics in 2007. It’s featured on the Most Accessed Articles website here.

I love Wikipedia. I use it at least half a dozen times a week…probably more of late. That said I have previously questioned the level of curation of the data on Wikipedia. (2,3) I DO believe that contributors to Wikipedia are making valiant efforts to ensure the quality of the data but I also believe that tools must be developed soon, or processes developed to ensure the quality of the data. Here’s why…

This is the chemical structure of Mupirocin on Wikipedia. Now, if you bothered to redraw that chemical structure in a drawing package showing the molecular mass (like I did) then you would see that it is NOT what is listed in the DrugBox

The structure, molecular formula and molecular mass are shown below taken directly from Free ChemSketch but of course all the drawing packages can do this!

Looking on ChemSpider I found three structures (two are identical but not yet deduplicated – this is presently going on in the background). two are shown below…

Structure 16739332, the top structure, is the correct one while the bottom one is in error. The structure comes from one data source only – Drugbank. Previously for Taxol, Drugbank contained the correct version of the structure. The problem is that ALL of our systems, including ChemSpider, have issues like this….we all have errors and they need curation. Wikipedia is great…the changes were made by me tonight…see here. I added a IUPAC Name, removed the link to Drugbank and updated the molecular mass.

I am committed to assisting in the curating of Wikipedia…many of us are. However, I think there must be a better way and will continue my discussions with the Wikipedia Chemistry Team to get access to all of the chemical compounds on Wikipedia if possible and validate the data in a batch using ChemSpider and associated tools.

I was approached today with a question regarding the contents of the ChemSpider database. I have commented previously about the fact that there are quality issues based on some of the depositions but that these are being cleaned up fairly quickly because of the efforts of our curation processes, both robotic and manual. The question was regarding the fact that there were two structures on ChemSpider with the registry number 34090-76-1. This is not uncommon. There are occasions when a registry number is appropriate for a particular salt form while the associated structure is the neutral compound. So, the registry number will be on the database for both the neutral compound and the salt. However, this situation was different…it was down to the position of the double bond. The person was out to confirm the position of that double bond. It was not easy for me to confirm.

What was MORE confusing was what the person had already extracted information from an STN Registry Search. That search provided the following information:

CAS Name: 1,3-Isobenzofurandione, tetrahydro-5-methyl- (CA INDEX NAME)
Other listed names:
Cyclohexene-1,2-dicarboxylic anhydride, 4-methyl- (8CI)
4-Methyltetrahydrophthalic acid anhydride
5-Methyltetrahydroisobenzofuran-1,3-dione

And the following structure:

Compare this structure with the other two off of ChemSpider shown below in the array of three.

Every_single name from STN is listed as a “tetrahydro” compounds so, there needs to be a double bond in the molecule by default. If there isn’t then the compound is a “hexahydro” compound.

Obviously one of the alternative names for the compound was derived from phthalic acid anhydride and this suggests that the “missing double bond” should be at the ring junction as shown.

Included in the STN record is the tag “IDS” tag in the “CI” or “chemical Indexing” field. The term IDS stands for “Incompletely Defined Substance”. So, this is an example of a registry number being allocated to a compound that, in this case, is known to have an additional double bond but it is not shown on the chemical structure displayed in the STN search results but ICS declares it as being “incompletely defined”. Some might say that the fact that ChemSpider has two structures associated with the registration number but each with the double bond in a different position is appropriate. But likely those specific compounds have their OWN registry numbers. So, what should we do?

1) Remove the registry number 34090-76-1 associated with both structures?
2) Leave as is?
3) Add a new term ICS for such records and submit the new incompletely defined substance as a new form of structure?
4) Add NEW registry numbers associated with the individual structures (which someone will need to source since I don’t have them)
5) Something else?

I welcome any or all input. Based on input I will simply login to ChemSpider, make the edit and the information is changed (for addition or removal of identifiers). By working together like this there is an iterative improvement in the quality of structure-name pairs for the benefit of chemists, just as shown with the recent Wikipedia examination of Taxol.

This is a declaration of intent that ChemSpider will shortly start hosting so-called “Focused Libraries” on ChemSpider in the very near future. The focused libraries will contain a set of compounds with in silico predicted affinities for specific protein targets. The availability of focused libraries can dramatically reduce how many compounds might require experimental examination for activity. The first sets of Focused Libraries have been supplied to us by Otava Chemicals.

This is part of our path to offering new services via ChemSpider. Discussions are underway to integrate to online docking, to the possibility of offering synthetic feasibility analysis  and to expand the growing list of services integrated to ChemSpider due to the kindness and support of our collaborators.

Attention synthetic organic chemists. Most scientists have skeletons in the closet. Problems they cannot solve and observations they cannot explain. A couple of years ago I was involved in a project to slve the structure of a compound. It had remained unresolved by manual interrogation of the NMR data for over a decade. The application of a computer-assisted structure elucidation system helped resolve the structure. It is described in detail in this publication.

Quindolinocryptotackieine: the elucidation of a novel indoloquinoline alkaloid structure through the use of computer-assisted structure elucidation and 2D NMR

Now, we THINK we have it elucidated correctly. However, we would like to confirm it. Synthesis of the molecule in question, further NMR data generation and a crystal structure would help finish this work fully. This is a call to organic chemists to participate in a hobby project. Anybody want to help? We guarantee a publication etc. The structure is shown below. Contact me at antonyDOTwilliamsATChemspiderDOTcom. Thanks!

Peter Murray-Rust and Henry Rzepa have started on an Open Notebook Science project around calculating NMR

In Peter’s own words:

“We are starting an experiment on Open Notebook Science. <..> ONS seems to be the generally agreed term for scientific endeavour where the experiments are rapidly posted in public view, possibly before being exhaustively checked. It takes bravery as it isn’t fun if you goof publicly.“
“The recent controversy over hexacyclinol – where a published structure seems to be “wrong” – has sparked one good development – the realisation that high-quality QM calculations can predict experimental data well enough to show whether the published structure is “correct”.
We’re now starting to do this for NMR spectra. Henry Rzepa has taken Scott Rychnovksy’s methods for calculating 13C spectra and refined the protocol. Christoph Steinbeck has helped us get 20, 000 spectra from NMRShiftDB and Nick Day (of crystalEye fame) has amended the protocols so we can run hundreds of jobs per day..”

I posted comments to Peter’s post commenting “My intuition is that the HOSE-code approach, neural network approach and LSR approach will outperform the GIAO approach. Certainly these approaches would be much faster I believe. It would be good to compare the outcome of your studies with these other prediction algorithms and if the data is open then it will make for a good study.”

Peter comments “I am a supporter of the HOSE code and NN approach, but I have also been impressed with the GIAO method. The time taken is relatively unimportant. A 20-atom molecule takes a day or so, smaller ones are faster. We can run 100 jobs a day – so 30,000 a month. That’s larger than NMRShiftDB.”

He also comments “I have extended George Whitesides’ ideas of writing papers that in doing research one should write the final paper first (and then of course modify it as you go along). So here’s what we will have accomplished (please correct mistakes)”.

The abstract is below. Rather than correct mistakes I have added a paragraph (NON-bolded). I believe this project offers the ability to help build a bridge between the Open Data community, the academic community and the commercial software community for the benefit of science. There has never been a study of the magnitude being discussed here comparing quantum-mechanical NMR prediction methods with the methods represented by commercial software products. I look forward to it!

PMR> “We adapted Rychnovksy’s method of calculating 13C NMR shifts by adding (XXX) basis set and functionals (Henry has done this). We extracted 1234 spectra with predicted 3D geometries for rigid molecules in NMRShiftDB (no acyclic-acyclic bond nodes for heavy atoms). Molecules had < = 21 heavy atoms (<= Cl). These were optimised using Gaussian XXX and the isotropic magnetic tensors calculated using correction for the known solvent. The shift was subtracted from the calculated TMS shift (in the same solvent) and the predicted shift compared with the observed.
Initially the RMS deviation was xxx. This was due to a small number of structures where there appeared to be gross errors of assignment. These were exposed to the community who agreed that these should be removed. The RMS dropped to yyy. The largest deviations were then due to Y-C-X systems, where a correction was applied (with theoretical backing). The RMS then dropped to zzz. The main outliers then appeared to be from laboratory AAA to whom we wrote and they agreed that their output format introduced systematic errors. They have now corrected this. The RMS was now zzz. The deviations were analysed by standard chemoinformatics methods and were found to correlate with the XC(=Z)Y group which probably has two conformations. A conformational analysis of the system was undertaken for any system with this group and the contributions from different conformers averaged. The RMS now dropped to vvv.
This established a protocol for predicting NMR spectra to 99.3% confidence. We then applied this to spectra published in 2007 in major chemical journals. We found that aa% of spectra appeared to be misassigned, and that bb% of suggested structures were “wrong” – i.e. the reported chemical shifts did not fit the reported spectra values.

    The results from the GIAO calculations were compared with three other prediction approaches provided by Advanced Chemistry Development. These algorithms were not limited in the number of heavy atoms that could be handled by the algorithm, The algorithms were a HOSE-code based approach, a neural network approach and an “increment approach”. A distinct advantage of these approaches is the time for prediction relative to the quantum-mechanical calculations. The QM calculation took a number of weeks to perform on the dataset of 23475 structures on a cluster of computers. However, a standard PC enabled the HOSE code based predictions to be performed in a few hours, the Neural Net predictions in about 4 minutes and the Increment based predictions in less than 3 minutes.
    A comparison of the approaches gave statistics for the non-QM approaches superior to those of the QM approach. Outliers were observed in both cases and were traced to misassignments. QM approaches were generally more capable of predicting exotic structures while for the majority of the NMRShiftDB made up of general organic chemicals non-QM approaches were superior.

We argue that if spectra and compounds were published in CMLSpect in the supplemental data it would be possible for reviewers and editors to check the “correctness” on receipt of the manuscript. We wrote to all major editors. aa% agreed this was a good idea and asked us to help. bb% said they had no plans and the community liked things the way it is. cc% said that if we extracted data they would sue us. dd% failed to reply.

This has the potential to be a very exciting project. While I wouldn’t write the paper myself without doing the work I’ll certainly try the approach. Let’s see what the truth is. The challenge now is to get to agreement on how to compare the performance of the algorithms. We are comparing very different beasts with the QM vs. non-QM approaches so, in many ways, this should be much easier than the challenges discussed so far around comparing non-QM approaches between vendors.

For those of you who have been watching the blog of late you will be aware of the recent discussions about Open Data (1,2). We have offered the possibility to submitters of spectral data to declare their data either Open or Closed. Noel posted a comment on the blog asking the question “Why is the default Closed? Why even offer the option of Closed?”

So..my response to “Why not offer the option of Closed?” My opinion is that this is the submitters decision. It’s not our role to force “Openness” of data onto users. We are working to create an environment that provides value to ChemSpider users rather than one that forces them into a policy regarding openness. Personally, I would prefer to have access to data to help answer a question, even if they are NOT Open Data, than to not have access to those data. I have asked all of the people who have submitted data or had me submit data to ChemSpider whether they would like to have their data moved to open. 3 said yes 2 said no. I do NOT intend to force people to adhere to making their data Open. That is their choice, not mine. We are creating a community for collaboration. There is value in having access to data whether it is Open or not. if you look at the recent conversations about RSC and their Free Access versus Open Access we must agree that there IS value to Free Access to their articles despite the fact that they are not Open Access.

My friend Gary Martin has allowed us to deposit some of his data onto ChemSpider. He has commented twice (1,2) and I refer you to those blog postings for his opinions. They are interesting to read.

The reality is tha our policies, even as they are, appear to be appropriate to have people deposit their data. We already have over 100 spectra deposited on ChemSpider and more to come based on recent conversations. Some of these ARE Open Data and the depositors are acknowledged for this. They are sharing their data with you through us. That’s the benefit of building a community for chemists.

Since ChemSpider went live in March of this year we have received a lot of feedback and questions regarding our understanding of science, our purpose and our passions. We have an excellent Advisory Group who participate in dialogs and constructive discussions. Much of the feedback we have received has been from one individual , Peter Murray-Rust (PMR).

Before proceeding with this post I want to clarify my perceptions. I believe PMR brings a lot of value to the Chemistry Blogosphere. Over the past decade I have watched Peter’s activities with interest as he has participated with many other evangelists to pursue the cause of ODOSOS (Open Data, Open Source and Open Standards). Over the years I will confess a level of hero-worship. I had enjoyed watching what he was doing in regards to enabling the web for chemists. He is prolific..I don’t know where he finds the time to write so much. He travels the world and informs us all of what is going on “out there”. He does a great service. In contrast to these positive traits which I honor I am of the opinion that Peter is overly harsh and judgmental in some cases. Often he posts without necessary research and his perceptions become the “truth”. This is dangerous when he has such a public profile and such influence. For evidence of influence visit the graph here and notice the incredible spike in traffic resulting from his post about the Monkeys at ChemZoo in April of this year. It is unlikely those visitors ever returned to our site or blog to hear our comments. Potential damage was done.This blog post is in regards to his most recent judgments of ChemSpider.

When ChemSpider was set up for the benefit of the chemistry community I had assumed that this humble effort by a small group of dedicated individuals would be welcomed by PMR and other Open Access advocates. In general I believe that’s true. Our actions, policies and status have drawn a significant amount of feedback from PMR on his blog. New feedback was posted late last week and I’ll get to that shortly. As a review, in keeping with the trend being set by Rich Apodaca (1,2,3), I am listing what’s happened to date.

“Constructive Feedback” for Newbies

The Challenge to ChemSpider Chemistry

When Sodium chloride dimers are bad science..but are on NIST Webbook and PubChem

Calcium Carbonate is not soluble and can’t have a logP PLUS Lipinski says Calcium Carbonate CAN have a logP

Prussian Blue on ChemSpider is Terrible…but still as good as Pubchem and Emolecules.

Is Stereochemistry on Taxol important? Should the public data be curated?

ChemSpider VERSUS PubChem or ChemSpider SUPPORTS PubChem

ChemSpider ripped off PubChem…damn them.

ChemSpider and Their Openness and non-Web 2.0

ChemSpider don’t understand what Web 2.0 is.

ChemSpider contribute to the community…and support PubChem

Spectral Data are Declared Open Data

Helping out the community with Web Services

There are a lot more…and so to the latest. I’ll identify the recent post comments in italics.

PMR> Recently the Chemspider company has announced an “Open Chemistry Web” which in my opinion misuses the word “Open”.

Open Chemistry Web is the name of a new blog set up and hosted by Will Griffiths. It’s not ODOSOS. It’s a NAME of a blog. If we are in an environment where the name of a blog cannot include the word “Open” then we are living in sad times. Will’s passion is in text-mining OPEN ACCESS Chemistry Articles..or others if people will allow it. Can he not name his own blog? Hmmm….

PMR> Chemspider.com and its associates are commercial organization which have aggregated a large number of chemical connection tables and have started by calculating their properties and extracting literature references which they make freely accessible but not Open. The freedom is for an unspecified timescale and you cannot download significant amounts of the data and you cannot re-use it without permission. ”

Yes we are “commercial”. I dealt with this same comment previously. If you have interest in this please browse it. A later post outlines the present status of the project and whether or not it will survive.

Yes, we have aggregated a large number of connection tables and have started by calculating their properties and extracting literature references which they make freely accessible.We have done a lot more. We have made multiple services available to the community (1,2,3,4) but, with no surprise, have received no acknowledgment.

Regarding “not open“. We are giving away the ChemSpider database to those who ask for it. It will be published in PubChem. We USE Open Source components (1,2,3,4). We have not generated any Open Source components yet and our source code is not Open. We index Open Access articles on ChemRefer. We work with the Open Source data community to help.

Regarding “you cannot download significant amounts of the data and you cannot re-use it without permission“. We are giving away the ChemSpider database to those who ask for it. We do NOT have a server farm to support downloads. The FAQ page says

May I download the data and use it in my own database(s)?

You have limited rights in this regard. You can only assemble a database of 5000 structures or less, and their associated properties, from our database without our permission. You can download up to 1000 structures per day from the website. Please contact us at feedbackATchemspiderDOTcom to request an extension outside this constraint. We are willing to provide the ENTIRE database of ChemSpider structures at your request – the file will consist of InChI Strings, InChIKeys and ChemSpider IDs. These constraints are under regular review so please feel free to engage us in conversation.”

PMR>”Initially I was concerned about the complete lack of quality in these calculations and said so – I believe there has been some improvement in quality but I do not check and do not intend to do so. I do not follow Chemspider regularly but they appear to have added the ability for anyone to add annotations and curation. I have serious concerns about the lack of thought given to metadata and I do not expect Chemspider to be able to scale or to compete against modern approaches.”

I acknowledge the judgments and opinions. A question…in terms of online data sources for chemistry I believe that approximately 20 million structures ranks in the top 3. We have about 1500 chemists per day using the site with thousands of transactions including text and structre/substructure searching. Please compare with other services in this domain and, if you do this, provide quantitative information. We welcome any feedback on metadata. We are presently working on RDF’ing ChemSpider thanks to the guidance and support of Egon Willighagen. I have dealt with the metadata discussion previously here and abstracted below.

“Other comments include “I see very little difference between Chemfinder and Chemspider. They are both closed, proprietary, do not expose data, or metadata, or algorithms; have closed code, do not allow downloads or re-use. They lose metadata in their aggregation process. I have nothing personal against Chemspider (or, if they are associated, ACDLabs) – I just think the Web 1.0 model is out of date for chemistry.”

To respond…yes, the code is proprietary and closed..we don’t know of any Open Source code that would quickly search >10 million structures by structure and substructure (that will be covered in a separate blog as I have the utmost respect for the commercial entities that do this well! It’s DIFFICULT.) Oh…but Open Source isn’t part of the Web 2.0 definition. We don’t expose algorithms…correct…many are provided by collaborators and we do not have the right to expose their code. But that isn’t part of Web 2.0 either.

And next…the beloved “metadata” term. What exactly IS metadata? Let’s refer again to our web-friendly Wikipedia regarding metadata. In brief it’s “data about data” and a perfect example is an XML schema vs XML. An XML schema is metadata. According to my interpretation this means InChI and SMILES are not metadata since these data can be interchanged with the structure itself. I may be wrong. The hypothetical entity describing what data can be bound to a structure would be metadata not necessarily data related somehow to the structure, but rather more general data describing the datamodel – for example the source of the data – this IS metadata. ChemSpider doesn’t lose the metadata…we retain the only metadata currently available, the data source, and use it as our link out to the provider. Our primary role again, for now, is to connect silos of information via chemical structures.”

PMR> Chemspider also encourages Uploading Spectra Onto ChemSpider. These spectra by default all belong to Chemspider. They are not Open. If you can convince the world at large to donate IPR to you for free, you deserve some form of congratulations for sheer bravado. Note that even if you upload data and metadata you are not allowed to download it (there is a limit of 100 structures).

Thanks, again, for the judgments. We have been testing out the system with two of our advisory group and myself. Only JC Bradley’s Lab and Bob Lancashire have deposited and with the understanding, I believe, that the data would be “Open”. Since PMR’s blog posts continue to do damage to our reputation we have no choice but to respond. We do this with coding. Within 24 hours of his comments Open Data was declared, spectra can be downloaded. The intention was always there to do this…just we have higher priorities.

PMR>”We have ca. 250,000 calculations on molecules and 130,000 crystal structures which Chemspider have suggested we upload to them. I’m not yet sure why we should do this.”

Well, if they are Open Data, as marked at the CrystalEye website, and seeing as though people would like to access the data via ChemSpider, we should just be able to download. But, we don’t want all the data..we just want the structures and the appropriate URL structure to link back to CrystalEye. This is what we do with all data sources including NMRShiftDB.
PMR>”Chemrefer appears to allow searching of Open chemistry articles by keyword. Unexceptional, but why shouldn’t we simply use Pubchem? AFAIK it will index all these journals.”

PubChem indexes these journals? No, I think it’s PubMed. We’ll check on whether everything ChemRefer indexes is in PubMed. However, what they don’t do, yet, or ever, is connect the chemical names in those journals to chemical structures. That’s what’s been done for patents.

“PMR> The IPR model of Chemspider seems clear. No data, metadata and author contributions are Open.

Incorrect.

“PMR>That allows them, at some stage in the future to close some or all of the site and to charge for data and services”

The site, as it exists today, is intended to stay free for all. We may, OPENLY acknowledged, open services that are for charge.

“PMR> and – like eMolecules and their tie-up with Wiley (Wiley and eMolecules: unacceptable; an explanation would be welcome) – I predict this will happen within 5 years (unless Chemspider fails to survive in its current form).

I have posted on what I believe is an inappropriate judgment by Peter that the data on Chemgate is extracted from the journals. I put a trackback to Peter’s original post. He never responded. He did comment separately though about busyness and commenting. Unfortunately Wiley and Chemgate now show up again…with no effort to clean up the previous comments and, unfortunately, more incorrect information about ChemSpider.

“PMR> So all the authors who are contributing metadata are, in effect, donating IP to Chemspider. I have no moral objection to this – it just seems retrograde when we have Open collections of molecules such as PubChem and our own crystalEye.”

ChemSpider data will all go onto PubChem shortly. This was announced at the recent PubChem meeting. I have asked PMR to point me to where I can download the CrystalEye collection if it is indeed Open Data.

“PMR>But a number of my friends in the Open Chemistry area are on the Chemspider advisory board, so I must be missing something. Perhaps they can show how donating IP to a commercial closed company advances the cause of Open Chemistry.”

I hope they discuss with you. This group is a powerful team of intellect, capabilities, insight and support. I value the opportunity to work with them.

“PMR> And I applaud Chemspiderman’s efforts to clean up chemistry. Sometimes this gets muddled with the association with a commercial organisation based on possessing chemical IP so sometimes my messages have been less than generous and I apologized.”

Yes, you did. And I accept it willingly. It was very gentleman like.

“PMR> I am not anti-capitalist – I do not attack companies per se. But I do attack people who use the word “Open” incorrectly and to promote themselves. I have done this when publishers come up with “Open Access” offerings which appear to be less than satisfactory ( see “open access products” at Nature obscures the debate, Why Open Access metrics are necessary) and for which the community has to pay. “Open” is now used by commercial organisations in the same way as “healthy” – please feel good about us and our activities as we use the word “Open”. We know it’s meaningless, but it makes us look good. Well, it isn’t meaningless. A number of people are trying carefully to describe what is meant by Open access, Open Data, Open source and Open Services. And when others use it to mean something less, I take exception. If nothing else it makes our job much harder.”

I will comment on this in a couple of later posts. I do not support the “marketing” use of Open and do not believe we are doing so. However, I want to comment more on this, but at a later date. Marketing statements bug me too. You’d think that “…the world’s most comprehensive openly accessible search engine for chemical structures” would be PubChem. But it’s not according to this marketing statement …who is it?

There have been comments about PubChem being the model of Openness. I think the effort is great. FULLY support it. But let’s wake up. If funding ceases then PubChem could go away. The data is Open. The software is NOT. PubChem is built around some home-built services and on top of commercial modules such as CACTVS and OpenEye. I discussed it here and it has not been challenged. Am I wrong?

“PMR>: There is nothing Open about this. Even the blog is not Open (it does not carry a CC licence). The services may be free, and they may be useful, but they are not Open. The text that they index may indeed be Open Access in its own right (and probably is because otherwise the publishers will sue them) but this is no especial credit to Chemrefer. We also index Open resources but we make our results Open.Chemrefer could disappear tomorrow. Only if the data, and the source code are made Openly available under licence can they be called Open.”

There is a CC license on the page. Peter acknowledged this. Who said the services were Open? if we did, point me to it and we will rectify. I have asked Peter separately whether all articles linked to CrystalEye are Open Access or some with permission from the publishers. This is very interesting.

This has been a long post. I understand I have likely added fuel to the fire. I have done it in a public way. I judge that ChemSpider is being harmed by the ongoing misinformation. I wish it to stop. What I want is advice and support to make this a better service for our users. However, I refuse to make it my personal mission to satiate PMR’s requests and objectives. ChemSpider is developed for its users and the community in general NOT for it’s non-users. PMR is not a user. Not everything has to be Open for it to be of high-value. I believe we deliver value.

In his ongoing comments regarding our efforts here at ChemSpider Peter Murray-Rust has provided “interesting feedback”. This is one of MANY commentaries (1,2,3,….). Despite repeated requests to direct his comments directly to us/me, and suggestions to use the feedback form or participate on our blogs he prefers to comment on his blog.

For those of you who read our blogs and watch our service I hope that you see we are responsive to requests and feedback. I have chosen to take all comments, however directed, as opportunities for improvement so, in less than 24 hours, and because of the commitment of our development team to help preserve our reputation, here’s what we have done…

We looked at the CrystalEye system as a model of Open Data. CrystalEye is hosted by Peter’s Lab. We took this as an appropriate model and now when a user submits spectral data to ChemSpider they have a CHOICE to make the data Open or not. Their choice, not ours. This is a simple checkbox as shown below.

Open Data Spectra
The Open Data policies are those listed here. When a user allows us to host their data as Open Data then we allow others to download it and use it. When a spectrum is submitted as Open Data then those data can be downloadable. Simply click on the DOWNLOAD SPECTRUM button as shown in the image below (an example spectrum here) and you can download the JCAMP spectrum to your desktop.

Download Spectrum

Welcome to the world of Open Data on ChemSpider.

This is part 3 (1,2) in the mission to correctly identify the structure of Taxol. I’ve gone on the mission to identify the correct structure after noticing the contradiction via the Wikipedia page commentary PMR made earlier this week. As commented in Part 2 there is a direct contradiction in the two molecules linked off of Wikipedia…one linked to Drugbank and one linked to PubChem. My question is, which one is right? And if neither of them is correct, is the one in Wikipedia correct. And, if that isn’t correct what IS the structure of Taxol.

I posted a request to CHMINF-L today with a link to the earlier blog postings defining the issue. I received a lot of suggestions and guidance. These included:

1) Check PubChem for the structure of Taxol. (Ahem…check out my statement regarding I don’t know which is right!)

2) I got pointed to the C&E News story …very useful by the way

3) I received a link to the structure for Taxol as found in the MDL Compound Index on the DiscoveryGate platform. The CASRN of 33069-62-4 is associated with this structure in a variety of databases including Beilstein, MDL Available Chemicals Directory, MDL Toxicity, PubChem, PharmaPendium, xPharm and others.

4) I received a link to the structure of Taxol on DTP . This is the Developmental Therapeutics Program of the NCI.

5) I received directions about how to use STN-Easy and get the structure and the CAS number/synonyms etc for just a few dollars.

Using point 5 as a basis let’s start HERE and set this as the actual structure of Taxol. Let’s work from the point of view that THIS is highly curated data and is correct. It is an assumption but we need to start somewhere. The image is shown below and I hope that I haven’t broken “image copyright” posting it here (I’ve been watching all the discussions about this by PMR but am taking the risk nevertheless)

STN structure

Now, I happened to have used Taxol as an example in the Chapter I wrote in the Third Edition of the ACS Style Guide. It’s a good example of the challenges of structure representation, stereochemistry and systematic nomenclature. When I wrote that article I used the structure taken from the ACD/Dictionary included with the commercial version of ACD/ChemSketch (at that time I was the product manager so I had easy access to the tool. in fact, the webpage advertising the ACD/Dictionary uses Taxol!

So, imagine my concern in wanting to compare the STN structure with the ACD/Dictionary structure…results below..deep breath…

ACD/Dictionary

The structure is identical, in terms of connectivity and stereochemistry, to that from STN Easy (but, in my opinion, much more attractive. Since STN doesn’t give me an InChI (which strikes me as strange when many of us in the domain support the shift and the industry will demand it shortly!) I will generate the InChI inside ACD/ChemSketch and work from there.

The InChI String is: InChI=1/C47H51NO14/c1-25-31(60-43(56)36(52)35(28-16-10-7-11-17-28)48-41(54)29-18-12-8-13-19-29)23-47(57)40(61-42(55)30-20-14-9-15-21-30)38-45(6,32(51)22-33-46(38,24-58-33)62-27(3)50)39(53)37(59-26(2)49)34(25)44(47,4)5/h7-21,31-33,35-38,40,51-52,57H,22-24H2,1-6H3,(H,48,54)/t31-,32-,33+,35-,36+,37+,38-,40-,45+,46-,47+/m0/s1/f/h48H

and the InChIKey is: RCINICONZNJXQF-MZXODVADBJ

I used the InChI String to search ChemSpider and found it here.

taxol.png

While Taxol is on Chemspider 7 times only this one record is the ACTUAL structure. One of the 7 is the incorrect structure just based on molecular formula (there are two data sources for this deposition: one from PubChem and one from ChemBlock). For the other 6 structures all have the SAME connectivities but the stereochemistry is different. This is crystal clear in the new InChIKey comparison as shown below:

RCINICONZNJXQF-CLDWUXIMDD

RCINICONZNJXQF-SWYDOUDSDE

RCINICONZNJXQF-XIKKIZKTDF

RCINICONZNJXQF-GXKQXQCDDN

RCINICONZNJXQF-MZXODVADBJ

RCINICONZNJXQF-LOQTUHTGBW

Notice the first 14 characters are consistent…but the stereo layer is different. So,where does this get us with the Wikipedia entry shown below?

Wikipedia

I can confirm at this point the CAS Number is correct. Comparing the structure we have identified as correct with that on Wikipedia here I can confirm the structure on Wikipedia IS Correct. Yay. The link to the PubChem record is to the correct an INCORRECT structure and should be edited to linked to CID:36314 . The structure on DrugBank is CORRECT.

Now, what about the one sent to me by MDL and displayed here? It is CORRECT.

What about Taxol on DTP? I reviewed the 2D structure on the site but could not download it as a molfile. I would have had to download the entire 50Mbyte file. I downloaded the SMILES string and converted but it had no stereochemistry. So, I searched the ID number NSC125973 in PubChem since the DTP data is deposited there. The PubChem structure identified is 36314, the one in the Wikipedia DrugBox and the CORRECT structure.

What about Taxol in the C&E News story? The structure is CORRECT.

So, we know what the correct structure is…someone needs to confirm my findings and make the edit in Wikipedia please!

Ok, let’s look at systematic names for Taxol names. There are so many variants and I really do have to question their quality!

This is not a systematic name for taxol in my opinion: 7,11-Methano-5H-cyclodeca[3, 4]benz[1,2-b]oxete,benzenepropanoic acid deriv.

Let’s look at the C&E News name here and shown below.

C&E NEws name

Compare the name generated by ACD/Name using the INDEX name generation

Index Name Text

The_interface below shows the name in the software interface.

IndexName

Here_again we see a complication…the name in the C&E News article and the ACD/Name software differ in the stereo definitions in the block: (2aR,4S,4aS,6R,9S,11S,12S,12aR,12bS). The 4aS in ACD/Name is 4bS in the article. I am working to figure out the difference here right now.

PMR commented in a recent blog post “Naming is hard. Very hard. It’s been said that there are only two hard problems in computer science and naming is one of them.” I spent ten years at ACD/Labs working hard with some of the most skilled nomenclature specialists in the industry to make it simpler for chemists to generate systematic names. Would you want to MANUALLY name the structure above…really? There are many other systematic naming tools out there today from Cambridgesoft, ChemAxon, OEChem and others. I recommend using them!!! They can be very capable and include organometallics, some of the most challenging complexes to name.

What’s the bottom line about Taxol? Here’s the point…overall PMR was right about the fact that Wikipedia is highly curated. But not perfect. While the link to PubChem is to the wrong structure the right structure IS on PubChem…care must be taken to identify issues like this..more curation is required.

My biggest comment is that the quality of repositories such as PubChem and ChemSpider is going to degrade if there is no curation effort and if anybody and everybody is starting to deposit their data either as massive SDF deposits or as singletons. Curation efforts are essential.

I covered the issue of taxol a few weeks ago: http://www.chemspider.com/blog/?p=64. Today Taxol came up again on a post by Peter Murray-Rust. First of all a couple of comments re the post.

PMR commented “The intelligible Chemspider image was hand-drawn by the PNAS authors – I don’t know how it got to Chemspider. (Personally I think it’s pretty awful – I do not like stereo bonds which are rectangular rather than wedges. Why do people use them. And You only have to scale the image to corrupt this info). So we need an Open collection of chemical structures.” In case there is confusion please read the original post…the structure was grabbed from a PDF file (4 Total synthesis highlights (Annu. Rep. Prog. Chem., Sect. B: Org. Chem., 2004, 100, 91) – Royal Society of Chemistry)…it is NOT on ChemSpider. The structure was located by a search using Chemrefer, now on ChemSpider. It was not drawn by us, we’re not responsible for it and, to clarify, I don’t like it either.

Oh, and we do have an Open Collection of chemical structures. The deposition process is under beta-testing and anyone can download the data (we will give away the entire structure collection shortly).

Peter commented that Wikipedia is highly curated. I use it a lot. But, I am cautious…ESPECIALLY with stereochemisty. I’m trying to determine what the ACTUAL taxol structure is. My investigations suggest that one stereocenter is WRONG on the Wikipedia structure. The link to the PubChem record is therefore to the incorrect structure in theory.

Also, the systematic name is not what I would term as anywhere near IUPAC standard: β-(benzoylamino)-α-hydroxy-,6,12b-bis(acetyloxy)-12-(benzoyloxy)-2a,3,4,4a,5,6,9,10,11,12,12a,12b-dodecahydro-4,11-dihydroxy-4a,8,13,13-tetramethyl-5-oxo-7,11-methano-1H-cyclodeca(3,4)benz(1,2-b)
oxet-9-ylester,(2aR-(2a-α,4-β,4a-β,6-β,9-α(α-R*,β-S*),11-α,12-α,12a-α,2b-α))-benzenepropanoic acid

By the way…the name on Drugbank is 5 beta,20-Epoxy-1,2a,4,7 beta,10 beta,13 alpha-hexahydroxytax-11-en-9-one 4,10-diacetate
2-benzoate 13-ester with (2 R,3S)-N-benzoyl-3-phenylisoserine….hmmm…

I would LOVE this post to get confirmation regarding what the right structure is…is Wikipedia CORRECT or Wrong? I THINK the structure on Drugbank is RIGHT. This DIFFERS from the Wikipedia structure by one stereocenter. Check out the InChIs below:

PUBCHEM
InChI=1/C47H51NO14/c1-25-31(60-43(56)36(52)35(28-16-10-7-11-17-28)48-41(54)29-18-12-8-13-19-29)23-47(57)40(61-42(55)30-20-14-9-15-21-30)38-45(6,32(51)22-33-46(38,24-58-33)62-27(3)50)39(53)37(59-26(2)49)34(25)44(47,4)5/h7-21,31-33,35-38,40,51-52,57H,22-24H2,1-6H3,(H,48,54)/t31-,32-,33+,35-,36+,37-,38-,40-,45+,46-,47+/m0/s1/f/h48H

DRUGBANK
InChI=1/C47H51NO14/c1-25-31(60-43(56)36(52)35(28-16-10-7-11-17-28)48-41(54)29-18-12-8-13-19-29)23-47(57)40(61-42(55)30-20-14-9-15-21-30)38-45(6,32(51)22-33-46(38,24-58-33)62-27(3)50)39(53)37(59-26(2)49)34(25)44(47,4)5/h7-21,31-33,35-38,40,51-52,57H,22-24H2,1-6H3,(H,48,54)/t31-,32-,33+,35-,36+,37+,38-,40-,45+,46-,47+/m0/s1/f/h48H

Compare the STEREO layer at:
t31-,32-,33+,35-,36+,37-,38-,40-,45+,46-,47+
t31-,32-,33+,35-,36+,37+,38-,40-,45+,46-,47+

and compare the stereo for stereo center 37… one is PLUS and one is MINUS. OOPS!

I’m certainly willing to be wrong but the point is, right now, I am not sure what the right structure. Can anyone out there confirm??? Can someone check “the” highly curated data source and tell us?

Until then I am in full agreement with Peter regarding what Wikipedia SHOULD be “It’s Open, re-usable, very highly curated, and the first place that students look. That – or a derivative – is where the world’s chemistry should reside. ” HOWEVER, I am calling for confirmation of the structure and correction if necessary. One of either DrugBank OR PubChem, both linked from Wikipedia, is wrong.

In terms of the comment “That – or a derivative – is where the world’s chemistry should reside”. I DO agree. We have committed to a wiki-environment for Chemistry. We are presently deciding on the appropriate wiki environment (NOT necessarily MediaWiki) to layer onto ChemSpider. Email exchanges are underway with some of the players in this domain at present – and a sincere thanks to Joerg Wegner for his support on this! With Martin Walker on our advisory group (Walkerma on Wikipedia…a very active player in this domain) we look forward to the best advice and guidance from our collaborators.

As discussed in an earlier blog I spent some time chatting with Paul Doherty and Peter Murray Rust this weekend…specifically around InChIs and InChIKeys. I’d originally suggested to Paul that he put InChIs on the site so that I could use them to check for presence of the structures he draws in the ChemSpider database. Well, now he’s started to include them on his postings I get to check them.

I started with a search on the term Diazonamide A in Pubchem. Two hits….shown below.

Diazonamide A on PubChem 

Diazonamide A is a complex structure. Those structure representations drawn above are NASTY and do need cleaning for sure. Unfortunately Chempider has the same issues for these two structures (see below) since we did not CLEAN the PubChem dataset. Cleaning these structures in not an easy task. We are working on improving this as discussed previously.

The two shown on PubChem have different Isomeric SMILES and different InChIs. Why? ONE stereocenter difference…see the highlighted difference below (in red) and the arrow to the one stereocenter difference.

Stereo Differences

 It is appropriate to have two structures in PubChem..they are unique. But now we don’t know which one is Diazonamide A. Shucks.

And so, let’s check eMolecules. I didn’t find any hits. eMolecules did take a lot of PubChem into their dataset. the reason for not finding it might be “it’s not there” because of eMolecules focus on commercial suppliers OR because “I queried incorrectly”. Don’t know.

So, to ChemSpider. A search on Diazonamide A gave SIX hits. Two of these are the exact ones from PubChem. The four others are shown below…

Marinlit Diazonamide A

Of these four two of them have an “additional oxygen and two hydrogen atoms” in the molecular formula.

Is that right or wrong? TotallySynthetic has the formula as C40H34Cl2N6O6 so we’ll trust Paul and curate these two records as IN ERROR as shown below (A primary advantage of ChemSpider is we are allowing curating!). I’ll also let Marinlit know…

 

 Curate Out

 The differences between Diazonamide A and the “incorrect structure” are shown below just for information. SIGNIFICANTLY different.

Differences in Marinlit

Let’s take a look at the different InChIs for all structures we are considering – 2 PubChem, 2 ChemSpider (from Marinlit) and 1 from TotallySynthetic. They are ALL different and all differ in the sterochemistry layer.

PubChem CID: 395475
InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)16(3)4/h5-13,15-16,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23-,27-,30-,39-,40-/m0/s1/f/h44-45H
 

PubChem CID: 5492609
InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)16(3)4/h5-13,15-16,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23-,27-,30-,39-,40u /m0/s1/f/h44-45H
 

ChemSpider CID: 10478902 

InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)16(3)4/h5-13,15-16,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23?,27-,30-,39?,40-/m0/s1/f/h44-45H  

ChemSpider CID: 17212293 

InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)1 6(3)4/h5-13,15-16,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23-,27+,30+,39-,40+/m1/s1 

Totally Synthetic

InChI=1/C40H34Cl2N6O6/c1-15(2)27-37-46-29-32(54-37)40-20-9-5-8-19(18-7-6-10-22-25(18)26(33(41)43-22)31-34(42)48-38(29)53-31)28(20)47-39(40)52-24-12-11-17(13-21(24)40)14-23(35(50)45-27)44-36(51)30(49)16(3)4/h5-13,15-16,23,27,30,39,43,47,49H,14H2,1-4H3,(H,44,51)(H,45,50)/t23-,27-,30-,39?,40-/m0/s1

 

The TotallySynthetic structure has the stereo layer t23-,27-,30-,39?,40-. In real terms this relates to five stereo centers. At this point I have to question whether the structure as drawn is correct or not..what we have is five structures with the SAME connectivities but with different stereochemistries. This is similar to the issue I blogged about previously in regards to Taxol.

 

I’m not sure that we have the right structure of Diazonamide A yet..but I’m sure we’ll get there very quickly with this question out there in the open. I HOPE SO!

I have been invited to write an article regarding Open Access Chemistry Databases and am in the process of gathering information. During one of my google searches I happened across a statement I was aware of but had forgotten until recently. It relates to the ability to use CAS numbers on a website. Specifically, from the CAS Information Use Policies of 2005 it says, quote:

“A User or Organization may include, without a license and without paying a fee, up to 10,000 CAS Registry Numbers or CASRNs in a catalog, website, or other product for which there is no charge. The following attribution should be referenced or appear with the use of each CASRN: CAS Registry Number® is a Registered Trademark of the American Chemical Society. CAS recommends the verification of the CASRNs through CAS Client ServicesSM.”

I interpret this as meaning that above 10,000 CAS numbers permission must be granted to the organization gathering togethering a data collection. Based on my experience there are a LOT of situations where collections of more than 10,000 CAS numbers exist. We are presently deduplicating and indexing another million structures on the ChemSpider index. We regularly receive SDF files (are these electronic “catalogs”?) containing structures and CAS numbers…and when these contain over 10,000 CAS numbers are they inadvertently going against CAS policy? Are all of those online databases with a large number of structures doing so with permission (for example ChemIDPlus, ZINC DB, eMolecules and, of course, PubChem.

I can only imagine if these large collections/websites/databases do not have permission to expose over 10,000 CAS numbers. What a public relations nightmare that could open up! Since we deposited the PubChem dataset to ChemSpider that naturally includes any associated registry numbers. Since eMolecules has deposited portions (not all) of the PubChem dataset they also have deposited the registry numbers.

I may be lighting a fire here, and might get some interesting calls as a result, but I am publicly asking the question…if you are managing a website or public data collection of over 10,000 CAS numbers (read that as any site exposing PubChem data) have you asked permission to expose the data? And … did you get permission? CAS numbers are everywhere…they are “phone numbers” for chemistry. On cans and boxes in our kitchen and garage. On webpages all over the place. This is a very interesting situation for “large chemistry databases”…