Archive for February, 2008

Producing visually attractive chemical structures with depictions which communicate sufficient detail about the molecule to make them easy to interpret is easy for simple molecules and challenging for complex structures and even some of nominal complexity. Consider the images below.

clean3.png

Complex it’s not but even certain algorithms cannot clean this fairly simple structure. Appropriate cleaning will give

clean4.png

Performed incorrectly the resulting depictions can be very confusing and communicate incorrect information to the user. Consider the structure below.

clean5.png

Inspection of the structure might initially confuse the user into consider the presence of a napthyl ring system. However, consider the bonds within the ring and you will notice it is not a naphthyl ring. CLEANing the molecule gives us:

clean6.png

We acknowledge that we have some “pretty ugly” structures online as a result of us depositing structures from various sources and with millions of structures we are sure to have you problems. For our curators and our depositors we have provided the ability to CLEAN molecules online. The capability has been introduced into the Structure Drawing Applet and is outlined in a technical note online.

CLEANing algorithms are not easy to perfect. None are ideal. But they ARE absolutely necessary. For those of you receiving the CrystalEye feed you will see a lot of interesting depictions especially for the organometallics. The organic compounds should be fairly easy to CLEAN up under most conditions. The examples below show examples.

clean9.png

clean8.png

Hopefully_the CLEANing capabilities we have introduced will help our curators and depositors improve the depictions of structures on ChemSpider.

Buy me a Coffee

Those of you who have been reading the blog will likely recall my views the Open Notebook NMR project. I have been trying to establish whether or not the time-consuming and computationally intensive GIAO calculations associated with Open Notebook NMR provide better performance than standard out-of-the-box PC-based desktop computer calculations taking just a few seconds. We didn’t get to examine that hypothesis out of that work but I have been working on that study with a couple of my ex-colleagues at ACD/Labs and hope to report on it in the future. A paper was released this week by Rzepa and Braddock in J Nat Prod. regarding the “Structural Reassignment of Obtusallenes V, VI, and VII by GIAO-Based Density Functional Prediction” doi:10.1021/np0705918

I want to comment on the new feature of Associating a Publication with a structure, in this case associating the J Nat Prod paper with this structure. In this case the structure of Obtusallene V is in the ChemSpider database.

obtusallenev.png

I added the appropriate identifiers using the usual process and then added the publication. To do this simply select “Add Publication” and the following screen will be displayed.

add_publication.png

About 60 seconds of work and I had added the article title, authors, DOI and a short decsription regarding the paper. When I submitted it the publication went to the curator for approval, in this case me. When approved then the article is associated with, and linked via DOI, to the structure.

add_publication3.png

As articles are associated in this way the literature, both open and closed, becomes structure and substructure searchable extending the reach of ChemSpider even further.

I then deposited the CORRECT structure of Obtusallene V as identified by Rzepa and Braddock. The structure is here and scrolling down shows you the annotation regarding it being the correct structure. I removed the name Obtusallene V from the old structure so that now the search on that name takes you to the new and correct structure. This URL is that query: http://www.chemspider.com/Search.aspx?q=obtusallene+V 

This blog post is a perfect example of what we have been promising since the beginning: structure deposition, curation and cleaning up the quality of data.

Buy me a Coffee

I have been asked a number of times what I would recommend to look at SDF files. This is specifically because it is possible to download  the information associated with a structure into one of these viewers and store/manipulate/print. The constraint with this question about SDF viewers is generally that the utility needs to be FREE. There are a number of good SDF viewing tools out there especially in the commercial world. Ann Richard from EPA, who runs the DSSTox database, has commented on SDF file viewers here.

I use numerous tools for SDF viewing. My primary desktop tool is ACD/ChemFolder and I use it because I know it inside out since I managed the product for a number of years and ACD/Labs have been gracious enough to provide me with a copy to support the work I am doing to curate Wikipedia chemical structures. ChemSpider has also been given access to ChemAxon and OpenEye tools as a result of the philanthropic nature of these two organizations and their support of both academia and “no login” websites (in the case of ChemAxon).  We also use these tools for the manipulation of SDF files and confirming the results across multiple software products when we see issues with SDF file handling. I don’t have access to Symyx/MDL tools but as the format was defined by MDL in the early days that would also be ideal when there are issues with the SDF file reads. Just today I was told by a company that their SDF files must be fine since ISIS reads them but I have checked them in other tools and they are not readable thus making this exchange format interesting at best to manage the complexities of SDF files from different sources but such is the nature of the domain I’m afraid.

For a free tool I am presently using a tool from Hyleos called the ChemFileBrowser. Under the constraint of Free it is a great tool and is in very active development as detailed on their webpage. It’s stable, fast, gives me what I need to see in terms of SDF file browsing and is priced well. It doesn’t offer searching and is not connected to a drawing tool and those capabilities are available from the commercial products. However, for the distribution and viewing of SDF files I think it’s a great little utility. The screenshot below is for the single structure view but there is also a grid view.

hyleos.png

This structure is part of  my viewing of the IUCr structure file. We are presently working through the extraction of chemical names from IUCr articles from 1948 onwards. These are being converted using name-to-structure tools available to us and then manually curated for accuracy…a long and tedious process. These structures will then be put onto the ChemSpider database and links will be formed directly to the associated IUCr articles via DOI. This is a fairly simple, but time-consuming process for organics but very difficult for organometallic complexes for two reasons - support of conversion of complex organometallic names and the representation of those structures in SDF format.

I am about to start work with the NISS’ PowerMV viewer also as an alternative viewing tool for SDF files. I hope to report back on its capabilities should time allow but this will not be for awhile since we are preparing for our shift from ChemSpider Beta to coincide with our first year Anniversary in March.

Buy me a Coffee

We have started to introduce new capabilities onto ChemSpider in preparation for our shift from ChemSpider Beta to ChemSpider RELEASE VERSION to celebrate our one year anniversary. You should see a number of incremental improvements happening over the next few weeks. I’ll highlight as many of them as I can as we release them.

Let’s start with new Search Capabilities. On the search screen you will see a drop down menu. You can now select from 5 different types of searches.This post with highlight only the “Structure” type search. Details about the others will follow.

searches.png

It appears that many people believe that ChemSpider is only a text-based search. The reality is  we went live with structure and substructure search in version 1. For sure we have improved things over the months and the searches are better and faster but structure searching has always been present.

To perform structure searching choose the Structure Search and you will see:

structuresearch1.png

Now click on Input Structure and  a screen for submitting structures will be opened. There are three ways to do this. The Convert will allow you to input a SMILES string,  InChI string or chemical name and convert to the structure. If the structure is corrected select ACCEPT and perform the search. Alternatively load a molfile by browsing and uploading then click ACCEPT. If you want to draw a structure or edit one you have converted simply select edit and draw into the applet.

structuresearch2.png

The manual for the applet is here. The applet will look like this:

structuresearch3.png

These capabilities have been in place for a while. Now what we have done is enable you to search in different ways from this place. Specifically…

structuresearch4.png

These searches are very valuable  specifically in relation to the issue of tautomers and structure skeletons. As I have shown with previous post on ginkgolide B in some cases there are many similar structures on the database. Searching on the skeleton will find them all. A search on Taxol  shows 1 exact structure, 1 tautomer of the exact structure and 42  structures with same skeleton as shown below.

structuresearch5.png

Enjoy the new capabilities. We welcome your feedback.

Buy me a Coffee

The recent blog posting on the InChIKey Resolver has sparked quite a lot of interest. I’ve been talking with Alex Tropsha, one of our Advisory group regarding hosting the InChIKey Resolver project at UNC-Chapel Hill and the decision is that we will move ahead with setting up a system under their control. They are presently looking for  a developer interested in relocating to Chapel Hill to support both this project as well as some other exciting projects they have running in their laboratories. If anyone is interested in a Cheminformatics role please contact me at the usual email address and I will connect you to the appropriate person. I’m excited since we’d get to work together on the project. Don’t be shy…UNC Chapel Hill is a superb school, Alex and his team are doing excellent science and the environment is simultaneously one of fun, creativity and hard work.

Buy me a Coffee

Tonight I finished an article on Public Chemistry Databases. During that article I commented on the size of the Public Chemistry Databases versus the commercial databases. There have been numerous discussions in the blogosphere about the size of databases such as PubChem relative to the CAS Registry. Recently PubChem and ChemSpider headed towards 20 million structures. The CAS Registry is about 33 million.

Now, I don’t know how much duplication there is in the Registry but I can comment is what is in ChemSpider and likely in PubChem. Here’s a basic comment about molecules with complex stereochemistry. They tens to exist MULTIPLE times in the database due to different variants of stereochemistry. Let’s examine Ginkgolide B. The structure below is taken from a recent RSC article. I was interested to see whether we had the “correct” structure of Ginkgolide B on ChemSpider, assuming that the correct structure is that one shown on the RSC webpage.

ginkgolide-b.png

A search on the name Ginkgolide B turned up a total of 6 structures. The connectivities are the same for all structures. The ONLY difference is in the stereochemistry. Take a look at the structures in Table View. There is one structure with full stereochemistry expressed. This one comes from PubChem, Thomson Pharma and xPharm. With full stereochemistry it might be safe to assume it is correct.

However, even for Taxol there are structures with complete stereochemistry and they are different: Structure 1, Structure 2, Structure 3, Structure 4 and Structure 5

I actually gave up looking eventually…here are the different complete stereochemistries. Look carefully…

t31-,32?-,33+,35-,?36+,37-,38?-,40-,45+,?46-,47+

t31-,32?+,33+,35-,?36+,37+,38?-,40-,45+,?46-,47+

t31-,32?+,33+,35-,?36-,37+,38?+,40-,45-,?46+,47-

t31-,32?-,33+,35-,?36-,37-,38?-,40-,45-,?46-,47-

t31-,32?-,33+,35-,?36+,37-,38?-,40-,45-,?46-,47-

Question for ChemSpider Users - there are actually WAY MORE than 10 Taxol skeletons on ChemSpider. Can anyone figure out how many? It actually takes one search to find them all!

We believe this is the correct structure of Taxol.

Back to Ginkgolide B. I redrew the structure shown in the RSC article (and as shown below).

ginkgolide-b_2.png

Generating the InChIKey for this structure and performing a search on ChemSpider gave me no hits. It looks like either the RSC structure is wrong OR all of the six structures from all of the different sources are wrong. As mentioned, there is actually only one Ginkgolide B structure ( a structure with the associated identifier) on ChemSpider with full stereochemistry. The stereo for that structure is:

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20+ ChemSpider

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20- RSC Stereo

There is ONE stereocenter difference.

This is what curation is all about. The question now is which one is correct? Is it RSC? Is it the structure on ChemSpider? Can anyone validate? Did I miss something up in the comparison (it happens!)

Now, for the LONG list of Ginkgolide B structures on ChemSpider shown here what do users think we should do? If we simply remove ALL labels for the incorrect structures then we will remove all links into other databases that contain information for “their Ginkgolide B”. If we collapse all links into the correct Ginkgolide B on ChemSpider, and bring 6 records into one, then the structures to which the correct structure links are actually incorrect in the linked databases (but useful information exists there).

Quite the conundrum. I’d appreciate feedback!

Buy me a Coffee

ChemSpider IS polluted with interesting identifiers associated with chemical structures and I have blogged many times about our efforts to clean it up. I’ve also suggested that systems such as ChemSpider, and their are many, needs an easy way to provide feedback and we have done this as discussed here. All of us hosting such large data collections deal with these issues. Today I found a classic though. A search on a CAS Number brought me to this page:

estrone1.png

The information seems fair enough but the list of names is quite amusing:

estrone2.png

These  might be a new form of “International Name”. We have had disasters just like this on our own site. At the weekend I was informed by a user of one of our structures having over 70,000 identifiers! We looked at it. It was the ONLY structure on the database with more than 300 identifiers and this one user found it. We’ve cleaned it out now. Hosting services like this is a lot of fun :-)

Buy me a Coffee

An article entitled The Search for Unusual Suspects discusses the fact that scaffold hopping expands the range of core molecular shapes for lead generation. This is of particular interest to us here at ChemSpider because it discusses LASSO. For those of you watching the blog you will know that we are in the process of implementing LASSO here on ChemSpider. One of the specific advantages of the LASSO descriptor is the ability to scaffold-hop. This is defined in the article and quoted here:

The term “scaffold hopping” was coined by former Hoffmann-La Roche researcher Gisbert Schneider. “It defines the techniques used to identify isofunctional molecules — molecules that have the same bioactivity but different architecture — in other words, different chemotypes,”

Rather than try to do justice to the article I recommend reading the article here.

Also, expect to see LASSO rollout in the the very near future here on ChemSpider.

Buy me a Coffee

I use Google Reader to read blog posts. It’s great. The PROBLEM I have with blogs is when I want to receive comments on comments I have made on a blog. I waste a lot of time going back to blog posts to see whether the author of the original post has commented on my comments. It is a real time hog. Maybe I should stop commenting…it would be less time-consuming.

I think ALL blog hosts should allow users to Subscribe for Comments. What is this? I noticed it on David Bradley’s ScienceBase blog and we added it tonight. Look at the screenshot below and notice the red highlight. Now if you post a comment to the blog check this box and if new comments are received on this particular blog post then you will receive an email about the new comment.

subscribe.png

Buy me a Coffee

The recent post regarding the InChIKey resolver has catalyzed a number of conversations. There have been just as many going on off the blog as well as comments on the original blog posting. One thing that came up a number of times was about how there is no such thing as a unique InChIKey.

One specific question asked whether or not the InChI was sensitive to tautomers? This is all down to option settings. There are a number of layers in the InChIString from which a key is derived. The InChIString (and therefore InChIKey) generated for a particular structure is dependent on the settings for the layers. I won’t review the layers again as it has been done many times elsewhere especially at the unofficial InChI FAQ page.

Suffice it to say the mobile proton perception layer DOES allow individual InChIKeys to be generated for different tautomers. See below the 4 tautomers for guanine and the different InChIKeys.

guanine-inchikeys.png

Note that the first set of characters in front of the dash carry the “connectivity” information between atoms while the second set of characters carries the content of the layers - stereochemistry, mobile protons, charge and isotopes. In the four guanine structures the connectivities are identical.

When the mobile proton perception is switched on then all tautomers give the SAME InChIKey, UYTPUPDQBNUYGX-UHFFFAOYAE. This type of capability can be very valuable when creating a database for the purpose of searching a database. For example, every structure could be populated into the database both with and without mobile proton perception. This would allow for searching of not only the individual tautomers but also all members of the same tautomer family.

What this means is that a whole series of InChIStrings and InChIKeys can be generated for a molecule dependent on settings. There are moves afoot to define a set of standard settings for the generation of InChIs. Until then variability is possible. This is compounded by the input of the correct structures prior to generating InChIs. Perform a search for Taxol on ChemSpider and you will get three structures, same mass, same connectivity (check the keys in Table View)

taxol.png

Check the InChIKeys below and you will see the different layers. Check CAREFULLY for differences in stereochemistry and you will see question marks for undefined stereochemistry. The FULL stereochemistry is in the bottom InChI only.

InChI: InChI=1/C4 ?7H51NO14/c?1-25-31(60?-43(56)36(?52)35(28-1?6-10-7-11-?17-28)48-4? 1(54)29-18?-12-8-13-1?9-29)23-47?(57)40(61-?42(55)30-2?0-14-9-15-? 21-30)38-4?5(6,32(51)?22-33-46(3?8,24-58-33?)62-27(3)5?0)39(53)37? (59-26(2)4?9)34(25)44?(47,4)5/h7?-21,31-33,?35-38,40,5?1-52,57H,2? 2-24H2,1-6?H3,(H,48,5?4)/t31-,32?-,33+,35-,?36+,37-,38??,40-,45+,? 46-,47+/m0?/s1
InChI: InChI=1/C4 ?7H51NO14/c?1-25-31(60?-43(56)36(?52)35(28-1?6-10-7-11-?17-28)48-4? 1(54)29-18?-12-8-13-1?9-29)23-47?(57)40(61-?42(55)30-2?0-14-9-15-? 21-30)38-4?5(6,32(51)?22-33-46(3?8,24-58-33?)62-27(3)5?0)39(53)37? (59-26(2)4?9)34(25)44?(47,4)5/h7?-21,31-33,?35-38,40,5?1-52,57H,2? 2-24H2,1-6?H3,(H,48,5?4)/t31-,32?-,33?,35?,?36?,37+,38??,40?,45+,? 46?,47+/m0?/s1
InChI: InChI=1/C4 ?7H51NO14/c?1-25-31(60?-43(56)36(?52)35(28-1?6-10-7-11-?17-28)48-4? 1(54)29-18?-12-8-13-1?9-29)23-47?(57)40(61-?42(55)30-2?0-14-9-15-? 21-30)38-4?5(6,32(51)?22-33-46(3?8,24-58-33?)62-27(3)5?0)39(53)37? (59-26(2)4?9)34(25)44?(47,4)5/h7?-21,31-33,?35-38,40,5?1-52,57H,2? 2-24H2,1-6?H3,(H,48,5?4)/t31-,32?-,33+,35-,?36+,37+,38?-,40-,45+,? 46-,47+/m0?/s1

The following depositors do not have full stereochemistry for Taxol in their databases it appears. Maybe this is because the structure was drawn before full characterization?

ChemBank, ChemExper Chemical Directory, DiscoveryGate, Emory University Molecular Libraries Screening Center, KEGG, NINDS Approved Drug Screening Program, PubChem, San Diego Center for Chemical Genomics, Thomson Pharma, CambridgeSoft Corporation, PubChem

I do not have access to all databases to confirm this but a search of the Pubchem record to check for sources suggests this observation is true.

I believe the issue with appropriate InChi generation is not down to settings as they could be set as defaults within the majority of InChI generators, especially using a centralized InChI resolver where structures would be submitted and strings and keys could be generated on the fly. I believe the issue is with the accuracy of structure drawings primarily. A central service could produce tools to check for undefined stereochemistry, highlight it and ask for resolution or submission as is. There are many more questions…

Buy me a Coffee

There have been a number of discussions on this blog about Open Data, Free Access and Open Access. There continues to be confusion and I for one, UNFORTUNATELY, still interchange Free Access and Open Access while speaking. For those of you interested in this domain I highly recommend reading the recent posting by Steven Harnad.

Buy me a Coffee

A list of links to articles on the ChemConnector Blog may be of interest to some:

The Full NMR Assignment of  Hexacylinol using CASE

Taking a Break from Wikipedia Curation

 My friend, An American Citizen

Does the Power of Marketing Equate to the Stupidity of the Public?

Buy me a Coffee

I am happy to announce that the LASSO project previously announced on this blog, is well underway. Last week I spent a day in Toronto visiting with SimBioSys regarding this project and other projects I am now consulting with them on. There are a whole series of exciting possibilities for mashing together some of their capabilities with ChemSpider. Think about Synthetic Accessibility and Retrosynthetic Analysis as future possibilities!

This weekend we have done a lot of work to prepare for the impending rollout of all of the search screens and a crowdsourcing project to help validate it’s performance, learn about the constraints under which it can be used and establish the most appropriate work flows to make the system easy to use. In setting up LASSO we have calculated millions of descriptors and have deposited them onto our database. We have now reached that interesting state of affairs called “hard drive envy”. We keeping seeing big disks in the adverts and wishing we had them.

With LASSO depositions comes a request to our users. We would like to buy some new hardware to support the ongoing depositions. If you find our service of any value and would care to offer some “hardware support” please click on “Support ChemSpider” on any record or right here. We’ll take early Valentine’s Day presents from those of you who love us. And we won’t be upset if you don’t. We’ll keep bringing chocolates for you everyday…

support.png

Buy me a Coffee

For the past few months a LOT of curating has been happening on ChemSpider. The majority of this has been conducted by a small number of committed individuals. Tens of thousands of identifiers have been removed, names have been added and the overall quality of the records has improved dramatically.

ANY registered user of ChemSpider can participate in curating the identifiers on ChemSpider. Howeve, there is also a curators role and now that we have rigorously tested the process with a group of individuals and for tens of thousands of actions we are inviting people to join our team of curators. If we all apply ourselves collectively to enhancing the quality of what is on the database then it will benefit everybody.

For instructions see the online Technical Note.

Buy me a Coffee

By now you have likely all heard of a Digital Object Identifier. That’s that little number associated with doi: on the publications you author/read. This is the way the publishing community has come up with for being able to “resolve” an article online at a publishers website. It also offers a powerful way to perform a search online via any of the search engines. If the article is doi’ed then you will likely find it very quickly. The fastest way is to use a resolving service such as Crossref whereby the doi number is input and an appropriate lookup directs you to the site at which the article is hosted. CrossRef’s mandate is “to connect users to primary research content, by enabling publishers to work collectively. CrossRef is also the official DOI® link registration agency for scholarly and professional publications. It operates a cross-publisher citation linking system that allows a researcher to click on a reference citation on one publisher’s platform and link directly to the cited content on another publisher’s platform, subject to the target publisher’s access control practices.”

So how does all this relate to InChIKeys? The InChIKey was only introduced recently but already tens of thousands of them can be found in Google searches and on databases. In fact, its already tens of millions…ChemSpider has contributed about 20 million of them to the soup of identifiers while other databases such as that offered by Wolfgang Robien for searching his NMR database are also available. Literally, 10s of millions of InChIKeys. Then there are blog sites already using them…for example, totallysynthetic.com (one of my favorite reads). Is this good news? Yes. Is it bad news? Yes.

I’m not sure that people are considering the limitations of InChIKeys. Even some of my friends in the domain have missed the fact that the InChIKey is a hash of the original InChiString. This means that, differently than the InChIString, the InChIkey cannot be reversed to the chemical structure. What does this mean?

The structure of Xanax shown below has an InChIString shown immediately below the structure and the InChIKey, a hash of the string, below that.
inchikey-resolver.png

The InChIString contains details regarding the molecular formula of the molecule and the connectivity information for the atoms (what is connected to what). The InChIString also has additional layers available identifying stereochemistry, mobile protons to deal with tautomerization and other additional layers. All of this has been defined elsewhere and will not be discussed exhaustively here. Let’s just declare that the InChIString can represent a chemical structure in a linear text notation and therefore has value. By inputting the InChIString into an appropriate converter, either online here at ChemSpider or at the desktop through the majority of structure drawing packages or through some other package utilizing the InChI DLLs, the InChiString can be converted to the associated chemical structure. The same is NOT true of the hash. The hash cannot be reversed. While it does represent a concise and homogeneous format for the original InChIString it can ONLY be used as a look-up for the original InChIString or the original structure.

What does this mean for the millions of InChIKeys already floating around cyberspace. Well, unless they can be used to lookup the original InChIString or associated chemical structure the fact is that they cannot be converted. Let’s take a clear example. Paul Docherty’s TotallySynthetic.com is an excellent website for synthetic organic chemists. Paul puts a lot of time into discussing the recent literature around specific syntheses. He spends a lot of time drawing out the structures into a “beautiful format”, draws out the reactions and sometimes the mechanistic details. Very nice. A lot of work and likely of interest to the majority of organic chemists who would happen across his site. Recently I coauthored a paper regarding NMR and Vinblastine and I was interested to see whether there was anything in cyberspace about vinblastine. So, I navigated to Vinbastine in ChemSpider then clicked on the InChIKey to perform a search using Google (we are considering adding checkboxes to allow searches via Yahoo and Microsoft Live Search…would anyone use them or is Google enough?) Note, I clicked on the “layers” aspect of the InChIKey to search on all stereochemistry, not just the connectivities.
The results were interesting …an immediate, but small list, of hits on the Vinblastine InChIKey. Whoo-hoo. Did I just perform a chemical structure search across the web? Well…..kind of. What we actually just did was a search of a text-string which is a hash representation of an alphanumeric string representing a molecule. So, yes and no to the structure search. There are links for Vinblastine on ChemSpider and that’s nice to see but we started there so that’s irrelevant really. I also see a link to a TotallySynthetic.com blog posting. Excellent. Clicking on the link I open the page and there it is. Vinblastine. Nice. Oh, and a list of 8 InChIKeys as shown below.

inchikeys.png

Excellent. Those must represent the structures in the rest of the blog post. Great. I think I’ll see whether those exist in ChemSpider by copy-paste-search in ChemSpider. Okay…2 do, 6 don’t. No problem…as a service to the community I’ll just add the structures we don’t have in ChemSpider but are on TotallySynthetic.com to the database via the new deposition system. But wait…where do I find the chemical structures associated with the InChIkeys on the page? I need them as InChIStrings or SMILES or molfiles or some structure format so I don’t have to go and draw them again to generate the InChiKey.

Exist:

NNRZTJAACCRFRV-ZCFIWIBFBY

CXBGOBGJHGGWIE-ACSXSLCXBW

Do not exist:

HQMRCIGBAXSBEP-CHKWXVPMBQ

IFGQFWXLBCFLAE-VEEOACQBBH

PRJFITCBVYTEFZ-XQRVVYSFBP

QGPJDRFASIBFLH-XDQBUYQUBY

PNSPYPFMBWOZOB-ZSSUYBNLBE

VCGOCYRZQLHTIN-SWPFSWIRBG

Oh dear…literally back to the drawing board ?. I have to redraw Paul’s chemical structures to regenerate the InChIKeys that are already on the page and already represent the chemical structures he drew. I thought we were supposed to get away from rework???

Here’s the point. Wouldn’t it be much easier if the InChIKey on TotallySynthetic.com could be pasted into a “resolver”, much like the doi is, so that the original structure could be identified, shown, downloaded, saved, reused, not redrawn? Of course it would! ChemSpider is already being used in that way. Earlier this week Rich Apodaca asked me the question how many daily transactions do we have at ChemSpider? Taking the indexing hits into account I had estimated between 1000-2000. Sorry, I was wrong. It’s actually closer to 5000 per day now. An increasing number of those are actually people pasting InChIKeys to search the database. This is surprising to me since the InChiKey is so new. Maybe people are just testing? Who knows? But, I think with time this will become more popular as the InChI in both of its forms proliferates. The issue is regarding InChiKeys generated for structures NOT present in the ChemSpider database. How will they be resolved?

Here we come to the need for the InChIKey resolver. There needs to be a public service whereby people can generate their InChIKeys and then resolve them in the future. When a structure is drawn, uploaded as a structure drawing, input as an InChIString or SMILES string, for the purpose of generating the InChIKey, the molecules need to be saved to a database and stored with their InChIkey for future lookup. Can this be done via a series of distributed servers. Likely yes. Is this better done using a centralized service. I think so. Why?

While a search might give rise to a page such as that at TotallySynthetic.com and InChIKey resolving would allow you to quickly see the associated structures, I think the bigger picture is being missed. How would you SUBstructure search the web? How would you SIMILARITY search the web? Doing this using InChIKeys is simply not possible. The best approach is likely a centralized repository of chemical structures and their associated InChIKeys. This centralized repository of structures can be indexed for searching by substructure and similarity of structure. The results can be viewed and additional searches of the web can be spawned using other search engines. If InChIKeys proliferate across blogs, wikis, open electronic notebooks, embedded into Wikipedia pages and publications (both closed and open access) and even into institutional repositories, then a centralized system will allow access across these data sources. Filters can be used to differentiate publications, from blogs, from closed databases etc. Clearly, if anyone wants to search on Water as an InChIKey then you’ll be drowning (excuse the pun) in links as there will be “quite a few”. Just in case you missed it I’ll emphasize that the InChIKey is a homogeneous format. So, water, Mw of 18 and formula H2O, is XLYOFNOQVPJJNP-UHFFFAOYAF and erythromycin, with Mw of 734 and formula C34H67NO13, is ULGZDMOVFRHVEP-RWJQBGPGBH. On a page of InChIKeys how would you tell the difference in the structural nature without resolving?

So, who should build the InChIKey resolving service? Maybe the PubChem team are well positioned to do this? I don’t doubt they have the intellect, the skill sets, the computing power and maybe even the interest. However, I can imagine a certain collision prevailing should PubChem step forwards to take on this task. How about IUPAC? Well, I think IUPAC would like to see this done but they are not really positioned to run a service like this based on my understanding. Maybe it should be a community effort? Well, yes, I agree it should involve the community but the effort needs to be led, managed, overseen by a central body. It also needs to be paid for. Such a system could not be built and managed “for free”. Somebody would pay, whether it be through a granting body, sponsorship, philanthropy or via a combination of free and paid services (as with Crossref).

There will be likely be responses to this blogpost insisting that such an effort “belong to the people”, be based on Open Source components only and free for use. I agree with the statement that such a service should be free for use in general. I agree that this is how it should belong to the people. People should be able to use the system to generate InChIKeys and resolve InChIKeys and do so without any price barriers. Can it be-based on Open Source software? Potentially yes. What does it need? A structure input method (including structure drawing), a database for storage (while a lookup table might suffice), the InChI DLLs from IUPAC for generation and reversal of the structures and InChIs, a structure display tool and a website to host. There are a multitude of Open Source drawing packages already. There is certainly a good choice of open source databases to choose from. Structure rendering is not particularly difficult (though generating nice “clean” structures is not an easy task). The InChI DLLs are Open Source of course. So, it CAN be done with Open Source components, it can belong to the community and it can bring many additional benefits to the community when it is done.

Imagine the following set of components as the basis of the theoretical platform: JChemPaint for structure drawing and rendering, MySQL or PostgreSQL for databasing, and the InChI DLLs as the pivotal requirement. There are structure cleaning algorithms available but none are perfect. Maybe what’s available in Open Source could be modified by the team working on this project? Maybe one of the vendors would Open Source their structure cleaning algorithms to the community as part of a philanthropic contribution for the general good? The output of the project could include the “best” structure cleaning algorithm available in Sourceforge for anyone to use.

I judge this project is necessary. I judge the time is now. It’s a fulltime job for a small team. It will cost money to run it but not necessarily to use it. Wikipedia does not run for free. They have recently run their efforts to raise money to support their efforts. Their development team is very small as far as I know. But what an impact! I believe a small team of individuals can get this done. It will take dedicated effort and resources. It will require the backing of organizations such as the W3C, IUPAC and certainly the participation of groups such as the Blue Obelisk group and PubChem. There will likely be a lot of politics in leading such an effort but it should not hinder getting it done. There will likely be barriers to attempting to proliferate the InChI as a means of connecting data.
During the past six months during my sabbatical I have had time to ponder how I would like to contribute to the community. This is it. I would like to lead this effort. I would like to take what has been learned using ChemSpider as a basis and apply it to this project. I want to build a team to get this done and, with the support of the community, provide the platform for hosting a centralized repository of chemical structures and associated identifiers to facilitate development of structure connectivity across the web. I can imagine certain groups, specifically from academia, wanting to jump on the opportunity to lead an effort like this. My belief is that this should be led by a not for profit established to deliver on this task and willing and able to call upon the passionate individuals and groups who would like to see this happen. This system should not belong to any particular university, group or entity other than itself. It should be independent of annual grants if possible and get to a place of being self-sustaining. It should establish a board of thought-leaders in this domain to establish the path forward to get this done.
I do not have all the answers. I’m not even sure of all of the questions. There are known challenges and unknown challenges ahead. The InChI is far from all-encompassing of Chemistry and is limited in terms of inorganics, organometallics, polymers, Markush and so on. But this can and will come I believe. There will be egos involved if we do this. Individual, group and organizational egos. Certain groups are going to feel threatened (and as some have told me already I should be wearing a Kevlar vest). But this needs to be done. We have made the first step. We have blocked the InChIs.org domain name in case we choose to use it as the resolving domain. This blogpost is a statement of intent to pursue this idea. Maybe this is already being done? Maybe behind closed doors? If it is underway please speak up. I welcome all comments – statements of support, detraction, why it won’t work, why it needs to be done. Let’s start the community dialog here.

For now feel free to use our web services, our InChIKey generation and even the structure we have provided for using InChIkeys to probe ChemSpider directly. For example, a structure of http://www.chemspider.com/InChIKey/ with an appended InChIKey takes you directly to the structure. Try this link http://www.chemspider.com/InChIKey/CXBGOBGJHGGWIE-ACSXSLCXBW.

Buy me a Coffee

ChemSpider has been helped by a number of organizations either with their guidance, their sponsorship support or, in the case of commercial software vendors, by the contribution of software “to the cause”. I have watched ChemAxon for a number of years. When I first saw them at an ACS meeting I was impressed with not only their relationship building, their growing following but also with their deep understanding of many of the challenges in delivering a cheminformatics toolkit. I commented then to “Watch these guys”. And I DID mean it as a “look over your shoulder” type of way. Over the years I have only been more impressed with their ability to deliver and especially with their contributions to academia with their Application Package software. Most people know them for their Marvin structure drawing package, now in its latest form. There are many websites where you will happen across their branding. I sat in recently on a presentation regarding their JChemBase platform and was very impressed with the flexibility, speed and interface.

We have now been given access to a number of their software components and, with time, you will see us use more of their software. We have to figure out integration etc and balance it with all of the other tasks waiting on us. For now I wanted to comment that we recently used ChemAxon’s 3D structure optimizer to produce 3D structures for two purposes…to put onto the database to pass to JMol for visualization purposes and to provide a feed to the LASSO algorithms. The algorithms were fast enough for us to pass through almost 20 million molecules. We were provided access to a computer cluster to do these calculations thankfully. We did set rather stringent criteria in terms of time so not every structure on the database is optimized - if they weren’t done in a certain period we would abort and move to the next. When there are resources available we will pass the incomplete set through again.

My personal thanks to ChemAxon for their support. It’s greatly appreciated by us and, I am sure, our users too.

Buy me a Coffee

I’ve commented previously about the fact that Microsoft had used our web services to connect to Infomesa (1,2). Last week I had a chance to meet with Sam Batterman face-to-face for an overview of Infomesa and to see the services in action. Sam and I chatted for a couple of hours about his platform, the challenges of managing quality in publicly accessible data (and not proliferating errors) and the directions for ChemSpider in general and whether we could extend the services to support his needs.

Infomesa is a BIG whiteboard. And I mean BIG! During our discussions Sam threw up photos, charts, spreadsheets and videos onto the board and formed relationships between them. He demonstrated using our services to pull back/generate SMILES strings,  structure images and InChIKeys. He demonstrated mapping relationships as I would do today in MindManager. I could see immediate utility to the approach of the giant whiteboard. I am tired of arranging relationships over multiple documents and while I like MindManager a lot (!) the utility of the enormous whiteboard approach (I think Sam mentioned “equivalent to 50,000 pixels”) became clear in a couple of minutes.

I look forward to playing with Infomesa myself when it’s available and doing what we can through our web services to help offer additional utility to chemists.

Buy me a Coffee

On a very regular basis I receive requests now from ChemSpider users for what I would call “General Help”. Here’s a request from a user to answer a Chemistry Question. Can the community help?

“We have detected 2,2 Dichloropropionic Acid  in several of our drinking water systems at levels of about 0.1 to 0.2 ug/L (above the EU MCL), but none in the source water (in some cases surface water, in others, groundwater under the direct influence of surface water).  The levels in the pipe rise after the addition of free chlorine.  This leads me to believe that either some sort of lab interference or some other compound is actually being detected.  We have sent samples to two separate labs using two different methods (blanks were not sent) and the results were confirmed.

Any thoughts on how 2,2 Dichloropropionic Acid might be naturally formed in the presence of chlorine in drinking water? Is it possible that other organics either degrade down to 2,2 Dichloropropionic Acid or are precursors for 2,2 Dichloropropionic Acid, in the presence of chlorine?”

Please post your comments on this blog and I will ask the user to watch this page.

This is part of our future endeavors. Shortly we will establish a discussion forum for ChemSpider users so that when a user has scientific problems you can post here and discuss. Should the outcome from the discussion be science associated with a particular chemical we will add it to a record n the database for further reference.

Buy me a Coffee

On March 24th of this year ChemSpider will celebrate its one year anniversary. We have come a long way in just over a year. During this period we have stayed focused on our intention of “Building a Structure Centric Community for Chemists”. It actually feels as if we are successfully delivering on it too. With the recent roll out of our deposition system, now being used by people to deposit structures, with our contributions to PubChem, our participation with Wikipedia, our web services for people to integrate to us, the participation of the community in curating the data and the impressive growth in users (for us anyway!) as shown below.

number-of-users.png

We have had our fair share of detractors but have stayed focused on delivery.  I was truly touched today to see a blog post acknowledging our efforts and what we are up to.  Joerg Kurt Wegner commented on his Mining Drug Space blog about his views of what we have been working on. Seeing it in his words really brought home all that we are up to and the potential impact of our efforts. We are actually accelerating our efforts right now for our anniversary release in March and we hope to deliver some very exciting capabilities at that time.

We welcome your comments… ha