Copyright©2008 Antony Williams
By now you have likely all heard of a Digital Object Identifier. That’s that little number associated with doi: on the publications you author/read. This is the way the publishing community has come up with for being able to “resolve” an article online at a publishers website. It also offers a powerful way to perform a search online via any of the search engines. If the article is doi’ed then you will likely find it very quickly. The fastest way is to use a resolving service such as Crossref whereby the doi number is input and an appropriate lookup directs you to the site at which the article is hosted. CrossRef’s mandate is “to connect users to primary research content, by enabling publishers to work collectively. CrossRef is also the official DOI® link registration agency for scholarly and professional publications. It operates a cross-publisher citation linking system that allows a researcher to click on a reference citation on one publisher’s platform and link directly to the cited content on another publisher’s platform, subject to the target publisher’s access control practices.”
So how does all this relate to InChIKeys? The InChIKey was only introduced recently but already tens of thousands of them can be found in Google searches and on databases. In fact, its already tens of millions…ChemSpider has contributed about 20 million of them to the soup of identifiers while other databases such as that offered by Wolfgang Robien for searching his NMR database are also available. Literally, 10s of millions of InChIKeys. Then there are blog sites already using them…for example, totallysynthetic.com (one of my favorite reads). Is this good news? Yes. Is it bad news? Yes.
I’m not sure that people are considering the limitations of InChIKeys. Even some of my friends in the domain have missed the fact that the InChIKey is a hash of the original InChiString. This means that, differently than the InChIString, the InChIkey cannot be reversed to the chemical structure. What does this mean?
The InChIString contains details regarding the molecular formula of the molecule and the connectivity information for the atoms (what is connected to what). The InChIString also has additional layers available identifying stereochemistry, mobile protons to deal with tautomerization and other additional layers. All of this has been defined elsewhere and will not be discussed exhaustively here. Let’s just declare that the InChIString can represent a chemical structure in a linear text notation and therefore has value. By inputting the InChIString into an appropriate converter, either online here at ChemSpider or at the desktop through the majority of structure drawing packages or through some other package utilizing the InChI DLLs, the InChiString can be converted to the associated chemical structure. The same is NOT true of the hash. The hash cannot be reversed. While it does represent a concise and homogeneous format for the original InChIString it can ONLY be used as a look-up for the original InChIString or the original structure.
What does this mean for the millions of InChIKeys already floating around cyberspace. Well, unless they can be used to lookup the original InChIString or associated chemical structure the fact is that they cannot be converted. Let’s take a clear example. Paul Docherty’s TotallySynthetic.com is an excellent website for synthetic organic chemists. Paul puts a lot of time into discussing the recent literature around specific syntheses. He spends a lot of time drawing out the structures into a “beautiful format”, draws out the reactions and sometimes the mechanistic details. Very nice. A lot of work and likely of interest to the majority of organic chemists who would happen across his site. Recently I coauthored a paper regarding NMR and Vinblastine and I was interested to see whether there was anything in cyberspace about vinblastine. So, I navigated to Vinbastine in ChemSpider then clicked on the InChIKey to perform a search using Google (we are considering adding checkboxes to allow searches via Yahoo and Microsoft Live Search…would anyone use them or is Google enough?) Note, I clicked on the “layers” aspect of the InChIKey to search on all stereochemistry, not just the connectivities.
The results were interesting …an immediate, but small list, of hits on the Vinblastine InChIKey. Whoo-hoo. Did I just perform a chemical structure search across the web? Well…..kind of. What we actually just did was a search of a text-string which is a hash representation of an alphanumeric string representing a molecule. So, yes and no to the structure search. There are links for Vinblastine on ChemSpider and that’s nice to see but we started there so that’s irrelevant really. I also see a link to a TotallySynthetic.com blog posting. Excellent. Clicking on the link I open the page and there it is. Vinblastine. Nice. Oh, and a list of 8 InChIKeys as shown below.
Excellent. Those must represent the structures in the rest of the blog post. Great. I think I’ll see whether those exist in ChemSpider by copy-paste-search in ChemSpider. Okay…2 do, 6 don’t. No problem…as a service to the community I’ll just add the structures we don’t have in ChemSpider but are on TotallySynthetic.com to the database via the new deposition system. But wait…where do I find the chemical structures associated with the InChIkeys on the page? I need them as InChIStrings or SMILES or molfiles or some structure format so I don’t have to go and draw them again to generate the InChiKey.
Do not exist:
Oh dear…literally back to the drawing board . I have to redraw Paul’s chemical structures to regenerate the InChIKeys that are already on the page and already represent the chemical structures he drew. I thought we were supposed to get away from rework???
Here’s the point. Wouldn’t it be much easier if the InChIKey on TotallySynthetic.com could be pasted into a “resolver”, much like the doi is, so that the original structure could be identified, shown, downloaded, saved, reused, not redrawn? Of course it would! ChemSpider is already being used in that way. Earlier this week Rich Apodaca asked me the question how many daily transactions do we have at ChemSpider? Taking the indexing hits into account I had estimated between 1000-2000. Sorry, I was wrong. It’s actually closer to 5000 per day now. An increasing number of those are actually people pasting InChIKeys to search the database. This is surprising to me since the InChiKey is so new. Maybe people are just testing? Who knows? But, I think with time this will become more popular as the InChI in both of its forms proliferates. The issue is regarding InChiKeys generated for structures NOT present in the ChemSpider database. How will they be resolved?
Here we come to the need for the InChIKey resolver. There needs to be a public service whereby people can generate their InChIKeys and then resolve them in the future. When a structure is drawn, uploaded as a structure drawing, input as an InChIString or SMILES string, for the purpose of generating the InChIKey, the molecules need to be saved to a database and stored with their InChIkey for future lookup. Can this be done via a series of distributed servers. Likely yes. Is this better done using a centralized service. I think so. Why?
While a search might give rise to a page such as that at TotallySynthetic.com and InChIKey resolving would allow you to quickly see the associated structures, I think the bigger picture is being missed. How would you SUBstructure search the web? How would you SIMILARITY search the web? Doing this using InChIKeys is simply not possible. The best approach is likely a centralized repository of chemical structures and their associated InChIKeys. This centralized repository of structures can be indexed for searching by substructure and similarity of structure. The results can be viewed and additional searches of the web can be spawned using other search engines. If InChIKeys proliferate across blogs, wikis, open electronic notebooks, embedded into Wikipedia pages and publications (both closed and open access) and even into institutional repositories, then a centralized system will allow access across these data sources. Filters can be used to differentiate publications, from blogs, from closed databases etc. Clearly, if anyone wants to search on Water as an InChIKey then you’ll be drowning (excuse the pun) in links as there will be “quite a few”. Just in case you missed it I’ll emphasize that the InChIKey is a homogeneous format. So, water, Mw of 18 and formula H2O, is XLYOFNOQVPJJNP-UHFFFAOYAF and erythromycin, with Mw of 734 and formula C34H67NO13, is ULGZDMOVFRHVEP-RWJQBGPGBH. On a page of InChIKeys how would you tell the difference in the structural nature without resolving?
So, who should build the InChIKey resolving service? Maybe the PubChem team are well positioned to do this? I don’t doubt they have the intellect, the skill sets, the computing power and maybe even the interest. However, I can imagine a certain collision prevailing should PubChem step forwards to take on this task. How about IUPAC? Well, I think IUPAC would like to see this done but they are not really positioned to run a service like this based on my understanding. Maybe it should be a community effort? Well, yes, I agree it should involve the community but the effort needs to be led, managed, overseen by a central body. It also needs to be paid for. Such a system could not be built and managed “for free”. Somebody would pay, whether it be through a granting body, sponsorship, philanthropy or via a combination of free and paid services (as with Crossref).
There will be likely be responses to this blogpost insisting that such an effort “belong to the people”, be based on Open Source components only and free for use. I agree with the statement that such a service should be free for use in general. I agree that this is how it should belong to the people. People should be able to use the system to generate InChIKeys and resolve InChIKeys and do so without any price barriers. Can it be-based on Open Source software? Potentially yes. What does it need? A structure input method (including structure drawing), a database for storage (while a lookup table might suffice), the InChI DLLs from IUPAC for generation and reversal of the structures and InChIs, a structure display tool and a website to host. There are a multitude of Open Source drawing packages already. There is certainly a good choice of open source databases to choose from. Structure rendering is not particularly difficult (though generating nice “clean” structures is not an easy task). The InChI DLLs are Open Source of course. So, it CAN be done with Open Source components, it can belong to the community and it can bring many additional benefits to the community when it is done.
Imagine the following set of components as the basis of the theoretical platform: JChemPaint for structure drawing and rendering, MySQL or PostgreSQL for databasing, and the InChI DLLs as the pivotal requirement. There are structure cleaning algorithms available but none are perfect. Maybe what’s available in Open Source could be modified by the team working on this project? Maybe one of the vendors would Open Source their structure cleaning algorithms to the community as part of a philanthropic contribution for the general good? The output of the project could include the “best” structure cleaning algorithm available in Sourceforge for anyone to use.
I judge this project is necessary. I judge the time is now. It’s a fulltime job for a small team. It will cost money to run it but not necessarily to use it. Wikipedia does not run for free. They have recently run their efforts to raise money to support their efforts. Their development team is very small as far as I know. But what an impact! I believe a small team of individuals can get this done. It will take dedicated effort and resources. It will require the backing of organizations such as the W3C, IUPAC and certainly the participation of groups such as the Blue Obelisk group and PubChem. There will likely be a lot of politics in leading such an effort but it should not hinder getting it done. There will likely be barriers to attempting to proliferate the InChI as a means of connecting data.
During the past six months during my sabbatical I have had time to ponder how I would like to contribute to the community. This is it. I would like to lead this effort. I would like to take what has been learned using ChemSpider as a basis and apply it to this project. I want to build a team to get this done and, with the support of the community, provide the platform for hosting a centralized repository of chemical structures and associated identifiers to facilitate development of structure connectivity across the web. I can imagine certain groups, specifically from academia, wanting to jump on the opportunity to lead an effort like this. My belief is that this should be led by a not for profit established to deliver on this task and willing and able to call upon the passionate individuals and groups who would like to see this happen. This system should not belong to any particular university, group or entity other than itself. It should be independent of annual grants if possible and get to a place of being self-sustaining. It should establish a board of thought-leaders in this domain to establish the path forward to get this done.
I do not have all the answers. I’m not even sure of all of the questions. There are known challenges and unknown challenges ahead. The InChI is far from all-encompassing of Chemistry and is limited in terms of inorganics, organometallics, polymers, Markush and so on. But this can and will come I believe. There will be egos involved if we do this. Individual, group and organizational egos. Certain groups are going to feel threatened (and as some have told me already I should be wearing a Kevlar vest). But this needs to be done. We have made the first step. We have blocked the InChIs.org domain name in case we choose to use it as the resolving domain. This blogpost is a statement of intent to pursue this idea. Maybe this is already being done? Maybe behind closed doors? If it is underway please speak up. I welcome all comments – statements of support, detraction, why it won’t work, why it needs to be done. Let’s start the community dialog here.
For now feel free to use our web services, our InChIKey generation and even the structure we have provided for using InChIkeys to probe ChemSpider directly. For example, a structure of http://www.chemspider.com/InChIKey/ with an appended InChIKey takes you directly to the structure. Try this link http://www.chemspider.com/InChIKey/CXBGOBGJHGGWIE-ACSXSLCXBW.Stumble it!