By now you have likely all heard of a Digital Object Identifier. That’s that little number associated with doi: on the publications you author/read. This is the way the publishing community has come up with for being able to “resolve” an article online at a publishers website. It also offers a powerful way to perform a search online via any of the search engines. If the article is doi’ed then you will likely find it very quickly. The fastest way is to use a resolving service such as Crossref whereby the doi number is input and an appropriate lookup directs you to the site at which the article is hosted. CrossRef’s mandate is “to connect users to primary research content, by enabling publishers to work collectively. CrossRef is also the official DOI® link registration agency for scholarly and professional publications. It operates a cross-publisher citation linking system that allows a researcher to click on a reference citation on one publisher’s platform and link directly to the cited content on another publisher’s platform, subject to the target publisher’s access control practices.”

So how does all this relate to InChIKeys? The InChIKey was only introduced recently but already tens of thousands of them can be found in Google searches and on databases. In fact, its already tens of millions…ChemSpider has contributed about 20 million of them to the soup of identifiers while other databases such as that offered by Wolfgang Robien for searching his NMR database are also available. Literally, 10s of millions of InChIKeys. Then there are blog sites already using them…for example, totallysynthetic.com (one of my favorite reads). Is this good news? Yes. Is it bad news? Yes.

I’m not sure that people are considering the limitations of InChIKeys. Even some of my friends in the domain have missed the fact that the InChIKey is a hash of the original InChiString. This means that, differently than the InChIString, the InChIkey cannot be reversed to the chemical structure. What does this mean?

The structure of Xanax shown below has an InChIString shown immediately below the structure and the InChIKey, a hash of the string, below that.
inchikey-resolver.png

The InChIString contains details regarding the molecular formula of the molecule and the connectivity information for the atoms (what is connected to what). The InChIString also has additional layers available identifying stereochemistry, mobile protons to deal with tautomerization and other additional layers. All of this has been defined elsewhere and will not be discussed exhaustively here. Let’s just declare that the InChIString can represent a chemical structure in a linear text notation and therefore has value. By inputting the InChIString into an appropriate converter, either online here at ChemSpider or at the desktop through the majority of structure drawing packages or through some other package utilizing the InChI DLLs, the InChiString can be converted to the associated chemical structure. The same is NOT true of the hash. The hash cannot be reversed. While it does represent a concise and homogeneous format for the original InChIString it can ONLY be used as a look-up for the original InChIString or the original structure.

What does this mean for the millions of InChIKeys already floating around cyberspace. Well, unless they can be used to lookup the original InChIString or associated chemical structure the fact is that they cannot be converted. Let’s take a clear example. Paul Docherty’s TotallySynthetic.com is an excellent website for synthetic organic chemists. Paul puts a lot of time into discussing the recent literature around specific syntheses. He spends a lot of time drawing out the structures into a “beautiful format”, draws out the reactions and sometimes the mechanistic details. Very nice. A lot of work and likely of interest to the majority of organic chemists who would happen across his site. Recently I coauthored a paper regarding NMR and Vinblastine and I was interested to see whether there was anything in cyberspace about vinblastine. So, I navigated to Vinbastine in ChemSpider then clicked on the InChIKey to perform a search using Google (we are considering adding checkboxes to allow searches via Yahoo and Microsoft Live Search…would anyone use them or is Google enough?) Note, I clicked on the “layers” aspect of the InChIKey to search on all stereochemistry, not just the connectivities.
The results were interesting …an immediate, but small list, of hits on the Vinblastine InChIKey. Whoo-hoo. Did I just perform a chemical structure search across the web? Well…..kind of. What we actually just did was a search of a text-string which is a hash representation of an alphanumeric string representing a molecule. So, yes and no to the structure search. There are links for Vinblastine on ChemSpider and that’s nice to see but we started there so that’s irrelevant really. I also see a link to a TotallySynthetic.com blog posting. Excellent. Clicking on the link I open the page and there it is. Vinblastine. Nice. Oh, and a list of 8 InChIKeys as shown below.

inchikeys.png

Excellent. Those must represent the structures in the rest of the blog post. Great. I think I’ll see whether those exist in ChemSpider by copy-paste-search in ChemSpider. Okay…2 do, 6 don’t. No problem…as a service to the community I’ll just add the structures we don’t have in ChemSpider but are on TotallySynthetic.com to the database via the new deposition system. But wait…where do I find the chemical structures associated with the InChIkeys on the page? I need them as InChIStrings or SMILES or molfiles or some structure format so I don’t have to go and draw them again to generate the InChiKey.

Exist:

NNRZTJAACCRFRV-ZCFIWIBFBY

CXBGOBGJHGGWIE-ACSXSLCXBW

Do not exist:

HQMRCIGBAXSBEP-CHKWXVPMBQ

IFGQFWXLBCFLAE-VEEOACQBBH

PRJFITCBVYTEFZ-XQRVVYSFBP

QGPJDRFASIBFLH-XDQBUYQUBY

PNSPYPFMBWOZOB-ZSSUYBNLBE

VCGOCYRZQLHTIN-SWPFSWIRBG

Oh dear…literally back to the drawing board . I have to redraw Paul’s chemical structures to regenerate the InChIKeys that are already on the page and already represent the chemical structures he drew. I thought we were supposed to get away from rework???

Here’s the point. Wouldn’t it be much easier if the InChIKey on TotallySynthetic.com could be pasted into a “resolver”, much like the doi is, so that the original structure could be identified, shown, downloaded, saved, reused, not redrawn? Of course it would! ChemSpider is already being used in that way. Earlier this week Rich Apodaca asked me the question how many daily transactions do we have at ChemSpider? Taking the indexing hits into account I had estimated between 1000-2000. Sorry, I was wrong. It’s actually closer to 5000 per day now. An increasing number of those are actually people pasting InChIKeys to search the database. This is surprising to me since the InChiKey is so new. Maybe people are just testing? Who knows? But, I think with time this will become more popular as the InChI in both of its forms proliferates. The issue is regarding InChiKeys generated for structures NOT present in the ChemSpider database. How will they be resolved?

Here we come to the need for the InChIKey resolver. There needs to be a public service whereby people can generate their InChIKeys and then resolve them in the future. When a structure is drawn, uploaded as a structure drawing, input as an InChIString or SMILES string, for the purpose of generating the InChIKey, the molecules need to be saved to a database and stored with their InChIkey for future lookup. Can this be done via a series of distributed servers. Likely yes. Is this better done using a centralized service. I think so. Why?

While a search might give rise to a page such as that at TotallySynthetic.com and InChIKey resolving would allow you to quickly see the associated structures, I think the bigger picture is being missed. How would you SUBstructure search the web? How would you SIMILARITY search the web? Doing this using InChIKeys is simply not possible. The best approach is likely a centralized repository of chemical structures and their associated InChIKeys. This centralized repository of structures can be indexed for searching by substructure and similarity of structure. The results can be viewed and additional searches of the web can be spawned using other search engines. If InChIKeys proliferate across blogs, wikis, open electronic notebooks, embedded into Wikipedia pages and publications (both closed and open access) and even into institutional repositories, then a centralized system will allow access across these data sources. Filters can be used to differentiate publications, from blogs, from closed databases etc. Clearly, if anyone wants to search on Water as an InChIKey then you’ll be drowning (excuse the pun) in links as there will be “quite a few”. Just in case you missed it I’ll emphasize that the InChIKey is a homogeneous format. So, water, Mw of 18 and formula H2O, is XLYOFNOQVPJJNP-UHFFFAOYAF and erythromycin, with Mw of 734 and formula C34H67NO13, is ULGZDMOVFRHVEP-RWJQBGPGBH. On a page of InChIKeys how would you tell the difference in the structural nature without resolving?

So, who should build the InChIKey resolving service? Maybe the PubChem team are well positioned to do this? I don’t doubt they have the intellect, the skill sets, the computing power and maybe even the interest. However, I can imagine a certain collision prevailing should PubChem step forwards to take on this task. How about IUPAC? Well, I think IUPAC would like to see this done but they are not really positioned to run a service like this based on my understanding. Maybe it should be a community effort? Well, yes, I agree it should involve the community but the effort needs to be led, managed, overseen by a central body. It also needs to be paid for. Such a system could not be built and managed “for free”. Somebody would pay, whether it be through a granting body, sponsorship, philanthropy or via a combination of free and paid services (as with Crossref).

There will be likely be responses to this blogpost insisting that such an effort “belong to the people”, be based on Open Source components only and free for use. I agree with the statement that such a service should be free for use in general. I agree that this is how it should belong to the people. People should be able to use the system to generate InChIKeys and resolve InChIKeys and do so without any price barriers. Can it be-based on Open Source software? Potentially yes. What does it need? A structure input method (including structure drawing), a database for storage (while a lookup table might suffice), the InChI DLLs from IUPAC for generation and reversal of the structures and InChIs, a structure display tool and a website to host. There are a multitude of Open Source drawing packages already. There is certainly a good choice of open source databases to choose from. Structure rendering is not particularly difficult (though generating nice “clean” structures is not an easy task). The InChI DLLs are Open Source of course. So, it CAN be done with Open Source components, it can belong to the community and it can bring many additional benefits to the community when it is done.

Imagine the following set of components as the basis of the theoretical platform: JChemPaint for structure drawing and rendering, MySQL or PostgreSQL for databasing, and the InChI DLLs as the pivotal requirement. There are structure cleaning algorithms available but none are perfect. Maybe what’s available in Open Source could be modified by the team working on this project? Maybe one of the vendors would Open Source their structure cleaning algorithms to the community as part of a philanthropic contribution for the general good? The output of the project could include the “best” structure cleaning algorithm available in Sourceforge for anyone to use.

I judge this project is necessary. I judge the time is now. It’s a fulltime job for a small team. It will cost money to run it but not necessarily to use it. Wikipedia does not run for free. They have recently run their efforts to raise money to support their efforts. Their development team is very small as far as I know. But what an impact! I believe a small team of individuals can get this done. It will take dedicated effort and resources. It will require the backing of organizations such as the W3C, IUPAC and certainly the participation of groups such as the Blue Obelisk group and PubChem. There will likely be a lot of politics in leading such an effort but it should not hinder getting it done. There will likely be barriers to attempting to proliferate the InChI as a means of connecting data.
During the past six months during my sabbatical I have had time to ponder how I would like to contribute to the community. This is it. I would like to lead this effort. I would like to take what has been learned using ChemSpider as a basis and apply it to this project. I want to build a team to get this done and, with the support of the community, provide the platform for hosting a centralized repository of chemical structures and associated identifiers to facilitate development of structure connectivity across the web. I can imagine certain groups, specifically from academia, wanting to jump on the opportunity to lead an effort like this. My belief is that this should be led by a not for profit established to deliver on this task and willing and able to call upon the passionate individuals and groups who would like to see this happen. This system should not belong to any particular university, group or entity other than itself. It should be independent of annual grants if possible and get to a place of being self-sustaining. It should establish a board of thought-leaders in this domain to establish the path forward to get this done.
I do not have all the answers. I’m not even sure of all of the questions. There are known challenges and unknown challenges ahead. The InChI is far from all-encompassing of Chemistry and is limited in terms of inorganics, organometallics, polymers, Markush and so on. But this can and will come I believe. There will be egos involved if we do this. Individual, group and organizational egos. Certain groups are going to feel threatened (and as some have told me already I should be wearing a Kevlar vest). But this needs to be done. We have made the first step. We have blocked the InChIs.org domain name in case we choose to use it as the resolving domain. This blogpost is a statement of intent to pursue this idea. Maybe this is already being done? Maybe behind closed doors? If it is underway please speak up. I welcome all comments – statements of support, detraction, why it won’t work, why it needs to be done. Let’s start the community dialog here.

For now feel free to use our web services, our InChIKey generation and even the structure we have provided for using InChIkeys to probe ChemSpider directly. For example, a structure of http://www.chemspider.com/InChIKey/ with an appended InChIKey takes you directly to the structure. Try this link http://www.chemspider.com/InChIKey/CXBGOBGJHGGWIE-ACSXSLCXBW.

Stumble it!

32 Responses to “We Need an InChIKey Resolver and We Need It Now”

  1. Rich Apodaca says:

    Antony, interesting idea.

    Why does a Google search of the InChI in your last paragraph:

    CXBGOBGJHGGWIE-ACSXSLCXBW

    not turn up the ChemSpider page:

    http://www.chemspider.com/InChIKey/CXBGOBGJHGGWIE-ACSXSLCXBW

    Reverse lookup of InChI Key is useful, I agree.

    But knowing that regular search engines like Google can faithfully index all InChI Keys in a massive chemical database such as ChemSpider is, IMO, far more important right now for promoting the adoption of InChI Key by publishers, end users, and service providers.

  2. Tom Transue says:

    Tony,

    I absolutely agree that there needs to be a resolver, but to me the solution seems pretty simple from a technical standpoint:

    1) have a repository for InChI strings,
    2) generate the keys for each,
    3) serve the string on request when one provides a key

    There are some parallels here with the password encryption world (where lookup tables are naughty things). If one ignores the chemistry involved, the problem becomes similar, but much simpler because

    a) there are (arguably) fewer valid InChI strings than passwords
    b) there is not a “salt” in the InChIKey generation algorithm

    Therefore, it is pretty simple to just have a list of strings and keys. Even if the list gets to a billion, it should pretty easy for a computer to serve a string from a key.

    So, I guess the hard part to me is the social problem: who can convince the community that THEIR repository is the one to send their InChI string to? The good news is that in this “database eat database world”, the InChI strings SHOULD get around pretty well — at least as long as InChIKeys don’t completely displace of their parent strings (in which case we’ve lost the information that we worked so hard to gather).

    Tom

  3. Antony Williams says:

    Rich, CXBGOBGJHGGWIE-ACSXSLCXBW has simply not been indexed by Google yet. They were averaging us at an average of about 30,000 pages per day (out of 20 million). This one is not indexed. However, this blogpost is ALREADY indexed by Google so it has much higher priority ranking than individual records on the database. No surprise..there is a lot of traffic to the blog over an individual record.

    So, my comment has immediately shown that while Google CAN faithfully index all InChIKeys in a massive chemical database it doesn’t mean they do. So, this InChIKey, SXCKAQFFUIJSCM-CPXHFIBYBD, is not indexed (until now :-) ) and you would have now chance of determining what it is until it is indexed. Not a good solution…

  4. Antony Williams says:

    Tom,
    All good suggestions….yes, the repository needs to include InChIstrings and associated InChIkeys. BUT, this does assume then that people have tools available to them for converting the InChIstring to a structure and display it in a “good way”. For example, look at the structure of Taxol here: http://www.chemspider.com/Chemical-Structure.10368587.html. Now convert this InChIString for Taxol in your favorite conversion tool…attractive? InChI=1/C47H51NO14/c1-25-31(60-43(56)36(52)35(28-16-10-7-11-17-28)48-41(54)29-18-12-8-13-19-29)23-47(57)40(61-42(55)30-20-14-9-15-21-30)38-45(6,32(51)22-33-46(38,24-58-33)62-27(3)50)39(53)37(59-26(2)49)34(25)44(47,4)5/h7-21,31-33,35-38,40,51-52,57H,22-24H2,1-6H3,(H,48,54)/t31-,32-,33+,35-,36+,37+,38-,40-,45+,46-,47+/m0/s1
    So, I think there is a reason to save “structures as drawn” if possible. Useful but not absolutely necessary.

    Now to your other comments…that there are arguably fewer valid InChI Strings than passwords. Well, there are no numbers in the inChIKey so those can be removed. But, how many people are generating 25 character passwords in reality? Also, read the post listed below and let me know whether you think there are less valid InChI strings :-)
    http://www.chemspider.com/blog/how-many-structures-can-you-generate-from-a-molecular-formula.html

    Okay…salts in InChIKey format. InChIKey is a hash of the InChIString. So, if you can generate an InChIString for a salt then no problem. And you can. For ammonium chloride the string and key are: InChI=1/ClH.H3N/h1H;1H3 and NLXLAEXVIDQMFP-UHFFFAOYAI. Check this search in ChemSPider: http://www.chemspider.com/Search.aspx?q=NLXLAEXVIDQMFP-UHFFFAOYAI

    Search this in Google: FAPWRFPIFSIZLT-REWHXWOFAE
    You’ll find an FDA manual already containing InChIKeys..this one includes a number of salts. The code you searched on is sodium chloride.

    In terms of a BILLION strings, review my post on “How Many Structures Can You Generate From A Molecular Formula?” and you will see that 1 billion is TINY. A fraction of what can be generated for a very simple molecular formula.

    The social problem DOES exist. I agree.

  5. Tom Transue says:

    Tony,

    Sorry, the “salt” I was referring to is a term in password encoding to make it harder for people to crack passwords (not to be confused with a chemical term).

    Given that by definition/design hash functions are one-way, the only solution is a lookup table, so regardless of how many structures have InChI strings, if use of InChIKeys is going to be useful, there will have to be a big lookup table (or lots of them — perish the thought) somewhere. I did not mean to suggest that there aren’t or won’t be more than a billion molecules (consider the combinatorics of DNA for a moment, and it becomes clear that there are far more possible molecules than atoms in the universe (even more than a google), and moreover, we’ll start finding duplicate InChIKeys once we get to 10^20 or 10^30 or so). The good news, I think, is that it will probably take a little while for scientists to conceive of too many billion chemicals for which they need InChI strings to use and deposit somewhere. I’m betting that even a small effort could keep up with demand without much trouble, but I still think that the hard part is coercing the public to send their strings to some central place.

    As to drawings, I think that that is a separate issue from the central look-up table. 2D chemical structure is really only helpful for humans (as far as I can tell), and there are good and bad drawing tools and personal preferences and so on. A side project could be a table of preferred drawings of structures with InChI strings or Keys as a lookup. I’m not saying that this is not a reasonable thing to wish for, but the as the point of your post states, we REALLY need a central InChIKey to InChI string lookup.

    Tom

  6. Antony Williams says:

    Tom…there’s a perfect example of the foibles of human communication and different skill sets. “Salt” for me versus “salt” for you. I’m still shaking my head at the misapprehension..but I am newly educated! Thanks.

    Just fyi, from the IUPAC website “There is a finite, but very small probability of finding two structures with the same InChIKey. For duplication of only the first block of 14 characters this is 1.3% in 10^9, equivalent to a single collision in one of 75 databases of 10^9 compounds each.”

    I ABSOLUTELY agree that the 2D structure depictions are a separate issue. Yes indeed. This certainly is an issue for humans only…shame there are so many of us interested in looking at depictions over strings! Yes..it can be a side project. There may be 10, 20, 100 acceptable images for a structure and people could choose the one they wanted from the table of depictions.

    We REALLY need the lookup though. Sounds like a good reason for a lunch meeting…my treat. Ping me offline…let’s discuss.

  7. Rich Apodaca says:

    Antony, any idea which InChIKeys/InChIs have been indexed so far? I’d really like a way to independently confirm that Google is indeed indexing the CS compound summary pages. NMRShiftDB (a much smaller database) has the same problem with its summary page InChIs not getting indexed.

    Clearly the Googlebot isn’t going to index a page it doesn’t know about… So how does Google know about the CS compound summary pages? Sitemap? Something else?

    A Google search for just the text “ChemSpider ID”, which appears on all of the CS compound summary pages, returns just a few hits, and none of them are compound summary pages:

    http://www.google.com/search?hl=en&q=%22ChemSpider+ID%22&btnG=Search

    Not a good sign…

  8. Rich Apodaca says:

    Antony, I answered my own question – partly.

    The second Google result is actually a nested set of about 7,000. You can get there by:

    http://www.google.com/search?hl=en&q=+site:www.chemspider.com+%22ChemSpider+ID%22

    For example, one of the InChIs indexed by Google is:

    http://www.google.com/search?hl=en&q=InChI%3D1%2FC6H11NO%2Fc8-6-4-2-1-3-5-7-6%2Fh1-5H2%2C%28H%2C7%2C8%29&btnG=Google+Search

    The corresponding InChI Key indexed by Google is:

    http://www.google.com/search?hl=en&q=JBKVHLHDHHXQEQ-UHFFFAOYAF&btnG=Google+Search

    It’s a start, but 7,000 is a long way from 30,000 (or even 15 million). I wonder where the other Compound summary pages are…

  9. Wolfgang Robien says:

    Tony; I like your idea of an Inchikey-resolver and I would like to support it. The only questions/remarks I have, deals with the efficiency of such a process. A few facts first:

    At the moment approx. 33 millions of organics are known
    Chemspider holds approx. 21M, PUBCHEM-Compounds approx. 18M structures, which represents 2/3 of known chemistry. I know that within chemspider structure correction is an ongoing process as it is e.g. within my own CSEARCH-project. NMRshiftdb has been severly improved over the last months, etc. – Now all these systems exchange structures ….. ‘A’ gets 10,000 structures from ‘B’, ‘A’ does some corrections and gives its structures to ‘C’. ‘B’ doesnt know A’s corrections and gives also its structures to ‘C’. Now ‘C’ has 2 “versions” of the same structure – in principle you can ignore that for an Inchikey-resolver, but the situation is much more complicated, because CHEMSPIDER, PUBCHEM, etc. have dozens of contributors. I have definitely understanding of data-curation and I know that data-curation is sometimes a work like ‘Sherlock Holmes’ has done, because experimental parts of publications (and especially NMR-assignments) tends to be cryptic. We have a lot of systems in parallel, everybody doing his/her job seriously, spends a lot of time on data-curation. What we need is not 10, 20, maybe 100 structure repositories – each of it is incomplete (see above). What we really need is ONE SINGLE STRUCTURE REPOSITORY ( we live on only ONE PLANET !) – now I also put my kevlar vest on and put 300 feet landmines around my house – we have it, its CAS ! Sorry to say so, but this is the most complete one. When you are interested in a specific structure and you dont find it in Chemspider, emolecules, Pubchem,etc. – what does this tell you. It simply tells you, it is not stored – it DOES NOT TELL: IT DOES NOT EXIST ! I am quite sure I will be (hopefully only virtually) beaten by the community for this statement, but please keep in mind the relationship between ‘new things’ (algorithm, data, new procedures, etc.) and ‘data-curation’ when hosting a large database. The ‘curation-effort’ doesnt linearly increase with the size of the database – its at least a quadratic relationship. What we need is ONE, CENTRALIZED place for structures and ‘retrieval functionality’ (including this inchikey-resolver), which covers the COMPLETE KNOWN CHEMISTRY and NOT hundreds of incomplete and severly overlapping installations.
    Let me know, when I can put off my kevlar vest ;-) )

  10. Rajarshi Guha says:

    Wolfgang – rather than one monolithic repo, why not use a federated approach? OF course it implies that individual databases provide a standardized intrface (amongst multiple possible), but that sounds far easier than mandating *one* database for everybody.

    Especially, for the issue of InChI key resolution, such a federated approach, could mandate a very simple REST interface. That way a single repo does not need to keep up with updates from individual databases.

    A central ‘federator’ sounds far more feasible than a central repository.

    Of course this doesn’t address the issue of structures not deposited in a DB (such as those on a blogpost etc)

  11. Wolfgang Robien says:

    An example in order to convince that this (highly desirable) curation-process leads to a lot of confusion (the example below has been identified within only 2 minutes of work):

    Globostellatic acid F: was drawn with C-O-O-H (hydroperoxide) instead of a carboxyl group in NMRSHIFTDB -> the data went to PUBCHEM ( CID: 15938977 / original NMRSHIFTDB-number was 22047)

    Within NMRShiftDB this entry has been corrected: NMRSHIFTDB-number 20093989 and went again to PUBCHEM: CID=11526176

    Do a search on PUBCHEM for the name ‘globostellatic’ – you end up with 2 ‘globostellatic acid F’ structures, one is correct, the other is a hydroperoxide instead of an acid – both are coming from NMRShiftDB …….

    What we need is either ONE, CENTRAL REPOSITORY which covers chemistry completely or at least a certain protocol between the 100s of structure-oriented databases ….. doing the valuable, but time-consuming job of data-curation at many different places with many different levels of sophistication steals time for creative work.

  12. hko says:

    Wolfgang,
    in my opinion, the primary discussion point is the InCHIKey resolver
    and not the data errors or the curation of data.

  13. Antony Williams says:

    Rich, regarding your question “Any idea which InChIKeys/InChIs have been indexed so far?” I have no idea really. Considering the number of hits we get on a daily basis from search engines it actually seems to be pretty small. The majority of InChIKeys indexed are only because OTHER pages link to us via InChIKey. For example, http://scitoys.com/ has a lot of links to us. I think the number of indexed InChIKeys is only around 250,000 but that’s a gut instinct.

  14. Antony Williams says:

    Wolfgang, Thanks for the comments…I’ve addressed some here but will expand in a separate post.

    You said: “I know that within chemspider structure correction is an ongoing process as it is within my own CSEARCH-project.” Yes, there is a growing effort now around curation and comments. We have just rolled out an enhanced system this week and I will blog about it shortly once the manual is written. See http://www.chemspider.com/feedbackcurated.aspx

    You said “NMRshiftdb has been severly improved over the last months, etc. – Now all these systems exchange structures ….. ‘A’ gets 10,000 structures from ‘B’, ‘A’ does some corrections and gives its structures to ‘C’. ‘B’ doesnt know A’s corrections and gives also its structures to ‘C’. Now ‘C’ has 2 “versions” of the same structure – in principle you can ignore that for an Inchikey-resolver, but the situation is much more complicated, because CHEMSPIDER, PUBCHEM, etc. have dozens of contributors.” Yes, this is VERY complex. We are also making lots of edits to the PubChem dataset and they are not finding their way back. We end up making edits in ChemSpider and redepositing to PubCHem and withdrawing structures. We are tending NOT to remove any structures from the database but annotating them with information.

    You said “I have definitely understanding of data-curation and I know that data-curation is sometimes a work like ‘Sherlock Holmes’ has done, because experimental parts of publications (and especially NMR-assignments) tends to be cryptic. ” Absolutely yes!!!! There are examples which have taken a couple of hours to work through.

    You said “We have a lot of systems in parallel, everybody doing his/her job seriously, spends a lot of time on data-curation. What we need is not 10, 20, maybe 100 structure repositories – each of it is incomplete (see above). What we really need is ONE SINGLE STRUCTURE REPOSITORY ( we live on only ONE PLANET !) – now I also put my kevlar vest on and put 300 feet landmines around my house – we have it, its CAS ! Sorry to say so, but this is the most complete one.” Yes, I agree. It is the highest quality repository available and certainly the largest collection of quality data. I agree. But, like many chemists, I don’t have access to it. Also, the system is not indexing online materials which are not published. They have started hosting spectral data as you know but these are limited to commercial collections. There is no way to deposit data either. Also, for the purpose of this discussion, they do not support InChI.

    “When you are interested in a specific structure and you dont find it in Chemspider, emolecules, Pubchem,etc. – what does this tell you. It simply tells you, it is not stored – it DOES NOT TELL: IT DOES NOT EXIST ! ” This is also a true statement….not all chemicals studied to date are in the registry. Not all the chemicals in PubChem are in the registry (I think it’s between 1/2 and 2/3). Not all chemicals for sale in the marketplace are in the registry. Until recently prophetic compounds from patents were not supported. Nevertheless, I believe CAS IS the curated standard.

    You said “I am quite sure I will be (hopefully only virtually) beaten by the community for this statement, but please keep in mind the relationship between ‘new things’ (algorithm, data, new procedures, etc.) and ‘data-curation’ when hosting a large database. The ‘curation-effort’ doesnt linearly increase with the size of the database – its at least a quadratic relationship. What we need is ONE, CENTRALIZED place for structures and ‘retrieval functionality’ (including this inchikey-resolver), which covers the COMPLETE KNOWN CHEMISTRY and NOT hundreds of incomplete and severly overlapping installations.” So, what I think you are suggesting is that CAS hosts the InChIKey Resolver. Now that’s a novel idea. A number of CAS people are registered on this blog and are likely reading it. But, there’s never been a response from anyone at CAS and ACS and I expect that will not change with this discussion. But..I like the idea…

    You said “Let me know, when I can put off my kevlar vest ;-) )” I have bulk-purchase pricing if you need another one…. :-)

  15. Wolfgang Robien says:

    ad hko: Yes you are right, the topic is the ‘InChIKey-resolver’ – this works via a lookup (Inchikey–>Inchi–>SD-file/png/CML,etc.). You can generate InchiKeys from every structure collection on the web (CHEMSPIDER, emolecules, PUBCHEM,etc.) and built this table. As I mentioned above, this will cover maybe 2/3 of known chemistry. The other choice is to run isomer generator programs and fill this table ‘artificially’, this will be impossible because of the ‘combinatorial explosion’ of isomer generation. The central question remains: What happens with ‘new’ chemistry – which is poorly reflected in one of these databases (CAS does a systematic extraction of approx. 10,000 journals, delay quite small, ca. days-weeks) – e.g. with CSEARCH I try to be up-to-date too, but there is always a much longer delay between the time of publication and the availability of the spectral data within my collection.

    Again, I like the idea, but I think the ‘statistical success rate’ will be around 2/3 (only).

    I DO NOT insist on ONE, CENTRALIZED solution, but there are a lot of arguments, which lead me to the conclusion, that data-handling becomes much easier, when you have only a few (or even one) installation.

    With ‘N’ structure oriented databases, we have N**2-N pathways of possible data exchange. If N is very large (as it is), sophisticated data-exchange protocols (and concurrent updates) are necessary. They are not in place at the moment.
    The investment (=time) for data curation into decentralized, but severly overlapping systems, is too high.
    What’s about ‘new’ chemistry – what will be the delay between publication and availability of the data in an ‘InchiKey-resolver’.
    CAS should also learn, that web-publishing is upcoming and structure-deposition is a necessary task.

    I will keep my kevlar-vest on ;-) )

  16. Jean-Claude Bradley says:

    How much of a problem this is depends on how you use the InChIKeys and what you expect to get from them. Except for the largest molecules, we still tag our experiment pages with InChIs, links back to ChemSpider and at least one common name, in addition to the InChIKey. This serves our purpose of getting indexed by Google and being able to link our various experiments to each other via a Google search. And because we have a redundancy, someone finding one of our pages with an InChIKey search could still figure out which compound it was based on the associated InChI or common name or link to ChemSpider, even if a resolving service is down.

    Insofar as we need to look up InChIKeys unassociated with any other identifiers, the ChemSpider service has been very helpful. If I understand what you are saying here, you are projecting into the future when demand will grow and another (funded) service will have to be put in place. I agree that is likely to happen, although how quickly I don’t have a good idea. In the meantime I would suggest that anyone using InChIKeys make sure they maintain a local look-up table and not rely exclusively on an online lookup service.

  17. Antony Williams says:

    Posted on behalf of Stan Young, one of the ChemSpider Advisory Group:

    Frank Burden, among many others, wanted a unique Structure-Number function, Burden 1989. The paper produced a small cottage industry. It turns out the relationship he described is not unique – drat, but similar compounds do map to similar numbers. Bob Pearlman, UTexas, expanded the single Burden number to a set of numbers, BCUTs. They are weird, but they have proven useful for placing a compound into a Chemsitry Diversity Space. So we come to the inverse problem. It is not possible to go from the BCUT numbers back to the structure. I think the best that can be done is to have a master server and, as compounds are constructed, the structure and its InCHIKey are deposited. The re are numerous questions to pose and resolve, but a unique Structure-Number function is unlikely to be possible.

  18. Antony Williams says:

    These questions posted by Rajarshi Guha on CHMINF-L but pasted here for discussions

    My question is: how many molecules are you going to store? More specifically, if I publish some structures on my website and provide the molfile and an InChI key, the resolver has to have my structures deposited, to resolve the key. How do those depositions get performed? Obviously I could deposit them into chemspider etc at the
    same time I publish them. But what if I don’t?

    Furthermore, say I’m generating a set of structures virtually (this refers to the usefulchem project) – I have 77K or 500K virtual structures, which say are available on my website. Out of these some (say a 100) may be of use. If this small subset is useful I could see depositing the structures by hand into a central repository. Then when a person searches by InChI key for the members of this subset the resolver would find it.

    But what if somebody refers to some other member of the library that was not deposited? The resolver does not find it. This would imply that I should deposit the whole virtual library. This is certainly doable, but is it a good idea? Wouldn’t it just be populating the database with possible junk?

    Following this thinking, one would need one (or both) of two things:

    * a way to enforce (?) that any web accessible structure be deposited in the central repository, so that the resolver can perform the lookup

    OR

    * the repository will contain ‘all’ of chemical space – this sounds infeasible

    On the centralized respository idea, I’m a little leery of this approach. Why couldn’t it be more of a federation layer? This would require that individual repositories store InChI keys and have some type of uniform search interface (REST sounds ideal). That way, rather than a central repository (which might not be all-encompassing), you’d have a federation layer, that really only needs to keep track of the individual databases.

    It seems that this approach would allow for a higher degree of flexibility than a central repo. Naturally, this would require a coordinated effort on the part of the community – but that’s what this is about anyway

    On a side note, I must say that InChI keys are handy to copy/paste/search compared to InChI or SMILES – but in the end they seem to be a kludge. If Google did not mangle InChI strings we wouldn’t need InChI keys for searching purposes. I’d even go so far as to say that InChI keys are syntactic sugar. But I’d be happy to hear otherwise.

    PS. The last paragraph suggests that an extremely useful project would be a chemistry aware crawler. I wonder whether that’d be possible? Certainly the tools are there

  19. Joerg Kurt Wegner says:

    @Antony,

    yes, hash keys can not be transformed, but a DOI is also not able creating an article. The key is that it links to curated, peer-reviewed articles. This is its biggest difference to InChiIKeys, they are not cleaned, curated, or confirmed.
    In an ideal world curated information would create a recalculatable index, where the confirmation-flag would either be part of the InChIKey or there would be a list of confirmed and curated InChIKeys.
    For me the bonus of CAS numbers is that they are curated and in that sense clean. And though CAS contains lots of information it is unlikely that a ‘users centric’ approach, better known as social software, will not soon outperform CAS.
    So, I recommend two things
    1. Create an approved flag for structures, which might be curated by users. This can then by either kept in an additional list to the InChIKey’s. In the ideal case, each InChIKey calculation would double-check with a central service (PubChem, IUPAC, Chemspider, or mirrors of those) to check if the structure exists already in one of those lookup services. If not, it might be added ? I am happy, to live in an online world. Maybe someone should talk to the IUPAC guys, they have indeed forgotten the storage concept in their approach !
    2. Provide/Push an InChIKey lookup similar to an DOI lookup. Here also, people would like to end up on a curated page, so whoever provides this service, people should be able to correct things. In contrast to peer-reviewed articles are we living in the post-structure-review area ! Now, this principle should be pushed and supported by all services somehow, direct or indirect.

    @Wolfgang,

    keep the vest and open the cellar door, I might join with my cyber dragon and magic wand. (I am a fantasy fan and believe sometimes in miracles)
    I read somewhere that the only way for reaching independency is centralization. This might sound strange, but this is exactly the principle behind DOI or CAS numbers. Another good description for all other services out there might just be, ‘get the hell organized you stupid scientists, its not creative inventing your own InChIKey, but using it and creating creative forms of usage for it’.

  20. Antony Williams says:

    Responses to Rajarshi Guha :

    Q> My question is: how many molecules are you going to store?

    A>It has to be many millions. The largest public databases (NOT Virtual databases) are around 20 million. Sure, we could add many millions of virtual compounds and may someone would do that. I would suggest that the majority of large virtual libraries will likely not have InChIKeys generated as there will be little reason to communicate the virtual library to anyone…it will be project specific. So, how many novel structures per year deposited on top of the foundation collection of PubChem/ChemSPider… maybe >100,000 and More specifically, if I publish some structures on my website and provide
    the molfile and an InChI key, the resolver has to have my structures deposited, to resolve the key. How do those depositions get performed? Obviously I could deposit them into chemspider etc at the same time I publish them. But what if I don’t?

    A> Well, you can deposit at anytime..later if you choose. A web service can be set up to receive the InChIStrings and InChIKeys en masse, or simply an SDF file could be received and InChIKeys generated en masse and returned to the depositor and deposited to the system. And it’s opt-in…nobody has to deposit.

    Q> Furthermore, say I’m generating a set of structures virtually (this refers to the usefulchem project) – I have 77K or 500K virtual structures, which say are available on my website. Out of these some (say a 100) may be of use. If this small subset is useful I could see depositing the structures by hand into a central repository. Then when a person searches by InChI key for the members of this subset the resolver would find it. But what if somebody refers to some other member of the library that was not deposited? The resolver does not find it. This would imply that I should deposit the whole virtual library. This is certainly doable, but is it a good idea? Wouldn’t it just be populating the database with possible junk?

    A> Who defines junk? Tricky question. There is a lot of “interesting stuff” on PubChem and ChemSpider but neither is judged for it (well, at least PubChem isn’t). Enormous libraries could in theory be deposited yes and is there value? Well, we have hosted 70,000 virtual compounds for UsefulChem already at JC’s request. Are they junk? He needs to be asked. We did not host the 500,000. We have been offered other virtual libraries and said no “for now”. THe UsefulChem Virtual Library was to support the particular project and as proof of concept.

    Q> Following this thinking, one would need one (or both) of two things:
    * a way to enforce (?) that any web accessible structure be deposited in the central repository, so that the resolver can perform the lookup
    OR
    * the repository will contain ‘all’ of chemical space – this sounds infeasible

    A> The repository will be opt-in. I doubt that commercial entities will add their structures and InChIKeys to the system because of IP issues. So enforcing is a no-go. We cannot get to the place of the DOI….when a DOI exists by default it has been registered.

    Q> On the centralized respository idea, I’m a little leery of this approach. Why couldn’t it be more of a federation layer? This would require that individual repositories store InChI keys and have some type of uniform search interface (REST sounds ideal). That way, rather than a central repository (which might not be all- encompassing), you’d have a federation layer, that really only needs
    to keep track of the individual databases.
    A> Indeed…why not! A federation layer would be an excellent approach.

    Q> It seems that this approach would allow for a higher degree of flexibility than a central repo. Naturally, this would require a coordinated effort on the part of the community – but that’s what this is about anyway
    A> Agreed. My statement is it is necessary to be proactive about this now/soon. And community support of the effort whether central or federated is necessary.

    Q> The last paragraph suggests that an extremely useful project would be a chemistry aware crawler. I wonder whether that’d be possible? Certainly the tools are there
    A> Indeed. Watch this space.

  21. hko says:

    Naive question to inchistring ?
    Generating inchis with Chemsketch (ACD Version 11),
    I get the SAME inchi for some (all ???) tautomeric
    structures.

    I have used structure:
    ChemSpider ID: 521745
    Empirical Formula: C14H12N4O
    AND the other pyrazol tautomer.
    I got the same inchi for both structures.

    Does inchi not distinguish between tautomers ???
    If this would be true then I would get the SAME
    INCHIKEY for different tautomers.

  22. Rajarshi Guha says:

    In response to Stan’s post ,it does appear that BCUT descriptors can be used to go back to the original structure (or a very similar structure). Pearlman recently put out a paper describing this – http://dx.doi.org/10.1021/ci600383v

  23. Richard says:

    Isn’t it just good practice for anyone publishing an InChIKey to publish the InChI in the same place? As the InChIKey is a fix for a particular problem (text search engines) why promote its general use beyond its capabilities? Find InChIKey, use InChI !

  24. Antony Williams says:

    Richard, yes it is a good practice. But not everyone is doing it. SO maybe the request is this easy. Post them together. But, I don’t think this will ever happen exhaustively though we can wait and see.

    Are the InChIStrings and Keys associated with Project Prospect available as an aggregate anywhere by the way?

  25. Antony Williams says:

    Posted on behalf of Martin Walker, one of our advisory group.

    I think a InChIString crawler could be a big part of the solution (though it fails if people post ONLY the InChIKey). It could record every InChIString it finds on the Internet (along with the link – to enhance the service), and then generate an InChIKey for that InChIString. It would be designed so that anyone searching for the InChIKey via a DOI-type lookup would be directed to an entry that would list, InChIKey and InChIString, as well as relevant links and ideally an automatically generated structure as well. I think ChemSpider could probably operate such a service very well, once the server was bought and the software written.

    I know someone briefly mentioned a crawler, is this what they had in mind?
    Is this a reasonable suggestion? I even have a great name for it – “ChemSpider” – oh dear, that domain’s been taken already…..

  26. Rajarshi says:

    Martin, yes that’s what I had in mind. I’ve been looking at Nutch (http://lucene.apache.org/nutch/) as a candidate crawler. It’s pretty featureful, but for the purposes of an InChI crawler, the problem is much simpler than generic web crawling – no need to make text indices of the page contents. Rather, it boils down to simply identifying InChI’s on the page and storing the (link, InChI) tuple in a DB (and generate an InChI key as well)

    Of course I have not worked with web crawlers before, so my view might be naive!

  27. Richard says:

    Q> Are the InChIStrings and Keys associated with Project Prospect available as an aggregate anywhere by the way?

    A> Not yet…

  28. Alan McNaught says:

    Tony

    The InChI team welcomes enthusiastically the initiative from ChemSpider to look for ways of providing an InChIKey resolver service; we’re grateful to you for starting the ball rolling. We had hoped that the chemical information community would pick up the idea of developing InChI/InChIKey look-up facilities; indeed this would represent an enormous increase in accessibility for web-based chemical information. IUPAC itself is not best constituted to develop this kind of service. Sorry we didn’t get round to expressing our support earlier.

  29. David Bradley says:

    I’d much prefer there to be a single, simple system that, as a journalist, I could use without having to worry about all the technicalities, like DOI…

    …incidentally, CrossRef just published a WordPress plugin that can help youdo DOI look ups and paste citations in blog posts

    http://www.sciencebase.com/science-blog/search-and-cite-for-science-bloggers.html

    Perhaps a plugin like this that worked with Chemspider would allow Inchikey and Inchi to become transparently interchangeable and so more widely useful. Like I say I haven’t gone into the tech of any of this, just want to be able to start working with a simple solution ;-)

    db

  30. alf says:

    ChemSpider is already an excellent InChIKey resolver, inasmuch as all a resolver has to do is reverse lookup of InChI strings from InChI keys.

    If the only problem is that there are InChI strings floating around that aren’t in ChemSpider, then that’s a separate problem – it means people aren’t depositing their structures in the right repositories, doesn’t it?

  31. Antony Williams says:

    Alf, Thanks for the comments. Yes, I think ChemSpider, as is, offers a certain level of InChIKey resolving. That said, I know that we are likely only half way there. There are quality issues we need to resolve with the entire process, we need to improve the workflows for deposition as well as some of the checking procedures and reporting of structural issues during deposition. We also need to build web services to facilitate deposition from other services. There’s a lot of work to be done. Also, we need to address the issue of 24/7 uptime and reduce dependency on Internet Service Providers and a support system of only two servers and one backup.

    There are definitely InChIStrings floating around which are not on ChemSpider yet and we are working on a crawler to find them and deposit them. We would also like people to deposit structures via our structure deposition system described here: http://www.chemspider.com/docs/Single_Structure_Depositions_on_ChemSpider.pdf

  32. Egon Willighagen says:

    The proper URL for the JChemPaint website is:

    http://jchempaint.sourceforge.net/

    Antony, can you please update the link in this blog item?

Leave a Reply