Ian Mulvany recently posted on a lunch meeting with Egon Willighagen over at Nascent. Egon’s a member of our Advisory Group, is very supportive of our efforts and provides great feedback to questions. We havent yet met…but I look forward to sharing lunch with him one day…

I wish I’d been at that lunch as I’d have some comments to add in. I’ve extracted from Ian’s post below and italicized his words then commented below.

“One solution to marking up molecules is to use an InChi (an IUPAC International Chemical Identifier). These have been championed by Peter Murray Rust and there is an extensive InChi FAQ available. The short story is that an InCHi is a character string which uniquely describes a chemical substance. From any chemical structure you can generate an InChi.”

AW> Peter has been a great advocate and champion for InChI and has definitely evangelized the value. But we should not forget those who have pushed the development and executed on delivering it. Specifically, Steve Heller, Steve Stein, Dmitrii Tchekhovskoi (all associated with NIST) and Alan McNaught (associated with IUPAC). The InChI was originally called the IUPAC-NIST Chemical Identifier. I’ve spoken previously about heroes and these people are truly the heroes of InChi. The rest of us get to use it, talk about and celebrate it…they had the vision AND executed on it.

“Peter has a writeup on using inCHi in blogs, and if every chemical that appeared everywhere was somehow marked up with it’s InChi, or the article referring to it tagged with them then the findability problem would be solved by simple string searching.”

AW> Yes, this is true. BUT it is limited. And people don’t appear to be talking about the limitations. Chemists don’t necessarily want to search only on an exact structure (and don’t me get started on all of the various layers that can be layered onto an InChIString – stereo, fixed hydrogens etc). They may want to search on substructure and similarity of structure and InChIs are going to have to be aggregated to allow this … I have blogged about an approach and Egon could help get us there!

Egon suggested as a solution that journals should require papers dealing with chemicals to include InChis. He said that every tool for drawing chemicals (standard issue for anyone writing a paper on the subject) can now output the InChi with the click of a button <…> you are Nature, you can make authors do anything in order to get a paper published so why not get them to do x. Well, for a start, that’s an editorial decision,<…>

Journals are naturally shy of any step that can delay the publication time of an article, and so I am also skeptical that we would see such obligatory requirements. Better, I think, to have this step as a voluntary one. Practically all journals allow supplementary information and I am sure all of them would accept InChi as supplementary information.”

I agree with Egon…I’ve written almost a dozen peer-reviewed articles this year. The insturctions for authors demand systematic nomenclature and the authors are responsible for it. Demand InChI. Alternatively the majority of papers have structures embedded as OLE compatible objects. Develop a tool (not difficult) to generate InChIs on them. By the way, the InChIs COULD be embedded directly inside a PDF (I managed a product that generated PDF files that were STRUCTURE-SEARCHABLE! as well as generated images that were structure searchable. ) Yes, there is work to be done BUT it can be done. The challenge, I believe, is to get the primary societies to throw down the gauntlet. RSC are already using InChIs in Project prospect. if Chemical Abstracts Service were to utilize and index InChIs the American Chemical Society might be very interested in requiring InChIs for their manuscripts, whether directly embedded in the documents or as supplementary information. Rich Docherty over at TotallySynthetic has started tagging his posts with InChIKeys…not InChIStrings (I’ve talked about the value of this here and here)

“So what can we do now to help making connections between papers and molecules? Peter Corbett, who works with Peter Murray Rust, is working on automated methods of getting computers to read chemistry papers and output semantic markup of them. “

AW> Over at ChemSpider we are working with Will Griffiths who developed ChemRefer . We have already extracted 10s of thousands of chemical names and will be linking them up to ChemSpider structures to enable Open Access papers to be structure/substructure searchable. However, we’ve hit a bit of a hurdle…more details on this will follow shortly but we have been asked to remove thousands of articles indexed according to what we believe is a standard search engine policy from the ChemRefer index. During our conversation today with the publisher the conversion of chemical names to chemical structures to provide a structure searchable index of the articles was deemed to be “re-purposing” of the Open Access articles and is NOT allowable. Peter Corbett and Peter Murray Rust are engaged in similar activities so will likely run into the same challenges. If they manage to get around this issue with this and other publishers then they will be working in a “permissive” role where they will need to get permission from publishers to perform semantic markup. Their semantic markup is also “re-purposing”. The “permissive challenge” is far away from Peter’s stance in terms of Open Data for all.

“Egon has now created rdf pages for molecules on openmolecules.net. These pages use the InChi in their structure, and now each molecule had it’s own web page. “

AW> We are now working with Egon to RDF our own ChemSpider pages. Watch this space…

“Egon’s pages check Connotea, and pull from Connotea co-tags of InChi tags (Here is a short description of this). If we work on this a bit more we should be able to set up a system where if you tag a paper with an InChi, that paper could appear on Egon’s pages. “

AW> Not only Egon’s pages…we will index directly into ChemSpider also. The molecules will become part of a close to 20 million structure index including analytical data. It is one big web of chemistry, it is all coming together now, and Egon is a good guy to have lunch with. Wish I was there….

Stumble it!

5 Responses to “One Day I’ll Have Lunch with Egon Willighagen Too…”

  1. Egon Willighagen says:

    In reply to:

    “During our conversation today with the publisher the conversion of chemical names to chemical structures to provide a structure searchable index of the articles was deemed to be “re-purposing” of the Open Access articles and is NOT allowable.”

    Antony, might you clarify the issue of OpenAccess and not being able to extract molecules? I am sure the Open-licensed papers with the CC-license will surely allow that. So, I assume you refer here to more mainstream chemistry publishers which happen to do OpenAccess in the way they define that. Or… ?

  2. Antony Williams says:

    All of this will become clear in a separate posting. At present Will and I are summarizing our conversations with the publisher to ensure that we both agree what was said and agreed to during the conversation.

    There are too many presumptions and judgments being made right now in our limited blogosphere, certainly that you and I frequent. I want to ensure that what I “heard” was what was said before I go and taint a publisher.

    I believe the issue is going to come down to our interpretation of “Access” and the difference between Free and Open Access as well as the policies that exist around spidering on different websites. Please give us time to deal with this in an appropriate and polite manner. My intention is to figure out how to co-exist with publishers and abide by policies. It does not mean I will not challenge policies but when they exist and we are asked to respect them we will do so. I’ve already been beaten up over why Chemspider is a “.com” and have explained that we live in a litigious society. I do not want to get into legal battles…we cannot afford it…either for our reputation or financially.

  3. Christopher Singleton says:

    A quick question for the more knowledgeable readers out there, why was the InChI developed, instead of just using SMILES?

  4. Antony Williams says:

    As far as I recall from the early conversations, where I actually suggested that we should just approach Daylight and ask them to make SMILES open and let us extend it, the decision was that it was a proprietary format. There are likely some politics in there but I just don’t recall them.

    Interestingly multiple years later EXACTLY this is underway now around SMILES being Open and a team extending it…see this discussion:


  5. Richard Kidd says:

    A problem with SMILES is that you can get more than one string from a compound


Leave a Reply