InChIs are a powerful way to communicate chemical structures. They are going to enable internet chemistry and when we roll out the InChI Resolver shortly then the community will have access to a resource to resolve InChIKeys and ultimately navigate chemistry on the web. We commonly receive chemical structures in the form of InChIs and in order to deposit the structures we have to convert the InChIs back to chemical structures, commonly into SDF format for batch deposition. For simple organics this is not a difficult process…the tools we have at our disposal can deal with the layout of simple organics. However, for some of the chemical structures we receive optimizing 2D layout is very challenging. Many of the issues come with fullerenes (See examples below) but not only. Carbohydrates, complex cycles etc are big challenges.

clean

In building the InChI resolver we hope to provide attractive visual depictions of the associated structures. Without AuxInfo data carrying the coordinates,  or without the deposition of SDF files containing the layout coordinates we have a major challenge ahead of us. Auxinfo data are shown below for erythromycin. These data are rarely generated when people generate InChIKeys and the issue of structure layout will dominate the interpretation of complex structures.

auxinfo

Since beauty is in the eye of the beholder my judgement is that automatc layour algorithms should only assist in the appropriate layout and eyeballs will need to make the final decision. That is why it is better to deposit SDF files of InChIs with Auxinfo carrying the coordinates than it is to deposit InChIs only and leave the structure layout to an algorithm. It will fail.

I am interested in seeing what people can do with their structure cleaning algorithms on InChIs like this:

InChI=1/C66H103N17O16S/c1-9-35(6)52(69)66-72-32-48(100-66)63(97)80-43(26-34(4)5)59(93)75-42(22-23-50(85)86)58(92)83-53(36(7)10-2)64(98)76-40-20-15-16-25-71-55(89)46(29-49(68)84)78-62(96)47(30-51(87)88)79-61(95)45(28-39-31-70-33-73-39)77-60(94)44(27-38-18-13-12-14-19-38)81-65(99)54(37(8)11-3)82-57(91)41(21-17-24-67)74-56(40)90/h12-14,18-19,31,33-37,40-48,52-54H,9-11,15-17,20-30,32,67,69H2,1-8H3,(H2,68,84)(H,70,73)(H,71,89)(H,74,90)(H,75,93)(H,76,98)(H,77,94)(H,78,96)(H,79,95)(H,80,97)(H,81,99)(H,82,91)(H,83,92)(H,85,86)(H,87,88)/t35u,36u,37u,40-,41+,42+,43-,44+,45-,46-,47+,48u,52-,53-,54-/m0/s1

The images below show the iterative application of DIFFERENT structure layout algorithms. One caution…your layout algorithm should produce the SAME InChI at the end and NOT flip stereocenters. Interesting challenge. Who says cheminformatics isn’t challenging? And who thought building an InChI Resolver would be easy?

layout1layout2layout3layout4

Reblog this post [with Zemanta]
Stumble it!

5 Responses to “The Downside of InChI Depositions”

  1. Rich Apodaca says:

    @Tony, what do you think of the capabilities of the open source Structure Diagram Generation (SDG) tools discussed in this article?:

    http://depth-first.com/articles/2008/03/26/five-open-tools-for-2d-structure-layout-aka-structure-diagram-generation

  2. Joerg Kurt Wegner says:

    One potential solution could be host templates of structures or (large) substructures, which are getting used for ‘getting’ a good look-and-feel. This avoids relying on any layout algorithm and takes away the burden of coordinate submission.

    Here also, users could do this in a social way and upload multiple templates and vote for them. There should be a maximum number, e.g. five. By definition the highest ranked template gets used, but they could also use something like ‘embed with template ranked fourth’

  3. Antony Williams says:

    Rich…I haven’t looked at these tools..I am aware of a lot of the blog discussions about them but haven’t compared them myself. My judgment is that they will likely work well on basic organics but that, like all cleaning algorithms, will be challenged by the more complex molecules. I think a good test will be to assemble an SDF file of challenging molecules and make them available for testing the algorithms. Are you aware whether any of the 5 open tools you discussed have the ability to upload an SDF file to a server and return an SDF file of cleaned structures? Thanks

  4. Antony Williams says:

    Joerg…In the past 2 days I have been working on publishing some very diverse structures…you’ll see some interesting news shortly. The templated approach underlying Cleaning algorithms is a useful approach I agree. However..we don’t use our own cleaning algorithms and those we do use would need to use the templates. Also, this would only have a small impact on the dataset I have been working with in my opinion. The structures are so diverse that if they were templated they would be unlikely to see a related compound in most cases. So…yes…good idea…worth pursuing….but not the panacea

  5. Rich Apodaca says:

    @Tony, Noel has a nice series comparing performance with some tough cases – might be interesting to add your tool of choice to the mix to compare:

    http://baoilleach.blogspot.com/2008/10/cheminformatics-toolkit-face-off.html

    With the right glue code, all of the tools should be configurable as web services that can clean the structures in an sd file (or single molfile).

    P.S. – your comments posting follow-up page seems damaged. Actually gives an error saying content not available or some such.

Leave a Reply