We’ve rejigged our data to make searching more reliable.

What have we done?

We’ve regenerated all of the InChIs in the database with version 1.03 of the InChI code.

What does that mean?

The InChI (international chemical identifier) is a short piece of text that describes the structure of a molecule. Each one is generated by a free and open-source computer program, which guarantees that it should be the same and there shouldn’t be conflicting InChIs for the same molecule. You can’t really write them by hand, because they look like this:

InChI=1S/C10H22ClN2O5PS/c1-3-10(9-18-20(2,15)16)12-19(14)13(7-5-11)6-4-8-17-19/h10,12H,3-9H2,1-2H3

ChemSpider is built on InChIs. If two molecules have the same InChI, then they’re the same record in ChemSpider, and if you can’t InChIfy it, you can’t put it in ChemSpider. That’s why we can’t do, for example, polymers yet.

We’re proud to be founder members of the InChI Trust, which supports this critical element in the sharing of chemical compound information.

InChI Trust Member 2011

What does all this mean for ChemSpider?

Because there is an active community supporting InChI who look out for these things, version 1.03 contained some bug fixes which mean that a very small number of the InChIs themselves, only a few dozen out of the whole database, have changed.

  • P+–O bonds and P+–S are now treated slightly differently. This means that it will be easier to find the exact molecule you’re looking for, regardless of how it’s been drawn. (In principle this will also apply to analogous bonds containing arsenic, selenium, tellurium and antimony, but I can’t see any examples of this in the database.)
  • There was a small bug where the InChI generated for a molecule with an azide group in it sometimes varied according to the input drawing. But that doesn’t happen now.

This regeneration has also allowed us to catch and clean up some errors in the data.

What happens next?

Version 1.04 of the InChI code will be released soon. With our new framework for processing large amounts of data we’ll be able to update our InChIs much quicker. The main changes in 1.04 that affect the InChI are to how it handles radical atoms in aromatic rings, nobelium, lawrencium and rutherfordium, so we anticipate that there shouldn’t be very many changed InChIs!

Stumble it!

7 Responses to “We’ve updated our InChIs”

  1. Egon Willighagen says:

    “If two molecules have the same InChI, then they’re the same record in ChemSpider”

    Using what InChI options? Is that based on the Standard InChI, or an InChI with the FixedH option? Or with any of the optional tautomer options?

  2. Colin says:

    Dear Egon,

    Thank you for your swift response.

    The non-standard option we use is RecMet.

    We certainly don’t use KET or 15T because they’re still only experimental.

    Best wishes,
    Colin.

  3. Robert Kiss says:

    Dear Colin,

    While I totally agree that InChI is the best choice for database registration at the moment, there are some cases where InChI alone might not be enough. I list some of these cases in our blog (blog.mcule.com). Can I ask how you handle protomers with different InChIs for example? Also, do you apply any specific rules to avoid “double-registration” of tautomers due to simple/hard (de)protonation discrepancies?

    Can I also ask why you don’t prefer using the optional tautomer rules? Is it just because the InChI manual says it is in an experimental phase? According to the answers I got on the Blue Obelisk exchange forum (http://blueobelisk.shapado.com/questions/why-inchi-does-not-support-keto-enol-and-1-5-tautomerism-by-default) and on the InChI-discuss list, they are not part of the standard InChI, because the incorporation of these rules would cause a dramatic change in standard InChIs. There seems to be no harm, however, using them for database registration.

    Regards,
    Robert

    Robert Kiss
    http:/mcule.com

  4. Colin says:

    Hello Robert,

    We’re not doing anything with protomers that have different InChIs at the moment, but we will be doing more normalization in the not-too-distant future as part of the Open PHACTS project.

    The optional tautomer rules will take a lot of work to investigate and validate against our records!

    However, 1,5-tautomerism, for example, does matter a lot for things like acac complexes, which we’re looking at in the InChI working group on coordination complexes and organometallic chemistry. So probably a lot of this will come out in the wash, so to speak, when we work out how InChI should handle that area of chemistry properly.

    I should add that when ChemSpider (and Prospect) started, the tautomer options were only accessible if you recompiled the InChI source yourself and set the appropriate options. They were only openly released later. There’s some experimental code there to do ring–chain tautomerism as well, but that’s still, if I remember rightly, something you can only switch on as a compilation option.

    Best wishes,
    Colin.

  5. Robert Kiss says:

    Thanks Colin! I totally agree it is very difficult to change even a single step in your registration process as this will can result in ID changes and it might confuse your users. As for mcule, at the moment we are in a much better situation as we can make as many changes as we want before we launch :)
    BTW, can you give us some hint when the 1.04 InChI version which you were referring above will become available?
    Another question arising from the RecMet option: applying this InChI option means you distinguish organometallics where the metal was connected to its ligands from those where they were not connected?
    Regards,
    Robert

  6. Colin says:

    I can’t say any more about 1.04 than “soon”. As for RecMet, you’re quite right. If you search for LDA, say, in ChemSpider, you’ll find two entries.

    Best wishes,
    Colin.

  7. Colin says:

    I can say more about 1.04! It’s out now: http://www.inchi-trust.org/index.php?q=node/14

Leave a Reply