In January of this year IUPAC announced the release of version 1.02 of the InChI algorithms and software with the Standard InChI and InChIKey. The definition is given below.

Standard InChI

In response to user requests, a Standard InChI (i.e. without options for properties such as tautomerism and stereoconfiguration) has been defined as follows:

  • Standard InChI is for the purposes of interoperability/compatibility between large databases/web searching and information exchange.
  • Standard InChI and non-standard InChI are always distinguishable.
  • Standard InChI is a stable identifier; however, periodic updates may be necessary; they are reflected in the identifier version designation, which is included in the InChI string.
  • Any shortcomings in standard InChI may be addressed using non-standard InChI (currently obtainable using InChI version 1.02beta).

Standard InChIKey

In response to user feedback the format of InChIKey has been changed; it is different from that in InChI software v. 1.02-beta, having 27 characters rather than 25.

Standard InChIKey has five distinct components.

  1. 14-character hash of the basic (Mobile-H) InChI layer;
  2. 8-character hash of the remaining layers (except for the “/p” segment, which accounts for added or removed protons: it is not hashed at all; the number of protons is encoded at the end of the standard InChIKey.)
  3. 1 flag character,
  4. 1 version character
  5. the last character is a [de]protonation indicator.

The overall length of InChIKey is fixed at 27 characters, including separators (dashes):

AAAAAAAAAAAAAA-BBBBBBBBFV-P

This is significantly shorter than a typical InChI string.

Here

(1) AAAAAAAAAAAAAA is a 14-character hash.

(2) BBBBBBBB is an 8-character hash

(3) F is a flag indicating standard InChIKey (produced out of standard InChI): it always has the value ‘S’.

(4) V is a flag for InChI version character: ‘A’ for version 1, ‘B’ for version 2, etc.

P is an indicator for the number of protons; this number is not encoded in the hash but is indicated as a separate 2-character block at the end, where one character is a hyphen, as –N for neutral, -M for -1 hydrogen, -O for +1 hydrogen, etc.

We have generated Standard InChIStrings and InChIKeys across the entire database now and for each record now you will see four variations of the InChI.

InChI: InChI=1/C17H13ClN4/c1-11-20-21-16-10-19-17(12-5-3-2-4-6-12)14-9-1​3(18)7-8-15(14)22(11)16/h2-9H,10H2,1H3
InChIKey: VREFGVBLTWBCJP-UHFFFAOYAT
Std. InChI: InChI=1S/C17H13ClN4/c1-11-20-21-16-10-19-17(12-5-3-2-4-6-12)14-9-​13(18)7-8-15(14)22(11)16/h2-9H,10H2,1H3
Std. InChIKey: VREFGVBLTWBCJP-UHFFFAOYSA-N

It is now possible to search across the entire database by Standard InChIs also.

Stumble it!

5 Responses to “Standard InChIs and InChIKeys Populated to ChemSpider”

  1. Tore Eriksson says:

    There semes to be somthing murky with the standard InChI/InChIKeys produced by PubChem. Since I’m not a chemoinformatician, I would appreciate some insight. This might also be relevant for your outlinks. Looking at the PubChem compounds CID:9568014 and CID:15978272 they are stereoisomers but they have identical standard InChI/InChIKeys (NYMQJWVULPQXBK-UHFFFAOYSA-N) without any stereo layer. Searching ChemSpider with the isomeric SMILES or hand-drawn structures both lead to the entry ChemSpider ID:2054355 which also seems to lack sterochemistry. This entry contains outlinks to both CIDs above.

    Recomputing the InChI from PubChem SDF files gives the same InChIKey that is shown in all the above entries. However, when I convert the PubChem XML files to SDF (with open babel), and then recompute the standard InChI, the stereo layer is present, and the InChIKeys are of course different. (An extra proton is also added…) I seems like all PubChem InChIs are computed without reagard for stereochemistry now which seems a tad bit odd.

    Is this something that you have noticed as well. I am adding InChIKeys to my database now, and wonder if I should merge entries like this.

  2. alf says:

    It does seem that the PubChem SDF files that Tore refers to above are missing information that’s present in the corresponding PubChem XML files; the InChIs for those two molecules, using an XML->OpenBabel->SDF->StdInChI conversion, should be:
    InChI=1S/C2H8N3S.HI/c1-6-2(3)5-4;/h5H,3-4H2,1H3;1H/b5-2-;
    InChI=1S/C2H8N3S.HI/c1-6-2(3)5-4;/h5H,3-4H2,1H3;1H/b5-2+;

    I’d be interested to know why this is happening.

  3. alf says:

    The difference between PubChem SDF files and PubChem XML-derived SDF files that Tore mentions above seems to be due to OpenBabel, which doesn’t read the PC-Atoms_charge section from the PubChem XML file, so produces an SDF file which doesn’t include the charge information for each atom.

  4. Antony Williams says:

    CSID2054355 has a crossed bond indicating non-explicit orientation around the double bond…thereby allowing the link on PubChem to both the E and Z orientation. Our experience of OpenBabel is that it is buggy in a number of ways and we prefer to use OpenEye’s tools for this reason.

  5. Tore Eriksson says:

    Thanks for the replies. Is this what is happening then: In the non-salt (CSID369757) the hydrogens are mobile and the tautamerization makes the sp2 stereoisomers interchangeable. However, if the molecule is protonated the interconversion is stopped/slowed. Or perhaps the stereoisomers only exists in the solid salt (CSID2054355) where interconversion is sterically hindered?

Leave a Reply