We’ve updated our InChIs
Posted by: Colin Batchelor in InChICopyright©2011 Colin Batchelor
We’ve rejigged our data to make searching more reliable.
What have we done?
We’ve regenerated all of the InChIs in the database with version 1.03 of the InChI code.
What does that mean?
The InChI (international chemical identifier) is a short piece of text that describes the structure of a molecule. Each one is generated by a free and open-source computer program, which guarantees that it should be the same and there shouldn’t be conflicting InChIs for the same molecule. You can’t really write them by hand, because they look like this:
InChI=1S/C10H22ClN2O5PS/c1-3-10(9-18-20(2,15)16)12-19(14)13(7-5-11)6-4-8-17-19/h10,12H,3-9H2,1-2H3
ChemSpider is built on InChIs. If two molecules have the same InChI, then they’re the same record in ChemSpider, and if you can’t InChIfy it, you can’t put it in ChemSpider. That’s why we can’t do, for example, polymers yet.
We’re proud to be founder members of the InChI Trust, which supports this critical element in the sharing of chemical compound information.
What does all this mean for ChemSpider?
Because there is an active community supporting InChI who look out for these things, version 1.03 contained some bug fixes which mean that a very small number of the InChIs themselves, only a few dozen out of the whole database, have changed.
- P+–O– bonds and P+–S– are now treated slightly differently. This means that it will be easier to find the exact molecule you’re looking for, regardless of how it’s been drawn. (In principle this will also apply to analogous bonds containing arsenic, selenium, tellurium and antimony, but I can’t see any examples of this in the database.)
- There was a small bug where the InChI generated for a molecule with an azide group in it sometimes varied according to the input drawing. But that doesn’t happen now.
This regeneration has also allowed us to catch and clean up some errors in the data.
What happens next?
Version 1.04 of the InChI code will be released soon. With our new framework for processing large amounts of data we’ll be able to update our InChIs much quicker. The main changes in 1.04 that affect the InChI are to how it handles radical atoms in aromatic rings, nobelium, lawrencium and rutherfordium, so we anticipate that there shouldn’t be very many changed InChIs!

Entries (RSS)
August 19th, 2011 at 10:04 am
“If two molecules have the same InChI, then they’re the same record in ChemSpider”
Using what InChI options? Is that based on the Standard InChI, or an InChI with the FixedH option? Or with any of the optional tautomer options?
August 19th, 2011 at 10:19 am
Dear Egon,
Thank you for your swift response.
The non-standard option we use is
RecMet.We certainly don’t use
KETor15Tbecause they’re still only experimental.Best wishes,
Colin.
August 19th, 2011 at 4:35 pm
Dear Colin,
While I totally agree that InChI is the best choice for database registration at the moment, there are some cases where InChI alone might not be enough. I list some of these cases in our blog (blog.mcule.com). Can I ask how you handle protomers with different InChIs for example? Also, do you apply any specific rules to avoid “double-registration” of tautomers due to simple/hard (de)protonation discrepancies?
Can I also ask why you don’t prefer using the optional tautomer rules? Is it just because the InChI manual says it is in an experimental phase? According to the answers I got on the Blue Obelisk exchange forum (http://blueobelisk.shapado.com/questions/why-inchi-does-not-support-keto-enol-and-1-5-tautomerism-by-default) and on the InChI-discuss list, they are not part of the standard InChI, because the incorporation of these rules would cause a dramatic change in standard InChIs. There seems to be no harm, however, using them for database registration.
Regards,
Robert
–
Robert Kiss
http:/mcule.com
August 24th, 2011 at 8:59 am
Hello Robert,
We’re not doing anything with protomers that have different InChIs at the moment, but we will be doing more normalization in the not-too-distant future as part of the Open PHACTS project.
The optional tautomer rules will take a lot of work to investigate and validate against our records!
However, 1,5-tautomerism, for example, does matter a lot for things like acac complexes, which we’re looking at in the InChI working group on coordination complexes and organometallic chemistry. So probably a lot of this will come out in the wash, so to speak, when we work out how InChI should handle that area of chemistry properly.
I should add that when ChemSpider (and Prospect) started, the tautomer options were only accessible if you recompiled the InChI source yourself and set the appropriate options. They were only openly released later. There’s some experimental code there to do ring–chain tautomerism as well, but that’s still, if I remember rightly, something you can only switch on as a compilation option.
Best wishes,
Colin.
August 24th, 2011 at 4:11 pm
Thanks Colin! I totally agree it is very difficult to change even a single step in your registration process as this will can result in ID changes and it might confuse your users. As for mcule, at the moment we are in a much better situation as we can make as many changes as we want before we launch
BTW, can you give us some hint when the 1.04 InChI version which you were referring above will become available?
Another question arising from the RecMet option: applying this InChI option means you distinguish organometallics where the metal was connected to its ligands from those where they were not connected?
Regards,
Robert
August 25th, 2011 at 4:35 am
I can’t say any more about 1.04 than “soon”. As for RecMet, you’re quite right. If you search for LDA, say, in ChemSpider, you’ll find two entries.
Best wishes,
Colin.
October 7th, 2011 at 4:29 am
I can say more about 1.04! It’s out now: http://www.inchi-trust.org/index.php?q=node/14