Copyright©2008 Antony Williams
Frequent readers of this blog will recall the multiple exchanges which occurred around trying to get access to the “Open Data” on CrystalEye. I commented then that our intention was to : “… scrape the InChIs, the title of the article, the journal name, volume and page details and the DOI number. We will de-duplicate the structures onto the database or create new structure records as appropriate. My concern is whether or not the ACS will allow us to scrape their Open Data so I have issued the direct question to them below. I am hoping for an affirmative response and then I will move on to confirm with the other publishers.”
I have not been able to get an answer out of ACS about whether their data can be accepted as Open Data and keep us out of trouble if we publish it. It really should be a simple answer but Open Data is causing lots of issues nowadays and we are in for a rocky ride. However, NONE of the list are copyrightable anyways… “the InChIs, the title of the article, the journal name, volume and page details and the DOI number“.
And so to work we went. Supposedly there are 130,000 structures on CrystalEye. Since we were scraping we had to source these ourselves. We could only find 93,000. It doesn’t mean they aren’t there but there’s no site map to help us find the rest.
We have scraped InChIs where they exist (and they don’t for many inorganics and organometallics) and have grabbed DOIs etc. We have extracted about 56,000 InChIs total. What we are trying to achieve with this approach is to provide a manner by which to link from a structure through to an article. However, I’ve made a decision NOT to do that. Here’s why:
1) There are many broken URLs associated with the InChI…for example here and here . There is no standard format to the URLs as we have tried to achieve with the standard URL structures for ChemSpider related to InChIKey for example. We are dealing with complex URLs such as
2)Looking through the data it is clear that there are issues with structures and the accuracy of what a structure is. Just looking at internal consistency within a record we see issues. Look at this record. The associated XML file is here. At the bottom of the file we see:
These are representations of the same structure in SMILES and InChI format. (I question the label Daylight SMILES as the SMILES string would have to be generated by Daylight, and maybe it is, for that to be true. Many software packages generate SMILES, some good, some bad. Daylight generate “theirs”.) I believe the process is that the structure is extracted from the CIF and then converted via CML to SMILES and InChI. The problem is in the internal consistency. The structures below were generated by converting the structures from InChI and SMILES to structures as shown. Notice that they are inconsistent in E/Z stereo.
Looking at the paper here I believe that the InChI is correct but the SMILES is incorrect. This is either “Daylight’s issue” or the tool converting the CML to SMILEs. According to PMR it is either the SMILES conversion in Jumbo or OpenBabel it appears.This is not the only example..there are others.
3) Let me clarify before continuing by commenting that I am a NMR spectroscopist, not a crystallographer. And so, my knowledge of CIF files etc is not at the level of the developers of CrystalEye. With this premise in mind I looked at the list of InChIs scraped from CrystalEye. While there are MANY examples of InChIs not accurately representing complex organometallics this IS to be expected since InChI has not been developed to deal with them yet. However, there are also many examples of what I “think” are strange InChIs. Taking the premise that on ChemSpider we want people to search a chemical structure and find their way to related information on other databases, let me provide some examples.
For this link the InChI is InChI=1/12H2O/h12*1H2. If you look at the image at the link you will see “12 waters”. It’s not a surprise as it links to an article entitled “A Blind Structure Prediction of Ice XIV” in the Journal of the American Chemical Society. I doubt that anyone would “draw” twelve molecules of water to generate an InChI to search for this article, but you never know. However, looking at other examples we find for example “InChI=1/3C6O2/c3*7-5-1-2-6(8)4-3-5″. The number 3 at the beginning of the formula 3C6O2 indicates there are three EQUIVALENT molecules and I assume these are contained within a unit cell (?) as shown in the figure below and at this link. I would expect a person to draw ONE of these molecules in order to perform a search and not have to draw all three to generate the InChI.
What is interesting about the article associated with this example is the nature of its commentary “The space groups of point group C3: some corrections, some comments” where the abstract says “A survey of the October 2001 release of the Cambridge Structural Database has uncovered approximately 675 separate apparently reliable entries under space groups P3, P31,P32 and R3; in approximately 100 of these entries, the space-group assignment appears to be incorrect. Other features of these space groups are also discussed.” I wonder whether this observation is related?
There are many more examples. Check out this CIF. Note the InChI, InChI=1/4C7H4ClN/c4*8-7-4-2-1-3-6(7)5-9/h4*1-4H, and the Figure below. 4 equivalent molecules.
I am assuming that the InChI is being derived from the unit cell. It is certainly being derived from the CIF. I don’t think of this as a “chemical structure” that I want to deposit to ChemSpider.
What I have noticed about all the online databases I visit, other than ChemSpider and Wikipedia, is that there is no easy way to annotate a record with an error. On ChemSpider we allow people to click on “Post Comments” (it used to be called Curate Data) and add comments directly to a record. This means that if there ARE errors that have been spotted by people that they are visible for everyone else to see also. If someone finds an error why not let them tell the world. I don’t believe PubChem, Drugbank, CrystalEye, blah, blah allow such comments (that I can see). If we can add comments to blogs for all to see why not comments to DB records???? As it is I burn up hours of time trying to hunt down email addresses for contacts for the individual databases I find errors on and trying to inform people. The majority of people are grateful and respond. Some don’t respond, don’t make edits and leave errors online to proliferate. Such a simple capability as Post a Comment could really help identify these errors for other people.
So, there are InChIs on CrystalEye. I cannot speak to whether they represent the structure in the article or whether they represent the structure in the CIF en masse but I am concerned that we do not proliferate incorrect data. But the InChIs on CrystalEye are what they are and if we are only linking to them then we are identifying that there is expected to be related info on that database. However, because of the “multiplicative nature” of some of the InChIs I don’t want to index them either right now. I am presently defining some regular expressions to allow us to “refine” the InChIs for connecting to. An example is shown below..it should be clear what needs to be edited to remove the “multiples”.
For_now we will not index InChIs from CrystalEye on ChemSpider. The quality of what’s related to those InChIs from that point on will need to be checked further. This is NOT just a CrystalEye issue. It’s an issue for all databases including ourselves. However you arrive at a database, whether it’s ChemSpider, Wikipedia, or PubChem, or ChEBI, always check, if you can, for the quality of data. We are all contaminated in some way with errors. We hope that ChemSpider isn’t struggling with InChI and SMILES collisions (and we haven’t found any yet) but we might be. We ARE struggling with InChIs and organometallics in the same way all other public online databases are.
With time I will review a few tens of InChIs versus structures in the CIFs on CrystalEye and we might be able to decide whether or not to post the DOIs and titles etc. in the future. What we will NOT be doing is grabbing any “Open Data” for the time being until some more validation work has been done. It’s a great shame as I was hoping to link to crystal structures. I am not aware that anyone has linked to CrystalEye. If you have I welcome your guidance regarding how you got around broken links, InChI vs. SMILES conflicts and InChI complexities. Thanks!Stumble it!