Frequent readers of this blog will recall the multiple exchanges which occurred around trying to get access to the “Open Data” on CrystalEye. I commented then that our intention was to : “… scrape the InChIs, the title of the article, the journal name, volume and page details and the DOI number. We will de-duplicate the structures onto the database or create new structure records as appropriate. My concern is whether or not the ACS will allow us to scrape their Open Data so I have issued the direct question to them below. I am hoping for an affirmative response and then I will move on to confirm with the other publishers.”

I have not been able to get an answer out of ACS about whether their data can be accepted as Open Data and keep us out of trouble if we publish it. It really should be a simple answer but Open Data is causing lots of issues nowadays and we are in for a rocky ride. However, NONE of the list are copyrightable anyways… “the InChIs, the title of the article, the journal name, volume and page details and the DOI number“.

And so to work we went. Supposedly there are 130,000 structures on CrystalEye. Since we were scraping we had to source these ourselves. We could only find 93,000. It doesn’t mean they aren’t there but there’s no site map to help us find the rest.

We have scraped InChIs where they exist (and they don’t for many inorganics and organometallics) and have grabbed DOIs etc. We have extracted about 56,000 InChIs total. What we are trying to achieve with this approach is to provide a manner by which to link from a structure through to an article. However, I’ve made a decision NOT to do that. Here’s why:

1) There are many broken URLs associated with the InChI…for example here and here . There is no standard format to the URLs as we have tried to achieve with the standard URL structures for ChemSpider related to InChIKey for example. We are dealing with complex URLs such as

http://wwmm.ch.cam.ac.uk/crystaleye/summary/acs/cgdefu/2006 /1/data/cg050086d/cg050086dsup2_-aku———————-/ cg050086dsup2-aku———————-.cif.summary.html

2)Looking through the data it is clear that there are issues with structures and the accuracy of what a structure is. Just looking at internal consistency within a record we see issues. Look at this record. The associated XML file is here. At the bottom of the file we see:

<identifier convention=”daylight:smiles”>
[H]C=2C([H])=C(C([H])=C([H])C=2(N=NC(=NN([H])C1=C([H])C([H])= C(C([H])=C1([H]))C([H])([H])[H])[N+](=O)[O-]))C([H])([H])[H]
</identifier>

<identifier convention=”iupac:inchi”>
InChI=1/C15H15N5O2/c1-11-3-7-13(8-4-11)16-18-15(20(21)22)19-17 -14-9-5-12(2)6-10-14/h3-10,16H,1-2H3/b18-15+,19-17+
</identifier>

These are representations of the same structure in SMILES and InChI format. (I question the label Daylight SMILES as the SMILES string would have to be generated by Daylight, and maybe it is, for that to be true. Many software packages generate SMILES, some good, some bad. Daylight generate “theirs”.) I believe the process is that the structure is extracted from the CIF and then converted via CML to SMILES and InChI. The problem is in the internal consistency. The structures below were generated by converting the structures from InChI and SMILES to structures as shown. Notice that they are inconsistent in E/Z stereo.
smiles-and-inchi.png

Looking at the paper here I believe that the InChI is correct but the SMILES is incorrect. This is either “Daylight’s issue” or the tool converting the CML to SMILEs. According to PMR it is either the SMILES conversion in Jumbo or OpenBabel it appears.This is not the only example..there are others.

3) Let me clarify before continuing by commenting that I am a NMR spectroscopist, not a crystallographer. And so, my knowledge of CIF files etc is not at the level of the developers of CrystalEye. With this premise in mind I looked at the list of InChIs scraped from CrystalEye. While there are MANY examples of InChIs not accurately representing complex organometallics this IS to be expected since InChI has not been developed to deal with them yet. However, there are also many examples of what I “think” are strange InChIs. Taking the premise that on ChemSpider we want people to search a chemical structure and find their way to related information on other databases, let me provide some examples.

For this link the InChI is InChI=1/12H2O/h12*1H2. If you look at the image at the link you will see “12 waters”. It’s not a surprise as it links to an article entitled “A Blind Structure Prediction of Ice XIV” in the Journal of the American Chemical Society. I doubt that anyone would “draw” twelve molecules of water to generate an InChI to search for this article, but you never know. However, looking at other examples we find for example “InChI=1/3C6O2/c3*7-5-1-2-6(8)4-3-5″. The number 3 at the beginning of the formula 3C6O2 indicates there are three EQUIVALENT molecules and I assume these are contained within a unit cell (?) as shown in the figure below and at this link. I would expect a person to draw ONE of these molecules in order to perform a search and not have to draw all three to generate the InChI.

3-molecules.png

What is interesting about the article associated with this example is the nature of its commentary “The space groups of point group C3: some corrections, some comments” where the abstract says “A survey of the October 2001 release of the Cambridge Structural Database has uncovered approximately 675 separate apparently reliable entries under space groups P3, P31,P32 and R3;  in approximately 100 of these entries, the space-group assignment appears to be incorrect. Other features of these space groups are also discussed.” I wonder whether this observation is related?

There are many more examples. Check out this CIF. Note the InChI, InChI=1/4C7H4ClN/c4*8-7-4-2-1-3-6(7)5-9/h4*1-4H, and the Figure below. 4 equivalent molecules.

4-molecules-in-cell.png

I am assuming that the InChI is being derived from the unit cell. It is certainly being derived from the CIF. I don’t think of this as a “chemical structure” that I want to deposit to ChemSpider.

What I have noticed about all the online databases I visit, other than ChemSpider and Wikipedia, is that there is no easy way to annotate a record with an error. On ChemSpider we allow people to click on “Post Comments” (it used to be called Curate Data) and add comments directly to a record. This means that if there ARE errors that have been spotted by people that they are visible for everyone else to see also. If someone finds an error why not let them tell the world. I don’t believe PubChem, Drugbank, CrystalEye, blah, blah allow such comments (that I can see). If we can add comments to blogs for all to see why not comments to DB records???? As it is I burn up hours of time trying to hunt down email addresses for contacts for the individual databases I find errors on and trying to inform people. The majority of people are grateful and respond. Some don’t respond, don’t make edits and leave errors online to proliferate. Such a simple capability as Post a Comment could really help identify these errors for other people.

So, there are InChIs on CrystalEye. I cannot speak to whether they represent the structure in the article or whether they represent the structure in the CIF en masse but I am concerned that we do not proliferate incorrect data. But the InChIs on CrystalEye are what they are and if we are only linking to them then we are identifying that there is expected to be related info on that database. However, because of the “multiplicative nature” of some of the InChIs I don’t want to index them either right now. I am presently defining some regular expressions to allow us to “refine” the InChIs for connecting to. An example is shown below..it should be clear what needs to be edited to remove the “multiples”.

multiplier1.png

For_now we will not index InChIs from CrystalEye on ChemSpider. The quality of what’s related to those InChIs from that point on will need to be checked further. This is NOT just a CrystalEye issue. It’s an issue for all databases including ourselves. However you arrive at a database, whether it’s ChemSpider, Wikipedia, or PubChem, or ChEBI, always check, if you can, for the quality of data. We are all contaminated in some way with errors. We hope that ChemSpider isn’t struggling with InChI and SMILES collisions (and we haven’t found any yet) but we might be. We ARE struggling with InChIs and organometallics in the same way all other public online databases are.

With time I will review a few tens of InChIs versus structures in the CIFs on CrystalEye and we might be able to decide whether or not to post the DOIs and titles etc. in the future. What we will NOT be doing is grabbing any “Open Data” for the time being until some more validation work has been done. It’s a great shame as I was hoping to link to crystal structures. I am not aware that anyone has linked to CrystalEye. If you have I welcome your guidance regarding how you got around broken links, InChI vs. SMILES conflicts and InChI complexities. Thanks!

Stumble it!

8 Responses to “Struggling to Scrape CrystalEye”

  1. Egon Willighagen says:

    I will comment in my blog in more detail later, but want to put in a quick comment on InChI calculation… I agree that InChI’s are much more useful for isolated molecules. Therefore, when setting up our MetID database (part of MetWare [1]) instance with PubChem content, I make sure to put in molecules, not salts etc. From those, uncharged molecules are reconstructed which then act as input to MetID.

    1.http://chem-bla-ics.blogspot.com/2007/11/metware-metabolomics-database-project.html

  2. Nick Day says:

    Hi Antony,

    Thanks a lot for your comments, these are certainly things that need resolving asap. I shan’t comment on them all right now, but will do at some point.

    I can explain the multiplicative nature of the InChIs though. When the CML is created, if the unit cell contains more than one unique structure (those that are not related by symmetry), then in the CML there will be an InChI for each structure *and* one for those symmetry unrelated structures as a whole (which is what you have been seeing). If you look again at the CML, there is one parent molecule element and four child molecule elements of that. Each child molecule contains a child identifier element for the SMILES and InChI for a particular structure, while the parent Molecule contains child identifiers for them combined.

    cheers,
    Nick

  3. baoilleach says:

    Regarding: “However, NONE of the list are copyrightable anyways… “the InChIs, the title of the article, the journal name, volume and page details and the DOI number“.”

    Not to agree or disagree, but have you tried extracting all bibliographic information from a journal RSS feed and making it available on another site? You would think that a journal would see this as free advertising.

  4. Antony Williams says:

    Nick, Regarding the multiplicative nature of the InChIs your comments make sense. What we are grabbing from the View page is the InChI as posted there. You likely get why we don’t want to post the multiplicative set into our database. I’m not sure what the best solution is for your side and maybe it’s best as it is. But I don’t think it’s a good way for people to hook up to CrystalEye via InChI or the resulting InChIkey. What do you think?

  5. Antony Williams says:

    Noel, Yes we have thought about grabbing RSS feeds but the ONLY one I can think of right now where the InChI is included is from Project Prospect. Are you aware of others?

  6. Noel O'Boyle says:

    I’m not aware of others. I was actually responding to the general idea of republishing bibliographic data extracted from journal RSS feeds, which I understand is not something that journals encourage.

  7. Antony Williams says:

    For the readers of this blog I have pasted comments from the CrystalEye team from here: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=946#comment-117418

    “(all) There has been some offline correspondence as well. Nick and Jim may wish to add things.

    The present situation is that CrystalEye robots continue to collect data, convert it and index it. We are aware of a few bugs and plan to list them. However we do not plan bug fixes immediately.

    Nick developed CrystalEye as part of his thesis – he is now writing this up and only does further software work if it impacts on his thesis. The purpose of the thesis was – inter alia – to see how well calculations can reproduce crystal structures and so far none of the bugs impinge on that. We’re pleased to have them reported, of course and thank you all.

    Please note that in reporting bugs we wish to use unit tests to identify them so it will help if we have one or more specific instances with details (”Entry ddddd depicts stereochemistry X but the data show Y”). We then write a unit test which depends on the specific bug.

    Please also note that in a project with >1000000 data items it is not likely that there are problems – it is certain. The “id”s that authors give are often full of strange characters (and this cases many of our problems). Similarly there are chemical compounds that are beyond our anticipation. And there are simple data errors – a recent one I looked at had a PF8 anion (it was a disordered PF6)

    WRT to multiple molecules per asymmetric unit (not actually a unit cell). Many crystals have several different molecules per asymmetric unit – a typical example might be coper sulfate Cu(OH2)4.H20.SO4. Here the water of solvation is different from the other waters so needs to be explicit.

    In some cases the asymmetric unit has more than one identical molecule. Perhaps we should normalize to a single molecule, perhaps not. We have chosen not too. After all InChI was not designed to deal with aggregated systems.

    We have plans for further development of CrystalEye which includes changes to the software and which should be”

  8. Antony Williams says:

    Jim Downing has posted on his blog about “Posting Comments” using Connotea. Visit : http://wwmm.ch.cam.ac.uk/blogs/downing/?p=171

Leave a Reply