Peter Murray-Rust responded to my recent comments about a Free Lunch. There are a number of comments to be made and an exciting opportunity to use Open Data and linking from ChemSpider.

I’d asked the question about how many records there were on CrystalEye. In our world a unique record is a unique InChI, not so on CrystalEye and appropriately so as the crystal structure itself is presumably the unique record. Makes sense.

PMR> We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is.

What we will do with multiple crystal structures for a single chemical structure is link all unique crystal structures from the unique chemical structure. In this way people can query the chemical structure and find all associated analytical data – spectra and crystallographic files. If we were to list the number of unique depositions on ChemSpider I think we would be around 40 million estimate though!

PMR> The main duplication comes from the Crystallography Open Database which has about 45,000 structures.

I looked at the Crystallography Open Database this morning. it states on the home page “Updated daily: 68268 entries in the COD”. We may have an opportunity with the COD to link up to their data and reduce the need for us to host CIFs. Excellent…we’re all for reducing workload and providing links into other systems. It’s what we do.

PMR> The only thing stopping us putting them (AJW> The structures from CrystalEye)  in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon. When it’s finished it will be in RDF/CML.

This is great news. This means that after the summer we can download the data directly via PubChem and link up to CrystalEye that way. Perfect. We’ll stop working on integrating to CrystalEye now and wait for the integration path via PubChem and focus on other data sources. Thank you Peter, Nick, Andrew and Jim!!! That said I don’t believe that PubChem will take CML, they will convert using their tools to produce their compatible formats and InChI being one of them. That will break organometallics etc. UNLESS PubChem are going to adopt CML now and that would be an interesting positive shift in terms of a sign of support for the format. A strong positive. I’l chat with the PubChem team so that if CML is coming we can consider adopting in some way and be ready.

From my post “AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.”

PMR: Chicken and egg… :-) You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.

It’s nice to know that ChemSpider has that type of influence now. It’s good to see it going into Accelrys’ software and I had heard that from Dan’s blog and had added the CML Blog to my reader. I’m definitely watching and willing to follow. We’re busy leading so many other things right now we’ll wait for adoption and then jump on it like a “hobo on a muffin”.

Stumble it!

2 Responses to “Open Data for Crystallography on ChemSpider”

  1. Egon Willighagen says:

    Haven’t had time to blog about it yet, but was this week looking with Rajarshi Guha and Dan Zaharevitz at a subset of PubChem element, and ran the CDK atom typer against it. Goals: 1. fix glitches in the CDK atom typer, but also 2. to filter out problems in the representation en entries in that subset.

    Now, one thing I did discover is that some complexation (coordination) bonds are correctly represented in PubChem’s ASN.1 format (both plain text and XML, binary probably too, but is so difficult to read)… However, I spotted at least some MDL molfile versions (SD files, in case of search results), where these complexation were represented as single, covalent bonds. That will surely mess up your chemoinformatics.

    Therefore, I can surely understand the desire to submit as CML, which has more power in expressing chemistry than MDL molfiles. Particularly important for organometallics… Alternatively, I guess one can submit to PubChem in the ASN.1 format too?

  2. Antony Williams says:

    NO DOUBT there are many issues with organometallics in both PubChem and ChemSpider. I’ve discussed that here previously :

    The primary issue is that many of the vendors/depositors are so used to dealing with SDF files that our mutual systems have to deal with problematic submissions. AND, they get passed on from system to system. Right now I want people to post a comment when they see issues with organometallics or ANY structures and then we can deal with them. Unfortunately, on the long list of issues to work on, this is not top, but it is in the top third. I’ve only had 2 comments regarding poor organometallic representations and I think I a more concerned about it than the majority of users. Of course, inorganic and organmetallic chemists would want to shoot us for the poor representations and I acknowledge it, thank them and have to keep working on what we can.

    I don’t think there IS a good system out there dealing with organometallics other than maybe CrystalEye and that does use CML as part of the underlying technology so that makes sense. Jim Downing and I are in email exchange at present about using the atom feeds so maybe we can figure out how to get the value out of that.

Leave a Reply