Archive for May 6th, 2008

One of the blogs I really enjoy reading is Deepak Singh’s Business,Bytes,Genes and Molecules. Today there was a blog post about ChemSpider but something strange happened…I could ONLY read it in Google Reader. When I tried to navigate to the actual website it asked me to Save a file. See below.

It may be harmless but I’ve suffered enough at the hands of “bad files” to not grab it. Anyone else seeing this symptom? It’s in both browsers (IE and FF) and on two computers.

Anyhow, thankfully I can read it in Google Reader. There’s a point Deepak raises and I insert it here..

“On the web, data should be available as an addressable resource. The fact that data is available as RDF is great (and I wish more data was available as such). However, my personal preference is that data, especially open data, needs to be accompanied by APIs and bindings that allow the data to be accessed in a number of formats (not a dump per se). I think over time the acceptable formats will be established, much like XML/JSON/RSS have become the standard transport formats. The key aspect here are the business models. Is the business in providing a service on top of the data? For example for more than X number of API calls, there could be a fee associated.”

Just in case people have missed them we have a whole series of Web Services available already and they are being used. You can find details about them here:

Mass Spec Web Services

Taverna Hooks to ChemSpider Web Services for Metabolomics

Web Services Demo Pages and Example Code

Microsoft Hook Web Services into Infomesa

Waters Deliver Integration Via Web Services

There are more examples. We have thousands of calls a day using the Web Services at present and welcome more feedback on them!

Buy me a Coffee

Jean-Claude Bradley was “asked by the Institute for the Future to highlight a dozen “Signals” that may point to new trends in science as part of the X2 Project“. He has listed his selections on his blog-posting and people are encouraged to vote. JC mentioned ChemSpider twice and I am honored and humbled that he feels our efforts deserve recognition.

JC has recognized our efforts in depositing analytical data on ChemSpider and our web services to generate InChIStrings and InChIkeys.

Buy me a Coffee

In a recent post about ChemSpider we’ve been accused of wanting a Free Lunch. I copy a segment of the post and comment with insertions.

“Data are normally produced for a particular purpose and the reuse them for another cost money. I’ll exemplify this by taking CrystalEye data - about 120,000 crystal structures and 1 million molecular fragments - which were aggregated, transformed and validated by Nick Day as part of his thesis. (BTW Nick is writing up - it’s a tribute to his work that CrystalEye runs without attention for months on end).

AJW> It is true…it is a tribute to Nick that CrystalEye can run for months without attention. Kudos. I am interested in how much pressure the site is under. How many searches/users in a day etc.? We find that our struggles in uptime (and these are negligible) are primarily based on stress on the servers. For nighttime users tonight things will have been slow…we deposited over 100,000 new molecules from 5 new data sources. That does create some slowness. We will hit about 40,000 transactions today. Our problems are ISP issues and powercuts. But we are also not in a University using thick pipes etc.

One comment…it was 130,000 structures according to a previous blog and has been expanding since then from daily depositions. Right now I would expect it to be 140,000 rather than 120,000. When we did try scraping the data our best estimate was about 90,000. We might have missed something in our scraping and it’s why we asked for a dump of the data.

The primary purpose of CrystalEye was to allow Nick to test the validity of QM calculations in high-throughput mode. It turned out that the collection might be useful so we have posted it as Open Data. To add to its value we have made it browsable by journal and article, searcahable by cell dimensions, searchable by chemical substructure and searchable by bond-length. This is a fair range of what the casual visitor might wish to have available. Andrew Walkingshaw has transformed it into RDF and built a SPARQL endpoint with the help of Talis. It has a Jmol applet and 2D diagrams, and links back to the papers. So there is a lot of functionality associated with it.

AJW> The team has done a good job in putting the site together. The JMol applet is an excellent utility for us all to use and thanks to that team for sure! Egon has been challenging us to RDF the site and it’s on our list, but keeps getting pushed down based on other requests. Since he’s the only voice asking it will keep getting pushed down unfortunately.

This has come under some criticism to the effect that we haven’t really made it Openly available. For example Antony Williams(Chemspider blog) writes (Acting as a Community Member to Help Open Access Authors and Publishers):

“This [interaction with MDPI] is contrary to some of my experiences with some other advocates of Open Data and Open Access where trying to get their “Open Data” is like pulling teeth.”

PMR: I assume this relates to CrystalEye - I don’t know of any other case.

AJW> There are other examples and he’s right. He doesn’t know of them and I’d prefer he not rant on my behalf so I’ll not name them.

Antony and I have had several discussions about CrystalEye - basically he would like to import it into his database (which is completely acceptable) but it’s not in the format he wants (multi-entry files in MDL’s SDF format, whereas CrystalEye is in CML and RDF).

AJW> To clarify, again. I DON’T want to import CrystalEye into ChemSpider. I DON’T! All I want is the set of structures and unique associated URLs so that users of ChemSpider can find that there is crystal structure information over on CrystalEye and can click the link and be on CrystalEye and get the benefit of Nick, Andrew and Peter’s work. I don’t want to reproduce their effort. I want to integrate to it. I’ve said it many times on Peter’s blog and on this one.

This type of problem arises everywhere in the data world. For example the problem of converting between map coordinates (especially in 3D) can be enormous. As Rich says, it costs money. There is generally no escape from the cost, but certain approaches such as using standards such as XML and RDF can dramatically lower the costs. Nevertheless there is a cost. Jim Downing made this investment by creating an Atom feed mechanism so that CrystalEeye couls be systematically downloaded but I don’t think Chemspider has used this.

AJW> If Jim can contact me by email and provide me with detailed instructions to download the entire file of structures ONLY and their associated URLs that would be excellent. I’ll send the request to him tonight.

The real point is that Chemspider wishes to use the data for a different purpose from which it was intended.

AJW> The problem is that stories keep getting made up about what we want. ALL I want to do is drive traffic to CrystalEye so that people who don’t know about it can use it. No more than that. I don’t get how trying to provide an integration path is so difficult. I’ll ask Jim to help.

That’s fine. But as Rich says it costs money. It’s unrealistic to expect we should carry out the conversion for a commercial company for free. We’d be happy to a mutually acceptable business proposition and it could probably be done by hiring a summer student.

AJW> I am interested in what commercial benefit integrating to CrystalEye can have. It’s work on our side. I’m not sure what a mutually acceptable business proposition would look like. It can’t be that much work to send us a set of InChIStrings and URLs for the CrystalEye dataset..they already exist on CrystalEye. So, I’ll assume that this is a last comment on “No thanks to CrystalEye data in ChemSpider”. I have to ask why not put them in PubChem. Since PubChem is held as the standard of OpenData why not put CrystalEye there?

I continue to stress that CrystalEye is completely Open. If you want it enough and can make the investment then all the mechanism are available. There’s a downloader and converters and they are all Open (though it may cost money to integrate them).

AJW> Just fyi ChemSpider has adopted Creative Commons licenses.

FWIW we are continuing to explore the ways in which CrystalEye is made available. We’re being funded by Microsoft as part of the OREChem project and the result of this could represent some of the way in which the Web technology is influencing scientific disciplines. We’d recommend that those interested in mashups and re-use in chemistry took a close look at RDF/SPARQL/CML/ORE as those are going to be standard in other fields.

AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.

I have spoken previously about the challenges of Scraping CrystalEye Content and staying in relationship with publishers. I have approached CAS and spoken with the Copyright team at ACS. In December of last year I spoke about the 5 month delay to discuss with ACS about whether or not we could scrape CIF files from ACS journals directly. Well, I had a nice chat with two ACS people in New Orleans, one of them from ACS Pubs. We had a nice chat about ChemSpider and I answered a lot of questions about what we were doing, where we were going, how we are “funded” (we are not!) etc. Many pages of notes were taken. At the end of the meeting I asked the question “So, relative to my question about CrystalEye and scarping CIFS. Are Supplementary Data ok to scrape or not?”

The answer? “We haven’t made a decision yet. We need to discuss”.

Are crystal structures really that special? It’s been difficult to get JUST the structures associated with even Open Data. Now I’ve been waiting over 7 months for a question to be answered by ACS…and it’s binary. YES or NO.

At this point I give up. Peter Murray-Rust has had ACS CIFs scraped from their publications for a LONG time. And continues to scrape them. Cambridge University/Unilever School of Informatics didn’t get permission and have been very vocal about what they’ve done and no legal action re. copyright has been taken so I’ll assume it’s not an issue. If it’s not an issue then we can go ahead.

If we can go ahead then why wouldn’t we? We have…we already have scraped the collection of CIFs from ACS, from a broader range of ACS journals than CrystalEye taps into. It’s Supplementary Data, it’s non-copyrightable and now its ours to publish. We already support CIF displays on ChemSpider so what we need to do now is to mass convert/handle the data and deposit onto ChemSpider. We also have the IUCR CIFs to deposit. I guess ChemSpider will soon become “CrystalEye 2″ as we host the data. That said we are NOT crystallographers so I have an open request to the community for someone with interest/skills in crystallography to join our advisory group and support this effort. Feel free to ping me.

Buy me a Coffee

Over the past year ChemSpider has been challenged over the nature of our offering in terms of Open Data etc. A small number of people focused a lot of time talking about this while we remained focused on improving the website and having it available for people to use as a Free Access website. I spoke to Peter Suber about Open Access and then John Willbanks about Creative Commons.

Since ChemSpider is the aggregate of a number of people’s work (including provision of software by collaborators) I had to get into conversation to see what licenses would be acceptable to those groups.

With the redesign of the website we have structured ourselves in a way to add licenses as we see appropriate now. So, as of today we have added the Creative Commons Attribution Share Alike 3.0 United States License and the appropriate logo is on all sections of a Record View except for the predicted properties. Once we get approval from our collaborators for this same license (and discussions are underway) then the whole record view will be Licensed.

At that point, you are free :

  • to Remix — to make derivative works

Under the following conditions:

  • Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
  • Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.

Buy me a Coffee