One blog I check out a few times per week is that of Derek Lowe who writes In the Pipeline. What makes Derek’s blog different, in my opinion, is that he has many years under his belt as a synthetic chemist in pharma companies. He watches what is going on in his industry and makes us aware of his opinions and those of others. He’s still active in the lab and makes us aware of the challenges of lab syntheses and, I find, with some historical perspectives regarding what is “was like” then versus now. Always well written with high feedback In the Pipeline is likely one of the most frequented blogs out there.

Today I read about Schering-Plough’s thrombin receptor antagonist compound, SCH 530348. I am an SP shareholder and was rather disappointed by the recent news about Vytorin. Let’s hope the retrials provide new results. However, SCH 530348 looks more exciting according to Derek’s comments and talking to other people in the industry.

I searched ChemSpider for the structure since it seemed of interest. SCH 530348 is based on a natural product called himbacine and Derek had linked to it from his blog but, I assumed, he couldn’t find the structure of interest on the database. Neither could I. So, about 5 minutes work and it was on the database, with the link to the ASAP article and tagged etc as shown below.

Description

TRA-SCH 530348 is an oral antiplatelet drug under development by Schering-Plough for the treatment and prevention of atherothrombotic events in patients with Acute Coronary Syndrome, previous Myocardial Infarction, stroke, or existing peripheral arterial disease.

Tags

SCH 530348
Schering Plough
TRA-SCH 530348

Links & References

Samuel Chackalamannil, Yuguang Wang, William J. Greenlee, Zhiyong Hu, Yan Xia, Ho-Sam Ahn, George Boykow, Yunsheng Hsieh, Jairam Palamanda, Jacqueline Agans-Fantuzzi, Stan Kurowski, Michael Graziano, and Madhu Chintala . Discovery of a Novel, Orally Active Himbacine-Based Thrombin Receptor Antagonist (SCH 530348) with Potent Antiplatelet Activity, ASAP J. Med. Chem., ASAP Article
A potent series of thrombin receptor (PAR-1) antagonists based on the natural product himbacine is described.

Any user of ChemSpider can do this now. ANYONE. If you are interested in how let me know! I have some documents already prepared tou guide you through the process and need to update others but the database can be expanded by everyone now…not quite Wikipedia but not bad at all! What do you think?

Buy me a Coffee

The past few days have been very interesting for ChemSpider and the discussions regarding the licensing of Open Data under Creative Commons. There have been a number of exchanges in the blogosphere commenting on what this means/should mean for Open Data.

data-should-be-public-domain-and-more-esoteric-blog-based

chemspider-good-intentions-and-the-fog-of-licensing

Does ChemSpider really violate Open Data with CC SA?

John Wilbanks replies to the ChemSpider/OpenData discussion

I am Still DELIGHTED with ChemSpider

more-on-the-science-exchance-or-building-and-capitalising-a-data-commons

The unfortunate public disclosure of an internal document regarding ChemSpider has resulted in some excellent public dialog about Open Data and licensing and I am glad that the outcome, while potentially very messy, has been controlled and appropriate dialog. I will not reiterate the details of each of the blogs but have generally commented on them directly. I thank John Wilbanks specifically for the gracious manner with which he addressed a potentially explosive situation.

This situation didn’t make me angry..just frustrated. ChemSpider has become a focus for attention over the past few months. Maybe it’s because we are doing the right things? Maybe it’s because we are challenging the status quo? Maybe it’s because people know me personally after 10 years in the commercial sector?

We are doing the best things we can - with no funding, “spare time resources” (all work is night and weekends by volunteers!), to make a difference and build a community. We have taken a lot of hits over the months and it has been distracting, especially for me who has chosen to remain engaged in a public discourse on the blogosphere. It is no longer healthy. It has distracted us from our mission..it is way too time-consuming and delivers little value to our users. Our users I care about. Our self-professed non-users I don’t.

We will likely pull all Creative Commons licenses shortly and resolve our “Openness” with a declaration as originally advised by Peter Suber. It was good advice then and is more pertinent now.

I have also posted a statement to Peter Murray Rust’s blog thanking him for his apology and declaring my intent. This may be misinterpreted, and will likely garner a reaction, but I hope my intent is clear here. If you deposit data to ChemSpider then your data remains yours. We are not assuming copyright. We will do our utmost to navigate the complexities of licensing as appropriate. We might fail. Our role is to serve our users.

We are getting out of serving any agendas other than our own…we are Building a Community for Chemists.

This is what I posted to Peter’s blog

“Peter, I thank you for the applause regarding our implementation of licensing on ChemSpider. I also acknowledge and accept the apology you have issued publicly to John Wilbanks, ChemSpider and members of the advisory board.

I believe that some good has likely come out of the conversations over the weekend - maybe a little more confusion, maybe a little more clarification (especially around John’s “data in the public domain” comments) and maybe a few more relationships. This latter part is especially of interest to me as we work on creating a community for chemists.

Now to the outcome for ChemSpider. ChemSpider went live in March of last year with a “who knows where it will go” approach. From the moment we went live you have paid attention. However, rarely has this been with any sense of support but, rather, a framework of negativity. You have criticized our science and our intent. You have projected your judgments as truths. I have addressed these judgments many times but rarely with acknowledgment from your side. It has been a lot of work for both of us. To be clear, I have judged your efforts around Open Notebook Science for NMR similarly.

ChemSpider appears to have a center spotlight now in terms of licensing and Open Data. I acknowledge these are significant parts of YOUR agenda and a key part of what you have worked on for many years. I judge your other agendas to be Open Access, Semantic Web and associated technologies. I honor your work in these areas and feel you have contributed and will continue to contribute to the ongoing shifts of Open science prevailing at present. Thank you.

Our agenda for ChemSpider is different. We are building a community for chemists (Notice the recent shift from the original vision “Building a Structure Centric Community for Chemists” as we expand out of structures only.) At present, we are doing what we can to support the needs of chemists researching structure-based information. We are integrating information. We are more than a “linkbase”. We are actively supporting Open Notebook Science. We ARE listening to our users, the community, our collaborators and our advisory group.We have delivered a valuable solution in the past year with no cost to the users, to the tax-payers, with no grants and based on the hard work of a small dedicated team of volunteers only.

The past year has been very distracting for us, and me in particular, in terms of your comments and judgments about ChemSpider, and by association, about myself. I have tried to clean up a number of these on the ChemSpider blog (http://tinyurl.com/45acav) but acknowledge you might have different view points. These discussions have been draining and have distracted us from our core mission of serving our users.

At this point I need to withdraw from the dynamic we have co-created around our relative agenda(s). I honor and acknowledge what you are out to achieve. I wish for the same from you. I do not believe that our intentions are contrary or mutually exclusive to your own. I think that time will prove this to be true.

I believe that you judge our efforts to be in conflict with those of your WorldWide Molecular Matrix but I doubt that is true. I will respond shortly to some historical posts regarding your call for a structure collection for your eChemistry Project with Microsoft. We are willing to help and I am open to a discussion should you wish to collaborate. I am working with the Wikipedia:Chemistry team to build a validated SDF file for the public domain and we can make this available to you. “

Buy me a Coffee

The attack of the Killer RabbitImage via Wikipedia

I have essentially stayed away from my computer over the weekend pondering the Creative Commons discussions about ChemSpider. I’ve spent time playing with our twins, landscaping and “contemplating my navel”.  After what happened at the end of last week regarding the shared opinions on our adoption of Creative Commons licenses I needed to spend sometime away from the computer. My frustration level has been increasing of late since ChemSpider has been put in the spotlight for so many reasons, and very few of them positive. Even this is strange considering we have now topped 6000 users per day, we receive kudos and blessings from our users and, overall, I believe we are making a positive contribution to our intention to build a community for chemists.

Nevertheless, ChemSpider has become the “Free Access Chemistry Site du jour” for certain individuals to poke at. We appear to be the “model of concern” - lots of discussions going on about ChemSpider is “this” and “that”, we need to consider our business model, ChemSpider needs licenses and many other commentaries. Thanks for discussing us out there in the blogosphere. Would you like to talk TO US instead? I’d appreciate having someone advise us on licensing etc at a minimum.

Anyhow, I was in the mood for a good belly laugh tonight when someone pointed me to a recent post by Rich Apodaca entitled “Just a Flesh Wound” and parodying the Semantic Web , RDFs etc. with Monty Python and the Holy Grail’s infamous blood and gore scene. I say go visit the site and, if you’ve been watching the many discussions in this area I think you’ll giggle too.

Buy me a Coffee

Like most users of WordPress we use Akismet to deal with our spam. Unfortunately it appears that it might be a little hungry and is chewing on some real comments too according to this blog post by John Wilbanks. John was responding to the Creative Commons on ChemSpider licensing discussion. I’ll comment more on that later.

Please  know that I never willingly deny any posts made to the ChemSpider blog and moderate only to keep the “bad stuff” off of the site. If you try to post and it doesn’t show up there in 48 hours (even I take downtime) please send the post to infoATchemspiderDOTcom and it will get posted on your behalf OR I will try and dig it out of Akismet. As you can see below though…we do get a LOT of Spam!

Buy me a Coffee

A picture tells it all. Look at the image below..the result of a search on ChemSpider on Google. Click on the image to see the full size detail. The search gives almost 7.5 MILLION hits on ChemSpider. Hmmmm…is the result of the very active discussion about ChemSpider and Creative Commons that has occurred over the weekend? It does seem to have generated a lot of interest. The reality…within a few minutes of this search it had settled back down to about 300,000 hits. Google was probably recalculating PageRank or something. Nevertheless, kind of fun to see!

Buy me a Coffee

There have been a number of discussions around the fact that we are adding Safety and Toxicity Data to ChemSpider. The comments section contain some cautions from Cameron Neylon about a disclaimer. We have had one online for a while but we have now updated the site and linked the disclaimer from above every Supplementary Info section.  Thanks to Cameron for his support and suggestions in this area.

Disclaimer

For all documents, data and software available via this website and server, ChemZoo does not warrant or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed. Specifically, any information listed here should not be relied on for any assessment of the risks of any specific procedure or process. You should always carry out your own risk assessment for any experiment, procedure, or process. In particular the information here may not be correct or appropriate to local regulations, legal codes, practices or policies, or may be out of date. The authors will take no responsibility for any injury, loss, or damage caused in any manner whatsoever by the use of this information.Some ChemSpider Web pages link to other web pages for the convenience of users. ChemZoo is not responsible for the availability or content of these web pages. ChemZoo does not endorse or warrant the services, products or information offered at these other webpages unless explicitly stated.

The ChemSpider website is maintained by ChemZoo Inc. For site security purposes we use software programs and algorithms to monitor traffic and to identify unauthorized attempts to upload or change information or otherwise cause damage to the ChemSPider service. In the event of unauthorized activities utilizing the ChemSpider server information from these sources may be used to help identify an individual for prosecution.

The data on the ChemSpider web site are sourced from a number of contributors and collaborators. The majority of data were originally sourced from the NCBI-Pubchem website and are made available under the explicit statements and disclaimers provided by NCBI. Specifically, NCBI places no restrictions on the use or distribution of the data contained within their database. ChemZoo have abided by the assumptions of NCBI. These are specifically “While some submitters of the original data (or the country of origin of such data) may claim patent, copyright, or other intellectual property rights in all or a portion of the data (that has been submitted). NCBI is not in a position to assess the validity of such claims and, therefore, cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.” In this regard should any contributors to the NCBI-Pubchem database wish to have their data removed from the ChemSpider website please send a detailed request to info-at-chemspider-dot-com.

Revised: May 9th, 2008

Buy me a Coffee

ChemSpider has taken some thrashing over the past year. We’ve been hit on science (and proven our point many times), on Open Access versus Free Access statements, on whether or not we have Open Data or not. There has been encouragement to define what the data on our site is in terms of Open Data or not. We’ve adopted Open Data tags on deposited data from users after pressure there. When I’ve asked more about Open Data I have heard that it is not ratified at the same level as Creative Commons licenses and they would be better to use. A week ago we put up Creative Commons Licenses in what I hoped was a GOOD move for the ChemSpider site and would relax the criticism of our site and potentially receive their blessing and support.

We received a blessing for all of 72 hours. In his blog post Peter Murray-Rust was DELIGHTED with our decision to do this. I quote: “I am DELIGHTED to report that Chemspider has adopted a CC-SA licence for its data.” and espoused “PMR: This is wonderful. As far as I know Chemspider is the only commercial chemical information company offering data under this licence, which is completely compatible with the Open Knowledge Definition. (It is also BBB-compliant, though data and publications are different animals).”

I assumed therefore we’d done a good thing. There was no indication to me that our postion was anything other than positive.

There has been a conversation going on in the blogosphere for a couple of weeks now about Strong and Weak Open Access. I’ve read, watched and simply let others share their opinions because they’ve been in Open Access discussions for a number of years and have more context, background and passion to stay engaged in these discussions. They ARE important discussions and will come to a conclusion.

It appears that “I” am confused by Creative Commons licenses. This based on the fact that 72 hours we had done a good thing and got a blessing but 3 days later I read yet another post this time with a comment  from John Wilbanks stating “I’d like to see a meaningful discussion of the risks of Share Alike and Attribution on data integration. Chemspider’s move to CC BY SA fits into this discussion nicely - it’s a total violation of the open data protocol we laid out at SC, which says “Don’t Use CC Licenses on Data” - but it does conform inside the broader OKD.”

Uh-oh. ChemSpider is in Total Violation of Creative Commons Licenses. As we say in Wales in times of distress … “Hell’s Bells” (My dad was a builder..if you believe he taught me to curse like that well….)

Peter followed it up with a comment “PMR: I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism - CC-SA is conformant to the OK definition, but difficult to operate for re-use). That’s why we use the OKF’s OpenData sticker on CrystalEye.” Hmmm…

Again, when I’ve asked about the OpenData sticker I’ve been informed that this is not yet ratified.

There have been many discussions about Openness I’ve been involved with..just one example here. It has been difficult. Openness and licensing remains confusing…see here an example and this is just about a blogsite!

So the question is what now? Do we remove Creative Commons Licenses? Do we adopt Open Data licenses or do we just get ourselves out of the middle of this entire confusing discussion until all is resolved and settled. And IF we remove CC licenses and don’t post other licenses I know we’ll get criticized for that too. But let’s be honest…we’ve been highlighted for NOT having licenses up to this point. Now we are highlighted FOR having them. Maybe we can hope that no press is bad press. I’ll await feedback on this post and make a decision about what to do in the next 48 hours. Blog away…

Buy me a Coffee

I am posting this in order to help one of my “neighbors”, IUPAC in Research Triangle Park. Their office is about 30 minutes from where I live. This is a beautiful area of the world and I encourage people to contact the Secretariat directly should you have an interest in this role.

Post-doctoral Position in Chemistry Informatics

Develop, implement, and support web based applications to enable IUPAC Staff and Committee members to work more effectively. The emphasis will be on development of tools for communication and collaboration to allow scientists working on IUPAC projects to accomplish their project goals while minimizing the need for travel. This will build on the new architecture of the IUPAC web site that uses XML technology to organize the information used by IUPAC members as well as the general scientific public. In addition, methods will be developed to organize and present IUPAC’s information, now contained in books and journal articles, to make it more accessible and more useful.

This position is located at the IUPAC Secretariat in Research Triangle Park, North Carolina, USA and will require considerable travel.

Required background: PhD or equivalent in Chemistry or a related discipline so as to combine a reasonable chemical knowledge with computing expertise; experience with SQL databases and XML coding; excellent written English and the ability to deal with multiple projects simultaneously.

Salary and benefits are competitive and will depend on experience and qualifications.

IUPAC was formed in 1919 by chemists from industry and academia. For almost nine decades, the Union has succeeded in fostering worldwide communications in the chemical sciences and in uniting academic, industrial and public sector chemistry in a common language. IUPAC is recognized as the world authority on chemical nomenclature, terminology, standardized methods for measurement, atomic weights and many other critically evaluated data. In more recent years, IUPAC has been pro-active in establishing a wide range of conferences and projects designed to promote and stimulate modern developments in chemistry, and also to assist in aspects of chemical education and the public understanding of chemistry.

More information about IUPAC and its activities is available at <www.iupac.org>.

Contact:

John W. Jost, Executive Director

IUPAC Secretariat

P.O. Box 13757

Research Triangle Park, NC 27709-3757, USA

E-mail: secretariatATiupacDOTorg

Buy me a Coffee

I had commented recently on my pleasant experiences of working with MDPI regarding Molbank articles. See the posts here and here. SInce Peter Murray-Rust and I had both blogged on this issue (from different points of view) Deitrich Rordorf from MDPI went out of his way to make us both aware, via email, of a recent publication they had posted on their site:

“Just for your information and in reply to the blog posts regarding use of Creative Commons By Attribution License v3.0: We recently published an editorial “Changes Coming to MDPI Journals: Digital Object Identifier (DOI) and Creative Commons Attribution License” at http://www.mdpi.org/molecules/papers/13051079.pdf

The paper is entitled “Changes Coming to MDPI Journals: Digital Object Identifier (DOI) and Creative Commons Attribution License” and speaks for itself. I recommend that interested parties read the entire paper and commend MDPI on their decisions. EXCELLENT news.

Buy me a Coffee

Last a week I had a pleasant chat with a reporter from Nature magazine, a Mr Geoff Brumfiel. Geoff was interested in ChemSpider…what it was, how it ran, who used it, who supported it, who liked it, who curated it, who didn’t like it and so on.

The results of that discussion, and others he spoke to about ChemSpider, are here in his article.

Chemists spin a web of data p139
Chemspider website provides free information on millions of molecules.
Geoff Brumfiel
doi:10.1038/453139a
Full Text | PDF

It is a rule at Nature, at least for this type of article, that I could not see the article before it went to press and therefore I didn’t get the chance to proofread and comment. Geoff has accurately captured the spirit of our discussions but a few detailed clarifications are needed too. I have pasted in black the article content and in italics the clarification.

providing the community with an open-access source of chemical information

I giggled and commented please don’t say it’s Open Access. Say it’s Free Access. Say there are Open Data. And now we have Creative Commons licenses. But don’t say it’s Open Access, not Strong, not weak, not gold, not green. Just Free Access. No price barriers to usage.

Chemist Antony Williams is hoping to change this in a move likely to ruffle the feathers of the American Chemical Society.

I commented that we are not purposely in competition with anyone. It’s not what drives us to do this. Whether others see us to be competitive is for them not us. We don’t intentionally try to ruffle feathers. It doesn’t mean that what we are doing won’t ruffle feathers of course. Whether it’s ACS or others. It’s not the goal..it might be an outcome.

The modest project has made chemists interested in open access take notice — last week, the number of daily users of the site surpassed 5,000.

We have crossed 5500 users for the past two nights. The trend is positive.

“Other potential sources of information, such as Wikipedia, lack the algorithms needed to search chemicals according to their structure. “

Structure searching is “feasible” of course with InChI Strings. But substructure isn’t and Wikipedia is treated as a text-based search by almost all of its users

“The site is maintained with modest profits from advertising and the work of about 30 active volunteers who double-check the data pulled in from outside.

The original investment in hardware and software costs has finally been recouped. Modest profits? No one gets paid for the work we do. There is a phenomenal sweat equity investment in the platform numbering many thousands of hours to get here. We are indebted to the many software collaborators, providers of tools and the people curating and depositing to the system. There have BEEN about 30 active volunteers. RIght now I would say the number of active depositors and curators is around 10. But it is growing. I hadn’t checked the number of REGISTERED users for a long time. We have over 1150 registered users…those who CAN login and curate data, deposit data, see new features etc. People do NOT have to register to use the site…but >1150 did. Wow. I didn’t know it was that many until i just checked (BIG SMILE)

““There’s an awful lot of chemical information, but there’s an awful lot of rubbish as well,” says Barrie Walker, a retired industrial chemist in Yorkshire, UK, who helps maintain the site.”

Don’t know whether Marrie said this or not. He IS an honest guy and he is our QUALITY GURU and we are proud that he is willing to give us his fine eyes. There IS garbage on the site still. But, after a year online and active curating it has been much reduced. About 200 edits a day are made to the site: names changed/deleted/added, spectra/structures/URLs/Publications added etc. It’s quite the pace. We have cleaned up 100s of thousands of incorrect associations from the external data sources. It’s been and will remain an enormous task with an enormous payback for the community

Williams adds that the site still has problems with certain searches. For example, it struggles to distinguish between isomers: molecules with the same chemical formula arranged in different structures.

We  can distinguish isomers no problem. The PROBLEM is that there is a mixture of isomeric species submitted from multiple data sources and data are mixed and intermingled in way that the user cannot get to the correct structure. Search taxol or Ginkgolide on the ChemSpider blog and read the mutliple blog posts about this. We can of course search all isomers for a particular chemical formula…

“But Williams nevertheless believes that the service may be able to compete with for-profit services. “What I’m doing is highly disruptive,” he says. “I think it can be done and it needs to be done.”

I think what WE are doing…its not me..it’s we…is disruptive. In a good way. Many chemists will benefit. Will it have an impact on for-profit services? Yes, maybe. As an outcome but not as the target. Our team of people, both internal to ChemSpider’s development and Advisory Group, and the people we don’t even know who are cleaning and depositing into the system for their colleagues in the community, are creating a powerful resource for Chemists. The FOCUS of this effort is to Build a Structure Centric Community for Chemists. We will change that soon…the focus on Structure-Centric will be to cover Chemistry in general and to Build a Community for Chemists.

We are well on our way and thanks to Nature, and Geoff in particular for exposing it. My comments above are not meant to detract from Geoff’s reporting abilities but it was a long discussion and some clarification statements are of value i believe.

Buy me a Coffee

Peter Murray-Rust responded to my recent comments about a Free Lunch. There are a number of comments to be made and an exciting opportunity to use Open Data and linking from ChemSpider.

I’d asked the question about how many records there were on CrystalEye. In our world a unique record is a unique InChI, not so on CrystalEye and appropriately so as the crystal structure itself is presumably the unique record. Makes sense.

PMR> We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is.

What we will do with multiple crystal structures for a single chemical structure is link all unique crystal structures from the unique chemical structure. In this way people can query the chemical structure and find all associated analytical data - spectra and crystallographic files. If we were to list the number of unique depositions on ChemSpider I think we would be around 40 million depositions..an estimate though!

PMR> The main duplication comes from the Crystallography Open Database which has about 45,000 structures.

I looked at the Crystallography Open Database this morning. it states on the home page “Updated daily: 68268 entries in the COD”. We may have an opportunity with the COD to link up to their data and reduce the need for us to host CIFs. Excellent…we’re all for reducing workload and providing links into other systems. It’s what we do.

PMR> The only thing stopping us putting them (AJW> The structures from CrystalEye)  in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon. When it’s finished it will be in RDF/CML.

This is great news. This means that after the summer we can download the data directly via PubChem and link up to CrystalEye that way. Perfect. We’ll stop working on integrating to CrystalEye now and wait for the integration path via PubChem and focus on other data sources. Thank you Peter, Nick, Andrew and Jim!!! That said I don’t believe that PubChem will take CML, they will convert using their tools to produce their compatible formats and InChI being one of them. That will break organometallics etc. UNLESS PubChem are going to adopt CML now and that would be an interesting positive shift in terms of a sign of support for the format. A strong positive. I’l chat with the PubChem team so that if CML is coming we can consider adopting in some way and be ready.

From my post “AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.”

PMR: Chicken and egg… :-) You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.

It’s nice to know that ChemSpider has that type of influence now. It’s good to see it going into Accelrys’ software and I had heard that from Dan’s blog and had added the CML Blog to my reader. I’m definitely watching and willing to follow. We’re busy leading so many other things right now we’ll wait for adoption and then jump on it like a “hobo on a muffin”.

Buy me a Coffee

One of the blogs I really enjoy reading is Deepak Singh’s Business,Bytes,Genes and Molecules. Today there was a blog post about ChemSpider but something strange happened…I could ONLY read it in Google Reader. When I tried to navigate to the actual website it asked me to Save a file. See below.

It may be harmless but I’ve suffered enough at the hands of “bad files” to not grab it. Anyone else seeing this symptom? It’s in both browsers (IE and FF) and on two computers.

Anyhow, thankfully I can read it in Google Reader. There’s a point Deepak raises and I insert it here..

“On the web, data should be available as an addressable resource. The fact that data is available as RDF is great (and I wish more data was available as such). However, my personal preference is that data, especially open data, needs to be accompanied by APIs and bindings that allow the data to be accessed in a number of formats (not a dump per se). I think over time the acceptable formats will be established, much like XML/JSON/RSS have become the standard transport formats. The key aspect here are the business models. Is the business in providing a service on top of the data? For example for more than X number of API calls, there could be a fee associated.”

Just in case people have missed them we have a whole series of Web Services available already and they are being used. You can find details about them here:

Mass Spec Web Services

Taverna Hooks to ChemSpider Web Services for Metabolomics

Web Services Demo Pages and Example Code

Microsoft Hook Web Services into Infomesa

Waters Deliver Integration Via Web Services

There are more examples. We have thousands of calls a day using the Web Services at present and welcome more feedback on them!

Buy me a Coffee

Jean-Claude Bradley was “asked by the Institute for the Future to highlight a dozen “Signals” that may point to new trends in science as part of the X2 Project“. He has listed his selections on his blog-posting and people are encouraged to vote. JC mentioned ChemSpider twice and I am honored and humbled that he feels our efforts deserve recognition.

JC has recognized our efforts in depositing analytical data on ChemSpider and our web services to generate InChIStrings and InChIkeys.

Buy me a Coffee

In a recent post about ChemSpider we’ve been accused of wanting a Free Lunch. I copy a segment of the post and comment with insertions.

“Data are normally produced for a particular purpose and the reuse them for another cost money. I’ll exemplify this by taking CrystalEye data - about 120,000 crystal structures and 1 million molecular fragments - which were aggregated, transformed and validated by Nick Day as part of his thesis. (BTW Nick is writing up - it’s a tribute to his work that CrystalEye runs without attention for months on end).

AJW> It is true…it is a tribute to Nick that CrystalEye can run for months without attention. Kudos. I am interested in how much pressure the site is under. How many searches/users in a day etc.? We find that our struggles in uptime (and these are negligible) are primarily based on stress on the servers. For nighttime users tonight things will have been slow…we deposited over 100,000 new molecules from 5 new data sources. That does create some slowness. We will hit about 40,000 transactions today. Our problems are ISP issues and powercuts. But we are also not in a University using thick pipes etc.

One comment…it was 130,000 structures according to a previous blog and has been expanding since then from daily depositions. Right now I would expect it to be 140,000 rather than 120,000. When we did try scraping the data our best estimate was about 90,000. We might have missed something in our scraping and it’s why we asked for a dump of the data.

The primary purpose of CrystalEye was to allow Nick to test the validity of QM calculations in high-throughput mode. It turned out that the collection might be useful so we have posted it as Open Data. To add to its value we have made it browsable by journal and article, searcahable by cell dimensions, searchable by chemical substructure and searchable by bond-length. This is a fair range of what the casual visitor might wish to have available. Andrew Walkingshaw has transformed it into RDF and built a SPARQL endpoint with the help of Talis. It has a Jmol applet and 2D diagrams, and links back to the papers. So there is a lot of functionality associated with it.

AJW> The team has done a good job in putting the site together. The JMol applet is an excellent utility for us all to use and thanks to that team for sure! Egon has been challenging us to RDF the site and it’s on our list, but keeps getting pushed down based on other requests. Since he’s the only voice asking it will keep getting pushed down unfortunately.

This has come under some criticism to the effect that we haven’t really made it Openly available. For example Antony Williams(Chemspider blog) writes (Acting as a Community Member to Help Open Access Authors and Publishers):

“This [interaction with MDPI] is contrary to some of my experiences with some other advocates of Open Data and Open Access where trying to get their “Open Data” is like pulling teeth.”

PMR: I assume this relates to CrystalEye - I don’t know of any other case.

AJW> There are other examples and he’s right. He doesn’t know of them and I’d prefer he not rant on my behalf so I’ll not name them.

Antony and I have had several discussions about CrystalEye - basically he would like to import it into his database (which is completely acceptable) but it’s not in the format he wants (multi-entry files in MDL’s SDF format, whereas CrystalEye is in CML and RDF).

AJW> To clarify, again. I DON’T want to import CrystalEye into ChemSpider. I DON’T! All I want is the set of structures and unique associated URLs so that users of ChemSpider can find that there is crystal structure information over on CrystalEye and can click the link and be on CrystalEye and get the benefit of Nick, Andrew and Peter’s work. I don’t want to reproduce their effort. I want to integrate to it. I’ve said it many times on Peter’s blog and on this one.

This type of problem arises everywhere in the data world. For example the problem of converting between map coordinates (especially in 3D) can be enormous. As Rich says, it costs money. There is generally no escape from the cost, but certain approaches such as using standards such as XML and RDF can dramatically lower the costs. Nevertheless there is a cost. Jim Downing made this investment by creating an Atom feed mechanism so that CrystalEeye couls be systematically downloaded but I don’t think Chemspider has used this.

AJW> If Jim can contact me by email and provide me with detailed instructions to download the entire file of structures ONLY and their associated URLs that would be excellent. I’ll send the request to him tonight.

The real point is that Chemspider wishes to use the data for a different purpose from which it was intended.

AJW> The problem is that stories keep getting made up about what we want. ALL I want to do is drive traffic to CrystalEye so that people who don’t know about it can use it. No more than that. I don’t get how trying to provide an integration path is so difficult. I’ll ask Jim to help.

That’s fine. But as Rich says it costs money. It’s unrealistic to expect we should carry out the conversion for a commercial company for free. We’d be happy to a mutually acceptable business proposition and it could probably be done by hiring a summer student.

AJW> I am interested in what commercial benefit integrating to CrystalEye can have. It’s work on our side. I’m not sure what a mutually acceptable business proposition would look like. It can’t be that much work to send us a set of InChIStrings and URLs for the CrystalEye dataset..they already exist on CrystalEye. So, I’ll assume that this is a last comment on “No thanks to CrystalEye data in ChemSpider”. I have to ask why not put them in PubChem. Since PubChem is held as the standard of OpenData why not put CrystalEye there?

I continue to stress that CrystalEye is completely Open. If you want it enough and can make the investment then all the mechanism are available. There’s a downloader and converters and they are all Open (though it may cost money to integrate them).

AJW> Just fyi ChemSpider has adopted Creative Commons licenses.

FWIW we are continuing to explore the ways in which CrystalEye is made available. We’re being funded by Microsoft as part of the OREChem project and the result of this could represent some of the way in which the Web technology is influencing scientific disciplines. We’d recommend that those interested in mashups and re-use in chemistry took a close look at RDF/SPARQL/CML/ORE as those are going to be standard in other fields.

AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.

I have spoken previously about the challenges of Scraping CrystalEye Content and staying in relationship with publishers. I have approached CAS and spoken with the Copyright team at ACS. In December of last year I spoke about the 5 month delay to discuss with ACS about whether or not we could scrape CIF files from ACS journals directly. Well, I had a nice chat with two ACS people in New Orleans, one of them from ACS Pubs. We had a nice chat about ChemSpider and I answered a lot of questions about what we were doing, where we were going, how we are “funded” (we are not!) etc. Many pages of notes were taken. At the end of the meeting I asked the question “So, relative to my question about CrystalEye and scarping CIFS. Are Supplementary Data ok to scrape or not?”

The answer? “We haven’t made a decision yet. We need to discuss”.

Are crystal structures really that special? It’s been difficult to get JUST the structures associated with even Open Data. Now I’ve been waiting over 7 months for a question to be answered by ACS…and it’s binary. YES or NO.

At this point I give up. Peter Murray-Rust has had ACS CIFs scraped from their publications for a LONG time. And continues to scrape them. Cambridge University/Unilever School of Informatics didn’t get permission and have been very vocal about what they’ve done and no legal action re. copyright has been taken so I’ll assume it’s not an issue. If it’s not an issue then we can go ahead.

If we can go ahead then why wouldn’t we? We have…we already have scraped the collection of CIFs from ACS, from a broader range of ACS journals than CrystalEye taps into. It’s Supplementary Data, it’s non-copyrightable and now its ours to publish. We already support CIF displays on ChemSpider so what we need to do now is to mass convert/handle the data and deposit onto ChemSpider. We also have the IUCR CIFs to deposit. I guess ChemSpider will soon become “CrystalEye 2″ as we host the data. That said we are NOT crystallographers so I have an open request to the community for someone with interest/skills in crystallography to join our advisory group and support this effort. Feel free to ping me.

Buy me a Coffee

Over the past year ChemSpider has been challenged over the nature of our offering in terms of Open Data etc. A small number of people focused a lot of time talking about this while we remained focused on improving the website and having it available for people to use as a Free Access website. I spoke to Peter Suber about Open Access and then John Willbanks about Creative Commons.

Since ChemSpider is the aggregate of a number of people’s work (including provision of software by collaborators) I had to get into conversation to see what licenses would be acceptable to those groups.

With the redesign of the website we have structured ourselves in a way to add licenses as we see appropriate now. So, as of today we have added the Creative Commons Attribution Share Alike 3.0 United States License and the appropriate logo is on all sections of a Record View except for the predicted properties. Once we get approval from our collaborators for this same license (and discussions are underway) then the whole record view will be Licensed.

At that point, you are free :

  • to Remix — to make derivative works

Under the following conditions:

  • Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
  • Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.

Buy me a Coffee

I am in catch up mode tonight reading a week long backlog of blog posts. I’ve caught up tonight with some of Peter’s posts about semantic chemical authoring (1,2). I’ll respond shortly with comments regarding our own efforts in pulling together the web. I agree with Peter that improved semantic chemical authoring tools are necessary but we are focused right now on doing what we can with what is already available online. What it takes is coding, some regular expressions, some visual inspection and work. Lots of work. More later…

One of the things we are working on is connecting blog posts and wiki pages to ChemSpider as evidenced by our work with Molecule of the Day and our integrations to TotallySynthetic posts on an ongoing basis. What we expect of the authors though is that they author with care. We are generally using name to structure conversion capabilities to generate the chemical structures for connecting to on ChemSpider. Paul Doherty at TotallySynthetic used to provide us with inChIStrings and InChIKeys to connect up to but stopped because it was a lot of work I believe. Molecule of the Day generally discusses fairly simple molecule relative to TotallySynthetic’s COMPLEX molecules. Manual inspection is unfortunately necessary even in the simplest of cases. And it IS time-consuming. Robots will gather information and, in my judgment, PROLIFERATE incorrect data unless someone is going to do the work to inspect OR the system provides a curation platform to quickly remove errors.

I blogged tonight on the ChemConnector blog about the importance of dashes and spaces in systematic names. It should be very clear from that post how important it is. it is a major challenge to use name to structure conversion tools on chemical names that are imperfect and do not represent the structure they are meant to represent. There needs to be respect for chemical names and as we move them from system to system, database to database we need to do our best to retain their integrity. This HAS BEEN a major challenge for us as we scrape data from variou