Archive for May, 2008

Over on my other blog I have recently posted some comments that may be of interest to ChemSpider Blog readers

Spaces, Dashes and Issues with Nomenclature Conversion

It appears that members of the text-mining for chemistry community are using one or more of the commercial name to structure software programs to convert chemical names to structures and, prior to feeding the algorithms, they are removing all white spaces from the names. They are also doing the same, in some cases, with dashes. How well is that going to work? Is it safe to remove spaces from chemical names and assume this has no effect? Is consideration being given more to the accuracy of the text-mining than to the nature of systematic nomenclature?

Let’s look at some examples of the result of removing spaces from chemical names. Consider the different results just from moving a space. READ MORE

Hamburger PDFs and Making Them Structure Searchable

This is just an fyi comment for the community really since this is a general assumption that Word Documents and PDFs cannot be made structure-searchable. The truth is that both can be made structure searchable. How? Well, you need to write the correct information into the file to enable it but it’s possible. There are a number of solutions out there allowing structure-based searching of Word document files. I believe the first one was originally from Oxford Molecular before being acquired by Accelrys. I think there are now multiple including, I believe, Cambridgesoft, ACD/Labs and probably others. READ MORE

Buy me a Coffee

We announced WiChempedia previously in terms of using our new dedicated website approach as a subset of ChemSpider. What we did there is show the leading part of the Wikipedia article and then ask people to click over to read the entire article on Wikipedia. We can support the entire article on ChemSpider if that is of interest to people but it’s easier, for now, to keep it as is. Comments?

What we are doing is trying to provide better integration to Wikipedia since I am working with Wikipedia on the curation of Wikipedia chemical structures and deeper integration seems appropriate. So, tonight we added the ability to Edit the Wikipedia article. In this way you can directly edit errors you might see in the lead of the article but you also get to edit the entire article if you are interested.

As an example see Taxol here. You will see the following Taxol lead for the article. Look at the END where you will see: Read more… or Edit at Wikipedia…

Paclitaxel is a mitotic inhibitor used in cancer chemotherapy. It was discovered in a National Cancer Institute program at the Research Triangle Institute in 1967 when Monroe E. Wall and Mansukh C. Wani isolated it from the bark of the Pacific yew tree, ”Taxus brevifolia” and named it ‘taxol’. When it was developed commercially by Bristol-Myers Squibb (BMS) the generic name was changed to ‘paclitaxel’ and the BMS compound is sold under the trademark ‘Taxol’. In this formulation paclitaxel is dissolved in Cremophor EL, a polyoxyethylated castor oil, as a delivery agent since paclitaxel is not soluble in water. A newer formulation, in which paclitaxel is bound to albumin as the delivery agent (Protein-bound paclitaxel), is sold commercially by [http://www.abraxisbio.com Abraxis BioScience] under the trademark [http://www.abraxane.com Abraxane].”[http://www.fda.gov/cder/foi/label/2005/021660lbl.pdf Abraxane Drug Information].” ”Food and Drug Administration.” January 7, 2005. Retrieved on March 9, 2007. Paclitaxel is now used to treat patients with lung, ovarian, breast cancer, head and neck cancer, and advanced forms of Kaposi’s sarcoma. Paclitaxel is also used for the prevention of restenosis. Paclitaxel works by interfering with normal microtubule breakdown during cell division. Together with docetaxel, it forms the drug category of the taxanes. It was the subject of a notable total synthesis by Robert A. Holton. As well as offering substantial improvement in patient care, paclitaxel has been a relatively controversial drug. There was originally concern because of the environmental impact of its original sourcing, no longer used, from the Pacific yew. The assignment of rights, and even the name itself, to BMS were the subject of public debate and Congressional hearings. Read more… or Edit at Wikipedia…
There you have it…this type of integration is a joy to do. Literally a couple of minutes to make the connection and a few minutes to set the style et voila. Editing articles in Wikipedia.

Buy me a Coffee

American Chemical Society

I admit to not being fully knowledgeable in the details of CAS Numbers. If anyone has a short treatise regarding their history and breadth relative to generic/specific structures and “materials” I’d welcome getting pointed to it. That said in the community in which I participate CAS Registry numbers appear to be very confusing. One thing is for sure…the authority IS the Chemical Abstracts Service. They have the reference data collection of course.

In the public domain there is a “mess of data” and various parties attempting to use them for full effect. It’s a problem. In a recent letter to C&E News (May 5, 2008,Volume 86, Number 18,pp. 4-7,) a Ms Deanna Morrow Hall, from Stone Mountain, Georgia commented on this confusion. I can’t paste the entire letter here because of Copyright issues of course but will abstract.

The most common problem is the confusion between the number for the generic formula of a compound (intended to be used for a chemical entity when its exact composition is unknown or variable) versus the number for a compound of specific known formula.

She gave as an example, Propanol:

Propanol (generic formula) : 62309-51-7

1-propanol: 71-23-8

2-propanol: 67-63-0

First, a vendor (either in a product specification or in a material safety data sheet) uses 62309–51–7 as the registry number for one of the specific configurations. If a buyer uses the correct specific registry number to search for suppliers of one of the specific configurations, then he will not find that vendor.

Second, a vendor uses the correct specific registry number for one of the specific configurations. If a buyer uses 62309–51–7 to search for suppliers of one of the specific configurations, then he will not find that vendor.

Third, a vendor correctly uses 62309–51–7 to describe a mixture of the two specific configurations, but the buyer thinks he’s ordering one of the pure compositions.

Her letter concludes with

“Given that these errors occur with greater frequency than one might anticipate and are not trivial in their consequences, it seems appropriate that ACS should initiate a study to quantify the extent of the problem and to identify solutions to it.”

Access to Registry Numbers and just the related structure/material would be a great service to chemists. It would likely have an enormous impact on the ACS/CAS bottom line though. This is understandable. But what about the bottom line of communication between chemists? Ms. Hall’s examples are definitely real.

In the Wikipedia curation project outlined on this blog we have run unto issues with validating CAS Numbers. Fortunately CAS have offered to help. The project is now rolling again after a hiatus and we are presently preparing 500 structures to upload…hopefully more. We definitely found errors and the validation process will be possible only with their help. What do we do moving forward though?

In a recent post Peter Murray-Rust discussed creation of semantic chemical information. I have a lot of comments to make on that post but it must wait. I’ll focus on the CAS Numbers for now

A good example is Wikipedia. (…..) relies on the “wisdom of crowds”, but I think it works well in chemistry. Chemspider has harnessed the wisdom of crowds but I suspect that only a very small fraction of their entries have been human-curated and I give an example below which seems to need attention.

The reality is that about 10X the number of chemicals on Wikipedia have been human-curated..I estimate about 50,000. Curation means what in this case? It makes validation of the consistency between the structure displayed and the numerous identifiers allocated to that structure. We cannot validate predicted values of course. 50,000 human curated records is significant.

Peter went on to discuss identifiers “Identifiers. Potentially identifiers are the easiest and most powerful tool. An identifier is a unique string associated by an authority with a substance (not necessarily pure). If an authority(X) asserts that substance A(X) and substance B(X) have the same identifier then they can be said to be equivalent. There are many authorities making such assertions. Ultimately it is only the authority(X) who can make assertions about its identifiers. To be widely useful the authority should provide a lookup (resolution) service which is both human- and machine-accessible. In practice many authorities don’t do this or provide only a toll-access service. The identifiers are also often copyright and may or may not be copied. This often leads to other authorities(Y) who copy identifiers without permission and make their own assertions which may or may not be compatible with the authority(X). Frequently also the source of the identifier is not given. Thus many people who submit information to Pubchem give identifiers and these are listed as “[RN]” = registry number. For aspirin for example, there seem to be many identifiers - in the Chemspider entry all the following link through to Pubchem, e.g. 2349-94-2[RN], 26914-13-6[RN], 98201-60-6[RN]

When Peter commented “I give an example below which seems to need attention” I think he was pointing to the fact that aspirin has many Registry Numbers “there seem to be many identifiers - in the Chemspider entry all the following link through to Pubchem, e.g. 2349-94-2[RN], 26914-13-6[RN], 98201-60-6[RN]“. Maybe it wasn’t the issue. Either way it’s a great foundation to examine CAS Numbers.

Is three RNs on ChemSpider appropriate? Well, we know that MULTIPLE RNs is okay already based on Ms Morrow Halls comments. Is ChemSpider on target with these three?

Landolt-Bornstein’s Poperty Index is very well known. They have aspirin here. They list the following CAS Numbers: 50-78-2, 2349-94-2, 11126-35-5, 11126-37-7, 26914-13-6, 98201-60-6

An online MSDS sheet for Aspirin is here and lists the registry numbers: 50-78-2, 98201-60-6, 26914-13-6, 2349-94-2, 11126-35-5, 11126-37-7

The German Institute of Medical Documentation and Information lists Aspirin here and lists the following CAS Numbers: 50-78-2, 2349-94-2; 11126-35-5; 11126-37-7; 26914-13-6; 98201-60-6.

The RTECS database lists for Aspirin:

The Registry of Toxic Effects of Chemical Substances

Salicylic acid, acetate

CAS #: 50-78-2


ALT CAS #: 2349-94-2
ALT CAS #: 11126-35-5
ALT CAS #: 11126-37-7
ALT CAS #: 26914-13-6
ALT CAS #: 98201-60-6

For the MSDS Sheet and the German Institute the CAS Numbers are the same as Landolt-Bornstein…maybe they were sourced there?

Peter had listed only three RNs on ChemSpider “2349-94-2[RN], 26914-13-6[RN], 98201-60-6[RN]“” Checking ChemSpider showed we actually had the following list there: One Validated RN: 50-78-2 (the one declared as the Primary Number on the other sites) and the following list (NONE of them validated):

11126-35-5[RN]
11126-37-7[RN]
2349-94-2[RN]
26914-13-6[RN]
337376-15-5[RN]
98201-60-6[RN]

ALL of these are valid based on the other data sources EXCEPT for 337376-15-5, a totally unrelated compound detailed here. This one has been deleted using the usual synonyms curation process and the others approved.

PubChem also lists ALL six Registry Numbers as shown below. There are those who believe registry numbers are not on PubChem. Not true.

50-78-2
11126-35-5
11126-37-7
2349-94-2
26914-13-6
98201-60-6

So, ChemSpider, PubChem, MSDS sheets and many others have a consistent set of 6 registry numbers for aspirin. Are they correct…only CAS could confirm. I believe this shows that multiple CAS Numbers are appropriate. What I cannot comment on is what each one stands for. This reverts back to Ms Morrow-Hall’s comments.

Moving forward how will we stop the proliferation of errors? How can we reduce the potential cost of mistakes made as a result of CAS Number miscommunications?

Buy me a Coffee

John Wilbanks opened his blog post regarding the Erosion of the Public Domain with the statement “This Chemspider licensing brouhaha is generating some needed discussions around open data, and something I keep hearing about is that it is GPL v. BSD all over again “. This relates to the recent blog post I posted regarding our renewed focus on our agenda of Building a Community for Chemists.

I cannot do justice to John’s manner of delivering his message. He hits the nail on the head. I quote “The public domain is not contractually constructed. It just is. It cannot be made more free, only less free. And if we start a culture of licensing and enclosing the public domain (stuff that is actually already free, like the human genome) in the name of “freedom” we’re playing a dangerous game.”

To be clear I never felt a need to put licenses on ChemSpider. People are using the content on ChemSpider, grabbing it, reusing it. We have provided web services to help people get more value out of the content. We will add more as time, resources and needs require it. The only reason we added licenses was pressure to do so. What was the pressure about? None of the USERS of the site ever put pressure on us and I don’t think CARE about licensing. They just use as is and seem happy to do so.

That said, I am looking for an education. Nay, REQUESTING it from people in the domain. Deepak Singh posted a comment on John’s blog post. “I do think that there is a lot of confusion around the differentiation around content (Creative Commons) and data (which is different). The data commons needs a different set of rules, and starting with a clear understanding of what Public Domain means and why it is a good thing.”

So, what is data, what is content? Is a structure and a series of chemical identifiers “Data”? Is a list of safety and toxicity information “Data”? Are a series of links to blog post and articles “Data”? Wikipedia is defined as content I believe. So, out of all of this discussion my question is whether ChemSpider is Content or Data. (Yes..I have my own views already!)

Buy me a Coffee

There have been some follow-on comments from the recent Nature Article written about ChemSpider.

I was happy to see this comment from Jose Barros:

“As enthusiast of “Internet-aided Chemistry” subject I wish to congratulate Nature for mentioning the Chemspider initiative. To the best of our knowledge, Chemspider represents a reliable alternative for those who were not able to access commercial databases, thus contributing for scientific inclusion mainly in the less developed countries. As Chemspider grows up, it may also be used by the scientific community as a bargain tool for obtain better services or lower prices from suppliers of commercial databases.”

and a follow-on post tonight from Barrie Walker relative to his comment in the article. Originally quoted as saying “There’s an awful lot of chemical information, but there’s an awful lot of rubbish as well, says Barrie Walker, a retired industrial chemist in Yorkshire, UK, who helps maintain the site.”  Barrie added a follow up comment “comments used by Nature applied to chemistry on the internet rather than anything to do with ChemSpider. As one of ChemSpider’s master curaters, I am very supportive of the project, otherwise I would not be spending time editing the data.

I have known and worked with Tony for many years and I believe the project has a great future and with further development will see an increasing number of users.”

Comments with context have a whole different meaning. I wonder what the context was when Bob Massie from Chemical Abstracts Service compared the Golfing Industry with the Drug Industry? Likely that whole comment was taken out of context …

Buy me a Coffee

One blog I check out a few times per week is that of Derek Lowe who writes In the Pipeline. What makes Derek’s blog different, in my opinion, is that he has many years under his belt as a synthetic chemist in pharma companies. He watches what is going on in his industry and makes us aware of his opinions and those of others. He’s still active in the lab and makes us aware of the challenges of lab syntheses and, I find, with some historical perspectives regarding what is “was like” then versus now. Always well written with high feedback In the Pipeline is likely one of the most frequented blogs out there.

Today I read about Schering-Plough’s thrombin receptor antagonist compound, SCH 530348. I am an SP shareholder and was rather disappointed by the recent news about Vytorin. Let’s hope the retrials provide new results. However, SCH 530348 looks more exciting according to Derek’s comments and talking to other people in the industry.

I searched ChemSpider for the structure since it seemed of interest. SCH 530348 is based on a natural product called himbacine and Derek had linked to it from his blog but, I assumed, he couldn’t find the structure of interest on the database. Neither could I. So, about 5 minutes work and it was on the database, with the link to the ASAP article and tagged etc as shown below. The link is here.

Description

TRA-SCH 530348 is an oral antiplatelet drug under development by Schering-Plough for the treatment and prevention of atherothrombotic events in patients with Acute Coronary Syndrome, previous Myocardial Infarction, stroke, or existing peripheral arterial disease.

Tags

SCH 530348
Schering Plough
TRA-SCH 530348

Links & References

Samuel Chackalamannil, Yuguang Wang, William J. Greenlee, Zhiyong Hu, Yan Xia, Ho-Sam Ahn, George Boykow, Yunsheng Hsieh, Jairam Palamanda, Jacqueline Agans-Fantuzzi, Stan Kurowski, Michael Graziano, and Madhu Chintala . Discovery of a Novel, Orally Active Himbacine-Based Thrombin Receptor Antagonist (SCH 530348) with Potent Antiplatelet Activity, ASAP J. Med. Chem., ASAP Article
A potent series of thrombin receptor (PAR-1) antagonists based on the natural product himbacine is described.

Any user of ChemSpider can do this now. ANYONE. If you are interested in how let me know! I have some documents already prepared tou guide you through the process and need to update others but the database can be expanded by everyone now…not quite Wikipedia but not bad at all! What do you think?

Buy me a Coffee

The past few days have been very interesting for ChemSpider and the discussions regarding the licensing of Open Data under Creative Commons. There have been a number of exchanges in the blogosphere commenting on what this means/should mean for Open Data.

data-should-be-public-domain-and-more-esoteric-blog-based

chemspider-good-intentions-and-the-fog-of-licensing

Does ChemSpider really violate Open Data with CC SA?

John Wilbanks replies to the ChemSpider/OpenData discussion

I am Still DELIGHTED with ChemSpider

more-on-the-science-exchance-or-building-and-capitalising-a-data-commons

The unfortunate public disclosure of an internal document regarding ChemSpider has resulted in some excellent public dialog about Open Data and licensing and I am glad that the outcome, while potentially very messy, has been controlled and appropriate dialog. I will not reiterate the details of each of the blogs but have generally commented on them directly. I thank John Wilbanks specifically for the gracious manner with which he addressed a potentially explosive situation.

This situation didn’t make me angry..just frustrated. ChemSpider has become a focus for attention over the past few months. Maybe it’s because we are doing the right things? Maybe it’s because we are challenging the status quo? Maybe it’s because people know me personally after 10 years in the commercial sector?

We are doing the best things we can - with no funding, “spare time resources” (all work is night and weekends by volunteers!), to make a difference and build a community. We have taken a lot of hits over the months and it has been distracting, especially for me who has chosen to remain engaged in a public discourse on the blogosphere. It is no longer healthy. It has distracted us from our mission..it is way too time-consuming and delivers little value to our users. Our users I care about. Our self-professed non-users I don’t.

We will likely pull all Creative Commons licenses shortly and resolve our “Openness” with a declaration as originally advised by Peter Suber. It was good advice then and is more pertinent now.

I have also posted a statement to Peter Murray Rust’s blog thanking him for his apology and declaring my intent. This may be misinterpreted, and will likely garner a reaction, but I hope my intent is clear here. If you deposit data to ChemSpider then your data remains yours. We are not assuming copyright. We will do our utmost to navigate the complexities of licensing as appropriate. We might fail. Our role is to serve our users.

We are getting out of serving any agendas other than our own…we are Building a Community for Chemists.

This is what I posted to Peter’s blog

“Peter, I thank you for the applause regarding our implementation of licensing on ChemSpider. I also acknowledge and accept the apology you have issued publicly to John Wilbanks, ChemSpider and members of the advisory board.

I believe that some good has likely come out of the conversations over the weekend - maybe a little more confusion, maybe a little more clarification (especially around John’s “data in the public domain” comments) and maybe a few more relationships. This latter part is especially of interest to me as we work on creating a community for chemists.

Now to the outcome for ChemSpider. ChemSpider went live in March of last year with a “who knows where it will go” approach. From the moment we went live you have paid attention. However, rarely has this been with any sense of support but, rather, a framework of negativity. You have criticized our science and our intent. You have projected your judgments as truths. I have addressed these judgments many times but rarely with acknowledgment from your side. It has been a lot of work for both of us. To be clear, I have judged your efforts around Open Notebook Science for NMR similarly.

ChemSpider appears to have a center spotlight now in terms of licensing and Open Data. I acknowledge these are significant parts of YOUR agenda and a key part of what you have worked on for many years. I judge your other agendas to be Open Access, Semantic Web and associated technologies. I honor your work in these areas and feel you have contributed and will continue to contribute to the ongoing shifts of Open science prevailing at present. Thank you.

Our agenda for ChemSpider is different. We are building a community for chemists (Notice the recent shift from the original vision “Building a Structure Centric Community for Chemists” as we expand out of structures only.) At present, we are doing what we can to support the needs of chemists researching structure-based information. We are integrating information. We are more than a “linkbase”. We are actively supporting Open Notebook Science. We ARE listening to our users, the community, our collaborators and our advisory group.We have delivered a valuable solution in the past year with no cost to the users, to the tax-payers, with no grants and based on the hard work of a small dedicated team of volunteers only.

The past year has been very distracting for us, and me in particular, in terms of your comments and judgments about ChemSpider, and by association, about myself. I have tried to clean up a number of these on the ChemSpider blog (http://tinyurl.com/45acav) but acknowledge you might have different view points. These discussions have been draining and have distracted us from our core mission of serving our users.

At this point I need to withdraw from the dynamic we have co-created around our relative agenda(s). I honor and acknowledge what you are out to achieve. I wish for the same from you. I do not believe that our intentions are contrary or mutually exclusive to your own. I think that time will prove this to be true.

I believe that you judge our efforts to be in conflict with those of your WorldWide Molecular Matrix but I doubt that is true. I will respond shortly to some historical posts regarding your call for a structure collection for your eChemistry Project with Microsoft. We are willing to help and I am open to a discussion should you wish to collaborate. I am working with the Wikipedia:Chemistry team to build a validated SDF file for the public domain and we can make this available to you. “

Buy me a Coffee

The attack of the Killer RabbitImage via Wikipedia

I have essentially stayed away from my computer over the weekend pondering the Creative Commons discussions about ChemSpider. I’ve spent time playing with our twins, landscaping and “contemplating my navel”.  After what happened at the end of last week regarding the shared opinions on our adoption of Creative Commons licenses I needed to spend sometime away from the computer. My frustration level has been increasing of late since ChemSpider has been put in the spotlight for so many reasons, and very few of them positive. Even this is strange considering we have now topped 6000 users per day, we receive kudos and blessings from our users and, overall, I believe we are making a positive contribution to our intention to build a community for chemists.

Nevertheless, ChemSpider has become the “Free Access Chemistry Site du jour” for certain individuals to poke at. We appear to be the “model of concern” - lots of discussions going on about ChemSpider is “this” and “that”, we need to consider our business model, ChemSpider needs licenses and many other commentaries. Thanks for discussing us out there in the blogosphere. Would you like to talk TO US instead? I’d appreciate having someone advise us on licensing etc at a minimum.

Anyhow, I was in the mood for a good belly laugh tonight when someone pointed me to a recent post by Rich Apodaca entitled “Just a Flesh Wound” and parodying the Semantic Web , RDFs etc. with Monty Python and the Holy Grail’s infamous blood and gore scene. I say go visit the site and, if you’ve been watching the many discussions in this area I think you’ll giggle too.

Buy me a Coffee

Like most users of WordPress we use Akismet to deal with our spam. Unfortunately it appears that it might be a little hungry and is chewing on some real comments too according to this blog post by John Wilbanks. John was responding to the Creative Commons on ChemSpider licensing discussion. I’ll comment more on that later.

Please  know that I never willingly deny any posts made to the ChemSpider blog and moderate only to keep the “bad stuff” off of the site. If you try to post and it doesn’t show up there in 48 hours (even I take downtime) please send the post to infoATchemspiderDOTcom and it will get posted on your behalf OR I will try and dig it out of Akismet. As you can see below though…we do get a LOT of Spam!

Buy me a Coffee

A picture tells it all. Look at the image below..the result of a search on ChemSpider on Google. Click on the image to see the full size detail. The search gives almost 7.5 MILLION hits on ChemSpider. Hmmmm…is the result of the very active discussion about ChemSpider and Creative Commons that has occurred over the weekend? It does seem to have generated a lot of interest. The reality…within a few minutes of this search it had settled back down to about 300,000 hits. Google was probably recalculating PageRank or something. Nevertheless, kind of fun to see!

Buy me a Coffee

There have been a number of discussions around the fact that we are adding Safety and Toxicity Data to ChemSpider. The comments section contain some cautions from Cameron Neylon about a disclaimer. We have had one online for a while but we have now updated the site and linked the disclaimer from above every Supplementary Info section.  Thanks to Cameron for his support and suggestions in this area.

Disclaimer

For all documents, data and software available via this website and server, ChemZoo does not warrant or assume any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed. Specifically, any information listed here should not be relied on for any assessment of the risks of any specific procedure or process. You should always carry out your own risk assessment for any experiment, procedure, or process. In particular the information here may not be correct or appropriate to local regulations, legal codes, practices or policies, or may be out of date. The authors will take no responsibility for any injury, loss, or damage caused in any manner whatsoever by the use of this information.Some ChemSpider Web pages link to other web pages for the convenience of users. ChemZoo is not responsible for the availability or content of these web pages. ChemZoo does not endorse or warrant the services, products or information offered at these other webpages unless explicitly stated.

The ChemSpider website is maintained by ChemZoo Inc. For site security purposes we use software programs and algorithms to monitor traffic and to identify unauthorized attempts to upload or change information or otherwise cause damage to the ChemSPider service. In the event of unauthorized activities utilizing the ChemSpider server information from these sources may be used to help identify an individual for prosecution.

The data on the ChemSpider web site are sourced from a number of contributors and collaborators. The majority of data were originally sourced from the NCBI-Pubchem website and are made available under the explicit statements and disclaimers provided by NCBI. Specifically, NCBI places no restrictions on the use or distribution of the data contained within their database. ChemZoo have abided by the assumptions of NCBI. These are specifically “While some submitters of the original data (or the country of origin of such data) may claim patent, copyright, or other intellectual property rights in all or a portion of the data (that has been submitted). NCBI is not in a position to assess the validity of such claims and, therefore, cannot provide comment or unrestricted permission concerning the use, copying, or distribution of the information contained in the molecular databases.” In this regard should any contributors to the NCBI-Pubchem database wish to have their data removed from the ChemSpider website please send a detailed request to info-at-chemspider-dot-com.

Revised: May 9th, 2008

Buy me a Coffee

ChemSpider has taken some thrashing over the past year. We’ve been hit on science (and proven our point many times), on Open Access versus Free Access statements, on whether or not we have Open Data or not. There has been encouragement to define what the data on our site is in terms of Open Data or not. We’ve adopted Open Data tags on deposited data from users after pressure there. When I’ve asked more about Open Data I have heard that it is not ratified at the same level as Creative Commons licenses and they would be better to use. A week ago we put up Creative Commons Licenses in what I hoped was a GOOD move for the ChemSpider site and would relax the criticism of our site and potentially receive their blessing and support.

We received a blessing for all of 72 hours. In his blog post Peter Murray-Rust was DELIGHTED with our decision to do this. I quote: “I am DELIGHTED to report that Chemspider has adopted a CC-SA licence for its data.” and espoused “PMR: This is wonderful. As far as I know Chemspider is the only commercial chemical information company offering data under this licence, which is completely compatible with the Open Knowledge Definition. (It is also BBB-compliant, though data and publications are different animals).”

I assumed therefore we’d done a good thing. There was no indication to me that our postion was anything other than positive.

There has been a conversation going on in the blogosphere for a couple of weeks now about Strong and Weak Open Access. I’ve read, watched and simply let others share their opinions because they’ve been in Open Access discussions for a number of years and have more context, background and passion to stay engaged in these discussions. They ARE important discussions and will come to a conclusion.

It appears that “I” am confused by Creative Commons licenses. This based on the fact that 72 hours we had done a good thing and got a blessing but 3 days later I read yet another post this time with a comment  from John Wilbanks stating “I’d like to see a meaningful discussion of the risks of Share Alike and Attribution on data integration. Chemspider’s move to CC BY SA fits into this discussion nicely - it’s a total violation of the open data protocol we laid out at SC, which says “Don’t Use CC Licenses on Data” - but it does conform inside the broader OKD.”

Uh-oh. ChemSpider is in Total Violation of Creative Commons Licenses. As we say in Wales in times of distress … “Hell’s Bells” (My dad was a builder..if you believe he taught me to curse like that well….)

Peter followed it up with a comment “PMR: I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism - CC-SA is conformant to the OK definition, but difficult to operate for re-use). That’s why we use the OKF’s OpenData sticker on CrystalEye.” Hmmm…

Again, when I’ve asked about the OpenData sticker I’ve been informed that this is not yet ratified.

There have been many discussions about Openness I’ve been involved with..just one example here. It has been difficult. Openness and licensing remains confusing…see here an example and this is just about a blogsite!

So the question is what now? Do we remove Creative Commons Licenses? Do we adopt Open Data licenses or do we just get ourselves out of the middle of this entire confusing discussion until all is resolved and settled. And IF we remove CC licenses and don’t post other licenses I know we’ll get criticized for that too. But let’s be honest…we’ve been highlighted for NOT having licenses up to this point. Now we are highlighted FOR having them. Maybe we can hope that no press is bad press. I’ll await feedback on this post and make a decision about what to do in the next 48 hours. Blog away…

Buy me a Coffee

I am posting this in order to help one of my “neighbors”, IUPAC in Research Triangle Park. Their office is about 30 minutes from where I live. This is a beautiful area of the world and I encourage people to contact the Secretariat directly should you have an interest in this role.

Post-doctoral Position in Chemistry Informatics

Develop, implement, and support web based applications to enable IUPAC Staff and Committee members to work more effectively. The emphasis will be on development of tools for communication and collaboration to allow scientists working on IUPAC projects to accomplish their project goals while minimizing the need for travel. This will build on the new architecture of the IUPAC web site that uses XML technology to organize the information used by IUPAC members as well as the general scientific public. In addition, methods will be developed to organize and present IUPAC’s information, now contained in books and journal articles, to make it more accessible and more useful.

This position is located at the IUPAC Secretariat in Research Triangle Park, North Carolina, USA and will require considerable travel.

Required background: PhD or equivalent in Chemistry or a related discipline so as to combine a reasonable chemical knowledge with computing expertise; experience with SQL databases and XML coding; excellent written English and the ability to deal with multiple projects simultaneously.

Salary and benefits are competitive and will depend on experience and qualifications.

IUPAC was formed in 1919 by chemists from industry and academia. For almost nine decades, the Union has succeeded in fostering worldwide communications in the chemical sciences and in uniting academic, industrial and public sector chemistry in a common language. IUPAC is recognized as the world authority on chemical nomenclature, terminology, standardized methods for measurement, atomic weights and many other critically evaluated data. In more recent years, IUPAC has been pro-active in establishing a wide range of conferences and projects designed to promote and stimulate modern developments in chemistry, and also to assist in aspects of chemical education and the public understanding of chemistry.

More information about IUPAC and its activities is available at <www.iupac.org>.

Contact:

John W. Jost, Executive Director

IUPAC Secretariat

P.O. Box 13757

Research Triangle Park, NC 27709-3757, USA

E-mail: secretariatATiupacDOTorg

Buy me a Coffee

I had commented recently on my pleasant experiences of working with MDPI regarding Molbank articles. See the posts here and here. SInce Peter Murray-Rust and I had both blogged on this issue (from different points of view) Deitrich Rordorf from MDPI went out of his way to make us both aware, via email, of a recent publication they had posted on their site:

“Just for your information and in reply to the blog posts regarding use of Creative Commons By Attribution License v3.0: We recently published an editorial “Changes Coming to MDPI Journals: Digital Object Identifier (DOI) and Creative Commons Attribution License” at http://www.mdpi.org/molecules/papers/13051079.pdf

The paper is entitled “Changes Coming to MDPI Journals: Digital Object Identifier (DOI) and Creative Commons Attribution License” and speaks for itself. I recommend that interested parties read the entire paper and commend MDPI on their decisions. EXCELLENT news.

Buy me a Coffee

Last a week I had a pleasant chat with a reporter from Nature magazine, a Mr Geoff Brumfiel. Geoff was interested in ChemSpider…what it was, how it ran, who used it, who supported it, who liked it, who curated it, who didn’t like it and so on.

The results of that discussion, and others he spoke to about ChemSpider, are here in his article.

Chemists spin a web of data p139
Chemspider website provides free information on millions of molecules.
Geoff Brumfiel
doi:10.1038/453139a
Full Text | PDF

It is a rule at Nature, at least for this type of article, that I could not see the article before it went to press and therefore I didn’t get the chance to proofread and comment. Geoff has accurately captured the spirit of our discussions but a few detailed clarifications are needed too. I have pasted in black the article content and in italics the clarification.

providing the community with an open-access source of chemical information

I giggled and commented please don’t say it’s Open Access. Say it’s Free Access. Say there are Open Data. And now we have Creative Commons licenses. But don’t say it’s Open Access, not Strong, not weak, not gold, not green. Just Free Access. No price barriers to usage.

Chemist Antony Williams is hoping to change this in a move likely to ruffle the feathers of the American Chemical Society.

I commented that we are not purposely in competition with anyone. It’s not what drives us to do this. Whether others see us to be competitive is for them not us. We don’t intentionally try to ruffle feathers. It doesn’t mean that what we are doing won’t ruffle feathers of course. Whether it’s ACS or others. It’s not the goal..it might be an outcome.

The modest project has made chemists interested in open access take notice — last week, the number of daily users of the site surpassed 5,000.

We have crossed 5500 users for the past two nights. The trend is positive.

“Other potential sources of information, such as Wikipedia, lack the algorithms needed to search chemicals according to their structure. “

Structure searching is “feasible” of course with InChI Strings. But substructure isn’t and Wikipedia is treated as a text-based search by almost all of its users

“The site is maintained with modest profits from advertising and the work of about 30 active volunteers who double-check the data pulled in from outside.

The original investment in hardware and software costs has finally been recouped. Modest profits? No one gets paid for the work we do. There is a phenomenal sweat equity investment in the platform numbering many thousands of hours to get here. We are indebted to the many software collaborators, providers of tools and the people curating and depositing to the system. There have BEEN about 30 active volunteers. RIght now I would say the number of active depositors and curators is around 10. But it is growing. I hadn’t checked the number of REGISTERED users for a long time. We have over 1150 registered users…those who CAN login and curate data, deposit data, see new features etc. People do NOT have to register to use the site…but >1150 did. Wow. I didn’t know it was that many until i just checked (BIG SMILE)

““There’s an awful lot of chemical information, but there’s an awful lot of rubbish as well,” says Barrie Walker, a retired industrial chemist in Yorkshire, UK, who helps maintain the site.”

Don’t know whether Barrie said this or not. He IS an honest guy and he is our QUALITY GURU and we are proud that he is willing to give us his fine eyes. There IS garbage on the site still. But, after a year online and active curating it has been much reduced. About 200 edits a day are made to the site: names changed/deleted/added, spectra/structures/URLs/Publications added etc. It’s quite the pace. We have cleaned up 100s of thousands of incorrect associations from the external data sources. It’s been and will remain an enormous task with an enormous payback for the community

Williams adds that the site still has problems with certain searches. For example, it struggles to distinguish between isomers: molecules with the same chemical formula arranged in different structures.

We can distinguish isomers no problem. The PROBLEM is that there is a mixture of isomeric species submitted from multiple data sources and data are mixed and intermingled in way that the user cannot get to the correct structure. Search taxol or Ginkgolide on the ChemSpider blog and read the mutliple blog posts about this. We can of course search all isomers for a particular chemical formula…

“But Williams nevertheless believes that the service may be able to compete with for-profit services. “What I’m doing is highly disruptive,” he says. “I think it can be done and it needs to be done.”

I think what WE are doing…its not me..it’s we…is disruptive. In a good way. Many chemists will benefit. Will it have an impact on for-profit services? Yes, maybe. As an outcome but not as the target. Our team of people, both internal to ChemSpider’s development and Advisory Group, and the people we don’t even know who are cleaning and depositing into the system for their colleagues in the community, are creating a powerful resource for Chemists. The FOCUS of this effort is to Build a Structure Centric Community for Chemists. We will change that soon…the focus on Structure-Centric will be to cover Chemistry in general and to Build a Community for Chemists.

We are well on our way and thanks to Nature, and Geoff in particular for exposing it. My comments above are not meant to detract from Geoff’s reporting abilities but it was a long discussion and some clarification statements are of value i believe.

Buy me a Coffee

Peter Murray-Rust responded to my recent comments about a Free Lunch. There are a number of comments to be made and an exciting opportunity to use Open Data and linking from ChemSpider.

I’d asked the question about