Archive for the Vision Category

We have previously described initial steps to integrate ChemSpider with ELNs with IDBS, and to define the elnItemManifest metadata model.

We have now also made further steps to integrate ChemSpider with Southampton University’s ELN, LabTrove, following on from an eScience tool that Stephen Wan from CSIRO had developed with the University of New South Wales to text mine LabTrove ELN blog posts to identify chemical names and link these to the relevant ChemSpider compounds. LabTrove is an open source blog-based system which can be used for recording and sharing experimental findings. Previously, if an image of the compound was to be added to an experiment blog post, it would be necessary either to upload it as an image (following drawing it in a separate drawing package) or to paste in a link to the image in another website (following a separate internet search in another browser window). We have now added the ability to click a button directly when adding or editing an experiment to launch a search of ChemSpider and when the required compound is found, an image of it can be added to chemspider simply by clicking on it, as can be seen in this demonstration video:

The editing controls in LabTrove are based on TinyMCE, a WYSIWYG editor which is used in a range of blogs, including WordPress. This means that this same ChemSpider plugin can also be used to insert compound images from ChemSpider from any other blog or website that uses a TinyMCE editor too.

If you have a LabTrove installation which you would like to add the ChemSpider plugin to then simply update your installation with the latest source code from LabTrove’s SourceForge website.

If you have a website or blog which uses a TinyMCE editor which you would like to add the ChemSpider plugin to then simply download this zip file, extract the folder in it and move the “chemspider” directory created to your tinymce plugins folder. Then, in your tinymce initialization process, add the plugin “chemspider” and the button “chemspider”.

I’m sure that by now everyone has noticed that the ChemSpider homepage design changed just over a month ago. A few features moved around, the Molecules of Interest section was retired and perhaps most significantly the Search box was given a dose of CSID: 5791, becoming bigger and more prominent.

The reason for this wasn’t just to make the site more attractive (though I think it does look ‘prettier’). Our motivation for the change is to deliver a site that makes it easier for users to interact with and understand. And by doing so, hopefully make it quicker and simpler for you to get your tasks done using ChemSpider. The refresh of the homepage is hopefully illustrative of this: We think that as most users come to ChemSpider to search for information – it should be easy to get straight into a search, hence the greater emphasis on this feature.

In the next few days we will release another upgrade to the interface which is centered on making it easier to understand the data presented in the compound Record View pages. I’ll post a blog entry dealing with some of the key features in the next few days.

The development of ChemSpider is an ongoing process, and we are aware that even after this upgrade there will be aspects of the compound Record View pages that will need more work (and also other parts of the site that still need development). It’s not going to be easy: ChemSpider brings together a rich and varied set of data from a large number of sources – this poses many challenges. We also realise that there are many different tasks that each of you – as users – want to perform, and it is always going to be difficult to reconcile all of the different opinions/needs.

However, we are trying to make the site better for you. And therefore, we’d really like to know your opinions on the changes (please test new features for a few days first). We welcome your feedback on the redesign either in the form of blog comments or email feedback (chemspider-at-rsc.org).

Over the next week – keep your eyes peeled for the upgrade and my accompanying blog post which will endeavor to give you a good introduction to the new features.

I recently talked about ChemSpider and some of our recent and future developments at the STM Innovations Seminar 2010, in a flash session of 20-slide talks timed @ 15 secs a slide. Great fun to do, and a format which ensures much less suffering for the audience – or at least shorter but more concentrated suffering. Anyway, here’s ChemSpider in 5 minutes for an audience of non-chemists. The other slides and videos from the Seminar are also available. Thanks to STM for hosting the event and River Valley for filming it.

Last night I gave a presentation at the BAGIM meeting in Boston. The abstract is below together with the embedded presentation from Slideshare

ChemSpider – Is This The Future of Linked Chemistry on the Internet?
ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are now hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of almost 25 million chemical substances, grows daily, and is integrated with over 400 sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for a linked web for chemistry and to provide access to a set online tools and services to support access to these data.

ChemSpider has been around for about two and a half years. Based on the feedback we have received from the community regarding our humble offering to the chemistry community users like it. In most cases they “get it” too. They understand that we working to provide information to them that can assist their work. We are hoping to provide some glimpse of data, some snippet of information, some link of value which can enable their studies/research/inquiry. And, in some cases, people want more. Let’s be honest…WE want more. We want to deliver more value, provide more impact and integrate more data for you, the community.

Some of the things that we have been asked for over the past few months are more web services to tap into the experimental data available on ChemSpider (grabbing experimental properties for QSAR modeling for example), more reaction syntheses to peruse, improved speed for substructure searching, similarity searching, integration to more publishers literature and easier to navigate website. Good list!

We have  a set of priorities for the near term and will be doing our utmost to deliver them in time for the IUPAC congress in Glasgow in August and the ACS meeting in Washington later that month. But we want to hear from you. What do you, our users, want to see on ChemSpider. If you had your wishes, and resources were no object, there were no barriers to integration with any data source and you got to define the path forwar for ChemSpider what would it be?

Feel free to share it here on the blog or, if you’d prefer to be more anonymous with your comments, feel free to drop me an email at infoATchemspiderDOTcom. We want your input. Please don’t be shy…engage us and you might just get what you want (though some things might take a while!)

Reblog this post [with Zemanta]

We’ve received a lot of kudos, congratulations and praise for our decision to become a part of the RSC. We thank everyone who has gone out of their way to acknowledge the shift in our circumstances. We did have some concern that some people would judge us on “selling out” rather than going it alone. Based on the feedback to date our worries were unfounded.

Tonight the comments of Warren DeLano, developer of the Open Source platform PyMOL (more details here), truly struck a chord with me. His comments are below.

pymol“DeLano Scientific LLC congratulates Antony Williams et al. on the acquisition of ChemSpider by the Royal Society of Chemistry. This historic event provides a compelling example of how an independent open-minded project (open-access, open-data, open-source, etc.) can increase its resources and extend its longevity without compromising on its core mission, as is always necessary when a project “sells out” to a for-profit company beholden to narrow fiduciary objectives.

We hope that the ChemSpider / RSC example will both inspire more open-minded individuals to strike out on their own with similarly ambitious efforts and encourage various non-profit and government entities to actively recruit successful projects back into “the establishment” in ways which do not compromise project integrity and yet can enable even greater long-term positive societal impacts.”

Specifically the statement “without compromising on its core mission” hit me. It’s exactly why the fit with the RSC felt right. RSC are focused on Advancing the Chemical Sciences and look upon ChemSpider as a way to help the community to access information, data and knowledge and bring together chemists, publishers, vendors and other parties. It’s been our mission all along. So, we are not compromised as we have the same intentions. A great match.

Thanks to Warren for the recognition. Much appreciated.

Reblog this post [with Zemanta]

The ChemSpider Journal of Chemistry is an experiment. We intend to demonstrate how modern web technologies can be used to dramatically enhance the type of information that can be communicated using web-based tools over standard online publishing approaches. There are some publishers who are working in delivering additional value to their readers by providing enhanced HTML articles and adding information to their articles such as InChIs to allow structure-based queries online. These publishers include the Royal Society of Chemistry with their Project Prospect and the Nature Publishing Group with their Nature Chemical Biology papers. The majority of articles presented by the commercial publishers are not of a “just-in-time” nature and are delayed by the “processes of publishing”. They are generally fairly lengthy documents and report successful results. They are commonly peer-reviewed and have endured a significant timeline from initial writing to submission, publishers processing, review and publication. Science is however being reported in near real-time under Open Notebook Science (ONS) initiatives. We believe that an online journal can co-exist between the immediate nature of blogging and wiki tools hosting ONS efforts and the more standard processes of the scientific publishers. Some publishers are already allowing online and open peer-review whereby readers provide their feedback to the author in a public forum. Papers can enter a period of online peer review and commentary during which readers provide feedback to the author(s). As a result of this process the authors can engage in public discourse with the commentators and issue a final form of the manuscript. We will offer similar facilities.

We invite manuscripts from anybody interested in exposing their work in the field of chemistry and intersecting fields. In general we expect these communications to be 1500-3000 words in length but there is no limit. We encourage submissions relating to chemistry, biochemistry and chemical biology; regarding synthesis, the analytical sciences and computational chemistry; as research, as commentaries and as questions to the community. Provided the submission relates to the domain of the chemical sciences we will find a place for it within the ChemSpider Journal of Chemistry. We encourage submissions from academia and industry, from students and senior scientists, from individuals and teams, for successful research or failed experiments. We encourage submitters to challenge us to host your manuscripts in a manner which most clearly communicates your science. This may include hosting various forms of data made available to the public as Open Data, providing visualization tools for the display of molecules, spectra, images and videos. We intend to not be constrained and to make full use of web-based tools available today and coming online tomorrow.

All articles will be Open Access articles. We will abide by the Budapest Open Access Initiative which declares “By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” Authors must agree to allow unrestricted reading, downloading, distribution, printing, searching and linking to the published work.

Over the past 2 years we believe we have demonstrated our passion for public science, our willingness to serve the community, and integrity in our actions. We hope that the ChemSpider Journal of Chemistry will provide a vehicle to all scientists operating within the domain of the chemical sciences to expose their work and interests to the community. We intend to deliver a facile process of submission and superior tools for delivery. We welcome your support and look forward to expanding the communication of chemistry.

Reblog this post [with Zemanta]

There’s no shortage of possibilities regarding where we could go next with ChemSpider and we’re always thinking ahead. At present we are focused on chemistry document markup and the development of ChemMantis. Moving forward we are considering how chemists might want to use ChemSpider. Based on comments from organic chemists over the past few months a lot of chemists are using ChemSpider to source chemicals for purchase for screening and specifically to find starting materials for further reactions.

Recently we added the ChemSynthesis structure collection. That database offers links out to over 45,000 articles regarding reaction synthesis. We are now being encouraged to manage reactions directly on ChemSpider. While we of course have the skills to do so it’s not in our near future. But, what if we did?  Then retrosynthetic analysis might be possible. At the ACS meeting in Philadelphia in August I gave a presentation on ARChem Route Designer, a software product marketed by SimBioSys . It was my privilege to give this presentation on behalf of one of the most respected chemists, Peter Johnson, someone who has been at the forefront of tools for synthesis design and structure based drug design. Take a look at the presentation about ARChem…for chemists interested in software tools for Retrosynthetic Analysis it may be of interest…and I wonder whether a platform like this might be of interest to integrate to ChemSpider…what do YOU think????

When  ChemSpider was rolled out to the world as a part of ChemZoo we always knew we would be introducing more “critters”. We are happy to announce our progree with our new development ChemMantis. Why Mantis? Well…it’s the Markup And Nomenclature Transformation Integrated System. Fits perfectly into our zoo!

We have been working on the markup of chemistry documents for a number of months and I unveiled the first aspects of our work at the ACS meeting in Philadelphia. The presentation is available online on my Slideshare account. What we are trying to do is to use our ChemSpider platform as the foundation of a document markup system whereby chemical names are automatically identified and can either be converted to chemical structures (possible using algorithms for name to structure conversion) or are retrieved from our ChemSpider database. We have invested a lot of efforts to curate and validate the ChemSpider database of over 21.5 million unique chemical entities over the past year and are now sitting on a foundation of information allowing us to connect between chemical identifiers, chemical structures and out to rich sources such as Wikipedia and PubChem and to provide information such as chemical vendors and other online systems. ChemMantis is well and truly weved into the web of ChemSpider now.

We are now in alpha release and are adding some finishing tweaks to the markup system, the visualization elements and the  workflow. You can see the immediate effects of our recent work on improving the quality of structure images in the balloon below.

We_would_like to test the system on YOUR documents if you are willing to participate. What we are looking for are WORD documents for already published papers. They can be Open or Closed access papers. We are not expecting copyright transfer – we want to markup the documents and return to you for feedback. In the process we will be testing the quality of our Dictionary, our conversions, our visulaizations and our process. We welcome your support. Feel free to connect with us at infoATchemspiderDOTcom. Over the next few weeks you will hear more about ChemMantis and our contributions to text mining and markup of chemistry documents.

Most people reading this blog will know that we are advocates of the InChI standard for structure representation. I am aware of the intentions to extend the InChI into the world of reaction Capture and look forward to testing it as it moves forward and providing feedback to the team. An announcement was made in the CSA Trust Newsletter and I’ve snipped it below.

“A project to develop a standard representation for chemical reactions was launched recently at a meeting in Berlin, Germany, hosted by René Deplanque of FIZ Chemie. The project is being led by Guenter Grethe.

The goal of this meeting was to develop the requirements for a proposal to be submitted to IUPAC to fund an Open Source, public domain ReactionML (IUPAC RML) standard to complement the IUPAC InChI chemical structure representation. The requirements would include what the community needs, technical and organisational issues and financial aspects.

The meeting was quite successful and an initial first stage of the project was agreed to and will include:

  • Reactants
  • Products
  • Reagents
  • Catalysts
  • Solvents

All the chemical structure representation will be based on and build upon the IUPAC InChI/InChIKey standards, which, since its introduction in August 2006, has become the international chemical structure representation standard for all large databases of chemical data. Some of these databases containing InChIs are in excess of 36 million unique structures.

It is expected a beta test release version of this new IUPAC standard will be available for public testing by the end of 2008.”

I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site – word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

ChemSpider has been working hard to support Wikipedia for a number of months now. We have been curating the structures on Wikipedia, I have been an active member of the WP:Chem team, we have extended our integration of WIkipedia to show the leed of the Wikipedia article on associated record views and have a lot of background activities going on re. Wikipedia at present (info will be released shortly). There are new articles released on Wikipedia on an ongoing basis and we stay up to date as best we can monitoring bots for updates. Harvesting monographs out of Wikipedia based only on ChemBoxes and Drugboxes is not sufficient for sure since not every article about drugs and chemicals on Wikipedia has an associated Drugbox or ChemBox. For example… You have likely heard of Rember for Alzheimers already? A search on Google for Rember Alzheimers will give about 2 million hits. It’s already being discussed in the blogosphere including Derek Lowe’s  In the Pipeline. Rember turns out to be methylene blue. There is already an article on Wikipedia about Rember but there is no chembox as yet. As I was researching Rember out of interest I noticed we did not have methylene blue linked to Wikipedia and Rember wasn’t associated with methylene blue. Adding the name was of course easy..5 seconds work after login. We have now added the ability to associate data sources directly too. What does this mean? On a record view page is a list of “Data Sources” associated with a compound. This is where depositions about a compound came from and, generally, links back to the associated web pages. Previously in order to populate the Data Source table it would be necessary to deposit the structure and associated info as an SDF file. TOO MUCH work. So, now we have made it easy. To add a data source simply login and select “Edit” (top right hand side of the data source table). To add a new data source simply click Add and input the information into the pop up box.The input is the name to be listed in the Data Source table, the URL to the information on the Data Source page (if info exists) and the name of the Data Source. This is one caveat of adding such links..the data source must exist. If you want to add data associated with your own website you need to register yourself, add a Data Source and wait for us to approve. Wikipedia is a special case since when the link is made we grab the leed of the article directly and show it in the Record View. For methylene blue there are two related Wikipedia articles so we have linked to them both as you can see on the record view. Simple go to ChemSpider and search for rember and you’ll see two linked Wikipedia articles.

For those of you who have been following the discussions of Stevan Harnad, Peter Suber and others regarding institutional repositories and Open Access you will already be up to speed regarding OA mandates and what they could mean in terms of access to data. Rather than go into this area in detail myself I point you specifically to Steven Harnad’s site to review ongoing discussions there (there are mutliple parties exchanging views.

What I am going to so though is point you to this comment on Peter Suber’s blog regarding “ Stanford Opens Access to All Its Education Studies“.

Specifically, the following comments are of interest “Under Stanford’s new policy, only the author’s final, peer-reviewed copy of the article would be posted online —in some cases, potentially months before the printed version becomes available….By early fall, the education school plans to have a Web site in place where the articles will be posted and archived in a searchable database. With approximately 50 scholars on Stanford’s education school faculty, the site could accumulate as many as 100 articles a year, by Mr. Willinsky’s estimate.”

Stanford is not alone in this type of shift. What does this mean for indexing of articles and availability for searching in terms of the work we are doing with ChemSpider right now (1,2,3). Text-indexing of chemistry articles would simply mean turning our spider onto the repository. Using the tools we have available now and the database of 21 million compounds and associated dictionary we could also convert the chemical names to structures and make the articles searchable by both text and structure BEFORE publication, in theory, months before. With the work that is already underway on Open Access articles on ChemSpider and SOON to be unveiled, we could also provide tools for authors to markup their own documents. My preference, as for many others, is that authors of Chemistry articles use semantic authoring tools to allow us to grab the appropriate information from the articles for linking as well as provide a path for semantic connectivity.

The question then is whether or not ChemSpider can index institutional repositories or authors self-archived collections on their university research group websites. The authors self-archived collections will be very valuable but of course most likely to upset the publishers. We’d like to do both.

I envisage a time when articles are indexed and searchable even before they are published and indexed by others. Why not? If there are changes to the article between pre-and post-publication both can be indexed.

We welcome your comments! Anyone want to introduce me to the host of an institutional repository?

I’ve been looking at various forms of communication to assist with people understanding a little more about ChemSpider. I am presently investigating the production of online movies to assist users in understanding how to use the system to full effect and hope to rollout a few examples shortly. In parallel I’ve been looking at podcasting technology.

Serendipitously I was approached by Nature to be involved in one of their podcasts and went through the experience with them. Though  you’d never know it from the podcast it was done during vacation while trying to balance the energy of our boisterous twin boys in the room with a background noise of the ocean crashing on the shore. There are worse ways to be involved in a podcast for sure…balancing the nice overview of the sea with two little boys desperately trying to stay quiet and the professionalism and speed of Geoff Brumfiel at Nature made this a very pleasant experience.

If you’re interested you can check out the podcast here. Based on the feedback we might add podcasting as one more way to communicate with our users. Thoughts?

Here at ChemSpider we’ve been working for almost a year and a half to build a structure centric community for chemists. During this time we have been dabbling, in the background, with ChemSpider being not so structure-centric, but this has not been exposed yet. Of late we have been attracted to the possibilities around text-mining and mark-up of articles.

We are well underway in terms of providing tools for markup and they will be released incrementally. We have a lot more ideas and are interested to participate in the Article 2.0 contest to see what we can do. What is Article 2.0? Article 2.0 was announced by Elsevier here with the following statement:

“We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset. We will encourage the use of XQuery, but this will not be a mandate. By leveraging these APIs, the contestant becomes the publisher and can render scientific articles to meet their needs including integrating the article into existing applications or combining it with other web service APIs.”

7500 articles and complete freedom to present the articles as we see fit. Enticing! What do we already have on ChemSpider that we could reuse?

1) Structure deposition

2) Analytical data and image deposition

3) Integration to other data via URLs

4) Add comments/description

5) Text markup with “Chemical enhancements”

6) A dataset of >21 million structures and integration to over 120 data sources

7) Good ideas …

Article 2.0 looks interesting…we hope to be involved

ChemSpider has taken some thrashing over the past year. We’ve been hit on science (and proven our point many times), on Open Access versus Free Access statements, on whether or not we have Open Data or not. There has been encouragement to define what the data on our site is in terms of Open Data or not. We’ve adopted Open Data tags on deposited data from users after pressure there. When I’ve asked more about Open Data I have heard that it is not ratified at the same level as Creative Commons licenses and they would be better to use. A week ago we put up Creative Commons Licenses in what I hoped was a GOOD move for the ChemSpider site and would relax the criticism of our site and potentially receive their blessing and support.

We received a blessing for all of 72 hours. In his blog post Peter Murray-Rust was DELIGHTED with our decision to do this. I quote: “I am DELIGHTED to report that Chemspider has adopted a CC-SA licence for its data.” and espoused “PMR: This is wonderful. As far as I know Chemspider is the only commercial chemical information company offering data under this licence, which is completely compatible with the Open Knowledge Definition. (It is also BBB-compliant, though data and publications are different animals).”

I assumed therefore we’d done a good thing. There was no indication to me that our postion was anything other than positive.

There has been a conversation going on in the blogosphere for a couple of weeks now about Strong and Weak Open Access. I’ve read, watched and simply let others share their opinions because they’ve been in Open Access discussions for a number of years and have more context, background and passion to stay engaged in these discussions. They ARE important discussions and will come to a conclusion.

It appears that “I” am confused by Creative Commons licenses. This based on the fact that 72 hours we had done a good thing and got a blessing but 3 days later I read yet another post this time with a comment  from John Wilbanks stating “I’d like to see a meaningful discussion of the risks of Share Alike and Attribution on data integration. Chemspider’s move to CC BY SA fits into this discussion nicely – it’s a total violation of the open data protocol we laid out at SC, which says “Don’t Use CC Licenses on Data” – but it does conform inside the broader OKD.”

Uh-oh. ChemSpider is in Total Violation of Creative Commons Licenses. As we say in Wales in times of distress … “Hell’s Bells” (My dad was a builder..if you believe he taught me to curse like that well….)

Peter followed it up with a comment “PMR: I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism – CC-SA is conformant to the OK definition, but difficult to operate for re-use). That’s why we use the OKF’s OpenData sticker on CrystalEye.” Hmmm…

Again, when I’ve asked about the OpenData sticker I’ve been informed that this is not yet ratified.

There have been many discussions about Openness I’ve been involved with..just one example here. It has been difficult. Openness and licensing remains confusing…see here an example and this is just about a blogsite!

So the question is what now? Do we remove Creative Commons Licenses? Do we adopt Open Data licenses or do we just get ourselves out of the middle of this entire confusing discussion until all is resolved and settled. And IF we remove CC licenses and don’t post other licenses I know we’ll get criticized for that too. But let’s be honest…we’ve been highlighted for NOT having licenses up to this point. Now we are highlighted FOR having them. Maybe we can hope that no press is bad press. I’ll await feedback on this post and make a decision about what to do in the next 48 hours. Blog away…

Last a week I had a pleasant chat with a reporter from Nature magazine, a Mr Geoff Brumfiel. Geoff was interested in ChemSpider…what it was, how it ran, who used it, who supported it, who liked it, who curated it, who didn’t like it and so on.

The results of that discussion, and others he spoke to about ChemSpider, are here in his article.

Chemists spin a web of data p139
Chemspider website provides free information on millions of molecules.
Geoff Brumfiel
doi:10.1038/453139a
Full Text | PDF

It is a rule at Nature, at least for this type of article, that I could not see the article before it went to press and therefore I didn’t get the chance to proofread and comment. Geoff has accurately captured the spirit of our discussions but a few detailed clarifications are needed too. I have pasted in black the article content and in italics the clarification.

providing the community with an open-access source of chemical information

I giggled and commented please don’t say it’s Open Access. Say it’s Free Access. Say there are Open Data. And now we have Creative Commons licenses. But don’t say it’s Open Access, not Strong, not weak, not gold, not green. Just Free Access. No price barriers to usage.

Chemist Antony Williams is hoping to change this in a move likely to ruffle the feathers of the American Chemical Society.

I commented that we are not purposely in competition with anyone. It’s not what drives us to do this. Whether others see us to be competitive is for them not us. We don’t intentionally try to ruffle feathers. It doesn’t mean that what we are doing won’t ruffle feathers of course. Whether it’s ACS or others. It’s not the goal..it might be an outcome.

The modest project has made chemists interested in open access take notice — last week, the number of daily users of the site surpassed 5,000.

We have crossed 5500 users for the past two nights. The trend is positive.

“Other potential sources of information, such as Wikipedia, lack the algorithms needed to search chemicals according to their structure. “

Structure searching is “feasible” of course with InChI Strings. But substructure isn’t and Wikipedia is treated as a text-based search by almost all of its users

“The site is maintained with modest profits from advertising and the work of about 30 active volunteers who double-check the data pulled in from outside.

The original investment in hardware and software costs has finally been recouped. Modest profits? No one gets paid for the work we do. There is a phenomenal sweat equity investment in the platform numbering many thousands of hours to get here. We are indebted to the many software collaborators, providers of tools and the people curating and depositing to the system. There have BEEN about 30 active volunteers. RIght now I would say the number of active depositors and curators is around 10. But it is growing. I hadn’t checked the number of REGISTERED users for a long time. We have over 1150 registered users…those who CAN login and curate data, deposit data, see new features etc. People do NOT have to register to use the site…but >1150 did. Wow. I didn’t know it was that many until i just checked (BIG SMILE)

““There’s an awful lot of chemical information, but there’s an awful lot of rubbish as well,” says Barrie Walker, a retired industrial chemist in Yorkshire, UK, who helps maintain the site.”

Don’t know whether Barrie said this or not. He IS an honest guy and he is our QUALITY GURU and we are proud that he is willing to give us his fine eyes. There IS garbage on the site still. But, after a year online and active curating it has been much reduced. About 200 edits a day are made to the site: names changed/deleted/added, spectra/structures/URLs/Publications added etc. It’s quite the pace. We have cleaned up 100s of thousands of incorrect associations from the external data sources. It’s been and will remain an enormous task with an enormous payback for the community

Williams adds that the site still has problems with certain searches. For example, it struggles to distinguish between isomers: molecules with the same chemical formula arranged in different structures.

We can distinguish isomers no problem. The PROBLEM is that there is a mixture of isomeric species submitted from multiple data sources and data are mixed and intermingled in way that the user cannot get to the correct structure. Search taxol or Ginkgolide on the ChemSpider blog and read the mutliple blog posts about this. We can of course search all isomers for a particular chemical formula…

“But Williams nevertheless believes that the service may be able to compete with for-profit services. “What I’m doing is highly disruptive,” he says. “I think it can be done and it needs to be done.”

I think what WE are doing…its not me..it’s we…is disruptive. In a good way. Many chemists will benefit. Will it have an impact on for-profit services? Yes, maybe. As an outcome but not as the target. Our team of people, both internal to ChemSpider’s development and Advisory Group, and the people we don’t even know who are cleaning and depositing into the system for their colleagues in the community, are creating a powerful resource for Chemists. The FOCUS of this effort is to Build a Structure Centric Community for Chemists. We will change that soon…the focus on Structure-Centric will be to cover Chemistry in general and to Build a Community for Chemists.

We are well on our way and thanks to Nature, and Geoff in particular for exposing it. My comments above are not meant to detract from Geoff’s reporting abilities but it was a long discussion and some clarification statements are of value i believe.

We have made significant advances in the structure deposition system on ChemSpider. We’ve reported on our advances previously and working hard to polish it.In parallel we’ve done work to support deposition of batches of structures (100s to many thousands) as well as the deposition of CSV files to support Open Notebook Science. We are going to roll out deposition in phases – single deposition first, batch deposition next and then CSV file based batch deposition.

So…why are we encouraging the deposition of structures onto ChemSpider. We agree that we could accept RSS feeds (and we will). Our view is that people might to have “bragging rights” on their latest synthesis, might want to expose their latest paper on ChemSpider, might have a link to an article online that they might want to expose to people. While there are MILLIONS of structures online there is new chemistry reported everyday. What other system is there available as a structure-based community for chemists where people can deposit their structures, stories, links and comments to share with others? (And open up a conversation with others about synthesis, analysis etc.) Think of it a little like Flickr or YouTube for chemical structures. Anyone can post their structures for people to browse.

I’ve been doing some example depositions to show what’s feasible…these are simple to do…a few minutes work maximum.

1) I was a co-author of a publication and received a copy today. I wanted to put a link to the paper and associate it with the structure we analyzed. The structure already existed on the database so this was information to be added to the existing structure. Scroll down to the end of the page for this record to see this Supplemental information

Martin, G.E., Hilton, B.D, Blinov K.A. and Williams, A.J. “Using indirect covariance spectra to identify artifact responses in unsymmetrical indirect covariance calculated spectra “, Magnetic Resonance in Chemistry

[DOI: 10.1002/mrc.2141]

2) A new publication was released this week regarding a new compound Quesnoin. David Bradley blogged about it on Spinneret. In this case I wanted to add the structure, information about the structure as well as a link to the recently published article. Scroll to the bottom of this record.

There are many other examples online too here (1,2,3). Look at the Supplemental Information in each case.

There are some final tweaks being made at present but single deposition is now rolled out. We are looking for people NOW to start using the system so please ping me. An overview of the system is available here.

deposition_workflow1.png

 

 

The future will include users creating their own “catalogs” of structures, “social networking” and discussions around structures, team-based discussions, public and private structure collections and so on. It’s coming…in stages. We start here with the single deposition process.

Over the past few weeks I have had a few discussions with a member of the ChemSpider Advisory group regarding a concept to create WiChempedia. I’ve enjoyed these conversations with Alex Tropsha (professor and Chair in the Division of Medicinal Chemistry and Natural Products in the School of Pharmacy, UNC-Chapel Hill.) We are like-minded in a number of ways but specifically in what can be done to facilitate delivery of quality information to the chemistry community.

As you will notice if you frequent this blog I am rather a stickler for accuracy and quality (1,2,3). I think it’s important (4). Over the past few weeks I’ve spent more time looking at the quality of data on Wikipedia and trying to figure out the best way to bring together our efforts on ChemSpider to enhance the capabilities of integrated information and to support the quality efforts being made by the WP:CHEM team and help them. I also intend to facilitate the development of our own Wiki environment for chemistry and to generally enhance the tools available to chemists not only for Wikipedia type annotation but also to support Open notebook Science.

Now, I don’t want to reinvent the wheel. Wikipedia has a lot of what is necessary in terms of being a known system, a following of people and committed supporters in the WP:CHEM team. What I have been hoping for was a shift around structure and substructure searching on the MediaWiki platform but I know that is a tough request as the platform is not built for that type of thing, The InChKey holds some promise for exact structure searching but does not offer an opportunity for substructure searching without a lookup across a larger database. I want to facilitate information and data sharing further. I do want to provide the type of service that Wikipedia does in terms of general information but also layer cheminformatics tools onto that knowledge and information, allow addition of analytical data, analysis tools, real time predictions and analysis ultimately. This platform should certainly be wiki-enabled.

Decision made. Our intention is to deliver wiki-capabilities in ChemSpider and to use the Open Content associated with chemicals and drugs on Wikipedia inside the system. We will then provide an environment for people to continue to add to, enhance and curate the Wikipedia content as well as add their own. Last night (and well into the early morning) I spent some time talking to Martin Walker from WP:CHEM regarding my concerns that we might offend the Wikipedians with our efforts and that I did not want them to feel that we were ripping off their hard work but rather have our efforts seen as supportive and enabling. My intention as we work through downloading the data and to check, validate and correct what is sitting on Wikipedia directly for benefit to the community. Also, we will of course need to leave all Wikipedia content under the appropriate licensing for others to use. Martin commented that there are tens of mirrors of Wikipedia out there ripped purely with the purpose of exposing and getting ads revenue. We are not working from that model….our intention, as usual, is to build a structure centric community for chemists and with so much excellent work done on Wikipedia I want to take advantage of it and give back also by the work we will do.

Two domain names have been grabbed for this project : WiChempedia, for compatibility with Wikipedia, and also WeChempedia, to emphasize the community aspects of the project.

If you frequent this blog you will recall that we have made a commitment to Microsoft Sharepoint as our future platform for wiki’ing ChemSpider. That is where we believe this work will be done ultimately but we don’t have the platform in our hands yet.

The Xmas vacation is going to be full of holiday movies and manual examination and curation of the Wikipedia data. Wish us luck!

Following my recent post on high performance computing and the Cell B.E I saw this today re. Gamers handing over their compute cycles to PS3GRID.
I abstract here but point you to the full article for details:

PS3GRID is coordinated by researchers at the Research Unit on Biomedical Informatics (GRIB) at the Instituto Municipal de Investigación Médica and the Universidad Pompeu Fabra in Barcelona, Spain. The distributed infrastructure enables any PS3 to do computations on atomic and molecular simulations

The researchers, headed by GRIB scientist Gianni De Fabritiis, chose the PS3 because it is the first consumer device to contain the IBM Cell processor. “The Cell,” which is more than an order of magnitude faster than standard Intel or AMD processors, optimizes the types of computation commonly used in graphics applications. In addition, the Cell offers an inexpensive and powerful method to perform highly detailed molecular dynamics simulations of biomedical systems. Using the Cell, a PS3 has the computational power equivalent to about 20 PCs.”

I think the image below will tell the story of what’s coming soon to ChemSpider. As part of a collaboration with a member of our advisory group we will be unveiling this new capability for beta testing in the very near future. I’m sure some of you will see where we are going next…watch this space.

I subscribe to Scientific Computing so that it drops into my email inbox. I read Rob Farber’s article this week entitled “The Future Looks Bright for Teraflop Computing “. His opening question was “Wouldn’t it be great to have a teraflop of computing power sitting in your lab, desktop workstation, or remote instrument server?” What would that mean to your work?

For those of you using ChemSpider you will know that we have about 20 million compounds on the database. With that many compounds population of the database with properties such as InChIStrings, InChIKeys, physchem properties and systematic names can take many days if not weeks. With three computers only in our hands, one of them a web server and one of them the database server, we are limited to one system. Even that dual processor system provides slow throughput. Oh the joys of having access to teraflop processors!!!

In my previous post on focused libraries I commented on ongoing discussions regarding the potential to perform online docking. Evangelists such as Jean-Claude Bradley (on our advisory group) have been talking about this possibility as part of his approach to Open Notebook Science. Docking can be very time consuming and the speed of calculations is very important. I have been working on a project regarding the value of porting docking software to the Cell Broadband Engine processor from IBM. The development of that processor is an interesting story in itself since it was driven specifically by the needs of the gaming industry for better performance in their calculations. Now SimBioSys are porting their docking software to the Cell processor as described in this White Paper. The improvement in performance is quite amazing!!!

While working for a commercial software company we saw productivity gains moving to clusters. Dual processors in our laptops and annual performance gains from the general technology shifts offer faster calculations every year. Teraflops on the desktop (and even laptop) are likely a few years away…but GFlops are here..

When we first started the ChemSpider project we made a commitment to “Build a Structure Centric Community for Chemists”. We are well on the way to facilitating that we believe. We have talked about a “wiki” environment for collaboration. In this framework we see wiki to indicate a “collaborative environment”, not necessarily adherence to a specific wiki-platform. Our intention is to provide the ability for users of ChemSpider to collaborate in the co-management of content on the ChemSpider site. A number of our readers have taken our statements to indicate that we will be using the same wiki platform as that utilized on Wikipedia. We have looked at and considered a number of “wiki” tools, platforms, interfaces and user-experiences. At this time we have made a decision to utilize Microsoft Sharepoint as the platform on which to construct our wiki-environment. With a clear commitment to Web 2.0 already declared and our platform built on SQL server and ASP.NET we feel it is the appropriate platform for us to build on. We believe the correct platform choice has already demonstrated that we can deploy a good solution very quickly because of our technology choices.

Now, we realize that this might result in a series of jabs about us not using Open Source solutions and so on but we are more focused on delivering an appropriate scalable solution than building ChemSpider only on Open Source software. We will support anyone who wishes to do the same on Open Source though.

We will keep you informed of our progress. Now we need to migrate ourselves to .NET3 and we hope this will be a short term disruption in the future as we switch over. Watch this space.

For those of you who have been watching the blog of late you will be aware of the recent discussions about Open Data (1,2). We have offered the possibility to submitters of spectral data to declare their data either Open or Closed. Noel posted a comment on the blog asking the question “Why is the default Closed? Why even offer the option of Closed?”

So..my response to “Why not offer the option of Closed?” My opinion is that this is the submitters decision. It’s not our role to force “Openness” of data onto users. We are working to create an environment that provides value to ChemSpider users rather than one that forces them into a policy regarding openness. Personally, I would prefer to have access to data to help answer a question, even if they are NOT Open Data, than to not have access to those data. I have asked all of the people who have submitted data or had me submit data to ChemSpider whether they would like to have their data moved to open. 3 said yes 2 said no. I do NOT intend to force people to adhere to making their data Open. That is their choice, not mine. We are creating a community for collaboration. There is value in having access to data whether it is Open or not. if you look at the recent conversations about RSC and their Free Access versus Open Access we must agree that there IS value to Free Access to their articles despite the fact that they are not Open Access.

My friend Gary Martin has allowed us to deposit some of his data onto ChemSpider. He has commented twice (1,2) and I refer you to those blog postings for his opinions. They are interesting to read.

The reality is tha our policies, even as they are, appear to be appropriate to have people deposit their data. We already have over 100 spectra deposited on ChemSpider and more to come based on recent conversations. Some of these ARE Open Data and the depositors are acknowledged for this. They are sharing their data with you through us. That’s the benefit of building a community for chemists.

This week I was privileged to attend a PubChem Working Group meeting in Washington and sit around table with interested parties discussing the present and future state of PubChem. I had the opportunity to give an overview of ChemSpider and our vision of ourselves and where we are going. if you are interested in reviewing the commentary please find a PDF file of the presentation here (shared with permission of PubChem). I welcome any comments, feedback or questions either as a blog response or offline.