Archive for the Vision Category

ChemSpider has been around for about two and a half years. Based on the feedback we have received from the community regarding our humble offering to the chemistry community users like it. In most cases they “get it” too. They understand that we working to provide information to them that can assist their work. We are hoping to provide some glimpse of data, some snippet of information, some link of value which can enable their studies/research/inquiry. And, in some cases, people want more. Let’s be honest…WE want more. We want to deliver more value, provide more impact and integrate more data for you, the community.

Some of the things that we have been asked for over the past few months are more web services to tap into the experimental data available on ChemSpider (grabbing experimental properties for QSAR modeling for example), more reaction syntheses to peruse, improved speed for substructure searching, similarity searching, integration to more publishers literature and easier to navigate website. Good list!

We have  a set of priorities for the near term and will be doing our utmost to deliver them in time for the IUPAC congress in Glasgow in August and the ACS meeting in Washington later that month. But we want to hear from you. What do you, our users, want to see on ChemSpider. If you had your wishes, and resources were no object, there were no barriers to integration with any data source and you got to define the path forwar for ChemSpider what would it be?

Feel free to share it here on the blog or, if you’d prefer to be more anonymous with your comments, feel free to drop me an email at infoATchemspiderDOTcom. We want your input. Please don’t be shy…engage us and you might just get what you want (though some things might take a while!)

Reblog this post [with Zemanta]

Buy me a Coffee

We’ve received a lot of kudos, congratulations and praise for our decision to become a part of the RSC. We thank everyone who has gone out of their way to acknowledge the shift in our circumstances. We did have some concern that some people would judge us on “selling out” rather than going it alone. Based on the feedback to date our worries were unfounded.

Tonight the comments of Warren DeLano, developer of the Open Source platform PyMOL (more details here), truly struck a chord with me. His comments are below.

pymol“DeLano Scientific LLC congratulates Antony Williams et al. on the acquisition of ChemSpider by the Royal Society of Chemistry. This historic event provides a compelling example of how an independent open-minded project (open-access, open-data, open-source, etc.) can increase its resources and extend its longevity without compromising on its core mission, as is always necessary when a project “sells out” to a for-profit company beholden to narrow fiduciary objectives.

We hope that the ChemSpider / RSC example will both inspire more open-minded individuals to strike out on their own with similarly ambitious efforts and encourage various non-profit and government entities to actively recruit successful projects back into “the establishment” in ways which do not compromise project integrity and yet can enable even greater long-term positive societal impacts.”

Specifically the statement “without compromising on its core mission” hit me. It’s exactly why the fit with the RSC felt right. RSC are focused on Advancing the Chemical Sciences and look upon ChemSpider as a way to help the community to access information, data and knowledge and bring together chemists, publishers, vendors and other parties. It’s been our mission all along. So, we are not compromised as we have the same intentions. A great match.

Thanks to Warren for the recognition. Much appreciated.

Reblog this post [with Zemanta]

Buy me a Coffee

The ChemSpider Journal of Chemistry is an experiment. We intend to demonstrate how modern web technologies can be used to dramatically enhance the type of information that can be communicated using web-based tools over standard online publishing approaches. There are some publishers who are working in delivering additional value to their readers by providing enhanced HTML articles and adding information to their articles such as InChIs to allow structure-based queries online. These publishers include the Royal Society of Chemistry with their Project Prospect and the Nature Publishing Group with their Nature Chemical Biology papers. The majority of articles presented by the commercial publishers are not of a “just-in-time” nature and are delayed by the “processes of publishing”. They are generally fairly lengthy documents and report successful results. They are commonly peer-reviewed and have endured a significant timeline from initial writing to submission, publishers processing, review and publication. Science is however being reported in near real-time under Open Notebook Science (ONS) initiatives. We believe that an online journal can co-exist between the immediate nature of blogging and wiki tools hosting ONS efforts and the more standard processes of the scientific publishers. Some publishers are already allowing online and open peer-review whereby readers provide their feedback to the author in a public forum. Papers can enter a period of online peer review and commentary during which readers provide feedback to the author(s). As a result of this process the authors can engage in public discourse with the commentators and issue a final form of the manuscript. We will offer similar facilities.

We invite manuscripts from anybody interested in exposing their work in the field of chemistry and intersecting fields. In general we expect these communications to be 1500-3000 words in length but there is no limit. We encourage submissions relating to chemistry, biochemistry and chemical biology; regarding synthesis, the analytical sciences and computational chemistry; as research, as commentaries and as questions to the community. Provided the submission relates to the domain of the chemical sciences we will find a place for it within the ChemSpider Journal of Chemistry. We encourage submissions from academia and industry, from students and senior scientists, from individuals and teams, for successful research or failed experiments. We encourage submitters to challenge us to host your manuscripts in a manner which most clearly communicates your science. This may include hosting various forms of data made available to the public as Open Data, providing visualization tools for the display of molecules, spectra, images and videos. We intend to not be constrained and to make full use of web-based tools available today and coming online tomorrow.

All articles will be Open Access articles. We will abide by the Budapest Open Access Initiative which declares “By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” Authors must agree to allow unrestricted reading, downloading, distribution, printing, searching and linking to the published work.

Over the past 2 years we believe we have demonstrated our passion for public science, our willingness to serve the community, and integrity in our actions. We hope that the ChemSpider Journal of Chemistry will provide a vehicle to all scientists operating within the domain of the chemical sciences to expose their work and interests to the community. We intend to deliver a facile process of submission and superior tools for delivery. We welcome your support and look forward to expanding the communication of chemistry.

Reblog this post [with Zemanta]

Buy me a Coffee

There’s no shortage of possibilities regarding where we could go next with ChemSpider and we’re always thinking ahead. At present we are focused on chemistry document markup and the development of ChemMantis. Moving forward we are considering how chemists might want to use ChemSpider. Based on comments from organic chemists over the past few months a lot of chemists are using ChemSpider to source chemicals for purchase for screening and specifically to find starting materials for further reactions.

Recently we added the ChemSynthesis structure collection. That database offers links out to over 45,000 articles regarding reaction synthesis. We are now being encouraged to manage reactions directly on ChemSpider. While we of course have the skills to do so it’s not in our near future. But, what if we did?  Then retrosynthetic analysis might be possible. At the ACS meeting in Philadelphia in August I gave a presentation on ARChem Route Designer, a software product marketed by SimBioSys . It was my privilege to give this presentation on behalf of one of the most respected chemists, Peter Johnson, someone who has been at the forefront of tools for synthesis design and structure based drug design. Take a look at the presentation about ARChem…for chemists interested in software tools for Retrosynthetic Analysis it may be of interest…and I wonder whether a platform like this might be of interest to integrate to ChemSpider…what do YOU think????

Buy me a Coffee

When  ChemSpider was rolled out to the world as a part of ChemZoo we always knew we would be introducing more “critters”. We are happy to announce our progree with our new development ChemMantis. Why Mantis? Well…it’s the Markup And Nomenclature Transformation Integrated System. Fits perfectly into our zoo!

We have been working on the markup of chemistry documents for a number of months and I unveiled the first aspects of our work at the ACS meeting in Philadelphia. The presentation is available online on my Slideshare account. What we are trying to do is to use our ChemSpider platform as the foundation of a document markup system whereby chemical names are automatically identified and can either be converted to chemical structures (possible using algorithms for name to structure conversion) or are retrieved from our ChemSpider database. We have invested a lot of efforts to curate and validate the ChemSpider database of over 21.5 million unique chemical entities over the past year and are now sitting on a foundation of information allowing us to connect between chemical identifiers, chemical structures and out to rich sources such as Wikipedia and PubChem and to provide information such as chemical vendors and other online systems. ChemMantis is well and truly weved into the web of ChemSpider now.

We are now in alpha release and are adding some finishing tweaks to the markup system, the visualization elements and the  workflow. You can see the immediate effects of our recent work on improving the quality of structure images in the balloon below.

We_would_like to test the system on YOUR documents if you are willing to participate. What we are looking for are WORD documents for already published papers. They can be Open or Closed access papers. We are not expecting copyright transfer - we want to markup the documents and return to you for feedback. In the process we will be testing the quality of our Dictionary, our conversions, our visulaizations and our process. We welcome your support. Feel free to connect with us at infoATchemspiderDOTcom. Over the next few weeks you will hear more about ChemMantis and our contributions to text mining and markup of chemistry documents.

Buy me a Coffee

Most people reading this blog will know that we are advocates of the InChI standard for structure representation. I am aware of the intentions to extend the InChI into the world of reaction Capture and look forward to testing it as it moves forward and providing feedback to the team. An announcement was made in the CSA Trust Newsletter and I’ve snipped it below.

“A project to develop a standard representation for chemical reactions was launched recently at a meeting in Berlin, Germany, hosted by René Deplanque of FIZ Chemie. The project is being led by Guenter Grethe.

The goal of this meeting was to develop the requirements for a proposal to be submitted to IUPAC to fund an Open Source, public domain ReactionML (IUPAC RML) standard to complement the IUPAC InChI chemical structure representation. The requirements would include what the community needs, technical and organisational issues and financial aspects.

The meeting was quite successful and an initial first stage of the project was agreed to and will include:

  • Reactants
  • Products
  • Reagents
  • Catalysts
  • Solvents

All the chemical structure representation will be based on and build upon the IUPAC InChI/InChIKey standards, which, since its introduction in August 2006, has become the international chemical structure representation standard for all large databases of chemical data. Some of these databases containing InChIs are in excess of 36 million unique structures.

It is expected a beta test release version of this new IUPAC standard will be available for public testing by the end of 2008.”

Buy me a Coffee

I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site - word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

Buy me a Coffee

ChemSpider has been working hard to support Wikipedia for a number of months now. We have been curating the structures on Wikipedia, I have been an active member of the WP:Chem team, we have extended our integration of WIkipedia to show the leed of the Wikipedia article on associated record views and have a lot of background activities going on re. Wikipedia at present (info will be released shortly). There are new articles released on Wikipedia on an ongoing basis and we stay up to date as best we can monitoring bots for updates. Harvesting monographs out of Wikipedia based only on ChemBoxes and Drugboxes is not sufficient for sure since not every article about drugs and chemicals on Wikipedia has an associated Drugbox or ChemBox. For example… You have likely heard of Rember for Alzheimers already? A search on Google for Rember Alzheimers will give about 2 million hits. It’s already being discussed in the blogosphere including Derek Lowe’s  In the Pipeline. Rember turns out to be methylene blue. There is already an article on Wikipedia about Rember but there is no chembox as yet. As I was researching Rember out of interest I noticed we did not have methylene blue linked to Wikipedia and Rember wasn’t associated with methylene blue. Adding the name was of course easy..5 seconds work after login. We have now added the ability to associate data sources directly too. What does this mean? On a record view page is a list of “Data Sources” associated with a compound. This is where depositions about a compound came from and, generally, links back to the associated web pages. Previously in order to populate the Data Source table it would be necessary to deposit the structure and associated info as an SDF file. TOO MUCH work. So, now we have made it easy. To add a data source simply login and select “Edit” (top right hand side of the data source table). To add a new data source simply click Add and input the information into the pop up box.The input is the name to be listed in the Data Source table, the URL to the information on the Data Source page (if info exists) and the name of the Data Source. This is one caveat of adding such links..the data source must exist. If you want to add data associated with your own website you need to register yourself, add a Data Source and wait for us to approve. Wikipedia is a special case since when the link is made we grab the leed of the article directly and show it in the Record View. For methylene blue there are two related Wikipedia articles so we have linked to them both as you can see on the record view. Simple go to ChemSpider and search for rember and you’ll see two linked Wikipedia articles.

Buy me a Coffee

For those of you who have been following the discussions of Stevan Harnad, Peter Suber and others regarding institutional repositories and Open Access you will already be up to speed regarding OA mandates and what they could mean in terms of access to data. Rather than go into this area in detail myself I point you specifically to Steven Harnad’s site to review ongoing discussions there (there are mutliple parties exchanging views.

What I am going to so though is point you to this comment on Peter Suber’s blog regarding “ Stanford Opens Access to All Its Education Studies“.

Specifically, the following comments are of interest “Under Stanford’s new policy, only the author’s final, peer-reviewed copy of the article would be posted online —in some cases, potentially months before the printed version becomes available….By early fall, the education school plans to have a Web site in place where the articles will be posted and archived in a searchable database. With approximately 50 scholars on Stanford’s education school faculty, the site could accumulate as many as 100 articles a year, by Mr. Willinsky’s estimate.”

Stanford is not alone in this type of shift. What does this mean for indexing of articles and availability for searching in terms of the work we are doing with ChemSpider right now (1,2,3). Text-indexing of chemistry articles would simply mean turning our spider onto the repository. Using the tools we have available now and the database of 21 million compounds and associated dictionary we could also convert the chemical names to structures and make the articles searchable by both text and structure BEFORE publication, in theory, months before. With the work that is already underway on Open Access articles on ChemSpider and SOON to be unveiled, we could also provide tools for authors to markup their own documents. My preference, as for many others, is that authors of Chemistry articles use semantic authoring tools to allow us to grab the appropriate information from the articles for linking as well as provide a path for semantic connectivity.

The question then is whether or not ChemSpider can index institutional repositories or authors self-archived collections on their university research group websites. The authors self-archived collections will be very valuable but of course most likely to upset the publishers. We’d like to do both.

I envisage a time when articles are indexed and searchable even before they are published and indexed by others. Why not? If there are changes to the article between pre-and post-publication both can be indexed.

We welcome your comments! Anyone want to introduce me to the host of an institutional repository?

Buy me a Coffee

I’ve been looking at various forms of communication to assist with people understanding a little more about ChemSpider. I am presently investigating the production of online movies to assist users in understanding how to use the system to full effect and hope to rollout a few examples shortly. In parallel I’ve been looking at podcasting technology.

Serendipitously I was approached by Nature to be involved in one of their podcasts and went through the experience with them. Though  you’d never know it from the podcast it was done during vacation while trying to balance the energy of our boisterous twin boys in the room with a background noise of the ocean crashing on the shore. There are worse ways to be involved in a podcast for sure…balancing the nice overview of the sea with two little boys desperately trying to stay quiet and the professionalism and speed of Geoff Brumfiel at Nature made this a very pleasant experience.

If you’re interested you can check out the podcast here. Based on the feedback we might add podcasting as one more way to communicate with our users. Thoughts?

Buy me a Coffee

Here at ChemSpider we’ve been working for almost a year and a half to build a structure centric community for chemists. During this time we have been dabbling, in the background, with ChemSpider being not so structure-centric, but this has not been exposed yet. Of late we have been attracted to the possibilities around text-mining and mark-up of articles.

We are well underway in terms of providing tools for markup and they will be released incrementally. We have a lot more ideas and are interested to participate in the Article 2.0 contest to see what we can do. What is Article 2.0? Article 2.0 was announced by Elsevier here with the following statement:

“We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset. We will encourage the use of XQuery, but this will not be a mandate. By leveraging these APIs, the contestant becomes the publisher and can render scientific articles to meet their needs including integrating the article into existing applications or combining it with other web service APIs.”

7500 articles and complete freedom to present the articles as we see fit. Enticing! What do we already have on ChemSpider that we could reuse?

1) Structure deposition

2) Analytical data and image deposition

3) Integration to other data via URLs

4) Add comments/description

5) Text markup with “Chemical enhancements”

6) A dataset of >21 million structures and integration to over 120 data sources

7) Good ideas …

Article 2.0 looks interesting…we hope to be involved

Buy me a Coffee

ChemSpider has taken some thrashing over the past year. We’ve been hit on science (and proven our point many times), on Open Access versus Free Access statements, on whether or not we have Open Data or not. There has been encouragement to define what the data on our site is in terms of Open Data or not. We’ve adopted Open Data tags on deposited data from users after pressure there. When I’ve asked more about Open Data I have heard that it is not ratified at the same level as Creative Commons licenses and they would be better to use. A week ago we put up Creative Commons Licenses in what I hoped was a GOOD move for the ChemSpider site and would relax the criticism of our site and potentially receive their blessing and support.

We received a blessing for all of 72 hours. In his blog post Peter Murray-Rust was DELIGHTED with our decision to do this. I quote: “I am DELIGHTED to report that Chemspider has adopted a CC-SA licence for its data.” and espoused “PMR: This is wonderful. As far as I know Chemspider is the only commercial chemical information company offering data under this licence, which is completely compatible with the Open Knowledge Definition. (It is also BBB-compliant, though data and publications are different animals).”

I assumed therefore we’d done a good thing. There was no indication to me that our postion was anything other than positive.

There has been a conversation going on in the blogosphere for a couple of weeks now about Strong and Weak Open Access. I’ve read, watched and simply let others share their opinions because they’ve been in Open Access discussions for a number of years and have more context, background and passion to stay engaged in these discussions. They ARE important discussions and will come to a conclusion.

It appears that “I” am confused by Creative Commons licenses. This based on the fact that 72 hours we had done a good thing and got a blessing but 3 days later I read yet another post this time with a comment  from John Wilbanks stating “I’d like to see a meaningful discussion of the risks of Share Alike and Attribution on data integration. Chemspider’s move to CC BY SA fits into this discussion nicely - it’s a total violation of the open data protocol we laid out at SC, which says “Don’t Use CC Licenses on Data” - but it does conform inside the broader OKD.”

Uh-oh. ChemSpider is in Total Violation of Creative Commons Licenses. As we say in Wales in times of distress … “Hell’s Bells” (My dad was a builder..if you believe he taught me to curse like that well….)

Peter followed it up with a comment “PMR: I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism - CC-SA is conformant to the OK definition, but difficult to operate for re-use). That’s why we use the OKF’s OpenData sticker on CrystalEye.” Hmmm…

Again, when I’ve asked about the OpenData sticker I’ve been informed that this is not yet ratified.

There have been many discussions about Openness I’ve been involved with..just one example here. It has been difficult. Openness and licensing remains confusing…see here an example and this is just about a blogsite!

So the question is what now? Do we remove Creative Commons Licenses? Do we adopt Open Data licenses or do we just get ourselves out of the middle of this entire confusing discussion until all is resolved and settled. And IF we remove CC licenses and don’t post other licenses I know we’ll get criticized for that too. But let’s be honest…we’ve been highlighted for NOT having licenses up to this point. Now we are highlighted FOR having them. Maybe we can hope that no press is bad press. I’ll await feedback on this post and make a decision about what to do in the next 48 hours. Blog away…

Buy me a Coffee

Last a week I had a pleasant chat with a reporter from Nature magazine, a Mr Geoff Brumfiel. Geoff was interested in ChemSpider…what it was, how it ran, who used it, who supported it, who liked it, who curated it, who didn’t like it and so on.

The results of that discussion, and others he spoke to about ChemSpider, are here in his article.

Chemists spin a web of data p139
Chemspider website provides free information on millions of molecules.
Geoff Brumfiel
doi:10.1038/453139a
Full Text | PDF

It is a rule at Nature, at least for this type of article, that I could not see the article before it went to press and therefore I didn’t get the chance to proofread and comment. Geoff has accurately captured the spirit of our discussions but a few detailed clarifications are needed too. I have pasted in black the article content and in italics the clarification.

providing the community with an open-access source of chemical information

I giggled and commented please don’t say it’s Open Access. Say it’s Free Access. Say there are Open Data. And now we have Creative Commons licenses. But don’t say it’s Open Access, not Strong, not weak, not gold, not green. Just Free Access. No price barriers to usage.

Chemist Antony Williams is hoping to change this in a move likely to ruffle the feathers of the American Chemical Society.

I commented that we are not purposely in competition with anyone. It’s not what drives us to do this. Whether others see us to be competitive is for them not us. We don’t intentionally try to ruffle feathers. It doesn’t mean that what we are doing won’t ruffle feathers of course. Whether it’s ACS or others. It’s not the goal..it might be an outcome.

The modest project has made chemists interested in open access take notice — last week, the number of daily users of the site surpassed 5,000.

We have crossed 5500 users for the past two nights. The trend is positive.

“Other potential sources of information, such as Wikipedia, lack the algorithms needed to search chemicals according to their structure. “

Structure searching is “feasible” of course with InChI Strings. But substructure isn’t and Wikipedia is treated as a text-based search by almost all of its users

“The site is maintained with modest profits from advertising and the work of about 30 active volunteers who double-check the data pulled in from outside.

The original investment in hardware and software costs has finally been recouped. Modest profits? No one gets paid for the work we do. There is a phenomenal sweat equity investment in the platform numbering many thousands of hours to get here. We are indebted to the many software collaborators, providers of tools and the people curating and depositing to the system. There have BEEN about 30 active volunteers. RIght now I would say the number of active depositors and curators is around 10. But it is growing. I hadn’t checked the number of REGISTERED users for a long time. We have over 1150 registered users…those who CAN login and curate data, deposit data, see new features etc. People do NOT have to register to use the site…but >1150 did. Wow. I didn’t know it was that many until i just checked (BIG SMILE)

““There’s an awful lot of chemical information, but there’s an awful lot of rubbish as well,” says Barrie Walker, a retired industrial chemist in Yorkshire, UK, who helps maintain the site.”

Don’t know whether Barrie said this or not. He IS an honest guy and he is our QUALITY GURU and we are proud that he is willing to give us his fine eyes. There IS garbage on the site still. But, after a year online and active curating it has been much reduced. About 200 edits a day are made to the site: names changed/deleted/added, spectra/structures/URLs/Publications added etc. It’s quite the pace. We have cleaned up 100s of thousands of incorrect associations from the external data sources. It’s been and will remain an enormous task with an enormous payback for the community

Williams adds that the site still has problems with certain searches. For example, it struggles to distinguish between isomers: molecules with the same chemical formula arranged in different structures.

We can distinguish isomers no problem. The PROBLEM is that there is a mixture of isomeric species submitted from multiple data sources and data are mixed and intermingled in way that the user cannot get to the correct structure. Search taxol or Ginkgolide on the ChemSpider blog and read the mutliple blog posts about this. We can of course search all isomers for a particular chemical formula…

“But Williams nevertheless believes that the service may be able to compete with for-profit services. “What I’m doing is highly disruptive,” he says. “I think it can be done and it needs to be done.”

I think what WE are doing…its not me..it’s we…is disruptive. In a good way. Many chemists will benefit. Will it have an impact on for-profit services? Yes, maybe. As an outcome but not as the target. Our team of people, both internal to ChemSpider’s development and Advisory Group, and the people we don’t even know who are cleaning and depositing into the system for their colleagues in the community, are creating a powerful resource for Chemists. The FOCUS of this effort is to Build a Structure Centric Community for Chemists. We will change that soon…the focus on Structure-Centric will be to cover Chemistry in general and to Build a Community for Chemists.

We are well on our way and thanks to Nature, and Geoff in particular for exposing it. My comments above are not meant to detract from Geoff’s reporting abilities but it was a long discussion and some clarification statements are of value i believe.

Buy me a Coffee

We have made significant advances in the structure deposition system on ChemSpider. We’ve reported on our advances previously and working hard to polish it.In parallel we’ve done work to support deposition of batches of structures (100s to many thousands) as well as the deposition of CSV files to support Open Notebook Science. We are going to roll out deposition in phases - single deposition first, batch deposition next and then CSV file based batch deposition.

So…why are we encouraging the deposition of structures onto ChemSpider. We agree that we could accept RSS feeds (and we will). Our view is that people might to have “bragging rights” on their latest synthesis, might want to expose their latest paper on ChemSpider, might have a link to an article online that they might want to expose to people. While there are MILLIONS of structures online there is new chemistry reported everyday. What other system is there available as a structure-based community for chemists where people can deposit their structures, stories, links and comments to share with others? (And open up a conversation with others about synthesis, analysis etc.) Think of it a little like Flickr or YouTube for chemical structures. Anyone can post their structures for people to browse.

I’ve been doing some example depositions to show what’s feasible…these are simple to do…a few minutes work maximum.

1) I was a co-author of a publication and received a copy today. I wanted to put a link to the paper and associate it with the structure we analyzed. The structure already existed on the database so this was information to be added to the existing structure. Scroll down to the end of the page for this record to see this Supplemental information

Martin, G.E., Hilton, B.D, Blinov K.A. and Williams, A.J. “Using indirect covariance spectra to identify artifact responses in unsymmetrical indirect covariance calculated spectra “, Magnetic Resonance in Chemistry

[DOI: 10.1002/mrc.2141]

2) A new publication was released this week regarding a new compound Quesnoin. David Bradley blogged about it on Spinneret. In this case I wanted to add the structure, information about the structure as well as a link to the recently published article. Scroll to the bottom of this record.

There are many other examples online too here (1,2,3). Look at the Supplemental Information in each case.

There are some final tweaks being made at present but single deposition is now rolled out. We are looking for people NOW to start using the system so please ping me. An overview of the system is available here.

deposition_workflow1.png

 

 

The future will include users creating their own “catalogs” of structures, “social networking” and discussions around structures, team-based discussions, public and private structure collections and so on. It’s coming…in stages. We start here with the single deposition process.

Buy me a Coffee

Over the past few weeks I have had a few discussions with a member of the ChemSpider Advisory group regarding a concept to create WiChempedia. I’ve enjoyed these conversations with Alex Tropsha (professor and Chair in the Division of Medicinal Chemistry and Natural Products in the School of Pharmacy, UNC-Chapel Hill.) We are like-minded in a number of ways but specifically in what can be done to facilitate delivery of quality information to the chemistry community.

As you will notice if you frequent this blog I am rather a stickler for accuracy and quality (1,2,3). I think it’s important (4). Over the past few weeks I’ve spent more time looking at the quality of data on Wikipedia and trying to figure out the best way to bring together our efforts on ChemSpider to enhance the capabilities of integrated information and to support the quality efforts being made by the WP:CHEM team and help them. I also intend to facilitate the development of our own Wiki environment for chemistry and to generally enhance the tools available to chemists not only for Wikipedia type annotation but also to support Open notebook Science.

Now, I don’t want to reinvent the wheel. Wikipedia has a lot of what is necessary in terms of being a known system, a following of people and committed supporters in the WP:CHEM team. What I have been hoping for was a shift around structure and substructure searching on the MediaWiki platform but I know that is a tough request as the platform is not built for that type of thing, The InChKey holds some promise for exact structure searching but does not offer an opportunity for substructure searching without a lookup across a larger database. I want to facilitate information and data sharing further. I do want to provide the type of service that Wikipedia does in terms of general information but also layer cheminformatics tools onto that knowledge and information, allow addition of analytical data, analysis tools, real time predictions and analysis ultimately. This platform should certainly be wiki-enabled.

Decision made. Our intention is to deliver wiki-capabilities in ChemSpider and to use the Open Content associated with chemicals and drugs on Wikipedia inside the system. We will then provide an environment for people to continue to add to, enhance and curate the Wikipedia content as well as add their own. Last night (and well into the early morning) I spent some time talking to Martin Walker from WP:CHEM regarding my concerns that we might offend the Wikipedians with our efforts and that I did not want them to feel that we were ripping off their hard work but rather have our efforts seen as supportive and enabling. My intention as we work through downloading the data and to check, validate and correct what is sitting on Wikipedia directly for benefit to the community. Also, we will of course need to leave all Wikipedia content under the appropriate licensing for others to use. Martin commented that there are tens of mirrors of Wikipedia out there ripped purely with the purpose of exposing and getting ads revenue. We are not working from that model….our intention, as usual, is to build a structure centric community for chemists and with so much excellent work done on Wikipedia I want to take advantage of it and give back also by the work we will do.

Two domain names have been grabbed for this project : WiChempedia, for compatibility with Wikipedia, and also WeChempedia, to emphasize the community aspects of the project.

If you frequent this blog you will recall that we have made a commitment to Microsoft Sharepoint as our future platform for wiki’ing ChemSpider. That is where we believe this work will be done ultimately but we don’t have the platform in our hands yet.

The Xmas vacation is going to be full of holiday movies and manual examination and curation of the Wikipedia data. Wish us luck!

Buy me a Coffee

Following my recent post on high performance computing and the Cell B.E I saw this today re. Gamers handing over their compute cycles to PS3GRID.
I abstract here but point you to the full article for details:

PS3GRID is coordinated by researchers at the Research Unit on Biomedical Informatics (GRIB) at the Instituto Municipal de Investigación Médica and the Universidad Pompeu Fabra in Barcelona, Spain. The distributed infrastructure enables any PS3 to do computations on atomic and molecular simulations

The researchers, headed by GRIB scientist Gianni De Fabritiis, chose the PS3 because it is the first consumer device to contain the IBM Cell processor. “The Cell,” which is more than an order of magnitude faster than standard Intel or AMD processors, optimizes the types of computation commonly used in graphics applications. In addition, the Cell offers an inexpensive and powerful method to perform highly detailed molecular dynamics simulations of biomedical systems. Using the Cell, a PS3 has the computational power equivalent to about 20 PCs.”

Buy me a Coffee

I think the image below will tell the story of what’s coming soon to ChemSpider. As part of a collaboration with a member of our advisory group we will be unveiling this new capability for beta testing in the very near future. I’m sure some of you will see where we are going next…watch this space.

Buy me a Coffee

I subscribe to Scientific Computing so that it drops into my email inbox. I read Rob Farber’s article this week entitled “The Future Looks Bright for Teraflop Computing “. His opening question was “Wouldn’t it be great to have a teraflop of computing power sitting in your lab, desktop workstation, or remote instrument server?” What would that mean to your work?

For those of you using ChemSpider you will know that we have about 20 million compounds on the database. With that many compounds population of the database with properties such as InChIStrings, InChIKeys, physchem properties and systematic names can take many days if not weeks. With three computers only in our hands, one of them a web server and one of them the database server, we are limited to one system. Even that dual processor system provides slow throughput. Oh the joys of having access to teraflop processors!!!

In my previous post on focused libraries I commented on ongoing discussions regarding the potential to perform online docking. Evangelists such as Jean-Claude Bradley (on our advisory group) have been talking about this possibility as part of his approach to Open Notebook Science. Docking can be very time consuming and the speed of calculations is very important. I have been working on a project regarding the value of porting docking software to the Cell Broadband Engine processor from IBM. The development of that processor is an interesting story in itself since it was driven specifically by the needs of the gaming industry for better performance in their calculations. Now SimBioSys are porting their docking software to the Cell processor as described in this White Paper. The improvement in performance is quite amazing!!!

While working for a commercial software company we saw productivity gains moving to clusters. Dual processors in our laptops and annual performance gains from the general technology shifts offer faster calculations every year. Teraflops on the desktop (and even laptop) are likely a few years away…but GFlops are here..

Buy me a Coffee

When we first started the ChemSpider project we made a commitment to “Build a Structure Centric Community for Chemists”. We are well on the way to facilitating that we believe. We have talked about a “wiki” environment for collaboration. In this framework we see wiki to indicate a “collaborative environment”, not necessarily adherence to a specific wiki-platform. Our intention is to provide the ability for users of ChemSpider to collaborate in the co-management of content on the ChemSpider site. A number of our readers have taken our statements to indicate that we will be using the same wiki platform as that utilized on Wikipedia. We have looked at and considered a number of “wiki” tools, platforms, interfaces and user-experiences. At this time we have made a decision to utilize Microsoft Sharepoint as the platform on which to construct our wiki-environment. With a clear commitment to Web 2.0 already declared and our platform built on SQL server and ASP.NET we feel it is the appropriate platform for us to build on. We believe the correct platform choice has already demonstrated that we can deploy a good solution very quickly because of our technology choices.

Now, we realize that this might result in a series of jabs about us not using Open Source solutions and so on but we are more focused on delivering an appropriate scalable solution than building ChemSpider only on Open Source software. We will support anyone who wishes to do the same on Open Source though.

We will keep you informed of our progress. Now we need to migrate ourselves to .NET3 and we hope this will be a short term disruption in the future as we switch over. Watch this space.

Buy me a Coffee

For those of you who have been watching the blog of late you will be aware of the recent discussions about Open Data (1,2). We have offered the possibility to submitters of spectral data to declare their data either Open or Closed. Noel posted a comment on the blog asking the question “Why is the default Closed? Why even offer the option of Closed?”

So..my response to “Why not offer the option of Closed?” My opinion is that this is the submitters decision. It’s not our role to force “Openness” of data onto users. We are working to create an environment that provides value to ChemSpider users rather than one that forces them into a policy regarding openness. Personally, I would prefer to have access to data to help answer a question, even if they are NOT Open Data, than to not have access to those data. I have asked all of the people who have submitted data or had me submit data to ChemSpider whether they would like to have their data moved to open. 3 said yes 2 said no. I do NOT intend to force people to adhere to making their data Open. That is their choice, not mine. We are creating a community for collaboration. There is value in having access to data whether it is Open or not. if you look at the recent conversations about RSC and their Free Access versus Open Access we must agree that there IS value to Free Access to their articles despite the fact that they are not Open Access.

My friend Gary Martin has allowed us to deposit some of his data onto ChemSpider. He has commented twice (1,2) and I refer you to those blog postings for his opinions. They are interesting to read.

The reality is tha our policies, even as they are, appear to be appropriate to have people deposit their data. We already have over 100 spectra deposited on ChemSpider and more to come based on recent conversations. Some of these ARE Open Data and the depositors are acknowledged for this. They are sharing their data with you through us. That’s the benefit of building a community for chemists.

Buy me a Coffee

This week I was privileged to attend a PubChem Working Group meeting in Washington and sit around table with interested parties discussing the present and future state of PubChem. I had the opportunity to give an overview of ChemSpider and our vision of ourselves and where we are going. if you are interested in reviewing the commentary please find a PDF file of the presentation here (shared with permission of PubChem). I welcome any comments, feedback or questions either as a blog response or offline.

Buy me a Coffee

Seth Godin is a mentor to many marketers out there today. I’ve read a number of his books over the years and he has many comments. He is a self-professed “idea-giver” …read his latest blog posting. I specifically like his comment “ideas are easy, doing stuff is hard”. How true that is. Over the years I’ve had lots of ideas. I’ve shared many “beverage-based conversations” where big ideas have been put out. The trick is in the “money where your mouth is” execution of these ideas. Over the years I’ve had the pleasure of working with people who tend to deliver as well as talk. WAY more motivating than just listening to the promises of what could be.

A few years ago at a meeting in Washington I sat in on probably the earliest public forum discussion on the potential of InChI. As a result of excellent teamwork between NIST and IUPAC, and doing rather than just talking they got it done. There was some negativity expressed during the initial meetings about InChI but it did not distract the team from producing the prototype versions, initial release and now the latest update with InChIKey support.

Now, I’ll guarantee that Seth Godin doesn’t know what an InChIKey is (Seth, if you’re reading this prove me wrong :-) ). But I want to take the position of supporting the Big Idea of structure searching the web and suggesting InChI key as one way execute on this now. There is a lot of passion around doing this and it has shown up in a number of postings by Rich, by Joerg (in regards to Wikipedia in this discussion), by Egon (discussing RDF’ing molecular space) and Jim, among others.

I am reading and hearing exchanges about the web being made structure searchable and my mind drifts immediately to the “it’s not enough” stance. The InChIKey should address some of the issues seen with InChI string searches and likely will be way more popular with the search engines. As commented last night on ChemSpider news the InChI keys on ChemSpider now link directly to a Google search.

The challenge remains, once all of those keys are out there how will the web be SUBstructure searchable or SIMILARITY searchable. The solution would appear to be a centralized repository of structures with their associated InChI strings and InChIKeys. The InChIKey cannot be reversed to the structure. A centralized repository of millions of structures and associated InChI strings and keys would allow that repository to be searched by substructure/similarity and then when a structure(s) of interest is identified then the Google search on that string/key could be kicked off. Maybe the discussion regarding the creation of such a centralized repository has happened already so I’d be interested in hearing what the path forward for that is. If it’s happening then the questions are who will host, how will it be funded, is there a timeline etc. If it’s not happening or is way in the future then I have an interest in opening the discussion regarding using the ChemSpider database and appropriate services (presently under development) to provide an interim service.

Structure searching of the web is of course going to provide high value. It should not stop there of course. let’s have the proactive dialog now about the next phase to facilitate substructure and similarity searching. If the conversations are going on elsewhere please post the links as comments so that the readers can follow them. I’m sure that Egon, Joerg, Rich, PMR will all have thoughts about how this should look. The bottom line out there is if this is the path the underlying system needs to be able to handle at least 25 million structures (ChemSpider has 17 million already) in the short term and be scalable to many tens of millions. There aren’t too many open platforms that can do that yet. I am aware of commercial platforms supporting many millions but no Open Source platforms yet…

Buy me a Coffee

Recently I posted some statistics regarding traffic to the ChemSpider website examined using various tools…our own and the Alexa Rank engine. Peter Schneider has commented on the performance of the various rank engines. He also asked an interesting question: “But the real question is: Does emolecules generate more income with an Alexa Rank of 400 000? It is not the question, if a site has more visitors or not… The question is, which project will survive…” It

s definitely worth commenting on!I am looking into the Alexa Toolbar issue and if Peter is correct in his judgment of its bias we will likely take it down. What we are looking for is accurate representation. We are now tracking google analytics and have signed up on compete.com as he suggested so only time will tell now.I think Peter is right in that there needs to be some standard way to compare sites. Certainly ChemSpider is not out to “beat” eMolecules or PubChem, or any of the new systems which might come online in the near future. I believe we all share the same space and bring value in our own ways. I have great respect for what Klaus and the group are up to. I collaborated with the team directly while I was at ACD/Labs - integrating ChemSketch into Chmoogle (as it was then), arranging exposure at Reactive Reports and then again with the logP donations working with the PhysChem product manager at ACD/Labs .Does eMolecules generate more revenue than ChemSpider with a lower Alexa rank. I would hope so…they are a business! I am not sure of their business plan but it does include exposing companies catalogs through their site (for revenue I should expect. - see example with a NCH skin on top of eMolecules engine at http://nchlab.emolecules.com/). I have also heard that in certain cases that compounds sold via the website results in a percentage going to eMolecules. I don

t know it is true but it is rumored to be that way. (By the way..I suggested to Klaus that we exchange our relevant structure collections and index each others structure collections
and link between the sites but haven’t got a response yet. This type of exchange/integration is what Joerg is talking about here.)ChemSpider, on the other hand, is a passion project. Until about a month ago it was non-revenue generating …more bank account draining :-) All computer software, hardware, ISP fees etc were paid for out of our bank accounts. Yes, we founded a corporation to do this…we

re an overly “litigious society”.
Recently I chose a period of personal sabbatical so now I am the non-revenue generating member of the household (but a great chauffeur for the children). I am happy to say that now we actually have sponsors for the site. We did try the Adsense approach but the $2.50 per day wasn’t worth the reputation ding and the annoying screens. We’ve added “Buy me a Coffee” to the blogs…but so far we haven’t had one. So, we are depending on the kindness of our sponsors to keep the site going at present. If you look at the home page you will note that Waters was kind enough to sponsor the site and is a gold-level sponsor based on the magnitude of their support. We have recently received support from one of our other collaborators and their logo will post soon.

I can confirm that in my downtime I am looking for additional funding to the keep ChemSpider going in whatever way it comes: sponsorship, anonymous donations, grants, collaborations, begging, borrowing (no stealing…). ChemSpider can continue to move while there are free cycles to support it and enough income (or family monies available) to keep it exposed. If there is no way to create a revenue stream from the system it will certainly suffer in terms of the pace it moves when those of us working on it now get tired and some of us “go back to work” and have new career objectives to distract us. ChemSpider IS still a passion project. The intention is that there will always be an Open Access ChemSpider for chemists to use. I see no reason that everything you have access to now will ever be taken away. The majority of what we have in our development plans is for the good of all. I don’t know how else to commit to a deeper level of permanence for the site. We are not yet done with the conversations about Open Sourcing the code in the future.

So, thanks Peter for asking the question about “which project will survive”. If any readers have thoughts about garnering financial support for the system through sponsorship, grants, collaborative work etc please contact me at the usual address (antony.williams AT chemspider DOT com) and open the discussion. What we want is for ChemSpider to be around for many years to come..and I believe we can make that happen even in our spare time. That said, with dedicated effort the reach of this project can be truly massive…

 

 

 

Buy me a Coffee

This past week I received some inquiries and comments regarding the traffic coming to the ChemSpider Site. It was commented that it was not possible to compare eMolecules traffic and ChemSpider traffic on Compete. I confirmed this and have now registered ChemSpider so that this should be possible in the future. There are many Analytics tools out there to measure traffic at a site. We use Weblog Expert at our site for our internal analytics tool. The plot below shows a fairly linear growth in the number of unique visitors to the ChemSpider site since we went live on March 27th, just in time for the Spring ACS.

WebLog Expert plot

We also use Alexa to browse our performance. The statistics are shown below for the increase in global users accessing the site, the overall traffic rank and the number of page views per user.

Alexa Rank 1

The geographical distribution of visitors is actually quite surprising. Until recently the UK was actually the most popular visiting country but the US visits increased dramatically when we integrated the announcement regarding the Patent Searching went online. What is quite surprising is the low number of visitors from Germany, China and India. Based on my previous experiences in the chemoinformatics world I would expect Germany to be much HIGHER and certainly there should be increased traffic from India. That said, India wasn’t even on the list a week ago and is growing now as the message spreads. If any of you can help spread the message outside of the USA please do!

Alexa Rank 2

Addressing the original statement about being unable to compare stats on www.compete.com I’ve shown the geographical traffic ranks for eMolecules. Clearly there are a lot more countries for ChemSpider to provide value to! Hopefully our penetration will increase with time.

Emolecules on Alexa

Interestingly, there are also all types of rumors about the validity of Alexa but Alexa challenge this. It’s difficult to know what’s right so what’s reported here is simply what’s given online. What we are happy to report is an ongoing growth in the usage of the system. It validates our efforts.

Buy me a Coffee

For those of you watching the progress of ChemSpider since it’s initial exposure in March of this year we have been incrementally adding new features and specifically integration to other rich sources of information. We have delivered integration to multiple data sources (Click on the Data Sources checkbox under the Advanced Search for the list) as well as the integration to text-based searching of 50,000 Open Access articles via the ChemRefer service. Now we have extended the ability to include review of Patents.

In a collaboration with Reel Two we have provided a way to provide structure and substructure searching and access through millions of chemical structures integrated to patents on the US, European and Asian Patent Offices via their SureChem Portal. Following a search simply click through to the Detailed Results page for a particular structure and look in the Data Sources list for the word SureChem. See below as an example…note Surechem blocked in red.

Surechem Link









Clicking on any of the names in the Data Sources link launches a new Browser Window containing the links to the External Substances links as shown below.

links to Surechem Data Sources

Clicking on any of the External Links will take you to the actual patent sitting on the Patent Analysis website and identified via the Surechem query. For example, see here.

We have a number of ideas to enhance the deliver of patent information via ChemSpider but for the time-being we believe that the ChemSpider and the Reel Two SureChem integration offers a powerful means by which a chemist can navigate their way from a chemical structure to a patent. We welcome your feedback.

Buy me a Coffee