Archive for August, 2008

MeSH is likely well known by anyone working in the Life Sciences and with Pubmed. As defined on Wikipedia

Medical Subject Headings (MeSH) is a huge controlled vocabulary (or metadata system) for the purpose of indexing journal articles and books in the life sciences. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM’s catalog of book holdings.

…The 2005 version of MeSH contains a total of 22,568 subject headings, also known as descriptors. Most of these are accompanied by a short definition, links to related descriptors, and a list of synonyms or very similar terms (known as entry terms). Because of these synonym lists, MeSH can also be viewed as a thesaurus.”

We are presently moving further into integration with Pubmed and as part of this move we have decided to integrate MeSH information and the structure level onto relevent record views. Now, when you visit a particular record where MeSH information is available, the data will be visible under the MeSH tab, open by default.

MeSH has been curated by a highly skilled team over a number of years. More information about MeSH can be found online. The contents of the MeSH table should be self-explanatory. Over the next few weeks watch as we do more with the integration of MeSH and Pubmed.

Buy me a Coffee

Most people reading this blog will know that we are advocates of the InChI standard for structure representation. I am aware of the intentions to extend the InChI into the world of reaction Capture and look forward to testing it as it moves forward and providing feedback to the team. An announcement was made in the CSA Trust Newsletter and I’ve snipped it below.

“A project to develop a standard representation for chemical reactions was launched recently at a meeting in Berlin, Germany, hosted by René Deplanque of FIZ Chemie. The project is being led by Guenter Grethe.

The goal of this meeting was to develop the requirements for a proposal to be submitted to IUPAC to fund an Open Source, public domain ReactionML (IUPAC RML) standard to complement the IUPAC InChI chemical structure representation. The requirements would include what the community needs, technical and organisational issues and financial aspects.

The meeting was quite successful and an initial first stage of the project was agreed to and will include:

  • Reactants
  • Products
  • Reagents
  • Catalysts
  • Solvents

All the chemical structure representation will be based on and build upon the IUPAC InChI/InChIKey standards, which, since its introduction in August 2006, has become the international chemical structure representation standard for all large databases of chemical data. Some of these databases containing InChIs are in excess of 36 million unique structures.

It is expected a beta test release version of this new IUPAC standard will be available for public testing by the end of 2008.”

Buy me a Coffee

I’ve had a number of questions about the presentation I gave at ACS Philly last week about document markup. The phrase I keep hearing is “very disruptive” followed by the question “will authors do more work and what’s in it for them?”.

The presentation here outlines the general concept that I talked about…

The basic concept I presented is as follows, with a focus on Chemistry Articles.

A lot of effort is being expended in “text-mining” publications, post-publication, to index these articles and make them searchable not only by text but by the specific language of chemistry, chemical structures. We are specifically asking the question “why extract chemical structures from articles using chemical name conversion approaches and chemical image conversion tools when the structures in the article were ORIGINALLY machine readable?”

We are considering a system whereby authors are asked to contribute to the availability of a free online service for performing structure and substructure-based searches of chemistry articles. While the submission of journal articles is already a lot of work (I know from experience of authoring/co-authoring about 10 a year) we hope that authors will support a service whereby they can upload their own articles to a “validation and mark-up service”. The upload capabilities will support upload of the primary document, chemical structures in standard formats and supplementary information of various types (to be defined)

This system will perform the following services:

1) semi-automated markup of a document - title, author(s), abstract and additional dictionary-based terms plus the ability to use the NLM-DTD markup
2) identification of chemical names and conversion to structures in an automated fashion
3) conversion of structure IMAGES to connection tables using optical structure recognition software (either commercial or open surce)
4) ask authors to confirm whether the converted structures are appropriate
5) provide a structure validation service for submitted molecules checking for “accurate representation”
6) Deposit all structures associated with an article onto ChemSpider but under embargo. Associate the article Title, authors and “abstract snippet” with all structures.
7) Issue a set of ChemSpider IDs for the author to submit to the publisher with the article
8) When a publication has passed through review the author can release the structures from embargo using a DOI or an article URL (more common for Open Access articles)

The result of this project will be a way for publishers to link their articles directly to a free access chemistry database and use a series of web services to enable other capabilities (to be defined). It will also allow articles in Open Access and non-Open Access publications to searchable by the “language of chemistry”.

This is only a slice of the overall project but I think it may be of interest relative to the comments you have made below.

Parts of this were shown last week at Drexel University and a particular snippet is available online here:

We are also going to provide a Microsoft Word add-on which will allow users to prepare articles for publishing using similar technologies.

We think this IS disruptive..what say you?

Buy me a Coffee

An interesting post on the end of cyberspace blog regarding whether online databases and journals are changing scientists’ reading habits. I always feel its more appropriate to give the original blogger the traffic so pop over and read the post here.

I think the extract below might tempt you to do so…

” Searching online is more efficient and following hyperlinks quickly puts researchers in touch with prevailing opinion, but this may accelerate consensus and narrow the range of findings and ideas built upon.”

Buy me a Coffee

Oh bless the power of the blogosphere. I have about 250 blogs on my blogroll and one of them is wikinomics. AS with all things nowadays, email, voice mail, mail itself, I have to get through them all at a rather frantic pace just to leave time for real work. reading through 135 new blog posts generated overnight I hit a very interesting post by Jude Fiorillo all about the SearchMe.com search engine and its visual way of displaying hits. LOVE IT!!!

What does this “yet one more search engine” offer? Judes comments “What’s special about this search engine is that when you query a topic, the websites returned to you are displayed visually (similar to Apple’s cover flow) rather than in list form. A picture is worth 1000 words so they say, so it makes sense that with 1 quick look at the preview pane of a website, you can better filter your results, and roughly gauge the quality of the website (by it’s professionalism and aesthetic, available content, and general message).”

Does it deliver…oh yes. I did a search on ChemSpider and I saw the results here. I “saw” web pages about ChemSpider I’d never seen before. Easy to navigate. Very sweet. These guys will get bought by Google…

Buy me a Coffee

I am looking for someone with a good understanding of carbohydrate chemistry to join the ChemSpider Advisory Group and help us “get carbohydrates right” on the ChemSpider database. It would be sweet if someone could help us clean up the abundance of data on the site and offer us their skills. It might be a good project for a student to work on with us as it will require some research to make sure that we end up with REFERENCE quality data fo rothers to use. Anybody interested?

Buy me a Coffee

We are testing out a new 3D molecule optimizer on ChemSpider at present. It appears to be more rugged than our previous algorithm and, in our hands at least, has not yet failed to produce a 3D representation. On general small organics it appears to handle the optimization very well and is quite fast. We welcome your feedback if you have time to test. In order to use the optimizer simply go to a record view, lick on Zoom on the Structure tab (see below) and then click on Show 3D as shown in the second image. Since all coordinates are calculated in real time please note that it can take a few seconds from opening the JMol applet to display of the optimized structure (we need to ad a “calculating….” display element when we have time.

Thins look good for us but we have had a couple of reports of failures and are trying to trace whether it is a browser security setting or not. Please let us know if you have any issues. Thanks

Buy me a Coffee

As the number of spectra uploaded to ChemSpider increases (and it is now increasing at quite a rate) we have noticed that ther increased loading time associated with records with a large numbr of spectra can be very long, especially if the spectra are “heavy”, for example for C13 specra at high-frequency and with zero-filling. When there are a number of spectra there are even more challenges.

With this in mind we have introduced the ability to Load a Spectrum when the user wants to see the spectrum and not automatically on loading the page. An example is shown here for recently uploaded spectra from the Drexel University laboratory of Jean-Claude Bradley.

Please est it out and let us know if you see any issues. the example listed above has a “heavy C13″ spectrum so loading might take awhile.?/p>

Buy me a Coffee

Josh Wilson, the Reference Librarian at the Physical and Mathematical Sciences North Carolina State University Libraries posted a comment on CHMINF this week. He commented:

“Recently I conducted an orientation session for new graduate students in chemistry, and I gave them a survey to determine their familiarity with some common databases and research tasks.  I thought you might be interested in seeing the results.  (I do not present them as scientific, it was a small sample and I didn’t have time to painstakingly construct the questions.)  To spare you a page-long e-mail, check the results and some observations here:”

The document gives the following information:

Question 1 was “Are you familiar with the following databases for finding chemistry information?”. Students answered on a scale from 1-4 (from not familiar to very familiar, so the closer the average to 4, the more it was universally known, the closer to 1, the least known).  Average scores for 25 respondents:

Wikipedia - 3.24
SciFinder Scholar - 2.76
ChemFinder - 2.44
Google Scholar - 2.36
Sigma-Aldrich - 2.36
Chemical Abstracts (printed) - 2.20
Web of Science - 1.84
PubChem - 1.52
Beilstein/Gmelin - 1.28
ChemSpider - 1.08
CrossFire Commander - 1.04

Clearly we have work to do to improve awareness of ChemSpider for students (and I have already sent a note to Josh to see if I can provide an overview to students at NCSU some time in the future) but what is clear is how important Wikipedia is to students. This makes the curation work on Wikipedia all the more important!

Buy me a Coffee

Retrosynthetic Analysis Presentation at ACS-Philly

I had the pleasure of representing ARChem Route Designer, a retrosynthetic analysis tool from SimBioSys at the American Chemical Society meeting in Philadelphia last week. More…

Chem4Word Project from Microsoft and Murray-Rust

Following on from my presentation regarding text-mining and document mark-up at the ACS meeting in Philly it was interesting to see the announcement about the Chem4Word project from Microsoft. In collaboration with the Unilever School of Informatics at Cambridge university, and specifically working with Peter Murray-Rust and some of his team. From the website announcement it states:  “Microsoft Research is investigating the introduction of chemistry-related features in Microsoft Office Word, including authoring and semantic annotations. More…

Buy me a Coffee

An announcement was made on the Blue Obelisk Discussion List this week reagrding a new database of 4 million molecules at present but up to 50 million molecules in the future. It is called molecules.gnu-darwin.org/ and lists with the following comments:

Some facts: The Molecules website contains more than 4 million small molecule structure files in pdb format, and molecular graphics representations. About 50 million molecules are still in the pipe, and they are expected to appear here over the course of the next few weeks and months. The pdb format is readable by common FOSS molecule viewer software, such as RasMol and PyMOL. In due course, we plan to provide high quality structures via energy minimization refinement, and additional resources.

Molecules@gnu-darwin.org is founded in the spirit of free software, open source, and public access. It is hoped that access to these files will be a wonderful community resource for science education, research, and entertainment as well. We are looking for investment or funding to expedite and expand this work, and lead the field, with an eye towards an advanced, complete, synthetic, structural, and informatical bioorganome. Meanwhile, the site is already an exceptional lab resource, and molecular catalog, providing the means and building blocks towards additional novel structures. We aim to be the best.

The structural biology, protein crystallography, and molecular graphics talent that is building the Molecules archive is available to work for you in a contract or consulting arrangement. Wide-ranging expertise is available. Molecules@gnu-darwin.org is built entirely with FOSS, free and open source software, GNU-Darwin OS, and it is under the aegis of The GNU-Darwin Distribution. Here is a link to the Distribution résumé. Our founder is an X-ray laboratory admin for the Department of Biophysics and Biophysical Chemistry of Johns Hopkins University School of Medicine. You can also read his CV. We would like to build a community around this website, and we are looking for volunteers and collaborators to help. Regarding any aspect of the work of this site, please feel free to contact us, molecules@gnu-darwin.org, with gdmolecules in the subject line. Cheers!”

I’m always interested in potential databases to connect to that will add additional capabilities and diversity to ChemSpider’s information. I have browsed the database and searched on some common molecules (Xanax, aspirin, Taxol and others) and found no hits. This seemed strang but it does say “Search warning: not yet fully spidered

The statement that there are 50 million molecules in total coming suggests that the database is a republication of PubChem and the SDF archives seem to suggest so too since they redirect to PubChem for the download: http://molecules.gnu-darwin.org/ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/

At present the database therefore appears to be the PubChem database in PDB format. I hope that there is some additional information added to warrant our linking to this new database.

Buy me a Coffee

I posted a couple of days ago about the talk I gave at Drexel University. part of it was already uploaded to YouTube. The rest of the talk has been posted to Jean-Claude Bradleys’ server at Drexel University and is over 1 hour and 20 mins long…it’s ageneral overview of ChemSpider and a users tour regarding “how to”. It’s not an Oscar winner but you might find some new info in there..Click HERE to see the movie.

Buy me a Coffee

The visit to the ACS Philly came to an end with a visit to see my friend and ChemSpider advisory group member Jean-Claude Bradley. I gave a live presentation of ChemSpider at the university and JC captured it with Camtasia and will be putting the full presentation on YouTube shortly. There were some funny moments including no hits on the first search I did only to note it was a misspelling.

Part of the presentation ended up in a discussion about Open Access publishers and I showed a live demo of our document markup system. JC posted this particular part of the presentation on YouTube already and named it the “ChemSpider revelation”.

I finished the day sitting in front of the Drexel chemistry department Varian Unity Inova 500MHz instrument with Jean-Claude Bradley and his student Khalid Mizra. It’s been a dozen years since I’ve run an NMR spectrometer but it was like getting on a bike. They’d been having problems with S/N on C13 spectra and a couple of hours later we’d calibrated pulses, changed decoupler modulation, optimized the decoupler offset, calibrated the power and were getting spectra in a few dozen scans relative to overnight runs. It was great to be polarized again sitting near a magnet…I admit I have missed it!

Buy me a Coffee

Graphic linking to EPA DSSTox Structure-Browser v1.0The Distributed Structure-Searchable Toxicity (DSSTox) Database Network is a project of EPA’s National Center for Computational Toxicology, helping to build a public data foundation for improved structure-activity and predictive toxicology capabilities. The DSSTox website provides a public forum for publishing downloadable, structure-searchable, standardized chemical structure files associated with toxicity data.

Ann Richard is the principal for this project, someone I really respect and acknowledge as one of the people in the domain of open chemistry data who has a specific focus on quality. Working with a very small team over the past few years the data associated with DSSTox has been stringently examined.

Their EPA DSSTox Structure-Browser, developed from available structure-viewing freeware and open-source programming tools delivers a simple, easy-to-use structure-searching capability through the chemical inventory of published DSSTox Data Files.

In their latest rollout of the DSSTox Browser they have provided new structure-based link-outs to external websites to assist users in directly accessing structure-related content and capabilities on external public websites of potential interest to the DSSTox user community. Link-out from the DSSTox Structure Browser is based on InChIKey (conforming to ChemSpider conventions for InChIKey generation).

External Resources: PubChem, ChemSpider, EPA ACToR, Lazar in silico tox

Buy me a Coffee

A link to the presentation I gave at ACS-Philly yesterday in Rajarshi Guha’s session is provided below. A lot changes between writing an abstract and writing a talk so I had the chance to expose an increasing number of papers ALREADY using ChemSpider as one of its platforms of choice to source information from.

Can a Free Access Structure-Centric Community for Chemists Benefit Drug Discovery?

ChemSpider is an online database of over 20 million chemical structures assembled from well over a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples. QSAR analyses utilizing the ChemModLab environment for assessing quantitative structure-activity relationships will and screening using a molecular surface descriptor model.

Link to presentation

Buy me a Coffee

I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site - word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

Buy me a Coffee

It’s over a week since I was at Scifoo and I am finally coming back up for air. As it is I spend every evening at the weekend catching up with email, backlogs of tasks, blogging and so on so take out a whole weekend and with two cross-USA flights and catching-up is becomes suffering. Add to that all of the preparations and deadlines coming for ACS Philadelphia next week and …well, it’s not been pleasant.

So..Scifoo…was it all it was cracked up to me? HELL YES. Loved it. Why?

1) I got to spend hours chatting with Jean-Claude Bradley and Cameron Neylon and aggregate our thoughts about what next in regards to improving support for Open Notebook Science on ChemSpider. It’s clear and it is now just a matter of time and resources but we know what to do.

2) I go to put faces to names of some of the people I have connected with in the blogosphere.

3) I got to chat with Chris Anderson from Wired about provocative statements about data deluges.

4) I finally got it about Second Life when JC Bradley gave a presentation about how he uses the platform. Since then I’ve had a guided tour with JC, adopted the avatar ChemSpider Magic out in the virtual world, have flown around there and, post-ACS, hope to have a bigger presence. (Once I figure out why SL seems to think my processor is running at 1/3 of its clock speed)

5) I met with Paul Stamets and chatted about fungi, about isolated small molecules with “magic properties” and a possibility to participate in some small molecule structure elucidations

6) I got to lead a session about how to make the Internet Structure AND substructure Searchable and to talk not only about the technologies to do so but what it could mean for communication between chemists.

7) I spent a lot of time talking about “just in time” science and “open notebook chemistry” in particular. We talked about wiki-environments and how instant they can be. I JUST watched Michael Phelps take his eighth gold at the Olympics and checked Wikipedia almost immediately after..and it was already updated. WHat would this mean for science when data are flowing from instruments to webpages for all to see. I recently chatted about this with David Leahy from InkSpot.

8) I was happy, once again, to share space with Barend Mons of Wikiproteins. Barend and I met recently for an evening to discuss how ChemSpider and WikiProteins could be meshed together and I thoroughly enjoyed our time together. His energy is infectious, his passion for what the wiki environment could mean for science and our similar view of “stop talking about it, let’s start doing it” gives us a shared platform for evangelizing eScience.

9) Google research data - Google are open to hosting terabytes, or 1000s of terabytes of data. I met with different members of the team, all great guys and truly passionate about their project. ChemSpider has a disk drive and will be sending our structure collection for hosting on Google’s research data repository.

10) The organization of Scifoo was superb. As free as the agenda was,…since it was created on the fly by the attendees…it went off without a hitch.  Kudos, thanks and high fives to the organizing committee and support team. Awesome.

11) I enjoyed Googles It-its. Actually..I enjoyed them twice a day. But was burning the calories anyway so no guilt here…

Would I go again? I wish it was still going on! I missed too many great talks and then sat around and listened to the reviews of the meetings. Scifoo…Scifood for the brain.

Buy me a Coffee

In the near future it is likely that ChemSpider will be taking on some projects requiring that we add some additional programming support to our team. If you have an interest in potentially assisting the ChemSpider team in some project work and would like to meet up and discuss at the ACS meeting in Philadelphia please let me know. Some of this work will be funded.

Buy me a Coffee

We have added the compound collection from Trans World Chemicals to ChemSpider. This is a collection of almost 1600 compounds. The collection can be viewed here.

Buy me a Coffee

Despite the ability of browsers to open up multiple tabs there really shouldn’t be any need to open a new browser window if we take advantage of  balloon pop-ups. For example, if you hover over the name of the Data Source then information will be displayed. If you hover over the External ID in the data source table then you see a screenshot of the external record within the balloon. It can take a couple of seconds to load so be patient. The screenshot below shows the KEGG record highlighted from the Data Source external ID. You can of course click on the link to navigate to the external record if what you see pop-up in the data balloon is of interest.

We will be using similar capabilities for our markup of chemistry articles. An example is shown below. Each chemical name is hyperlinked to a chemical structure in ChemSpider and displays the structure in a pop-up balloon.

Buy me a Coffee

ChemSpider has been working hard to support Wikipedia for a number of months now. We have been curating the structures on Wikipedia, I have been an active member of the WP:Chem team, we have extended our integration of WIkipedia to show the leed of the Wikipedia article on associated record views and have a lot of background activities going on re. Wikipedia at present (info will be released shortly). There are new articles released on Wikipedia on an ongoing basis and we stay up to date as best we can monitoring bots for updates. Harvesting monographs out of Wikipedia based only on ChemBoxes and Drugboxes is not sufficient for sure since not every article about drugs and chemicals on Wikipedia has an associated Drugbox or ChemBox. For example… You have likely heard of Rember for Alzheimers already? A search on Google for Rember Alzheimers will give about 2 million hits. It’s already being discussed in the blogosphere including Derek Lowe’s  In the Pipeline. Rember turns out to be methylene blue. There is already an article on Wikipedia about Rember but there is no chembox as yet. As I was researching Rember out of interest I noticed we did not have methylene blue linked to Wikipedia and Rember wasn’t associated with methylene blue. Adding the name was of course easy..5 seconds work after login. We have now added the ability to associate data sources directly too. What does this mean? On a record view page is a list of “Data Sources” associated with a compound. This is where depositions about a compound came from and, generally, links back to the associated web pages. Previously in order to populate the Data Source table it would be necessary to deposit the structure and associated info as an SDF file. TOO MUCH work. So, now we have made it easy. To add a data source simply login and select “Edit” (top right hand side of the data source table). To add a new data source simply click Add and input the information into the pop up box.The input is the name to be listed in the Data Source table, the URL to the information on the Data Source page (if info exists) and the name of the Data Source. This is one caveat of adding such links..the data source must exist. If you want to add data associated with your own website you need to register yourself, add a Data Source and wait for us to approve. Wikipedia is a special case since when the link is made we grab the leed of the article directly and show it in the Record View. For methylene blue there are two related Wikipedia articles so we have linked to them both as you can see on the record view. Simple go to ChemSpider and search for rember and you’ll see two linked Wikipedia articles.

Buy me a Coffee

I have been talking with Brian Gilman at Scilink over the past few weeks. He recently interviewed e about my background and published the interview online here “SciLink Spotlight - How Do You Say Founder Of ChemSpider In Welsh?” Igt might answer some of the questions you’ve been asking about the background of ChemSpider and maybe future directions?

Buy me a Coffee