Archive for the How ChemSpider Runs Category

Following on from my previous post regarding new functionality on ChemSpider, the last one regarding improved integration to SureChem patents, I am happy to announce that we have improved the Pubmed integration. Previously we would stream way too many articles from Pubmed for some compounds. For example, retrieving articles about cholesterol would result in too long a page of Pubmed articles being displayed.We have now limited the number of articles retrieved and simply put a link at the top of the page for you to retrieve the rest of the links.


In this case when you click the link we initiate a full search using the Entrez Life Sciences API. The results for cholesterol are shown here. A small but user-friendly improvement…More to follow.

While some say “Silence is Golden” some of us find it deafening! One of my common statements regarding Press Releases and political commentaries is there is as much said in the “unsaid”. Why this lead in to this blog post? Well….the truth is we haven’t been very productive in the past few weeks with the delivery of new functionality onto ChemSpider and people have been asking me why we haven’t been so prolific with our updates. Well….in this case Silence is Golden based on the new functionality and data rolling out soon!

Historically we were introducing new functionality every few days and rolling it out with a “continuous beta” approach to delivery. We were also working on only three computers and were challenged with issues of uptime and handling. At the RSC we have access to development, test and live environments, we have a stable compute environment supporting the system that provides power support where previously we would have been at risk of outages. We have a support team who have “got our backs” and we are not dealing with all of the issues regarding keeping the environment healthy for the ChemSpider platform. With our new hosted environment and the drive to move away from our previous constant and ongoing updates to a more controlled process for rollout, specifically including internal testing prior to going Live, we have been working on procedures to ensure the best delivery. In parallel we have been working on a series of internal projects that are very exciting and you should see the results soon!

With our new processes in place, and our new systems now established we have been working on new functionality development and are happy to announce that we will now be moving towards regular updates, every few weeks. We’re starting this week with the roll out of a set of new capabilities for you to try out. I’ll highlight these in a series of blog posts over the coming days.Let’s start with this one…

We are happy to announce an improved integration to the patent web service provided to us via our collaboration with SureChem. We announced our initial integration to this service at the ACS meeting last fall in Washington and received a lot of positive feedback regarding the implementation. That rollout only provided integration to a subset of the entire collection, the USPTO. SureChem host data from a number of patent agencies and the collection includes USPTO Granted, USPTO Applications, European Granted, European Applications, WO/PCT and Japanese Abstracts. Thanks to their web service we now have the ability to retrieve information regarding those sources also. The image below shows the patents retrieved for Xanax. Check it out…give us your feedback and extend holiday cheer to SureChem also for their contribution to the community.


For those of you who have been using ChemSpider for the past few months you will be aware that historically we had an integration in place to SureChem’s Patent Portal. A few months ago that integration was unfortunately broken as SureChem improved their service. Also, we were un-synchronized with their growing set of chemical structures as they updated their patents. The previous integration was very limited in nature anyway as it simply showed the presence of patents associated with the ChemSpider structure in the SureChem database. Certainly a more ideal solution is the one that we introduced just in time for the ACS meeting in Washington.

The new solution lists not only the number of patents containing the chemical compound shown in the ChemSpider record but also show the first 10 patents, by title, and provides direct link-throughs to the patents on SureChem. This is a much improved integration and we hope you enjoy it.  The next stage is to deposit the latest SureChem structure collection that has grown significantly since our last deposition. Thanks to our collaborators at SureChem from offering you, our users, access to their service.


Reblog this post [with Zemanta]

An article in the latest C&E News discusses the acquisition of ChemSpider by the Royal Society of Chemistry. I certainly appreciate the comments of Robert Massie, President of CAS who stated:  “CAS has worked with Williams in the past,” CAS President Robert J. Massie notes. “We join everyone who is interested in the advance of chemical information in recognizing his considerable contributions. We are delighted to see that his creativity and enthusiasm will continue to benefit the chemical enterprise.”

I worked a lot with CAS while I was at ACD/Labs (over 10.5 years and left there as their Chief Science officer). I was intitmately involved in the development and deployment of a number of software tools and visited Columbus many times. I have many fond memories of working with the CAS team and there are some great people working at the organization. I hope that in my new role at the Royal Society of Chemistry that I will have the opportunity to work with CAS again in a collaborative and cross-publisher manner to the benefit of the  Chemistry Community.

Reblog this post [with Zemanta]

In August 2007 I blogged about our Alexa statistics in terms of visits. There were comments about the appropriateness of Alexa as a measure of traffic. I was encouraged by one of the commenters to look at Compete:

  1. Peter Schneider says:
    Alexa can be manipulated to a certain extend by webmasters simply using the Alexa toolbar.

    Interestingly, chemspider does not appear in compete:

    So the Alexa Rank does not give a clear signal.

Since then I’ve been watching Compete as well as our own internal statistics. Both suggest the traffic continues to grow quite consistently.


Reblog this post [with Zemanta]

InChIs are a powerful way to communicate chemical structures. They are going to enable internet chemistry and when we roll out the InChI Resolver shortly then the community will have access to a resource to resolve InChIKeys and ultimately navigate chemistry on the web. We commonly receive chemical structures in the form of InChIs and in order to deposit the structures we have to convert the InChIs back to chemical structures, commonly into SDF format for batch deposition. For simple organics this is not a difficult process…the tools we have at our disposal can deal with the layout of simple organics. However, for some of the chemical structures we receive optimizing 2D layout is very challenging. Many of the issues come with fullerenes (See examples below) but not only. Carbohydrates, complex cycles etc are big challenges.


In building the InChI resolver we hope to provide attractive visual depictions of the associated structures. Without AuxInfo data carrying the coordinates,  or without the deposition of SDF files containing the layout coordinates we have a major challenge ahead of us. Auxinfo data are shown below for erythromycin. These data are rarely generated when people generate InChIKeys and the issue of structure layout will dominate the interpretation of complex structures.


Since beauty is in the eye of the beholder my judgement is that automatc layour algorithms should only assist in the appropriate layout and eyeballs will need to make the final decision. That is why it is better to deposit SDF files of InChIs with Auxinfo carrying the coordinates than it is to deposit InChIs only and leave the structure layout to an algorithm. It will fail.

I am interested in seeing what people can do with their structure cleaning algorithms on InChIs like this:


The images below show the iterative application of DIFFERENT structure layout algorithms. One caution…your layout algorithm should produce the SAME InChI at the end and NOT flip stereocenters. Interesting challenge. Who says cheminformatics isn’t challenging? And who thought building an InChI Resolver would be easy?


Reblog this post [with Zemanta]

Following on from my recent post about “Why are structures like YouTube Videos ? “I am now asking the same question about spectra.

The answer is simple. When people have deposited data as OPEN DATA on ChemSpider we are now providing the ability to embed the spectral data and display at other sites. This is different in that we are not just showing images but real live spectra in the JSpecView Java Applet so Java must be installed. Thanks to Cameron Neylon for asking the question about whether we could provide the service. Glad to help…

If all is well you should see an IR spectrum associated with the ChemSpider record here. In order to EMBED spectra simply Login to ChemSpider, find an Open Data spectrum of interest (you could browse and then click on EMBED (left hand corner below the spectral image. Do a left click to see additional features of JSpecView. We DO have some minor work to do with spectral plot reversal and improving the zoom display but we’re getting there. Enjoy.

Reblog this post [with Zemanta]

We are focused on providing tools to our users to ensure that they can add information of interest to structure-based records in ChemSpider. We have introduced DOI-based associations recently allowing users to connect publications of interest to chemical compounds on our database. The process is simple. Find the structure record of interest, use the Add DOI function and Publish. The process is outlined graphically below.

First, Login then navigate to the article of interest. In this case we are interested in associating a publication with the structure of Chaetoglobin A.

Find_the_paper of interest and the associated DOI. In this case we will associate the following RSC article. Click on Add DOI and enter the DOI. Click on Lookup, confirm that the data is correct and click on OK or cancel as appropriate.

The_associated_DOI_will_be_held_in_embargo until a curator confirms it, generally within a few hours. If we see no issues with the process we will remove the curation process. When approved you can see the information associated with the record as shown below. The DOI is linked directly to the article and will deliver traffic to the publishers serving both the users of ChemSpider and the publishing community. Simple.

Reblog this post [with Zemanta]

I announced in July of this year that we were performing predictions using the EPISuite of prediction tools.I’m glad to say that one of our servers is now in “cooling mode” after running red hot for over 4 months. We’ve been feeding all single component ChemSpider entities with Molecular Weight <500 (non-radicals). The results are now posted on ChemSpider under the EPISuite tab. We hope you find them of value and offer our thanks to the EPA for providing us access to the software.

As we work on ChemMantis it is clear that we want to expand the integration out to external sources of information as much as possible rather than limit the connectivities to the ChemSpider platform. We have started to build the necessary dictionaries to support bacteria, fungi, viruses etc so it makes sense to connect these up to external resources. As a proof of concept we are using Wikipedia sources to directly feed the “Species Balloons” and have enabled searching of Wikipedia, Google and Entrex directly from the balloon. As an example of the integration we see below the species balloon filled with the leed of the article from Wikipedia for Zymomonas mobilis(click on the thumbnail)


From the balloon it is possible to search across Entrez, Google and directly into Wikipedia for more information. For this particular bacterium Entrez gives a list of results as shown below (click on the thumbnail). We are using a similar approach with elements now. Rather than show a “bare element” in a structure balloon (who needs to see Li for Lithium?) we will display the leed text from Wikipedia for that element. The near future will likely see us link to Uniprot and PDB for proteins and out to similar rich sources for other species.

ChemMantis is now in alpha release and under tests. ChemMantis is our Chemistry Markup And Nomenclature Transformation Integrated System. The movie below can likely tell a better story than I can write. So, let’s start with this movie…and more will follow. The premise is upload a document, find chemical names, convert names/identifiers to chemical structures and find related information. In this case we are demonstrating how structures are linked to information on ChemSpider and from there out to other information on the web. There are more such displays to come….

I’ve posted over at the ChemConnector Blog about the potential need for a neutral review of the performance of Optical Structure Recognition algorithms. I’m interested in the technology because we are now using it on ChemSpider for our document markup and structure recogition. I’d welcome your thoughts and comments…visit the blog post.

When  ChemSpider was rolled out to the world as a part of ChemZoo we always knew we would be introducing more “critters”. We are happy to announce our progree with our new development ChemMantis. Why Mantis? Well…it’s the Markup And Nomenclature Transformation Integrated System. Fits perfectly into our zoo!

We have been working on the markup of chemistry documents for a number of months and I unveiled the first aspects of our work at the ACS meeting in Philadelphia. The presentation is available online on my Slideshare account. What we are trying to do is to use our ChemSpider platform as the foundation of a document markup system whereby chemical names are automatically identified and can either be converted to chemical structures (possible using algorithms for name to structure conversion) or are retrieved from our ChemSpider database. We have invested a lot of efforts to curate and validate the ChemSpider database of over 21.5 million unique chemical entities over the past year and are now sitting on a foundation of information allowing us to connect between chemical identifiers, chemical structures and out to rich sources such as Wikipedia and PubChem and to provide information such as chemical vendors and other online systems. ChemMantis is well and truly weved into the web of ChemSpider now.

We are now in alpha release and are adding some finishing tweaks to the markup system, the visualization elements and the  workflow. You can see the immediate effects of our recent work on improving the quality of structure images in the balloon below.

We_would_like to test the system on YOUR documents if you are willing to participate. What we are looking for are WORD documents for already published papers. They can be Open or Closed access papers. We are not expecting copyright transfer – we want to markup the documents and return to you for feedback. In the process we will be testing the quality of our Dictionary, our conversions, our visulaizations and our process. We welcome your support. Feel free to connect with us at infoATchemspiderDOTcom. Over the next few weeks you will hear more about ChemMantis and our contributions to text mining and markup of chemistry documents.

We’ve been working on structure depictions on ChemSpider and overall we are very happy with where we have got to. These structure depictions are going to be showing up in various parts of our system now.

However, we should qualify the difference between structure images and structure layout. The depictions and the layout are governed by different algorithms.While a structure image can be attractive the layout may not be perfect. it is possible to improve the layout of the molecule deposited on ChemSpider. Notice for the structure on the left that there is overlap with the methyl group.

For details on how to CLEAN structures on ChemSpider please read the Technical Note here: Interactive Cleaning of Molecules During Curation and Deposition.

The result of performing cleaning is shown below. This layout may also not be the perfect layout but there is no overlap. The user can continue to manually optimize the structure for the preferred layout.

It is finally time to rollout more attractive structure depictions. We have needed some more attractive structure depictions for a while but they have become an absolute must have as we rollout the following new capabilities:

1) The ability to make YOUR chemical blog structure searchable (watch this space…). We suggested one path previously…this is BETTER…

2) Structure balloons for using with our document markup tools, both browser-based and Microsoft Word based

We all judge quality of visual aesthetics quickly. We know a good structure when we see one. This is an announcement that we will be rolling out new structures across the site in the next few days. You will see better looking structures showing up across the site – during deposition, during service-based predictions, during searches and, well, everywhere. While not perfect as yet a little more tweaking and the entire database will be supported by the new structure depiction algorithms. As it is you should see some examples now on the database…one shown below. We welcome your feedback!

I recently started a discussion with the users of ChemSpider about how they use our system. There have already been two responses and I am hoping for more. Having sat in on a IUPAC InChI meeting in Washington last week I can honestly say that it was one of the most functional and on-task meetings I have sat in on in a long time. Decisions were made about how to move forward with the next release of the InChIKey and “standard versions” of both the InChIString and InChIKey.

The meeting has prompted the question how do you use InChI? For what purpose do you use InChI and do you use only the string? Do you use it for communication purposes and structure exchange? Do you use it in your internal databases? Is it a primary path to deduplication? What settings do you use for the InChIString?

I’m interested in how you are using InChI nad how important it has become for you? Comments welcomed..

Those of you who read the ChemSpider blog for a while will know the name of Paul Docherty who writes on his TotallySynthetic blog. I have a great appreciation for Paul’s writings and, with permission, have started associating his blog posts directly with the structures, literally inserting the entire blog post, but NOT the comments, into a description for the molecule. For example, see the structure here. We’ve started to go back through Paul’s postings and make his entire collection structure searchable… see There are a lot of old posts to deposit and time will get us there. Everything is posted under CC licenses and with permission.

Paul also writes for the RSC in Chemistry World as, for example, here. With permission from the RSC we are inserting snippets of the article into ChemSpider and linking out to the Chemistry World article, for example: here and here. If you are a registered user you can link your own articles on your websites using the Add URL functionality or you can deposit your own postings onto our site in the same way…feel free to ask us how.

I posted a request recently for people to share with us how they use ChemSpider. Two comments were submitted today and are given below. Thanks to Chris and Sean.

Sean Ekins commented “Over the past year I have used ChemSpider as a valuable resource for generating molecular properties that are then used to analyze specific enzymes substrate requirements (PMID: 18537573) and also for following up on hits derived from computational pharmacophore database searching (PMID: 18579710). The later searching helped find additional compounds that were purchased from vendors and tested in vitro, ultimately some were found to be active. These uses are in addition to finding structures SMILES for generating QSAR models and general molecule searching. I hope to build on this in the future!”

Chris Singleton commented “As a chromatographer, the predicted properties of the molecule are an invaluable tool.  With values such as the logP and logD, it is much easier to predict and estimate retention times relative to other similar compounds without having to spend the time to do an entire chromatographic run.  Of course, I still do the actual chromatography, but these tools let me ‘home in’ on an appropriate method much quicker.  Some of these predictions are available in chemical structure drawing programs, but Chemspider allows me to do one-stop shopping for most of the properties I’m interested in.  And if I don’t have a structure handy, it’s easy to look it up so I can see what type of MS ionization mode I want to use, based on the moieties and polarity of the molecule.

Secondly, I’m more involved in DMPK from the analytical side as opposed to the pharmacokinetic (PK) side, so I’m trying to learn more about the PK.  The fact that Chemspider has such a wide range of properties and links to outside sources really lets me correlate the PK properties of a drug to its structure and lets me be a better analyst.  I  also know that Chemspider is busy adding links to the infoboxes on Wikipedia (an effort that I am involved in), so I read about the properties of a drug and am then able to click through to look at the chemical properties and how the structure relates to function.  This is more of a professional development exercise for me, but having the wealth of information in one place make it a easier to get a big picture view of a drug, rather than just looking at the pharmacology properties alone.  Definitely a structure centric view of drugs.”

There has been an outpouring of offers from the ChemSpider community in terms of helping to examine/clean and enhance information regarding carbohydrates on ChemSpider. Almost 2 dozen users have now made an offer to help. Very exciting really!

I’ve already outlined the necessity to improve the quality of associations between structures and identifiers on the database. However, I am also hoping that users will write articles about carbohydrates using the rich-text formatting capabilities (ADD Description), will add spectra if they have them, will link up articles if they have interesting papers and will add URLs to interesting online content also.

We have now delivered the ability to curate and enhance records on ChemSpider and look forward to having our users help, starting with Carbohydrates…

MeSH is likely well known by anyone working in the Life Sciences and with Pubmed. As defined on Wikipedia

Medical Subject Headings (MeSH) is a huge controlled vocabulary (or metadata system) for the purpose of indexing journal articles and books in the life sciences. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM’s catalog of book holdings.

…The 2005 version of MeSH contains a total of 22,568 subject headings, also known as descriptors. Most of these are accompanied by a short definition, links to related descriptors, and a list of synonyms or very similar terms (known as entry terms). Because of these synonym lists, MeSH can also be viewed as a thesaurus.”

We are presently moving further into integration with Pubmed and as part of this move we have decided to integrate MeSH information and the structure level onto relevent record views. Now, when you visit a particular record where MeSH information is available, the data will be visible under the MeSH tab, open by default.

MeSH has been curated by a highly skilled team over a number of years. More information about MeSH can be found online. The contents of the MeSH table should be self-explanatory. Over the next few weeks watch as we do more with the integration of MeSH and Pubmed.

We are testing out a new 3D molecule optimizer on ChemSpider at present. It appears to be more rugged than our previous algorithm and, in our hands at least, has not yet failed to produce a 3D representation. On general small organics it appears to handle the optimization very well and is quite fast. We welcome your feedback if you have time to test. In order to use the optimizer simply go to a record view, lick on Zoom on the Structure tab (see below) and then click on Show 3D as shown in the second image. Since all coordinates are calculated in real time please note that it can take a few seconds from opening the JMol applet to display of the optimized structure (we need to ad a “calculating….” display element when we have time.

Thins look good for us but we have had a couple of reports of failures and are trying to trace whether it is a browser security setting or not. Please let us know if you have any issues. Thanks

I posted a couple of days ago about the talk I gave at Drexel University. part of it was already uploaded to YouTube. The rest of the talk has been posted to Jean-Claude Bradleys’ server at Drexel University and is over 1 hour and 20 mins long…it’s ageneral overview of ChemSpider and a users tour regarding “how to”. It’s not an Oscar winner but you might find some new info in there..Click HERE to see the movie.

I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site – word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

Despite the ability of browsers to open up multiple tabs there really shouldn’t be any need to open a new browser window if we take advantage of  balloon pop-ups. For example, if you hover over the name of the Data Source then information will be displayed. If you hover over the External ID in the data source table then you see a screenshot of the external record within the balloon. It can take a couple of seconds to load so be patient. The screenshot below shows the KEGG record highlighted from the Data Source external ID. You can of course click on the link to navigate to the external record if what you see pop-up in the data balloon is of interest.

We will be using similar capabilities for our markup of chemistry articles. An example is shown below. Each chemical name is hyperlinked to a chemical structure in ChemSpider and displays the structure in a pop-up balloon.

I have reported previously on rich text editing capabilities that we have been testing. The text-editing capabilities have been rolled out this evening to beta-testers (if you wish to be a beta-tester you must register on ChemSpider here and request beta-tester status).

The text editing capabilities show up when you are logged into ChemSpider. The capabilities show up with the “Description” link and is listed with all of the other capabilities to allow information to be associated with a record. At the top of a record view you will see

Click on Description and it will open up a simple Rich text Editor, similar to that you would see on Wikipedia, and allowing you to copy-paste rich text or edit standard text and insert hyperlinks etc.

We would like people to test drive the ability to add descriptions to record views. Please go ahead and start working with it and provide us feedback.