Archive for July, 2009

I am writing an editorial piece at present that necessitates the communication of what types of data we can host from users if they choose to use ChemSpider as a platform to host their data and interesting chemistry pieces. For example:

Hosting Reaction Details: The Synthesis of cis-Bicyc​lo[3.3.0]​octane-3,​7-dione

Chemistry Movie: Photochromism in action

Spectral data in abundance: Spectra of aspirin (click on the green image to view)

Open Notebook Science report: An analysis of  the spectrum of Cholesterol

List of publications: A long list of publications associated with cholesterol

The Linked Wikipedia Article: Xanax

new-logoWe are just about to head off to the IUPAC Congress in Glasgow and unveil a spiffing new booth. In preparation for the unveiling of our new logo we’ve done some editing to the website and changed the look and feel of some of the pages. These are mostly cosmetic at present and there is little change to the core functionality of the site but we hope that some of the changes make the site a little easier to navigate.

This is the first work we are doing to improve the website and to roll out a redesign of the logo (look out for that logo at the ACS meeting in Washington in a couple of weeks…you’ll see it in a few places and we will have our own booth there too). Over the next few weeks we will be working further to improve the usability and flow of the website and to enhance the core functionality of the platform. Watch this space.

We welcome your feedback on the new logo and, if you don’t see it on the ChemSpider website please refresh the stylesheet using Ctrl-F5.

DISTRIBUTED BY NATURE PUBLISHING GROUP ON BEHALF OF THE INCHI TRUST
21 July 2009

Contact: Grace Baynes
Corporate Public Relations, Nature Publishing Group
T:+44 (0)20 7014 4063
g.baynes@nature.com

The InChI Trust, a not-for-profit organisation to expand and develop the InChI Open Source chemical structure representation algorithm, is formally launched this week. Originally developed by the International Union of Pure and Applied Chemistry (IUPAC), the IUPAC International Chemical Identifier (InChI) is an alpha-numeric character string generated by an algorithm. The InChI was developed as a new, non-proprietary, international standard to represent chemical structures. The Trust aims to develop and improve on the current InChI standard, further enabling the interlinking of chemistry and chemical structures on the web. The connection with IUPAC is maintained through IUPAC’s InChI Subcommittee.

The InChI algorithm turns chemical structures into machine-readable strings of information. InChIs are unique to the compound they describe and can encode absolute stereochemistry Machine-readable, the InChI allows chemistry and chemical structures to be navigable and discoverable. A simple analogy is that InChI is the bar-code for chemistry and chemical structures. The InChI format and algorithm are non-proprietary and the software is open source, with ongoing development done by the community.

“The goal of the InChI Trust”, says Project Director Stephen Heller “is to continue to develop the InChI and InChIKey, the condensed machine-searchable version, as a tool to enable widescale linking of chemical information.”

The InChI Trust was formally incorporated in the UK in May 2009, and now has 6 charter members: The Royal Society of Chemistry, Nature Publishing Group, FIZ-Chemie Berlin, Symyx Technologies, Taylor & Francis and OpenEye. Further organizations and publishers are in the process of joining the InChI Trust.

“Nature Publishing Group is delighted to be a charter member of the InChI Trust”, says Jason Wilde, Publisher for the Physical Sciences, Nature Publishing Group. “We view the ongoing maintenance of the InChI algorithm, and the resulting adoption of InChI, as important for the development of chemistry communication. The interlinking that the InChI offers between journal content and databases ensures that chemistry is the first truly web-enabled scientific discipline.”

“The InChI has already gained a wide user base,” says Richard Kidd, Informatics Manager at the Royal Society of Chemistry, “and the Trust will ensure continuing development and support for this key standard, helping to link together chemical resources across the internet. The RSC is proud to support the InChI Trust.”

Since the introduction of the InChI in 2005, there has been widespread take-up of InChI standards by public databases and journals. Today, there are more than 100 million InChIs in scientific literature and products.

To date, numerous databases, journals, and chemical structure drawing programs have incorporated the InChI algorithm. These include the NIST WebBook and mass spectral databases, the NIH/NCBI PubChem database, the NIH/NCI database, the EBI chemistry database, ChemSpider, Symyx Draw and many others.

The initiative serves chemists, publishers, chemical software companies, chemical structure drawing vendors, librarians, and intermediaries by creating an international standard to represent defined chemical structures. This provides a consistent, credible and compatible way for databases of chemical structures to be linked together for the benefit of users of chemical information around the world.

-ENDS-

For further information, please contact:

Project Director, Dr. Stephen Heller at steve@inchi-trust.org

Background notes:

The InChI project was initially undertaken by IUPAC with the cooperation of National Institute for Standards and Technology (NIST). In 2009, a standard version of InChI and the InChIKey were released. Members of the InChI Trust will pay annual dues to support the continued development of InChI, and maintainance of the InChI algorithm. This income will be used exclusively for InChI algorithm development, maintenance, outreach, and educational activities associated with the project

Details of the up-take by many chemical database providers, software developers, and journal publishers are available at www.iupac.org/inchi/adopters.html

Reblog this post [with Zemanta]

There have been other comments about Wolfram Alpha and it’s support for Chemistry (1,2 and others) but I have remained rather quiet until now about my experiences with Alpha for a couple of reasons. First of all I’d rather let the service settle down a bit before poking at it too hard. My experiences of going live with ChemSpider were definitely that it takes a while to stabilize the system and address some of the earliest feedback. Also, knowing that I would be at Scifoo and aware that Theodore Gray would be there I had hoped to see Alpha in action. I wasn’t disappointed. Yesterday Theodore drove the system in front of an audience including a number of interested scientists, members of Google and, Peter Murray-Rust and myself from Chemistry. Theo had no fear…essential for live demos. He was asked questions and he did took the plunge, did the search and with the rest of us celebrated a successful search, a weird result and just plain wrong. It was ALL good. I am impressed. I am impressed by that they are out to achieve with Wolfram Alpha. I am convinced that what they are doing with Alpha will contribute to science and mathematics in general and that Chemists will be using this system when they have more awareness of it.

For a general intro to Alpha see the presentation here.

So, some examples of interesting searches:

1) A guy in the room had asked the question “What is the largest land mammal?” and had not received an answer a few weeks earlier. Now Theo posed that question and got the answer here. Nice! Now, I took that to mean that they were keeping logs of failed queries and tweaking…confirmed by Theo. VERY nice.

2) Peter Murray Rust had previously blogged about bad results from his searches (searching on dibromoethane for example). When he repeated his searches in the session hosted by Theo he acknowledged that he was pleased that they had fixed the issues he had previously blogged about. This is how modern systems should be …moving quickly.

3) Searching on names…for example, what is the number of people with my name…my spelling is Antony NOT Anthony. See here for the results.

4) What is the return per employee for Google versus IBM. It’s in this query: http://www35.wolframalpha.com/input/?i=GOOG+IBM

5) What are the chemical structures of Taxol? Methamphetamine? Cholesterol? Buckminsterfullerene? You get answers for all. The organic molecules all give images of chemical structures. The connections in all cases are correct but I see no evidence of stereochemistry anywhere across the chemical structures on the database..it doesn’t mean it’s not there but I couldn’t find it.

So, for chemistry, am I impressed. Yes I am. I’m not worried right now that Alpha is not dealing with stereochemistry…I am sure they will layer that on later. It is clear based on most of the results that I have seen that there is some GOOD curation of the data going on. According to Theo there are chemists on staff and they are curating the data coming in. Hallelujah! If you look in the Source Information for Taxol you see a LONG list of sources of chemical source information and the primary source is the Wolfram Alpha Curated Data.

alpha-data There is much that can be done to help Wolfram Alpha to have better Chemistry. They have a HARD job ahead of them if they are going to sample the Public Databases to grab quality chemistry. It’s in there for sure but it’s hard to find. What could come out of ChemSpider and Wolfram Alpha working together?

1) If we could get the list of “compounds” in Wolfram Alpha then we can provide chemical compound connection tables with all necessary stereochemistry etc.

2) When we pass back the compound list then we can pass back ChemSpider IDs and get them listed as identifiers alongside the PubChem CID. In theory it would be good to get these linked back to ChemSpider so that a user can come and find associated articles, analytical data, the wikipedia article, predicted and experimental properties and so on. This is where ChemSpider’s integration would be of value.

3) There is an opportunity to expand the chemistry in Wolfram Alpha by passing a subset of ChemSpider compounds to be added to Alpha. Certainly I don’t think that Alpha should host all 21.5 million of our compounds for the reasons I have enumerated many times on this blog. See my last post about the 54 versions of the Taxol skeleton…there should be only one Taxol. But, there may be a way to subset “important chemistry” and get it into Alpha. OR, maybe they do want it all?

There are clearly opportunities to help expand the chemistry and I hope we have the chance. I think Alpha is incredibly ambitious. But why not be ambitious? ChemSpider was ambitious too and look what we have done with three servers in a basement…it’s a whole lot less resources that Wolfram are throwing at Alpha. I want them to be successful…a computational engine for the public. Why not….so many of us are asking questions using search engines right now and can’t get anywhere near an answer…

Reblog this post [with Zemanta]

Let’s start off where I intend to finish. Bigger does not necessarily mean better. A large database of unique chemical entities does not necessarily mean a good database and accurate chemical representations of chemical entities can be pretty hard to find.

Few people realize how these simple statements are impacting the quality of what’s available online for chemists to use and how curation of data must occur in order to improve what’s available.

Now…what’s the basis for me to initiate this discussion and WHY would I prefer that ChemSpider was actually a smaller database?

Today on CHMINF Steve Heller posted the following review:

“From:http://www.ala.org/ala/mgrps/divs/rusa/sections/mars/marspubs/marsbestfreewebsites/marsbestfree2009.cfm

Title: PubChem
URL: http://pubchem.ncbi.nlm.nih.gov/

PubChem is a search tool for chemical information, divided into three areas: Compounds, Substances, and BioAssays. Full entries provide detailed information with the most basic information – a general description, the molecular weight and formula, the structure, plus a Table of Contents (ToC) for the full entryall easily found above the fold. Use the ToC or scroll down to retrieve more advanced information, such as bioactivity results, synonyms, chemical actions, detailed properties, and more. Each module is fully interlinked with the other sections of PubChem as well as resources in ToxNet and PubMed, providing full access to toxicology resources and the medical literature, and allowing users access to as much or as little of the chemical information as they need.

Author/Publisher: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
Date reviewed: February 16, 2009

PS. PubChem now has 37,326,949 DIFFERENT structures”.

Bob Buntrock made the following statement “Re the PS below, I find it difficult to believe that PubChem has 37.3 million “different” compounds.  The figures from the CAS website show 48 million organic and inorganic compounds which excludes sequences but includes polymers, alloys, coordination compounds, minerals, and mixtures. Since PubChem aims to cover “small molecules”, it would seem that many compounds in these last 5 categories would not be present.  Therefore, I assume that a significant number of the 37.3 million PubChem compounds are redundant.” All hell broke loose with lots of posts discussing the uniqueness of chemical entities and the fact that PubChem compounds WERE unique. Okay, I’m not going to argue this for the moment but I am going to agree with Bob that a significant number of the compounds are likely redundant. It is ALSO true of ChemSpider. Why?

I could write a multipage blog but I have already discussed this issue many times on this blog but am clearly failing to communicate the issue. I’ll try again but I reference you to previous posts about Taxol (1,2,3), Vancomycin (4) and Ginkgolide B (5,6). I suggest you read these earlier posts but will try and explain again anyway.

Some general statements. Many complex chemical compounds, especially natural products, have timelines. A compound when initially elucidated can give the connectivity only and get reported. Then stereochemistry might be layered on later, and reported. Then stereochemistry might be adjusted, and reported. Through this whole timeline the compound might be referred to by a particular chemical name….let’s call it Afonwenium. So, based on the timeline for this molecule there can be anywhere between 1-4 “versions” of the structure by that name. They are all unique chemical entities but the “final structure” is the one that people will want. It’s the one that should be represented on Wikipedia, the one that should correctly be drawn in all publications following the final elucidation report and assertion of structure and the one that should be found on many of the “reference” databases such as KEGG, DrugBank etc.

Search Taxol on ChemSpider and Taxol on PubChem and compare the number of structures you get. I judge that there are MANY unique chemical entities on PubChem that are MEANT to be Taxol but are not. And I don’t mean the ones that are named as “Taxol derivative”, I mean the ones that may have the SAME molecular weight, formula and connectivity but have DIFFERENT stereo – no stereo, MULTIPLE partial stereo and MULTIPLE full stereo. These issues exist for compounds like Ginkgolide B and Vancomycin and many more structures.  There is of course only one Taxol, a compound registered by Bristol Myers Squibb and asserted to have a specific constitution.

Just out of interest lets see how many compounds are on ChemSPider with a specific skeleton (ignoring stereo).

There are 54 compounds with the skeleton of Taxol: http://www.chemspider.com/InChIKey/RCINICONZNJXQF. These are all UNIQUE chemical entities but there are C-11 and C-14 labeled, Deuterium and Tritium labeled and so on. But there are over 30 compounds that have the same skeleton, without isotopically labeled sites, that still have the Taxol skeleton. Maybe some of these are meant to be Taxol with different stereochemistry but I judge that MOST of these are meant to be Taxol and are labeled as such but differ in terms ofno, partial and full stereo at least. This is ONE example. To Bob’s question…is this redundancy? I say yes. How does this get solved? Curation will do it but it’s expensive and time consuming and the only way forward in my judgment is to crowdsource it. This problem is not going away anytime soon in PubChem or ChemSpider. We HAVE curated the name associations and removed the name of Taxol for all skeletons that are not what is the asserted form of Taxol. But the structures do remain on the database and link back to the original sources. We will be working on ways to show on every search that there are associated skeletons, compounds related by isotopic labeling and the status of no, partial and full stereochemistry. All to come…

The ongoing “Bigger is Better” arguments for Public Compound Databases is irrelevant at this point in my opinion. We can add 50 million new compounds with a simple enumeration exercise but woulf it bring any value? I say no. We can add virtual libraries from a number of our collaborators but I judge it to be of very limited value. The value of the Public Compound Databases are in what they connect to and whether there is an answer to a question at the end of the chain. If I search on a chemical and find it on ChemSpider but I cannot find a vendor for it, no analytical data, no properties of value, no manuscripts, no patents linked etc then I have just done a search, found it on ChemSpider but have derived no value. We are working on increasing the VALUE of our content. Linking compounds to rich data sources, layering on additional properties, links to papers, blog entries and discussions and so on. If the result of a search is a hit but with no value who cares. If the result of a search is a hit but with links to the wrong information that’s worse. If I ask the question “What is Taxol” and get one hit I need it to be right. If I ask the question and get tens of hits now what?

Curation has been underway for 2 years. We’re not finished. Its a massive task. In reality it will NEVER be finished – new chemistry comes in every day and more information gets associated. We don’t have answers to all of the issues that exist around these diverse datasets but we are not naive in our understanding that our database is polluted with issues inherited from many other sources. We have marked tens of thousands of structures for deprecation. We have likely added information into PubChem that has contributed to the issue of data quality. But we are working on it.

Meanwhile errors that exist in PubChem are proliferating. A simple example is that of methane in PubChem that I have blogged about many times…one example here. Here are some of  the names associated with the structure of methane on PubChem: 1,3-DICHLORO-PROPAN-2-ONE, diamond, charcoal and many tens of other incorrect names.

The National Cancer Institute’s Chemical Structure Lookup Service has over 46 million unique chemical entities and they have offered a series of services to search by InChI, name and many other queries. A posting to CHMINF outlined the service

“Chemical Identifier Resolver (beta):
—————————-

http://cactus.nci.nih.gov/chemical/structure

This service is a resolver for different chemical structure representations and identifiers, including those that do not carry any information about the structure itself. For instance, it can work as a Standard InChIKey Resolver, an NCI/CADD Identifier Resolver or a Chemical Name Resolver. The service also allows one to convert a given structure identifier into another representation or structure identifier.

Representations/identifiers supported are: Standard InChI/InChIKey, NCI/CADD Identifiers (FICuS, FICTS, uuuuu), SMILES, SDF, names, and a few other types of
IDs.  See the web page for more information.

For those identifiers that require lookup, the underlying database currently contains about 67 million unique structure records, from which the respective Standard InChIKeys and NCI/CADD Identifiers have been calculated. For lookup by chemical names, 68 million names associated with 16 million unique structure records are currently available in the database. The database continues to grow.

Closely related are the new capabilities of resolving/converting chemical structure identifiers by simply using a URL adhering to the following scheme: http://cactus.nci.nih.gov/chemical/structure/”structure identifier”/”representation”[/xml]

We just list a few examples here that should give you an idea of what’s possible with this service.  For more detailed explanations, see the above web page.

Example: Standard InChI for chemical name string “aspirin”: http://cactus.nci.nih.gov/chemical/structure/aspirin/stdinchi

Example: Standard InChIKey of “ethanol” specified as SMILES string “CCO”: http://cactus.nci.nih.gov/chemical/structure/CCO/stdinchikey

Example: Unique SMILES string of chemical name string “benzene”:http://cactus.nci.nih.gov/chemical/structure/benzene/smiles

Example: SD File for chemical name string “morphine”:http://cactus.nci.nih.gov/chemical/structure/morphine/sdf

Example: Chemical names for Standard InChIKey “InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N” (Standard InChIKey of “ethanol”): http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/names

Example: Synonyms for chemical name string “aspirin”:http://cactus.nci.nih.gov/chemical/structure/aspirin/names”

Unfortunately polluted names are finding their way across all of these databases which is why a lookup on methane gives us: http://cactus.nci.nih.gov/chemical/structure/methane/names including in the list:
1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo[9.5.1.1(3,9).1(5,15).1(7,13)]octasiloxane, mixture of isomers
673323_ALDRICH
PSS-[2-[(Chloromethyl)phenyl]ethyl]-Heptaisobutyl substituted
675342_SIAL
(2R,3R)-Butanediol bis(methanesulfonate)

and DIAMOND…
(2R,3R)-Butanediol dimesylate

The CAS database is highly curated, not without errors, and built up using robots and eyes. Public Compound Databases are built with the best intent and are useful. But they are not curated and are polluted. Bigger does NOT mean better and care is warranted. ChemSPider will likely stay smaller that many of the other Public Compound Databases moving forward as we remain focused on adding value and addressing the issues of inherited and future quality. It’s a long journey…

ChemMobi, an application written by James Jack from Symyx has finally been posted to the App Store and can be downloaded, for free, and enable your iPhone to search both Symyx’s Discovery Gate and ChemSpider (using our web services). I’ve posted before about the work done by James (1,2) and it has now come to fruition with the first version of ChemMobi. If you are an iPhone user try it out and give us your feedback!

chemmobi

Reblog this post [with Zemanta]