Archive for March, 2008

ChemZoo Joins Microsoft BioIT Alliance

Wake Forest, NC, March 27th, 2008 — ChemZoo Inc. announced today that it has joined the Microsoft BioIT Alliance, a cross-industry group working to enhance collaboration among life sciences organizations to accelerate the pace of drug discovery and development.

“Collaborative science is facilitated by a common development platform and cooperation between informatics companies,” said Antony Williams, President of ChemZoo. “The ChemSpider Free Access website has been delivered with the intention of building a structure centric community for chemists and is happy to contribute to the Microsoft BioIT Alliance initiative. As a leader in the domain of online chemistry data management we look forward to collaborating with Microsoft BioIT alliance partners. In these efforts we believe that ChemSpider as a common public access data platform will allow BioIT members to source information and connect to a database allowing their users to solve problems more quickly. By providing a platform for crowdsourcing enhancement of an online chemistry resource researchers around the world will have access to community-based data and knowledge.”

Rudy Potenzone, Worldwide Industry Technology Strategist for Pharmaceuticals and BioIT Alliance Director, said: “We are thrilled to have ChemZoo join the Alliance.  The ChemSpider team has created in a short time an important free resource for the scientific community in an innovative and collaborative environment that will be a model for future projects in scientific information.  We welcome their participation.”

 The BioIT Alliance is designed to enable collaboration among organizations in the life sciences field in order to shorten the time between discovery of new biological data and the application of that knowledge to human health. ChemZoo will enhance their ChemSpider platform in collaboration with other alliance members to provide access in a manner enabling collaborative science and, where appropriate, public chemistry-based knowledge management.

About ChemZoo

ChemZoo, Inc., was founded with the intention of providing online chemistry software and services to help build a chemical structure centric community for chemists. Their first offering, ChemSpider, is a chemistry search engine built with the intention of aggregating and indexing chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. Founded in 2007 and located in Raleigh, North Carolina, ChemSpider intends to become a facilitator in the exchange of structure-based information between chemists worldwide. For further information, see www.chemspider.com

About BioIT Alliance

Formed in 2006, the BioIT Alliance is a cross-industry group working to integrate science and technology in order to accelerate the pace of drug discovery and realize the potential of personalized medicine. Founding members include Accelrys Software Inc., Affymetrix, Inc., Agilent Technologies Inc., Amylin Pharmaceuticals, Inc., Applied Biosystems, The BioTeam Inc., Digipede Technologies LLC, Discovery Biosciences Corporation, Geospiza Inc., Hewlett-Packard Development Company, L.P., Illumina Inc., InterKnowlogy, Microsoft Corporation, Sun Microsystems Inc., The Scripps Research Institute, VizX Labs LLC and other key companies in the pharmaceutical, biotech, hardware and software industries. Additional information about the BioIT Alliance can be found on the BioIT Alliance Web site at http://www.bioitalliance.org.

  About Microsoft

Founded in 1975, Microsoft (NASDAQ: MSFT) is the worldwide leader in software, services and solutions that help people and businesses realize their full potential. For more information, please visit www.microsoft.com.

Buy me a Coffee

Frequent users to ChemSpider who use the identifiers for searching will commonly find a mixture of “names” and “database IDs” as well as “registry numbers”. Since the number of database IDs can sometimes swamp the synonyms and chemical names we chose to separate them. We have run some regular expressions across the database to separate database IDs out. We have left registry numbers (marked by [RN]), EINECS numbers (marked by [EINECS]) and Wiswesser Line Notation (marked by [WLN]) in with the synonyms.

database-ids.png

Unfortunately there are MANY flavors of Database IDs and we might have missed some. If you come across any “potential” DB IDs and think we should segregate them out please use the POST COMMENTS  ability to inform us. Simply Post a Comment to a record and suggest we check the identifiers out for potential DBIDs. Thank!

Buy me a Coffee

On March 24th 2007 we gave birth to a “Spider”…it arrived as most new births new..coughing, spluttering, belching and not too pretty. Nevertheless, it’s growing nicely, has been cleaned up a lot since day one and is developing new friends. ChemSpider is part of the family of Free Access databases and definitely has a place on the team.

With ChemSpider’s first birthday we should take time to acknowledge our firsts in the Free Access structure database environment (these are based on my knowledge of what’s available and I welcome any corrections).

1) First real time online curation system for structure-identifier pairs

2) First real time structure-deposition system - association with publications, analytical data and supplementary information

3) First system to use InChIStrings and Keys as a basis of structure deduplication and searching

4) First system to allow online deposition of analytical data, images and CIF files associated with structure

5) First system to “take a stab” at offering support for Open Notebook Science with supporting web services and integration to a structure collection.

We thank all of our supporters, users, advocates and naysayers for your feedback and comments regarding ChemSpider. Your input has been very valuable in helping to improve our service. In recognition of their contributions I name the following people for their contributions:

Curator of the Year: Barrie Walker, UK

Depositor of the Year: Chris Singleton, US

Feedback of the Year: Heinz Kolshorn, Germany

Collaborator of the Year: Will Griffiths, Chemrefer (and now member of ChemSpider)

Community Builder of the Year: Joerg Wegner (Belgium?)

What are we working on now and in the near future?

1) A system for integrated text and structure searching of Open and Free Access scientific publications

2) Mounting multiple data sources online including an expanded dataset for IuCr integration, the Wikipedia Chemistry dataset, multiple vendors datasets (over a million additional compounds), a new SureChem patent integration dataset, safety data from MSDS sheets

3) The division of chemical names and synonyms from database IDs for ease of navigation through the system

4) The scraping and aggregation of CIF files from multiple sources and publication online

5) The deposition of new spectral datasets supplied by collaborators and friends of ChemSpider.

6) Unveiling the batch deposition process to allow users to deposit tens of thousands of structures should they wish

7) Additional support for Open notebook Science advocates

8) Supporting substances which cannot be easily represented in a structure format but can be represented using a registry number - minerals, polymers, mixtures etc

9) Website redesign for improved ease of use and navigation

Just like the Queen of England has two birthdays ChemSpider will, all being well, celebrate the first day of Spring ACS by lifting our Beta label from our website in April. We went live on the day before Chicago’s Spring ACS last year and have kept going ever since…

Buy me a Coffee

I am going to be in New Orleans from the 5th-8th of April before heading off to the SBS Conference in St Louis. If anyone is interested in chatting about ChemSpider and our future directions please ping me at antonyDOTwilliamsATchemspiderDOTcom and maybe we can get together for a coffee?

Buy me a Coffee

There are a number of groups in the “free access to chemistry information” domain at present and all are working hard to provide access to data, knowledge and connectivities to serve the chemistry community. Two of the most common questions I get asked are in regards to the difference between ChemSpider and PubChem and between ChemSpider and eMolecules. Yesterday I was asked the question about the difference in regards to eMolecules three times. So, overnight I put together what I hope is an objective comparison of capabilities. I welcome any feedbac or additional questions. It’s a living document as both of our sites are changing (and I know ChemSpider better than I know eMolecules of course).

I know members of the eMolecules team from my previous role when I facilitated the connection between eMolecules (or Chmoogle as it was then) and ChemSketch.  I then had the pleasure of meeting with their VP of Sales at the Chicago ACS meeting one day after ChemSpider went live and we discussed our opinions about our mutual intentions to deliver value to the community. I like the eMolecules offering. There are some nice visual elements on the site. ChemSpider IS different and has a different focus based on what I see eMolecules delivering. We are out to build a structure centric community for chemists. I believe eMolecules is focused on delivering a centralized resource for sourcing chemicals for purchase and has a business model based on advertising and delivering websites for chemical vendors. We will shortly complete our first “depositor skin” that will provide to one of our collaborators a way to display their content only from ChemSpider and “branded” with their logo etc. so will offer a similar service to depositors.

A number of people now visit ChemSpider asking us where they can source a particular chemical. If we cannot find it on ChemSpider then I do visit eMolecules and point people to the link if I find it. At present that’s only about 10% of the time. Despite the fact that ChemSpider is about 2.5 times bigger as a database than eMolecules their focus is commercial vendors and at present they do have more commercial vendors than us. Our collection is growing at about 2-3 new depositors per week, mostly chemical vendors requesting that we ad them to our database. Some people think that ChemSpider is simply a rewrapping of the PubChem database. On day 1 we went live with only the PubChem collection but the data sources collection is much more diverse now and we actually deposited back to PubChem (which I don’t believe eMolecules has yet?). Our structures are unique..but you MUST be careful with that consideration. For example, there ARE multiple flavors of taxol on ChemSpider but the same is true of Taxol on eMolecules. Actually, the reality is that there are multiple flavors of the “Taxol Skeleton” on ChemSpider (42 to be precise! http://www.chemspider.com/q/RCINICONZNJXQF) but NOW, after our curation and redicretion efforts, there is only one compound, the CORRECT one, that will be retrieved based on a search on the name Taxol (http://www.chemspider.com/q/taxol) relative to the seven on eMolecules where you have to determine for yourself which one is Taxol. The 42 Taxol skeletons include multiple stereochemisties and isotopically labeled compounds - C-13, C-11, Tritium, Deuterium etc). So, be careful when people talk about unique structures!

It would be great to have the eMolecules collection in ChemSpider and direct traffic to their site and to their commercial vendors and extend the community. What do you think?

Buy me a Coffee

Over the past few months we have been working hard to integrate the ChemRefer system into ChemSpider. I have reported on our recent rollout of the first level of integration and Will Griffiths has started a discussion at the Open Chemistry Web blogpage. Will has played a key role in facilitating our relationships with publishers in the Open Access domain. Many have had an appreciation for his ChemRefer website.

When we connected ChemRefer to ChemSpider one of our first commitments was to facilitate direct structure/substructure searching of the IUCr publications. Will has indexed the publications from 1948 to present day. He has extracted chemical names and systematic names from the titles and abstracts of the articles. Since we have access to name to structure conversion capabilities from OpenEye we can convert many of these names to chemical structures in an automated fashion. As a result of my work on the curation of Wikipedia chemical structures I also have access to some of the tools I was involved in developing at ACD/Labs (ACD/ChemFolder and ACD/Name) This has allowed me to curate and convert other extracted names in a manual manner. This is NOT the most efficient way to conduct this process as it requires a lot of eyeballing. However, it is this type of approach, reviewing many hundreds of extracted names, which has allowed us to optimize the process, recognize where potential failures could arise and improve the chance to extract the best set of names to work with for conversion purposes. Now we are at the whim of name to structure conversion software in terms of the accuracy of the conversion but this can be validated (See later). Since organometallic names are very difficult to convert to structures we can only deal with organic structures at present. However, we do also have many validated synonyms in our hands too on ChemSpider and that provides a useful dictionary for conversion.

Ultimately I hope that other commercial providers of batch Name to Conversion software modules would see the potential value of collaboration with ChemSpider and give us access to their software tools to assist with our project.

We are in the process of curating and checking the names and chemical structures for all that we have extracted so far but the depositions have already started. Of the structures converted we have found about half of them are already on ChemSpider (example) but another half are unique (example). At present we have connected almost a thousand articles via DOIs from ChemSpider to the IUCr articles. The best estimates at present are that we will be able to connect about 8500 structures to articles.

Since this is the IUCr collection we will validate our approach in the future and, with their permission, we will use the CIFs to validate our extracted structures against the CIF-based structures. This will give us an indication of the errors one could expect from an automated extraction of chemical names from articles and conversion to structures using Name to Structure. We are also going to continue our curation process and allow people to curate structures from IUCr(and other articles) to ChemSpider. There will be cases where some names/structures from articles have not been converted to structures and these will need manual submission. This will be possible through the manual deposition system.

We are looking forward to providing similar services to other publishers should they be desired. This is our proof of concept…and it’s working.

Buy me a Coffee

Late last week I posted about a message left by CAS on the Wikipedia pages regarding the CAS Number Validation project we had started at Wikipedia Chemistry to assist in producing a quality curated dataset. There was an interesting outpouring of judgement and negativity regarding the message posted by CAS balanced by a hope for resolution and new connections. My hope in building the community of chemists at ChemSpider, is to do so while staying in relationship with key players in the domain. I prefer to do things with permission and while I will cajole and encourage shifts I choose to respect the expectations of the players.

This week conversations have been ongoing between WP:Chem and CAS. The conversations have been conducted by Martin Walker, a member of the WP:Chem team as well as the ChemSpider Advisory Group. Martin and I have similar opinions in regards to how to participate in the community and I honor his approach in working through this potentially difficult situation. The outcome of the discussions are declared here on Wikipedia.

New announcement from CAS

CAS, a division of the American Chemical Society, is pleased to announce that it will contribute to the Wikipedia project. CAS will work with Wikipedia to help provide accurate CAS Registry Numbers® for current substances listed in Wikiprojects-Chemicals section of the Wikipedia Chemistry Portal that are of widespread general public interest.

The CAS Registry is the world’s most comprehensive collection of chemical substances and the CAS Registry Number is the recognized global standard for chemical substance identification.

CAS views Wikipedia as an important societal tool for the general public, and this collaboration with Wikipedia is in line with CAS’ mission as a Division of the American Chemical Society.

We look forward to working with the Wikipedia volunteers over the next few weeks to make this happen.Eshively (talk) 13:40, 12 March 2008 (UTC)
I think this is excellent. I implicitly agree with the statement “The CAS Registry is the world’s most comprehensive collection of chemical substances.” For CAS to offer support to the Wikipedia team for the curation project is, for me, an indication of commitment to public service and I am indebted to the participants in this decision. I’m excited to get back underway with the curation project and will start up my efforts again this weekend. This decision by CAS has invigorated me to keep eyeballing structures as fast (and carefully) as possible.

My sincere appreciation is extended to the CAS management team and decision-makers. My gratitude to WP:Chem for staying engaged in the conversation to get to this outcome. My encouragement to us all to get this project done and have a high quality validated dataset of chemicals available as a public resource. Onwards and upwards!

Buy me a Coffee

Previously we introduced the ability to submit chemical structures to the database using the Single Structure Deposition process. This allows users to submit single structures to the database and associate with publications, URLs, Pubmed IDs and so on. An example of the result can be seen here for Quesnoin…the structure and associated supplementary info was deposited online using the outlined process.

We have previously unveiled the ability to add publication details to existing structures on the database as outlined here. What we’ve heard is that it would be just as useful, and in the time of Web 2.0, even better to allow allow connections to other web pages by allowing URLs to be connected to existing structures on the database. The process is easy.

You need to be logged in to Add URLs, Publications etc. The only action that can be done without logging in is the Posting of Comments. The reason we do this is to help us protect from vandalism, if possible. When logged in then click on Add URL. The example below is for me wanting to form a link between the structure record of Xanax and the article on Wikipedia.

addurl1.png

A dialog box will be displayed. Input the Title that you want displayed in the supplementary info and the URL of the associated link. See below:

addurl2.png

Filling the information will show as follows:

addurl3.png

Then Click OK. The submission will be sent to a curator for approval and should be approved very quickly. The reason for this process is to ensure that we don’t get adorned with “inappropriate links”. The information will show as “Supplementary info” at the bottom of the structure record as shown here.

Buy me a Coffee

Two particular Open Access resources providing enormous value to Life Sciences nowadays are PubMed and PubChem. I’m sure everyone reading this blog has heard of them and used them both. We previously announced our deeper collaboration with ChemRefer. We have been busily working in the background to integrate ChemRefer into ChemSpider and the very “alpha” version of the integration is now available online. It can be accessed from the Search menu as shown below. Simply click on ChemRefer

chemrefer.png

When selected it will open the ChemRefer search window and a list of Publishers who have allowed us to index them as shown below. All can be searched or the search can be limited by using the Check Boxes.

chemrefer2.png

The search results for searching on taxol are shown here.The figure below, while too small to see detail, shows that the word searched is highlighted in the text.

chemrefer3.png

Notice that the RSC is no longer indexed at their request. We were sad to lose them from our searches.

We have also integrated to Entrez, for searching health sciences databases at the National Center for Biotechnology Information (NCBI) website. This can be searched by choosing NCBI Entrez from the Search drop down menu. It is available here. An example of the results is shown here. For this first integration we limit the results to 100 hits.

entrez1.png

Clicking on any of the titles is a direct hyperlink to the article in PubMed Central.

These integrations to these text searching engines are only the first part of our work. We have already been extracting chemical names and linking them up to structures in the ChemSpider database.Our first efforts in this area will be unveiled shortly. In this case it will be possible to simultaneously perform structure/substructure and text-based searches. This is a very significant undertaking but we are well underway to bringing our vision of structure and text based indexing of the Open Access literature to fruition.

Buy me a Coffee

Adam Azman is building an Open Data Chemical Dictionary as discussed by David Bradley over on ChemSpy. With due respect to the work done by Adam and David Bradley to date I point you to ChemSpy to review the details.

Adam is local to me and attends UNC-Chapel Hill. Recently I had the opportunity to spend some time with Adam and talk about how the ChemSpider list of identifiers could be used by Adam to expand his dictionary. We have millions of identifiers and it is likely a rich resource for Adam to build his dictionary. Our contribution has already been made to Adam and he is presently working through extracting from the file. We’re looking forward to seeing the results!

The dictionary as it presently exists can be downloaded from the ChemSpy site here.

Buy me a Coffee

I have written previously about the joys and frustrations of participating in the blogging community. One of my greatest joys is what I learn from others in the domain. In regards to the discussions about CAS Numbers I was very interested to read this comment from PhysChim62, a member of the Wikipedia:Chem team, regarding the European Union Database Directive.

I have extracted certain statements from that comment: “The restrictions it purports to impose of the reuse of its product would appear to breach anti-trust legislation on both sides of the Atlantic. Users of CAS databases in the European Union can take heart from Art. 8.1 of the Database Directive (96/6/EC): The maker of a database which is made available to the public in whatever manner may not prevent a lawful user of the database from extracting and/or re-utilising insubstantial parts of its contents, evaluated qualitatively and/or quantitatively, for any purposes whatsoever.”

Now I get to go off and read up on the Database Directive…while ChemSpider is trying to build a structure centric community for chemists I have to acknowledge the benefits to myself personally in terms of what I am learning from the participants in this community and specifically the people commenting on the blog. Thank you to those of you who comment!

Buy me a Coffee

The recent post regarding CAS numbers and Wikipedia has stirred up some great conversation and responses and I point you to the comments to peruse. For now I want to comment on one made by Cameron Neylon on his blog. I point you to his post to read first rather me lifting it from him and posting it here. It’s respectful of his work. Also, you may choose to add him to your Google Reader. Cameron is a great advocate of Open Notebook Science and I encourage you to visit his site.

OK…Did you read it???

Ok ..I am now lifting certain comments and wish to state my own views.

Cameron said “So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually.

At ChemSpider we likely have more experience than most in interconverting millions of structures from/to various other formats. They all have their own limitations. InChI, in both of its formats is limited in a number of ways. They are acknowledged and being worked on. For instance, polymers, inorganics, organometallics, mixtures of specific stoichiometries, Markush structures (not an issue for Cameron’s material in a bottle). SMILES comes in so many different flavors that it can be very distressing.Even the most popular Cheminformatics vendors can be incompatible..believe me, we’ve seen it in many ways! CML..haven’t worked with it yet on ChemSpider but remain interested. However, uptake seems to be very limited after maybe a decade of being available to the domain. InChI took off an proliferated at an incredible rate while CML has been around a lot longer and, as far as I know from 10 years in the cheminformatics business, has low adoption. It doesn’t mean it’s not the solution but if it is then it needs to be adopted by the masses. PubChem IDs ..there are structure IDs and compound IDs. So, a decision will be necessary there. More and more I am seeing PubChem IDs listed anyways…they are in the Aldrich catalog for example. The work will come with the curation of the data - making sure that people can find the “appropriate ID” for a compound. Check out my earlier posts about the need for curation (1,2,3,4 and many others). CAS is very highly curated and are the authority for the CAS numbers. PubChem are, of course, the authority for their IDs too but compounds can be deprecated from time to time as depositors find their own errors and there have been so many depositors with different quality standards to date that cleaning up the database is a major challenge. While they could do it it is not their mandate today, they are not funded to do so and it would be an enormous undertaking and would likely need to involve some form of crowdsourcing via online curation as we are doing here at ChemSpider.

Cameron said “The CAS number so appealing; it is short, easily typed in, and printed on most bottles.

‘Tis so. And asking the vendors to move away from it won’t work. Adding a PubChem ID might work but that’s a big shift too and I believe they would need to have guarantees about the long term future of PubChem and its database and funding to buy into that. Also, a MASSIVE validation exercise. if the companies ended up depositing their OWN compounds to get PubChem IDs I believe all hell will break loose…it’s already going on by the way of course, when they deposit. Their compounds go through internal processes at PubChem and come out the other side as deposited structures. Is everyone that went in deposited exactly as it was supplied. In theory yes. In reality? We have the same issues at ChemSpider…not easy. By the way, the CAS number with check digit and specific format is much nice than “just a number”. Maybe we should do the same with ChemSPider IDs…new format, plus a check digit?

Cameron said “I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers

Hmmm…really? Check out this link. Compare the snippet here from the Taxol Drugbox on Wikipedia

drugboxtaxol.png

with the taxol record here on PubChem and the synonyms list. You are looking for 33069-62-4.

taxolcas.png

When you find it click on it and you will find 6 structures for Taxol. The issue is curation. Which structure is right?

Why does the story that there are no CAS numbers on PubChem continue to proliferate??? Sure, they are not called CAS Numbers but it’s what they are. Depositors simply put them in as identifiers and PubChem don’t have to remove them. They respect the depositors right to deposit their identifiers..whether they are CAS numbers or not.

Now, I agree robotic curation can help with these issues and the RDF approach already being discussed for Wikipedia and ChemSpider (with Egon) can be useful in helping to link together resources and, if adopted by companies such as Aldrich etc, can be of great value in helping to clean up some of the issues. But, it is only part of the solution. The need for manual curation is being missed. Robots are already making a bigger mess in my opinion. Manual curation is a must.

Cameron said ”The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services.

I have made this statement to many people over the years. The Registry of chemical structures has little value as a collection of structures. Its what’s connected that has value. The CAS Numbers, the patents, the literature articles, the vendors. It’s the same with PubChem and ChemSpider. Who cares how many structures we have? We can generate HUNDREDS OF MILLIONS for you! It’s the associated information. I have always thought that CAS should provide an internet service that is simply a CAS lookup. Search a number and see the structure and or substance detail. End of story. You want patent details, papers, vendor details…you pay. It’s a transaction…same as STN now. In fact, do a study, see how many searches are done just to relate CAS numbers and structures, figure out the loss in revenue if you “give it away” and shift it over to the transaction charge to look at “more info”. Certainly the giveaway would help with the public relations! Oh..and you “could” make it a structure search to get a CAS number too…more work but of course possible.

Cameron said “…do people agree that CID is a good standard index to aggregate around?”

No…not yet. There is so much to be done before that can happen in my opinion. A lot.

What I’d rather do, and maybe I am a dreamer, but I would say get into relationship with ACS/CAS and try and establish a tipping point of support to where they see it is good for their business and the community as a whole. I prefer staying in relationship if possible. That said the effort around CIFs from ACS via CrystalEye has not moved forward as explained here so maybe my vision will not work in the case of CAS numbers either. Either way, we made a decision not to scrape CrystalEye anyway and this shows perfectly the issues of SMILES and InChI giving issues of structure representation!

I say let’s not abandon hope regarding CAS opening their numbers to the world just yet. This dialog is likely sparking discussions already. Let’s keep it out there and establish a groundswell of concern and support and hope that the right thing can happen for our good and for CAS. I have great respect for many of their people and their work and want the resolution to be appropriate for all parties. Let’s hope…and if hope doesn’t work then I encourage robotic and manual curation…the system is ready on ChemSpider. Come and help out!

Buy me a Coffee

Ever since ChemSpider went online we had committed to allowing community curation and annotation. We have done this…in spades. We have introduced the ability to Post Comments for curators approval, we have allowed the association of publications and the association of URLs. Data have been annotated with over 200 spectra, CIF files or images as described here and here.

Over 500 comments posted by users and administrators of ChemSpider have now been curated, fixed, acknowledged or rejected (use the filter at the top of the page to see the different types).

status.png.

Almost 200,000 identifiers have been curated (approved or removed). The necessary functionality to allow curation and annotation of the data has been delivered. We will now extend and enhance it specifically to handle batch depositions (already in beta testing).

Crowdsourced curation and annotation is going to be necessary to cleanse a lot of the material which has found its way out into the public domain and we have started.

Buy me a Coffee

 The Wikipedia curation project I have been working on has been on hold from my side for the past few weeks as I focused on some other important tasks including some presentations, 5 peer-reviewed papers and articles and a whole series of technical advances on ChemSpider. It’s been good to get a break from eyeball curation but I am ready to start again in mid-March if no new night time distractions show up.

Tonight I was catching up with my Watchlist on Wikipedia for the first time in a long time and noted that a comment had been added to the Wikipedia Project: CAS Validation page. This discussion page was started to have a place to discuss a second validation of my work by other membes of the WP:Chem team and especially to deal with my concerns about CAS numbers not matching the structure drawn in the Chemical Box or Drug Box. Sometimes the CAS number might be for the chloride salt but the structure would be the neutral form for example. So, this was our discussion place. I believe there is general agreement by all participants at WP:Chem that CAS Numbers have value for the users of Wikipedia and chemists is general so the presence of a CAS number in the boxes makes absolute sense and, of course, the correct CAS number for the structure makes sense in an encyclopedia. Therefore, validation and sourcing of CAS numbers has been pursued.

A comment from Eric Shively at CAS can be found here online at Wikipedia. He comments:

Chemical Abstracts Service (CAS) objects to anyone encouraging the use of SciFinder® and STN® to curate third-party databases or chemical substance collections, including the one found in Wikipedia. SciFinder and STN are provided to researchers under formal license agreements, under which the researchers agree to refrain from using these tools to build databases. We urge and expect those researchers to respect the explicit terms of the agreements they have entered into. CAS is a division of the American Chemical Society. Please contact CAS if you have questions. Eric Shively, CAS, eshively@cas.org Eshively (talk) 20:56, 5 March 2008 (UTC)

It’s an interesting stance. This at a time when there is more focus on facilitating information exchange. In an environment where people are using resources such as Wikipedia to source information one would assume that the availability of CAS numbers would actually be encouraged rather than so blatantly discouraged. It’s been said before that CAS numbers are like the phone numbers of the chemistry world so if they were to be sourced from a vendors catalog would that be acceptable? And how would anybody know where they are sourced anyway? If they were sourced from a bottle of chemicals on the shelf and added to Wikipedia is that acceptable?

Nevertheless,  as Mr Shively comments there are legal agreements in place and they are expected to be respected. Question: does every user of Scifinder read the agreement? When a large Pharma company licenses access to Scifinder for their users do they expect people to know the legalities of usage and train their users in such detail? Maybe…

As it is I am not a user of SciFinder…though I’d like to be. I think it’s an incredible resource. So, I don’t have to worry about the legal repercussions of using the system (yet). As it is I will continue my work of curating and I guess there will be a discussion now with the WP:Chem team about what to do about CAS Numbers.

Buy me a Coffee

Recently I posted about trying to identify the correct structure of Ginkgolide B and the need for curation of ChemSpider entries. David Barden from the RSC commented on my post:

“Antony - I am an organic chemist working on the RSC journal in which the published structure of ginkgolide B appeared, and am pretty sure that it is correct, having been written by a regular author of ours familiar with the literature on the ginkgolides. I think the problem might lie with the representation (and/or conversion to InChI) of the structures - even in the one structure you indicated as having “full stereochemistry”, it seemed to me that 3 stereocenters were undefined, from a visual inspection of the structure. Apart from these stereocenters, the structure and InChI (generated myself) otherwise seem identical, so I’m not sure why the last part of the string in the ChemSpider entry is “20+” rather than “20-”. The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.”

I have redrawn the structure of Ginkgolide to echo that shown in the RSC journal and it is shown below alongside a cropped image from the article:

compare-the-two.png

I’m_pretty sure I have the structure correct. The InChIString is:

InChI=1/C20H24O10/c1-6-12(23)28-11-9(21)18-8-5-7(16(2,3)4)17(18)10(22)13(24)29-15(17)30-20(18,14(25)27-8)19(6,11)26/h6-11,15,21-22,26H,5H2,1-4H3/t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20-/m1/s1

and the InChIKey is:

SQOJOAFXDQDRGF-MMQTXUMRBS

In the previous post I searched on Ginkgolide B as an identifier to see how many Ginkgolide B’s there are. There are 6 as shown here.

I searched on the entire InChIkey and found no hits. This means that the structure is NOT on ChemSpider.

I then searched on the CONNECTIONs captured within the InChIKey and represented by: SQOJOAFXDQDRGF . I received 18 hits in total varying in completeness in terms of incomplete stereochemistry and DIFFERENT but fully assigned stereochemistry. I searched the entire InChIKey on Google (SQOJOAFXDQDRGF-MMQTXUMRBS) but received no hits. Just to check I then searched the InChIString shown above on Google. Surprisingly, I DID get a hit! It was for this structure. I was puzzled and a comparison of the strings showed a difference in ONE section of the string, the stereo layer.

Searched on Google: /t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20-/m1/s1

Found by Google: /t6-,7?+,8-,9+,10?+,11+,15+,?17+,18+,19?-,20+/m1/s?1

See the difference? ONE stereocenter… 20- versus 20+ . Thank goodness we are moving to InChIKeys rather than InChIStrings since the majority of people would likely miss the detail. I did the first time! So, based on all of my searches the structure of Ginkgolide B as represented in the article published by the RSC is NOT in the ChemSpider database. I agree with David Barden when he comments “The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.” It is very complex and time-consuming and the hope is that comparison of InChIKeys, specifically the second part of the key, will help catch the differences in a more facile manner.

The question, unfortunately, remains. What IS the correct structure of Ginkgolide B? For now I have assumed that the one in the RSC article is correct and have added the structure to the database using the normal deposition process and have associated with the RSC article and the blog discussions on ChemSpider. If it turns out it is not correct then I will leave the structure, the connection to the article but remove the identifier Ginkgolide B.

suppinfo.png

Buy me a Coffee

I’ve reported previously on the fact that we are now adding publication details to chemical structures on ChemSpider. We have introduced the ability to do this in a manual fashion where anyone can associate a blog article, wiki article or scientific publication directly with one of more chemical structures but we are also looking to do this in an automated fashion. Project Prospect from the RSC seemed like an ideal opportunity for us to consider using their InChI association to harvest the titles and DOIs to make the association. This would be done after a discussion with the RSC to receive their blessing if possible, based on our previous interactions.

Today I investigated the possibilities of using the information available. I started with this article and clicked on the Enhanced HTMP View (Prospect View) and then used the Toolbox to Show the Compounds. The article is about the “Chemistry and biology of resorcylic acid lactones”. A partial screenshot of the type of molecules discussed in the article is shown below.

radicicol.png

For_the_purpose of this blogpost notice the visually appealing forms of the structures and the stereochemistry on the molecules. Now, each of the marked compounds in the article is linked to details of the molecule. See below..each of the pink highlights is linked to the molecule and pops up a new box.

radicicol3.png

Looking at radicicol on Project Prospect and on ChemSpider we see a difference. In fact, compare with the structure shown above from the article. The difference is one of stereochemistry..there is no stereochemistry in the InChI or in the SMILES string. There are also issues with the structure depiction shown below and this has been discussed before relative to “Cleaning“.

radicicol2.png

Zearalenone on Project Prospect and on ChemSpider

zearalenone.png

As_previously discussed with Ginkgolide B there can be many versions of a structure on the ChemSpider database. We recently introduced the ability to search on a “skeleton” as shown below in the new Structure Search options.

search-options.png

When the skeleton for zearalonone is searched (same skeleton excluding H) I found 15 hits. Some are shown below. Notice the difference (highlighted with red boxes) structure to structure in terms of the presence/absence of the double bond inside the cycle, the difference between the OH and the =O and the specified  stereochemistry. This search can be very useful for finding related structures and more examples will be given in the near future of using such searches.We DO find versions without specified  stereochemistry but we are presently working on approaches to relate the stereo/non-stereo versions of structures to each others in a very visual manner. More will follow…

skeletons.png

Buy me a Coffee

I’ve posted previously about analytical data deposition on ChemSpider. There are now fairly regular depositions going on and an increasing number of spectra are available online as listed here.

We have now enabled the deposition of CIF files onto the system and the first example is shown here, an association with a UsefulChem Deposition from JC Bradley’s laboratory and using the excellent Jmol viewing capabilities.

cif.png

Previously I blogged about trying to use the CrystalEye Open Data on ChemSpider and recently about the struggles of scraping the data set. Based on our experiences we’ve chosen not to scrape CrystalEye but grab CIFs and interpret ourselves and associate with the appropriate chemical structures on our database. Expect to see this happen over the next few weeks.

We have also supported the submission of images to associate with structures. The first example is here.  Scroll to the bottom.This is simply an example of a “pill pack” associated with the structure of the drug. Of course such images could include pictures of crystals, colored solutions, microscopy images etc.  Both capabilities can be accessed by clicking on the appropriate link on the record View page and following the simple instructions.

additions.png

Buy me a Coffee