Archive for the Uncategorized Category

I am posting this in order to help one of my “neighbors”, IUPAC in Research Triangle Park. Their office is about 30 minutes from where I live. This is a beautiful area of the world and I encourage people to contact the Secretariat directly should you have an interest in this role.

Post-doctoral Position in Chemistry Informatics

Develop, implement, and support web based applications to enable IUPAC Staff and Committee members to work more effectively. The emphasis will be on development of tools for communication and collaboration to allow scientists working on IUPAC projects to accomplish their project goals while minimizing the need for travel. This will build on the new architecture of the IUPAC web site that uses XML technology to organize the information used by IUPAC members as well as the general scientific public. In addition, methods will be developed to organize and present IUPAC’s information, now contained in books and journal articles, to make it more accessible and more useful.

This position is located at the IUPAC Secretariat in Research Triangle Park, North Carolina, USA and will require considerable travel.

Required background: PhD or equivalent in Chemistry or a related discipline so as to combine a reasonable chemical knowledge with computing expertise; experience with SQL databases and XML coding; excellent written English and the ability to deal with multiple projects simultaneously.

Salary and benefits are competitive and will depend on experience and qualifications.

IUPAC was formed in 1919 by chemists from industry and academia. For almost nine decades, the Union has succeeded in fostering worldwide communications in the chemical sciences and in uniting academic, industrial and public sector chemistry in a common language. IUPAC is recognized as the world authority on chemical nomenclature, terminology, standardized methods for measurement, atomic weights and many other critically evaluated data. In more recent years, IUPAC has been pro-active in establishing a wide range of conferences and projects designed to promote and stimulate modern developments in chemistry, and also to assist in aspects of chemical education and the public understanding of chemistry.

More information about IUPAC and its activities is available at <www.iupac.org>.

Contact:

John W. Jost, Executive Director

IUPAC Secretariat

P.O. Box 13757

Research Triangle Park, NC 27709-3757, USA

E-mail: secretariatATiupacDOTorg

Buy me a Coffee

Peter Murray-Rust responded to my recent comments about a Free Lunch. There are a number of comments to be made and an exciting opportunity to use Open Data and linking from ChemSpider.

I’d asked the question about how many records there were on CrystalEye. In our world a unique record is a unique InChI, not so on CrystalEye and appropriately so as the crystal structure itself is presumably the unique record. Makes sense.

PMR> We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is.

What we will do with multiple crystal structures for a single chemical structure is link all unique crystal structures from the unique chemical structure. In this way people can query the chemical structure and find all associated analytical data - spectra and crystallographic files. If we were to list the number of unique depositions on ChemSpider I think we would be around 40 million depositions..an estimate though!

PMR> The main duplication comes from the Crystallography Open Database which has about 45,000 structures.

I looked at the Crystallography Open Database this morning. it states on the home page “Updated daily: 68268 entries in the COD”. We may have an opportunity with the COD to link up to their data and reduce the need for us to host CIFs. Excellent…we’re all for reducing workload and providing links into other systems. It’s what we do.

PMR> The only thing stopping us putting them (AJW> The structures from CrystalEye)  in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon. When it’s finished it will be in RDF/CML.

This is great news. This means that after the summer we can download the data directly via PubChem and link up to CrystalEye that way. Perfect. We’ll stop working on integrating to CrystalEye now and wait for the integration path via PubChem and focus on other data sources. Thank you Peter, Nick, Andrew and Jim!!! That said I don’t believe that PubChem will take CML, they will convert using their tools to produce their compatible formats and InChI being one of them. That will break organometallics etc. UNLESS PubChem are going to adopt CML now and that would be an interesting positive shift in terms of a sign of support for the format. A strong positive. I’l chat with the PubChem team so that if CML is coming we can consider adopting in some way and be ready.

From my post “AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.”

PMR: Chicken and egg… :-) You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.

It’s nice to know that ChemSpider has that type of influence now. It’s good to see it going into Accelrys’ software and I had heard that from Dan’s blog and had added the CML Blog to my reader. I’m definitely watching and willing to follow. We’re busy leading so many other things right now we’ll wait for adoption and then jump on it like a “hobo on a muffin”.

Buy me a Coffee

I am off to Bio-IT in Boston this coming week and I am honored to have been asked to talk on ChemSpider. I wasn’t on the agenda as of 72 hours ago but was offered an opportunity as a result of a cancellation in one of the sessions. I’m looking forward to seeing what’s new in the world of informatics this year and due to be unveiled at Bio-IT. At SBS there was only one person in the room when I talked who had even heard of ChemSpider. I didn’t take offense…about 60 people went away informed! I hope for a similar opportunity at Bio-IT. The blog will be fairly quiet this week. Catch-up time is next week.

AN fyi, I recently wrote an article entitled “Public Chemical Compound Databases” in Current Opinion in Drug Discovery & Development 2008 11(3). The abstract is:

“The internet has rapidly become the first port of call for all information searches. The increasing array of chemistry-related resources that are now available provides chemists with a direct path to the information that was previously accessed via library services and was limited by commercial and costly resources. The diversity of the information that can be accessed online is expanding at a dramatic rate, and the support for publicly available resources offers significant opportunities in terms of the benefits to science and society. While the data online do not generally meet the quality standards of manually curated sources, there are efforts underway to gather scientists together and ‘crowdsource’ an improvement in the quality of the available data. This review discusses the types of public compound databases that are available online and provides a series of examples. Focus is also given to the benefits and disruptions associated with the increased availability of such data and the integration of technologies to data mine this information.”

It’s not an Open Access article but it’s out there if anyone is interested or is subscribed. Enjoy.

Buy me a Coffee

I’ve blogged previously about my view of webstat monitors such as Alexa and Compete. I trust more those things that we measure ourself and can objectively analyze. Three of those measures are our own webstats, our Feedburner count for the blog and the number of registered users on ChemSpider who register to curate, annotate and add content. The numbers are as follows now:

Website stats: Over 4000 users per day

Feedburner: Oscillating between 475 and 525 people on a weekly basis

Registered Users: 728 as of today

We are now starting to see traffic to WiChempedia, a consistent stream of depositions, new links to external websites and an increase in the number of DOI linked articles. Overall, traffic is where we hoped it would be at this stage of exposure.

Buy me a Coffee

I’ve blogged previously about the work being done on PlayStations to support computational chemistry. Now, Zsolt Zsoldo, CTO of SimBioSys, announces that their “eHITS Lightning” program will be unveiled at Bio-IT in Boston next week. I sat in on his talk at ACS where he reviewed the initial results on the Cell Processor on both PlayStation 3 and IBM Blade processors with the cell processor. His talk is online as discussed on the SimBioSys blog. They are VERY impressive initial numbers. Is this the future…computational chemistry on gaming machins that can be purchased for about $400, run on free Linux? Why not…

Buy me a Coffee

I have blogged previously about our integration to IUCr. I am happy to announce that we have been busily curating their data and are presently active in indexing thousands of chemical structures and articles. We will definitely have problems in accurately representing organometallics with our present structure rendering and handling and due to the limitations of InChI. For the time being we are focused on organic molecules. As we index the molecules we are adding the Author and Title, linked via the DOI to the original article on the IUCr website, to the record. So, you will see something like this in the Supplementary Data section.

iucr-link.png

Expanding the index will grow the connectivities to IUCr structure by structure. We presently have >1300 structures on the dedicated website, iucr.chemspider.com. Enjoy.

Buy me a Coffee

Last week I gave two presentations - one at the American Chemical Society meeting in New Orleans and one at the Society of Biomolecular Screening meeting in St Louis. As I give public presentations or put together general informational presentations I will publish them to our presentations page for review. Both the ACS and SBS presentations are now available online for anyone to review and comment. Feel free to post your comments here on this blog post or provide feedback offline.

Buy me a Coffee

With the rollout of the new website we are now set up to provide DEDICATED websites to subsets of data. We will unveil a number of these in the next few weeks as examples of how we can host data with a dedicated purpose. As an example of a website I point you to a dedicated Molecule of the Day website. I have discussed Molecule of the Day recently. Now we unveil the website here. Notice the logo on the upper right hand side.

motd.png

Buy me a Coffee

To celebrate our one year anniversary since going live at ACS Spring in Chicago in 2007 we lifted the title of beta from ChemSpider and released a new website design. I’ve already received some positive feedback on the new design and on us growing the service over the past year. We welcome any and all feedback though. What do you think of our contributions over the past year? What do you think of our new website. Let us know!

And Happy 1st Birthday ChemSpider…

Buy me a Coffee

For anyone with an interest in Chem(o)informatics and who lives in the Research Triangle area I highly recommend taking advantage of the lecture below to listen to “one of the greats” - Professor Johann Gasteiger. I guarantee you will not be disappointed.

Chemoinformatics - Addressing fundamental topics of a chemist

Prof. Dr. Johann Gasteiger

Computer-Chemie-Centrum
Institut für Organische Chemie I
Friedrich-Alexander-Universität Erlangen-Nürnberg
Nägelsbachstraße 25
91052 Erlangen
Deutschland / Germany

Thursday, April 17th, 2006

3:00PM -5:00PM

Kerr Hall, Rm. 2001, School of Pharmacy

University of North Carolina

Chapel Hill, NC 27599

Host: Prof. Alexander Tropsha

Sponsor: CECCR

Buy me a Coffee

During a recent discussion about ChemSpider interest was expressed in whether or not ChemSpider would be supporting toxicity and Safety data. It’s been on our list for a while but questions result in action…so, check out the following links. Scroll to the bottom to the supplementary information.

Sodium Acetate Trihydrate

sodium-acetate.png

Benzoyl Peroxide

benzoyl-peroxide.png

There are about another 3000 records with such information on the website now. Click on any of the question marks and up pops a dialog box explaining what the property is…and admittedly the example below is rather obvious!

help-boxes.png

Also, notice the wiki link wikilink.png which takes you out to the originating site for the data.

As an example of the process used to map the fields see below how we take the original fields and then map them to other fields to “homogenize”. Notice the meta info layer too, specifically the associated units.

mapping-fields.png

Mapping choices are made according to the pulldown menu shown below.

mapping-fields-2.png

The_process to gather, map and publish this data has now been tested on two different datasets. It is not yet perfect but we improve with every iteration and, I believe, will shortly have an optimized process for scraping and publishing. I believe that the processes we are developing here will provide a smooth and highly functional system for gathering depositing data and efficiently integrate it to our database and make it easy to expand the associated data and knowledge associated with the structures on ChemSpider.

Buy me a Coffee

Molecule of the Day (MOTD) is one of those fun blogs that the public will likely enjoy…and there’s enough there foe even us chemists to remember just how much fun chemistry is. It seemed like a good idea to try out the ability to link URLs to ChemSpider on a few of the Molecule on the day articles. About an hours manual work and the entire MOTD blog archive could be made structure searchable. As it is I’ve linke dup a few tonight….about 60 seconds each to click on the MOTD blog post, search the structure in ChemSpider and paste the article title and URL. Voila…searchable Molecule of the Day. Some examples linked up…scroll to the bottom of the record:

Sodium Acetate Trihydrate

Benzoyl Peroxide 

1,3,5-Trioxane

Buy me a Coffee

Excuse the silence of late regarding this blog. There has been, and continues to be, a lot of work going on in preparation of a rollout of a new look and feel ChemSpider, hopefully to coincide with our anniversary roll out at Spring ACS in Chicago last year. Fingers crossed we will lift the beta label next week for New Orleans.

In celebration (it’s all serendipity actually!) I will be giving a talk on Sunday in New Orleans, coincident with the day we went live. The details and abstract are below (of course…when the abstract was written it was a placeholder..I had no idea what would be presented because we didn’t know where we’d be)

PAPER ID: 1168843
PAPER TITLE: “ChemSpider: Building a structure-centric community for chemists”

DIVISION: Division of Chemical Information
SESSION: Cheminformatics: From Teaching to Research SESSION START TIME: Sunday, April 6, 2008, 1:40 PM
DAY & TIME OF PRESENTATION: Sunday, April 6, 2008 from 3:40 PM to 4:05 PM
LOCATION: Marriott Convention Center, Room: Blaine Kern D

ChemSpider – Building a Structure Centric Community for Chemists

Antony J. Williams

Scientists commonly find themselves in a state of overwhelm in regards to the availability of information accessible to them. The distribution of resources now includes the entire space of the worldwide web, access to primary databases such as CAS and, commonly, a plethora of internally developed systems. While the web has provided improved access to chemistry-related information there has not been an online central resource allowing integrated chemical structure-searching of chemistry databases, chemistry articles, patents and web pages such as blogs and wikis. ChemSpider has built a structure centric community for chemists by providing free access to an online database and collaboration tool for chemists. The online database offers an environment for curating the data on ChemSpider as well as the deposition of chemical structures, analytical data and associated information and provides a significant knowledge base and resource for chemists working in different domains. An overview of present and future capabilities will be given.

On Tuesday I fly off to St Louis to the SBS Meeting presenting on ChemSpider and the ongoing collaborations around NISS/NCSU’s ChemModLab and SimBioSys’ LASSO. My talk is on Thursday morning.

 

How a Structure-Centric Community for Chemists Can Benefit Drug Discovery - Virtual Screening Experiments Utilizing a Publicly Accessible Ligand Database, QSAR Modeling Tools and a Virtual Docking Software Package

 ChemSpider is an online database of over 20 million chemical structures assembled from almost a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples using the software outlined below.

The QSAR analyses utilize the ChemModLab environment which is a free, web-based toolbox for fitting and assessing quantitative structure-activity relationships. Its elements include a cheminformatics front end to supply molecular descriptors, a set of statistical methods for fitting models, and methods for validating the resulting model. Five molecular descriptor sets are used with 16 math modeling methods to give a total of 80 QSAR models. The input is a file of compounds and a text file for biological activity.

The in-silico docking experiments are conducted using a combination QSAR/Docking approach using the SimBioSys eHITS and Lasso software programs. The docking procedure allows for the screening of a complete molecular database to obtain the correct binding poses and estimated binding affinities. The ligand based screening tool utilizes a novel conformation independent 3D QSAR descriptor, ideally suited for scaffold hopping.

If you are interested in what we’re up to and where we’re going I hope we get to meet-up at one of these gatherings!

Buy me a Coffee

ChemZoo Joins Microsoft BioIT Alliance

Wake Forest, NC, March 27th, 2008 — ChemZoo Inc. announced today that it has joined the Microsoft BioIT Alliance, a cross-industry group working to enhance collaboration among life sciences organizations to accelerate the pace of drug discovery and development.

“Collaborative science is facilitated by a common development platform and cooperation between informatics companies,” said Antony Williams, President of ChemZoo. “The ChemSpider Free Access website has been delivered with the intention of building a structure centric community for chemists and is happy to contribute to the Microsoft BioIT Alliance initiative. As a leader in the domain of online chemistry data management we look forward to collaborating with Microsoft BioIT alliance partners. In these efforts we believe that ChemSpider as a common public access data platform will allow BioIT members to source information and connect to a database allowing their users to solve problems more quickly. By providing a platform for crowdsourcing enhancement of an online chemistry resource researchers around the world will have access to community-based data and knowledge.”

Rudy Potenzone, Worldwide Industry Technology Strategist for Pharmaceuticals and BioIT Alliance Director, said: “We are thrilled to have ChemZoo join the Alliance.  The ChemSpider team has created in a short time an important free resource for the scientific community in an innovative and collaborative environment that will be a model for future projects in scientific information.  We welcome their participation.”

 The BioIT Alliance is designed to enable collaboration among organizations in the life sciences field in order to shorten the time between discovery of new biological data and the application of that knowledge to human health. ChemZoo will enhance their ChemSpider platform in collaboration with other alliance members to provide access in a manner enabling collaborative science and, where appropriate, public chemistry-based knowledge management.

About ChemZoo

ChemZoo, Inc., was founded with the intention of providing online chemistry software and services to help build a chemical structure centric community for chemists. Their first offering, ChemSpider, is a chemistry search engine built with the intention of aggregating and indexing chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. Founded in 2007 and located in Raleigh, North Carolina, ChemSpider intends to become a facilitator in the exchange of structure-based information between chemists worldwide. For further information, see www.chemspider.com

About BioIT Alliance

Formed in 2006, the BioIT Alliance is a cross-industry group working to integrate science and technology in order to accelerate the pace of drug discovery and realize the potential of personalized medicine. Founding members include Accelrys Software Inc., Affymetrix, Inc., Agilent Technologies Inc., Amylin Pharmaceuticals, Inc., Applied Biosystems, The BioTeam Inc., Digipede Technologies LLC, Discovery Biosciences Corporation, Geospiza Inc., Hewlett-Packard Development Company, L.P., Illumina Inc., InterKnowlogy, Microsoft Corporation, Sun Microsystems Inc., The Scripps Research Institute, VizX Labs LLC and other key companies in the pharmaceutical, biotech, hardware and software industries. Additional information about the BioIT Alliance can be found on the BioIT Alliance Web site at http://www.bioitalliance.org.

  About Microsoft

Founded in 1975, Microsoft (NASDAQ: MSFT) is the worldwide leader in software, services and solutions that help people and businesses realize their full potential. For more information, please visit www.microsoft.com.

Buy me a Coffee

On March 24th 2007 we gave birth to a “Spider”…it arrived as most new births new..coughing, spluttering, belching and not too pretty. Nevertheless, it’s growing nicely, has been cleaned up a lot since day one and is developing new friends. ChemSpider is part of the family of Free Access databases and definitely has a place on the team.

With ChemSpider’s first birthday we should take time to acknowledge our firsts in the Free Access structure database environment (these are based on my knowledge of what’s available and I welcome any corrections).

1) First real time online curation system for structure-identifier pairs

2) First real time structure-deposition system - association with publications, analytical data and supplementary information

3) First system to use InChIStrings and Keys as a basis of structure deduplication and searching

4) First system to allow online deposition of analytical data, images and CIF files associated with structure

5) First system to “take a stab” at offering support for Open Notebook Science with supporting web services and integration to a structure collection.

We thank all of our supporters, users, advocates and naysayers for your feedback and comments regarding ChemSpider. Your input has been very valuable in helping to improve our service. In recognition of their contributions I name the following people for their contributions:

Curator of the Year: Barrie Walker, UK

Depositor of the Year: Chris Singleton, US

Feedback of the Year: Heinz Kolshorn, Germany

Collaborator of the Year: Will Griffiths, Chemrefer (and now member of ChemSpider)

Community Builder of the Year: Joerg Wegner (Belgium?)

What are we working on now and in the near future?

1) A system for integrated text and structure searching of Open and Free Access scientific publications

2) Mounting multiple data sources online including an expanded dataset for IuCr integration, the Wikipedia Chemistry dataset, multiple vendors datasets (over a million additional compounds), a new SureChem patent integration dataset, safety data from MSDS sheets

3) The division of chemical names and synonyms from database IDs for ease of navigation through the system

4) The scraping and aggregation of CIF files from multiple sources and publication online

5) The deposition of new spectral datasets supplied by collaborators and friends of ChemSpider.

6) Unveiling the batch deposition process to allow users to deposit tens of thousands of structures should they wish

7) Additional support for Open notebook Science advocates

8) Supporting substances which cannot be easily represented in a structure format but can be represented using a registry number - minerals, polymers, mixtures etc

9) Website redesign for improved ease of use and navigation

Just like the Queen of England has two birthdays ChemSpider will, all being well, celebrate the first day of Spring ACS by lifting our Beta label from our website in April. We went live on the day before Chicago’s Spring ACS last year and have kept going ever since…

Buy me a Coffee

I am going to be in New Orleans from the 5th-8th of April before heading off to the SBS Conference in St Louis. If anyone is interested in chatting about ChemSpider and our future directions please ping me at antonyDOTwilliamsATchemspiderDOTcom and maybe we can get together for a coffee?

Buy me a Coffee

There are a number of groups in the “free access to chemistry information” domain at present and all are working hard to provide access to data, knowledge and connectivities to serve the chemistry community. Two of the most common questions I get asked are in regards to the difference between ChemSpider and PubChem and between ChemSpider and eMolecules. Yesterday I was asked the question about the difference in regards to eMolecules three times. So, overnight I put together what I hope is an objective comparison of capabilities. I welcome any feedbac or additional questions. It’s a living document as both of our sites are changing (and I know ChemSpider better than I know eMolecules of course).

I know members of the eMolecules team from my previous role when I facilitated the connection between eMolecules (or Chmoogle as it was then) and ChemSketch.  I then had the pleasure of meeting with their VP of Sales at the Chicago ACS meeting one day after ChemSpider went live and we discussed our opinions about our mutual intentions to deliver value to the community. I like the eMolecules offering. There are some nice visual elements on the site. ChemSpider IS different and has a different focus based on what I see eMolecules delivering. We are out to build a structure centric community for chemists. I believe eMolecules is focused on delivering a centralized resource for sourcing chemicals for purchase and has a business model based on advertising and delivering websites for chemical vendors. We will shortly complete our first “depositor skin” that will provide to one of our collaborators a way to display their content only from ChemSpider and “branded” with their logo etc. so will offer a similar service to depositors.

A number of people now visit ChemSpider asking us where they can source a particular chemical. If we cannot find it on ChemSpider then I do visit eMolecules and point people to the link if I find it. At present that’s only about 10% of the time. Despite the fact that ChemSpider is about 2.5 times bigger as a database than eMolecules their focus is commercial vendors and at present they do have more commercial vendors than us. Our collection is growing at about 2-3 new depositors per week, mostly chemical vendors requesting that we ad them to our database. Some people think that ChemSpider is simply a rewrapping of the PubChem database. On day 1 we went live with only the PubChem collection but the data sources collection is much more diverse now and we actually deposited back to PubChem (which I don’t believe eMolecules has yet?). Our structures are unique..but you MUST be careful with that consideration. For example, there ARE multiple flavors of taxol on ChemSpider but the same is true of Taxol on eMolecules. Actually, the reality is that there are multiple flavors of the “Taxol Skeleton” on ChemSpider (42 to be precise! http://www.chemspider.com/q/RCINICONZNJXQF) but NOW, after our curation and redicretion efforts, there is only one compound, the CORRECT one, that will be retrieved based on a search on the name Taxol (http://www.chemspider.com/q/taxol) relative to the seven on eMolecules where you have to determine for yourself which one is Taxol. The 42 Taxol skeletons include multiple stereochemisties and isotopically labeled compounds - C-13, C-11, Tritium, Deuterium etc). So, be careful when people talk about unique structures!

It would be great to have the eMolecules collection in ChemSpider and direct traffic to their site and to their commercial vendors and extend the community. What do you think?

Buy me a Coffee

Late last week I posted about a message left by CAS on the Wikipedia pages regarding the CAS Number Validation project we had started at Wikipedia Chemistry to assist in producing a quality curated dataset. There was an interesting outpouring of judgement and negativity regarding the message posted by CAS balanced by a hope for resolution and new connections. My hope in building the community of chemists at ChemSpider, is to do so while staying in relationship with key players in the domain. I prefer to do things with permission and while I will cajole and encourage shifts I choose to respect the expectations of the players.

This week conversations have been ongoing between WP:Chem and CAS. The conversations have been conducted by Martin Walker, a member of the WP:Chem team as well as the ChemSpider Advisory Group. Martin and I have similar opinions in regards to how to participate in the community and I honor his approach in working through this potentially difficult situation. The outcome of the discussions are declared here on Wikipedia.

New announcement from CAS

CAS, a division of the American Chemical Society, is pleased to announce that it will contribute to the Wikipedia project. CAS will work with Wikipedia to help provide accurate CAS Registry Numbers® for current substances listed in Wikiprojects-Chemicals section of the Wikipedia Chemistry Portal that are of widespread general public interest.

The CAS Registry is the world’s most comprehensive collection of chemical substances and the CAS Registry Number is the recognized global standard for chemical substance identification.

CAS views Wikipedia as an important societal tool for the general public, and this collaboration with Wikipedia is in line with CAS’ mission as a Division of the American Chemical Society.

We look forward to working with the Wikipedia volunteers over the next few weeks to make this happen.Eshively (talk) 13:40, 12 March 2008 (UTC)
I think this is excellent. I implicitly agree with the statement “The CAS Registry is the world’s most comprehensive collection of chemical substances.” For CAS to offer support to the Wikipedia team for the curation project is, for me, an indication of commitment to public service and I am indebted to the participants in this decision. I’m excited to get back underway with the curation project and will start up my efforts again this weekend. This decision by CAS has invigorated me to keep eyeballing structures as fast (and carefully) as possible.

My sincere appreciation is extended to the CAS management team and decision-makers. My gratitude to WP:Chem for staying engaged in the conversation to get to this outcome. My encouragement to us all to get this project done and have a high quality validated dataset of chemicals available as a public resource. Onwards and upwards!

Buy me a Coffee

I have written previously about the joys and frustrations of participating in the blogging community. One of my greatest joys is what I learn from others in the domain. In regards to the discussions about CAS Numbers I was very interested to read this comment from PhysChim62, a member of the Wikipedia:Chem team, regarding the European Union Database Directive.

I have extracted certain statements from that comment: “The restrictions it purports to impose of the reuse of its product would appear to breach anti-trust legislation on both sides of the Atlantic. Users of CAS databases in the European Union can take heart from Art. 8.1 of the Database Directive (96/6/EC): The maker of a database which is made available to the public in whatever manner may not prevent a lawful user of the database from extracting and/or re-utilising insubstantial parts of its contents, evaluated qualitatively and/or quantitatively, for any purposes whatsoever.”

Now I get to go off and read up on the Database Directive…while ChemSpider is trying to build a structure centric community for chemists I have to acknowledge the benefits to myself personally in terms of what I am learning from the participants in this community and specifically the people commenting on the blog. Thank you to those of you who comment!

Buy me a Coffee

The recent post regarding CAS numbers and Wikipedia has stirred up some great conversation and responses and I point you to the comments to peruse. For now I want to comment on one made by Cameron Neylon on his blog. I point you to his post to read first rather me lifting it from him and posting it here. It’s respectful of his work. Also, you may choose to add him to your Google Reader. Cameron is a great advocate of Open Notebook Science and I encourage you to visit his site.

OK…Did you read it???

Ok ..I am now lifting certain comments and wish to state my own views.

Cameron said “So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually.

At ChemSpider we likely have more experience than most in interconverting millions of structures from/to various other formats. They all have their own limitations. InChI, in both of its formats is limited in a number of ways. They are acknowledged and being worked on. For instance, polymers, inorganics, organometallics, mixtures of specific stoichiometries, Markush structures (not an issue for Cameron’s material in a bottle). SMILES comes in so many different flavors that it can be very distressing.Even the most popular Cheminformatics vendors can be incompatible..believe me, we’ve seen it in many ways! CML..haven’t worked with it yet on ChemSpider but remain interested. However, uptake seems to be very limited after maybe a decade of being available to the domain. InChI took off an proliferated at an incredible rate while CML has been around a lot longer and, as far as I know from 10 years in the cheminformatics business, has low adoption. It doesn’t mean it’s not the solution but if it is then it needs to be adopted by the masses. PubChem IDs ..there are structure IDs and compound IDs. So, a decision will be necessary there. More and more I am seeing PubChem IDs listed anyways…they are in the Aldrich catalog for example. The work will come with the curation of the data - making sure that people can find the “appropriate ID” for a compound. Check out my earlier posts about the need for curation (1,2,3,4 and many others). CAS is very highly curated and are the authority for the CAS numbers. PubChem are, of course, the authority for their IDs too but compounds can be deprecated from time to time as depositors find their own errors and there have been so many depositors with different quality standards to date that cleaning up the database is a major challenge. While they could do it it is not their mandate today, they are not funded to do so and it would be an enormous undertaking and would likely need to involve some form of crowdsourcing via online curation as we are doing here at ChemSpider.

Cameron said “The CAS number so appealing; it is short, easily typed in, and printed on most bottles.

‘Tis so. And asking the vendors to move away from it won’t work. Adding a PubChem ID might work but that’s a big shift too and I believe they would need to have guarantees about the long term future of PubChem and its database and funding to buy into that. Also, a MASSIVE validation exercise. if the companies ended up depositing their OWN compounds to get PubChem IDs I believe all hell will break loose…it’s already going on by the way of course, when they deposit. Their compounds go through internal processes at PubChem and come out the other side as deposited structures. Is everyone that went in deposited exactly as it was supplied. In theory yes. In reality? We have the same issues at ChemSpider…not easy. By the way, the CAS number with check digit and specific format is much nice than “just a number”. Maybe we should do the same with ChemSPider IDs…new format, plus a check digit?

Cameron said “I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers

Hmmm…really? Check out this link. Compare the snippet here from the Taxol Drugbox on Wikipedia

drugboxtaxol.png

with the taxol record here on PubChem and the synonyms list. You are looking for 33069-62-4.

taxolcas.png

When you find it click on it and you will find 6 structures for Taxol. The issue is curation. Which structure is right?

Why does the story that there are no CAS numbers on PubChem continue to proliferate??? Sure, they are not called CAS Numbers but it’s what they are. Depositors simply put them in as identifiers and PubChem don’t have to remove them. They respect the depositors right to deposit their identifiers..whether they are CAS numbers or not.

Now, I agree robotic curation can help with these issues and the RDF approach already being discussed for Wikipedia and ChemSpider (with Egon) can be useful in helping to link together resources and, if adopted by companies such as Aldrich etc, can be of great value in helping to clean up some of the issues. But, it is only part of the solution. The need for manual curation is being missed. Robots are already making a bigger mess in my opinion. Manual curation is a must.

Cameron said ”The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services.

I have made this statement to many people over the years. The Registry of chemical structures has little value as a collection of structures. Its what’s connected that has value. The CAS Numbers, the patents, the literature articles, the vendors. It’s the same with PubChem and ChemSpider. Who cares how many structures we have? We can generate HUNDREDS OF MILLIONS for you! It’s the associated information. I have always thought that CAS should provide an internet service that is simply a CAS lookup. Search a number and see the structure and or substance detail. End of story. You want patent details, papers, vendor details…you pay. It’s a transaction…same as STN now. In fact, do a study, see how many searches are done just to relate CAS numbers and structures, figure out the loss in revenue if you “give it away” and shift it over to the transaction charge to look at “more info”. Certainly the giveaway would help with the public relations! Oh..and you “could” make it a structure search to get a CAS number too…more work but of course possible.

Cameron said “…do people agree that CID is a good standard index to aggregate around?”

No…not yet. There is so much to be done before that can happen in my opinion. A lot.

What I’d rather do, and maybe I am a dreamer, but I would say get into relationship with ACS/CAS and try and establish a tipping point of support to where they see it is good for their business and the community as a whole. I prefer staying in relationship if possible. That said the effort around CIFs from ACS via CrystalEye has not moved forward as explained here so maybe my vision will not work in the case of CAS numbers either. Either way, we made a decision not to scrape CrystalEye anyway and this shows perfectly the issues of SMILES and InChI giving issues of structure representation!

I say let’s not abandon hope regarding CAS opening their numbers to the world just yet. This dialog is likely sparking discussions already. Let’s keep it out there and establish a groundswell of concern and support and hope that the right thing can happen for our good and for CAS. I have great respect for many of their people and their work and want the resolution to be appropriate for all parties. Let’s hope…and if hope doesn’t work then I encourage robotic and manual curation…the system is ready on ChemSpider. Come and help out!

Buy me a Coffee

Ever since ChemSpider went online we had committed to allowing community curation and annotation. We have done this…in spades. We have introduced the ability to Post Comments for curators approval, we have allowed the