Archive for the Community Building Category

Today I had the privilege of meeting with many members of the team creating the RCSB Protein Data Bank. This resulted from the wonderful networking opportunity offered by the Scifoo camp held earlier this year at Google where I met Helen Berman, director of the PDB team, part of the worldwide Protein Data Bank. Helen and I shared some conversations sitting outside the Google offices in California and shared our opinions and visions regarding the quality of small molecule data available online. Today was an opportunity to take those conversations further, meet with members of the team and determine whether ChemSpider’s efforts could bring benefit to the PDB in terms of our curation efforts and whether ChemSpider users could benefit from having access to information on the PDB via hosting of the PDB ligand dictionary.

I gave a presentation (online here and based on others I have delivered previously) and received a one on one review of the deposition and curation processes of the PDB as well participated in a group discussion about how to continue the stringent and exacting process of validation and curation associated with small molecule structure sets. We discussed the complex relationships between systematic names, trivial names, registry IDs, database IDs, tautomers, charged states, SMILES and InChIs. It was a particularly validating day to spend time with a group of people who have responsibility for building one of the most valuable resources in the world and have faced the many challenges associated with validating structure-based data. There is a distinction between people who talk about what it takes to curate structure collections rather than those who actually do the job for a living. This team is made up of dedicated, passionate and skilled individuals who deeply care about the quality of their data and who do the heavy lifting and grunt work so that the users of the PDB enjoy the benefits. They have been working on a multi-year process to curate and improve the PDB data and are in the final major phase of the effort to clean up the archive and apply the processes to all new data moving forward . ChemSpider and PDB will be more integrated in the near future and we look forward to supporting their efforts for providing high quality structure data to the community and continuing to expand the network of integrated online chemistry.

I think the press release here, and copied below, speaks for itself…When I posted the blog about the need for an InChIKey Resolver it resulted in a great discussion and series of comments. Since that time I’ve had many discussions with interested parties about the need. The RSC and ChemSpider share a mutual view regarding the need for the InChI resolver and we are honored to be entrusted to develop a resolver for the community. Will it be “the” resolver..only time will tell. There are various ways to deliver a system to do this so we’ll start here and garner feedback. There are many ways to “hunt a Welshman” (I can say that since I’m Welsh!) so there may be other efforts to deliver a resolver coming too.

“RSC and ChemSpider develop InChI Resolver

01 December 2008

An InChI Resolver, a unique free service for scientists to share chemical structures and data, will be developed by a collaboration between ChemZoo Inc., host of ChemSpider, and the Royal Society of Chemistry. 

Using the InChI – an IUPAC standard identifier for compounds – scientists can share and contribute their own molecular data and search millions of others from many web sources. The RSC/ChemSpider InChI Resolver will give researchers the tools to create standard InChI data for their own compounds, create and use search engine-friendly InChIKeys to search for compounds, and deposit their data for others to use in the future. 

The future of publishing

‘The wider adoption and unambiguous use of the InChI standard will be an important development in the way chemistry is published in the future, and the further development of the semantic web,’ comments Robert Parker, Managing Director of RSC Publishing. 

The InChI Resolver will be based on ChemSpider’s existing database of over 21 million chemical compounds and will provide the first stable environment to promote the use and sharing of compound data. ‘ChemSpider hosts the largest and most diverse online database of chemical structures sourced from over 150 different data sources’ adds Antony Williams of ChemSpider, ‘We have embraced the InChI identifier as a key component of our platform and the basis of our structure searches and integration path to a number of other resources. We have delivered a number of InChI-based web services and, with the introduction of the InChI Resolver, we hope to continue to expand the utility and value of both InChI and the ChemSpider service.’ 

Society support

‘As a learned society publisher it is important that RSC provide support for the standard and contribute to the development of the resolver, which promises to be a valuable service for the chemical science community.’ continues Parker, ‘our collaboration with ChemSpider on this project will enable this to be delivered quickly and sustainably.’ 

The imminent adoption of the InChI generation protocol will be a welcome and necessary step to the wider adoption of the InChI standard. “

Chemistry Molecular Structure Models I want to acknowledge the support of one of our sponsors. For the past year or so we have had the following image on our home page. It’s pretty clear what this is for – molecular models. Very soon after we went live with ChemSpider Stephan Logan of Indigo Instruments and I chatted about what our intention with ChemSpider was in terms of providing a community resource for details about chemistry. Stephan offered his full support to us and I’ve enjoyed all of our discussions since then. I thank him and indigo Instruments for their support and, if you ever want a molecular modeling kit for your hands instead of your computer screen I recommend contacting them! An example structure for Xanax, in one of its flavors, is shown below. We’re looking forward to very “integrated” relationship with Indigo Instruments in the near future. Watch this space…

Chemical structure drawing packages are an essential tool for chemists in this day and age. There are many free offerings from chemistry software vendors given to the community. Companies such as ChemAxon, ACD/Labs, Cambridgesoft and others give their structure drawing packages to the community (as well as many other tools for some companies). Now a new structure drawing package has entered the freeware offerings and this is likely going to stir up the community. Many chemists will remember MDL and their ISIS Draw offering. For the longest period of time ISIS Draw and Cambridgesoft ChemDraw were the “most accepted” drawing packages – publishers would accept molecular structures submitted only from these applications. Despite the growth in popularity of other applications these remained the preferred tools. In industry these two held the primary position. In my previous role as Chief Science Officer at ACD/Labs I was very happy about the fact that we offered the ACD/ChemSketch package to the community and it is still available and new versions released on an ongoing basis. It’s a great package and almost a million downloads have been given away.

Now however Symyx Draw is available for free for academic and personal use. “Symyx Technologies, Inc. (NASDAQ: SMMX) today announced that Symyx Draw 3.1, the chemical drawing application that replaces industry-leading ISIS/Draw, is now available for download at no charge for academic and non-commercial personal use ( Symyx Draw 3.1 enables scientists to draw and edit complex chemical structures and reactions with ease, facilitating the collaborative searching, viewing, registering, and archiving of scientific information. To support academic researchers, Symyx Draw 3.1 also offers exceptional publication-quality drawing capabilities for presentations, reports and scientific papers, as well as improved integration with the Microsoft® Office suite of software applications. ”

Symyx Draw is a great replacement for ISIS Draw. It offers capabilities not previously offered in ISIS Draw. In particular, in my own area of interest it INCLUDES the generation of IUPAC Names and conversion of chemical names to structures. I tested some quite complex structures converting in both directions. Name generation from complex structures was very good. Conversion of chemical names back to structures worked well but I could not see how to generate the stereocenters in the resulting structures.I like the ability to search for InChIKeys across the internet directly – good thinking (and of course the searches I did took me to ChemSpider in 90% of the cases…)

Symyx Draw is offered with a whole series of add-ins that will extend its already excellent capabilities.I won’t list them here but point you to the list online.

I’ve just scratched the surface in learning the software package but I’ll likely switch to it for the next month or so as I did with Google Chrome. Then I can find it’s limitations and advantages and add it to the mix.

Maybe they’ll consider providing an add-in to search ChemSpider directly as ACD/Labs did with their add-on (the add-on does need updating by ACD/Labs however).

Welcome Symyx Draw to the mix…it’s a great offering

I’ve posted over at the ChemConnector Blog about the potential need for a neutral review of the performance of Optical Structure Recognition algorithms. I’m interested in the technology because we are now using it on ChemSpider for our document markup and structure recogition. I’d welcome your thoughts and comments…visit the blog post.


Recently I asked how people used ChemSpider. I received feedback from Jan Hummel from the Max Planck Institute of Molecular Plant Physiology and have posted it below for the blog readers.

Several years ago our institute was a pioneer in establishing GC MS-based approaches for metabolomic analysis in plants and also in other organisms. The GC MS-based approaches are mostly targeted since only compounds that have been previously measured as standard/reference substances can be reliably analyzed/identified in biological samples. Accordingly, we decided to expand our analysis strategies to more untargeted metabolite analysis approaches. For this purpose we considered what the best way would be to achieve this goal, and we decided that high resolution MS (eg. FT-ICR MS), might be the way to go. With these MS machines we can resolve thousands of masses extremely accurately with resolutions up to 1ppm. Combining this information with fragmentation data of individually measured masses, isotope labelling and retention times from the chromatographic separation means a plethora of data that has to be integrated into meaningful information is produced. Obviously this data is difficult to handle if there is no useful initial annotation. This is where ChemSpider comes into play. We use the immense repository of chemical data and knowledge provided by this well curated data collection as the entry point for the conversion of experimentally measured masses to possible chemical compounds. In an initial step we perform simple database matching of the measured masses to all the masses derived from the compounds present in ChemSpider. This allows us to associate a large number of measured masses to one or more possible chemical formulas. In a subsequent step we then make use of the structural information provided by the ChemSpider database to evaluate which of the initial considered compounds matches not only the measured mass, but also can explain the measured fragmentation pattern provided by the MS/MS data. For this purpose access to a large number of structural isomers is an invaluable tool.

Additionally, by using the structural data we can also make use of the collection of predicted properties of the compounds collected in ChemSpider by simply comparing them to the properties (mostly retention time in the LC run) of the measured compounds. This often helps us to sort out incorrectly annotated structures.

Even though many of these analyses are still manual and tedious, the huge data collected and provided by ChemSpider allows us a straight forward spectrum annotation, which hopefully in the future will be performed in a more automated manner. A paper entitled “High-Resolution Direct Infusion-Based Mass Spectrometry in Combination with Whole 13C Metabolome Isotope Labeling Allowing Unambiguous Assignment of Chemical Sum Formulas” (Giavalisco P et. al.) describing our approach was recently accepted in Analytical Chemistry. In this paper we used PubChem as the reference database.

In comparison to our studies performed using a PubChem based formula repository from May this year, a kindly provided data export from ChemSpider increased the amount of unique sum formulas in our system by more than 180,000 formulae. It appears that ChemSpider is growing at a very good rate!

Recently a new website connecting chemicals to synthesis references went online. The site is ChemSynthesis and as well as synthesis references the database also contains physical properties for many of the listed substances. There are currently more than 40 000 compounds and more than 45 000 synthesis references in the database and there is an intention to keep the database growing with contributions from the community. Presently ChemSynthesis is indexing information from quite an extensive list of journals given below.

The Journal of the American Chemical Society, Canadian Journal of Chemistry, Chemical and Pharmaceutical Bulletin, Chemistry Letters, Journal of Heterocyclic Chemistry, Journal of Medicinal Chemistry, The Journal of Organic Chemistry, Organic Syntheses, Synthesis, Synthetic Communications, Tetrahedron Letters, Tetrahedron

An example record can be found here and a list of hits from a text search is shown below.

Linking_from ChemSpider to ChemSynthesis seemed like a natural way to help our users source potential synthesis details. So, that’s done. Also we have exchanged the appropriate information with ChemSynthesis so that we have completed the loop. Users searching ChemSynthesis can navigate directly to the ChemSpider record with one click.

To review the entire ChemSynthesis dataset on ChemSpider simply follow this link. It is >40,000 molecules so might take a while to load. Another contribution to the community of connected chemists….

A lot of people have been helping to improve the quality of ChemSpider content by depositing new data and “Cleaning up” errors in the data over the past few months. it’s been a long climb. Our thanks to all of you who have contributed. I’ll be the first one to put my hand up and acknowledge that in some ways I have not made the act of contributing to the curation process very easily since I’ve been feeding the data out via the blog in chunks, as it has developed. Following a recent “long flight” I am happy to announce that the Curators Handbook/Bible is now available in its first form and is available online here. This document gives some pretty detailed guidance regarding how to curate the ChemSpider database. As always we welcome feedback. If something is not clear let us know and we will expand/enhance as appropriate.

What I also want to do is to thank those people who have commented on how truly impressed they are with the rate at which we are cleaning the data. In general most curation requests identified on the site are addressed within 24 hours. There are some issues hanging out there that we don’t have solutions for at present, specifically in regards to organometallic data handling, but we are still thinking about a path forward.

There has been a conversation going on over on Wikipedia about supporting ChemSpider IDs in the ChemBox and DrugBox. ChemSpider IDs have been added to ChemBoxes over the past few weeks by a number of contributors and, based on the blessing of members of the Wikipedia community, they will now be displayed in Drugboxes also. The conclusion of the conversation today stated:

 Done Thanks everyone – that seems clarification that people would find this helpful and, in particular, thanks for addressing ChemSpiderMan own reservation. I’ve added to {{drugbox}}, eg see Verapamil.David Ruben Talk 13:10, 22 September 2008 (UTC) 

A Drugbox, for Xanax, is shown below. Note the number of outlinks to PubChem, Drugbank and now ChemSpider. 


What I am most proud of is some of the statements made in the discussion that validate our efforts to create high quality curated source of information. For example:

“I’d just like to add my voice to those that find value in linking Wikipedia articles to ChemSpider.  find this database to be reliable and information-rich in comparison to the other dabases we link to already. I support adding a link from drugboxes and chemboxes. – Ed (Edgar181) 11:36, 22 September 2008 (UTC)”

“I think the effect of linking to ChemSpider would be to marry a well curated database (ChemSpider) with monographs (WP). To elaborate, the database contain various intrinsic properties (MW, isotopic composition, structure, stereo), experimentally-determined properties (bp/mp/appearance), experimentally-determined spectra (1H/13C NMR, IR, etc., e.g. [1]), apart from predicted data. Monographs: our articles discussing the synth, applications, chemistry, etc. of various compounds, drug-, drug-like, or otherwise. Seems like everything to gain and not much to lose, except for another entry in the drugbox and perhaps concerns of table creep. –Rifleman 82 (talk) 03:46, 22 September 2008 (UTC)”

“I personally support the addition of ChemSpider not because of the predicted properties—which are included in PubChem—, but because, so far, ChemSpider appears to be highly curated (and transparently so). PubChem has some serious if relatively infrequent reliability issues, which are well known to the WP chemistry/pharm community, and MeSH (to which CAS numbers in the Drugbox link) appears to lack information on many compounds. Fvasconcellos (t·c) 01:53, 22 September 2008 (UTC)”

It is validating to be embraced by the Wikipedia  community in this.What we commit to in return is to continue our efforts to expand the services and quality on ChemSpider. And presently we are working on a “little gift” to help Wikipedia. Watch this space.

I’ve been in a number of conversations of late about how Mass Spectrometrists might use ChemSpider and get value from our efforts. I recently gave a short Powerpoint presentation to a group about what ChemSpider is and the types of queries that ChemSpider users can conduct today. I’ve posted the presentation to Slideshare as usual so people can access it there if they are interested.

I’ve started wrapping my head around how we could provide more value to some of our users in regards to MS, HPLC and NMR. One of the things we could do is to use our known text mining skills to look for NMR or MS (LCMS) articles based on the use of the terms in the title or abstract and then using those terms as tags against chemical structures in the abstract/title. So, from titles such as “High-Performance Liquid Chromatographic Method for Determination of Phenytoin in Rabbits Receiving Sildenafil” from our collaborator Libertas Academica we would extract HPLC and Phenytoin and connect the article to the structure as we have done here. In this way the article would be searchable by structure and associated analytical technique and we could even look at extracting the detailed experimental approach from Open Access articles. More work but feasible. Any comments???

Readers of this blog will know we have a focus on enabling chemists to source information via both Open AND Closed access publishers with the aim, ultimately, of providing a way to perform structure and substructure searching of these articles. This work is well underway.

If you visit our Literature Search Page you will see that we have recently added the ACS AuthorChoice Free Access articles to the index and we will continue to index on an ongoing basis.  There are very few ACS AuthorChoice articles to search but the usual validation search of “Searching Taxol”  it does turn up one hit.

Herding Nanotransporters: Localized Activation via Release and Sequestration of Control Molecules (Nano Lett. 2007 Volume 8 Issue 1 Page 221) – American Chemical Society

R. Tucker, P. Katira, H. Hess

… 1 mM MgCl, 1 mM EGTA, pH 6 .9) containing 10 micromolar taxol for stabilization and kept at room temperature (20 C). Caged -ATP and “

Those of you who read the ChemSpider blog for a while will know the name of Paul Docherty who writes on his TotallySynthetic blog. I have a great appreciation for Paul’s writings and, with permission, have started associating his blog posts directly with the structures, literally inserting the entire blog post, but NOT the comments, into a description for the molecule. For example, see the structure here. We’ve started to go back through Paul’s postings and make his entire collection structure searchable… see There are a lot of old posts to deposit and time will get us there. Everything is posted under CC licenses and with permission.

Paul also writes for the RSC in Chemistry World as, for example, here. With permission from the RSC we are inserting snippets of the article into ChemSpider and linking out to the Chemistry World article, for example: here and here. If you are a registered user you can link your own articles on your websites using the Add URL functionality or you can deposit your own postings onto our site in the same way…feel free to ask us how.

I posted a request recently for people to share with us how they use ChemSpider. Two comments were submitted today and are given below. Thanks to Chris and Sean.

Sean Ekins commented “Over the past year I have used ChemSpider as a valuable resource for generating molecular properties that are then used to analyze specific enzymes substrate requirements (PMID: 18537573) and also for following up on hits derived from computational pharmacophore database searching (PMID: 18579710). The later searching helped find additional compounds that were purchased from vendors and tested in vitro, ultimately some were found to be active. These uses are in addition to finding structures SMILES for generating QSAR models and general molecule searching. I hope to build on this in the future!”

Chris Singleton commented “As a chromatographer, the predicted properties of the molecule are an invaluable tool.  With values such as the logP and logD, it is much easier to predict and estimate retention times relative to other similar compounds without having to spend the time to do an entire chromatographic run.  Of course, I still do the actual chromatography, but these tools let me ‘home in’ on an appropriate method much quicker.  Some of these predictions are available in chemical structure drawing programs, but Chemspider allows me to do one-stop shopping for most of the properties I’m interested in.  And if I don’t have a structure handy, it’s easy to look it up so I can see what type of MS ionization mode I want to use, based on the moieties and polarity of the molecule.

Secondly, I’m more involved in DMPK from the analytical side as opposed to the pharmacokinetic (PK) side, so I’m trying to learn more about the PK.  The fact that Chemspider has such a wide range of properties and links to outside sources really lets me correlate the PK properties of a drug to its structure and lets me be a better analyst.  I  also know that Chemspider is busy adding links to the infoboxes on Wikipedia (an effort that I am involved in), so I read about the properties of a drug and am then able to click through to look at the chemical properties and how the structure relates to function.  This is more of a professional development exercise for me, but having the wealth of information in one place make it a easier to get a big picture view of a drug, rather than just looking at the pharmacology properties alone.  Definitely a structure centric view of drugs.”

I talk with a lot of people about ChemSpider and how it is being used. It’s being used in many different ways based on the emails we receive from users and I’ve shared a number of the various ways through my public presentations. What I’d like to do is offer some space on this blog for users to share how YOU are using ChemSpider and what you are using the information on ChemSpider to do. Simply send a short story about how you use/have used ChemSpider and I’ll post it for you. We’ll of course put crosslinks back to your website if you have one etc. Thanks!

As ChemSpider has grown into an important part of the online community for providing access to information and data to chemists to assist them in their work there are many subjective criteria by which to be measured. We set some objectives early on in regards to how we would measure our own successes in the first couple of years. These included:

1) A result of >500,000 in a Google search (we have been at this number for over a month I believe)

2) Acknowledgment by our “peers”, another subjective criterion, by comments made in the blogosphere, recognized by invitations to speak, participate in panel discussions etc. No shortage here.

3) Reach 5000 unique users per day in our first year (already achieved)

4) Be reviewed in a mainstream publication (the Nature article written about ChemSpider does that)

5) Have over 150 data sources feed ChemSpider. We are close…145 data sources at present and more in the pipe to feed in shortly

6) Be indexed by Chemical Abstracts Service.

CAS has been indexing a number of web resources for a considerable time. Until today I didn’t know that we were one of these sources. It actually makes a lot of sense that we should be indexed. We have unique chemistry on our site since we host Open Notebook Science from groups such as that of Jean-Claude Bradley at Drexel University. But, we also have spectra and assignments from research compounds being deposited onto the database and are establishing relationships with Open Access publishers to index their chemical compounds connected directly to their articles. So, being indexed makes sense.

There has been a murmuring in the community that what ChemSpider is doing will collide with CAS. I have reiterated many times that I believe CAS offers the crown jewels in terms of quality and curated data. With what amounts to likely 1000s of person years of investment in building the registry we are unlikely to surpass CAS’ breadth of knowledge. Rather we are focused on providing a service to the community so that the community can participate in developing and growing the databas. I believe CAS and ChemSpider are synergistic and have much to offer by being connected in this way.

Inserted above is a screen grab of part of a record showing the ChemSpider database as the source of the structure. CAS have rigorous expectations regarding how they select what chemical entities should be inserted into their database. While I don’t know this list of definitions this structure clearly meets it. The structure above is on ChemSpider here. We’re very happy that we are being indexed now in the CAS registry and will continue to enhance our “unique structure collection” working with chemical vendors, publishers and scientists to grow our database.


In the past 48 hours we have added six new depositors datasets to ChemSpider. Details of all of our data sources are listed here. The list of six new depositors and the number of compounds in each collection is given below. Click on the hyperlinks for more information. The number of compounds link will display the compound collection and the link to the title of the compound collection will list some details about the data source provider.

489 NIH Clinical Collection 9/7/2008
3080 Shanghai Institute of Organic Chemistry 9/7/2008
12356 HDH Pharma 9/7/2008
196 OmegaChem 9/6/2008
2110 Exclusive Chemistry 9/6/2008
13412 Oakwood 9/6/2008

We have put in a place a simple way to associate a chemical compound in a single record view out to an external data source. We made this a general solution but did it specifically to enable connections to be made quickly between new Wikipedia records and records on ChemSpider. We have become very experienced with the validation of data on both Wikipedia and ChemSpider over the past few months so when we find new records on Wikipedia that are not already connected to ChemSpider we clean and validate structures on ChemSpider while validating the compounds on Wikpedia. Then, when we are convinced of the validity of the compounds then we connect them. While it may take a long time to validate the data associating the WIkipedia and ChemSpider records takes just a few seconds.

We have now established “Wikipedia on ChemSpider” for Wikipedia searching by structure and substructure searchable. We believe that people may be more likely to use this over WiChempedia but we will see.

The process for linking Data Sources directly to a record view is described in this Technical Note. We welcome feedback on the document in case it is difficult to follow.

There has been an outpouring of offers from the ChemSpider community in terms of helping to examine/clean and enhance information regarding carbohydrates on ChemSpider. Almost 2 dozen users have now made an offer to help. Very exciting really!

I’ve already outlined the necessity to improve the quality of associations between structures and identifiers on the database. However, I am also hoping that users will write articles about carbohydrates using the rich-text formatting capabilities (ADD Description), will add spectra if they have them, will link up articles if they have interesting papers and will add URLs to interesting online content also.

We have now delivered the ability to curate and enhance records on ChemSpider and look forward to having our users help, starting with Carbohydrates…

I am very proud at the response from our user base to my request for assistance with curating ChemSpider in regards to carbohydrates. Carbohydrates are complex in nature. They can be represented in linear form and cyclic form, they exist in ChemSpider with a common name but no defined stereochemistry, there are pentoses, hexoses and many stereoisomers per skeleton. There are MANY common carbohydrates with trivial names - RiboseArabinoseXyloseLyxoseAlloseAltroseMannoseGuloseIdoseGalactoseTalose

Carbohydrates have been very challenging for us at ChemSpider…many depositors have not been careful with the  association between the chemical structure and the associated identifiers. With a chemical structure as the primary key on a record we find confusing associations with structures. For example, a search on Maltotriose as an identifier turns up 5 structures on ChemSpider. Maltotriose is defined on Wikipedia as “trisaccharide (three-part sugar) consisting of three glucose molecules linked with 1,4 glycosidic bonds.” This should mean that it is not appropriate for the identifier maltotriose to be associated with this structure. The registry number associated with this structure should be deleted also based on Wikipedia as a resource. How many of the other identifiers should be deleted? Maybe all???

Looking at this record we see identifiers such as: alpha-D-G​lc-(1->4)​-alpha-D-​Glc-(1->4​)-D-Glc; alpha-D-G​lc, O-alp​ha-D-glc; GLC-(4-1)​GLC-(4-1)​GLC-(4-4)​GTE and O-alpha-D​D-Glucopy​ranosyl-(​1->4)-O-a​lpha-D-gl​ucopyrano​syl-(1->4​)-D-gluco​se . Are these appropriate for this compound?

The challenge for maltotriose is therefore to identify the CORRECT structure associated with that name. “Maybe” it is the structure on Wikipedia but don’t forget that we have an effort underway to validate the structures on Wikipedia and make sure they are correctly associated with the monograph title. Is Maltotriose an identifier for a unique stereoconfiguration or is there alpha- and beta-maltotriose?  I am not sure. What needs to be determined is the correct association between structures and identifiers. Incorrect associations should be removed so that they do not turn up the incorrect structures in ChemSpider when searched.

This is the start of the validation process for carbohydrates…its iterative, complex and hard work. Its going to begin with giving the group of interested parties curator power over on ChemSpider and asking them to work on this challenge. We welcome their assistance. The efforts of contributors like this will be essential. 

MeSH is likely well known by anyone working in the Life Sciences and with Pubmed. As defined on Wikipedia

Medical Subject Headings (MeSH) is a huge controlled vocabulary (or metadata system) for the purpose of indexing journal articles and books in the life sciences. Created and updated by the United States National Library of Medicine (NLM), it is used by the MEDLINE/PubMed article database and by NLM’s catalog of book holdings.

…The 2005 version of MeSH contains a total of 22,568 subject headings, also known as descriptors. Most of these are accompanied by a short definition, links to related descriptors, and a list of synonyms or very similar terms (known as entry terms). Because of these synonym lists, MeSH can also be viewed as a thesaurus.”

We are presently moving further into integration with Pubmed and as part of this move we have decided to integrate MeSH information and the structure level onto relevent record views. Now, when you visit a particular record where MeSH information is available, the data will be visible under the MeSH tab, open by default.

MeSH has been curated by a highly skilled team over a number of years. More information about MeSH can be found online. The contents of the MeSH table should be self-explanatory. Over the next few weeks watch as we do more with the integration of MeSH and Pubmed.

Most people reading this blog will know that we are advocates of the InChI standard for structure representation. I am aware of the intentions to extend the InChI into the world of reaction Capture and look forward to testing it as it moves forward and providing feedback to the team. An announcement was made in the CSA Trust Newsletter and I’ve snipped it below.

“A project to develop a standard representation for chemical reactions was launched recently at a meeting in Berlin, Germany, hosted by René Deplanque of FIZ Chemie. The project is being led by Guenter Grethe.

The goal of this meeting was to develop the requirements for a proposal to be submitted to IUPAC to fund an Open Source, public domain ReactionML (IUPAC RML) standard to complement the IUPAC InChI chemical structure representation. The requirements would include what the community needs, technical and organisational issues and financial aspects.

The meeting was quite successful and an initial first stage of the project was agreed to and will include:

  • Reactants
  • Products
  • Reagents
  • Catalysts
  • Solvents

All the chemical structure representation will be based on and build upon the IUPAC InChI/InChIKey standards, which, since its introduction in August 2006, has become the international chemical structure representation standard for all large databases of chemical data. Some of these databases containing InChIs are in excess of 36 million unique structures.

It is expected a beta test release version of this new IUPAC standard will be available for public testing by the end of 2008.”

I’ve had a number of questions about the presentation I gave at ACS Philly last week about document markup. The phrase I keep hearing is “very disruptive” followed by the question “will authors do more work and what’s in it for them?”.

The presentation here outlines the general concept that I talked about…

The basic concept I presented is as follows, with a focus on Chemistry Articles.

A lot of effort is being expended in “text-mining” publications, post-publication, to index these articles and make them searchable not only by text but by the specific language of chemistry, chemical structures. We are specifically asking the question “why extract chemical structures from articles using chemical name conversion approaches and chemical image conversion tools when the structures in the article were ORIGINALLY machine readable?”

We are considering a system whereby authors are asked to contribute to the availability of a free online service for performing structure and substructure-based searches of chemistry articles. While the submission of journal articles is already a lot of work (I know from experience of authoring/co-authoring about 10 a year) we hope that authors will support a service whereby they can upload their own articles to a “validation and mark-up service”. The upload capabilities will support upload of the primary document, chemical structures in standard formats and supplementary information of various types (to be defined)

This system will perform the following services:

1) semi-automated markup of a document – title, author(s), abstract and additional dictionary-based terms plus the ability to use the NLM-DTD markup
2) identification of chemical names and conversion to structures in an automated fashion
3) conversion of structure IMAGES to connection tables using optical structure recognition software (either commercial or open surce)
4) ask authors to confirm whether the converted structures are appropriate
5) provide a structure validation service for submitted molecules checking for “accurate representation”
6) Deposit all structures associated with an article onto ChemSpider but under embargo. Associate the article Title, authors and “abstract snippet” with all structures.
7) Issue a set of ChemSpider IDs for the author to submit to the publisher with the article
8) When a publication has passed through review the author can release the structures from embargo using a DOI or an article URL (more common for Open Access articles)

The result of this project will be a way for publishers to link their articles directly to a free access chemistry database and use a series of web services to enable other capabilities (to be defined). It will also allow articles in Open Access and non-Open Access publications to searchable by the “language of chemistry”.

This is only a slice of the overall project but I think it may be of interest relative to the comments you have made below.

Parts of this were shown last week at Drexel University and a particular snippet is available online here:

We are also going to provide a Microsoft Word add-on which will allow users to prepare articles for publishing using similar technologies.

We think this IS disruptive..what say you?

An interesting post on the end of cyberspace blog regarding whether online databases and journals are changing scientists’ reading habits. I always feel its more appropriate to give the original blogger the traffic so pop over and read the post here.

I think the extract below might tempt you to do so…

” Searching online is more efficient and following hyperlinks quickly puts researchers in touch with prevailing opinion, but this may accelerate consensus and narrow the range of findings and ideas built upon.”

Josh Wilson, the Reference Librarian at the Physical and Mathematical Sciences North Carolina State University Libraries posted a comment on CHMINF this week. He commented:

“Recently I conducted an orientation session for new graduate students in chemistry, and I gave them a survey to determine their familiarity with some common databases and research tasks.  I thought you might be interested in seeing the results.  (I do not present them as scientific, it was a small sample and I didn’t have time to painstakingly construct the questions.)  To spare you a page-long e-mail, check the results and some observations here:”

The document gives the following information:

Question 1 was “Are you familiar with the following databases for finding chemistry information?”. Students answered on a scale from 1-4 (from not familiar to very familiar, so the closer the average to 4, the more it was universally known, the closer to 1, the least known).  Average scores for 25 respondents:

Wikipedia – 3.24
SciFinder Scholar – 2.76
ChemFinder – 2.44
Google Scholar – 2.36
Sigma-Aldrich – 2.36
Chemical Abstracts (printed) – 2.20
Web of Science – 1.84
PubChem – 1.52
Beilstein/Gmelin – 1.28
ChemSpider – 1.08
CrossFire Commander – 1.04

Clearly we have work to do to improve awareness of ChemSpider for students (and I have already sent a note to Josh to see if I can provide an overview to students at NCSU some time in the future) but what is clear is how important Wikipedia is to students. This makes the curation work on Wikipedia all the more important!

Retrosynthetic Analysis Presentation at ACS-Philly

I had the pleasure of representing ARChem Route Designer, a retrosynthetic analysis tool from SimBioSys at the American Chemical Society meeting in Philadelphia last week. More…

Chem4Word Project from Microsoft and Murray-Rust

Following on from my presentation regarding text-mining and document mark-up at the ACS meeting in Philly it was interesting to see the announcement about the Chem4Word project from Microsoft. In collaboration with the Unilever School of Informatics at Cambridge university, and specifically working with Peter Murray-Rust and some of his team. From the website announcement it states:  “Microsoft Research is investigating the introduction of chemistry-related features in Microsoft Office Word, including authoring and semantic annotations. More…