Archive for the ChemSpider Chemistry Category

I gave my talk yesterday at CShals 2009, the conference on Semantics in Healthcare and Life Sciences.It was a great meeting for me (hindered by dismal access to wireless internet as a result of Marriott’s want to make more money from the conference organizers. They should be ashamed of themselves in this day and age!) as it was not about Chemistry, not about spectroscopy, not even about Open Data, Open Access and Open Source. It was about Semantics. I learned a lot and got to hear Tim Berners-Lee talk about where the semantic web is and where it can go and how can be disruptive in a good way while NOT being too disruptive to layer onto what already exists. The best part of the meetingfor me was the clear passion for the InChI, as well as a lot of acknowledgement that it is not perfect, cannot presently compete with molfiles, commercial systems, CAS Numbers and so on. But, people are optimistic and are waiting and supportive. Overnight I inserted a lot more information about InChIs and how they can be useful, where some of the limitations are presently, how the StdInChI has now added a new level of complexity on one hand and simplifcation on the other. There have already been a number of requests for a copy of the talk so it is up on Slideshare for now (and linked below). I’ll do a voice over in the next few days and upload to Scivee. I unveiled the first version of the InChI Resolver at conference and showed it to a couple of people. The general consensus is we are heading in the right direction. The timing on this conference was good because the intention is to layer on RDF before we release at the ACS, time allowing.

Reblog this post [with Zemanta]

Where in the world is Carmen Sandiego and who and where is Katie Crow? We’re still looking for her ever since she put her photo on ChemSpider and took advantage of the new capability we have for depositing images.

Well, a more appropriate use of the function is to actually deposit images of appropriate data. JSpecView does not support 2D NMR data at present but such data can still be of value. Ryan Sasaki from ACD/Labs was kind up enough to give me an example 2D COSY spectrum for strychnine so i could use it as a proof of concept. It is available under the spectra tab at this record (see the bottom of the page). This 2D spectrum could also show a structure with correlations etc.

Reblog this post [with Zemanta]

Beauty is in the eye of the beholder. Something I see as stunningly beautiful can just as easily be unattractive to my peers. Such is the nature of Chemistry too. Some might find a particular reaction particularly elegant while others would argue it is mundane. I judge that when it comes to the depiction of chemical structures we would all have fairly consistent views of what are attractive and appropriate chemical structure depictions or “layouts”.

Structure layout is hard to do well and there is still a need for THE optimal layout algorithm. We still find some nightmare organic structure layouts on ChemSpider. When we push them through the layout algorithm we use now they are easily resolved so we’re not sure why some escape the layout algorithm first time but such it is. We have provided the ability to clean these individual records as we find them and it takes just a couple of seconds. The technical note explaining how is here.

Such an operation was applied here. The structure on the left is the “ugly” structure (does anyone think it’s pretty?) and the one on the right is the cleaned version using the online process.

Unfortunately it is NOT so easy to obtain such improved layouts for the MAJORITY of organometallic compounds. This can be seen on PubChem (here) and, similarly, on ChemSpider here. The example is shown below. Are we working on this problem? Not really…the layout for such complex systems has been a challenge for many years and the appropriate way to deal with such situations is to use the CIF file, if its available, and display in JMol as we have enabled here. We are however still working on cleaning up the structures of organic molecules as we see them and still searching for the ultimate layout tool…

For those of you performing curation activities on ChemSpider you will likely have noticed the ability to mark a new type of identifier, a shorthand formula. We have enabled this because it has become clear that this could be a useful part of document markup as part of our ChemMantis system. For example, looking at an article let’s consider the excerpt shown below.

Regarding the excerpt you can see a number of highlighted terms, all being shorthand formulae and not depending on name to structure conversion algorithms but rather depending on a lookup dictionary. Each of these names are linked to ChemSpider for direct look up of information associated with the chemicals. The list of shorthand formulae extracted from a couple of hundred articles is actually only a couple of hundred formulae at present. It includes the most obvious compounds that we can all interpret: CH3OH, MeOH, CH3CN, MeCN, CH3COOH, NaCl, NaF, NaCN, KBr, KCl and so on. All of these are immediately interpretable by chemists. There are likely a few more to be found over the coming months but in the past week of reviewing articles from various sources we have actually only added a couple of new formulae. We have also seen value in linking up ions and elements as appropriate. We are likely to add filters for display/not display of elements and ions since we’re of the opinion that displaying every incidence of an element in an article is of luttle value…just imagine how many times you might see the word carbon or hydrogen in an article… carbon-carbon bonds, hydrogen bonding etc. So, we’re switching them off by default. We’ll keep reporting on how we are improving ChemMantis…based on the review of a stack of articles the system has improved dramatically. We are asking for your articles now…combining shorthand formulae and chemical name markup will highlight a document as shown below.

When  ChemSpider was rolled out to the world as a part of ChemZoo we always knew we would be introducing more “critters”. We are happy to announce our progree with our new development ChemMantis. Why Mantis? Well…it’s the Markup And Nomenclature Transformation Integrated System. Fits perfectly into our zoo!

We have been working on the markup of chemistry documents for a number of months and I unveiled the first aspects of our work at the ACS meeting in Philadelphia. The presentation is available online on my Slideshare account. What we are trying to do is to use our ChemSpider platform as the foundation of a document markup system whereby chemical names are automatically identified and can either be converted to chemical structures (possible using algorithms for name to structure conversion) or are retrieved from our ChemSpider database. We have invested a lot of efforts to curate and validate the ChemSpider database of over 21.5 million unique chemical entities over the past year and are now sitting on a foundation of information allowing us to connect between chemical identifiers, chemical structures and out to rich sources such as Wikipedia and PubChem and to provide information such as chemical vendors and other online systems. ChemMantis is well and truly weved into the web of ChemSpider now.

We are now in alpha release and are adding some finishing tweaks to the markup system, the visualization elements and the  workflow. You can see the immediate effects of our recent work on improving the quality of structure images in the balloon below.

We_would_like to test the system on YOUR documents if you are willing to participate. What we are looking for are WORD documents for already published papers. They can be Open or Closed access papers. We are not expecting copyright transfer – we want to markup the documents and return to you for feedback. In the process we will be testing the quality of our Dictionary, our conversions, our visulaizations and our process. We welcome your support. Feel free to connect with us at infoATchemspiderDOTcom. Over the next few weeks you will hear more about ChemMantis and our contributions to text mining and markup of chemistry documents.

Recently a new website connecting chemicals to synthesis references went online. The site is ChemSynthesis and as well as synthesis references the database also contains physical properties for many of the listed substances. There are currently more than 40 000 compounds and more than 45 000 synthesis references in the database and there is an intention to keep the database growing with contributions from the community. Presently ChemSynthesis is indexing information from quite an extensive list of journals given below.

The Journal of the American Chemical Society, Canadian Journal of Chemistry, Chemical and Pharmaceutical Bulletin, Chemistry Letters, Journal of Heterocyclic Chemistry, Journal of Medicinal Chemistry, The Journal of Organic Chemistry, Organic Syntheses, Synthesis, Synthetic Communications, Tetrahedron Letters, Tetrahedron

An example record can be found here and a list of hits from a text search is shown below.

Linking_from ChemSpider to ChemSynthesis seemed like a natural way to help our users source potential synthesis details. So, that’s done. Also we have exchanged the appropriate information with ChemSynthesis so that we have completed the loop. Users searching ChemSynthesis can navigate directly to the ChemSpider record with one click.

To review the entire ChemSynthesis dataset on ChemSpider simply follow this link. It is >40,000 molecules so might take a while to load. Another contribution to the community of connected chemists….

We’ve been working on structure depictions on ChemSpider and overall we are very happy with where we have got to. These structure depictions are going to be showing up in various parts of our system now.

However, we should qualify the difference between structure images and structure layout. The depictions and the layout are governed by different algorithms.While a structure image can be attractive the layout may not be perfect. it is possible to improve the layout of the molecule deposited on ChemSpider. Notice for the structure on the left that there is overlap with the methyl group.

For details on how to CLEAN structures on ChemSpider please read the Technical Note here: Interactive Cleaning of Molecules During Curation and Deposition.

The result of performing cleaning is shown below. This layout may also not be the perfect layout but there is no overlap. The user can continue to manually optimize the structure for the preferred layout.

It is finally time to rollout more attractive structure depictions. We have needed some more attractive structure depictions for a while but they have become an absolute must have as we rollout the following new capabilities:

1) The ability to make YOUR chemical blog structure searchable (watch this space…). We suggested one path previously…this is BETTER…

2) Structure balloons for using with our document markup tools, both browser-based and Microsoft Word based

We all judge quality of visual aesthetics quickly. We know a good structure when we see one. This is an announcement that we will be rolling out new structures across the site in the next few days. You will see better looking structures showing up across the site – during deposition, during service-based predictions, during searches and, well, everywhere. While not perfect as yet a little more tweaking and the entire database will be supported by the new structure depiction algorithms. As it is you should see some examples now on the database…one shown below. We welcome your feedback!

I recently started a discussion with the users of ChemSpider about how they use our system. There have already been two responses and I am hoping for more. Having sat in on a IUPAC InChI meeting in Washington last week I can honestly say that it was one of the most functional and on-task meetings I have sat in on in a long time. Decisions were made about how to move forward with the next release of the InChIKey and “standard versions” of both the InChIString and InChIKey.

The meeting has prompted the question how do you use InChI? For what purpose do you use InChI and do you use only the string? Do you use it for communication purposes and structure exchange? Do you use it in your internal databases? Is it a primary path to deduplication? What settings do you use for the InChIString?

I’m interested in how you are using InChI nad how important it has become for you? Comments welcomed..

As ChemSpider has grown into an important part of the online community for providing access to information and data to chemists to assist them in their work there are many subjective criteria by which to be measured. We set some objectives early on in regards to how we would measure our own successes in the first couple of years. These included:

1) A result of >500,000 in a Google search (we have been at this number for over a month I believe)

2) Acknowledgment by our “peers”, another subjective criterion, by comments made in the blogosphere, recognized by invitations to speak, participate in panel discussions etc. No shortage here.

3) Reach 5000 unique users per day in our first year (already achieved)

4) Be reviewed in a mainstream publication (the Nature article written about ChemSpider does that)

5) Have over 150 data sources feed ChemSpider. We are close…145 data sources at present and more in the pipe to feed in shortly

6) Be indexed by Chemical Abstracts Service.

CAS has been indexing a number of web resources for a considerable time. Until today I didn’t know that we were one of these sources. It actually makes a lot of sense that we should be indexed. We have unique chemistry on our site since we host Open Notebook Science from groups such as that of Jean-Claude Bradley at Drexel University. But, we also have spectra and assignments from research compounds being deposited onto the database and are establishing relationships with Open Access publishers to index their chemical compounds connected directly to their articles. So, being indexed makes sense.

There has been a murmuring in the community that what ChemSpider is doing will collide with CAS. I have reiterated many times that I believe CAS offers the crown jewels in terms of quality and curated data. With what amounts to likely 1000s of person years of investment in building the registry we are unlikely to surpass CAS’ breadth of knowledge. Rather we are focused on providing a service to the community so that the community can participate in developing and growing the databas. I believe CAS and ChemSpider are synergistic and have much to offer by being connected in this way.

Inserted above is a screen grab of part of a record showing the ChemSpider database as the source of the structure. CAS have rigorous expectations regarding how they select what chemical entities should be inserted into their database. While I don’t know this list of definitions this structure clearly meets it. The structure above is on ChemSpider here. We’re very happy that we are being indexed now in the CAS registry and will continue to enhance our “unique structure collection” working with chemical vendors, publishers and scientists to grow our database.


In the past 48 hours we have added six new depositors datasets to ChemSpider. Details of all of our data sources are listed here. The list of six new depositors and the number of compounds in each collection is given below. Click on the hyperlinks for more information. The number of compounds link will display the compound collection and the link to the title of the compound collection will list some details about the data source provider.

489 NIH Clinical Collection 9/7/2008
3080 Shanghai Institute of Organic Chemistry 9/7/2008
12356 HDH Pharma 9/7/2008
196 OmegaChem 9/6/2008
2110 Exclusive Chemistry 9/6/2008
13412 Oakwood 9/6/2008

We have put in a place a simple way to associate a chemical compound in a single record view out to an external data source. We made this a general solution but did it specifically to enable connections to be made quickly between new Wikipedia records and records on ChemSpider. We have become very experienced with the validation of data on both Wikipedia and ChemSpider over the past few months so when we find new records on Wikipedia that are not already connected to ChemSpider we clean and validate structures on ChemSpider while validating the compounds on Wikpedia. Then, when we are convinced of the validity of the compounds then we connect them. While it may take a long time to validate the data associating the WIkipedia and ChemSpider records takes just a few seconds.

We have now established “Wikipedia on ChemSpider” for Wikipedia searching by structure and substructure searchable. We believe that people may be more likely to use this over WiChempedia but we will see.

The process for linking Data Sources directly to a record view is described in this Technical Note. We welcome feedback on the document in case it is difficult to follow.

Users of ChemSpider might have noticed some performance isseus in the past 2-3 weeks with our web services, service availability and speed of searches. I put my hand in the air and say “Yup, acknowledged”. Hopefully they have not been too disruptive BUT it is for the overall benefit of the service ultimately. We have been streaming in 8 MILLION links to Pubmed in order to make Pubmed structure and substructure searchable. We are NOT rolling this out with full fanfare yet but I do want to explain the performance issues you might be experiencing. We work on Microsoft technology and while we are advocates for the platforms of .NET, IIS and SQL Server we definitely are putting them under pressure as we keep expanding the database and adding more value. We have thoughts about how to resolve this but want to finishg populating the tables first.

The upside….the majority of links are already in place. For an example visit a structure and look for PubMed as a data source and click on one of the links. For example, for Valium here you will see in the datasource table a series of Pubmed IDs next to the PubMed datasource…

  16971504, 17673, 874970, 406430, 17881, 327854, 879884, 577681, 560225, 195649, …

These will link you out to PubMed directly. Try it out…

Now, do we have implementation issues? YES. The lists of external IDs can be long so right now we show only the first 10. We wiil deal with display of others shortly. We need to provide a way to curate out “junk” entries. For example, “methyl” is on Chemspider as a fragment and has links to PubMed IDs…you’ll see why if you click was done with text mining. These issues will be resolved but for now we announce that PubMed is structure and substructure searchable via ChemSpider. We will explain how we did it shortly but for now we will acknowledge the massive contribution of our colleagues at SureChem. More to come…

I am very proud at the response from our user base to my request for assistance with curating ChemSpider in regards to carbohydrates. Carbohydrates are complex in nature. They can be represented in linear form and cyclic form, they exist in ChemSpider with a common name but no defined stereochemistry, there are pentoses, hexoses and many stereoisomers per skeleton. There are MANY common carbohydrates with trivial names - RiboseArabinoseXyloseLyxoseAlloseAltroseMannoseGuloseIdoseGalactoseTalose

Carbohydrates have been very challenging for us at ChemSpider…many depositors have not been careful with the  association between the chemical structure and the associated identifiers. With a chemical structure as the primary key on a record we find confusing associations with structures. For example, a search on Maltotriose as an identifier turns up 5 structures on ChemSpider. Maltotriose is defined on Wikipedia as “trisaccharide (three-part sugar) consisting of three glucose molecules linked with 1,4 glycosidic bonds.” This should mean that it is not appropriate for the identifier maltotriose to be associated with this structure. The registry number associated with this structure should be deleted also based on Wikipedia as a resource. How many of the other identifiers should be deleted? Maybe all???

Looking at this record we see identifiers such as: alpha-D-G​lc-(1->4)​-alpha-D-​Glc-(1->4​)-D-Glc; alpha-D-G​lc, O-alp​ha-D-glc; GLC-(4-1)​GLC-(4-1)​GLC-(4-4)​GTE and O-alpha-D​D-Glucopy​ranosyl-(​1->4)-O-a​lpha-D-gl​ucopyrano​syl-(1->4​)-D-gluco​se . Are these appropriate for this compound?

The challenge for maltotriose is therefore to identify the CORRECT structure associated with that name. “Maybe” it is the structure on Wikipedia but don’t forget that we have an effort underway to validate the structures on Wikipedia and make sure they are correctly associated with the monograph title. Is Maltotriose an identifier for a unique stereoconfiguration or is there alpha- and beta-maltotriose?  I am not sure. What needs to be determined is the correct association between structures and identifiers. Incorrect associations should be removed so that they do not turn up the incorrect structures in ChemSpider when searched.

This is the start of the validation process for carbohydrates…its iterative, complex and hard work. Its going to begin with giving the group of interested parties curator power over on ChemSpider and asking them to work on this challenge. We welcome their assistance. The efforts of contributors like this will be essential. 

I’ve had a number of questions about the presentation I gave at ACS Philly last week about document markup. The phrase I keep hearing is “very disruptive” followed by the question “will authors do more work and what’s in it for them?”.

The presentation here outlines the general concept that I talked about…

The basic concept I presented is as follows, with a focus on Chemistry Articles.

A lot of effort is being expended in “text-mining” publications, post-publication, to index these articles and make them searchable not only by text but by the specific language of chemistry, chemical structures. We are specifically asking the question “why extract chemical structures from articles using chemical name conversion approaches and chemical image conversion tools when the structures in the article were ORIGINALLY machine readable?”

We are considering a system whereby authors are asked to contribute to the availability of a free online service for performing structure and substructure-based searches of chemistry articles. While the submission of journal articles is already a lot of work (I know from experience of authoring/co-authoring about 10 a year) we hope that authors will support a service whereby they can upload their own articles to a “validation and mark-up service”. The upload capabilities will support upload of the primary document, chemical structures in standard formats and supplementary information of various types (to be defined)

This system will perform the following services:

1) semi-automated markup of a document – title, author(s), abstract and additional dictionary-based terms plus the ability to use the NLM-DTD markup
2) identification of chemical names and conversion to structures in an automated fashion
3) conversion of structure IMAGES to connection tables using optical structure recognition software (either commercial or open surce)
4) ask authors to confirm whether the converted structures are appropriate
5) provide a structure validation service for submitted molecules checking for “accurate representation”
6) Deposit all structures associated with an article onto ChemSpider but under embargo. Associate the article Title, authors and “abstract snippet” with all structures.
7) Issue a set of ChemSpider IDs for the author to submit to the publisher with the article
8) When a publication has passed through review the author can release the structures from embargo using a DOI or an article URL (more common for Open Access articles)

The result of this project will be a way for publishers to link their articles directly to a free access chemistry database and use a series of web services to enable other capabilities (to be defined). It will also allow articles in Open Access and non-Open Access publications to searchable by the “language of chemistry”.

This is only a slice of the overall project but I think it may be of interest relative to the comments you have made below.

Parts of this were shown last week at Drexel University and a particular snippet is available online here:

We are also going to provide a Microsoft Word add-on which will allow users to prepare articles for publishing using similar technologies.

We think this IS disruptive..what say you?

I am looking for someone with a good understanding of carbohydrate chemistry to join the ChemSpider Advisory Group and help us “get carbohydrates right” on the ChemSpider database. It would be sweet if someone could help us clean up the abundance of data on the site and offer us their skills. It might be a good project for a student to work on with us as it will require some research to make sure that we end up with REFERENCE quality data fo rothers to use. Anybody interested?

We are testing out a new 3D molecule optimizer on ChemSpider at present. It appears to be more rugged than our previous algorithm and, in our hands at least, has not yet failed to produce a 3D representation. On general small organics it appears to handle the optimization very well and is quite fast. We welcome your feedback if you have time to test. In order to use the optimizer simply go to a record view, lick on Zoom on the Structure tab (see below) and then click on Show 3D as shown in the second image. Since all coordinates are calculated in real time please note that it can take a few seconds from opening the JMol applet to display of the optimized structure (we need to ad a “calculating….” display element when we have time.

Thins look good for us but we have had a couple of reports of failures and are trying to trace whether it is a browser security setting or not. Please let us know if you have any issues. Thanks

A link to the presentation I gave at ACS-Philly yesterday in Rajarshi Guha’s session is provided below. A lot changes between writing an abstract and writing a talk so I had the chance to expose an increasing number of papers ALREADY using ChemSpider as one of its platforms of choice to source information from.

Can a Free Access Structure-Centric Community for Chemists Benefit Drug Discovery?

ChemSpider is an online database of over 20 million chemical structures assembled from well over a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples. QSAR analyses utilizing the ChemModLab environment for assessing quantitative structure-activity relationships will and screening using a molecular surface descriptor model.

Link to presentation

I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site – word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

Following on from my post regarding copy-paste of “structure-based” blog entries Joerg Wegner has asked some very good questions.

First of all my thanks to him for enhancing what I copied and pasted. He linked to some Wikipedia articles and to some ChemSpider compounds. He is likely familiar with the user of the Wikipedia editor since no manual exists yet for the editor we implemented but he did just fine using it.

Joerg then asked a couple of questions:

1) Is there a version history, or what happens, if someone is messing-up or spamming a certain entry?
2) Can I put pages on my watchlist seeing when others change them?

The answers are as follows:
1) Yes, there is a version history. At present we can revert to the previous version (but for now it is a manual operation for us..but will not be is underway). The version history is presently visible to the curators only as we finish the development of this work. However, the screenshot below shows we know who you are …if someone spams us..and they have done…we issue a warning and ask them to cease and desist and if they don’t we cut off them ability to curate. (Click on the thumbnail to see.

2) There is no record view watcher in place yet but it is of course on our list. We are more keen to do this when we get more people adopting molecules and taking care of them etc. See this previous blog post if you wish to adopt a molecule…by adopt we mean tend to and nurture – keep it updated with new articles, interesting info etc. I have three of my own.

Joerg also asked “Finally, is Paul’s blog now sub-structure searchable? Well, it links back via ChemSpider, but how does Paul now link to ChemSpider? Only via the comments? If there would be a version history, and he would have added this entry, then people could see it in his ChemSpider profile, like they can for Wikipedia edits.”

It is not possible to search Paul’s blog by substructure directly at Paul used to mark up his articles with InChIs but I think the process was problematic so he’s stopped. What would be excellent is to work with paul to have him submit all of his structures as CDX files, or an SDF file, with his URL and Blog Post title to us as a deposition. We would then be able to deposit all of his molecules onto ChemSpider if they don’t exist and make the link to his blogpost automatically. We could then setup and as his collection grows people would be able to search all of his blogposts (all structures, not just the title structure). We could of course turn ChemSpider into a “structure-based blogging environment for chemists” by adding the ability to add comments (oh..we did that already). Hmmm…let me think about this..for Paul’s basic style of “structure centric blogging” we already have the ability to host his structures, deposit and link all structures in his articles back to the central blog post, that blog post could be hosted on ChemSpider and people could comment on it. Probably we would need to lock down the initial blog post so that other people could not change it but already we see the value of other people editing it so should we?

Would anybody like to host a structure centric blog on ChemSpider? Let us know and we’ll help you. We already put links to Molecule of the Day hosted over on ScienceBlogs and have

Back to Joerg’s questions…
“Well, it links back via ChemSpider, but how does Paul now link to ChemSpider? Only via the comments?
If there would be a version history, and he would have added this entry, then people could see it in his ChemSpider profile, like they can for Wikipedia edits.”
Paul doesn’t link to ChemSpider to the best of my knowledge. Would be good if he did. Paul could embed links directly to ChemSpider in his article but would need to search for the matching molecules etc. If he worked with us to set up the deposition process then we could return a list of ChemSpider IDs and he could simply publish them.
Yes to the suggestion regarding having his postings in his ChemSpider profile. All in the plan but not there yet.

The Environmental Protection Agency has provided permission for ChemSpider to utilize their EPI SuiteTM software to predict a number of physical properties for the chemicals on the ChemSpider database. The properties include:
KOWWIN™: Estimates the log octanol-water partition coefficient, log KOW, of chemicals using an atom/fragment contribution method.
AOPWIN™: Estimates the gas-phase reaction rate for the reaction between the most prevalent atmospheric oxidant, hydroxyl radicals, and a chemical. Gas-phase ozone radical reaction rates are also estimated for olefins and acetylenes. In addition, AOPWIN™ informs the user if nitrate radical reaction will be important. Atmospheric half-lives for each chemical are automatically calculated using assumed average hydroxyl radical and ozone concentrations.
HENRYWIN™: Calculates the Henry’s Law constant (air/water partition coefficient) using both the group contribution and the bond contribution methods.
MPBPWIN™: Melting point, boiling point, and vapor pressure of organic chemicals are estimated using a combination of techniques.  Included is the subcooled liquid vapor presssure, which is the vapor pressure a solid would have if it were liquid at room temperature.  It is important in fate modeling.
BIOWIN™: Estimates aerobic and anaerobic biodegradability of organic chemicals using 7 different models; two of these are the original Biodegradation Probability Program (BPP™).  The seventh and newest model estimates anaerobic biodegradation potential.
BioHCWIN: Estimates biodegradation half-life for compounds containing only carbon and hydrogen (i.e. hydrocarbons).
PCKOCWIN™: The ability of a chemical to sorb to soil and sediment, its soil adsorption coefficient (Koc), is estimated by this program. EPI’s Koc estimations are based on the Sabljic molecular connectivity method with improved correction factors.
WSKOWWIN™: Estimates an octanol-water partition coefficient using the algorithms in the KOWWIN™ program and estimates a chemical’s water solubility from this value. This method uses correction factors to modify the water solubility estimate based on regression against log Kow.
WATERNT™: Estimates water solubility directly using a “fragment constant” method similar to that used in the KOWWIN™ model.
HYDROWIN™: Acid- and base-catalyzed hydrolysis constants for specific organic classes are estimated by HYDROWIN™. A chemical’s hydrolytic half-life under typical environmental conditions is also determined. Neutral hydrolysis rates are currently not estimated.
BCFWIN™: This program calculates the BioConcentration Factor and its logarithm from the log Kow. The methodology is analogous to that for WSKOWWIN™. Both are based on log Kow and correction factors.
KOAWIN: KOA is the octanol/air partition coefficient and has multiple uses in chemical assessment.  The model estimates KOA using the ratio of the octanol/water partition coefficient (KOW) from KOWWIN™, and the dimensionless Henry’s Law constant (KAW) from HENRYWIN™. • AEROWIN™: Estimates the fraction of airborne substance sorbed to airborne particulates, i.e. the parameter phi (φ), using three different methods.  AEROWIN™ results are also displayed with AOPWIN™ output as an aid in interpretation of the latter.
WVOLWIN™: Estimates the rate of volatilization of a chemical from rivers and lakes; calculates the half-life for these two processes from their rates. The model makes certain default assumptions-water body depth; wind velocity; etc.
STPWIN™: Using several outputs from EPI Suite™, this program predicts the removal of a chemical in a Sewage Treatment Plant; values are given for the total removal and three contributing processes (biodegradation, sorption to sludge, and stripping to air.) for a standard system and set of operating conditions.
LEV3EPI™: This level III fugacity model predicts partitioning of chemicals between air, soil, sediment, and water under steady state conditions for a default model “environment”; various defaults can be changed by the user.

The values for individual structures are available in the Record View under the EPI Summary.

For example, the information for Xanax is below.

 Log Octanol-Water Partition Coef (SRC):
    Log Kow (KOWWIN v1.67 estimate) =  3.87
    Log Kow (Exper. database match) =  2.12
       Exper. Ref:  BioByte (1995)

 Boiling Pt, Melting Pt, Vapor Pressure Estimations (MPBPWIN v1.42):
    Boiling Pt (deg C):  441.81  (Adapted Stein & Brown method)
    Melting Pt (deg C):  185.42  (Mean or Weighted MP)
    VP(mm Hg,25 deg C):  1.65E-008  (Modified Grain method)
    Subcooled liquid VP: 7.84E-007 mm Hg (25 deg C, Mod-Grain method)

 Water Solubility Estimate from Log Kow (WSKOW v1.41):
    Water Solubility at 25 deg C (mg/L):  13.1
       log Kow used: 2.12 (expkow database)
       no-melting pt equation used

 Water Sol Estimate from Fragments:
    Wat Sol (v1.01 est) =  0.15855 mg/L

 ECOSAR Class Program (ECOSAR v0.99h):
    Class(es) found:
       Aliphatic Amines
Henrys Law Constant (25 deg C) [HENRYWIN v3.10]:
   Bond Method :   9.77E-012  atm-m3/mole
   Group Method:   Incomplete
 Henrys LC [VP/WSol estimate using EPI values]:  5.117E-010 atm-m3/mole

 Log Octanol-Air Partition Coefficient (25 deg C) [KOAWIN v1.10]:
  Log Kow used:  2.12  (exp database)
  Log Kaw used:  -9.399  (HenryWin est)
      Log Koa (KOAWIN v1.10 estimate):  11.519
      Log Koa (experimental database):  None

 Probability of Rapid Biodegradation (BIOWIN v4.10):
   Biowin1 (Linear Model)         :   0.6009
   Biowin2 (Non-Linear Model)     :   0.2660
 Expert Survey Biodegradation Results:
   Biowin3 (Ultimate Survey Model):   2.2574  (weeks-months)
   Biowin4 (Primary Survey Model) :   3.1733  (weeks       )
 MITI Biodegradation Probability:
   Biowin5 (MITI Linear Model)    :  -0.1488
   Biowin6 (MITI Non-Linear Model):   0.0042
 Anaerobic Biodegradation Probability:
   Biowin7 (Anaerobic Linear Model): -0.4906
 Ready Biodegradability Prediction:   NO

Hydrocarbon Biodegradation (BioHCwin v1.01):
    Structure incompatible with current estimation method!

 Sorption to aerosols (25 Dec C)[AEROWIN v1.00]:
  Vapor pressure (liquid/subcooled):  0.000105 Pa (7.84E-007 mm Hg)
  Log Koa (Koawin est  ): 11.519
   Kp (particle/gas partition coef. (m3/ug)):
       Mackay model           :  0.0287
       Octanol/air (Koa) model:  0.0811
   Fraction sorbed to airborne particulates (phi):
       Junge-Pankow model     :  0.509
       Mackay model           :  0.697
       Octanol/air (Koa) model:  0.866 

 Atmospheric Oxidation (25 deg C) [AopWin v1.92]:
   Hydroxyl Radicals Reaction:
      OVERALL OH Rate Constant =   7.6246 E-12 cm3/molecule-sec
      Half-Life =     1.403 Days (12-hr day; 1.5E6 OH/cm3)
      Half-Life =    16.834 Hrs
   Ozone Reaction:
      No Ozone Reaction Estimation
   Fraction sorbed to airborne particulates (phi): 0.603 (Junge,Mackay)
    Note: the sorbed fraction may be resistant to atmospheric oxidation

 Soil Adsorption Coefficient (PCKOCWIN v1.66):
      Koc    :  2.151E+006
      Log Koc:  6.333 

 Aqueous Base/Acid-Catalyzed Hydrolysis (25 deg C) [HYDROWIN v1.67]:
    Rate constants can NOT be estimated for this structure!

 Bioaccumulation Estimates from Log Kow (BCFWIN v2.17):
   Log BCF from regression-based method = 0.932 (BCF = 8.559)
       log Kow used: 2.12 (expkow database)

 Volatilization from Water:
    Henry LC:  9.77E-012 atm-m3/mole  (estimated by Bond SAR Method)
    Half-Life from Model River: 1.053E+008  hours   (4.388E+006 days)
    Half-Life from Model Lake : 1.149E+009  hours   (4.786E+007 days)

 Removal In Wastewater Treatment:
    Total removal:               2.37  percent
    Total biodegradation:        0.10  percent
    Total sludge adsorption:     2.27  percent
    Total to Air:                0.00  percent
      (using 10000 hr Bio P,A,S)

 Level III Fugacity Model:
           Mass Amount    Half-Life    Emissions
            (percent)        (hr)       (kg/hr)
   Air       0.000217        33.7         1000
   Water     21              900          1000
   Soil      78.9            1.8e+003     1000
   Sediment  0.094           8.1e+003     0
     Persistence Time: 1.48e+003 hr

We started the calculations a number of weeks ago and are updating our progress on the ChemSpider Forum here. We now have values predicted for 3 million compounds.

It is NOT possible at present to search on these properties in the same way that other properties can be searched on the Search Predicted Properties page as shown below.

After all EPI Suite properties are predicted we will selectively make some of these available for searching. The interest so far appears to be in Henry’s Law values, Water Solubility and Melting Point (something that is very difficult to predict with accuracy!). We welcome your comments.

We will be able to extract experimental values for some properties and display directly. For example, logP shows an “experimental database match” for Xanax.

Log Octanol-Water Partition Coef (SRC):
Log Kow (KOWWIN v1.67 estimate) = 3.87
Log Kow (Exper. database match) = 2.12

Exper. Ref: BioByte (1995)

It is going to take a number of weeks to generate EPI Suite values for 21.5 million molecules but we are moving in that direction. Our sincere thanks to the EPA for allowing us to use their EPI Suite software on ChemSpider for the benefit of the community

I have spoken on this blog many times about the challenges of cleaning up data in chemistry databases. We’re expending a lot of efforts, with the assistance of many others, in cleaning up the data on ChemSpider and, as a benefit, assisting in cleaning up date in other databases also. The efforts to curate the chemical structure data on Wikipedia continues and the work is now focused on delivering ‘bots that will drive a cleansed data file to the individual records. Over the past few months I have developed a great appreciation for the efforts, dedication and commitment of the many contributors to Wikipedia Chemistry. There are many 10s of people editing and contributing to the articles and then there is the “core WP:Chem team” who show up for the IRC chats most Tuesdays at noon. Many of the past weeks have focused on how to curate the data and utilize ‘bots and control curated data moving forward. I am honored to share “IRC-space” with them!

Over the past few weeks I have been similarly blessed to interact with the ChEBI team via email as we have done our work to deposit their Entities of the Month (1,2). During the process of doing so we have exchanged many emails and have cleaned a number of errors in our mutual datasets. In my opinion a PERFECT example of the results of such detailed efforts is for Vancomycin. One week ago a search on vancomycin would give a dozen hits. Many of these had incomplete stereochemistry. Now a search on ChemSpider gives one hit for vancomycin here. This is the result of working with Kirill Degtyarenko at ChEBI. The conversation was initiated by my observation regarding stereo in the structure on ChEBI.

For details on how this is identified to be the correct structure read the description on that page. VERY DETAILED and includes links out to three publications.

Compare this with a search for vancomycin on PubChem giving 66 hits. Some of these differences are due to the different approaches for our text searches – the PubChem results list includes VANCOMYCIN HYDROCHLORIDE and Gatifloxacin & Vancomycin for example. However, there are a number of “vancomycins” also.

We believe we have the correct vancomycin identified at this point…we welcome any challengers!

Thanks to the efforts of contributors such as Heinz Kolshorn new compounds and associated analytical data are finding their way onto ChemSpider on a regular basis. These are chemical compounds that have been synthesized and fully characterized. Unless they are published they are unlikely to find their way into chemical registry systems or into training databases for the commercial NMR prediction packages such as those of ACD/Labs, Bio-Rad, Modgraph or Wolfgang Robien’s collection. As a result this type of information will be “Lost Chemistry“. These particular data from Heinz will almost certainly find their way into the NMRShiftDB since Heinz is hosting the database at his lab at the University of Mainz.

Heinz has been putting actual experimental spectra and the associated shift assignments onto ChemSpider of late. An example is here. This is enabled by our ability to upload and store both spectra and images. There are better ways to display the shift assignments by allow mouseover display of the structure and peak associations but this is not yet available on the system but clearly a nice to have. For now the information is there for others to use and is indicative of the value of integrating images and spectral data. I can envisage other pairings such as UV-spectra versus photo of colored solution for example.

Over the past few months we have recognized those people who have spent their time depositing to the content of ChemSpider either as depositors or curators. Recently I commented about one of our Advisory Group, Chris Singleton, taking on a major project to deposit spectral data to ChemSpider. If you visit the spectral data page and scroll through you will see that there are now 33 pages of spectra, each page containing 20 spectra. The majority of these are NMR spectra and the largest single collection is that deposited by Chris over the past few weeks. The data were those obtained from the Madison Metabolomics Consortium Database and described in a publication by Q. Cui, et al; “Metabolite identification via the Madison Metabolomics Consortium Database”, Nature Biotechnology, 26,162 (2008). Our sincere thanks to Chris for all of his work!

There is another raft of spectra waiting to be processed and deposited so the spectral data collection will continue to grow.

We have just completed the deposition of >180,000 compounds from Vitas-M onto our database.The data source details are here.

From the data sources page: “Vitas-M Laboratory, Ltd. is a major supplier of drug like organic compound libraries for High-Throughput Screening and Combinatorial Chemistry. They produce these compound libraries in their own laboratory located in Moscow and acquire them from their partners located at different chemical laboratories in the former USSR. They have well-established business relationships with many high professional chemists from the different scientific disciplines. All of the researchers have an extensive experience in synthetic organic chemistry or other related fields.

Now that our batch deposition process is up and running for large depositions we are presently adding >50,000 compounds per day to the database until we have cleaned out our backlog of >1,000,000 compounds. Clearly not all of these will be unique but there are certainly many thousands of new structures being added daily. You might experience some minor detrimental impact in terms of search speed as these depositions run. Our apologies.