Archive for the Quality and Content Category

Many of us using ChemSpider are looking for compounds of interest to us. In some cases those chemical entities are not of fleeting interest but something that we are working on in our research, have a hobbyist interest in or some other driving force encouraging us to track activity in.

With this in mind we have now allowed any user to “monitor an article”. What this means is that when new information is associated with an article (new outlinks, new forms of data, new publications, associated spectra etc) then an email will be sent to you making you aware of the new information. In order to monitor an article simply login as a register user and click on the “Monitor This Article” button. If you want to discontinue in the future simply return to the article and click on “Cancel Article Monitor”. We’d like a few people to help test this process for us and provide us with feedback. Keep your eye on those molecules of interest to you with Article Monitoring.

HDR Eye
Image by ►Felix◄ via Flickr

Today I had the privilege of meeting with many members of the team creating the RCSB Protein Data Bank. This resulted from the wonderful networking opportunity offered by the Scifoo camp held earlier this year at Google where I met Helen Berman, director of the PDB team, part of the worldwide Protein Data Bank. Helen and I shared some conversations sitting outside the Google offices in California and shared our opinions and visions regarding the quality of small molecule data available online. Today was an opportunity to take those conversations further, meet with members of the team and determine whether ChemSpider’s efforts could bring benefit to the PDB in terms of our curation efforts and whether ChemSpider users could benefit from having access to information on the PDB via hosting of the PDB ligand dictionary.

I gave a presentation (online here and based on others I have delivered previously) and received a one on one review of the deposition and curation processes of the PDB as well participated in a group discussion about how to continue the stringent and exacting process of validation and curation associated with small molecule structure sets. We discussed the complex relationships between systematic names, trivial names, registry IDs, database IDs, tautomers, charged states, SMILES and InChIs. It was a particularly validating day to spend time with a group of people who have responsibility for building one of the most valuable resources in the world and have faced the many challenges associated with validating structure-based data. There is a distinction between people who talk about what it takes to curate structure collections rather than those who actually do the job for a living. This team is made up of dedicated, passionate and skilled individuals who deeply care about the quality of their data and who do the heavy lifting and grunt work so that the users of the PDB enjoy the benefits. They have been working on a multi-year process to curate and improve the PDB data and are in the final major phase of the effort to clean up the archive and apply the processes to all new data moving forward . ChemSpider and PDB will be more integrated in the near future and we look forward to supporting their efforts for providing high quality structure data to the community and continuing to expand the network of integrated online chemistry.

I announced in July of this year that we were performing predictions using the EPISuite of prediction tools.I’m glad to say that one of our servers is now in “cooling mode” after running red hot for over 4 months. We’ve been feeding all single component ChemSpider entities with Molecular Weight <500 (non-radicals). The results are now posted on ChemSpider under the EPISuite tab. We hope you find them of value and offer our thanks to the EPA for providing us access to the software.

A lot of people have been helping to improve the quality of ChemSpider content by depositing new data and “Cleaning up” errors in the data over the past few months. it’s been a long climb. Our thanks to all of you who have contributed. I’ll be the first one to put my hand up and acknowledge that in some ways I have not made the act of contributing to the curation process very easily since I’ve been feeding the data out via the blog in chunks, as it has developed. Following a recent “long flight” I am happy to announce that the Curators Handbook/Bible is now available in its first form and is available online here. This document gives some pretty detailed guidance regarding how to curate the ChemSpider database. As always we welcome feedback. If something is not clear let us know and we will expand/enhance as appropriate.

What I also want to do is to thank those people who have commented on how truly impressed they are with the rate at which we are cleaning the data. In general most curation requests identified on the site are addressed within 24 hours. There are some issues hanging out there that we don’t have solutions for at present, specifically in regards to organometallic data handling, but we are still thinking about a path forward.

It is finally time to rollout more attractive structure depictions. We have needed some more attractive structure depictions for a while but they have become an absolute must have as we rollout the following new capabilities:

1) The ability to make YOUR chemical blog structure searchable (watch this space…). We suggested one path previously…this is BETTER…

2) Structure balloons for using with our document markup tools, both browser-based and Microsoft Word based

We all judge quality of visual aesthetics quickly. We know a good structure when we see one. This is an announcement that we will be rolling out new structures across the site in the next few days. You will see better looking structures showing up across the site – during deposition, during service-based predictions, during searches and, well, everywhere. While not perfect as yet a little more tweaking and the entire database will be supported by the new structure depiction algorithms. As it is you should see some examples now on the database…one shown below. We welcome your feedback!

Frequent users of ChemSpider might have noticed a change in layout of the record view pages of late. As we layer more information onto a record view page (EPI Suite predictions, SimBioSys LASSO scores, spectral data, MORE predictions to come) the record view pages become increasingly heavy. As a result we have had to navigate the challenge of increasignly heavy pages and user experience. Since we have added the ability to perform structure searching on Pubmed recently and are now in the process of adding a new update for Patent searching we have chosen to hide the Data Source outlinks until you choose to see them.

So, if you are looking for original data sources and a list of potential commercial vendors please click on the button indicated below to fold out the list. Commercial vendors are indicated as discussed previously here.

Users of ChemSpider might have noticed some performance isseus in the past 2-3 weeks with our web services, service availability and speed of searches. I put my hand in the air and say “Yup, acknowledged”. Hopefully they have not been too disruptive BUT it is for the overall benefit of the service ultimately. We have been streaming in 8 MILLION links to Pubmed in order to make Pubmed structure and substructure searchable. We are NOT rolling this out with full fanfare yet but I do want to explain the performance issues you might be experiencing. We work on Microsoft technology and while we are advocates for the platforms of .NET, IIS and SQL Server we definitely are putting them under pressure as we keep expanding the database and adding more value. We have thoughts about how to resolve this but want to finishg populating the tables first.

The upside….the majority of links are already in place. For an example visit a structure and look for PubMed as a data source and click on one of the links. For example, for Valium here you will see in the datasource table a series of Pubmed IDs next to the PubMed datasource…

  16971504, 17673, 874970, 406430, 17881, 327854, 879884, 577681, 560225, 195649, …

These will link you out to PubMed directly. Try it out…

Now, do we have implementation issues? YES. The lists of external IDs can be long so right now we show only the first 10. We wiil deal with display of others shortly. We need to provide a way to curate out “junk” entries. For example, “methyl” is on Chemspider as a fragment and has links to PubMed IDs…you’ll see why if you click them..it was done with text mining. These issues will be resolved but for now we announce that PubMed is structure and substructure searchable via ChemSpider. We will explain how we did it shortly but for now we will acknowledge the massive contribution of our colleagues at SureChem. More to come…

There has been an outpouring of offers from the ChemSpider community in terms of helping to examine/clean and enhance information regarding carbohydrates on ChemSpider. Almost 2 dozen users have now made an offer to help. Very exciting really!

I’ve already outlined the necessity to improve the quality of associations between structures and identifiers on the database. However, I am also hoping that users will write articles about carbohydrates using the rich-text formatting capabilities (ADD Description), will add spectra if they have them, will link up articles if they have interesting papers and will add URLs to interesting online content also.

We have now delivered the ability to curate and enhance records on ChemSpider and look forward to having our users help, starting with Carbohydrates…

As the number of spectra uploaded to ChemSpider increases (and it is now increasing at quite a rate) we have noticed that ther increased loading time associated with records with a large numbr of spectra can be very long, especially if the spectra are “heavy”, for example for C13 specra at high-frequency and with zero-filling. When there are a number of spectra there are even more challenges.

With this in mind we have introduced the ability to Load a Spectrum when the user wants to see the spectrum and not automatically on loading the page. An example is shown here for recently uploaded spectra from the Drexel University laboratory of Jean-Claude Bradley.

Please est it out and let us know if you see any issues. the example listed above has a “heavy C13″ spectrum so loading might take awhile. 

An announcement was made on the Blue Obelisk Discussion List this week reagrding a new database of 4 million molecules at present but up to 50 million molecules in the future. It is called molecules.gnu-darwin.org/ and lists with the following comments:

Some facts: The Molecules website contains more than 4 million small molecule structure files in pdb format, and molecular graphics representations. About 50 million molecules are still in the pipe, and they are expected to appear here over the course of the next few weeks and months. The pdb format is readable by common FOSS molecule viewer software, such as RasMol and PyMOL. In due course, we plan to provide high quality structures via energy minimization refinement, and additional resources.

Molecules@gnu-darwin.org is founded in the spirit of free software, open source, and public access. It is hoped that access to these files will be a wonderful community resource for science education, research, and entertainment as well. We are looking for investment or funding to expedite and expand this work, and lead the field, with an eye towards an advanced, complete, synthetic, structural, and informatical bioorganome. Meanwhile, the site is already an exceptional lab resource, and molecular catalog, providing the means and building blocks towards additional novel structures. We aim to be the best.

The structural biology, protein crystallography, and molecular graphics talent that is building the Molecules archive is available to work for you in a contract or consulting arrangement. Wide-ranging expertise is available. Molecules@gnu-darwin.org is built entirely with FOSS, free and open source software, GNU-Darwin OS, and it is under the aegis of The GNU-Darwin Distribution. Here is a link to the Distribution résumé. Our founder is an X-ray laboratory admin for the Department of Biophysics and Biophysical Chemistry of Johns Hopkins University School of Medicine. You can also read his CV. We would like to build a community around this website, and we are looking for volunteers and collaborators to help. Regarding any aspect of the work of this site, please feel free to contact us, molecules@gnu-darwin.org, with gdmolecules in the subject line. Cheers!”

I’m always interested in potential databases to connect to that will add additional capabilities and diversity to ChemSpider’s information. I have browsed the database and searched on some common molecules (Xanax, aspirin, Taxol and others) and found no hits. This seemed strang but it does say “Search warning: not yet fully spidered

The statement that there are 50 million molecules in total coming suggests that the database is a republication of PubChem and the SDF archives seem to suggest so too since they redirect to PubChem for the download: http://molecules.gnu-darwin.org/ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/

At present the database therefore appears to be the PubChem database in PDB format. I hope that there is some additional information added to warrant our linking to this new database.

We have added the compound collection from Trans World Chemicals to ChemSpider. This is a collection of almost 1600 compounds. The collection can be viewed here.

ChemSpider has been working hard to support Wikipedia for a number of months now. We have been curating the structures on Wikipedia, I have been an active member of the WP:Chem team, we have extended our integration of WIkipedia to show the leed of the Wikipedia article on associated record views and have a lot of background activities going on re. Wikipedia at present (info will be released shortly). There are new articles released on Wikipedia on an ongoing basis and we stay up to date as best we can monitoring bots for updates. Harvesting monographs out of Wikipedia based only on ChemBoxes and Drugboxes is not sufficient for sure since not every article about drugs and chemicals on Wikipedia has an associated Drugbox or ChemBox. For example… You have likely heard of Rember for Alzheimers already? A search on Google for Rember Alzheimers will give about 2 million hits. It’s already being discussed in the blogosphere including Derek Lowe’s  In the Pipeline. Rember turns out to be methylene blue. There is already an article on Wikipedia about Rember but there is no chembox as yet. As I was researching Rember out of interest I noticed we did not have methylene blue linked to Wikipedia and Rember wasn’t associated with methylene blue. Adding the name was of course easy..5 seconds work after login. We have now added the ability to associate data sources directly too. What does this mean? On a record view page is a list of “Data Sources” associated with a compound. This is where depositions about a compound came from and, generally, links back to the associated web pages. Previously in order to populate the Data Source table it would be necessary to deposit the structure and associated info as an SDF file. TOO MUCH work. So, now we have made it easy. To add a data source simply login and select “Edit” (top right hand side of the data source table). To add a new data source simply click Add and input the information into the pop up box.The input is the name to be listed in the Data Source table, the URL to the information on the Data Source page (if info exists) and the name of the Data Source. This is one caveat of adding such links..the data source must exist. If you want to add data associated with your own website you need to register yourself, add a Data Source and wait for us to approve. Wikipedia is a special case since when the link is made we grab the leed of the article directly and show it in the Record View. For methylene blue there are two related Wikipedia articles so we have linked to them both as you can see on the record view. Simple go to ChemSpider and search for rember and you’ll see two linked Wikipedia articles.

We’ve been enhancing our deposition system so that the addition of 10s of thousands of new compounds to ChemSpider doesn’t have too big an impact on the performance of ChemSpider. The deposition of every structure demands the calculation of associated properties and deduplication against the database and needed to be optimized. As a result of our improved processing we are now cleaning up our backlog of new structures, something which is well overdue we know but we didn’t want to overly stress the servers for our users. New data are now on the database from the following companies. There are more to come…

In keeping with our commitment to continue to index Open Access journals for searching on ChemSpider we are happy to announce our indexing of Libertas Academica. Most people I have spoken to about our indexing of Open Access journals have never heard of this Open Access publisher. Libertas Academica offers “Open access journals on clinical medicine, bioinformatics, biology, chemistry, pharmacology, gene signalling, systems biology, informatics, virology, substance abuse, translational science and complimentary medicine.” I know of LA-press because of their Analytical Chemistry Insights journal.

Their list of Popular Journals is given below and their full list of journals is given on the third tab.

The publisher allows direct commenting on articles on their website as shown here for their article on “High-Performance Liquid Chromatographic Method for Determination of Phenytoin in Rabbits Receiving Sildenafil” (This article is already linked from the structures of Phenytoin and Sildenafil)

Following our previous approach of using Taxol and Paclitaxel as a measure of potential contibution to search results on ChemSpider searching Libertas Academica gives 6 hits on Taxol while a search on Paclitaxel gave 23 hits.

Our growing list of Open Access Publishers is rather impressive at this point…see below. It will continue to grow.

The Environmental Protection Agency has provided permission for ChemSpider to utilize their EPI SuiteTM software to predict a number of physical properties for the chemicals on the ChemSpider database. The properties include:
KOWWIN™: Estimates the log octanol-water partition coefficient, log KOW, of chemicals using an atom/fragment contribution method.
AOPWIN™: Estimates the gas-phase reaction rate for the reaction between the most prevalent atmospheric oxidant, hydroxyl radicals, and a chemical. Gas-phase ozone radical reaction rates are also estimated for olefins and acetylenes. In addition, AOPWIN™ informs the user if nitrate radical reaction will be important. Atmospheric half-lives for each chemical are automatically calculated using assumed average hydroxyl radical and ozone concentrations.
HENRYWIN™: Calculates the Henry’s Law constant (air/water partition coefficient) using both the group contribution and the bond contribution methods.
MPBPWIN™: Melting point, boiling point, and vapor pressure of organic chemicals are estimated using a combination of techniques.  Included is the subcooled liquid vapor presssure, which is the vapor pressure a solid would have if it were liquid at room temperature.  It is important in fate modeling.
BIOWIN™: Estimates aerobic and anaerobic biodegradability of organic chemicals using 7 different models; two of these are the original Biodegradation Probability Program (BPP™).  The seventh and newest model estimates anaerobic biodegradation potential.
BioHCWIN: Estimates biodegradation half-life for compounds containing only carbon and hydrogen (i.e. hydrocarbons).
PCKOCWIN™: The ability of a chemical to sorb to soil and sediment, its soil adsorption coefficient (Koc), is estimated by this program. EPI’s Koc estimations are based on the Sabljic molecular connectivity method with improved correction factors.
WSKOWWIN™: Estimates an octanol-water partition coefficient using the algorithms in the KOWWIN™ program and estimates a chemical’s water solubility from this value. This method uses correction factors to modify the water solubility estimate based on regression against log Kow.
WATERNT™: Estimates water solubility directly using a “fragment constant” method similar to that used in the KOWWIN™ model.
HYDROWIN™: Acid- and base-catalyzed hydrolysis constants for specific organic classes are estimated by HYDROWIN™. A chemical’s hydrolytic half-life under typical environmental conditions is also determined. Neutral hydrolysis rates are currently not estimated.
BCFWIN™: This program calculates the BioConcentration Factor and its logarithm from the log Kow. The methodology is analogous to that for WSKOWWIN™. Both are based on log Kow and correction factors.
KOAWIN: KOA is the octanol/air partition coefficient and has multiple uses in chemical assessment.  The model estimates KOA using the ratio of the octanol/water partition coefficient (KOW) from KOWWIN™, and the dimensionless Henry’s Law constant (KAW) from HENRYWIN™. • AEROWIN™: Estimates the fraction of airborne substance sorbed to airborne particulates, i.e. the parameter phi (φ), using three different methods.  AEROWIN™ results are also displayed with AOPWIN™ output as an aid in interpretation of the latter.
WVOLWIN™: Estimates the rate of volatilization of a chemical from rivers and lakes; calculates the half-life for these two processes from their rates. The model makes certain default assumptions-water body depth; wind velocity; etc.
STPWIN™: Using several outputs from EPI Suite™, this program predicts the removal of a chemical in a Sewage Treatment Plant; values are given for the total removal and three contributing processes (biodegradation, sorption to sludge, and stripping to air.) for a standard system and set of operating conditions.
LEV3EPI™: This level III fugacity model predicts partitioning of chemicals between air, soil, sediment, and water under steady state conditions for a default model “environment”; various defaults can be changed by the user.

The values for individual structures are available in the Record View under the EPI Summary.

For example, the information for Xanax is below.

 Log Octanol-Water Partition Coef (SRC):
    Log Kow (KOWWIN v1.67 estimate) =  3.87
    Log Kow (Exper. database match) =  2.12
       Exper. Ref:  BioByte (1995)

 Boiling Pt, Melting Pt, Vapor Pressure Estimations (MPBPWIN v1.42):
    Boiling Pt (deg C):  441.81  (Adapted Stein & Brown method)
    Melting Pt (deg C):  185.42  (Mean or Weighted MP)
    VP(mm Hg,25 deg C):  1.65E-008  (Modified Grain method)
    Subcooled liquid VP: 7.84E-007 mm Hg (25 deg C, Mod-Grain method)

 Water Solubility Estimate from Log Kow (WSKOW v1.41):
    Water Solubility at 25 deg C (mg/L):  13.1
       log Kow used: 2.12 (expkow database)
       no-melting pt equation used

 Water Sol Estimate from Fragments:
    Wat Sol (v1.01 est) =  0.15855 mg/L

 ECOSAR Class Program (ECOSAR v0.99h):
    Class(es) found:
       Aliphatic Amines
Henrys Law Constant (25 deg C) [HENRYWIN v3.10]:
   Bond Method :   9.77E-012  atm-m3/mole
   Group Method:   Incomplete
 Henrys LC [VP/WSol estimate using EPI values]:  5.117E-010 atm-m3/mole

 Log Octanol-Air Partition Coefficient (25 deg C) [KOAWIN v1.10]:
  Log Kow used:  2.12  (exp database)
  Log Kaw used:  -9.399  (HenryWin est)
      Log Koa (KOAWIN v1.10 estimate):  11.519
      Log Koa (experimental database):  None

 Probability of Rapid Biodegradation (BIOWIN v4.10):
   Biowin1 (Linear Model)         :   0.6009
   Biowin2 (Non-Linear Model)     :   0.2660
 Expert Survey Biodegradation Results:
   Biowin3 (Ultimate Survey Model):   2.2574  (weeks-months)
   Biowin4 (Primary Survey Model) :   3.1733  (weeks       )
 MITI Biodegradation Probability:
   Biowin5 (MITI Linear Model)    :  -0.1488
   Biowin6 (MITI Non-Linear Model):   0.0042
 Anaerobic Biodegradation Probability:
   Biowin7 (Anaerobic Linear Model): -0.4906
 Ready Biodegradability Prediction:   NO

Hydrocarbon Biodegradation (BioHCwin v1.01):
    Structure incompatible with current estimation method!

 Sorption to aerosols (25 Dec C)[AEROWIN v1.00]:
  Vapor pressure (liquid/subcooled):  0.000105 Pa (7.84E-007 mm Hg)
  Log Koa (Koawin est  ): 11.519
   Kp (particle/gas partition coef. (m3/ug)):
       Mackay model           :  0.0287
       Octanol/air (Koa) model:  0.0811
   Fraction sorbed to airborne particulates (phi):
       Junge-Pankow model     :  0.509
       Mackay model           :  0.697
       Octanol/air (Koa) model:  0.866 

 Atmospheric Oxidation (25 deg C) [AopWin v1.92]:
   Hydroxyl Radicals Reaction:
      OVERALL OH Rate Constant =   7.6246 E-12 cm3/molecule-sec
      Half-Life =     1.403 Days (12-hr day; 1.5E6 OH/cm3)
      Half-Life =    16.834 Hrs
   Ozone Reaction:
      No Ozone Reaction Estimation
   Fraction sorbed to airborne particulates (phi): 0.603 (Junge,Mackay)
    Note: the sorbed fraction may be resistant to atmospheric oxidation

 Soil Adsorption Coefficient (PCKOCWIN v1.66):
      Koc    :  2.151E+006
      Log Koc:  6.333 

 Aqueous Base/Acid-Catalyzed Hydrolysis (25 deg C) [HYDROWIN v1.67]:
    Rate constants can NOT be estimated for this structure!

 Bioaccumulation Estimates from Log Kow (BCFWIN v2.17):
   Log BCF from regression-based method = 0.932 (BCF = 8.559)
       log Kow used: 2.12 (expkow database)

 Volatilization from Water:
    Henry LC:  9.77E-012 atm-m3/mole  (estimated by Bond SAR Method)
    Half-Life from Model River: 1.053E+008  hours   (4.388E+006 days)
    Half-Life from Model Lake : 1.149E+009  hours   (4.786E+007 days)

 Removal In Wastewater Treatment:
    Total removal:               2.37  percent
    Total biodegradation:        0.10  percent
    Total sludge adsorption:     2.27  percent
    Total to Air:                0.00  percent
      (using 10000 hr Bio P,A,S)

 Level III Fugacity Model:
           Mass Amount    Half-Life    Emissions
            (percent)        (hr)       (kg/hr)
   Air       0.000217        33.7         1000
   Water     21              900          1000
   Soil      78.9            1.8e+003     1000
   Sediment  0.094           8.1e+003     0
     Persistence Time: 1.48e+003 hr

We started the calculations a number of weeks ago and are updating our progress on the ChemSpider Forum here. We now have values predicted for 3 million compounds.

It is NOT possible at present to search on these properties in the same way that other properties can be searched on the Search Predicted Properties page as shown below.

After all EPI Suite properties are predicted we will selectively make some of these available for searching. The interest so far appears to be in Henry’s Law values, Water Solubility and Melting Point (something that is very difficult to predict with accuracy!). We welcome your comments.

We will be able to extract experimental values for some properties and display directly. For example, logP shows an “experimental database match” for Xanax.

Log Octanol-Water Partition Coef (SRC):
Log Kow (KOWWIN v1.67 estimate) = 3.87
Log Kow (Exper. database match) = 2.12

Exper. Ref: BioByte (1995)

It is going to take a number of weeks to generate EPI Suite values for 21.5 million molecules but we are moving in that direction. Our sincere thanks to the EPA for allowing us to use their EPI Suite software on ChemSpider for the benefit of the community

I have spoken on this blog many times about the challenges of cleaning up data in chemistry databases. We’re expending a lot of efforts, with the assistance of many others, in cleaning up the data on ChemSpider and, as a benefit, assisting in cleaning up date in other databases also. The efforts to curate the chemical structure data on Wikipedia continues and the work is now focused on delivering ‘bots that will drive a cleansed data file to the individual records. Over the past few months I have developed a great appreciation for the efforts, dedication and commitment of the many contributors to Wikipedia Chemistry. There are many 10s of people editing and contributing to the articles and then there is the “core WP:Chem team” who show up for the IRC chats most Tuesdays at noon. Many of the past weeks have focused on how to curate the data and utilize ‘bots and control curated data moving forward. I am honored to share “IRC-space” with them!

Over the past few weeks I have been similarly blessed to interact with the ChEBI team via email as we have done our work to deposit their Entities of the Month (1,2). During the process of doing so we have exchanged many emails and have cleaned a number of errors in our mutual datasets. In my opinion a PERFECT example of the results of such detailed efforts is for Vancomycin. One week ago a search on vancomycin would give a dozen hits. Many of these had incomplete stereochemistry. Now a search on ChemSpider gives one hit for vancomycin here. This is the result of working with Kirill Degtyarenko at ChEBI. The conversation was initiated by my observation regarding stereo in the structure on ChEBI.

For details on how this is identified to be the correct structure read the description on that page. VERY DETAILED and includes links out to three publications.

Compare this with a search for vancomycin on PubChem giving 66 hits. Some of these differences are due to the different approaches for our text searches – the PubChem results list includes VANCOMYCIN HYDROCHLORIDE and Gatifloxacin & Vancomycin for example. However, there are a number of “vancomycins” also.

We believe we have the correct vancomycin identified at this point…we welcome any challengers!

Thanks to the efforts of contributors such as Heinz Kolshorn new compounds and associated analytical data are finding their way onto ChemSpider on a regular basis. These are chemical compounds that have been synthesized and fully characterized. Unless they are published they are unlikely to find their way into chemical registry systems or into training databases for the commercial NMR prediction packages such as those of ACD/Labs, Bio-Rad, Modgraph or Wolfgang Robien’s collection. As a result this type of information will be “Lost Chemistry“. These particular data from Heinz will almost certainly find their way into the NMRShiftDB since Heinz is hosting the database at his lab at the University of Mainz.

Heinz has been putting actual experimental spectra and the associated shift assignments onto ChemSpider of late. An example is here. This is enabled by our ability to upload and store both spectra and images. There are better ways to display the shift assignments by allow mouseover display of the structure and peak associations but this is not yet available on the system but clearly a nice to have. For now the information is there for others to use and is indicative of the value of integrating images and spectral data. I can envisage other pairings such as UV-spectra versus photo of colored solution for example.

Over the past few months we have recognized those people who have spent their time depositing to the content of ChemSpider either as depositors or curators. Recently I commented about one of our Advisory Group, Chris Singleton, taking on a major project to deposit spectral data to ChemSpider. If you visit the spectral data page and scroll through you will see that there are now 33 pages of spectra, each page containing 20 spectra. The majority of these are NMR spectra and the largest single collection is that deposited by Chris over the past few weeks. The data were those obtained from the Madison Metabolomics Consortium Database and described in a publication by Q. Cui, et al; “Metabolite identification via the Madison Metabolomics Consortium Database”, Nature Biotechnology, 26,162 (2008). Our sincere thanks to Chris for all of his work!

There is another raft of spectra waiting to be processed and deposited so the spectral data collection will continue to grow.

I have blogged previously about ChEBI entities of the month and our work to include the information to ChemSpider. In order to do so we had to introduce rich text support. This work is done and reported here. As of today nearly all ChEBI Entity of the Month information is now posted to ChemSpider. During the processs we have provided feedback to the team about some suggested changes to some structure depictions and have also noted some differences in stereochemistry between our reference structures and those on ChEBI. This type of interaction has us all be very vigilant about accuracy and it was great (and fast) to work with the group at ChEBI to cross-validate the limited dataset. Everyone gains.

The Rich text editor worked perfectly and without failure and is ready to roll out to the general public we think but we would still like some beta-testers to help test it please.

Zemanta Pixie

Okay, this is clearly a rather tongue in cheek blog post but i couldn’t resist.

Search “sex” on ChemSpider and you get two hits…here

Click on the first structure and you will find that one of the identifiers for this compound is SEX, and it is an explosive.

Just READ the second structure and you will see it is SEX. It’s CLEAN sex though. The dirty sex was described in a recent article in a C&E News article and points back to the poor image originally published by the New York Times when they issued a book review of Pamela Paul’s Book “Bonk, The Curious Coupling of Science and Sex“. In order to have CLEAN sex I removed inappropriate substitutions and bonds.

It still looks like sex though…

ChemSpider added the Directory of Useful Decoys over the weekend. This dataset is well known to the community of scientists performing computational docking experiments and is outlined below. The dataset contributed over 128,000 molecules to the collection.

DUD, a directory of useful decoys for benchmarking virtual screening. DUD is designed to help test docking algorithms by providing challenging decoys. It contains:

  • A total of 2,950 active compounds against a total of 40 targets
  • For each active, 36 “decoys” with similar physical properties (e.g. molecular weight, calculated LogP) but dissimilar topology.

DUD is provided by the Shoichet Laboratory in the Department of Pharmaceutical Chemistry at the University of California, San Francisco (UCSF). To cite DUD, please reference Huang, Shoichet and Irwin, J. Med. Chem., 2006, 49(23), 6789-6801. doi 10.1021/jm0608356. There is a DUD wiki page where you can discuss DUD and an errata page where problems are reported and explained.”

In an ongoing commentary about the DailyMed dataset (1,2) I have been showing some of the struggles regarding creating curated datasets from publicly available data. This post shows an example of when trade names collide. The DailyMed record for sclerosol shows no chemical structure in the label….but describes the compound as follows:

“Sclerosol® Intrapleural Aerosol (sterile talc powder 4 g) is a sclerosing agent for intrapleural administration supplied as a single-use, pressurized spray canister with two delivery tubes of 15 cm and 25 cm in length. Each canister contains 4.0 g of talc, either white or off-white to light grey, asbestos-free, and brucite-free grade of talc of controlled granulometry. The composition of the talc is ≥ 95% talc as hydrated magnesium silicate. The empirical formula is Mg3 Si4 O10 (OH)2 with molecular weight of 379.3.”

Sclerosol is Talc. A search on Sclerosol online however brings us numerous hits for dimethyl sulfoxide on ChemIndustry and the Comparitive Toxicogenomics database and on MeSH. So, is Sclerasol also DMSO?

The PubChem record merges the relationship between Talc and DMSO rather well. Visit the record here. The substance summary is as follows:

“A highly polar organic liquid, that is used widely as a chemical solvent. Because of its ability to penetrate biological membranes, it is used as a vehicle for topical application of pharmaceuticals. It is also used to protect tissue during CRYOPRESERVATION. Dimethyl sulfoxide shows a range of pharmacological activity including analgesia and anti-inflammation.”

Further information is the MeSH details shown below.

The image of the associated structure is shown below…notice it’s representative of talc.

It appears that DMSO and Talc were meshed somehow.

Sclerasol on ChemSpider is Talc. I am not stating that the structure representation of talc is appropriate but it IS the same as the one displayed on PubChem. DMSO on ChemSpider is here and never had the name Sclerasol associated with it. Since we derived some of our data from PubChem I am not sure how we managed to separate the DMSO and Sclerasol association in our processes…but we did.

So, MAYBE Sclerasol is a name for DMSO…but I don’t think so.

Why is this important? As we are working on text mining and will use a lookup dictionary of chemical names and structures as part of the process we are putting in the work to create a high quality dictionary. it’s important for us moving forward.

I’ve started a review of the DailyMed dataset as it is representative of some of the struggles with preparing a curated dataset of chemical structures, chemical names and trade names. In the first comment I pointed to issues with structure representations. I believe one of the worst is shown for qvar to the left. An examination of the qvar record gives the name as beclamethasone propionate. This particular compound has the chemical structure shown below. Not only is the stereochemistry missing from the structure on DailyMed but also half the ring has been lost, maybe during a scanning process? I wonder whether the label circulating out there to the public has this issue? Would the public care? Probably not. But when trying to build a curated dataset it’s rather important.


The past couple of days has seen an interesting exchange going on over on the SimBioSys blog.

Zsolt Zsoldos is someone I respect, not only for his passion for his science but also for his want to educate others in the challenges of what he does in developing software. I believe his blog post entitled “Crystal Structure Errors in CSD too” was an honest attempt to tell people to be “careful” when using data from databases. I don’t care whether the database is ChemSpider, PubChem, the CAS Registry or any of the other databases available via free access of commercial transaction, they ALL have errors. It is inevitable. Zsolt’s attempt to highlight that such errors exist was done, I believe, with pedagogical intent.

“J” then came back and gave some appropriate comments in response to Zsolt’s post and they should be consumed in series. It appears there was some type of backroom conversation, likely with the CCDC,  about how these comments were not prominent enough. Zsolt then posted this:

Update: Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article.”

He then posted the comment into the original article. Huh? Not sure why Zsolt should have felt obliged to do this for anyone. It’s a WordPress issue re how comments are displayed. He should not have felt obliged to insert the text into the article. Zsolt then went on to comment about the licence agreement and permission to use the CSD. What is more interesting to me is his view here:

“On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement. ”

Those of you who have been watching the discussion between myself and ACS over the past few months will know I have been trying to get confirmation that “supplementary data” are Open Data and that we could scrape the CIFs if we chose to…it’s a MANY month conversation at this point. The Unilever School at Cambridge, via Nick Day’s work, has generated CrystalEye and, after many conversations, we were provided the data source and have it on ChemSpider now. We are awaiting constructive feedback from Nick and Peter Murray-Rust regarding our implementation of their data on our site. THis is especially important when there are licensing issues as appear to have been enforced on SimBioSys, evidenced by this Public Apology to CCDC. Read the post for details. It is Zsolt’s concluding statement that feeds directly into the value of Open Data in science and the value of CrystalEye to the community.

He comments: “One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as a charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data.”

As efforts like CrystalEye prevail, as the copyrightability and position of publishers regarding supplementary data is resolved, and the efforts of groups such as ChemSpider are applied to gathering Open Data and developing algorithms from these data, there is likely to be increasing tension showing up such as we see here.

here has been a response to my post about Chemical Names and Structures here.

PMR>”For certain purposes, it is valuable to collect as many names as possible, for example for location of lookup. But these should be accompanied with metadata. A similar example is from ChemSpiderMan (ed.):

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here: Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim�th?ylamino)m?�thyl]fur?an-2-yl}m?�thyl)sul?fanyl]�th?yl}-N’-m�?thyl-2-ni?tro�th�ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does “pain” mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn’t when I started and we had to read French and German  papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful. “

First of all, it’s interesting to note that the French name has been rendered as “junk” in Peter’s blog as shown here.

This probably relates to his original comment that the name is junk in his browser too…but acceptable in mine. On the other hand his blog post may look fine to him and looks bad in mine! Oh those dependencies…I see similar things show up in WordPress regularly.

Peter suggests that there should be metadata giving the language information. Good idea. See my previous blog post about that particular issue and the fact that we allow curators to layer on metadata AND we capture and retain it WHEN it is available.

If you look at this record you will see that there are names labeled as Polish, German and Dutch.

Chloropre​ne [Wiki]

1,3-Butad​iene, 2-c​hloro-

126-99-8 [RN]

204-818-0 [EINECS]

2-Chloor-​1,3-butad​ieen [Dutch]

2-Chlor-1​,3-butadi​en [German]

2-Chlorbu​ta-1,3-di​en [German]

2-Chloro-​1,3-butad​iene

2-Chlorob​utadiene

Chloropren [Polish]

Most labels were captured during the deposition process. One was added manually.Notice also the direct links to Wikipedia, the Registry number link to perform a search of PubChem and the link to EINECS.

As I commented in my post on ranitidine, and extracting from Peter’s post “Notice …….. that the name is labeled French on the record.” So, what Peter suggests is already in place on ChemSpider. I display below what is presently available to curators to label the names with. Notice this includes language,
EINECS numbers, CAS Registry Numbers, INNs, JANs etc.


The list of languages is easy to expand. Anybody have any requests?

A further comment “PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide). We have to use authorities (provenance) in our information. Thus the statements: Ranitidine is the Z-isomer and Ranitidine is the E-isomer may be seen as contradictory. That’s why people have suggested that RDF should have quads, not triples, such as Antony_Williams asserts ranitidine hasIsomer Z Wikipedia asserts ranitidine hasIsomer E Both these are true. That is the language we should use in the semantic web PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.”

This leads us into a deeper discussion about retention of metadata and authorities. We retain metadata when it is deposited or we can harvest it. Let’s consider the information below extracted from the same compound on ChemSpider:

Notice all of the

and note that they all link through to the original source of information, in this case NIOSH.

  • Appearance: Colorless liquid with a pungent, ether-like odor.

  • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, skin, respiratory system; anxiety, irritability; dermatitis; alopecia; reproductive effects; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, reproductive system Cancer Site [lung & skin cancer]

  • Incompatibilities and Reactivities: Peroxides & other oxidizers [Note: Polymerizes at room temperature unless inhibited with antioxidants.]

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet (flammable) Change: No recommendation Provide: Eyewash, Quick drench

  • Exposure Limits: NIOSH REL : Ca C 1 ppm (3.6 mg/m 3 ) [15-minute] See Appendix A OSHA PEL ?: TWA 25 ppm (90 mg/m 3 ) [skin]

There are also properties and each piece of data links out to the original source.For this record it is the same source. For some records it is already multiple sources.

Experimental physchem properties

  • Boiling Point: 139F

  • Flash Point: -4F

  • Freezing Point: -153F

  • Specific Gravity: 0.96

  • Solubility: Slight

  • Ionization Potential: 8.79 eV

  • Vapor Pressure: 188 mmHg

This particular structure has been deposited onto the ChemSpider database a total of 18 times from the  source databases listed below. Where possible i.e. when the structure is available online on the suppliers website and can be hyperlinked to, then each external ID links to the depositor. There is an error! The Aldrich depositions are for the polymer forms! Curators can know this info out.

Data Source External ID(s)
ChemDB 6681768
ChemIDplus 000126998, 014523898
DiscoveryGate 31369
DTP/NCI 18589
EINECS N/A
EPA DSSTox 1084_NTPBSI_v2b, 325_CPDBAS_v5b, 326_CPDBAS_v5b, 724_HPVCSI_v2c
Istituto Superiore di Sanità 601
NIOSH EI9625000
NIST 2143397875
NIST Chemistry WebBook 2143397875
PubChem 31369
Sigma-Aldrich 205397_ALDRICH, 205400_ALDRICH
Thomson Pharma 00243363

Also available to master curators is the ability to see who has been editing the names and synonyms and a full record of depositions, by who and when.

So, names are labeled with language and links to Wikipedia and other info. The predicted properties and systematic name are generally labeled according to the provider of the algorithm(s). We keep track of every URL and publication deposition and know which user deposited what and when…if the site is “vandalized” then we know which user did so.

Overall I’d say we have a lot of metadata for this record. The same is true for tens of thousands of records on ChemSpider and the amount of such information is growing literally daily. We’re not done yet of course – there is much more to add. We put a lot of thought into the design of this system and associated metadata but we also chose to jump off the cliff and start “doing”. There is a lot to learn from managing 20 million molecules and the complexity that comes with doing so. We continue to morph and extend as necessary and welcome input.

To clarify re. ranitidine…. I am NOT asserting that ranitidine has Z-isomer. I am stating that ranitidine has multiple names on ChemSpider, some with no stereochemistry and some with Z-stereochemistry. I also
report that a published crystal structure reports a Z-orientation.  I also report that a commercial software package suggests that the three tautomeric structures below are possible for ranitidine.

I also report, just for fun of course, that the InChI algorithm will declare two of these isomers, the bottom two, as equivalent when “mobile protons” are taken into account. Compare the ON InChIKeys below when mobile proton perception is detected by the InChI algorithm.   Need  more information?

With the curation capabilities we have in place, with the retained metadata, linkages to depositors and other sites and the revision history available, I would say that we are well equipped to manage the data for chemists and continue to enhance our platform for chemists worldwide.