Archive for the Quality and Content Category

Let’s start off where I intend to finish. Bigger does not necessarily mean better. A large database of unique chemical entities does not necessarily mean a good database and accurate chemical representations of chemical entities can be pretty hard to find.

Few people realize how these simple statements are impacting the quality of what’s available online for chemists to use and how curation of data must occur in order to improve what’s available.

Now…what’s the basis for me to initiate this discussion and WHY would I prefer that ChemSpider was actually a smaller database?

Today on CHMINF Steve Heller posted the following review:

“From:http://www.ala.org/ala/mgrps/divs/rusa/sections/mars/marspubs/marsbestfreewebsites/marsbestfree2009.cfm

Title: PubChem
URL: http://pubchem.ncbi.nlm.nih.gov/

PubChem is a search tool for chemical information, divided into three areas: Compounds, Substances, and BioAssays. Full entries provide detailed information with the most basic information – a general description, the molecular weight and formula, the structure, plus a Table of Contents (ToC) for the full entryall easily found above the fold. Use the ToC or scroll down to retrieve more advanced information, such as bioactivity results, synonyms, chemical actions, detailed properties, and more. Each module is fully interlinked with the other sections of PubChem as well as resources in ToxNet and PubMed, providing full access to toxicology resources and the medical literature, and allowing users access to as much or as little of the chemical information as they need.

Author/Publisher: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
Date reviewed: February 16, 2009

PS. PubChem now has 37,326,949 DIFFERENT structures”.

Bob Buntrock made the following statement “Re the PS below, I find it difficult to believe that PubChem has 37.3 million “different” compounds.  The figures from the CAS website show 48 million organic and inorganic compounds which excludes sequences but includes polymers, alloys, coordination compounds, minerals, and mixtures. Since PubChem aims to cover “small molecules”, it would seem that many compounds in these last 5 categories would not be present.  Therefore, I assume that a significant number of the 37.3 million PubChem compounds are redundant.” All hell broke loose with lots of posts discussing the uniqueness of chemical entities and the fact that PubChem compounds WERE unique. Okay, I’m not going to argue this for the moment but I am going to agree with Bob that a significant number of the compounds are likely redundant. It is ALSO true of ChemSpider. Why?

I could write a multipage blog but I have already discussed this issue many times on this blog but am clearly failing to communicate the issue. I’ll try again but I reference you to previous posts about Taxol (1,2,3), Vancomycin (4) and Ginkgolide B (5,6). I suggest you read these earlier posts but will try and explain again anyway.

Some general statements. Many complex chemical compounds, especially natural products, have timelines. A compound when initially elucidated can give the connectivity only and get reported. Then stereochemistry might be layered on later, and reported. Then stereochemistry might be adjusted, and reported. Through this whole timeline the compound might be referred to by a particular chemical name….let’s call it Afonwenium. So, based on the timeline for this molecule there can be anywhere between 1-4 “versions” of the structure by that name. They are all unique chemical entities but the “final structure” is the one that people will want. It’s the one that should be represented on Wikipedia, the one that should correctly be drawn in all publications following the final elucidation report and assertion of structure and the one that should be found on many of the “reference” databases such as KEGG, DrugBank etc.

Search Taxol on ChemSpider and Taxol on PubChem and compare the number of structures you get. I judge that there are MANY unique chemical entities on PubChem that are MEANT to be Taxol but are not. And I don’t mean the ones that are named as “Taxol derivative”, I mean the ones that may have the SAME molecular weight, formula and connectivity but have DIFFERENT stereo – no stereo, MULTIPLE partial stereo and MULTIPLE full stereo. These issues exist for compounds like Ginkgolide B and Vancomycin and many more structures.  There is of course only one Taxol, a compound registered by Bristol Myers Squibb and asserted to have a specific constitution.

Just out of interest lets see how many compounds are on ChemSPider with a specific skeleton (ignoring stereo).

There are 54 compounds with the skeleton of Taxol: http://www.chemspider.com/InChIKey/RCINICONZNJXQF. These are all UNIQUE chemical entities but there are C-11 and C-14 labeled, Deuterium and Tritium labeled and so on. But there are over 30 compounds that have the same skeleton, without isotopically labeled sites, that still have the Taxol skeleton. Maybe some of these are meant to be Taxol with different stereochemistry but I judge that MOST of these are meant to be Taxol and are labeled as such but differ in terms ofno, partial and full stereo at least. This is ONE example. To Bob’s question…is this redundancy? I say yes. How does this get solved? Curation will do it but it’s expensive and time consuming and the only way forward in my judgment is to crowdsource it. This problem is not going away anytime soon in PubChem or ChemSpider. We HAVE curated the name associations and removed the name of Taxol for all skeletons that are not what is the asserted form of Taxol. But the structures do remain on the database and link back to the original sources. We will be working on ways to show on every search that there are associated skeletons, compounds related by isotopic labeling and the status of no, partial and full stereochemistry. All to come…

The ongoing “Bigger is Better” arguments for Public Compound Databases is irrelevant at this point in my opinion. We can add 50 million new compounds with a simple enumeration exercise but woulf it bring any value? I say no. We can add virtual libraries from a number of our collaborators but I judge it to be of very limited value. The value of the Public Compound Databases are in what they connect to and whether there is an answer to a question at the end of the chain. If I search on a chemical and find it on ChemSpider but I cannot find a vendor for it, no analytical data, no properties of value, no manuscripts, no patents linked etc then I have just done a search, found it on ChemSpider but have derived no value. We are working on increasing the VALUE of our content. Linking compounds to rich data sources, layering on additional properties, links to papers, blog entries and discussions and so on. If the result of a search is a hit but with no value who cares. If the result of a search is a hit but with links to the wrong information that’s worse. If I ask the question “What is Taxol” and get one hit I need it to be right. If I ask the question and get tens of hits now what?

Curation has been underway for 2 years. We’re not finished. Its a massive task. In reality it will NEVER be finished – new chemistry comes in every day and more information gets associated. We don’t have answers to all of the issues that exist around these diverse datasets but we are not naive in our understanding that our database is polluted with issues inherited from many other sources. We have marked tens of thousands of structures for deprecation. We have likely added information into PubChem that has contributed to the issue of data quality. But we are working on it.

Meanwhile errors that exist in PubChem are proliferating. A simple example is that of methane in PubChem that I have blogged about many times…one example here. Here are some of  the names associated with the structure of methane on PubChem: 1,3-DICHLORO-PROPAN-2-ONE, diamond, charcoal and many tens of other incorrect names.

The National Cancer Institute’s Chemical Structure Lookup Service has over 46 million unique chemical entities and they have offered a series of services to search by InChI, name and many other queries. A posting to CHMINF outlined the service

“Chemical Identifier Resolver (beta):
—————————-

http://cactus.nci.nih.gov/chemical/structure

This service is a resolver for different chemical structure representations and identifiers, including those that do not carry any information about the structure itself. For instance, it can work as a Standard InChIKey Resolver, an NCI/CADD Identifier Resolver or a Chemical Name Resolver. The service also allows one to convert a given structure identifier into another representation or structure identifier.

Representations/identifiers supported are: Standard InChI/InChIKey, NCI/CADD Identifiers (FICuS, FICTS, uuuuu), SMILES, SDF, names, and a few other types of
IDs.  See the web page for more information.

For those identifiers that require lookup, the underlying database currently contains about 67 million unique structure records, from which the respective Standard InChIKeys and NCI/CADD Identifiers have been calculated. For lookup by chemical names, 68 million names associated with 16 million unique structure records are currently available in the database. The database continues to grow.

Closely related are the new capabilities of resolving/converting chemical structure identifiers by simply using a URL adhering to the following scheme: http://cactus.nci.nih.gov/chemical/structure/”structure identifier”/”representation”[/xml]

We just list a few examples here that should give you an idea of what’s possible with this service.  For more detailed explanations, see the above web page.

Example: Standard InChI for chemical name string “aspirin”: http://cactus.nci.nih.gov/chemical/structure/aspirin/stdinchi

Example: Standard InChIKey of “ethanol” specified as SMILES string “CCO”: http://cactus.nci.nih.gov/chemical/structure/CCO/stdinchikey

Example: Unique SMILES string of chemical name string “benzene”:http://cactus.nci.nih.gov/chemical/structure/benzene/smiles

Example: SD File for chemical name string “morphine”:http://cactus.nci.nih.gov/chemical/structure/morphine/sdf

Example: Chemical names for Standard InChIKey “InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N” (Standard InChIKey of “ethanol”): http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/names

Example: Synonyms for chemical name string “aspirin”:http://cactus.nci.nih.gov/chemical/structure/aspirin/names”

Unfortunately polluted names are finding their way across all of these databases which is why a lookup on methane gives us: http://cactus.nci.nih.gov/chemical/structure/methane/names including in the list:
1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo[9.5.1.1(3,9).1(5,15).1(7,13)]octasiloxane, mixture of isomers
673323_ALDRICH
PSS-[2-[(Chloromethyl)phenyl]ethyl]-Heptaisobutyl substituted
675342_SIAL
(2R,3R)-Butanediol bis(methanesulfonate)

and DIAMOND…
(2R,3R)-Butanediol dimesylate

The CAS database is highly curated, not without errors, and built up using robots and eyes. Public Compound Databases are built with the best intent and are useful. But they are not curated and are polluted. Bigger does NOT mean better and care is warranted. ChemSPider will likely stay smaller that many of the other Public Compound Databases moving forward as we remain focused on adding value and addressing the issues of inherited and future quality. It’s a long journey…

ChemSpider has been working on polishing both single structure and SDF file deposition. We are now using these tried and tested approaches to deposit large blocks of data, commonly many thousands of records. For depositions of 100s of thousands we do break the depositions into smaller chunks of 5-10 thousand each.

An example of depositing a couple of large SDF files was given to us when the following publication was released at JCIM.

Global Bayesian Models for the Prioritization of Antitubercular Agents
by Philip Prathipati, Ngai Ling Ma* and Thomas H. Keller
J. Chem. Inf. Model., 2008, 48 (12), pp 2362–2370
DOI: 10.1021/ci800143n

This paper offers us a few thousand SMILES strings in CSV files that we could deposit into ChemSpider and associate with the article.Visit n example here and you will see the article connected via DOI in the supplementary information.

article

It is easy for us to deposit such datasets so if you have publications with such datasets that you would like to see on ChemSpider send us the SDF file and the DOI and they will be deposited.

Reblog this post [with Zemanta]

Where in the world is Carmen Sandiego and who and where is Katie Crow? We’re still looking for her ever since she put her photo on ChemSpider and took advantage of the new capability we have for depositing images.

Well, a more appropriate use of the function is to actually deposit images of appropriate data. JSpecView does not support 2D NMR data at present but such data can still be of value. Ryan Sasaki from ACD/Labs was kind up enough to give me an example 2D COSY spectrum for strychnine so i could use it as a proof of concept. It is available under the spectra tab at this record (see the bottom of the page). This 2D spectrum could also show a structure with correlations etc.

Reblog this post [with Zemanta]

Many of us using ChemSpider are looking for compounds of interest to us. In some cases those chemical entities are not of fleeting interest but something that we are working on in our research, have a hobbyist interest in or some other driving force encouraging us to track activity in.

With this in mind we have now allowed any user to “monitor an article”. What this means is that when new information is associated with an article (new outlinks, new forms of data, new publications, associated spectra etc) then an email will be sent to you making you aware of the new information. In order to monitor an article simply login as a register user and click on the “Monitor This Article” button. If you want to discontinue in the future simply return to the article and click on “Cancel Article Monitor”. We’d like a few people to help test this process for us and provide us with feedback. Keep your eye on those molecules of interest to you with Article Monitoring.

HDR Eye
Image by ►Felix◄ via Flickr

Today I had the privilege of meeting with many members of the team creating the RCSB Protein Data Bank. This resulted from the wonderful networking opportunity offered by the Scifoo camp held earlier this year at Google where I met Helen Berman, director of the PDB team, part of the worldwide Protein Data Bank. Helen and I shared some conversations sitting outside the Google offices in California and shared our opinions and visions regarding the quality of small molecule data available online. Today was an opportunity to take those conversations further, meet with members of the team and determine whether ChemSpider’s efforts could bring benefit to the PDB in terms of our curation efforts and whether ChemSpider users could benefit from having access to information on the PDB via hosting of the PDB ligand dictionary.

I gave a presentation (online here and based on others I have delivered previously) and received a one on one review of the deposition and curation processes of the PDB as well participated in a group discussion about how to continue the stringent and exacting process of validation and curation associated with small molecule structure sets. We discussed the complex relationships between systematic names, trivial names, registry IDs, database IDs, tautomers, charged states, SMILES and InChIs. It was a particularly validating day to spend time with a group of people who have responsibility for building one of the most valuable resources in the world and have faced the many challenges associated with validating structure-based data. There is a distinction between people who talk about what it takes to curate structure collections rather than those who actually do the job for a living. This team is made up of dedicated, passionate and skilled individuals who deeply care about the quality of their data and who do the heavy lifting and grunt work so that the users of the PDB enjoy the benefits. They have been working on a multi-year process to curate and improve the PDB data and are in the final major phase of the effort to clean up the archive and apply the processes to all new data moving forward . ChemSpider and PDB will be more integrated in the near future and we look forward to supporting their efforts for providing high quality structure data to the community and continuing to expand the network of integrated online chemistry.

I announced in July of this year that we were performing predictions using the EPISuite of prediction tools.I’m glad to say that one of our servers is now in “cooling mode” after running red hot for over 4 months. We’ve been feeding all single component ChemSpider entities with Molecular Weight <500 (non-radicals). The results are now posted on ChemSpider under the EPISuite tab. We hope you find them of value and offer our thanks to the EPA for providing us access to the software.

A lot of people have been helping to improve the quality of ChemSpider content by depositing new data and “Cleaning up” errors in the data over the past few months. it’s been a long climb. Our thanks to all of you who have contributed. I’ll be the first one to put my hand up and acknowledge that in some ways I have not made the act of contributing to the curation process very easily since I’ve been feeding the data out via the blog in chunks, as it has developed. Following a recent “long flight” I am happy to announce that the Curators Handbook/Bible is now available in its first form and is available online here. This document gives some pretty detailed guidance regarding how to curate the ChemSpider database. As always we welcome feedback. If something is not clear let us know and we will expand/enhance as appropriate.

What I also want to do is to thank those people who have commented on how truly impressed they are with the rate at which we are cleaning the data. In general most curation requests identified on the site are addressed within 24 hours. There are some issues hanging out there that we don’t have solutions for at present, specifically in regards to organometallic data handling, but we are still thinking about a path forward.

It is finally time to rollout more attractive structure depictions. We have needed some more attractive structure depictions for a while but they have become an absolute must have as we rollout the following new capabilities:

1) The ability to make YOUR chemical blog structure searchable (watch this space…). We suggested one path previously…this is BETTER…

2) Structure balloons for using with our document markup tools, both browser-based and Microsoft Word based

We all judge quality of visual aesthetics quickly. We know a good structure when we see one. This is an announcement that we will be rolling out new structures across the site in the next few days. You will see better looking structures showing up across the site – during deposition, during service-based predictions, during searches and, well, everywhere. While not perfect as yet a little more tweaking and the entire database will be supported by the new structure depiction algorithms. As it is you should see some examples now on the database…one shown below. We welcome your feedback!

Frequent users of ChemSpider might have noticed a change in layout of the record view pages of late. As we layer more information onto a record view page (EPI Suite predictions, SimBioSys LASSO scores, spectral data, MORE predictions to come) the record view pages become increasingly heavy. As a result we have had to navigate the challenge of increasignly heavy pages and user experience. Since we have added the ability to perform structure searching on Pubmed recently and are now in the process of adding a new update for Patent searching we have chosen to hide the Data Source outlinks until you choose to see them.

So, if you are looking for original data sources and a list of potential commercial vendors please click on the button indicated below to fold out the list. Commercial vendors are indicated as discussed previously here.

Users of ChemSpider might have noticed some performance isseus in the past 2-3 weeks with our web services, service availability and speed of searches. I put my hand in the air and say “Yup, acknowledged”. Hopefully they have not been too disruptive BUT it is for the overall benefit of the service ultimately. We have been streaming in 8 MILLION links to Pubmed in order to make Pubmed structure and substructure searchable. We are NOT rolling this out with full fanfare yet but I do want to explain the performance issues you might be experiencing. We work on Microsoft technology and while we are advocates for the platforms of .NET, IIS and SQL Server we definitely are putting them under pressure as we keep expanding the database and adding more value. We have thoughts about how to resolve this but want to finishg populating the tables first.

The upside….the majority of links are already in place. For an example visit a structure and look for PubMed as a data source and click on one of the links. For example, for Valium here you will see in the datasource table a series of Pubmed IDs next to the PubMed datasource…

  16971504, 17673, 874970, 406430, 17881, 327854, 879884, 577681, 560225, 195649, …

These will link you out to PubMed directly. Try it out…

Now, do we have implementation issues? YES. The lists of external IDs can be long so right now we show only the first 10. We wiil deal with display of others shortly. We need to provide a way to curate out “junk” entries. For example, “methyl” is on Chemspider as a fragment and has links to PubMed IDs…you’ll see why if you click them..it was done with text mining. These issues will be resolved but for now we announce that PubMed is structure and substructure searchable via ChemSpider. We will explain how we did it shortly but for now we will acknowledge the massive contribution of our colleagues at SureChem. More to come…

There has been an outpouring of offers from the ChemSpider community in terms of helping to examine/clean and enhance information regarding carbohydrates on ChemSpider. Almost 2 dozen users have now made an offer to help. Very exciting really!

I’ve already outlined the necessity to improve the quality of associations between structures and identifiers on the database. However, I am also hoping that users will write articles about carbohydrates using the rich-text formatting capabilities (ADD Description), will add spectra if they have them, will link up articles if they have interesting papers and will add URLs to interesting online content also.

We have now delivered the ability to curate and enhance records on ChemSpider and look forward to having our users help, starting with Carbohydrates…

As the number of spectra uploaded to ChemSpider increases (and it is now increasing at quite a rate) we have noticed that ther increased loading time associated with records with a large numbr of spectra can be very long, especially if the spectra are “heavy”, for example for C13 specra at high-frequency and with zero-filling. When there are a number of spectra there are even more challenges.

With this in mind we have introduced the ability to Load a Spectrum when the user wants to see the spectrum and not automatically on loading the page. An example is shown here for recently uploaded spectra from the Drexel University laboratory of Jean-Claude Bradley.

Please est it out and let us know if you see any issues. the example listed above has a “heavy C13″ spectrum so loading might take awhile. 

An announcement was made on the Blue Obelisk Discussion List this week reagrding a new database of 4 million molecules at present but up to 50 million molecules in the future. It is called molecules.gnu-darwin.org/ and lists with the following comments:

Some facts: The Molecules website contains more than 4 million small molecule structure files in pdb format, and molecular graphics representations. About 50 million molecules are still in the pipe, and they are expected to appear here over the course of the next few weeks and months. The pdb format is readable by common FOSS molecule viewer software, such as RasMol and PyMOL. In due course, we plan to provide high quality structures via energy minimization refinement, and additional resources.

Molecules@gnu-darwin.org is founded in the spirit of free software, open source, and public access. It is hoped that access to these files will be a wonderful community resource for science education, research, and entertainment as well. We are looking for investment or funding to expedite and expand this work, and lead the field, with an eye towards an advanced, complete, synthetic, structural, and informatical bioorganome. Meanwhile, the site is already an exceptional lab resource, and molecular catalog, providing the means and building blocks towards additional novel structures. We aim to be the best.

The structural biology, protein crystallography, and molecular graphics talent that is building the Molecules archive is available to work for you in a contract or consulting arrangement. Wide-ranging expertise is available. Molecules@gnu-darwin.org is built entirely with FOSS, free and open source software, GNU-Darwin OS, and it is under the aegis of The GNU-Darwin Distribution. Here is a link to the Distribution résumé. Our founder is an X-ray laboratory admin for the Department of Biophysics and Biophysical Chemistry of Johns Hopkins University School of Medicine. You can also read his CV. We would like to build a community around this website, and we are looking for volunteers and collaborators to help. Regarding any aspect of the work of this site, please feel free to contact us, molecules@gnu-darwin.org, with gdmolecules in the subject line. Cheers!”

I’m always interested in potential databases to connect to that will add additional capabilities and diversity to ChemSpider’s information. I have browsed the database and searched on some common molecules (Xanax, aspirin, Taxol and others) and found no hits. This seemed strang but it does say “Search warning: not yet fully spidered

The statement that there are 50 million molecules in total coming suggests that the database is a republication of PubChem and the SDF archives seem to suggest so too since they redirect to PubChem for the download: http://molecules.gnu-darwin.org/ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/

At present the database therefore appears to be the PubChem database in PDB format. I hope that there is some additional information added to warrant our linking to this new database.

We have added the compound collection from Trans World Chemicals to ChemSpider. This is a collection of almost 1600 compounds. The collection can be viewed here.

ChemSpider has been working hard to support Wikipedia for a number of months now. We have been curating the structures on Wikipedia, I have been an active member of the WP:Chem team, we have extended our integration of WIkipedia to show the leed of the Wikipedia article on associated record views and have a lot of background activities going on re. Wikipedia at present (info will be released shortly). There are new articles released on Wikipedia on an ongoing basis and we stay up to date as best we can monitoring bots for updates. Harvesting monographs out of Wikipedia based only on ChemBoxes and Drugboxes is not sufficient for sure since not every article about drugs and chemicals on Wikipedia has an associated Drugbox or ChemBox. For example… You have likely heard of Rember for Alzheimers already? A search on Google for Rember Alzheimers will give about 2 million hits. It’s already being discussed in the blogosphere including Derek Lowe’s  In the Pipeline. Rember turns out to be methylene blue. There is already an article on Wikipedia about Rember but there is no chembox as yet. As I was researching Rember out of interest I noticed we did not have methylene blue linked to Wikipedia and Rember wasn’t associated with methylene blue. Adding the name was of course easy..5 seconds work after login. We have now added the ability to associate data sources directly too. What does this mean? On a record view page is a list of “Data Sources” associated with a compound. This is where depositions about a compound came from and, generally, links back to the associated web pages. Previously in order to populate the Data Source table it would be necessary to deposit the structure and associated info as an SDF file. TOO MUCH work. So, now we have made it easy. To add a data source simply login and select “Edit” (top right hand side of the data source table). To add a new data source simply click Add and input the information into the pop up box.The input is the name to be listed in the Data Source table, the URL to the information on the Data Source page (if info exists) and the name of the Data Source. This is one caveat of adding such links..the data source must exist. If you want to add data associated with your own website you need to register yourself, add a Data Source and wait for us to approve. Wikipedia is a special case since when the link is made we grab the leed of the article directly and show it in the Record View. For methylene blue there are two related Wikipedia articles so we have linked to them both as you can see on the record view. Simple go to ChemSpider and search for rember and you’ll see two linked Wikipedia articles.

We’ve been enhancing our deposition system so that the addition of 10s of thousands of new compounds to ChemSpider doesn’t have too big an impact on the performance of ChemSpider. The deposition of every structure demands the calculation of associated properties and deduplication against the database and needed to be optimized. As a result of our improved processing we are now cleaning up our backlog of new structures, something which is well overdue we know but we didn’t want to overly stress the servers for our users. New data are now on the database from the following companies. There are more to come…

In keeping with our commitment to continue to index Open Access journals for searching on ChemSpider we are happy to announce our indexing of Libertas Academica. Most people I have spoken to about our indexing of Open Access journals have never heard of this Open Access publisher. Libertas Academica offers “Open access journals on clinical medicine, bioinformatics, biology, chemistry, pharmacology, gene signalling, systems biology, informatics, virology, substance abuse, translational science and complimentary medicine.” I know of LA-press because of their Analytical Chemistry Insights journal.

Their list of Popular Journals is given below and their full list of journals is given on the third tab.

The publisher allows direct commenting on articles on their website as shown here for their article on “High-Performance Liquid Chromatographic Method for Determination of Phenytoin in Rabbits Receiving Sildenafil” (This article is already linked from the structures of Phenytoin and Sildenafil)

Following our previous approach of using Taxol and Paclitaxel as a measure of potential contibution to search results on ChemSpider searching Libertas Academica gives 6 hits on Taxol while a search on Paclitaxel gave 23 hits.

Our growing list of Open Access Publishers is rather impressive at this point…see below. It will continue to grow.

The Environmental Protection Agency has provided permission for ChemSpider to utilize their EPI SuiteTM software to predict a number of physical properties for the chemicals on the ChemSpider database. The properties include:
KOWWIN™: Estimates the log octanol-water partition coefficient, log KOW, of chemicals using an atom/fragment contribution method.
AOPWIN™: Estimates the gas-phase reaction rate for the reaction between the most prevalent atmospheric oxidant, hydroxyl radicals, and a chemical. Gas-phase ozone radical reaction rates are also estimated for olefins and acetylenes. In addition, AOPWIN™ informs the user if nitrate radical reaction will be important. Atmospheric half-lives for each chemical are automatically calculated using assumed average hydroxyl radical and ozone concentrations.
HENRYWIN™: Calculates the Henry’s Law constant (air/water partition coefficient) using both the group contribution and the bond contribution methods.
MPBPWIN™: Melting point, boiling point, and vapor pressure of organic chemicals are estimated using a combination of techniques.  Included is the subcooled liquid vapor presssure, which is the vapor pressure a solid would have if it were liquid at room temperature.  It is important in fate modeling.
BIOWIN™: Estimates aerobic and anaerobic biodegradability of organic chemicals using 7 different models; two of these are the original Biodegradation Probability Program (BPP™).  The seventh and newest model estimates anaerobic biodegradation potential.
BioHCWIN: Estimates biodegradation half-life for compounds containing only carbon and hydrogen (i.e. hydrocarbons).
PCKOCWIN™: The ability of a chemical to sorb to soil and sediment, its soil adsorption coefficient (Koc), is estimated by this program. EPI’s Koc estimations are based on the Sabljic molecular connectivity method with improved correction factors.
WSKOWWIN™: Estimates an octanol-water partition coefficient using the algorithms in the KOWWIN™ program and estimates a chemical’s water solubility from this value. This method uses correction factors to modify the water solubility estimate based on regression against log Kow.
WATERNT™: Estimates water solubility directly using a “fragment constant” method similar to that used in the KOWWIN™ model.
HYDROWIN™: Acid- and base-catalyzed hydrolysis constants for specific organic classes are estimated by HYDROWIN™. A chemical’s hydrolytic half-life under typical environmental conditions is also determined. Neutral hydrolysis rates are currently not estimated.
BCFWIN™: This program calculates the BioConcentration Factor and its logarithm from the log Kow. The methodology is analogous to that for WSKOWWIN™. Both are based on log Kow and correction factors.
KOAWIN: KOA is the octanol/air partition coefficient and has multiple uses in chemical assessment.  The model estimates KOA using the ratio of the octanol/water partition coefficient (KOW) from KOWWIN™, and the dimensionless Henry’s Law constant (KAW) from HENRYWIN™. • AEROWIN™: Estimates the fraction of airborne substance sorbed to airborne particulates, i.e. the parameter phi (φ), using three different methods.  AEROWIN™ results are also displayed with AOPWIN™ output as an aid in interpretation of the latter.
WVOLWIN™: Estimates the rate of volatilization of a chemical from rivers and lakes; calculates the half-life for these two processes from their rates. The model makes certain default assumptions-water body depth; wind velocity; etc.
STPWIN™: Using several outputs from EPI Suite™, this program predicts the removal of a chemical in a Sewage Treatment Plant; values are given for the total removal and three contributing processes (biodegradation, sorption to sludge, and stripping to air.) for a standard system and set of operating conditions.
LEV3EPI™: This level III fugacity model predicts partitioning of chemicals between air, soil, sediment, and water under steady state conditions for a default model “environment”; various defaults can be changed by the user.

The values for individual structures are available in the Record View under the EPI Summary.

For example, the information for Xanax is below.

 Log Octanol-Water Partition Coef (SRC):
    Log Kow (KOWWIN v1.67 estimate) =  3.87
    Log Kow (Exper. database match) =  2.12
       Exper. Ref:  BioByte (1995)

 Boiling Pt, Melting Pt, Vapor Pressure Estimations (MPBPWIN v1.42):
    Boiling Pt (deg C):  441.81  (Adapted Stein & Brown method)
    Melting Pt (deg C):  185.42  (Mean or Weighted MP)
    VP(mm Hg,25 deg C):  1.65E-008  (Modified Grain method)
    Subcooled liquid VP: 7.84E-007 mm Hg (25 deg C, Mod-Grain method)

 Water Solubility Estimate from Log Kow (WSKOW v1.41):
    Water Solubility at 25 deg C (mg/L):  13.1
       log Kow used: 2.12 (expkow database)
       no-melting pt equation used

 Water Sol Estimate from Fragments:
    Wat Sol (v1.01 est) =  0.15855 mg/L

 ECOSAR Class Program (ECOSAR v0.99h):
    Class(es) found:
       Aliphatic Amines
Henrys Law Constant (25 deg C) [HENRYWIN v3.10]:
   Bond Method :   9.77E-012  atm-m3/mole
   Group Method:   Incomplete
 Henrys LC [VP/WSol estimate using EPI values]:  5.117E-010 atm-m3/mole

 Log Octanol-Air Partition Coefficient (25 deg C) [KOAWIN v1.10]:
  Log Kow used:  2.12  (exp database)
  Log Kaw used:  -9.399  (HenryWin est)
      Log Koa (KOAWIN v1.10 estimate):  11.519
      Log Koa (experimental database):  None

 Probability of Rapid Biodegradation (BIOWIN v4.10):
   Biowin1 (Linear Model)         :   0.6009
   Biowin2 (Non-Linear Model)     :   0.2660
 Expert Survey Biodegradation Results:
   Biowin3 (Ultimate Survey Model):   2.2574  (weeks-months)
   Biowin4 (Primary Survey Model) :   3.1733  (weeks       )
 MITI Biodegradation Probability:
   Biowin5 (MITI Linear Model)    :  -0.1488
   Biowin6 (MITI Non-Linear Model):   0.0042
 Anaerobic Biodegradation Probability:
   Biowin7 (Anaerobic Linear Model): -0.4906
 Ready Biodegradability Prediction:   NO

Hydrocarbon Biodegradation (BioHCwin v1.01):
    Structure incompatible with current estimation method!

 Sorption to aerosols (25 Dec C)[AEROWIN v1.00]:
  Vapor pressure (liquid/subcooled):  0.000105 Pa (7.84E-007 mm Hg)
  Log Koa (Koawin est  ): 11.519
   Kp (particle/gas partition coef. (m3/ug)):
       Mackay model           :  0.0287
       Octanol/air (Koa) model:  0.0811
   Fraction sorbed to airborne particulates (phi):
       Junge-Pankow model     :  0.509
       Mackay model           :  0.697
       Octanol/air (Koa) model:  0.866 

 Atmospheric Oxidation (25 deg C) [AopWin v1.92]:
   Hydroxyl Radicals Reaction:
      OVERALL OH Rate Constant =   7.6246 E-12 cm3/molecule-sec
      Half-Life =     1.403 Days (12-hr day; 1.5E6 OH/cm3)
      Half-Life =    16.834 Hrs
   Ozone Reaction:
      No Ozone Reaction Estimation
   Fraction sorbed to airborne particulates (phi): 0.603 (Junge,Mackay)
    Note: the sorbed fraction may be resistant to atmospheric oxidation

 Soil Adsorption Coefficient (PCKOCWIN v1.66):
      Koc    :  2.151E+006
      Log Koc:  6.333 

 Aqueous Base/Acid-Catalyzed Hydrolysis (25 deg C) [HYDROWIN v1.67]:
    Rate constants can NOT be estimated for this structure!

 Bioaccumulation Estimates from Log Kow (BCFWIN v2.17):
   Log BCF from regression-based method = 0.932 (BCF = 8.559)
       log Kow used: 2.12 (expkow database)

 Volatilization from Water:
    Henry LC:  9.77E-012 atm-m3/mole  (estimated by Bond SAR Method)
    Half-Life from Model River: 1.053E+008  hours   (4.388E+006 days)
    Half-Life from Model Lake : 1.149E+009  hours   (4.786E+007 days)

 Removal In Wastewater Treatment:
    Total removal:               2.37  percent
    Total biodegradation:        0.10  percent
    Total sludge adsorption:     2.27  percent
    Total to Air:                0.00  percent
      (using 10000 hr Bio P,A,S)

 Level III Fugacity Model:
           Mass Amount    Half-Life    Emissions
            (percent)        (hr)       (kg/hr)
   Air       0.000217        33.7         1000
   Water     21              900          1000
   Soil      78.9            1.8e+003     1000
   Sediment  0.094           8.1e+003     0
     Persistence Time: 1.48e+003 hr

We started the calculations a number of weeks ago and are updating our progress on the ChemSpider Forum here. We now have values predicted for 3 million compounds.

It is NOT possible at present to search on these properties in the same way that other properties can be searched on the Search Predicted Properties page as shown below.

After all EPI Suite properties are predicted we will selectively make some of these available for searching. The interest so far appears to be in Henry’s Law values, Water Solubility and Melting Point (something that is very difficult to predict with accuracy!). We welcome your comments.

We will be able to extract experimental values for some properties and display directly. For example, logP shows an “experimental database match” for Xanax.

Log Octanol-Water Partition Coef (SRC):
Log Kow (KOWWIN v1.67 estimate) = 3.87
Log Kow (Exper. database match) = 2.12

Exper. Ref: BioByte (1995)

It is going to take a number of weeks to generate EPI Suite values for 21.5 million molecules but we are moving in that direction. Our sincere thanks to the EPA for allowing us to use their EPI Suite software on ChemSpider for the benefit of the community

I have spoken on this blog many times about the challenges of cleaning up data in chemistry databases. We’re expending a lot of efforts, with the assistance of many others, in cleaning up the data on ChemSpider and, as a benefit, assisting in cleaning up date in other databases also. The efforts to curate the chemical structure data on Wikipedia continues and the work is now focused on delivering ‘bots that will drive a cleansed data file to the individual records. Over the past few months I have developed a great appreciation for the efforts, dedication and commitment of the many contributors to Wikipedia Chemistry. There are many 10s of people editing and contributing to the articles and then there is the “core WP:Chem team” who show up for the IRC chats most Tuesdays at noon. Many of the past weeks have focused on how to curate the data and utilize ‘bots and control curated data moving forward. I am honored to share “IRC-space” with them!

Over the past few weeks I have been similarly blessed to interact with the ChEBI team via email as we have done our work to deposit their Entities of the Month (1,2). During the process of doing so we have exchanged many emails and have cleaned a number of errors in our mutual datasets. In my opinion a PERFECT example of the results of such detailed efforts is for Vancomycin. One week ago a search on vancomycin would give a dozen hits. Many of these had incomplete stereochemistry. Now a search on ChemSpider gives one hit for vancomycin here. This is the result of working with Kirill Degtyarenko at ChEBI. The conversation was initiated by my observation regarding stereo in the structure on ChEBI.

For details on how this is identified to be the correct structure read the description on that page. VERY DETAILED and includes links out to three publications.

Compare this with a search for vancomycin on PubChem giving 66 hits. Some of these differences are due to the different approaches for our text searches – the PubChem results list includes VANCOMYCIN HYDROCHLORIDE and Gatifloxacin & Vancomycin for example. However, there are a number of “vancomycins” also.

We believe we have the correct vancomycin identified at this point…we welcome any challengers!

Thanks to the efforts of contributors such as Heinz Kolshorn new compounds and associated analytical data are finding their way onto ChemSpider on a regular basis. These are chemical compounds that have been synthesized and fully characterized. Unless they are published they are unlikely to find their way into chemical registry systems or into training databases for the commercial NMR prediction packages such as those of ACD/Labs, Bio-Rad, Modgraph or Wolfgang Robien’s collection. As a result this type of information will be “Lost Chemistry“. These particular data from Heinz will almost certainly find their way into the NMRShiftDB since Heinz is hosting the database at his lab at the University of Mainz.

Heinz has been putting actual experimental spectra and the associated shift assignments onto ChemSpider of late. An example is here. This is enabled by our ability to upload and store both spectra and images. There are better ways to display the shift assignments by allow mouseover display of the structure and peak associations but this is not yet available on the system but clearly a nice to have. For now the information is there for others to use and is indicative of the value of integrating images and spectral data. I can envisage other pairings such as UV-spectra versus photo of colored solution for example.

Over the past few months we have recognized those people who have spent their time depositing to the content of ChemSpider either as depositors or curators. Recently I commented about one of our Advisory Group, Chris Singleton, taking on a major project to deposit spectral data to ChemSpider. If you visit the spectral data page and scroll through you will see that there are now 33 pages of spectra, each page containing 20 spectra. The majority of these are NMR spectra and the largest single collection is that deposited by Chris over the past few weeks. The data were those obtained from the Madison Metabolomics Consortium Database and described in a publication by Q. Cui, et al; “Metabolite identification via the Madison Metabolomics Consortium Database”, Nature Biotechnology, 26,162 (2008). Our sincere thanks to Chris for all of his work!

There is another raft of spectra waiting to be processed and deposited so the spectral data collection will continue to grow.

I have blogged previously about ChEBI entities of the month and our work to include the information to ChemSpider. In order to do so we had to introduce rich text support. This work is done and reported here. As of today nearly all ChEBI Entity of the Month information is now posted to ChemSpider. During the processs we have provided feedback to the team about some suggested changes to some structure depictions and have also noted some differences in stereochemistry between our reference structures and those on ChEBI. This type of interaction has us all be very vigilant about accuracy and it was great (and fast) to work with the group at ChEBI to cross-validate the limited dataset. Everyone gains.

The Rich text editor worked perfectly and without failure and is ready to roll out to the general public we think but we would still like some beta-testers to help test it please.

Zemanta Pixie

Okay, this is clearly a rather tongue in cheek blog post but i couldn’t resist.

Search “sex” on ChemSpider and you get two hits…here

Click on the first structure and you will find that one of the identifiers for this compound is SEX, and it is an explosive.

Just READ the second structure and you will see it is SEX. It’s CLEAN sex though. The dirty sex was described in a recent article in a C&E News article and points back to the poor image originally published by the New York Times when they issued a book review of Pamela Paul’s Book “Bonk, The Curious Coupling of Science and Sex“. In order to have CLEAN sex I removed inappropriate substitutions and bonds.

It still looks like sex though…

ChemSpider added the Directory of Useful Decoys over the weekend. This dataset is well known to the community of scientists performing computational docking experiments and is outlined below. The dataset contributed over 128,000 molecules to the collection.

DUD, a directory of useful decoys for benchmarking virtual screening. DUD is designed to help test docking algorithms by providing challenging decoys. It contains:

  • A total of 2,950 active compounds against a total of 40 targets
  • For each active, 36 “decoys” with similar physical properties (e.g. molecular weight, calculated LogP) but dissimilar topology.

DUD is provided by the Shoichet Laboratory in the Department of Pharmaceutical Chemistry at the University of California, San Francisco (UCSF). To cite DUD, please reference Huang, Shoichet and Irwin, J. Med. Chem., 2006, 49(23), 6789-6801. doi 10.1021/jm0608356. There is a DUD wiki page where you can discuss DUD and an errata page where problems are reported and explained.”

In an ongoing commentary about the DailyMed dataset (1,2) I have been showing some of the struggles regarding creating curated datasets from publicly available data. This post shows an example of when trade names collide. The DailyMed record for sclerosol shows no chemical structure in the label….but describes the compound as follows:

“Sclerosol® Intrapleural Aerosol (sterile talc powder 4 g) is a sclerosing agent for intrapleural administration supplied as a single-use, pressurized spray canister with two delivery tubes of 15 cm and 25 cm in length. Each canister contains 4.0 g of talc, either white or off-white to light grey, asbestos-free, and brucite-free grade of talc of controlled granulometry. The composition of the talc is ≥ 95% talc as hydrated magnesium silicate. The empirical formula is Mg3 Si4 O10 (OH)2 with molecular weight of 379.3.”

Sclerosol is Talc. A search on Sclerosol online however brings us numerous hits for dimethyl sulfoxide on ChemIndustry and the Comparitive Toxicogenomics database and on MeSH. So, is Sclerasol also DMSO?

The PubChem record merges the relationship between Talc and DMSO rather well. Visit the record here. The substance summary is as follows:

“A highly polar organic liquid, that is used widely as a chemical solvent. Because of its ability to penetrate biological membranes, it is used as a vehicle for topical application of pharmaceuticals. It is also used to protect tissue during CRYOPRESERVATION. Dimethyl sulfoxide shows a range of pharmacological activity including analgesia and anti-inflammation.”

Further information is the MeSH details shown below.

The image of the associated structure is shown below…notice it’s representative of talc.

It appears that DMSO and Talc were meshed somehow.

Sclerasol on ChemSpider is Talc. I am not stating that the structure representation of talc is appropriate but it IS the same as the one displayed on PubChem. DMSO on ChemSpider is here and never had the name Sclerasol associated with it. Since we derived some of our data from PubChem I am not sure how we managed to separate the DMSO and Sclerasol association in our processes…but we did.

So, MAYBE Sclerasol is a name for DMSO…but I don’t think so.

Why is this important? As we are working on text mining and will use a lookup dictionary of chemical names and structures as part of the process we are putting in the work to create a high quality dictionary. it’s important for us moving forward.