Archive for the Community Building Category

An article in the latest C&E News discusses the acquisition of ChemSpider by the Royal Society of Chemistry. I certainly appreciate the comments of Robert Massie, President of CAS who stated:  “CAS has worked with Williams in the past,” CAS President Robert J. Massie notes. “We join everyone who is interested in the advance of chemical information in recognizing his considerable contributions. We are delighted to see that his creativity and enthusiasm will continue to benefit the chemical enterprise.”

I worked a lot with CAS while I was at ACD/Labs (over 10.5 years and left there as their Chief Science officer). I was intitmately involved in the development and deployment of a number of software tools and visited Columbus many times. I have many fond memories of working with the CAS team and there are some great people working at the organization. I hope that in my new role at the Royal Society of Chemistry that I will have the opportunity to work with CAS again in a collaborative and cross-publisher manner to the benefit of the  Chemistry Community.

Reblog this post [with Zemanta]

With the news about the RSC acquiring ChemSpider assets sterting to settle it is time to get back to work. One of the things we are noticing is that people are really starting to take advantage of the ability to integrate their articles to ChemSpider via the Add DOI function that is available to registered users. If you want to associate a paper with a single chemical structure then it is very easy and uses the CrossRef service to Fetch the Result from a DOI lookup and deposit directly to ChemSpider. The four images below outline the way in which this can be done. In this example I want to associate two particular articles with the record for  1,2-dioxetanedione shown below.

It’s easy. Navigate to the record of interest, Make sure that the structure is the correct structure of interest and Simply click the Add DOI button above the chemical structure to the left. Don’t forget you must be logged in! Now, Enter the DOI, click on LookUp and confirm that the title retrieved is the correct publication. Then click on OK. Now the publication will be submitted for a curator to confirm that it is appropriate and it will show up online under the supplementary information when approved.

There are also processes for depositing an SDF file with a single publication and the SAME process is applied to connecting via PubmedIDs (Add PMID). Try it out. Help the community discover publications by adding appropriate DOIs to particular records. Look at how many there are associated with cholesterol already.



Reblog this post [with Zemanta]

I blogged yesterday about our release of Wikipedia Services on ChemSpider and how we are working to support authors on Wikipedia articles. Of course there are MANY languages of Wikipedia (as shown below) and we are willing to produce multilingual support. All we need is someone from the specific language version of Wikipedia to contact us and map the ChemBoxes and Drugboxes into their relevant languages. Let us know if you are interested.


Reblog this post [with Zemanta]

Wikipedia is great. I use it regularly. I’ve been working, with a team of experts, on curating and validating the “structure-based data” in the ChemBoxes and DrugBoxes for almost a year and a half. It’s been a long path and on the journey I have met some great people and made some true friends. I also HAVE NOT met most of the people I share the IRC chats with. We are a highly opinionated bunch of people but with a common focus of making Wikipedia better and making the data and content as accurate as possible.

We have the Wikipedia article lead in thousands of records on ChemSpider now. They are updated regularly as Wikipedia itself expands. One of the areas we have been focused on since the inception of the work was getting correct structures in place with the associated data. This includes the molecular formula, molecular weight, SMILES, InChI String, InChIKey, systematic name and so on. In order to help the process of expanding Wikipedia with new records and to provide a lot of these data automatically we have set about providing a Wikipedia Service so that Wikipedians can use ChemSpider as the source of the chemical structures of interest and generate the DrugBox and ChemBox content from ChemSpider. It’s a rather simple process…

Assume that you wanted to create a ChemBox for Domoic Acid you would search Domoic Acid on ChemSpider. You would then validate whether the structure on ChemSpider named domoic acid is correct and. if so, you would generate the Wikibox by clicking on the link to the right of the Quick Links


Following this simple button click the user is shown a new window displaying the “Design Wikibox” functionality. There are various flavors of ChemBoxes and Drugboxes which can be generated and the image below shows the “Simple ChemBox”


At present we fill the box with those data we have easy access to from ChemSpider and based on the chemical structure. We list all other fields for Wiki depositors to populate. For the Simple ChemBox this looks like this for Domoic Acid

| ImageFile =
| ImageSize =
| IUPACName = (2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-6-hydroxy-1,5-dimethyl-6-oxo-hexa-1,3-dienyl]pyrrolidine-2-carboxylic acid
| OtherNames =
| Section1 = {{Chembox Identifiers
| CASNo =
| PubChem = 5282253
| ChemSpiderID = 4445428
| SMILES = O=C(O)[C@H]1NC[C@H](/C(=C\C=C\[C@H](C(=O)O)C)C)[C@@H]1CC(=O)O }}
| Section2 = {{Chembox Properties
| Formula = C15H21NO6
| MolarMass = 311.3303
| Appearance =
| Density =
| MeltingPt =
| BoilingPt =
| Solubility = }}
| Section3 = {{Chembox Hazards
| MainHazards =
| FlashPt =
| Autoignition = }}

We insert the PubChemID associated with the particular structure if there is a related PubChem record. We also insert the ChemSpider ID in case the user wants to link back to ChemSpider.  A Full ChemBox is much longer:

| Name =
| ImageFile =
| ImageSize =
| IUPACName = (2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-6-hydroxy-1,5-dimethyl-6-oxo-hexa-1,3-dienyl]pyrrolidine-2-carboxylic acid
| SystematicName = (2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-6-hydroxy-1,5-dimethyl-6-oxo-hexa-1,3-dienyl]pyrrolidine-2-carboxylic acid
| OtherNames =
| Section1 = {{Chembox Identifiers
| Abbreviations =
| CASNo =
| PubChem = 5282253
| ChemSpiderID = 4445428
| SMILES = O=C(O)[C@H]1NC[C@H](/C(=C\C=C\[C@H](C(=O)O)C)C)[C@@H]1CC(=O)O
| InChI = InChI=1S/C15H21NO6/c1-8(4-3-5-9(2)14(19)20)11-7-16-13(15(21)22)10(11)6-12(17)18/h3-5,9-11,13,16H,6-7H2,1-2H3,(H,17,18)(H,19,20)(H,21,22)/b5-3+,8-4-/t9-,10+,11-,13+/m1/s1
| MeSHName = domoic acid
| ChEBI =
| KEGG = C13732
| ATCCode_prefix =
| ATCCode_suffix =
| ATC_Supplemental =}}
| Section2 = {{Chembox Properties
| Formula = C15H21NO6
| MolarMass = 311.3303
| Appearance =
| Density =
| MeltingPt =
| Melting_notes =
| BoilingPt =
| Boiling_notes =
| Solubility =
| SolubleOther =
| Solvent =
| LogP =
| VaporPressure =
| HenryConstant =
| AtmosphericOHRateConstant =
| pKa =
| pKb = }}
| Section3 = {{Chembox Structure
| CrystalStruct =
| Coordination =
| MolShape = }}
| Section4 = {{Chembox Thermochemistry
| DeltaHf =
| DeltaHc =
| Entropy =
| HeatCapacity = }}
| Section5 = {{Chembox Pharmacology
| AdminRoutes =
| Bioavail =
| Metabolism =
| HalfLife =
| ProteinBound =
| Excretion =
| Legal_status =
| Legal_US =
| Legal_UK =
| Legal_AU =
| Legal_CA =
| PregCat =
| PregCat_AU =
| PregCat_US = }}
| Section6 = {{Chembox Explosive
| ShockSens =
| FrictionSens =
| ExplosiveV =
| REFactor = }}
| Section7 = {{Chembox Hazards
| ExternalMSDS =
| EUClass =
| EUIndex =
| MainHazards =
| NFPA-H =
| NFPA-F =
| NFPA-R =
| NFPA-O =
| RPhrases =
| SPhrases =
| RSPhrases =
| FlashPt =
| Autoignition =
| ExploLimits =
| LD50 =
| PEL = }}
| Section8 = {{Chembox Related
| OtherAnions =
| OtherCations =
| OtherFunctn =
| Function =
| OtherCpds = }}

The user can also use the ChemSpider image and can resize it and click on the image to download it as a PNG file. We believe that our images are attractive and appropriate for web display. Wikipedia present favors the ACS format so based on feedback we can change the config file behind the image generator to produce a different format for display.

We are considering extending the system to support direct uploads of Molfiles and/or other structure formats rather than depending on a compound being on ChemSpider. However, it is VERY likely that chemical compounds of value to the Wikipedia encyclopedic content already exist on ChemSpider. The trick is to find them since they may not have the Wikipedia article chemical name associated with the record. An InChI-based, SMILES-based or alternative name search might help locate the record. Alternatively a full structure search via the applet will find the record OR the user can DEPOSIT the structure to ChemSpider and work from there. The system is flexible enough.

This is our first release of the Wikipedia Services so we welcome any and all feedback. It’s one more way we are giving back to the Wikipedia community for their service. The outcome for us will also be crowdsourced curation of ChemSpider…as Wikipedia articles are written we will clean up related structures on ChemSpider. Everyone wins.

By the way…check OUR structure for Domoic Acid with that one on ChemSpider. Does anyone know which is correct?

Reblog this post [with Zemanta]

At ChemSpider we have built a system which can support the storage of structures, spectra, images, properties, and so on. In terms of supporting spectra we have noticed that people have been submitting spectral images (as JPEGS generally) and as PDF files. So, rather than not support these formats and force people to submit ONLY JCAMP spectra we are now supporting additional forms of spectral data. These include images and PDF files. An example of PDF submission is shown here. Now, this opens interesting possibilities…we COULD allow deposition of Open Access PDF articles and association with chemical structures in a one to many relationship quite easily. Is it something we should do? We can extend it to many to many in the future. Feedback?


The Chemical Abstracts Service have announced their first foray into providing Public Domain data. was announced at the ACS meeting and is now online for all to visit. From the “About Common Chemistry” webpage the site is defined as:

“This database contains the CAS Registry Number®, chemical names (both formal and common), molecular formulas, and structures or sequences for ~7800 chemicals of widespread general public interest. These substances are of global commercial use or importance and have been cited 1,000 or more times in the CAS databases. Examples of substances included are aspirin, biotin, benzoyl peroxide, and boric acid. The Common Chemistry database also includes all 118 elements of the Periodic Table, although not all of the elements may meet the 1,000 references threshold.

Links to Wikipedia records (when available) have been provided by the Wikipedia Chemicals WikiProject in collaboration with Chemical Abstracts Service.

You can quickly and easily confirm a chemical name, CAS Registry Number, or structure from this database of common, everyday chemicals.

You can search for substances in Common Chemistry by either their CAS Registry Number or by their chemical name. Chemical name searches can be by exact name if you have one or by name fragment. CAS Registry Number searches are exact search only. Consult the Help page for additional search tips and details.

This database will be updated periodically. Information such as Wikipedia links may be added on a more frequent basis as it becomes available.”

A search on Xanax or Aspirin produces a hit very quickly and the record example for Xanax is given here. The result is a validated CAS Number for Xanax, a list of chemical names and the chemical structure. You can compare that to the ChemSpider record for Xanax here. I personally prefer our structure images on ChemSpider. The comparison is below…ChemSpider is on the right. We have a lot more info on the ChemSpider website and a lot of it is validated y the community.


Of note is the fact that the CAS number provided with the CAS image is not separated by dashes. I had never seen that before.

We have already created the Data Source on ChemSpider in case anyone wants to connect up records from ChemSpider with CommonChemistry as they are curating our dataset. I’ve already linked a few records to and maybe that will happen at Wikipedia too. Some basic checking on a few records shows that we have good validation on the registry numbers on ChemSpider already. I checked 5 records and we were correct in all cases. This is unlikely to bear true across the entire database but is a good sign.

It is unclear what licensing is on the data. I doubt it’s Open but that won’t matter to the majority of users…they are looking for a piece of information or to confirm something and are unlikely to be distracted by whether the data are Open or not…free access will suffice.

I haven’t tested the search capabilities too much and will do so in the next few days. I think that CAS should consider showing the leed of the Wikipedia article as well as linking out to other information. ChemSpider is a good one since we list articles, properties, analytical data etc for a much enhanced record …see Cholesterol as an example. When the site is out of beta we’ll offer to produce ChemSpider IDs for the entire CommonChemistry database in case they want to link.

This website is an interesting shift for CAS and demonstrates a willingness to provide access to Public Domain data. It is a good start to open up the first 7800 structures with more than 1000 citations and there is much more that they can do in a smilar vein, theoretically without threatening their business model. It’s going to be interesting to watch. Certainly CAS have helped in the validation of the CAS Numbers on Wikipedia and that has been an interesting project for all with validated CAS numbers resulting. It has been a long and exacting project with many eyes poring over the data…all for the good of the community.

Reblog this post [with Zemanta]

Some pleasant news was shared today on the CHMINF list server by Richard Kidd, the manager of informatics at RSC Publishing in Cambridge. Richard’s email to the community declared:

“We’ve just released a few new features based around the RSC Prospect project to enhance our articles. Some of these features are cosmetic and improve the look and feel, and some are deeply semantic and a bit specialist – but both take another step towards the future.

On the articles themselves, we now have mouseover popups of chemical structures, improved subject pages and compound pages, compound link through to ChemSpider, and better toolbar and page layout.” and he pointed to an example article here: I went for a click around and found the article. I checked out the structures displayed by hovering over the numbers and then clicking through to the associated records page as shown here. Right there are the links to external resources for SEARCHING…the links do not mean that the structures are necessarily on ChemSpider but that they can be searched on ChemSpider.

rsc1 I clicked on the search for that structure and did not find it unfortunately. However, I deposited it by pasting the InChI into our deposition entry box and converting it. It is now here. I then used the Add DOI function on that record to add the Author, Title and DOI information in about 15 seconds.It’s listed in the Supplementary Information now. A search on this compound will now find that record.

This is a very manual task for adding the information. What is ideal is to deposit a datastream of  structures and D OIs together for the “primary compounds”…I don’t want links to benzene for every record unless it is a primary compound in the article. We’ve already been doing that for the RSC Project Prospect backfile as described previously. I believe a solution for ongoing updates from RSC is feasible…

Reblog this post [with Zemanta]

We continue to expand the ChemSpider Database with new depositions sourced from various collaborators. We are especially privileged to have received the RSC’s structure collection associated with their Project Prospect articles and have spent a couple of weeks working with the data prior to depositing onto ChemSpider. During the deposition process we have formed the link between the chemical structures and their articles via a DOI link. We have been able to deposit the title, an associated author and the DOI. In this way we have been able to link thousands of chemical structures to articles on the RSC website. On each record associated an RSC article you will see both a link from the data source table and a link via DOI from the reference as shown here and in the figure below.

rsc_linkWith the RSC depositions came many beautiful structures – highly symmetric, complex and just plain “pretty” to a chemist. But a high level of complexity also arrived with the collection and while many InChIs could be converted to their associated connection tables the act of converting the InChIs could add additional stereochemistry and structure cleaning could change stereochemistry so this was a long, tedious and mostly manual process I’m afraid. Nevertheless, a wonderul addition to the ChemSpider database and our sincere thanks, on behalf of the community too, to the Royal Society of Chemistry for sharing their data with us. The InChIs will be deposited into the InChI Resolver shortly.

Reblog this post [with Zemanta]

okfThere are a lot of conversations going on in the community about Open Data, specifically on the Open Knowledge Foundation email list. A recent blog post announces a working group on Open Data in Science and I’ve sent an email offering to provide input. Hosting ChemSpider has certainly allowed me to get engaged in frontline conversations regarding people’s willingness to share their data and what the perceived differences of Open vs Free are. I don’t have all the answers to all the questions but this area is a growing area of interest and concern for scientists and will likely remain in the spotlight for the foreseeable future. My judgment is that the majority of scientists do not care whether data are free or Open despite the potential repercussions in terms of reuse that this distinction will produce. Scientists do care about whether their own data are free or Open as soon as I discuss with them what the differences are (based on my own understanding of the differences!). See my previous post

In this regard let’s chat about the Spectral Game for a moment. The Spectral Game is, even now, a resounding success and, in many ways, is surpassing our early expectations in terms of capability and usage. As of last week spectra had been viewed 20,261 times by 1305 unique visitors from47 countries. That’s quite amazing for an online game for chemists that is proliferating through word of mouth (blogs, emails, RSS feeds) only. The spectral game is fed by Open Data and now has over 1000 spectra feeding into the game. These have been supplied by scientists willing to make their data Open and by myself, sourcing data and processing during long evenings in front of a good movie. Open Data has been the criterion we have used to feed the Spectral Game

nistRecently Jean-Claude Bradley and I were talking about expanding the dataset on the spectral game to include more Mass Spectral, Infrared and UV-Vis data. The NIST Webbook is a rich source of such information and the data CAN be downloaded as JCAMP spectra for local processing. Due to the gracious nature of the people at NIST a request to allow us to download and use some of their data in the spectral game was greeted with full support and we have permission to do so and have already started the process. An example set of spectra can be found for Cholesterol (here) where there is now HNMR, CNMR, EI-MS, UV and IR data. The data were downloaded via this page: . The data are NOT Open Data however. If you visit the spectral pages you will see the ownership declared specifically. For the MS page it says :

Owner    NIST Mass Spectrometry Data Center
Collection (C) 2007 copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All rights reserved.
NIST MS number    67286

and for the IR spectra there are multiple sources:

Data compiled by: Coblentz Society, Inc.

* SOLID (MINERAL OIL MULL); Not specified, most likely a prism, grating, or hybrid spectrometer.; DIGITIZED BY NIST FROM HARD COPY; 4 cm-1 resolution
* SOLUTION 1% (CS2 FOR 2-15 microns, AND C2Cl4 FOR 5.5-7.2 microns)

coblentzBoth NIST and the Coblentz Society generate revenue from some of their data collections despite the fact that these data on Webbook are offered free for viewing. The NIST MS database is the most widely distributed MS database in the world (I believe) and they also offer an IR database for sale . Other data are available (1) The Coblentz society have been building their databases for decades and also offer them for sale. If you look at the prices of the Coblentz collection or the NIST IR collection they are a a hundred to 2 hundred dollars per collection. Maybe some rich uncle could write a check and release them into the world of Open Data for all to use? Otherwise the groups maintaining these collections deserve to have their costs covered at a minimum..which is probably what their revenue streams from these databases allow.

We are thankful to NIST to allow us to upload spectra from the Webbook and we have started to upload data. It will only be a slice of the collection. We will flag the data on our side as NOT Open Data but “Can be accessed by Spectral Game”. In this way the game grows in its types of data but we respect the licenses of the contributors. Open Data vs Free Data vs Pragmatic Usability…maybe the OKF can participate in negotiating the release of such data sources into the public domain and, where appropriate, sourcing some funding to allow them to do it?

Reblog this post [with Zemanta]

There are some interesting articles showing up on ChemSpider from across the blogosphere. We have just added to our list of high priorities to generate an RSS feed of structures, short descriptions and ChemSpider IDs so that anyone can access them. When we add new descriptions we will add snippets to the RSS feed.

New Articles include:

Teen Chemist and Splenda

A Discussion about the Synthesis of Spirangien A from the TotallySynthetic Blog by Paul Docherty

A Discussion about the Synthesis of Omaezakianol from the TotallySynthetic Blog by Paul Docherty

Reblog this post [with Zemanta]

ons1We’ve been working with Jean-Claude Bradley and his Open Notebook Solubility Challenge group to assist where we can. This has included enhancing some of our services (though there is more work to be done…), populating data into ChemSpider and, now, linking us up to the Data Tables built by Andy Lang (of The Spectral Game fame…we’re quite a team).

The Open Notebook Solubility Challenge is described here. The present list of compounds for which we have created the integration to be described below is here. WHen you open that link you’ll see the first bunch…notice the little icons showing patent links, Wikipedia links and the presence of spectra on those records.

WHat we have done now is deposit the links into the Data Source tables for these compounds and providing the direct link to the ONS tables. They can be viewed WITHOUT leaving the site simply by hovering over the link…OR you can click on the link to view the data directly. An example of the link view is shown below. To find these tables simply look up the Open Notebook Solubility Challenge data source in the table.



virscidianI was a generator of analytical data for a number of years. I am an analytical scientist by training..NMR jock to be precise (if you are interested in my work see here.) However during my tenure at Kodak our team ran a walkup laboratory and we generated  a lot of data. In those days it was a few 10s of gigabytes per year. We generated NMR, IR, Chrom and LC-MS data…lots of it. At that time we had our own NMR processing software that was written in house. That was replaced later by commercial desktop processing software. We used instrument based data processing processor and produced simple reports at the end of the work.

When I left Kodak for ACD/Labs I led the development of the SpecManager product extending it from 1D NMR processing software only into IR processing, chromatography processing, 2D NMR processing, LC-MS processing and into an even broader set of analytical techniques. By the time I left ACD/Labs, EACH of these areas had their own product managers – NMR, MS, UVIR and Chrom….4 in total.  My focus with these tools was always based on structure identification and qualitative data analysis.

LC-MS is one of the most useful, sensitive, high-throughput technologies available in the majority of laboratories and the data generated can be in the 100s of gigabytes every few months. High-throughput quantitative analysis software tools have generally been the domain of the hardware vendors but one person that I watched with interest as he took on the domain of high throughput quantitative analysis was Joe Simpkins. Joe was a co-founder of the Opans CRO lab and they needed BETTER software than was available from the vendors. So, Joe built it. Over the next few years the software was optimized for their lab, then the labs of their users. Ultimately their lab software was sold to their customers as it offered possibilities that the hardware vendors couldn’t address in terms of usability and specifically vendor neutrality. The software business grew and now Joe Simpkins has split off a company called Virscidian and has already added to the team. Components of his Analytical Studio software are already licensed to Agilent and it only makes sense that the other vendors will be looking at the Virscidian capabilities moving forwards.

Virscidian is one of those small companies I like to watch. They are small, nimble, highly skilled, networked in the industry and focused on providing the optimal solutions for the user. (Also they are my friends so I watch very closely…and help where I can). It just makes sense to connect ChemSpider to their solutions as we have to many of the other MS vendors….watch this space

I’ve been in a discussion with Mat Todd for a while….Mat is involved with  The Synaptic Leap and is a supporter of what we are trying to do with ChemSpider. What we have built is a platform where registered users can post there own structures, spectra, images and so on. We have a number of chemists contributing to the database and as a result there are synthesis protocols showing up on ChemSpider, new structures added regularly and NMR spectra specifically being added rather frequently. ChemSpider is growing as a result of community contributions and improving in quality as a result of community curation. But…we want even more participation…if possible.

Now, I rent my DVDs from Netflix and buy my books and electronics from Amazon. I read the reviews  of other people who have purchased books, CDs, electronics and rented DVDs. I am influenced by the masses. Doh.  Now…I will admit that I DON’T contribute to those reviews. I’ve never written about my favorite movies (but do score them). I dont’t review electronics or books but I do leave feedback on eBay. I judge I am like most other people. We have thousands of users per day that use ChemSpider now…the majority search, use and hopefully get value from our humble offering. Most people DON’T leave tracks…don’t add comments, don’t deposit data, don’t add information. Some of these people however do maintain their own blogs, they leave comments on other peoples blogs or happen across useful information. We want to know about interesting articles, blog posts, commentaries that might be of value to associate with structures on ChemSpider.

During a recent email exchange with Mat Todd we exchanged, at a very basic level, ideas about scraping information from sites and making available on ChemSpider rather than having people deposit the data. The question was how to automatically scrape appropriate data from blogs etc and maybe use tagging for some robot to locate the data and scrape and deposit. Theoretically yes, this would work. But, honestly I believe that developing some fancy technologies which might not bring benefit isn’t the way to go. Been there, done that. I realized there is a simpler way to do this AND, if there is a critical mass, then we will develop some fancy technologies (we’re good at that!)

So…welcome Google Alerts.


Google Alerts has been in beta for…hmmm…a while. Everyday I get alerts hitting my email about things I care’s how I find out about the latest comments about ChemSpider on blogs that I don’t have listed in my reader. So, a simple alert on “loadtochemspider” has been set up. If anyone tags a blog, wiki, “something” with that tag then we will check it out and if it fits we will add it to ChemSpider. Simple technology…1 minute to set up. No massive technology development…just eyeballs and copy-paste.

So…let’s try it. If you see something of interest add loadtochemspider as a tag, into a comment, into the original post. Leave the rest to us…

Reblog this post [with Zemanta]

freebase3There has been encouragement that we look at Freebase as an additional online resource to integrate to. In terms of chemical entities some of the Wikipedia structure collection has made its way onto Freebase and has been enhanced to include InChIs and SMILEs. It’s not clear to me whether the InChIs on Freebase are all obtained FROM Wikipedia or were layered on later onto Freebase. So, I approached the Freebase group and asked if they could provide me a dump of the InChIStrings and the SMILES strings together with the associated FreeBase IDs and the chemical names. In this way we would be able to generate SDF files for depositions and end up with the structures (converted from InChIs and SMILES) as well as the associated chemical names and Freebase IDs. Simple idea right?

freebase11So, we converted InChIs and SMILES and generated the depositions. Freebase links now show up in the Data Sources section and, if you put your cursor over the GUID you see an image of the page and can click through to the record on FreeBase. See the image above. The Freebase GUID for Benzene is here: #9202a8c04000641f800000000000ac66

All seems well. I have a question though…I look at a structure like Dapagliflozin on Wikipedia here and see full stereochemistry explicitly defined in the name and in the image. However, on Freebase I note that the stereochemistry is NOT explicitly defined in the InChI. The InChI is:  1/C21H25ClO6/c1-2-27-15-6-3-12(4-7-15)9-14-10-13(5-8-16(14)22)21-20(26)19(25)18(24)17(11-23)28-21/h3-8,10,17-21,23-26H,2,9,11H2,1H3/t17?,18?,19?,20?,21-/m0/s1

So, when we take the InChIs and the chemical names, convert the InChIs and deposit the chemical structures we end up with a “destruction” of the curation work we have done on ChemSpider. We end up with TWO structures for Dapagliflozin, not one (See below)


And now we need to start the curation efforts AGAIN to clean out misassociations of names and structures. So, what we are going to do is delete the deposition of Freebase structures and redeposit without the chemical names. In this case the outlinks to Freebase will be in place but the structures will not be found by a name search UNLESS the Freebase GUID is associated with an already curated name-structure pair that is coincident with the Freebase name.

I can say that the Freebase team were a pleasure to work with and, in theory, once the Wikipedia curation project is finished the SMILES and InChIs on Freebase will be correct and such linkages back to Freebase will be easier, and correct. In the meantime I am interetsed in where the Freebase SMILES And InChIs are coming fro (I think a lot of them are from Wikipedia but am not sure) and we are going to make certain on our side that we remove the chemical names so as to not decrease the quality of our curation efforts.

Reblog this post [with Zemanta]

I’ve posted previously about embedding structure images and spectra into blogs and webpages. One of the side effects of this is that for structure images specifically the ChemSpider record is linked back to the webpage that the structure is embedded into. Structures are embedded in various places now into wikis and blogs. An example of 11 embedded structures is shown here.

When Arvin Moser wrote in his blog about Letrozole and embedded the structure image into his blog post a link BACK to his blog post was created in the Data Source table. See the image below.


With this capability, as more people embed structures from ChemSpider into their online pages/blogs more of the internet will become structure searchable and ultimately linked. It does not require adding InChIs to webpages (though that is encouraged for indexing by search engines).

(Caveat: The system is not yet optimal and we are working on filtering out comments on blogs that presently get added as additional links. All “doubles” will be filtered out later)

Reblog this post [with Zemanta]

I gave my talk yesterday at CShals 2009, the conference on Semantics in Healthcare and Life Sciences.It was a great meeting for me (hindered by dismal access to wireless internet as a result of Marriott’s want to make more money from the conference organizers. They should be ashamed of themselves in this day and age!) as it was not about Chemistry, not about spectroscopy, not even about Open Data, Open Access and Open Source. It was about Semantics. I learned a lot and got to hear Tim Berners-Lee talk about where the semantic web is and where it can go and how can be disruptive in a good way while NOT being too disruptive to layer onto what already exists. The best part of the meetingfor me was the clear passion for the InChI, as well as a lot of acknowledgement that it is not perfect, cannot presently compete with molfiles, commercial systems, CAS Numbers and so on. But, people are optimistic and are waiting and supportive. Overnight I inserted a lot more information about InChIs and how they can be useful, where some of the limitations are presently, how the StdInChI has now added a new level of complexity on one hand and simplifcation on the other. There have already been a number of requests for a copy of the talk so it is up on Slideshare for now (and linked below). I’ll do a voice over in the next few days and upload to Scivee. I unveiled the first version of the InChI Resolver at conference and showed it to a couple of people. The general consensus is we are heading in the right direction. The timing on this conference was good because the intention is to layer on RDF before we release at the ACS, time allowing.

Reblog this post [with Zemanta]

I’ve previously posted about the work going on regarding the NMR Game…now morphed to the spectral game and described in detail by JC Bradley. We’ve been working hard to increase the number of spectra available as part of the game (now in the 100s of spectra!) and Andy has been working hard to improve the flow of data. The original structure images have been replaced with ChemSpider structure images and we have delivered a web service to allow Andy to continue to update the spectral collection as more data are added to the database.

When users see issues with the spectra they get to leave comments regarding their observations. This can be very valuable for us to curate the spectral data. This will allow us to perform game-based crowdsourcing of the spectral data and the feedback is already of value.

We have about another 30 spectra to add to the present collection of spectral Open Data and then we’ll take a break and I’ll be approaching the spectrometer vendors and a few other friends to see whether they have any data to contribute to the game. We are already considering adding the ability to add a “Company Logo” to be associated with a spectrum so that the vendors/contributors get fair recognition for their contribution to the game. If you are interested in providing data we will upload it for you. Contact us at infoATchemspiderDOTcom.

JC Bradley has now uploaded a short tutorial to YouTube regarding how to play the movie and I have embedded it below. JC’s also announced a prize for the best player. Go test your skills..

Reblog this post [with Zemanta]

Jean-Claude Bradley has recently posted about about an NMR Game running on Second Life. Read his blog for details but I excerpt some of the comments here:

Andy and I brainstormed some new chemistry games that we could introduce to Second Life to leverage our recent tools. One of the applications is the NMR game. By combining the orac molecule rezzer, the SL spectral viewing tool and ChemSpider Open Data spectra I think we have a pretty good game.

The idea is simple: click on the molecule that is represented by the spectrum. If it is correct you get 2 points and get another spectrum. You lose a point by clicking on an incorrect molecule. After going through all the spectra your score gets posted on the web to a top10 list. For equal scores the best time takes it.”

So, here at ChemSpider we are delivering spectra as Open Data to help with the game. And we’re happy to do so. It’s always been our intention to have ChemSpider provide value like this. ANY registered user can upload spectra to the ChemSpider website. The details are outlined here (I just noticed the interface has changed since I wrote that but you should still be able to follow the process). We need the spectra to be in JCAMP format and if you want them to be available for the game, and for people to download, they MUST be declared as Open Data.

Right now we have 100s of spectra. You can find them here. But we need more. Much more!We’d like you to contribute them. if you don’t want to upload them yourself then contact us directly and we will process and uplood for you. We need the data and the name/structure of the associated molecule.

And how will the game be used on these spectra? The game will be used to “curate and validate” the spectra. As the game is being played a score of how many people say it is correct will be kept. And of course what is wrong. Based on these scores our curators will be directed to “problematic spectra” for their attention. This is true crowdsourcing and a great way to do spectral validation.

We would like the spectral collection to grow and welcome contributions from anyone. They do NOT have to be just NMR. They can be IR, MS, Raman etc too. Ultimately a Spectral game will be unveiled. Please consider ChemSpider as a repository for your data as it will benefit the community of chemists and, in particular, the process of teaching students and allowing them to “game their way” through the process. Watch where this goes…it’s VERY interesting to consider how it can improve…there is an NMR game website in development so you won’t have to go just to Second Life.

Following on from my recent post about “Why are structures like YouTube Videos ? “I am now asking the same question about spectra.

The answer is simple. When people have deposited data as OPEN DATA on ChemSpider we are now providing the ability to embed the spectral data and display at other sites. This is different in that we are not just showing images but real live spectra in the JSpecView Java Applet so Java must be installed. Thanks to Cameron Neylon for asking the question about whether we could provide the service. Glad to help…

If all is well you should see an IR spectrum associated with the ChemSpider record here. In order to EMBED spectra simply Login to ChemSpider, find an Open Data spectrum of interest (you could browse and then click on EMBED (left hand corner below the spectral image. Do a left click to see additional features of JSpecView. We DO have some minor work to do with spectral plot reversal and improving the zoom display but we’re getting there. Enjoy.

Reblog this post [with Zemanta]

A few months ago I met with Adam Azman in Chapel Hill to discuss how the names in our ChemSpider database could be used to expand his Chemical Dictionary. It seemed that we would be sitting on a treasure trove of name fragments that could help him in his efforts. So, we supplied Adam with 1.3 million identifiers and Adam has worked for the last few months to generate his Chemical Dictionary. He extracted over 100,000 name fragments from our collection as he has described in his blogpost here.

Extracted from Adam’s blog are his so-called Administrivia “The dictionary is licensed under the Creative Commons Attribution 3.0 License.  …  The dictionary is compatible for Microsoft Office (Windows or Mac), and  Open Office (Windows or Linux).  The install file includes instructions for upgrading old versions and installing it for the first time.  The dictionary should be useful for all chemists.  However, I am an organic chemist.  Thus, the dictionary was created from an organic chemist’s mindset.  It will probably be most useful for organic chemists.”

Adam has explained in detail how he did the work. I encourage you to read his post to fully understand the nature of the work and how much heavy-lifting he actually did.  It’s been a pleasure to help Adam and the community by supplying our own form of a “dictionary” to him for his particular treatment. It took a few hours of work from our side and months of hard work from him. I encourage you to take advantage of his efforts…if you are a chemist this is a real gift for the season. The dictionary can be downloaded from our site here.

Now I want you to consider timing. We are working hard on our ChemMantis project, a system for entity extraction and document markup. Part of this includes the generation of dictionaries for finding chemical names. We’ve already expanded our chemical dictionary using the database of identifiers from ChemSpider but for those of you working with other systems such as OSCAR3 or the other commercial markup systems dependent on chemical dictionaries you will likely find Adam’s contribution significant. Enjoy.

When I was at the Scifoo meeting earlier this year I got very excited about the Google Datasets project. I must admit that my creative spirit and need to hang out with innovators has, for years, called out to me to “Take Chemistry to Google”. When I left SciFoo I left with a hard drive to put data onto. I had great ideas about using the ChemSpider dataset of InChIs and CSIDs to connect chemists. I had hoped to put the data into the Google Datasets Project but actually work with Google to “do something” with them other than just host them for other people to download. If you do a search on Google today (at least if I do) I get the following result…let me know what you get! I’ll admit my naivety on this but maybe there is a limitation of hits shown etc (David Bradley..any ideas?)

 Considering_that_part of the story for InChI, and I have given the story many times myself (!) is that the internet can be made structure searchable by InChI this is a limited result set especially considering that there 21.5 million of them on ChemSpider. Then there’s PubChem, Drugbank, and so many more.

My hope was that Google might be interested in connecting Google Scholar to structure searching and work with us to enable it. Couldn’t get anyone interested. I was in California for a week and asked whether I could stop by and talk about ChemSPider and how we could help Google with Chemistry – no interest. Overall I will say that I couldn’t get any traction with Google about Chemistry and it’s a great shame. I’ve had similar things said by others. One guy who used to be at Google who WAS interested in Chemistry was Simon Quellen-Field who runs the Sci-Toys website. I think Google needed an advocate for Chemistry in their Datasets Team so that it could have been more than just hosting data but rather doing something WITH the data for the community.

I’m disappointed that the project has come to an end since I was hopeful for its purpose and its impact. I think that someone else will pick it up. If not, then they should…

The letter said…

Thank you very much for trying out Google Research Datasets, providing interesting datasets, and giving us extremely useful feedback. We have learned a lot about the issues facing researchers and dataset producers from this testing period.

As you know, Google is a company that promotes experimentation with innovative new products and services. At the same time, we have to carefully balance that with ensuring that our resources are used in the most effective possible way to bring maximum value to our users.

It has been a difficult decision, but we have decided not to continue work on Google Research Datasets, but to instead focus our efforts on other activities such as Google Scholar, our Research Programs, and publishing papers about research here at Google.

The Google Research Datasets service will remain active until the end of January 2009 during which time any datasets may be downloaded. For those datasets that are impractical to download, we will also happily provide interested users with a copy via hard drive shipment.

Once again, we’d like to thank you for helping us test Google Research Datasets, it’s been a very useful experience, and we look forward to finding new ways to provide you with useful services in the future.”

Collaborative Drug Discovery is gaining increasing traction in terms of providing a collaborative platform for scientists to work together on Drug Discovery. They provide “a web-based software platform to organize preclinical research data to help scientists advance new drug candidates more effectively.” Certainly the support of the Gates Foundation and their investment of almost $1.9M validates their approach and the importance of their work : Collaborative Drug Discovery Receives Gates Foundation Grant to Support the Development of a Database to Accelerate Discovery of New Therapies Against Tuberculosis.

I have started to work with their platform and compliment them on the ease-of-use, the aesthetics and the intention of their work. I am not managing any of my own data on the platform yet but over the next couple of weeks hope to start actively managing some data that I am collaborating on with one of our editorial board. In order to execute on their mission CDD has to provide privacy and security for certain data but in parallel has made available public access data (Click on the thumbnail for a view of their present public access data). There will be those will likely criticize that all of their data are not Open but as I have explained myself previously this is a decision of the depositor to declare Open Data. In the case of CDD their business model, the wishes of their users and the very nature of drug discovery that their users are engaged in demands that they offer a secure and private platform in parallel to their Public Access offerings. Their approach works.

We have been working with CDD to allow their users to access ChemSpider directly from within the CDD platform. This is in place now and has been discussed in a recent blog post. the integration in their interface is clear. See the entire blogpost for more details. We look forward to working with CDD in the future. Their approach is fresh, innovative and gaining a lot of support from very significant names in the arena of drug discovery.

Reblog this post [with Zemanta]

The ChemSpider Journal of Chemistry is an experiment. We intend to demonstrate how modern web technologies can be used to dramatically enhance the type of information that can be communicated using web-based tools over standard online publishing approaches. There are some publishers who are working in delivering additional value to their readers by providing enhanced HTML articles and adding information to their articles such as InChIs to allow structure-based queries online. These publishers include the Royal Society of Chemistry with their Project Prospect and the Nature Publishing Group with their Nature Chemical Biology papers. The majority of articles presented by the commercial publishers are not of a “just-in-time” nature and are delayed by the “processes of publishing”. They are generally fairly lengthy documents and report successful results. They are commonly peer-reviewed and have endured a significant timeline from initial writing to submission, publishers processing, review and publication. Science is however being reported in near real-time under Open Notebook Science (ONS) initiatives. We believe that an online journal can co-exist between the immediate nature of blogging and wiki tools hosting ONS efforts and the more standard processes of the scientific publishers. Some publishers are already allowing online and open peer-review whereby readers provide their feedback to the author in a public forum. Papers can enter a period of online peer review and commentary during which readers provide feedback to the author(s). As a result of this process the authors can engage in public discourse with the commentators and issue a final form of the manuscript. We will offer similar facilities.

We invite manuscripts from anybody interested in exposing their work in the field of chemistry and intersecting fields. In general we expect these communications to be 1500-3000 words in length but there is no limit. We encourage submissions relating to chemistry, biochemistry and chemical biology; regarding synthesis, the analytical sciences and computational chemistry; as research, as commentaries and as questions to the community. Provided the submission relates to the domain of the chemical sciences we will find a place for it within the ChemSpider Journal of Chemistry. We encourage submissions from academia and industry, from students and senior scientists, from individuals and teams, for successful research or failed experiments. We encourage submitters to challenge us to host your manuscripts in a manner which most clearly communicates your science. This may include hosting various forms of data made available to the public as Open Data, providing visualization tools for the display of molecules, spectra, images and videos. We intend to not be constrained and to make full use of web-based tools available today and coming online tomorrow.

All articles will be Open Access articles. We will abide by the Budapest Open Access Initiative which declares “By ‘open access’ to this literature, we mean its free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself. The only constraint on reproduction and distribution, and the only role for copyright in this domain, should be to give authors control over the integrity of their work and the right to be properly acknowledged and cited.” Authors must agree to allow unrestricted reading, downloading, distribution, printing, searching and linking to the published work.

Over the past 2 years we believe we have demonstrated our passion for public science, our willingness to serve the community, and integrity in our actions. We hope that the ChemSpider Journal of Chemistry will provide a vehicle to all scientists operating within the domain of the chemical sciences to expose their work and interests to the community. We intend to deliver a facile process of submission and superior tools for delivery. We welcome your support and look forward to expanding the communication of chemistry.

Reblog this post [with Zemanta]

I’ve blogged previously at the honor of ChemSpider starting to be indexed by the Chemical Abstracts Service. I take this as a blessing of the value we are offering to the community. Interestingly this has resulted in some confusion within the communiyt. Now when people are finding structures of interest in the CAS registry they would like to hop over to our site for details. Based on what I’ve heard/seen that’s not so easy…there is no ChemSpider ID number to use and no link into our database. Hrmmpphhh..a little user friendliness would go a long way.

Anyhow, today a comment on one of the ChemSpider records caught my eye. “…why do these records not have a CAS RN associated with them? CAS acknowledges you...”. Well the answer is simple to that. We don’t receive CAS numbers. I am not aware that ANYONE who has chemicals indexed in the CAS registry receives CAS numbers back as an outcome of being indexed. Am I right? I think CAS numbers have to be paid for when you register a compound. And if they are registered for you so be it. No cost, but you don’t get the numbers for your own usage. Anybody know different?

I have received some interesting comments off-blog (consistent with the way I said blogs work for me) regarding how CAS’ indexing of us is helping people find information through us. For example, for the record at which the comment was made (CSID 9727274) the user could find their way out to PubChem, where we sourced the Thomson Pharma information that is also on the record. Hmmm…CAS is indexing PubChem via ChemSpider. That’s an interesting state of affairs considering the historical collisions regarding ACS and PubChem.

There have also been examples where people searching for IP issues about chemical structures have ended up on ChemSpider through our indexing on CAS and then from us out to the SureChem database of patents. The searchers have not been able to find information about certain structures in the CAS patent database but have found it in the registry and from there via ChemSPider out to the SureChem patent database. I’m not sure I fully understand the segregation of the data on the registry and the patent database since I don’t have access to CAS tools but I think it’s great that online resources such as ChemSpider, PubChem and SureChem are all now being brought together through the CAS registry and indexing processes. This is very beneficial to the community.

Reblog this post [with Zemanta]

I am honored to be invited to join the Editorial Board of the Journal of Cheminformatics, a new journal from Chemistry Central. Over the years I’ve co-authored a lot of papers in ACS’ JCIM, previously JCICS (my list of publications is here) and my co-authors and I have always wondered about when another journal of the nature of JCIM would show up in the world of Open Access. Now it’s coming and I’m excited to be involved. We are writing a submission to the journal at present. The editors-in-chief and editorial board are listed below. I know the majority of these people personally and believe that this group will ensure the highest standards for the journal.

Christoph Steinbeck (United Kingdom)
David J. Wild (United States)

Editorial Board
Jean-Claude Bradley (United States)
Curt Breneman (United States)
Robert D. Clark (United States)
Jeremy Frey (United Kingdom)
Johann Gasteiger (Germany)
Val Gillet (United Kingdom)
Robert Glen (United Kingdom)
Jonathan Goodman (United Kingdom)
Rajarshi Guha (United States)
Mic Lajiness (United States)
Yvonne Martin (United States)
Peter Murray Rust (United Kingdom)
Alexander Tropsha (United States)
Wendy Warr (United Kingdom)
Ian Watson (United States)
Peter Willett (United Kingdom)
Antony Williams (United States)