Archive for the Community Building Category

The ChemBL blog is an excellent read and if you’re interested in “Open Access Drug Discovery And Medicinal Chemistry Data ” this is one for you. We are shamelessly, and WITH permission, taking some of the blogposts about New Drug Approvals and adding them into the descriptions on ChemSpider. Some examples are here and here. To date for all cases where we have added the description the compound itself was already on ChemSpider and with the correct name. That’s good news based on some of our subjective measures of coverage for the database.

Buy me a Coffee

The Spectral Game at www.spectralgame.com is powered by chemical structures and spectra from ChemSpider. A provisional form of our manuscript regarding this paper is now online at the Journal of Cheminformatics here:

The Spectral Game: leveraging Open Data and crowdsourcing for education

Jean-Claude Bradley , Robert J Lancashire , Andrew SID Lang and Antony J Williams

Journal of Cheminformatics 2009, 1:9doi:10.1186/1758-2946-1-9

 
Published: 26 June 2009

Abstract (provisional)

We report on the implementation of the Spectral Game, a web-based game where players try to match molecules to various forms of interactive spectra including 1D/2D NMR, Mass Spectrometry and Infrared spectra. Each correct selection earns the player one point and play continues until the player supplies an incorrect answer. The game is usually played using a web browser interface, although a version has been developed in the virtual 3D environment of Second Life. Spectra uploaded as Open Data to ChemSpider in JCAMP-DX format are used for the problem sets together with structures extracted from the website. The spectra are displayed using JSpecView, an Open Source spectrum viewing applet which affords zooming and integration. The application of the game to the teaching of proton NMR spectroscopy in an undergraduate organic chemistry class and a 2D Spectrum Viewer are also presented.

Buy me a Coffee

scifooScifoo is just a few weeks away and I was reviewing the list of attendees this evening to see who I would be sharing space with.

I am especially looking forward to spening time with Andrew Lang, one of the brains behind the Spectral Game. We’ve spoken on the phone, exchanged many emails and worked on a couple of projects together. But we get to meet at SciFoo!

Last time I was at SciFoo I spent time talking with Cameron Neylon and JC Bradley about Open Notebook Science. At that time I had lots of ideas about what we could do to support Open Notebook Science. We actually have done quite well but at that time we were severely resource constrained. Things are a little different now we have been acquired by the RSC and I am looking forward to talking about what’s necessary and possible now.

Nicko Goncharoff from SureChem will be there. Nicko and I have spent a lot of time together over the past few years, mostly by phone and over email as we worked to integrate SureChem into ChemSpider and use their software development kit under our ChemMantis semantic chemistry markup tool. It’s always good to see him.

Other people I hope to spend some time talking to: Peter Murray Rust from the university of Cambridge, Timo Hannay, Alf Eaton and Terry Sheppard from the Nature Publishing Group and Theodore Gray.

Buy me a Coffee

linkedin I have set up a LinkedIn Users and Advisors group today and welcome any LinkedIn users interested in ChemSpider to join the group and stay informed about our activities on ChemSpider. I hope that it also provides a useful environment for discussion and collaboration around ChemSpider.

The ChemSpider LinkedIn Group can be accessed here.

Reblog this post [with Zemanta]

Buy me a Coffee

I have given a number of talks regarding ChemSpider over the past few months and generally comment “ChemSPider hosts almost 21.5 Million unqiue chemical entities from over 200 data sources. As of today it is over 21. 5 million chemical entities. We have deposited data from a number of new contributors of late, many of these are smaller chemical vendors such as Bridge Organics and ExtraSynthese. However, we recently crossed the 21.5 million mark because we have started to take advantage of the eMolecules dataset made available as a downloadable set. There are over 5 million structures in the dataset.

Many, but not all of these, deduplicate onto the ChemSpider database. The 21.5 millionth structure links to this record on eMolecules as shown below.

emolecules

When the data are added onto ChemSpider we automatically add SMILES, InChIs, MW, MF and a series of predicted physicochemical properties. This is for the new structures from eMolecules. In many cases however eMolecules is simply one more data source among many and information such as spectra, Wikipedia links, experimental data etc are all integrated. In this case though eMolecules can help you source a vendor for the material as is their strength.

Buy me a Coffee

An article in the latest C&E News discusses the acquisition of ChemSpider by the Royal Society of Chemistry. I certainly appreciate the comments of Robert Massie, President of CAS who stated:  “CAS has worked with Williams in the past,” CAS President Robert J. Massie notes. “We join everyone who is interested in the advance of chemical information in recognizing his considerable contributions. We are delighted to see that his creativity and enthusiasm will continue to benefit the chemical enterprise.”

I worked a lot with CAS while I was at ACD/Labs (over 10.5 years and left there as their Chief Science officer). I was intitmately involved in the development and deployment of a number of software tools and visited Columbus many times. I have many fond memories of working with the CAS team and there are some great people working at the organization. I hope that in my new role at the Royal Society of Chemistry that I will have the opportunity to work with CAS again in a collaborative and cross-publisher manner to the benefit of the  Chemistry Community.

Reblog this post [with Zemanta]

Buy me a Coffee

With the news about the RSC acquiring ChemSpider assets sterting to settle it is time to get back to work. One of the things we are noticing is that people are really starting to take advantage of the ability to integrate their articles to ChemSpider via the Add DOI function that is available to registered users. If you want to associate a paper with a single chemical structure then it is very easy and uses the CrossRef service to Fetch the Result from a DOI lookup and deposit directly to ChemSpider. The four images below outline the way in which this can be done. In this example I want to associate two particular articles with the record for  1,2-dioxetanedione shown below.

It’s easy. Navigate to the record of interest, Make sure that the structure is the correct structure of interest and Simply click the Add DOI button above the chemical structure to the left. Don’t forget you must be logged in! Now, Enter the DOI, click on LookUp and confirm that the title retrieved is the correct publication. Then click on OK. Now the publication will be submitted for a curator to confirm that it is appropriate and it will show up online under the supplementary information when approved.

There are also processes for depositing an SDF file with a single publication and the SAME process is applied to connecting via PubmedIDs (Add PMID). Try it out. Help the community discover publications by adding appropriate DOIs to particular records. Look at how many there are associated with cholesterol already.

doipart1

doipart2doipart3doipart4

Reblog this post [with Zemanta]

Buy me a Coffee

I blogged yesterday about our release of Wikipedia Services on ChemSpider and how we are working to support authors on Wikipedia articles. Of course there are MANY languages of Wikipedia (as shown below) and we are willing to produce multilingual support. All we need is someone from the specific language version of Wikipedia to contact us and map the ChemBoxes and Drugboxes into their relevant languages. Let us know if you are interested.

languages

Reblog this post [with Zemanta]

Buy me a Coffee

Wikipedia is great. I use it regularly. I’ve been working, with a team of experts, on curating and validating the “structure-based data” in the ChemBoxes and DrugBoxes for almost a year and a half. It’s been a long path and on the journey I have met some great people and made some true friends. I also HAVE NOT met most of the people I share the IRC chats with. We are a highly opinionated bunch of people but with a common focus of making Wikipedia better and making the data and content as accurate as possible.

We have the Wikipedia article lead in thousands of records on ChemSpider now. They are updated regularly as Wikipedia itself expands. One of the areas we have been focused on since the inception of the work was getting correct structures in place with the associated data. This includes the molecular formula, molecular weight, SMILES, InChI String, InChIKey, systematic name and so on. In order to help the process of expanding Wikipedia with new records and to provide a lot of these data automatically we have set about providing a Wikipedia Service so that Wikipedians can use ChemSpider as the source of the chemical structures of interest and generate the DrugBox and ChemBox content from ChemSpider. It’s a rather simple process…

Assume that you wanted to create a ChemBox for Domoic Acid you would search Domoic Acid on ChemSpider. You would then validate whether the structure on ChemSpider named domoic acid is correct and. if so, you would generate the Wikibox by clicking on the link to the right of the Quick Links

wikibox1

Following this simple button click the user is shown a new window displaying the “Design Wikibox” functionality. There are various flavors of ChemBoxes and Drugboxes which can be generated and the image below shows the “Simple ChemBox”

wikibox2

At present we fill the box with those data we have easy access to from ChemSpider and based on the chemical structure. We list all other fields for Wiki depositors to populate. For the Simple ChemBox this looks like this for Domoic Acid

{{Chembox
| ImageFile =
| ImageSize =
| IUPACName = (2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-6-hydroxy-1,5-dimethyl-6-oxo-hexa-1,3-dienyl]pyrrolidine-2-carboxylic acid
| OtherNames =
| Section1 = {{Chembox Identifiers
| CASNo =
| PubChem = 5282253
| ChemSpiderID = 4445428
| SMILES = O=C(O)[C@H]1NC[C@H](/C(=C\C=C\[C@H](C(=O)O)C)C)[C@@H]1CC(=O)O }}
| Section2 = {{Chembox Properties
| Formula = C15H21NO6
| MolarMass = 311.3303
| Appearance =
| Density =
| MeltingPt =
| BoilingPt =
| Solubility = }}
| Section3 = {{Chembox Hazards
| MainHazards =
| FlashPt =
| Autoignition = }}
}}

We insert the PubChemID associated with the particular structure if there is a related PubChem record. We also insert the ChemSpider ID in case the user wants to link back to ChemSpider.  A Full ChemBox is much longer:

{{Chembox
| Name =
| ImageFile =
| ImageSize =
| IUPACName = (2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-6-hydroxy-1,5-dimethyl-6-oxo-hexa-1,3-dienyl]pyrrolidine-2-carboxylic acid
| SystematicName = (2S,3S,4S)-3-(carboxymethyl)-4-[(1Z,3E,5R)-6-hydroxy-1,5-dimethyl-6-oxo-hexa-1,3-dienyl]pyrrolidine-2-carboxylic acid
| OtherNames =
| Section1 = {{Chembox Identifiers
| Abbreviations =
| CASNo =
| EINECS =
| EINECSCASNO =
| PubChem = 5282253
| ChemSpiderID = 4445428
| SMILES = O=C(O)[C@H]1NC[C@H](/C(=C\C=C\[C@H](C(=O)O)C)C)[C@@H]1CC(=O)O
| InChI = InChI=1S/C15H21NO6/c1-8(4-3-5-9(2)14(19)20)11-7-16-13(15(21)22)10(11)6-12(17)18/h3-5,9-11,13,16H,6-7H2,1-2H3,(H,17,18)(H,19,20)(H,21,22)/b5-3+,8-4-/t9-,10+,11-,13+/m1/s1
| RTECS =
| MeSHName = domoic acid
| ChEBI =
| KEGG = C13732
| ATCCode_prefix =
| ATCCode_suffix =
| ATC_Supplemental =}}
| Section2 = {{Chembox Properties
| Formula = C15H21NO6
| MolarMass = 311.3303
| Appearance =
| Density =
| MeltingPt =
| Melting_notes =
| BoilingPt =
| Boiling_notes =
| Solubility =
| SolubleOther =
| Solvent =
| LogP =
| VaporPressure =
| HenryConstant =
| AtmosphericOHRateConstant =
| pKa =
| pKb = }}
| Section3 = {{Chembox Structure
| CrystalStruct =
| Coordination =
| MolShape = }}
| Section4 = {{Chembox Thermochemistry
| DeltaHf =
| DeltaHc =
| Entropy =
| HeatCapacity = }}
| Section5 = {{Chembox Pharmacology
| AdminRoutes =
| Bioavail =
| Metabolism =
| HalfLife =
| ProteinBound =
| Excretion =
| Legal_status =
| Legal_US =
| Legal_UK =
| Legal_AU =
| Legal_CA =
| PregCat =
| PregCat_AU =
| PregCat_US = }}
| Section6 = {{Chembox Explosive
| ShockSens =
| FrictionSens =
| ExplosiveV =
| REFactor = }}
| Section7 = {{Chembox Hazards
| ExternalMSDS =
| EUClass =
| EUIndex =
| MainHazards =
| NFPA-H =
| NFPA-F =
| NFPA-R =
| NFPA-O =
| RPhrases =
| SPhrases =
| RSPhrases =
| FlashPt =
| Autoignition =
| ExploLimits =
| LD50 =
| PEL = }}
| Section8 = {{Chembox Related
| OtherAnions =
| OtherCations =
| OtherFunctn =
| Function =
| OtherCpds = }}
}}

The user can also use the ChemSpider image and can resize it and click on the image to download it as a PNG file. We believe that our images are attractive and appropriate for web display. Wikipedia present favors the ACS format so based on feedback we can change the config file behind the image generator to produce a different format for display.

We are considering extending the system to support direct uploads of Molfiles and/or other structure formats rather than depending on a compound being on ChemSpider. However, it is VERY likely that chemical compounds of value to the Wikipedia encyclopedic content already exist on ChemSpider. The trick is to find them since they may not have the Wikipedia article chemical name associated with the record. An InChI-based, SMILES-based or alternative name search might help locate the record. Alternatively a full structure search via the applet will find the record OR the user can DEPOSIT the structure to ChemSpider and work from there. The system is flexible enough.

This is our first release of the Wikipedia Services so we welcome any and all feedback. It’s one more way we are giving back to the Wikipedia community for their service. The outcome for us will also be crowdsourced curation of ChemSpider…as Wikipedia articles are written we will clean up related structures on ChemSpider. Everyone wins.

By the way…check OUR structure for Domoic Acid with that one on ChemSpider. Does anyone know which is correct?

Reblog this post [with Zemanta]

Buy me a Coffee

At ChemSpider we have built a system which can support the storage of structures, spectra, images, properties, and so on. In terms of supporting spectra we have noticed that people have been submitting spectral images (as JPEGS generally) and as PDF files. So, rather than not support these formats and force people to submit ONLY JCAMP spectra we are now supporting additional forms of spectral data. These include images and PDF files. An example of PDF submission is shown here. Now, this opens interesting possibilities…we COULD allow deposition of Open Access PDF articles and association with chemical structures in a one to many relationship quite easily. Is it something we should do? We can extend it to many to many in the future. Feedback?

Buy me a Coffee

common-chemistry

The Chemical Abstracts Service have announced their first foray into providing Public Domain data. CommonChemistry.org was announced at the ACS meeting and is now online for all to visit. From the “About Common Chemistry” webpage the site is defined as:

“This database contains the CAS Registry Number®, chemical names (both formal and common), molecular formulas, and structures or sequences for ~7800 chemicals of widespread general public interest. These substances are of global commercial use or importance and have been cited 1,000 or more times in the CAS databases. Examples of substances included are aspirin, biotin, benzoyl peroxide, and boric acid. The Common Chemistry database also includes all 118 elements of the Periodic Table, although not all of the elements may meet the 1,000 references threshold.

Links to Wikipedia records (when available) have been provided by the Wikipedia Chemicals WikiProject in collaboration with Chemical Abstracts Service.

You can quickly and easily confirm a chemical name, CAS Registry Number, or structure from this database of common, everyday chemicals.

You can search for substances in Common Chemistry by either their CAS Registry Number or by their chemical name. Chemical name searches can be by exact name if you have one or by name fragment. CAS Registry Number searches are exact search only. Consult the Help page for additional search tips and details.

This database will be updated periodically. Information such as Wikipedia links may be added on a more frequent basis as it becomes available.”

A search on Xanax or Aspirin produces a hit very quickly and the record example for Xanax is given here. The result is a validated CAS Number for Xanax, a list of chemical names and the chemical structure. You can compare that to the ChemSpider record for Xanax here. I personally prefer our structure images on ChemSpider. The comparison is below…ChemSpider is on the right. We have a lot more info on the ChemSpider website and a lot of it is validated y the community.

xanax

Of note is the fact that the CAS number provided with the CAS image is not separated by dashes. I had never seen that before.

We have already created the CommonChemistry.org Data Source on ChemSpider in case anyone wants to connect up records from ChemSpider with CommonChemistry as they are curating our dataset. I’ve already linked a few records to CommonChemistry.org and maybe that will happen at Wikipedia too. Some basic checking on a few records shows that we have good validation on the registry numbers on ChemSpider already. I checked 5 records and we were correct in all cases. This is unlikely to bear true across the entire database but is a good sign.

It is unclear what licensing is on the data. I doubt it’s Open but that won’t matter to the majority of users…they are looking for a piece of information or to confirm something and are unlikely to be distracted by whether the data are Open or not…free access will suffice.

I haven’t tested the search capabilities too much and will do so in the next few days. I think that CAS should consider showing the leed of the Wikipedia article as well as linking out to other information. ChemSpider is a good one since we list articles, properties, analytical data etc for a much enhanced record …see Cholesterol as an example. When the site is out of beta we’ll offer to produce ChemSpider IDs for the entire CommonChemistry database in case they want to link.

This website is an interesting shift for CAS and demonstrates a willingness to provide access to Public Domain data. It is a good start to open up the first 7800 structures with more than 1000 citations and there is much more that they can do in a smilar vein, theoretically without threatening their business model. It’s going to be interesting to watch. Certainly CAS have helped in the validation of the CAS Numbers on Wikipedia and that has been an interesting project for all with validated CAS numbers resulting. It has been a long and exacting project with many eyes poring over the data…all for the good of the community.

Reblog this post [with Zemanta]

Buy me a Coffee

Some pleasant news was shared today on the CHMINF list server by Richard Kidd, the manager of informatics at RSC Publishing in Cambridge. Richard’s email to the community declared:

“We’ve just released a few new features based around the RSC Prospect project to enhance our articles. Some of these features are cosmetic and improve the look and feel, and some are deeply semantic and a bit specialist - but both take another step towards the future.

On the articles themselves, we now have mouseover popups of chemical structures, improved subject pages and compound pages, compound link through to ChemSpider, and better toolbar and page layout.” and he pointed to an example article here:  http://tinyurl.com/dba3mv.

So..off I went for a click around and found the article. I checked out the structures displayed by hovering over the numbers and then clicking through to the associated records page as shown here. Right there are the links to external resources for SEARCHING…the links do not mean that the structures are necessarily on ChemSpider but that they can be searched on ChemSpider.

rsc1 I clicked on the search for that structure and did not find it unfortunately. However, I deposited it by pasting the InChI into our deposition entry box and converting it. It is now here. I then used the Add DOI function on that record to add the Author, Title and DOI information in about 15 seconds.It’s listed in the Supplementary Information now. A search on this compound will now find that record.

This is a very manual task for adding the information. What is ideal is to deposit a datastream of  structures and D OIs together for the “primary compounds”…I don’t want links to benzene for every record unless it is a primary compound in the article. We’ve already been doing that for the RSC Project Prospect backfile as described previously. I believe a solution for ongoing updates from RSC is feasible…

Reblog this post [with Zemanta]

Buy me a Coffee

We continue to expand the ChemSpider Database with new depositions sourced from various collaborators. We are especially privileged to have received the RSC’s structure collection associated with their Project Prospect articles and have spent a couple of weeks working with the data prior to depositing onto ChemSpider. During the deposition process we have formed the link between the chemical structures and their articles via a DOI link. We have been able to deposit the title, an associated author and the DOI. In this way we have been able to link thousands of chemical structures to articles on the RSC website. On each record associated an RSC article you will see both a link from the data source table and a link via DOI from the reference as shown here and in the figure below.

rsc_linkWith the RSC depositions came many beautiful structures - highly symmetric, complex and just plain “pretty” to a chemist. But a high level of complexity also arrived with the collection and while many InChIs could be converted to their associated connection tables the act of converting the InChIs could add additional stereochemistry and structure cleaning could change stereochemistry so this was a long, tedious and mostly manual process I’m afraid. Nevertheless, a wonderul addition to the ChemSpider database and our sincere thanks, on behalf of the community too, to the Royal Society of Chemistry for sharing their data with us. The InChIs will be deposited into the InChI Resolver shortly.

Reblog this post [with Zemanta]

Buy me a Coffee

okfThere are a lot of conversations going on in the community about Open Data, specifically on the Open Knowledge Foundation email list. A recent blog post announces a working group on Open Data in Science and I’ve sent an email offering to provide input. Hosting ChemSpider has certainly allowed me to get engaged in frontline conversations regarding people’s willingness to share their data and what the perceived differences of Open vs Free are. I don’t have all the answers to all the questions but this area is a growing area of interest and concern for scientists and will likely remain in the spotlight for the foreseeable future. My judgment is that the majority of scientists do not care whether data are free or Open despite the potential repercussions in terms of reuse that this distinction will produce. Scientists do care about whether their own data are free or Open as soon as I discuss with them what the differences are (based on my own understanding of the differences!). See my previous post

In this regard let’s chat about the Spectral Game for a moment. The Spectral Game is, even now, a resounding success and, in many ways, is surpassing our early expectations in terms of capability and usage. As of last week spectra had been viewed 20,261 times by 1305 unique visitors from47 countries. That’s quite amazing for an online game for chemists that is proliferating through word of mouth (blogs, emails, RSS feeds) only. The spectral game is fed by Open Data and now has over 1000 spectra feeding into the game. These have been supplied by scientists willing to make their data Open and by myself, sourcing data and processing during long evenings in front of a good movie. Open Data has been the criterion we have used to feed the Spectral Game

nistRecently Jean-Claude Bradley and I were talking about expanding the dataset on the spectral game to include more Mass Spectral, Infrared and UV-Vis data. The NIST Webbook is a rich source of such information and the data CAN be downloaded as JCAMP spectra for local processing. Due to the gracious nature of the people at NIST a request to allow us to download and use some of their data in the spectral game was greeted with full support and we have permission to do so and have already started the process. An example set of spectra can be found for Cholesterol (here) where there is now HNMR, CNMR, EI-MS, UV and IR data. The data were downloaded via this page: http://webbook.nist.gov/cgi/cbook.cgi?Name=cholesterol&Units=SI . The data are NOT Open Data however. If you visit the spectral pages you will see the ownership declared specifically. For the MS page it says :

Owner    NIST Mass Spectrometry Data Center
Collection (C) 2007 copyright by the U.S. Secretary of Commerce on behalf of the United States of America. All rights reserved.
Origin    T.IIDA NIHON UNIVERSITY, KORIYAMA, FUKUSHIMA-KEN, JAPAN
NIST MS number    67286

and for the IR spectra there are multiple sources:

Data compiled by: Coblentz Society, Inc.

* SOLID (KBr DISC) VS KBr
$$SEE 5095 FOR SOLUTION; PERKIN-ELMER 21 (GRATING); DIGITIZED BY COBLENTZ SOCIETY (BATCH I) FROM HARD COPY; 2 cm-1 resolution
* SOLID (MINERAL OIL MULL); Not specified, most likely a prism, grating, or hybrid spectrometer.; DIGITIZED BY NIST FROM HARD COPY; 4 cm-1 resolution
* SOLUTION 1% (CS2 FOR 2-15 microns, AND C2Cl4 FOR 5.5-7.2 microns)
$$SEE 5106 FOR KBr DISC; PERKIN-ELMER 21 (GRATING); DIGITIZED BY COBLENTZ SOCIETY (BATCH I) FROM HARD COPY; 2 cm-1 resolution

coblentzBoth NIST and the Coblentz Society generate revenue from some of their data collections despite the fact that these data on Webbook are offered free for viewing. The NIST MS database is the most widely distributed MS database in the world (I believe) and they also offer an IR database for sale . Other data are available (1) The Coblentz society have been building their databases for decades and also offer them for sale. If you look at the prices of the Coblentz collection or the NIST IR collection they are a a hundred to 2 hundred dollars per collection. Maybe some rich uncle could write a check and release them into the world of Open Data for all to use? Otherwise the groups maintaining these collections deserve to have their costs covered at a minimum..which is probably what their revenue streams from these databases allow.

We are thankful to NIST to allow us to upload spectra from the Webbook and we have started to upload data. It will only be a slice of the collection. We will flag the data on our side as NOT Open Data but “Can be accessed by Spectral Game”. In this way the game grows in its types of data but we respect the licenses of the contributors. Open Data vs Free Data vs Pragmatic Usability…maybe the OKF can participate in negotiating the release of such data sources into the public domain and, where appropriate, sourcing some funding to allow them to do it?

Reblog this post [with Zemanta]

Buy me a Coffee

There are some interesting articles showing up on ChemSpider from across the blogosphere. We have just added to our list of high priorities to generate an RSS feed of structures, short descriptions and ChemSpider IDs so that anyone can access them. When we add new descriptions we will add snippets to the RSS feed.

New Articles include:

Teen Chemist and Splenda

A Discussion about the Synthesis of Spirangien A from the TotallySynthetic Blog by Paul Docherty

A Discussion about the Synthesis of Omaezakianol from the TotallySynthetic Blog by Paul Docherty

Reblog this post [with Zemanta]

Buy me a Coffee

ons1We’ve been working with Jean-Claude Bradley and his Open Notebook Solubility Challenge group to assist where we can. This has included enhancing some of our services (though there is more work to be done…), populating data into ChemSpider and, now, linking us up to the Data Tables built by Andy Lang (of The Spectral Game fame…we’re quite a team).

The Open Notebook Solubility Challenge is described here. The present list of compounds for which we have created the integration to be described below is here. WHen you open that link you’ll see the first bunch…notice the little icons showing patent links, Wikipedia links and the presence of spectra on those records.

WHat we have done now is deposit the links into the Data Source tables for these compounds and providing the direct link to the ONS tables. They can be viewed WITHOUT leaving the site simply by hovering over the link…OR you can click on the link to view the data directly. An example of the link view is shown below. To find these tables simply look up the Open Notebook Solubility Challenge data source in the table.

 

ons2

Buy me a Coffee

virscidianI was a generator of analytical data for a number of years. I am an analytical scientist by training..NMR jock to be precise (if you are interested in my work see here.) However during my tenure at Kodak our team ran a walkup laboratory and we generated  a lot of data. In those days it was a few 10s of gigabytes per year. We generated NMR, IR, Chrom and LC-MS data…lots of it. At that time we had our own NMR processing software that was written in house. That was replaced later by commercial desktop processing software. We used instrument based data processing processor and produced simple reports at the end of the work.

When I left Kodak for ACD/Labs I led the development of the SpecManager product extending it from 1D NMR processing software only into IR processing, chromatography processing, 2D NMR processing, LC-MS processing and into an even broader set of analytical techniques. By the time I left ACD/Labs, EACH of these areas had their own product managers - NMR, MS, UVIR and Chrom….4 in total.  My focus with these tools was always based on structure identification and qualitative data analysis.

LC-MS is one of the most useful, sensitive, high-throughput technologies available in the majority of laboratories and the data generated can be in the 100s of gigabytes every few months. High-throughput quantitative analysis software tools have generally been the domain of the hardware vendors but one person that I watched with interest as he took on the domain of high throughput quantitative analysis was Joe Simpkins. Joe was a co-founder of the Opans CRO lab and they needed BETTER software than was available from the vendors. So, Joe built it. Over the next few years the software was optimized for their lab, then the labs of their users. Ultimately their lab software was sold to their customers as it offered possibilities that the hardware vendors couldn’t address in terms of usability and specifically vendor neutrality. The software business grew and now Joe Simpkins has split off a company called Virscidian and has already added to the team. Components of his Analytical Studio software are already licensed to Agilent and it only makes sense that the other vendors will be looking at the Virscidian capabilities moving forwards.

Virscidian is one of those small companies I like to watch. They are small, nimble, highly skilled, networked in the industry and focused on providing the optimal solutions for the user. (Also they are my friends so I watch very closely…and help where I can). It just makes sense to connect ChemSpider to their solutions as we have to many of the other MS vendors….watch this space

Buy me a Coffee

I’ve been in a discussion with Mat Todd for a while….Mat is involved with  The Synaptic Leap and is a supporter of what we are trying to do with ChemSpider. What we have built is a platform where registered users can post there own structures, spectra, images and so on. We have a number of chemists contributing to the database and as a result there are synthesis protocols showing up on ChemSpider, new structures added regularly and NMR spectra specifically being added rather frequently. ChemSpider is growing as a result of community contributions and improving in quality as a result of community curation. But…we want even more participation…if possible.

Now, I rent my DVDs from Netflix and buy my books and electronics from Amazon. I read the reviews  of other people who have purchased books, CDs, electronics and rented DVDs. I am influenced by the masses. Doh.  Now…I will admit that I DON’T contribute to those reviews. I’ve never written about my favorite movies (but do score them). I dont’t review electronics or books but I do leave feedback on eBay. I judge I am like most other people. We have thousands of users per day that use ChemSpider now…the majority search, use and hopefully get value from our humble offering. Most people DON’T leave tracks…don’t add comments, don’t deposit data, don’t add information. Some of these people however do maintain their own blogs, they leave comments on other peoples blogs or happen across useful information. We want to know about interesting articles, blog posts, commentaries that might be of value to associate with structures on ChemSpider.

During a recent email exchange with Mat Todd we exchanged, at a very basic level, ideas about scraping information from sites and making available on ChemSpider rather than having people deposit the data. The question was how to automatically scrape appropriate data from blogs etc and maybe use tagging for some robot to locate the data and scrape and deposit. Theoretically yes, this would work. But, honestly I believe that developing some fancy technologies which might not bring benefit isn’t the way to go. Been there, done that. I realized there is a simpler way to do this AND, if there is a critical mass, then we will develop some fancy technologies (we’re good at that!)

So…welcome Google Alerts.

alerts

Google Alerts has been in beta for…hmmm…a while. Everyday I get alerts hitting my email about things I care about..it’s how I find out about the latest comments about ChemSpider on blogs that I don’t have listed in my reader. So, a simple alert on “loadtochemspider” has been set up. If anyone tags a blog, wiki, “something” with that tag then we will check it out and if it fits we will add it to ChemSpider. Simple technology…1 minute to set up. No massive technology development…just eyeballs and copy-paste.

So…let’s try it. If you see something of interest add loadtochemspider as a tag, into a comment, into the original post. Leave the rest to us…

Reblog this post [with Zemanta]

Buy me a Coffee

freebase3There has been encouragement that we look at Freebase as an additional online resource to integrate to. In terms of chemical entities some of the Wikipedia structure collection has made its way onto Freebase and has been enhanced to include InChIs and SMILEs. It’s not clear to me whether the InChIs on Freebase are all obtained FROM Wikipedia or were layered on later onto Freebase. So, I approached the Freebase group and asked if they could provide me a dump of the InChIStrings and the SMILES strings together with the associated FreeBase IDs and the chemical names. In this way we would be able to generate SDF files for depositions and end up with the structures (converted from InChIs and SMILES) as well as the associated chemical names and Freebase IDs. Simple idea right?

freebase11So, we converted InChIs and SMILES and generated the depositions. Freebase links now show up in the Data Sources section and, if you put your cursor over the GUID you see an image of the page and can click through to the record on FreeBase. See the image above. The Freebase GUID for Benzene is here: #9202a8c04000641f800000000000ac66

All seems well. I have a question though…I look at a structure like Dapagliflozin on Wikipedia here and see full stereochemistry explicitly defined in the name and in the image. However, on Freebase I note that the stereochemistry is NOT explicitly defined in the InChI. The InChI is:  1/C21H25ClO6/c1-2-27-15-6-3-12(4-7-15)9-14-10-13(5-8-16(14)22)21-20(26)19(25)18(24)17(11-23)28-21/h3-8,10,17-21,23-26H,2,9,11H2,1H3/t17?,18?,19?,20?,21-/m0/s1

So, when we take the InChIs and the chemical names, convert the InChIs and deposit the chemical structures we end up with a “destruction” of the curation work we have done on ChemSpider. We end up with TWO structures for Dapagliflozin, not one (See below)

freebase2

And now we need to start the curation efforts AGAIN to clean out misassociations of names and structures. So, what we are going to do is delete the deposition of Freebase structures and redeposit without the chemical names. In this case the outlinks to Freebase will be in place but the structures will not be found by a name search UNLESS the Freebase GUID is associated with an already curated name-structure pair that is coincident with the Freebase name.

I can say that the Freebase team were a pleasure to work with and, in theory, once the Wikipedia curation project is finished the SMILES and InChIs on Freebase will be correct and such linkages back to Freebase will be easier, and correct. In the meantime I am interetsed in where the Freebase SMILES And InChIs are coming fro (I think a lot of them are from Wikipedia but am not sure) and we are going to make certain on our side that we remove the chemical names so as to not decrease the quality of our curation efforts.

Reblog this post [with Zemanta]

Buy me a Coffee

I’ve posted previously about embedding structure images and spectra into blogs and webpages. One of the side effects of this is that for structure images specifically the ChemSpider record is linked back to the webpage that the structure is embedded into. Structures are embedded in various places now into wikis and blogs. An example of 11 embedded structures is shown here.

When Arvin Moser wrote in his blog about Letrozole and embedded the structure image into his blog post a link BACK to his blog post was created in the Data Source table. See the image below.

letrozole

With this capability, as more people embed structures from ChemSpider into their online pages/blogs more of the internet will become structure searchable and ultimately linked. It does not require adding InChIs to webpages (though that is encouraged for indexing by search engines).

(Caveat: The system is not yet optimal and we are working on filtering out comments on blogs that presently get added as additional links. All “doubles” will be filtered out later)

Reblog this post [with Zemanta]

Buy me a Coffee

I gave my talk yesterday at CShals 2009, the conference on Semantics in Healthcare and Life Sciences.It was a great meeting for me (hindered by dismal access to wireless internet as a result of Marriott’s want to make more money from the conference organizers. They should be ashamed of themselves in this day and age!) as it was not about Chemistry, not about spectroscopy, not even about Open Data, Open Access and Open Source. It was about Semantics. I learned a lot and got to hear Tim Berners-Lee talk about where the semantic web is and where it can go and how can be disruptive in a good way while NOT being too disruptive to layer onto what already exists. The best part of the meetingfor me was the clear passion for the InChI, as well as a lot of acknowledgement that it is not perfect, cannot presently compete with molfiles, commercial systems, CAS Numbers and so on. But, people are optimistic and are waiting and supportive. Overnight I inserted a lot more information about InChIs and how they can be useful, where some of the limitations are presently, how the StdInChI has now added a new level of complexity on one hand and simplifcation on the other. There have already been a number of requests for a copy of the talk so it is up on Slideshare for now (and linked below). I’ll do a voice over in the next few days and upload to Scivee. I unveiled the first version of the InChI Resolver at conference and showed it to a couple of people. The general consensus is we are heading in the right direction. The timing on this conference was good because the intention is to layer on RDF before we release at the ACS, time allowing.

Reblog this post [with Zemanta]

Buy me a Coffee

I’ve previously posted about the work going on regarding the NMR Game…now morphed to the spectral game and described in detail by JC Bradley. We’ve been working hard to increase the number of spectra available as part of the game (now in the 100s of spectra!) and Andy has been working hard to improve the flow of data. The original structure images have been replaced with ChemSpider structure images and we have delivered a web service to allow Andy to continue to update the spectral collection as more data are added to the database.

When users see issues with the spectra they get to leave comments regarding their observations. This can be very valuable for us to curate the spectral data. This will allow us to perform game-based crowdsourcing of the spectral data and the feedback is already of value.

We have about another 30 spectra to add to the present collection of spectral Open Data and then we’ll take a break and I’ll be approaching the spectrometer vendors and a few other friends to see whether they have any data to contribute to the game. We are already considering adding the ability to add a “Company Logo” to be associated with a spectrum so that the vendors/contributors get fair recognition for their contribution to the game. If you are interested in providing data we will upload it for you. Contact us at infoATchemspiderDOTcom.

JC Bradley has now uploaded a short tutorial to YouTube regarding how to play the movie and I have embedded it below. JC’s also announced a prize for the best player. Go test your skills..

Reblog this post [with Zemanta]

Buy me a Coffee

Jean-Claude Bradley has recently posted about about an NMR Game running on Second Life. Read his blog for details but I excerpt some of the comments here:

Andy and I brainstormed some new chemistry games that we could introduce to Second Life to leverage our recent tools. One of the applications is the NMR game. By combining the orac molecule rezzer, the SL spectral viewing tool and ChemSpider Open Data spectra I think we have a pretty good game.

The idea is simple: click on the molecule that is represented by the spectrum. If it is correct you get 2 points and get another spectrum. You lose a point by clicking on an incorrect molecule. After going through all the spectra your score gets posted on the web to a top10 list. For equal scores the best time takes it.”

So, here at ChemSpider we are delivering spectra as Open Data to help with the game. And we’re happy to do so. It’s always been our intention to have ChemSpider provide value like this. ANY registered user can upload spectra to the ChemSpider website. The details are outlined here (I just noticed the interface has changed since I wrote that but you should still be able to follow the process). We need the spectra to be in JCAMP format and if you want them to be available for the game, and for people to download, they MUST be declared as Open Data.

Right now we have 100s of spectra. You can find them here. But we need more. Much more!We’d like you to contribute them. if you don’t want to upload them yourself then contact us directly and we will process and uplood for you. We need the data and the name/structure of the associated molecule.

And how will the game be used on these spectra? The game will be used to “curate and validate” the spectra. As the game is being played a score of how many people say it is correct will be kept. And of course what is wrong. Based on these scores our curators will be directed to “problematic spectra” for their attention. This is true crowdsourcing and a great way to do spectral validation.

We would like the spectral collection to grow and welcome contributions from anyone. They do NOT have to be just NMR. They can be IR, MS, Raman etc too. Ultimately a Spectral game will be unveiled. Please consider ChemSpider as a repository for your data as it will benefit the community of chemists and, in particular, the process of teaching students and allowing them to “game their way” through the process. Watch where this goes…it’s VERY interesting to consider how it can improve…there is an NMR game website in development so you won’t have to go just to Second Life.

Buy me a Coffee

Following on from my recent post about “Why are structures like YouTube Videos ? “I am now asking the same question about spectra.

The answer is simple. When people have deposited data as OPEN DATA on ChemSpider we are now providing the ability to embed the spectral data and display at other sites. This is different in that we are not just showing images but real live spectra in the JSpecView Java Applet so Java must be installed. Thanks to Cameron Neylon for asking the question about whether we could provide the service. Glad to help…

If all is well you should see an IR spectrum associated with the ChemSpider record here. In order to EMBED spectra simply Login to ChemSpider, find an Open Data spectrum of interest (you could browse http://www.chemspider.com/spectra.aspx) and then click on EMBED (left hand corner below the spectral image. Do a left click to see additional features of JSpecView. We DO have some minor work to do with spectral plot reversal and improving the zoom display but we’re getting there. Enjoy.

Reblog this post [with Zemanta]

Buy me a Coffee

A few months ago I met with Adam Azman in Chapel Hill to discuss how the names in our ChemSpider database could be used to expand his Chemical Dictionary. It seemed that we would be sitting on a treasure trove of name fragments that could help him in his efforts. So, we supplied Adam with 1.3 million identifiers and Adam has worked for the last few months to generate his Chemical Dictionary. He extracted over 100,000 name fragments from our collection as he has described in his blogpost here.

Extracted from Adam’s blog are his so-called Administrivia “The dictionary is licensed under the Creative Commons Attribution 3.0 License.  …  The dictionary is compatible for Microsoft Office (Windows or Mac), and  Open Office (Windows or Linux).  The install file includes instructions for upgrading old versions and installing it for the first time.  The dictionary should be useful for all chemists.  However, I am an organic chemist.  Thus, the dictionary was created from an organic chemist’s mindset.  It will probably be most useful for organic chemists.”

Adam has explained in detail how he did the work. I encourage you to read his post to fully understand the nature of the work and how much heavy-lifting he actually did.  It’s been a pleasure to help Adam and the community by supplying our own form of a “dictionary” to him for his particular treatment. It took a few hours of work from our side and months of hard work from him. I encourage you to take advantage of his efforts…if you are a chemist this is a real gift for the season. The dictionary can be downloaded from our site here.

Now I want you to consider timing. We are working hard on our ChemMantis project, a system for entity extraction and document markup. Part of this includes the generation of dictionaries for finding chemical names. We’ve already expanded our chemical dictionary using the database of identifiers from ChemSpider but for those of you working with other systems such as OSCAR3 or the other commercial markup systems dependent on chemical dictionaries you will likely find Adam’s contribution significant. Enjoy.

Buy me a Coffee