Archive for the ChemSpider Services Category

ChemSpider has been working on polishing both single structure and SDF file deposition. We are now using these tried and tested approaches to deposit large blocks of data, commonly many thousands of records. For depositions of 100s of thousands we do break the depositions into smaller chunks of 5-10 thousand each.

An example of depositing a couple of large SDF files was given to us when the following publication was released at JCIM.

Global Bayesian Models for the Prioritization of Antitubercular Agents
by Philip Prathipati, Ngai Ling Ma* and Thomas H. Keller
J. Chem. Inf. Model., 2008, 48 (12), pp 2362–2370
DOI: 10.1021/ci800143n

This paper offers us a few thousand SMILES strings in CSV files that we could deposit into ChemSpider and associate with the article.Visit n example here and you will see the article connected via DOI in the supplementary information.

article

It is easy for us to deposit such datasets so if you have publications with such datasets that you would like to see on ChemSpider send us the SDF file and the DOI and they will be deposited.

Reblog this post [with Zemanta]

Egon Willighagen has been growing the Linked Open Chemistry Data with his work on rdf.openmolecules.net. He has now integrated to the InChI Resolver to enhance the integration as shown below. We’re looking forward to hearing from users benefiting from this!

OpenMolecules RDF

About http://rdf.openmolecules.net/?InChI=1/CH4/h1H4
Identifier info:inchi/InChI=1/CH4/h1H4
InChI InChI=1/CH4/h1H4
Source Chemical blogspace
Source ChEBI
ChEBI ID CHEBI:16183
owl:sameAs http://bio2rdf.org/chebi:16183
Source Connotea
Tag NewTag
Tag alkanes
Tag Gas
Tag InChI
Source DBPedia
owl:sameAs http://dbpedia.org/resource/Methane
Source NMRShiftDB
owl:sameAs http://pele.farmbio.uu.se/nmrshiftdb/?moleculeId=20029286
NMRShiftDB mol ID 20029286
Source ChemSpider
ChemSpider ID 291

RDF Resource Description Framework Powered Icon

Reblog this post [with Zemanta]

A few years ago I was involved in the development of chemistry databases on both Palm and Pocket PC platforms. I wrote an article about it here: the products were named ChemPalm and ChemPocket …we even had a proof of concept 2D barcode scanner where the structures were encoded into 2D barcodes. The power of the devices we can hold in our hand now has jumped in leaps and bounds. Internet access from the phone is expected and web services integrated to phone applications can open up new avenues, especially when it comes to accessing Chemistry data.

A few months ago I heard so many people lauding the iPhone. How could it be that big a game changer. Then, on a long drive I plugged my iPod Nano in to charge on my car adapter and fried the Nano on the spot. The heat coming off was enough to bronze the stainless steel and you could smell it in the air. Turns out it was a problem for a very small % of Gen I Nanos. I looked at the price of a new iPod and was in the market for a GPS so thought I’d take a look at the iPhone which would give me both…and would be a phone replacement too. I haven’t regretted it one bit. The iPhone is one of the most useful devices I have ever owned….period. Free apps are installed in abundance and even though I don’t think of myself as a gamer I even have a couple of games to while away the hours when sitting on the runway…the long flight to Salt Lake City from North Carolina gave me a chance to get wet with this one today…watch it on YouTube here. I am the father of two 6.5 year old boys and if I let them see this their addiction to video games will begin (we don’t let them play video games at all yet).

We have been discussing putting together an application to allow browsing ChemSpider from the phone. presently I use Safari on the the iPhone to access ChemSpider as a website but with our web services putting together something a little skinnier is of course possible. Fortunately we collaborate with some very creative people and I was pinged this week by James Jack, a Symyx Consultant. I’ve been working with James to support his integration from Symyx Draw to ChemSpider and to the InChI Resolver. Actually, support is an overstatement…James is highly productive with minimum assistance and I think that gives credence to the quality of our web services too.

The screenshot below is a proof of concept only at present running on the visual studio emulator and runs on Windows Mobile 4 through 6. iPhone is next.  The app is called ChemMobi and allows viewing of the structure, “suppliers” and properties. …ChemSpider will soon be in the hands of chemists…we hope it’s in their hearts too!

mobilephone

Reblog this post [with Zemanta]

My colleague Will originally developed the ChemRefer service. When ChemSpider started up Will brought the ChemRefer technology and joined us to help expand the capabilities of our services. We integrated ChemRefer and released the text searching capabilities. Will indexed more and more journals and grew the index by 100s of thousands of articles. Unfortunately the downside was that the speed of the search decreased dramatically. Also, we kept hearing the comparison with the Google service and that their advantage was in their citations. So, Will has taken a few months off from indexing and has focused his efforts on developing his technologies to dramatically improve the speed of searching as well as implementing a system for recognizing citations. The system has been made available online for beta-testing just in time for the ACS meeting here in Salt Lake City BUT it is not yet integrated into ChemSpider.

I have performed some basic tests focused on searching chemical names initially. The literature search on ChemSpider has a lot more journals indexed but in order to perform the comparison I searched ONLY the RSC and Journal of Biological Chemistry articles since that is all we have indexed so far on the new system. The search results were as follows. The numbers compare number of hits for the old versus new literature search. The new search has indexed the latest RSC and JBC articles also so in theory should provide more hits.

Searching on Taxol: 626 hits found in 22 seconds (OLD) vs 717 hits in 1 seconds (NEW)

Searching on phenolphthalein: 47 hits found in 5 seconds OLD) vs 1514 hits in 1 second (NEW)

Searching on benzene: 846 hits found in 75 seconds vs 15260 hits in 4 seconds (NEW)

Clearly the searches are MUCH faster with the new system but it is also returning much more results. These are very early results and we will explain more about the system, the results and our future development shortly…

Try out the new system here for now and send us feedback at info@chemspider.com. Thanks

Reblog this post [with Zemanta]

inchis_rscIn what seems like an eon since I first blogged about the need for an InChI Resolver ChemSpider has continued its efforts to provide valuable resources for chemists while benefiting from the advantages of InChI and working through many associated challenges. I will give a presentation tomorrow at the ACS Meeting here in Salt Lake City (and a gorgeous place it is!) in a session dedicated specifically to the InChI identifier and its increasing penetration into the world of Cheminformatics, publishing and internet Chemistry. The talk will be posted to SlideShare here as usual.

 

Following the declaration of the need for an InChI Resolver I discussed the project with a number of groups (five in total) and wrote up project descriptions and hypothetical timelines to deliver a resolver. We finally announced a joint project with the Royal Society of Chemistry on December 1st 2008 and started work on producing a beta release version of the resolver by ACS Spring 2009..that would be TODAY. The alpha release went live about 4 weeks ago and logins were provided to a number of interested parties. From all of the people who tested the system we received a couple of bug reports and small requests for enhancement and all of those changes have been implemented just in time to release the Resolver for general public consumption here at the ACS.

 

We already have a list of things we want to deliver to enhance the system but will be waiting for feedback from the community regarding the value and workflows associated with this system as it functions presently in Beta release. An overview about the system is available here in Powerpoint and shown below. Go try it out at inchis.chemspider.com. It is in BETA release so send us any feedback please to info@chemspider.com. Thanks! 

There are some interesting articles showing up on ChemSpider from across the blogosphere. We have just added to our list of high priorities to generate an RSS feed of structures, short descriptions and ChemSpider IDs so that anyone can access them. When we add new descriptions we will add snippets to the RSS feed.

New Articles include:

Teen Chemist and Splenda

A Discussion about the Synthesis of Spirangien A from the TotallySynthetic Blog by Paul Docherty

A Discussion about the Synthesis of Omaezakianol from the TotallySynthetic Blog by Paul Docherty

Reblog this post [with Zemanta]

ons1We’ve been working with Jean-Claude Bradley and his Open Notebook Solubility Challenge group to assist where we can. This has included enhancing some of our services (though there is more work to be done…), populating data into ChemSpider and, now, linking us up to the Data Tables built by Andy Lang (of The Spectral Game fame…we’re quite a team).

The Open Notebook Solubility Challenge is described here. The present list of compounds for which we have created the integration to be described below is here. WHen you open that link you’ll see the first bunch…notice the little icons showing patent links, Wikipedia links and the presence of spectra on those records.

WHat we have done now is deposit the links into the Data Source tables for these compounds and providing the direct link to the ONS tables. They can be viewed WITHOUT leaving the site simply by hovering over the link…OR you can click on the link to view the data directly. An example of the link view is shown below. To find these tables simply look up the Open Notebook Solubility Challenge data source in the table.

 

ons2

InChIs are a powerful way to communicate chemical structures. They are going to enable internet chemistry and when we roll out the InChI Resolver shortly then the community will have access to a resource to resolve InChIKeys and ultimately navigate chemistry on the web. We commonly receive chemical structures in the form of InChIs and in order to deposit the structures we have to convert the InChIs back to chemical structures, commonly into SDF format for batch deposition. For simple organics this is not a difficult process…the tools we have at our disposal can deal with the layout of simple organics. However, for some of the chemical structures we receive optimizing 2D layout is very challenging. Many of the issues come with fullerenes (See examples below) but not only. Carbohydrates, complex cycles etc are big challenges.

clean

In building the InChI resolver we hope to provide attractive visual depictions of the associated structures. Without AuxInfo data carrying the coordinates,  or without the deposition of SDF files containing the layout coordinates we have a major challenge ahead of us. Auxinfo data are shown below for erythromycin. These data are rarely generated when people generate InChIKeys and the issue of structure layout will dominate the interpretation of complex structures.

auxinfo

Since beauty is in the eye of the beholder my judgement is that automatc layour algorithms should only assist in the appropriate layout and eyeballs will need to make the final decision. That is why it is better to deposit SDF files of InChIs with Auxinfo carrying the coordinates than it is to deposit InChIs only and leave the structure layout to an algorithm. It will fail.

I am interested in seeing what people can do with their structure cleaning algorithms on InChIs like this:

InChI=1/C66H103N17O16S/c1-9-35(6)52(69)66-72-32-48(100-66)63(97)80-43(26-34(4)5)59(93)75-42(22-23-50(85)86)58(92)83-53(36(7)10-2)64(98)76-40-20-15-16-25-71-55(89)46(29-49(68)84)78-62(96)47(30-51(87)88)79-61(95)45(28-39-31-70-33-73-39)77-60(94)44(27-38-18-13-12-14-19-38)81-65(99)54(37(8)11-3)82-57(91)41(21-17-24-67)74-56(40)90/h12-14,18-19,31,33-37,40-48,52-54H,9-11,15-17,20-30,32,67,69H2,1-8H3,(H2,68,84)(H,70,73)(H,71,89)(H,74,90)(H,75,93)(H,76,98)(H,77,94)(H,78,96)(H,79,95)(H,80,97)(H,81,99)(H,82,91)(H,83,92)(H,85,86)(H,87,88)/t35u,36u,37u,40-,41+,42+,43-,44+,45-,46-,47+,48u,52-,53-,54-/m0/s1

The images below show the iterative application of DIFFERENT structure layout algorithms. One caution…your layout algorithm should produce the SAME InChI at the end and NOT flip stereocenters. Interesting challenge. Who says cheminformatics isn’t challenging? And who thought building an InChI Resolver would be easy?

layout1layout2layout3layout4

Reblog this post [with Zemanta]

freebase3There has been encouragement that we look at Freebase as an additional online resource to integrate to. In terms of chemical entities some of the Wikipedia structure collection has made its way onto Freebase and has been enhanced to include InChIs and SMILEs. It’s not clear to me whether the InChIs on Freebase are all obtained FROM Wikipedia or were layered on later onto Freebase. So, I approached the Freebase group and asked if they could provide me a dump of the InChIStrings and the SMILES strings together with the associated FreeBase IDs and the chemical names. In this way we would be able to generate SDF files for depositions and end up with the structures (converted from InChIs and SMILES) as well as the associated chemical names and Freebase IDs. Simple idea right?

freebase11So, we converted InChIs and SMILES and generated the depositions. Freebase links now show up in the Data Sources section and, if you put your cursor over the GUID you see an image of the page and can click through to the record on FreeBase. See the image above. The Freebase GUID for Benzene is here: #9202a8c04000641f800000000000ac66

All seems well. I have a question though…I look at a structure like Dapagliflozin on Wikipedia here and see full stereochemistry explicitly defined in the name and in the image. However, on Freebase I note that the stereochemistry is NOT explicitly defined in the InChI. The InChI is:  1/C21H25ClO6/c1-2-27-15-6-3-12(4-7-15)9-14-10-13(5-8-16(14)22)21-20(26)19(25)18(24)17(11-23)28-21/h3-8,10,17-21,23-26H,2,9,11H2,1H3/t17?,18?,19?,20?,21-/m0/s1

So, when we take the InChIs and the chemical names, convert the InChIs and deposit the chemical structures we end up with a “destruction” of the curation work we have done on ChemSpider. We end up with TWO structures for Dapagliflozin, not one (See below)

freebase2

And now we need to start the curation efforts AGAIN to clean out misassociations of names and structures. So, what we are going to do is delete the deposition of Freebase structures and redeposit without the chemical names. In this case the outlinks to Freebase will be in place but the structures will not be found by a name search UNLESS the Freebase GUID is associated with an already curated name-structure pair that is coincident with the Freebase name.

I can say that the Freebase team were a pleasure to work with and, in theory, once the Wikipedia curation project is finished the SMILES and InChIs on Freebase will be correct and such linkages back to Freebase will be easier, and correct. In the meantime I am interetsed in where the Freebase SMILES And InChIs are coming fro (I think a lot of them are from Wikipedia but am not sure) and we are going to make certain on our side that we remove the chemical names so as to not decrease the quality of our curation efforts.

Reblog this post [with Zemanta]

I’ve posted previously about embedding structure images and spectra into blogs and webpages. One of the side effects of this is that for structure images specifically the ChemSpider record is linked back to the webpage that the structure is embedded into. Structures are embedded in various places now into wikis and blogs. An example of 11 embedded structures is shown here.

When Arvin Moser wrote in his blog about Letrozole and embedded the structure image into his blog post a link BACK to his blog post was created in the Data Source table. See the image below.

letrozole

With this capability, as more people embed structures from ChemSpider into their online pages/blogs more of the internet will become structure searchable and ultimately linked. It does not require adding InChIs to webpages (though that is encouraged for indexing by search engines).

(Caveat: The system is not yet optimal and we are working on filtering out comments on blogs that presently get added as additional links. All “doubles” will be filtered out later)

Reblog this post [with Zemanta]

I’ve previously posted about the work going on regarding the NMR Game…now morphed to the spectral game and described in detail by JC Bradley. We’ve been working hard to increase the number of spectra available as part of the game (now in the 100s of spectra!) and Andy has been working hard to improve the flow of data. The original structure images have been replaced with ChemSpider structure images and we have delivered a web service to allow Andy to continue to update the spectral collection as more data are added to the database.

When users see issues with the spectra they get to leave comments regarding their observations. This can be very valuable for us to curate the spectral data. This will allow us to perform game-based crowdsourcing of the spectral data and the feedback is already of value.

We have about another 30 spectra to add to the present collection of spectral Open Data and then we’ll take a break and I’ll be approaching the spectrometer vendors and a few other friends to see whether they have any data to contribute to the game. We are already considering adding the ability to add a “Company Logo” to be associated with a spectrum so that the vendors/contributors get fair recognition for their contribution to the game. If you are interested in providing data we will upload it for you. Contact us at infoATchemspiderDOTcom.

JC Bradley has now uploaded a short tutorial to YouTube regarding how to play the movie and I have embedded it below. JC’s also announced a prize for the best player. Go test your skills..

Reblog this post [with Zemanta]

Jean-Claude Bradley has recently posted about about an NMR Game running on Second Life. Read his blog for details but I excerpt some of the comments here:

Andy and I brainstormed some new chemistry games that we could introduce to Second Life to leverage our recent tools. One of the applications is the NMR game. By combining the orac molecule rezzer, the SL spectral viewing tool and ChemSpider Open Data spectra I think we have a pretty good game.

The idea is simple: click on the molecule that is represented by the spectrum. If it is correct you get 2 points and get another spectrum. You lose a point by clicking on an incorrect molecule. After going through all the spectra your score gets posted on the web to a top10 list. For equal scores the best time takes it.”

So, here at ChemSpider we are delivering spectra as Open Data to help with the game. And we’re happy to do so. It’s always been our intention to have ChemSpider provide value like this. ANY registered user can upload spectra to the ChemSpider website. The details are outlined here (I just noticed the interface has changed since I wrote that but you should still be able to follow the process). We need the spectra to be in JCAMP format and if you want them to be available for the game, and for people to download, they MUST be declared as Open Data.

Right now we have 100s of spectra. You can find them here. But we need more. Much more!We’d like you to contribute them. if you don’t want to upload them yourself then contact us directly and we will process and uplood for you. We need the data and the name/structure of the associated molecule.

And how will the game be used on these spectra? The game will be used to “curate and validate” the spectra. As the game is being played a score of how many people say it is correct will be kept. And of course what is wrong. Based on these scores our curators will be directed to “problematic spectra” for their attention. This is true crowdsourcing and a great way to do spectral validation.

We would like the spectral collection to grow and welcome contributions from anyone. They do NOT have to be just NMR. They can be IR, MS, Raman etc too. Ultimately a Spectral game will be unveiled. Please consider ChemSpider as a repository for your data as it will benefit the community of chemists and, in particular, the process of teaching students and allowing them to “game their way” through the process. Watch where this goes…it’s VERY interesting to consider how it can improve…there is an NMR game website in development so you won’t have to go just to Second Life.

Following on from my recent post about “Why are structures like YouTube Videos ? “I am now asking the same question about spectra.

The answer is simple. When people have deposited data as OPEN DATA on ChemSpider we are now providing the ability to embed the spectral data and display at other sites. This is different in that we are not just showing images but real live spectra in the JSpecView Java Applet so Java must be installed. Thanks to Cameron Neylon for asking the question about whether we could provide the service. Glad to help…

If all is well you should see an IR spectrum associated with the ChemSpider record here. In order to EMBED spectra simply Login to ChemSpider, find an Open Data spectrum of interest (you could browse http://www.chemspider.com/spectra.aspx) and then click on EMBED (left hand corner below the spectral image. Do a left click to see additional features of JSpecView. We DO have some minor work to do with spectral plot reversal and improving the zoom display but we’re getting there. Enjoy.

Reblog this post [with Zemanta]

Many of us using ChemSpider are looking for compounds of interest to us. In some cases those chemical entities are not of fleeting interest but something that we are working on in our research, have a hobbyist interest in or some other driving force encouraging us to track activity in.

With this in mind we have now allowed any user to “monitor an article”. What this means is that when new information is associated with an article (new outlinks, new forms of data, new publications, associated spectra etc) then an email will be sent to you making you aware of the new information. In order to monitor an article simply login as a register user and click on the “Monitor This Article” button. If you want to discontinue in the future simply return to the article and click on “Cancel Article Monitor”. We’d like a few people to help test this process for us and provide us with feedback. Keep your eye on those molecules of interest to you with Article Monitoring.

HDR Eye
Image by ►Felix◄ via Flickr

I think the press release here, and copied below, speaks for itself…When I posted the blog about the need for an InChIKey Resolver it resulted in a great discussion and series of comments. Since that time I’ve had many discussions with interested parties about the need. The RSC and ChemSpider share a mutual view regarding the need for the InChI resolver and we are honored to be entrusted to develop a resolver for the community. Will it be “the” resolver..only time will tell. There are various ways to deliver a system to do this so we’ll start here and garner feedback. There are many ways to “hunt a Welshman” (I can say that since I’m Welsh!) so there may be other efforts to deliver a resolver coming too.

“RSC and ChemSpider develop InChI Resolver

01 December 2008

An InChI Resolver, a unique free service for scientists to share chemical structures and data, will be developed by a collaboration between ChemZoo Inc., host of ChemSpider, and the Royal Society of Chemistry. 

Using the InChI – an IUPAC standard identifier for compounds – scientists can share and contribute their own molecular data and search millions of others from many web sources. The RSC/ChemSpider InChI Resolver will give researchers the tools to create standard InChI data for their own compounds, create and use search engine-friendly InChIKeys to search for compounds, and deposit their data for others to use in the future. 

The future of publishing

‘The wider adoption and unambiguous use of the InChI standard will be an important development in the way chemistry is published in the future, and the further development of the semantic web,’ comments Robert Parker, Managing Director of RSC Publishing. 

The InChI Resolver will be based on ChemSpider’s existing database of over 21 million chemical compounds and will provide the first stable environment to promote the use and sharing of compound data. ‘ChemSpider hosts the largest and most diverse online database of chemical structures sourced from over 150 different data sources’ adds Antony Williams of ChemSpider, ‘We have embraced the InChI identifier as a key component of our platform and the basis of our structure searches and integration path to a number of other resources. We have delivered a number of InChI-based web services and, with the introduction of the InChI Resolver, we hope to continue to expand the utility and value of both InChI and the ChemSpider service.’ 

Society support

‘As a learned society publisher it is important that RSC provide support for the standard and contribute to the development of the resolver, which promises to be a valuable service for the chemical science community.’ continues Parker, ‘our collaboration with ChemSpider on this project will enable this to be delivered quickly and sustainably.’ 

The imminent adoption of the InChI generation protocol will be a welcome and necessary step to the wider adoption of the InChI standard. “

Frequent users of ChemSpider might have noticed a change in layout of the record view pages of late. As we layer more information onto a record view page (EPI Suite predictions, SimBioSys LASSO scores, spectral data, MORE predictions to come) the record view pages become increasingly heavy. As a result we have had to navigate the challenge of increasignly heavy pages and user experience. Since we have added the ability to perform structure searching on Pubmed recently and are now in the process of adding a new update for Patent searching we have chosen to hide the Data Source outlinks until you choose to see them.

So, if you are looking for original data sources and a list of potential commercial vendors please click on the button indicated below to fold out the list. Commercial vendors are indicated as discussed previously here.

I’ve been in a number of conversations of late about how Mass Spectrometrists might use ChemSpider and get value from our efforts. I recently gave a short Powerpoint presentation to a group about what ChemSpider is and the types of queries that ChemSpider users can conduct today. I’ve posted the presentation to Slideshare as usual so people can access it there if they are interested.

I’ve started wrapping my head around how we could provide more value to some of our users in regards to MS, HPLC and NMR. One of the things we could do is to use our known text mining skills to look for NMR or MS (LCMS) articles based on the use of the terms in the title or abstract and then using those terms as tags against chemical structures in the abstract/title. So, from titles such as “High-Performance Liquid Chromatographic Method for Determination of Phenytoin in Rabbits Receiving Sildenafil” from our collaborator Libertas Academica we would extract HPLC and Phenytoin and connect the article to the structure as we have done here. In this way the article would be searchable by structure and associated analytical technique and we could even look at extracting the detailed experimental approach from Open Access articles. More work but feasible. Any comments???

Readers of this blog will know we have a focus on enabling chemists to source information via both Open AND Closed access publishers with the aim, ultimately, of providing a way to perform structure and substructure searching of these articles. This work is well underway.

If you visit our Literature Search Page you will see that we have recently added the ACS AuthorChoice Free Access articles to the index and we will continue to index on an ongoing basis.  There are very few ACS AuthorChoice articles to search but the usual validation search of “Searching Taxol”  it does turn up one hit.

Herding Nanotransporters: Localized Activation via Release and Sequestration of Control Molecules (Nano Lett. 2007 Volume 8 Issue 1 Page 221) – American Chemical Society

R. Tucker, P. Katira, H. Hess

… 1 mM MgCl, 1 mM EGTA, pH 6 .9) containing 10 micromolar taxol for stabilization and kept at room temperature (20 C). Caged -ATP and “

Users of ChemSpider might have noticed some performance isseus in the past 2-3 weeks with our web services, service availability and speed of searches. I put my hand in the air and say “Yup, acknowledged”. Hopefully they have not been too disruptive BUT it is for the overall benefit of the service ultimately. We have been streaming in 8 MILLION links to Pubmed in order to make Pubmed structure and substructure searchable. We are NOT rolling this out with full fanfare yet but I do want to explain the performance issues you might be experiencing. We work on Microsoft technology and while we are advocates for the platforms of .NET, IIS and SQL Server we definitely are putting them under pressure as we keep expanding the database and adding more value. We have thoughts about how to resolve this but want to finishg populating the tables first.

The upside….the majority of links are already in place. For an example visit a structure and look for PubMed as a data source and click on one of the links. For example, for Valium here you will see in the datasource table a series of Pubmed IDs next to the PubMed datasource…

  16971504, 17673, 874970, 406430, 17881, 327854, 879884, 577681, 560225, 195649, …

These will link you out to PubMed directly. Try it out…

Now, do we have implementation issues? YES. The lists of external IDs can be long so right now we show only the first 10. We wiil deal with display of others shortly. We need to provide a way to curate out “junk” entries. For example, “methyl” is on Chemspider as a fragment and has links to PubMed IDs…you’ll see why if you click them..it was done with text mining. These issues will be resolved but for now we announce that PubMed is structure and substructure searchable via ChemSpider. We will explain how we did it shortly but for now we will acknowledge the massive contribution of our colleagues at SureChem. More to come…

I’ve had a number of questions about the presentation I gave at ACS Philly last week about document markup. The phrase I keep hearing is “very disruptive” followed by the question “will authors do more work and what’s in it for them?”.

The presentation here outlines the general concept that I talked about…

The basic concept I presented is as follows, with a focus on Chemistry Articles.

A lot of effort is being expended in “text-mining” publications, post-publication, to index these articles and make them searchable not only by text but by the specific language of chemistry, chemical structures. We are specifically asking the question “why extract chemical structures from articles using chemical name conversion approaches and chemical image conversion tools when the structures in the article were ORIGINALLY machine readable?”

We are considering a system whereby authors are asked to contribute to the availability of a free online service for performing structure and substructure-based searches of chemistry articles. While the submission of journal articles is already a lot of work (I know from experience of authoring/co-authoring about 10 a year) we hope that authors will support a service whereby they can upload their own articles to a “validation and mark-up service”. The upload capabilities will support upload of the primary document, chemical structures in standard formats and supplementary information of various types (to be defined)

This system will perform the following services:

1) semi-automated markup of a document – title, author(s), abstract and additional dictionary-based terms plus the ability to use the NLM-DTD markup
2) identification of chemical names and conversion to structures in an automated fashion
3) conversion of structure IMAGES to connection tables using optical structure recognition software (either commercial or open surce)
4) ask authors to confirm whether the converted structures are appropriate
5) provide a structure validation service for submitted molecules checking for “accurate representation”
6) Deposit all structures associated with an article onto ChemSpider but under embargo. Associate the article Title, authors and “abstract snippet” with all structures.
7) Issue a set of ChemSpider IDs for the author to submit to the publisher with the article
8) When a publication has passed through review the author can release the structures from embargo using a DOI or an article URL (more common for Open Access articles)

The result of this project will be a way for publishers to link their articles directly to a free access chemistry database and use a series of web services to enable other capabilities (to be defined). It will also allow articles in Open Access and non-Open Access publications to searchable by the “language of chemistry”.

This is only a slice of the overall project but I think it may be of interest relative to the comments you have made below.

Parts of this were shown last week at Drexel University and a particular snippet is available online here:

We are also going to provide a Microsoft Word add-on which will allow users to prepare articles for publishing using similar technologies.

We think this IS disruptive..what say you?

A link to the presentation I gave at ACS-Philly yesterday in Rajarshi Guha’s session is provided below. A lot changes between writing an abstract and writing a talk so I had the chance to expose an increasing number of papers ALREADY using ChemSpider as one of its platforms of choice to source information from.

Can a Free Access Structure-Centric Community for Chemists Benefit Drug Discovery?

ChemSpider is an online database of over 20 million chemical structures assembled from well over a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples. QSAR analyses utilizing the ChemModLab environment for assessing quantitative structure-activity relationships will and screening using a molecular surface descriptor model.

Link to presentation

I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site – word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

The Environmental Protection Agency has provided permission for ChemSpider to utilize their EPI SuiteTM software to predict a number of physical properties for the chemicals on the ChemSpider database. The properties include:
KOWWIN™: Estimates the log octanol-water partition coefficient, log KOW, of chemicals using an atom/fragment contribution method.
AOPWIN™: Estimates the gas-phase reaction rate for the reaction between the most prevalent atmospheric oxidant, hydroxyl radicals, and a chemical. Gas-phase ozone radical reaction rates are also estimated for olefins and acetylenes. In addition, AOPWIN™ informs the user if nitrate radical reaction will be important. Atmospheric half-lives for each chemical are automatically calculated using assumed average hydroxyl radical and ozone concentrations.
HENRYWIN™: Calculates the Henry’s Law constant (air/water partition coefficient) using both the group contribution and the bond contribution methods.
MPBPWIN™: Melting point, boiling point, and vapor pressure of organic chemicals are estimated using a combination of techniques.  Included is the subcooled liquid vapor presssure, which is the vapor pressure a solid would have if it were liquid at room temperature.  It is important in fate modeling.
BIOWIN™: Estimates aerobic and anaerobic biodegradability of organic chemicals using 7 different models; two of these are the original Biodegradation Probability Program (BPP™).  The seventh and newest model estimates anaerobic biodegradation potential.
BioHCWIN: Estimates biodegradation half-life for compounds containing only carbon and hydrogen (i.e. hydrocarbons).
PCKOCWIN™: The ability of a chemical to sorb to soil and sediment, its soil adsorption coefficient (Koc), is estimated by this program. EPI’s Koc estimations are based on the Sabljic molecular connectivity method with improved correction factors.
WSKOWWIN™: Estimates an octanol-water partition coefficient using the algorithms in the KOWWIN™ program and estimates a chemical’s water solubility from this value. This method uses correction factors to modify the water solubility estimate based on regression against log Kow.
WATERNT™: Estimates water solubility directly using a “fragment constant” method similar to that used in the KOWWIN™ model.
HYDROWIN™: Acid- and base-catalyzed hydrolysis constants for specific organic classes are estimated by HYDROWIN™. A chemical’s hydrolytic half-life under typical environmental conditions is also determined. Neutral hydrolysis rates are currently not estimated.
BCFWIN™: This program calculates the BioConcentration Factor and its logarithm from the log Kow. The methodology is analogous to that for WSKOWWIN™. Both are based on log Kow and correction factors.
KOAWIN: KOA is the octanol/air partition coefficient and has multiple uses in chemical assessment.  The model estimates KOA using the ratio of the octanol/water partition coefficient (KOW) from KOWWIN™, and the dimensionless Henry’s Law constant (KAW) from HENRYWIN™. • AEROWIN™: Estimates the fraction of airborne substance sorbed to airborne particulates, i.e. the parameter phi (φ), using three different methods.  AEROWIN™ results are also displayed with AOPWIN™ output as an aid in interpretation of the latter.
WVOLWIN™: Estimates the rate of volatilization of a chemical from rivers and lakes; calculates the half-life for these two processes from their rates. The model makes certain default assumptions-water body depth; wind velocity; etc.
STPWIN™: Using several outputs from EPI Suite™, this program predicts the removal of a chemical in a Sewage Treatment Plant; values are given for the total removal and three contributing processes (biodegradation, sorption to sludge, and stripping to air.) for a standard system and set of operating conditions.
LEV3EPI™: This level III fugacity model predicts partitioning of chemicals between air, soil, sediment, and water under steady state conditions for a default model “environment”; various defaults can be changed by the user.

The values for individual structures are available in the Record View under the EPI Summary.

For example, the information for Xanax is below.

 Log Octanol-Water Partition Coef (SRC):
    Log Kow (KOWWIN v1.67 estimate) =  3.87
    Log Kow (Exper. database match) =  2.12
       Exper. Ref:  BioByte (1995)

 Boiling Pt, Melting Pt, Vapor Pressure Estimations (MPBPWIN v1.42):
    Boiling Pt (deg C):  441.81  (Adapted Stein & Brown method)
    Melting Pt (deg C):  185.42  (Mean or Weighted MP)
    VP(mm Hg,25 deg C):  1.65E-008  (Modified Grain method)
    Subcooled liquid VP: 7.84E-007 mm Hg (25 deg C, Mod-Grain method)

 Water Solubility Estimate from Log Kow (WSKOW v1.41):
    Water Solubility at 25 deg C (mg/L):  13.1
       log Kow used: 2.12 (expkow database)
       no-melting pt equation used

 Water Sol Estimate from Fragments:
    Wat Sol (v1.01 est) =  0.15855 mg/L

 ECOSAR Class Program (ECOSAR v0.99h):
    Class(es) found:
       Aliphatic Amines
Henrys Law Constant (25 deg C) [HENRYWIN v3.10]:
   Bond Method :   9.77E-012  atm-m3/mole
   Group Method:   Incomplete
 Henrys LC [VP/WSol estimate using EPI values]:  5.117E-010 atm-m3/mole

 Log Octanol-Air Partition Coefficient (25 deg C) [KOAWIN v1.10]:
  Log Kow used:  2.12  (exp database)
  Log Kaw used:  -9.399  (HenryWin est)
      Log Koa (KOAWIN v1.10 estimate):  11.519
      Log Koa (experimental database):  None

 Probability of Rapid Biodegradation (BIOWIN v4.10):
   Biowin1 (Linear Model)         :   0.6009
   Biowin2 (Non-Linear Model)     :   0.2660
 Expert Survey Biodegradation Results:
   Biowin3 (Ultimate Survey Model):   2.2574  (weeks-months)
   Biowin4 (Primary Survey Model) :   3.1733  (weeks       )
 MITI Biodegradation Probability:
   Biowin5 (MITI Linear Model)    :  -0.1488
   Biowin6 (MITI Non-Linear Model):   0.0042
 Anaerobic Biodegradation Probability:
   Biowin7 (Anaerobic Linear Model): -0.4906
 Ready Biodegradability Prediction:   NO

Hydrocarbon Biodegradation (BioHCwin v1.01):
    Structure incompatible with current estimation method!

 Sorption to aerosols (25 Dec C)[AEROWIN v1.00]:
  Vapor pressure (liquid/subcooled):  0.000105 Pa (7.84E-007 mm Hg)
  Log Koa (Koawin est  ): 11.519
   Kp (particle/gas partition coef. (m3/ug)):
       Mackay model           :  0.0287
       Octanol/air (Koa) model:  0.0811
   Fraction sorbed to airborne particulates (phi):
       Junge-Pankow model     :  0.509
       Mackay model           :  0.697
       Octanol/air (Koa) model:  0.866 

 Atmospheric Oxidation (25 deg C) [AopWin v1.92]:
   Hydroxyl Radicals Reaction:
      OVERALL OH Rate Constant =   7.6246 E-12 cm3/molecule-sec
      Half-Life =     1.403 Days (12-hr day; 1.5E6 OH/cm3)
      Half-Life =    16.834 Hrs
   Ozone Reaction:
      No Ozone Reaction Estimation
   Fraction sorbed to airborne particulates (phi): 0.603 (Junge,Mackay)
    Note: the sorbed fraction may be resistant to atmospheric oxidation

 Soil Adsorption Coefficient (PCKOCWIN v1.66):
      Koc    :  2.151E+006
      Log Koc:  6.333 

 Aqueous Base/Acid-Catalyzed Hydrolysis (25 deg C) [HYDROWIN v1.67]:
    Rate constants can NOT be estimated for this structure!

 Bioaccumulation Estimates from Log Kow (BCFWIN v2.17):
   Log BCF from regression-based method = 0.932 (BCF = 8.559)
       log Kow used: 2.12 (expkow database)

 Volatilization from Water:
    Henry LC:  9.77E-012 atm-m3/mole  (estimated by Bond SAR Method)
    Half-Life from Model River: 1.053E+008  hours   (4.388E+006 days)
    Half-Life from Model Lake : 1.149E+009  hours   (4.786E+007 days)

 Removal In Wastewater Treatment:
    Total removal:               2.37  percent
    Total biodegradation:        0.10  percent
    Total sludge adsorption:     2.27  percent
    Total to Air:                0.00  percent
      (using 10000 hr Bio P,A,S)

 Level III Fugacity Model:
           Mass Amount    Half-Life    Emissions
            (percent)        (hr)       (kg/hr)
   Air       0.000217        33.7         1000
   Water     21              900          1000
   Soil      78.9            1.8e+003     1000
   Sediment  0.094           8.1e+003     0
     Persistence Time: 1.48e+003 hr

We started the calculations a number of weeks ago and are updating our progress on the ChemSpider Forum here. We now have values predicted for 3 million compounds.

It is NOT possible at present to search on these properties in the same way that other properties can be searched on the Search Predicted Properties page as shown below.

After all EPI Suite properties are predicted we will selectively make some of these available for searching. The interest so far appears to be in Henry’s Law values, Water Solubility and Melting Point (something that is very difficult to predict with accuracy!). We welcome your comments.

We will be able to extract experimental values for some properties and display directly. For example, logP shows an “experimental database match” for Xanax.

Log Octanol-Water Partition Coef (SRC):
Log Kow (KOWWIN v1.67 estimate) = 3.87
Log Kow (Exper. database match) = 2.12

Exper. Ref: BioByte (1995)

It is going to take a number of weeks to generate EPI Suite values for 21.5 million molecules but we are moving in that direction. Our sincere thanks to the EPA for allowing us to use their EPI Suite software on ChemSpider for the benefit of the community

Since originally incorporating Chemrefer text-based searches of Chemistry literature to ChemSpider we have continued to expand the list of supported publishers as described here. We recently added Royal Society of Chemistry articles (in the past week). This weekend we have indexed IUPAC’s Pure and Applied Chemistry journal and added that to our list of supported publishers also. In keeping with my previous reports of contributions to text-based searching I performed a search on both Taxol and paclitaxel. Performing a search only on the IUPAC index provided 37 hits for Taxol and 11 hits for paclitaxe.l

We now have 12 sources feeding our Literature Search. We are looking for others to index and add into the list. if you have any suggestions please let us know.

I have reported previously on rich text editing capabilities that we have been testing. The text-editing capabilities have been rolled out this evening to beta-testers (if you wish to be a beta-tester you must register on ChemSpider here and request beta-tester status).

The text editing capabilities show up when you are logged into ChemSpider. The capabilities show up with the “Description” link and is listed with all of the other capabilities to allow information to be associated with a record. At the top of a record view you will see

Click on Description and it will open up a simple Rich text Editor, similar to that you would see on Wikipedia, and allowing you to copy-paste rich text or edit standard text and insert hyperlinks etc.

We would like people to test drive the ability to add descriptions to record views. Please go ahead and start working with it and provide us feedback.