A lot of people have been helping to improve the quality of ChemSpider content by depositing new data and “Cleaning up” errors in the data over the past few months. it’s been a long climb. Our thanks to all of you who have contributed. I’ll be the first one to put my hand up and acknowledge that in some ways I have not made the act of contributing to the curation process very easily since I’ve been feeding the data out via the blog in chunks, as it has developed. Following a recent “long flight” I am happy to announce that the Curators Handbook/Bible is now available in its first form and is available online here. This document gives some pretty detailed guidance regarding how to curate the ChemSpider database. As always we welcome feedback. If something is not clear let us know and we will expand/enhance as appropriate.

What I also want to do is to thank those people who have commented on how truly impressed they are with the rate at which we are cleaning the data. In general most curation requests identified on the site are addressed within 24 hours. There are some issues hanging out there that we don’t have solutions for at present, specifically in regards to organometallic data handling, but we are still thinking about a path forward.

Buy me a Coffee

It is finally time to rollout more attractive structure depictions. We have needed some more attractive structure depictions for a while but they have become an absolute must have as we rollout the following new capabilities:

1) The ability to make YOUR chemical blog structure searchable (watch this space…). We suggested one path previously…this is BETTER…

2) Structure balloons for using with our document markup tools, both browser-based and Microsoft Word based

We all judge quality of visual aesthetics quickly. We know a good structure when we see one. This is an announcement that we will be rolling out new structures across the site in the next few days. You will see better looking structures showing up across the site - during deposition, during service-based predictions, during searches and, well, everywhere. While not perfect as yet a little more tweaking and the entire database will be supported by the new structure depiction algorithms. As it is you should see some examples now on the database…one shown below. We welcome your feedback!

Buy me a Coffee

That is not a misspelling in the title…I do mean Word Docmination and not world domination. It’s a rather tongue in cheek comment based on some recent discussions where a friend working in the domain of cheminformatics laughingly joked that continuing to develop ChemSpider at our present rate and linking up document markup capabilities could lead to “world domination”. Hardly. Anyhow, we’re more for collaboration and integration rather than domination. However, taking the comment with it’s intent I did find it funny considering what we are working on at present…”Word docmination for Chemistry ”

Over the past couple of weeks you will have noticed that the blog has gone quite quiet. There have been a number of reasons for this…all positive. Collectively our team us involved in a number of projects and these have timelines and deliverables. My own personal time that can be dedicated to blogging is much diminished, for right now. That said there are some exciting things to report. We have progressed a long way with our document markup system since the presentation at ACS-Philadelphia. What we showed at the conference was very much proof of concept. Since then we have been improving our workflows for markup, have been validating the identification of chemical names versus “other text”, have been working on easy ways to build “dictionaries” of good and bad names into the interface, have been improving our structure layout and depiction and have started to compare with other document markup approaches. Our intention with our document markup approach has been to take advantage of much of the work we have done over the past 18 months. Specifically, we have created a large foundation of chemical entities with associated properties (identifiers, experimental and predicted properties, links to publications, services and other related information) and now we can leverage this database as we perform document markup. Not only can the validated structure-name pairs be used to great effect for this work but when chemical names are converted to structures they can immediately be used for lookup on the ChemSpider database. When chemical structures are identified in documents (either via chemical name conversion or by extraction of the chemical structure from the OLE container in a Word Document) then they can be deposited into the ChemSpider database together with a link to the original document or appropriate meta data. As we have moved through our project we have been focusing in on the next phase of the project which should be integration to the most common desktop word processor (?) Microsoft Word. At present we are working on utilizing the markup capabilities we have been developing in Internet Explorer and building integration to Word. These capabilities will initially be called via our ChemSpider web services but in the future might involve some desktop components for markup remote from ChemSpider. Time will tell. Watch this space for more news as we unveil the new capabilties. We are certainly interested in talking to any publishers who might be interested in looking at our markup capabilities as we add them.

Reblog this post [with Zemanta]

Buy me a Coffee

There has been a conversation going on over on Wikipedia about supporting ChemSpider IDs in the ChemBox and DrugBox. ChemSpider IDs have been added to ChemBoxes over the past few weeks by a number of contributors and, based on the blessing of members of the Wikipedia community, they will now be displayed in Drugboxes also. The conclusion of the conversation today stated:

 Done Thanks everyone - that seems clarification that people would find this helpful and, in particular, thanks for addressing ChemSpiderMan own reservation. I’ve added to {{drugbox}}, eg see Verapamil.David Ruben Talk 13:10, 22 September 2008 (UTC) 

A Drugbox, for Xanax, is shown below. Note the number of outlinks to PubChem, Drugbank and now ChemSpider. 

 

What I am most proud of is some of the statements made in the discussion that validate our efforts to create high quality curated source of information. For example:

“I’d just like to add my voice to those that find value in linking Wikipedia articles to ChemSpider.  find this database to be reliable and information-rich in comparison to the other dabases we link to already. I support adding a link from drugboxes and chemboxes. – Ed (Edgar181) 11:36, 22 September 2008 (UTC)”

“I think the effect of linking to ChemSpider would be to marry a well curated database (ChemSpider) with monographs (WP). To elaborate, the database contain various intrinsic properties (MW, isotopic composition, structure, stereo), experimentally-determined properties (bp/mp/appearance), experimentally-determined spectra (1H/13C NMR, IR, etc., e.g. [1]), apart from predicted data. Monographs: our articles discussing the synth, applications, chemistry, etc. of various compounds, drug-, drug-like, or otherwise. Seems like everything to gain and not much to lose, except for another entry in the drugbox and perhaps concerns of table creep. –Rifleman 82 (talk) 03:46, 22 September 2008 (UTC)”

“I personally support the addition of ChemSpider not because of the predicted properties—which are included in PubChem—, but because, so far, ChemSpider appears to be highly curated (and transparently so). PubChem has some serious if relatively infrequent reliability issues, which are well known to the WP chemistry/pharm community, and MeSH (to which CAS numbers in the Drugbox link) appears to lack information on many compounds. Fvasconcellos (t·c) 01:53, 22 September 2008 (UTC)”

It is validating to be embraced by the Wikipedia  community in this.What we commit to in return is to continue our efforts to expand the services and quality on ChemSpider. And presently we are working on a “little gift” to help Wikipedia. Watch this space.

Buy me a Coffee

Frequent users of ChemSpider might have noticed a change in layout of the record view pages of late. As we layer more information onto a record view page (EPI Suite predictions, SimBioSys LASSO scores, spectral data, MORE predictions to come) the record view pages become increasingly heavy. As a result we have had to navigate the challenge of increasignly heavy pages and user experience. Since we have added the ability to perform structure searching on Pubmed recently and are now in the process of adding a new update for Patent searching we have chosen to hide the Data Source outlinks until you choose to see them.

So, if you are looking for original data sources and a list of potential commercial vendors please click on the button indicated below to fold out the list. Commercial vendors are indicated as discussed previously here.

Buy me a Coffee

I recently started a discussion with the users of ChemSpider about how they use our system. There have already been two responses and I am hoping for more. Having sat in on a IUPAC InChI meeting in Washington last week I can honestly say that it was one of the most functional and on-task meetings I have sat in on in a long time. Decisions were made about how to move forward with the next release of the InChIKey and “standard versions” of both the InChIString and InChIKey.

The meeting has prompted the question how do you use InChI? For what purpose do you use InChI and do you use only the string? Do you use it for communication purposes and structure exchange? Do you use it in your internal databases? Is it a primary path to deduplication? What settings do you use for the InChIString?

I’m interested in how you are using InChI nad how important it has become for you? Comments welcomed..

Buy me a Coffee

I’ve been in a number of conversations of late about how Mass Spectrometrists might use ChemSpider and get value from our efforts. I recently gave a short Powerpoint presentation to a group about what ChemSpider is and the types of queries that ChemSpider users can conduct today. I’ve posted the presentation to Slideshare as usual so people can access it there if they are interested.

I’ve started wrapping my head around how we could provide more value to some of our users in regards to MS, HPLC and NMR. One of the things we could do is to use our known text mining skills to look for NMR or MS (LCMS) articles based on the use of the terms in the title or abstract and then using those terms as tags against chemical structures in the abstract/title. So, from titles such as “High-Performance Liquid Chromatographic Method for Determination of Phenytoin in Rabbits Receiving Sildenafil” from our collaborator Libertas Academica we would extract HPLC and Phenytoin and connect the article to the structure as we have done here. In this way the article would be searchable by structure and associated analytical technique and we could even look at extracting the detailed experimental approach from Open Access articles. More work but feasible. Any comments???

Buy me a Coffee

Readers of this blog will know we have a focus on enabling chemists to source information via both Open AND Closed access publishers with the aim, ultimately, of providing a way to perform structure and substructure searching of these articles. This work is well underway.

If you visit our Literature Search Page you will see that we have recently added the ACS AuthorChoice Free Access articles to the index and we will continue to index on an ongoing basis.  There are very few ACS AuthorChoice articles to search but the usual validation search of “Searching Taxol”  it does turn up one hit.

Herding Nanotransporters: Localized Activation via Release and Sequestration of Control Molecules (Nano Lett. 2007 Volume 8 Issue 1 Page 221) - American Chemical Society

R. Tucker, P. Katira, H. Hess

… 1 mM MgCl, 1 mM EGTA, pH 6 .9) containing 10 micromolar taxol for stabilization and kept at room temperature (20 C). Caged -ATP and “

Buy me a Coffee

Those of you who read the ChemSpider blog for a while will know the name of Paul Docherty who writes on his TotallySynthetic blog. I have a great appreciation for Paul’s writings and, with permission, have started associating his blog posts directly with the structures, literally inserting the entire blog post, but NOT the comments, into a description for the molecule. For example, see the structure here. We’ve started to go back through Paul’s postings and make his entire collection structure searchable… see totallysynthetic.chemspider.com. There are a lot of old posts to deposit and time will get us there. Everything is posted under CC licenses and with permission.

Paul also writes for the RSC in Chemistry World as, for example, here. With permission from the RSC we are inserting snippets of the article into ChemSpider and linking out to the Chemistry World article, for example: here and here. If you are a registered user you can link your own articles on your websites using the Add URL functionality or you can deposit your own postings onto our site in the same way…feel free to ask us how.

Buy me a Coffee

We will doing a lot of service work in the next 24-36 hours and there are likely to be service disruptions to ChemSpider. During this period there may be significant periods of downtime as we upgrade a number of services in the background. Our apologies for the disruption but the best time to do this is the weekend.

Buy me a Coffee

I posted a request recently for people to share with us how they use ChemSpider. Two comments were submitted today and are given below. Thanks to Chris and Sean.

Sean Ekins commented “Over the past year I have used ChemSpider as a valuable resource for generating molecular properties that are then used to analyze specific enzymes substrate requirements (PMID: 18537573) and also for following up on hits derived from computational pharmacophore database searching (PMID: 18579710). The later searching helped find additional compounds that were purchased from vendors and tested in vitro, ultimately some were found to be active. These uses are in addition to finding structures SMILES for generating QSAR models and general molecule searching. I hope to build on this in the future!”

Chris Singleton commented “As a chromatographer, the predicted properties of the molecule are an invaluable tool.  With values such as the logP and logD, it is much easier to predict and estimate retention times relative to other similar compounds without having to spend the time to do an entire chromatographic run.  Of course, I still do the actual chromatography, but these tools let me ‘home in’ on an appropriate method much quicker.  Some of these predictions are available in chemical structure drawing programs, but Chemspider allows me to do one-stop shopping for most of the properties I’m interested in.  And if I don’t have a structure handy, it’s easy to look it up so I can see what type of MS ionization mode I want to use, based on the moieties and polarity of the molecule.

Secondly, I’m more involved in DMPK from the analytical side as opposed to the pharmacokinetic (PK) side, so I’m trying to learn more about the PK.  The fact that Chemspider has such a wide range of properties and links to outside sources really lets me correlate the PK properties of a drug to its structure and lets me be a better analyst.  I  also know that Chemspider is busy adding links to the infoboxes on Wikipedia (an effort that I am involved in), so I read about the properties of a drug and am then able to click through to look at the chemical properties and how the structure relates to function.  This is more of a professional development exercise for me, but having the wealth of information in one place make it a easier to get a big picture view of a drug, rather than just looking at the pharmacology properties alone.  Definitely a structure centric view of drugs.”

Buy me a Coffee

I talk with a lot of people about ChemSpider and how it is being used. It’s being used in many different ways based on the emails we receive from users and I’ve shared a number of the various ways through my public presentations. What I’d like to do is offer some space on this blog for users to share how YOU are using ChemSpider and what you are using the information on ChemSpider to do. Simply send a short story about how you use/have used ChemSpider and I’ll post it for you. We’ll of course put crosslinks back to your website if you have one etc. Thanks!

Buy me a Coffee

As ChemSpider has grown into an important part of the online community for providing access to information and data to chemists to assist them in their work there are many subjective criteria by which to be measured. We set some objectives early on in regards to how we would measure our own successes in the first couple of years. These included:

1) A result of >500,000 in a Google search (we have been at this number for over a month I believe)

2) Acknowledgment by our “peers”, another subjective criterion, by comments made in the blogosphere, recognized by invitations to speak, participate in panel discussions etc. No shortage here.

3) Reach 5000 unique users per day in our first year (already achieved)

4) Be reviewed in a mainstream publication (the Nature article written about ChemSpider does that)

5) Have over 150 data sources feed ChemSpider. We are close…145 data sources at present and more in the pipe to feed in shortly

6) Be indexed by Chemical Abstracts Service.

CAS has been indexing a number of web resources for a considerable time. Until today I didn’t know that we were one of these sources. It actually makes a lot of sense that we should be indexed. We have unique chemistry on our site since we host Open Notebook Science from groups such as that of Jean-Claude Bradley at Drexel University. But, we also have spectra and assignments from research compounds being deposited onto the database and are establishing relationships with Open Access publishers to index their chemical compounds connected directly to their articles. So, being indexed makes sense.

There has been a murmuring in the community that what ChemSpider is doing will collide with CAS. I have reiterated many times that I believe CAS offers the crown jewels in terms of quality and curated data. With what amounts to likely 1000s of person years of investment in building the registry we are unlikely to surpass CAS’ breadth of knowledge. Rather we are focused on providing a service to the community so that the community can participate in developing and growing the databas. I believe CAS and ChemSpider are synergistic and have much to offer by being connected in this way.

Inserted above is a screen grab of part of a record showing the ChemSpider database as the source of the structure. CAS have rigorous expectations regarding how they select what chemical entities should be inserted into their database. While I don’t know this list of definitions this structure clearly meets it. The structure above is on ChemSpider here. We’re very happy that we are being indexed now in the CAS registry and will continue to enhance our “unique structure collection” working with chemical vendors, publishers and scientists to grow our database.

Buy me a Coffee

If you noticed performance issues in the past couple of days with ChemSpider we apologize. Unfortunately we were getting hit pretty hard by a university in Europe with over 400,000 hits made on the search page. This definitely had an impact on the performance of our web services and the general search performance. We’ve blocked the user now and are hoping for no repeats of the situation (though we have already had similar issues from other countries). If you are the person(s) who were trying to gather the data from our site please contact us directly at infoATchemspiderDOTcom and we can talk about what you are trying to achieve and whether we can help.

Buy me a Coffee

In the past 48 hours we have added six new depositors datasets to ChemSpider. Details of all of our data sources are listed here. The list of six new depositors and the number of compounds in each collection is given below. Click on the hyperlinks for more information. The number of compounds link will display the compound collection and the link to the title of the compound collection will list some details about the data source provider.

489 NIH Clinical Collection 9/7/2008
3080 Shanghai Institute of Organic Chemistry 9/7/2008
12356 HDH Pharma 9/7/2008
196 OmegaChem 9/6/2008
2110 Exclusive Chemistry 9/6/2008
13412 Oakwood 9/6/2008

Buy me a Coffee

We have put in a place a simple way to associate a chemical compound in a single record view out to an external data source. We made this a general solution but did it specifically to enable connections to be made quickly between new Wikipedia records and records on ChemSpider. We have become very experienced with the validation of data on both Wikipedia and ChemSpider over the past few months so when we find new records on Wikipedia that are not already connected to ChemSpider we clean and validate structures on ChemSpider while validating the compounds on Wikpedia. Then, when we are convinced of the validity of the compounds then we connect them. While it may take a long time to validate the data associating the WIkipedia and ChemSpider records takes just a few seconds.

We have now established “Wikipedia on ChemSpider” for Wikipedia searching by structure and substructure searchable. We believe that people may be more likely to use this over WiChempedia but we will see.

The process for linking Data Sources directly to a record view is described in this Technical Note. We welcome feedback on the document in case it is difficult to follow.

Buy me a Coffee

Users of ChemSpider might have noticed some performance isseus in the past 2-3 weeks with our web services, service availability and speed of searches. I put my hand in the air and say “Yup, acknowledged”. Hopefully they have not been too disruptive BUT it is for the overall benefit of the service ultimately. We have been streaming in 8 MILLION links to Pubmed in order to make Pubmed structure and substructure searchable. We are NOT rolling this out with full fanfare yet but I do want to explain the performance issues you might be experiencing. We work on Microsoft technology and while we are advocates for the platforms of .NET, IIS and SQL Server we definitely are putting them under pressure as we keep expanding the database and adding more value. We have thoughts about how to resolve this but want to finishg populating the tables first.

The upside….the majority of links are already in place. For an example visit a structure and look for PubMed as a data source and click on one of the links. For example, for Valium here you will see in the datasource table a series of Pubmed IDs next to the PubMed datasource…

  16971504, 17673, 874970, 406430, 17881, 327854, 879884, 577681, 560225, 195649, …

These will link you out to PubMed directly. Try it out…

Now, do we have implementation issues? YES. The lists of external IDs can be long so right now we show only the first 10. We wiil deal with display of others shortly. We need to provide a way to curate out “junk” entries. For example, “methyl” is on Chemspider as a fragment and has links to PubMed IDs…you’ll see why if you click them..it was done with text mining. These issues will be resolved but for now we announce that PubMed is structure and substructure searchable via ChemSpider. We will explain how we did it shortly but for now we will acknowledge the massive contribution of our colleagues at SureChem. More to come…

Buy me a Coffee

There has been an outpouring of offers from the ChemSpider community in terms of helping to examine/clean and enhance information regarding carbohydrates on ChemSpider. Almost 2 dozen users have now made an offer to help. Very exciting really!

I’ve already outlined the necessity to improve the quality of associations between structures and identifiers on the database. However, I am also hoping that users will write articles about carbohydrates using the rich-text formatting capabilities (ADD Description), will add spectra if they have them, will link up articles if they have interesting papers and will add URLs to interesting online content also.

We have now delivered the ability to curate and enhance records on ChemSpider and look forward to having our users help, starting with Carbohydrates…

Buy me a Coffee

I previously blogged that ChemSpider and Article 2.0 might be a good match. I commented on some of the things that we might be able to do:

“7500 articles and complete freedom to present the articles as we see fit. Enticing! What do we already have on ChemSpider that we could reuse?

1) Structure deposition

2) Analytical data and image deposition

3) Integration to other data via URLs

4) Add comments/description

5) Text markup with “Chemical enhancements”

6) A dataset of >21 million structures and integration to over 120 data sources

7) Good ideas …

Article 2.0 looks interesting…we hope to be involved” 

The contest site has now been updated. It is a little different than what we had imagined. The contest rules state: “We’re hoping you can develop many different journal article rendering alternatives by leveraging the Elsevier Article 2.0 API. The Elsevier Article 2.0 Contest is not about downloading all of the content in the contest repository and building a new search engine or identifying relationships amongst the individual articles contained in the Elsevier Article 2.0 Contest repository. While this is certainly an interesting exercise, the purpose of the Elsevier Article 2.0 Contest is to provide new alternatives for rendering individual journal articles.” There are also no cemistry journals listed in the FAQ list.

At this point it us unlikley that we will participate in the contest at all.

Buy me a Coffee

Last night I was talking about the potential distractions of Google Chrome and tonight I have been using it for four hours and am very impressed. My experience is faster loading, simple navigation, love the most-used page views on the startup tab. Focusing my efforts on ChemSpider page loading is MUCH faster for “heavy pages” and I have seen no issues at all except with the applets. The structure drawing applet and the spectral display applets both fail to load. Based on an internet search this appears to be a common problem but some people seem to have it solved. I have tried the reported approach and failed but I have a feeling that in the next couple of days there will be other reported solutions. Overall I am impressed with Google Chrome and, when Java Apps are supported (or I can get them working) I think this might just be the next favored broswer…but I will be downloading IE 8.0 beta shortly to play with too…

Buy me a Coffee

I am very proud at the response from our user base to my request for assistance with curating ChemSpider in regards to carbohydrates. Carbohydrates are complex in nature. They can be represented in linear form and cyclic form, they exist in ChemSpider with a common name but no defined stereochemistry, there are pentoses, hexoses and many stereoisomers per skeleton. There are MANY common carbohydrates with trivial names - RiboseArabinoseXyloseLyxoseAlloseAltroseMannoseGuloseIdoseGalactoseTalose

Carbohydrates have been very challenging for us at ChemSpider…many depositors have not been careful with the  association between the chemical structure and the associated identifiers. With a chemical structure as the primary key on a record we find confusing associations with structures. For example, a search on Maltotriose as an identifier turns up 5 structures on ChemSpider. Maltotriose is defined on Wikipedia as “trisaccharide (three-part sugar) consisting of three glucose molecules linked with 1,4 glycosidic bonds.” This should mean that it is not appropriate for the identifier maltotriose to be associated with this structure. The registry number associated with this structure should be deleted also based on Wikipedia as a resource. How many of the other identifiers should be deleted? Maybe all???

Looking at this record we see identifiers such as: alpha-D-G​lc-(1->4)​-alpha-D-​Glc-(1->4​)-D-Glc; alpha-D-G​lc, O-alp​ha-D-glc; GLC-(4-1)​GLC-(4-1)​GLC-(4-4)​GTE and O-alpha-D​D-Glucopy​ranosyl-(​1->4)-O-a​lpha-D-gl​ucopyrano​syl-(1->4​)-D-gluco​se . Are these appropriate for this compound?

The challenge for maltotriose is therefore to identify the CORRECT structure associated with that name. “Maybe” it is the structure on Wikipedia but don’t forget that we have an effort underway to validate the structures on Wikipedia and make sure they are correctly associated with the monograph title. Is Maltotriose an identifier for a unique stereoconfiguration or is there alpha- and beta-maltotriose?  I am not sure. What needs to be determined is the correct association between structures and identifiers. Incorrect associations should be removed so that they do not turn up the incorrect structures in ChemSpider when searched.

This is the start of the validation process for carbohydrates…its iterative, complex and hard work. Its going to begin with giving the group of interested parties curator power over on ChemSpider and asking them to work on this challenge. We welcome their assistance. The efforts of contributors like this will be essential.

Buy me a Coffee

Slow down I want to get off…I am writing this at 1:20AM. My life is getting too busy. I am entranced by the things we can do at ChemSpider and am swept up with email, blogs, slideshare (uploaded 8 old talks tonight at http://www.slideshare.net/AntonyWilliams), Google reader and on, and on I find myself fascinated with the pace at which everything is moving. Just when I having finished tweaking Firefox with my latest add-ons, specificlly Ubiquity, then comes the announcement (via a comic) that Google will release their own browser, Google Chrome. Ugh….while I’m excited enough already….(okay…can’t wait to play to play!!!)

A fresh take on the browser

9/01/2008 02:10:00 PM

At Google, we have a saying: “launch early and iterate.” While this approach is usually limited to our engineers, it apparently applies to our mailroom as well! As you may have read in the blogosphere, we hit “send” a bit early on a comic book introducing our new open source browser, Google Chrome. As we believe in access to information for everyone, we’ve now made the comic publicly available — you can find it here. We will be launching the beta version of Google Chrome tomorrow in more than 100 countries.”