Archive for the ChemSpider Services Category

Peter Murray-Rust responded to my recent comments about a Free Lunch. There are a number of comments to be made and an exciting opportunity to use Open Data and linking from ChemSpider.

I’d asked the question about how many records there were on CrystalEye. In our world a unique record is a unique InChI, not so on CrystalEye and appropriately so as the crystal structure itself is presumably the unique record. Makes sense.

PMR> We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is.

What we will do with multiple crystal structures for a single chemical structure is link all unique crystal structures from the unique chemical structure. In this way people can query the chemical structure and find all associated analytical data - spectra and crystallographic files. If we were to list the number of unique depositions on ChemSpider I think we would be around 40 million depositions..an estimate though!

PMR> The main duplication comes from the Crystallography Open Database which has about 45,000 structures.

I looked at the Crystallography Open Database this morning. it states on the home page “Updated daily: 68268 entries in the COD”. We may have an opportunity with the COD to link up to their data and reduce the need for us to host CIFs. Excellent…we’re all for reducing workload and providing links into other systems. It’s what we do.

PMR> The only thing stopping us putting them (AJW> The structures from CrystalEye)  in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon. When it’s finished it will be in RDF/CML.

This is great news. This means that after the summer we can download the data directly via PubChem and link up to CrystalEye that way. Perfect. We’ll stop working on integrating to CrystalEye now and wait for the integration path via PubChem and focus on other data sources. Thank you Peter, Nick, Andrew and Jim!!! That said I don’t believe that PubChem will take CML, they will convert using their tools to produce their compatible formats and InChI being one of them. That will break organometallics etc. UNLESS PubChem are going to adopt CML now and that would be an interesting positive shift in terms of a sign of support for the format. A strong positive. I’l chat with the PubChem team so that if CML is coming we can consider adopting in some way and be ready.

From my post “AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.”

PMR: Chicken and egg… :-) You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.

It’s nice to know that ChemSpider has that type of influence now. It’s good to see it going into Accelrys’ software and I had heard that from Dan’s blog and had added the CML Blog to my reader. I’m definitely watching and willing to follow. We’re busy leading so many other things right now we’ll wait for adoption and then jump on it like a “hobo on a muffin”.

Buy me a Coffee

One of the blogs I really enjoy reading is Deepak Singh’s Business,Bytes,Genes and Molecules. Today there was a blog post about ChemSpider but something strange happened…I could ONLY read it in Google Reader. When I tried to navigate to the actual website it asked me to Save a file. See below.

It may be harmless but I’ve suffered enough at the hands of “bad files” to not grab it. Anyone else seeing this symptom? It’s in both browsers (IE and FF) and on two computers.

Anyhow, thankfully I can read it in Google Reader. There’s a point Deepak raises and I insert it here..

“On the web, data should be available as an addressable resource. The fact that data is available as RDF is great (and I wish more data was available as such). However, my personal preference is that data, especially open data, needs to be accompanied by APIs and bindings that allow the data to be accessed in a number of formats (not a dump per se). I think over time the acceptable formats will be established, much like XML/JSON/RSS have become the standard transport formats. The key aspect here are the business models. Is the business in providing a service on top of the data? For example for more than X number of API calls, there could be a fee associated.”

Just in case people have missed them we have a whole series of Web Services available already and they are being used. You can find details about them here:

Mass Spec Web Services

Taverna Hooks to ChemSpider Web Services for Metabolomics

Web Services Demo Pages and Example Code

Microsoft Hook Web Services into Infomesa

Waters Deliver Integration Via Web Services

There are more examples. We have thousands of calls a day using the Web Services at present and welcome more feedback on them!

Buy me a Coffee

We made our web services available with the intention that third parties might take advantage of the capabilities. Waters recently integrated our MassSpecAPI web service to their MarkerLynx software.

To perform an online search of the ChemSpider database the user can set up a series of specific databases they might want to search. In the figure below the user has selected the Human Metabolome Database, Lipidmaps and the KEGG database.

waters1.pngSearches of the selected series of ChemSpider subset databases can be made based on either mass or elemental composition and returns the compound name in the ID column and retains the link to the ChemSpider ID.

waters2.png

ChemSpider records can then be viewed by clicking on the View Hit Details. The chemical structure and associated information can then be reviewed online.

waters3.png

Web-Service_Integrations of this type have started to expand in number and you can expect to see others appearing very shortly.

Buy me a Coffee

Over the past year we have been interested in our website statistics and our growing traffic. I have blogged previously about Alexa and was challenged to review the Compete statistics. After growing in rankings for a few weeks we removed the Alexa widget and saw our rankings plummet. We then installed the Compete widget and saw ourselves go up the rankings quite dramatically before removing that widget and seeing our rankings decrease. Meanwhile, our own website statistics have shown consistent month to month growth with an average of about 4000 unique users per day at present (As shown in the figure below).

stats.png

Bottom line, based on our observations, neither Alexa nor Compete give anywhere near valid statistics. At the SBS conference in St Louis this past week I asked the audience, about 60 people, how many in the room knew of or had heard of ChemSpider. ONE hand went up…and that was someone I had informed many months earlier. As I expressed to the audience…this was not disappointing news to me…it was quite exciting to know what the potential growth is as people are informed of the service. I expect the growth to continue, especially after the visits to the ACS and the SBS.

Buy me a Coffee

Over the past year ChemSpider has been working hard to build a functional and stable platform for the hosting, deposition and curation of structure-based data. This is to form the foundation of our mission to build a Structure-Based Community for Chemists. Our deposition system is in place and well-tested. Our indexing of articles is proven, and continues. We have indexed multiple Open Access articles. We support the deposition of analytical data (spectra and CIF files) into ChemSpider.

It is now time to take this to the next level and I would like to extend an invitation to Open Access publishers to work with us to design an interface (preferably a web service) to facilitate direct deposition of data into ChemSpider. We’d like to design an interface where you can feed your articles in with Title, Authors, Journal reference, DOI and Abstract. We would associate the article with the chemical structures in one of two specific ways - 1) extract the chemical names from the title and/or abstract and convert on the fly to deposit and/or associate with structures on ChemSpider and 2) allow the publisher to pass us a series of SMILES strings, InChI Strings, molfiles or chemical names to deposit on ChemSpider. Based on what we have already done it is clear this process is feasible, and will require some manual intervention until we optimize processes. If we do this we can design an interface and input format that can be made public, reusable by other groups for the deposition of information into their systems and, potentially, move away from the need for extracting information out of PDF files (and other formats). The outcome of this work would be a freely accessible structure and substructure searchable index of Open Access articles with links back to the Open Access article. We are already indexing articles so, with permission from even the non-Open Access publishers we could use similar processes to index abstracts and make articles structure/substructure searchable based on titles and abstracts.

So, my question. Are there any Open Access/Free Access publishers willing to discuss the possibilities I have outlined? If any of you will be at the ACS meeting and would like to discuss please post a response here or contact me at the usual email address (antonyDOTwilliamsATchemspiderDOTcom) and let’s talk about building a disruptive and enabling technology for chemists around the world

Buy me a Coffee

Over the past few months we have been working hard to integrate the ChemRefer system into ChemSpider. I have reported on our recent rollout of the first level of integration and Will Griffiths has started a discussion at the Open Chemistry Web blogpage. Will has played a key role in facilitating our relationships with publishers in the Open Access domain. Many have had an appreciation for his ChemRefer website.

When we connected ChemRefer to ChemSpider one of our first commitments was to facilitate direct structure/substructure searching of the IUCr publications. Will has indexed the publications from 1948 to present day. He has extracted chemical names and systematic names from the titles and abstracts of the articles. Since we have access to name to structure conversion capabilities from OpenEye we can convert many of these names to chemical structures in an automated fashion. As a result of my work on the curation of Wikipedia chemical structures I also have access to some of the tools I was involved in developing at ACD/Labs (ACD/ChemFolder and ACD/Name) This has allowed me to curate and convert other extracted names in a manual manner. This is NOT the most efficient way to conduct this process as it requires a lot of eyeballing. However, it is this type of approach, reviewing many hundreds of extracted names, which has allowed us to optimize the process, recognize where potential failures could arise and improve the chance to extract the best set of names to work with for conversion purposes. Now we are at the whim of name to structure conversion software in terms of the accuracy of the conversion but this can be validated (See later). Since organometallic names are very difficult to convert to structures we can only deal with organic structures at present. However, we do also have many validated synonyms in our hands too on ChemSpider and that provides a useful dictionary for conversion.

Ultimately I hope that other commercial providers of batch Name to Conversion software modules would see the potential value of collaboration with ChemSpider and give us access to their software tools to assist with our project.

We are in the process of curating and checking the names and chemical structures for all that we have extracted so far but the depositions have already started. Of the structures converted we have found about half of them are already on ChemSpider (example) but another half are unique (example). At present we have connected almost a thousand articles via DOIs from ChemSpider to the IUCr articles. The best estimates at present are that we will be able to connect about 8500 structures to articles.

Since this is the IUCr collection we will validate our approach in the future and, with their permission, we will use the CIFs to validate our extracted structures against the CIF-based structures. This will give us an indication of the errors one could expect from an automated extraction of chemical names from articles and conversion to structures using Name to Structure. We are also going to continue our curation process and allow people to curate structures from IUCr(and other articles) to ChemSpider. There will be cases where some names/structures from articles have not been converted to structures and these will need manual submission. This will be possible through the manual deposition system.

We are looking forward to providing similar services to other publishers should they be desired. This is our proof of concept…and it’s working.

Buy me a Coffee

Previously we introduced the ability to submit chemical structures to the database using the Single Structure Deposition process. This allows users to submit single structures to the database and associate with publications, URLs, Pubmed IDs and so on. An example of the result can be seen here for Quesnoin…the structure and associated supplementary info was deposited online using the outlined process.

We have previously unveiled the ability to add publication details to existing structures on the database as outlined here. What we’ve heard is that it would be just as useful, and in the time of Web 2.0, even better to allow allow connections to other web pages by allowing URLs to be connected to existing structures on the database. The process is easy.

You need to be logged in to Add URLs, Publications etc. The only action that can be done without logging in is the Posting of Comments. The reason we do this is to help us protect from vandalism, if possible. When logged in then click on Add URL. The example below is for me wanting to form a link between the structure record of Xanax and the article on Wikipedia.

addurl1.png

A dialog box will be displayed. Input the Title that you want displayed in the supplementary info and the URL of the associated link. See below:

addurl2.png

Filling the information will show as follows:

addurl3.png

Then Click OK. The submission will be sent to a curator for approval and should be approved very quickly. The reason for this process is to ensure that we don’t get adorned with “inappropriate links”. The information will show as “Supplementary info” at the bottom of the structure record as shown here.

Buy me a Coffee

Two particular Open Access resources providing enormous value to Life Sciences nowadays are PubMed and PubChem. I’m sure everyone reading this blog has heard of them and used them both. We previously announced our deeper collaboration with ChemRefer. We have been busily working in the background to integrate ChemRefer into ChemSpider and the very “alpha” version of the integration is now available online. It can be accessed from the Search menu as shown below. Simply click on ChemRefer

chemrefer.png

When selected it will open the ChemRefer search window and a list of Publishers who have allowed us to index them as shown below. All can be searched or the search can be limited by using the Check Boxes.

chemrefer2.png

The search results for searching on taxol are shown here.The figure below, while too small to see detail, shows that the word searched is highlighted in the text.

chemrefer3.png

Notice that the RSC is no longer indexed at their request. We were sad to lose them from our searches.

We have also integrated to Entrez, for searching health sciences databases at the National Center for Biotechnology Information (NCBI) website. This can be searched by choosing NCBI Entrez from the Search drop down menu. It is available here. An example of the results is shown here. For this first integration we limit the results to 100 hits.

entrez1.png

Clicking on any of the titles is a direct hyperlink to the article in PubMed Central.

These integrations to these text searching engines are only the first part of our work. We have already been extracting chemical names and linking them up to structures in the ChemSpider database.Our first efforts in this area will be unveiled shortly. In this case it will be possible to simultaneously perform structure/substructure and text-based searches. This is a very significant undertaking but we are well underway to bringing our vision of structure and text based indexing of the Open Access literature to fruition.

Buy me a Coffee

I’ve reported previously on the fact that we are now adding publication details to chemical structures on ChemSpider. We have introduced the ability to do this in a manual fashion where anyone can associate a blog article, wiki article or scientific publication directly with one of more chemical structures but we are also looking to do this in an automated fashion. Project Prospect from the RSC seemed like an ideal opportunity for us to consider using their InChI association to harvest the titles and DOIs to make the association. This would be done after a discussion with the RSC to receive their blessing if possible, based on our previous interactions.

Today I investigated the possibilities of using the information available. I started with this article and clicked on the Enhanced HTMP View (Prospect View) and then used the Toolbox to Show the Compounds. The article is about the “Chemistry and biology of resorcylic acid lactones”. A partial screenshot of the type of molecules discussed in the article is shown below.

radicicol.png

For_the_purpose of this blogpost notice the visually appealing forms of the structures and the stereochemistry on the molecules. Now, each of the marked compounds in the article is linked to details of the molecule. See below..each of the pink highlights is linked to the molecule and pops up a new box.

radicicol3.png

Looking at radicicol on Project Prospect and on ChemSpider we see a difference. In fact, compare with the structure shown above from the article. The difference is one of stereochemistry..there is no stereochemistry in the InChI or in the SMILES string. There are also issues with the structure depiction shown below and this has been discussed before relative to “Cleaning“.

radicicol2.png

Zearalenone on Project Prospect and on ChemSpider

zearalenone.png

As_previously discussed with Ginkgolide B there can be many versions of a structure on the ChemSpider database. We recently introduced the ability to search on a “skeleton” as shown below in the new Structure Search options.

search-options.png

When the skeleton for zearalonone is searched (same skeleton excluding H) I found 15 hits. Some are shown below. Notice the difference (highlighted with red boxes) structure to structure in terms of the presence/absence of the double bond inside the cycle, the difference between the OH and the =O and the specified  stereochemistry. This search can be very useful for finding related structures and more examples will be given in the near future of using such searches.We DO find versions without specified  stereochemistry but we are presently working on approaches to relate the stereo/non-stereo versions of structures to each others in a very visual manner. More will follow…

skeletons.png

Buy me a Coffee

We have started to introduce new capabilities onto ChemSpider in preparation for our shift from ChemSpider Beta to ChemSpider RELEASE VERSION to celebrate our one year anniversary. You should see a number of incremental improvements happening over the next few weeks. I’ll highlight as many of them as I can as we release them.

Let’s start with new Search Capabilities. On the search screen you will see a drop down menu. You can now select from 5 different types of searches.This post with highlight only the “Structure” type search. Details about the others will follow.

searches.png

It appears that many people believe that ChemSpider is only a text-based search. The reality is  we went live with structure and substructure search in version 1. For sure we have improved things over the months and the searches are better and faster but structure searching has always been present.

To perform structure searching choose the Structure Search and you will see:

structuresearch1.png

Now click on Input Structure and  a screen for submitting structures will be opened. There are three ways to do this. The Convert will allow you to input a SMILES string,  InChI string or chemical name and convert to the structure. If the structure is corrected select ACCEPT and perform the search. Alternatively load a molfile by browsing and uploading then click ACCEPT. If you want to draw a structure or edit one you have converted simply select edit and draw into the applet.

structuresearch2.png

The manual for the applet is here. The applet will look like this:

structuresearch3.png

These capabilities have been in place for a while. Now what we have done is enable you to search in different ways from this place. Specifically…

structuresearch4.png

These searches are very valuable  specifically in relation to the issue of tautomers and structure skeletons. As I have shown with previous post on ginkgolide B in some cases there are many similar structures on the database. Searching on the skeleton will find them all. A search on Taxol  shows 1 exact structure, 1 tautomer of the exact structure and 42  structures with same skeleton as shown below.

structuresearch5.png

Enjoy the new capabilities. We welcome your feedback.

Buy me a Coffee

I use Google Reader to read blog posts. It’s great. The PROBLEM I have with blogs is when I want to receive comments on comments I have made on a blog. I waste a lot of time going back to blog posts to see whether the author of the original post has commented on my comments. It is a real time hog. Maybe I should stop commenting…it would be less time-consuming.

I think ALL blog hosts should allow users to Subscribe for Comments. What is this? I noticed it on David Bradley’s ScienceBase blog and we added it tonight. Look at the screenshot below and notice the red highlight. Now if you post a comment to the blog check this box and if new comments are received on this particular blog post then you will receive an email about the new comment.

subscribe.png

Buy me a Coffee

I’ve commented previously about the fact that Microsoft had used our web services to connect to Infomesa (1,2). Last week I had a chance to meet with Sam Batterman face-to-face for an overview of Infomesa and to see the services in action. Sam and I chatted for a couple of hours about his platform, the challenges of managing quality in publicly accessible data (and not proliferating errors) and the directions for ChemSpider in general and whether we could extend the services to support his needs.

Infomesa is a BIG whiteboard. And I mean BIG! During our discussions Sam threw up photos, charts, spreadsheets and videos onto the board and formed relationships between them. He demonstrated using our services to pull back/generate SMILES strings,  structure images and InChIKeys. He demonstrated mapping relationships as I would do today in MindManager. I could see immediate utility to the approach of the giant whiteboard. I am tired of arranging relationships over multiple documents and while I like MindManager a lot (!) the utility of the enormous whiteboard approach (I think Sam mentioned “equivalent to 50,000 pixels”) became clear in a couple of minutes.

I look forward to playing with Infomesa myself when it’s available and doing what we can through our web services to help offer additional utility to chemists.

Buy me a Coffee

    ChemSpider hosts the “electronic catalogs” of many chemical vendors as a public service. When people use us to find chemical suppliers we do not expect to be compensated for the service and do not expect to be acknowledged for the service. If ChemSpider users get connected to companies and buy their chemicals so much the better. What has happened over the past few weeks has been very interesting though…we’ve seen a DRAMATIC increase in the number of requests to help source chemicals for people that are NOT on the ChemSpider database (yet) as well as to assist in identifying organizations (or individuals) who are interested in obtaining custom synthesis support.  We’re happy to do it.

I’m also in the middle of assisting in the curation of Wikipedia and showing “willingness” there too. There are a whole hosts of other projects we are being invited to participate in also and we are showing “willingness”.     There is an unfortunate downside to “willingness” though. There is very little time left to do “real work”. There is an increasing number of email requests and phone requests coming to us regarding “Where can I buy this chemical?” or “Can you help identify someone who can synthesize this chemical?”. In the past month we’re averaged about 20 requests per week but it is increasing fairly quickly. We have about an 80% success rate in finding the chemical of interest on the database or through or depositors as well as introducing the interested party to one or more people interested in doing custom synthesis for them…and then getting out of the way.

With this in mind you are going to  see us expend some efforts to facilitate interactions between chemical vendors and interested parties. I consider this as a form of social networking for chemists - facilitating introductions and connections - okay, maybe more of a dating service.

echemistry.png

Maybe it’s eChemistry (the science of attraction…an online dating service)? By the way, there is also the Microsoft eChemistry project and I hope that people don’t confuse the two….

Buy me a Coffee

There is a new contributor to the blogosphere…SimBioSys. I recommend adding the blog to your Google Reader. There are some very exciting things going on there right now. I have commented previously about how high performance computing engines such as the Cell Broadband Engine are being brought to bear on scientific problems. SimBioSys appear to be the only group who have chosen the Cell processor to port their virtual high-throughput screening and docking solution to. Their white paper makes for an interesting read.

In their most recent post “Roping in your next scaffold hop with LASSO” they talked about their LASSO publication: LASSO—ligand activity by surface similarity order: a new tool for ligand based virtual screening”. We are presently in the middle of a very exciting project regarding LASSO. We have teamed up to provide the virtual screening results for 40 target families on the full ChemSpider Library, currently containing over 18 million molecules. Using the LASSO similarity search tool, SimBioSys has screened the ChemSpider database against all 40 target families from the Database of Useful Decoys (DUD) dataset.

LASSO descriptors (Ligand Activity by Surface Similarity Order) contain a count of the different Interacting Surface Point Types (ISPT) found on a molecule. LASSO descriptors use 23 different surface point types, ranging from hydrogen bond donors/acceptor, to hydrophobic sites, to pi stacking interactions. Figure 1 shows a “histidinelike” fragment of a molecule. The triangles are the surface point types of this fragment, colored by type. Based on the idea that ligands must have surface properties compatible with the target site in order to bind, LASSO uses a descriptor of Interacting Surface Point Types (ISPT) to find molecules with diverse chemical scaffolds but similar surface properties.

lasso1.png

We are presently populating the ChemSpider database with 10s of millions of LASSO descriptors and this will allow screening of the ChemSpider database to:

? Find molecules which have a higher likelihood of binding to targets.
? Find molecules with better selectivity for a target.
? Reduce toxicity issues.

The 40 Target receptor families included in the screening results were chosen to cover a wide range of receptor classes due to their interest in drug discovery. Each target family had 10s to 100s of known active molecules, which were used as the basis for the query files used by LASSO, one query for each family. The similarity screening was performed on the full ChemSpider database across all 40 targets and the similarity scores for each structure/target pair is available via the ChemSpider website. Thus for each structure in the ChemSpider database, you can find its similarity score (based on surface properties) relative to actives of each of the 40 target receptors. In addition to allowing instant ranking results for a particular target of interest (retrieving molecules that are likely to be active for a receptor) this matrix of screening results can be used to find molecules that have predicted affinity for a target but low predicted affinity for all other targets. Performing such searches promises to improve selectivity and can be a guide to reducing toxicity concerns. More detail about this collaborative project will be forthcoming but the overview is provided here.

Watch this space for updates and an unveiling date.

Buy me a Coffee

I’ve reported previously on how Microsoft have started to use the ChemSpider web services and have hooked them into InfoMesa. Sam has used our services to add in a Search capability to return the InChI, InChIKey, Smiles, and a URL Link to Chemspider based on a search of a trade name, SMILES, etc. He’s also using the ability to return the chemical structure image. I’ll let the image on his blog tell the story more fully.

Buy me a Coffee

Over the past few weeks I have had a few discussions with a member of the ChemSpider Advisory group regarding a concept to create WiChempedia. I’ve enjoyed these conversations with Alex Tropsha (professor and Chair in the Division of Medicinal Chemistry and Natural Products in the School of Pharmacy, UNC-Chapel Hill.) We are like-minded in a number of ways but specifically in what can be done to facilitate delivery of quality information to the chemistry community.

As you will notice if you frequent this blog I am rather a stickler for accuracy and quality (1,2,3). I think it’s important (4). Over the past few weeks I’ve spent more time looking at the quality of data on Wikipedia and trying to figure out the best way to bring together our efforts on ChemSpider to enhance the capabilities of integrated information and to support the quality efforts being made by the WP:CHEM team and help them. I also intend to facilitate the development of our own Wiki environment for chemistry and to generally enhance the tools available to chemists not only for Wikipedia type annotation but also to support Open notebook Science.

Now, I don’t want to reinvent the wheel. Wikipedia has a lot of what is necessary in terms of being a known system, a following of people and committed supporters in the WP:CHEM team. What I have been hoping for was a shift around structure and substructure searching on the MediaWiki platform but I know that is a tough request as the platform is not built for that type of thing, The InChKey holds some promise for exact structure searching but does not offer an opportunity for substructure searching without a lookup across a larger database. I want to facilitate information and data sharing further. I do want to provide the type of service that Wikipedia does in terms of general information but also layer cheminformatics tools onto that knowledge and information, allow addition of analytical data, analysis tools, real time predictions and analysis ultimately. This platform should certainly be wiki-enabled.

Decision made. Our intention is to deliver wiki-capabilities in ChemSpider and to use the Open Content associated with chemicals and drugs on Wikipedia inside the system. We will then provide an environment for people to continue to add to, enhance and curate the Wikipedia content as well as add their own. Last night (and well into the early morning) I spent some time talking to Martin Walker from WP:CHEM regarding my concerns that we might offend the Wikipedians with our efforts and that I did not want them to feel that we were ripping off their hard work but rather have our efforts seen as supportive and enabling. My intention as we work through downloading the data and to check, validate and correct what is sitting on Wikipedia directly for benefit to the community. Also, we will of course need to leave all Wikipedia content under the appropriate licensing for others to use. Martin commented that there are tens of mirrors of Wikipedia out there ripped purely with the purpose of exposing and getting ads revenue. We are not working from that model….our intention, as usual, is to build a structure centric community for chemists and with so much excellent work done on Wikipedia I want to take advantage of it and give back also by the work we will do.

Two domain names have been grabbed for this project : WiChempedia, for compatibility with Wikipedia, and also WeChempedia, to emphasize the community aspects of the project.

If you frequent this blog you will recall that we have made a commitment to Microsoft Sharepoint as our future platform for wiki’ing ChemSpider. That is where we believe this work will be done ultimately but we don’t have the platform in our hands yet.

The Xmas vacation is going to be full of holiday movies and manual examination and curation of the Wikipedia data. Wish us luck!

Buy me a Coffee

It was only two days ago that I was talking about being green with envy about the throughput, processes and delivery of the PubChem system. And today I was informed that the deposit is done. Visiting the PubChem Data Sources page lists us as contributing 16.8 million compounds to the database. Wow.At last check 3 million had found there way to Entrez and were therefore searchable. The rest will be there very soon.

We apologize that ChemSpider has been a little slow over the past day but we have had other groups downloading the dataset. Unfortunately, our pipes aren’t fat enough to allow all of those multi-threaded downloads, plus all the calculations going on for our new depositions, plus hosting ChemSpider searches, Google InChI indexes at >30,000 per day and all the other things a modern server does to support a working team.

Fortunately for us the ChemSpider source is now on PubChem’s site and can be downloaded from there. They are used to handling high traffic and, over the years, have created a stable, scalable and performing system. We couldn’t be more honored to have our data there. And, if people want our data then they can exercise the fat pipes rather than wrestling with our download speeds.

Buy me a Coffee

If you have examined the predicted physchem properties associated with a structure on ChemSpider you will see something of this nature, on occasion. Notice that there is a logP value but there is also a logD value..in fact two values, one at pH 5.5 and one at pH 7.4. These chosen values were representative of typical physiological pH values of interest. But what IS logD?

Let’s visit the Wikipedia definition of both logP and logD to start:

logP - The partition coefficient is the ratio of concentrations of un-ionized compound between the two solutions. To measure the partition coefficient of ionizable solutes, the pH of the aqueous phase is adjusted such that the predominant form of the compound is un-ionized. The logarithm of the ratio of the concentrations of the un-ionized solute in the solvents is called log P

logD- The distribution coefficient is the ratio of the sum of the concentrations of all forms of the compound (ionized plus unionized) in each of the two phases. For measurements of distribution coefficient, the pH of the aqueous phase is buffered to a specific value such that the pH is not significantly perturbed by the introduction of the compound. The logarithm of the ratio of the sum of concentrations of the solute’s various forms in one solvent, to the sum of the concentrations of its forms in the other solvent is called Log D. In addition, log D is pH dependent, hence the one must specify the pH at which the log D was measured. Of particular interest is the log D at pH = 7.4 (the physiological pH of blood serum). For un-ionizable compounds, log P = log D at any pH.

With these definitions in mind what would we expect for an ionizable compound such as Zyrtec shown below?

The curve below shows the logD as a function of pH for Zyrtec.

At the two pHs of interest what are the structures of the ionized species? See below:

Why would this be important? In fact it’s crucial in the design of drugs in terms of how they act in the body across the physiological profile exhibited by the human body. But, I’m not going to tell you the story…instead I am going to point you to an article “The Rule of Five Revisited: Applying Log D in Place of Log P in Drug Likeness Filters”. This was one of the most accessed articles published in Molecular Pharmaceutics in 2007. It’s featured on the Most Accessed Articles website here.

Buy me a Coffee

We’ve just deposited our ChemSpider structure collection onto PubChem. It amounted to just shy of 18 million compounds. We set a target of 20 million compounds in our database by end of year (and we were actually there until we ran an enormous deduplication process..the deduplicated dataset is yet to be updated to the database and will be done over a weekend). Of course, quality not quantity is important…

We already rolled out an alpha version of the structure deposition gateway and had feedback from a few testers. There has been some re-engineering on the back end and some significant improvements to speed up the approval process for publishing the data online. It will unveil in the next few days (final test cycles acknowledged).

What we are looking for now are more depositors who are interested in depositing their data onto ChemSpider. These data can be from commercial databases, open access/free access databases or from personal data collections. There are significant benefits to the depositors in terms of creating awareness for your data and hopefully driving traffic to your websites (if appropriate). I’ve put together a short overview regarding the benefits for depositors and you can read it here. Contact me at antonyDOTwilliamsATchemspiderDOTcom if you want to suggest a new data source for me to contact. Onwards and upwards

Buy me a Coffee

Yesterday I announced the availability of the MassSpec web services for ChemSpider and, less than 24 hours later, I am happy to announce that it is already integrated. Egon Willighagen, one of the members of our Advisory Group, has already reported on his integration to ChemSpider with the intention of speeding up metabolomics analysis. He has used Taverna, a workflow and pipelining tool to set up his workflows. What’s good to see is how easy this was for him to do …well, I assume it was easy since he didn’t need to consult with us. We released the MassSpec web service and voila, he was integrated.

This is what is happening with our other web services too. A number of organizations are now integrated to ChemSpider and using the services on a daily basis.