Archive for the ChemSpider Chemistry Category

Peter Murray-Rust responded to my recent comments about a Free Lunch. There are a number of comments to be made and an exciting opportunity to use Open Data and linking from ChemSpider.

I’d asked the question about how many records there were on CrystalEye. In our world a unique record is a unique InChI, not so on CrystalEye and appropriately so as the crystal structure itself is presumably the unique record. Makes sense.

PMR> We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is.

What we will do with multiple crystal structures for a single chemical structure is link all unique crystal structures from the unique chemical structure. In this way people can query the chemical structure and find all associated analytical data - spectra and crystallographic files. If we were to list the number of unique depositions on ChemSpider I think we would be around 40 million depositions..an estimate though!

PMR> The main duplication comes from the Crystallography Open Database which has about 45,000 structures.

I looked at the Crystallography Open Database this morning. it states on the home page “Updated daily: 68268 entries in the COD”. We may have an opportunity with the COD to link up to their data and reduce the need for us to host CIFs. Excellent…we’re all for reducing workload and providing links into other systems. It’s what we do.

PMR> The only thing stopping us putting them (AJW> The structures from CrystalEye)  in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon. When it’s finished it will be in RDF/CML.

This is great news. This means that after the summer we can download the data directly via PubChem and link up to CrystalEye that way. Perfect. We’ll stop working on integrating to CrystalEye now and wait for the integration path via PubChem and focus on other data sources. Thank you Peter, Nick, Andrew and Jim!!! That said I don’t believe that PubChem will take CML, they will convert using their tools to produce their compatible formats and InChI being one of them. That will break organometallics etc. UNLESS PubChem are going to adopt CML now and that would be an interesting positive shift in terms of a sign of support for the format. A strong positive. I’l chat with the PubChem team so that if CML is coming we can consider adopting in some way and be ready.

From my post “AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.”

PMR: Chicken and egg… :-) You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.

It’s nice to know that ChemSpider has that type of influence now. It’s good to see it going into Accelrys’ software and I had heard that from Dan’s blog and had added the CML Blog to my reader. I’m definitely watching and willing to follow. We’re busy leading so many other things right now we’ll wait for adoption and then jump on it like a “hobo on a muffin”.

Buy me a Coffee

In a recent post about ChemSpider we’ve been accused of wanting a Free Lunch. I copy a segment of the post and comment with insertions.

“Data are normally produced for a particular purpose and the reuse them for another cost money. I’ll exemplify this by taking CrystalEye data - about 120,000 crystal structures and 1 million molecular fragments - which were aggregated, transformed and validated by Nick Day as part of his thesis. (BTW Nick is writing up - it’s a tribute to his work that CrystalEye runs without attention for months on end).

AJW> It is true…it is a tribute to Nick that CrystalEye can run for months without attention. Kudos. I am interested in how much pressure the site is under. How many searches/users in a day etc.? We find that our struggles in uptime (and these are negligible) are primarily based on stress on the servers. For nighttime users tonight things will have been slow…we deposited over 100,000 new molecules from 5 new data sources. That does create some slowness. We will hit about 40,000 transactions today. Our problems are ISP issues and powercuts. But we are also not in a University using thick pipes etc.

One comment…it was 130,000 structures according to a previous blog and has been expanding since then from daily depositions. Right now I would expect it to be 140,000 rather than 120,000. When we did try scraping the data our best estimate was about 90,000. We might have missed something in our scraping and it’s why we asked for a dump of the data.

The primary purpose of CrystalEye was to allow Nick to test the validity of QM calculations in high-throughput mode. It turned out that the collection might be useful so we have posted it as Open Data. To add to its value we have made it browsable by journal and article, searcahable by cell dimensions, searchable by chemical substructure and searchable by bond-length. This is a fair range of what the casual visitor might wish to have available. Andrew Walkingshaw has transformed it into RDF and built a SPARQL endpoint with the help of Talis. It has a Jmol applet and 2D diagrams, and links back to the papers. So there is a lot of functionality associated with it.

AJW> The team has done a good job in putting the site together. The JMol applet is an excellent utility for us all to use and thanks to that team for sure! Egon has been challenging us to RDF the site and it’s on our list, but keeps getting pushed down based on other requests. Since he’s the only voice asking it will keep getting pushed down unfortunately.

This has come under some criticism to the effect that we haven’t really made it Openly available. For example Antony Williams(Chemspider blog) writes (Acting as a Community Member to Help Open Access Authors and Publishers):

“This [interaction with MDPI] is contrary to some of my experiences with some other advocates of Open Data and Open Access where trying to get their “Open Data” is like pulling teeth.”

PMR: I assume this relates to CrystalEye - I don’t know of any other case.

AJW> There are other examples and he’s right. He doesn’t know of them and I’d prefer he not rant on my behalf so I’ll not name them.

Antony and I have had several discussions about CrystalEye - basically he would like to import it into his database (which is completely acceptable) but it’s not in the format he wants (multi-entry files in MDL’s SDF format, whereas CrystalEye is in CML and RDF).

AJW> To clarify, again. I DON’T want to import CrystalEye into ChemSpider. I DON’T! All I want is the set of structures and unique associated URLs so that users of ChemSpider can find that there is crystal structure information over on CrystalEye and can click the link and be on CrystalEye and get the benefit of Nick, Andrew and Peter’s work. I don’t want to reproduce their effort. I want to integrate to it. I’ve said it many times on Peter’s blog and on this one.

This type of problem arises everywhere in the data world. For example the problem of converting between map coordinates (especially in 3D) can be enormous. As Rich says, it costs money. There is generally no escape from the cost, but certain approaches such as using standards such as XML and RDF can dramatically lower the costs. Nevertheless there is a cost. Jim Downing made this investment by creating an Atom feed mechanism so that CrystalEeye couls be systematically downloaded but I don’t think Chemspider has used this.

AJW> If Jim can contact me by email and provide me with detailed instructions to download the entire file of structures ONLY and their associated URLs that would be excellent. I’ll send the request to him tonight.

The real point is that Chemspider wishes to use the data for a different purpose from which it was intended.

AJW> The problem is that stories keep getting made up about what we want. ALL I want to do is drive traffic to CrystalEye so that people who don’t know about it can use it. No more than that. I don’t get how trying to provide an integration path is so difficult. I’ll ask Jim to help.

That’s fine. But as Rich says it costs money. It’s unrealistic to expect we should carry out the conversion for a commercial company for free. We’d be happy to a mutually acceptable business proposition and it could probably be done by hiring a summer student.

AJW> I am interested in what commercial benefit integrating to CrystalEye can have. It’s work on our side. I’m not sure what a mutually acceptable business proposition would look like. It can’t be that much work to send us a set of InChIStrings and URLs for the CrystalEye dataset..they already exist on CrystalEye. So, I’ll assume that this is a last comment on “No thanks to CrystalEye data in ChemSpider”. I have to ask why not put them in PubChem. Since PubChem is held as the standard of OpenData why not put CrystalEye there?

I continue to stress that CrystalEye is completely Open. If you want it enough and can make the investment then all the mechanism are available. There’s a downloader and converters and they are all Open (though it may cost money to integrate them).

AJW> Just fyi ChemSpider has adopted Creative Commons licenses.

FWIW we are continuing to explore the ways in which CrystalEye is made available. We’re being funded by Microsoft as part of the OREChem project and the result of this could represent some of the way in which the Web technology is influencing scientific disciplines. We’d recommend that those interested in mashups and re-use in chemistry took a close look at RDF/SPARQL/CML/ORE as those are going to be standard in other fields.

AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.

I have spoken previously about the challenges of Scraping CrystalEye Content and staying in relationship with publishers. I have approached CAS and spoken with the Copyright team at ACS. In December of last year I spoke about the 5 month delay to discuss with ACS about whether or not we could scrape CIF files from ACS journals directly. Well, I had a nice chat with two ACS people in New Orleans, one of them from ACS Pubs. We had a nice chat about ChemSpider and I answered a lot of questions about what we were doing, where we were going, how we are “funded” (we are not!) etc. Many pages of notes were taken. At the end of the meeting I asked the question “So, relative to my question about CrystalEye and scarping CIFS. Are Supplementary Data ok to scrape or not?”

The answer? “We haven’t made a decision yet. We need to discuss”.

Are crystal structures really that special? It’s been difficult to get JUST the structures associated with even Open Data. Now I’ve been waiting over 7 months for a question to be answered by ACS…and it’s binary. YES or NO.

At this point I give up. Peter Murray-Rust has had ACS CIFs scraped from their publications for a LONG time. And continues to scrape them. Cambridge University/Unilever School of Informatics didn’t get permission and have been very vocal about what they’ve done and no legal action re. copyright has been taken so I’ll assume it’s not an issue. If it’s not an issue then we can go ahead.

If we can go ahead then why wouldn’t we? We have…we already have scraped the collection of CIFs from ACS, from a broader range of ACS journals than CrystalEye taps into. It’s Supplementary Data, it’s non-copyrightable and now its ours to publish. We already support CIF displays on ChemSpider so what we need to do now is to mass convert/handle the data and deposit onto ChemSpider. We also have the IUCR CIFs to deposit. I guess ChemSpider will soon become “CrystalEye 2″ as we host the data. That said we are NOT crystallographers so I have an open request to the community for someone with interest/skills in crystallography to join our advisory group and support this effort. Feel free to ping me.

Buy me a Coffee

I am in catch up mode tonight reading a week long backlog of blog posts. I’ve caught up tonight with some of Peter’s posts about semantic chemical authoring (1,2). I’ll respond shortly with comments regarding our own efforts in pulling together the web. I agree with Peter that improved semantic chemical authoring tools are necessary but we are focused right now on doing what we can with what is already available online. What it takes is coding, some regular expressions, some visual inspection and work. Lots of work. More later…

One of the things we are working on is connecting blog posts and wiki pages to ChemSpider as evidenced by our work with Molecule of the Day and our integrations to TotallySynthetic posts on an ongoing basis. What we expect of the authors though is that they author with care. We are generally using name to structure conversion capabilities to generate the chemical structures for connecting to on ChemSpider. Paul Doherty at TotallySynthetic used to provide us with inChIStrings and InChIKeys to connect up to but stopped because it was a lot of work I believe. Molecule of the Day generally discusses fairly simple molecule relative to TotallySynthetic’s COMPLEX molecules. Manual inspection is unfortunately necessary even in the simplest of cases. And it IS time-consuming. Robots will gather information and, in my judgment, PROLIFERATE incorrect data unless someone is going to do the work to inspect OR the system provides a curation platform to quickly remove errors.

I blogged tonight on the ChemConnector blog about the importance of dashes and spaces in systematic names. It should be very clear from that post how important it is. it is a major challenge to use name to structure conversion tools on chemical names that are imperfect and do not represent the structure they are meant to represent. There needs to be respect for chemical names and as we move them from system to system, database to database we need to do our best to retain their integrity. This HAS BEEN a major challenge for us as we scrape data from various data sources OR when people provide us data files such as Wikipedia and we need to check name-structure connections. It is not difficult to lose the integrity of a chemical name.

Back to Peter Murray-Rust’s discussions about semantic chemical authoring. Peter is talking about building a site of aggregated information from various websites.

PMR> “We’re in the process of aggregating a repository of common chemicals (somewhere in the range 1000-10000 entries) and we are taking data from various publicly available web sites. Typical sources are Wikipedia, any aggregator with Open Data policies and MSDS sheets (chemical safety information). One such site is INCHEM (Chemical Safety Information from Intergovernmental Organizations which lists about 1500 materials (most are chemical compounds though some are mixtures).”

Readers of this blog will know we’ve already done this. Both for NIOSH and for the Oxford MSDS set. We took a select subset of information. We integrated this with our Wikipedia set of data on ChemSpider (and, of course, also on WiChempedia).

PMR> “…From this we extract the most important information and turn it into CML - names, formula, connection tables, properties, etc.”

Our process was extraction of the same (but there arent any connection tables to grab from NIOSH or Oxford MSDS) then we converted names to structures and ran some “confirmation processes” including visual inspection when necessary.

PMR> “There are a large number of simple but niggly lexical problems, such as the degrees symbol for temperature (totally inconsistent within and between documents) And the semantics - how do you record a boiling point as “between 120 and 130 at 20 mm Hg”? (CML can do this, but it takes work to do the conversion.)”

Oh yes…these are problems. the inconsistencies between records is a pain but can be dealt with by mapping as shown here. Recording a boiling point “between 120 and 130 at 20 mm Hg” is no issue really. See this figure for something just as complex  regarding “loss of waters”.

PMR> “And the sites have errors. Here’s a rather subtle one which the average human would miss (we needed a machine to find it). You’ll have to go to the page for chloromethylmethylether - I daren’t try to transcribe it into Wordpress. The error is in the displayed page (no need to scroll down).”

There are a couple of issues here. We actually prefer NOT to use either the molecular formula or the molecular weight. In our Wikipedia work we found a lot of errors around these parameters and for the Wikipedia work at least the name, SMILES, InChI etc were more correct while MFs and MW would be wrong.

There may absolutely be value in using both MF and MW to confirm the structure and I definitely see the value. This would definitely help resolve some of the Nomenclature-Structure issues that can can arise from converting the names! One of the things that occurred in the blog post was that my earlier comments came to pass regarding removal of a space in the chemical name.

The names on the ORIGINAL InCHEM page were:

CHLOROMETHYL METHYL ETHER, Chloromethoxymethane with a CAS Number of 107-30-2 and an EINECS number of 203-480-1.

There was NOT a name listed as “chloromethylmethylether” which PMR listed in his post. The only difference is dropping one space. It’s only an accidental removal but dramatically changes the meaning of the record. This is where Peter’s use of either MF or MW becomes crucial! That loss of a space CAN cause big problems as described here. Does it cause a problem this time? Check below…look at the name with and without the space and the result of conversion in a commercial Name to Structure software package.

The CORRECT structure is on ChemSpider here and already includes the following Supplemental Information.

User Data

  • experimental physchem properties
    • Boiling Point: 138F

    • Freezing Point: -154F

    • Specific Gravity: 1.06

    • Solubility: Reacts

    • Ionization Potential: 10.25 eV

  • miscellaneous
    • Appearance: Colorless liquid with an irritating odor.

    • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

    • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

    • Symptoms: Irritation eyes, skin, mucous membrane; pulmonary edema, pulmonary congestion, pneumonitis; skin burns, necrosis; cough, wheezing, pulmonary congestion; blood stained-sputum; weight loss; bronchial se cretions; [potential occupational carcinogen]

    • Target Organs: Eyes, skin, respiratory system Cancer Site [in animals: skin & lung cancer]

    • Incompatibilities and Reactivities: Water [Note: Reacts with water to form hydrochloric acid & formaldehyde.]

    • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated/Daily Remove: When wet (flammable) Change: Daily Provide: Eyewash, Quick drench

    • Exposure Limits: NIOSH REL : Ca See Appendix A OSHA PEL : [1910.1006] See Appendix B

Peter IS right. We DO Need Semantic Chemical Authoring Tools. However, we’ve already gone a long way without them and what is already online CAN be dealt with. Incredible care is needed with nomenclature  and just spaces can mess things up! I know we have errors on our database - both structures and names. What is to be expected with 20 million structures and associated data? However, we are cleaning them up, rather quickly. We are scraping and integrating data at an increasing rate having learned a lot of lessons over the past year.

I’ll comments on Peter’s other Semantic Chemical Authoring posts in the next couple of days.

Buy me a Coffee

I’ve blogged previously about us adding safety and toxicity data to ChemSpider. We are busily sourcing new information from other data sources to add information and in the past couple of days we have added NIOSH data as it is a rich source of additional safety information. For example, the record for 1,2,3-trichloropropane shows:

  • First Aid: Eye: Irrigate immediately Skin: Soap wash Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, nose, throat; central nervous system depression; in animals: liver, kidney injury; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, central nervous system, liver, kidneys Cancer Site [in animals: forestomach, liver & mammary gland cancer]

  • Incompatibilities and Reactivities: Chemically-active metals, strong caustics & oxidizers

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet or contaminated Change: No recommendation Provide: Eyewash, Quick drench

Some additional examples are here: Temefos, Warfarin and Allyl Alcohol. Note that each of these also has a coincident extract from Wikipedia. We are therefore integrating Wikipedia articles, safety, toxicity, experimental and predicted properties. Our plan for semanticising and integrating the chemistry web is clearly well underway.

Buy me a Coffee

Frequent users to ChemSpider who use the identifiers for searching will commonly find a mixture of “names” and “database IDs” as well as “registry numbers”. Since the number of database IDs can sometimes swamp the synonyms and chemical names we chose to separate them. We have run some regular expressions across the database to separate database IDs out. We have left registry numbers (marked by [RN]), EINECS numbers (marked by [EINECS]) and Wiswesser Line Notation (marked by [WLN]) in with the synonyms.

database-ids.png

Unfortunately there are MANY flavors of Database IDs and we might have missed some. If you come across any “potential” DB IDs and think we should segregate them out please use the POST COMMENTS  ability to inform us. Simply Post a Comment to a record and suggest we check the identifiers out for potential DBIDs. Thank!

Buy me a Coffee

Over the past few months we have been working hard to integrate the ChemRefer system into ChemSpider. I have reported on our recent rollout of the first level of integration and Will Griffiths has started a discussion at the Open Chemistry Web blogpage. Will has played a key role in facilitating our relationships with publishers in the Open Access domain. Many have had an appreciation for his ChemRefer website.

When we connected ChemRefer to ChemSpider one of our first commitments was to facilitate direct structure/substructure searching of the IUCr publications. Will has indexed the publications from 1948 to present day. He has extracted chemical names and systematic names from the titles and abstracts of the articles. Since we have access to name to structure conversion capabilities from OpenEye we can convert many of these names to chemical structures in an automated fashion. As a result of my work on the curation of Wikipedia chemical structures I also have access to some of the tools I was involved in developing at ACD/Labs (ACD/ChemFolder and ACD/Name) This has allowed me to curate and convert other extracted names in a manual manner. This is NOT the most efficient way to conduct this process as it requires a lot of eyeballing. However, it is this type of approach, reviewing many hundreds of extracted names, which has allowed us to optimize the process, recognize where potential failures could arise and improve the chance to extract the best set of names to work with for conversion purposes. Now we are at the whim of name to structure conversion software in terms of the accuracy of the conversion but this can be validated (See later). Since organometallic names are very difficult to convert to structures we can only deal with organic structures at present. However, we do also have many validated synonyms in our hands too on ChemSpider and that provides a useful dictionary for conversion.

Ultimately I hope that other commercial providers of batch Name to Conversion software modules would see the potential value of collaboration with ChemSpider and give us access to their software tools to assist with our project.

We are in the process of curating and checking the names and chemical structures for all that we have extracted so far but the depositions have already started. Of the structures converted we have found about half of them are already on ChemSpider (example) but another half are unique (example). At present we have connected almost a thousand articles via DOIs from ChemSpider to the IUCr articles. The best estimates at present are that we will be able to connect about 8500 structures to articles.

Since this is the IUCr collection we will validate our approach in the future and, with their permission, we will use the CIFs to validate our extracted structures against the CIF-based structures. This will give us an indication of the errors one could expect from an automated extraction of chemical names from articles and conversion to structures using Name to Structure. We are also going to continue our curation process and allow people to curate structures from IUCr(and other articles) to ChemSpider. There will be cases where some names/structures from articles have not been converted to structures and these will need manual submission. This will be possible through the manual deposition system.

We are looking forward to providing similar services to other publishers should they be desired. This is our proof of concept…and it’s working.

Buy me a Coffee

Two particular Open Access resources providing enormous value to Life Sciences nowadays are PubMed and PubChem. I’m sure everyone reading this blog has heard of them and used them both. We previously announced our deeper collaboration with ChemRefer. We have been busily working in the background to integrate ChemRefer into ChemSpider and the very “alpha” version of the integration is now available online. It can be accessed from the Search menu as shown below. Simply click on ChemRefer

chemrefer.png

When selected it will open the ChemRefer search window and a list of Publishers who have allowed us to index them as shown below. All can be searched or the search can be limited by using the Check Boxes.

chemrefer2.png

The search results for searching on taxol are shown here.The figure below, while too small to see detail, shows that the word searched is highlighted in the text.

chemrefer3.png

Notice that the RSC is no longer indexed at their request. We were sad to lose them from our searches.

We have also integrated to Entrez, for searching health sciences databases at the National Center for Biotechnology Information (NCBI) website. This can be searched by choosing NCBI Entrez from the Search drop down menu. It is available here. An example of the results is shown here. For this first integration we limit the results to 100 hits.

entrez1.png

Clicking on any of the titles is a direct hyperlink to the article in PubMed Central.

These integrations to these text searching engines are only the first part of our work. We have already been extracting chemical names and linking them up to structures in the ChemSpider database.Our first efforts in this area will be unveiled shortly. In this case it will be possible to simultaneously perform structure/substructure and text-based searches. This is a very significant undertaking but we are well underway to bringing our vision of structure and text based indexing of the Open Access literature to fruition.

Buy me a Coffee

Recently I posted about trying to identify the correct structure of Ginkgolide B and the need for curation of ChemSpider entries. David Barden from the RSC commented on my post:

“Antony - I am an organic chemist working on the RSC journal in which the published structure of ginkgolide B appeared, and am pretty sure that it is correct, having been written by a regular author of ours familiar with the literature on the ginkgolides. I think the problem might lie with the representation (and/or conversion to InChI) of the structures - even in the one structure you indicated as having “full stereochemistry”, it seemed to me that 3 stereocenters were undefined, from a visual inspection of the structure. Apart from these stereocenters, the structure and InChI (generated myself) otherwise seem identical, so I’m not sure why the last part of the string in the ChemSpider entry is “20+” rather than “20-”. The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.”

I have redrawn the structure of Ginkgolide to echo that shown in the RSC journal and it is shown below alongside a cropped image from the article:

compare-the-two.png

I’m_pretty sure I have the structure correct. The InChIString is:

InChI=1/C20H24O10/c1-6-12(23)28-11-9(21)18-8-5-7(16(2,3)4)17(18)10(22)13(24)29-15(17)30-20(18,14(25)27-8)19(6,11)26/h6-11,15,21-22,26H,5H2,1-4H3/t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20-/m1/s1

and the InChIKey is:

SQOJOAFXDQDRGF-MMQTXUMRBS

In the previous post I searched on Ginkgolide B as an identifier to see how many Ginkgolide B’s there are. There are 6 as shown here.

I searched on the entire InChIkey and found no hits. This means that the structure is NOT on ChemSpider.

I then searched on the CONNECTIONs captured within the InChIKey and represented by: SQOJOAFXDQDRGF . I received 18 hits in total varying in completeness in terms of incomplete stereochemistry and DIFFERENT but fully assigned stereochemistry. I searched the entire InChIKey on Google (SQOJOAFXDQDRGF-MMQTXUMRBS) but received no hits. Just to check I then searched the InChIString shown above on Google. Surprisingly, I DID get a hit! It was for this structure. I was puzzled and a comparison of the strings showed a difference in ONE section of the string, the stereo layer.

Searched on Google: /t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20-/m1/s1

Found by Google: /t6-,7?+,8-,9+,10?+,11+,15+,?17+,18+,19?-,20+/m1/s?1

See the difference? ONE stereocenter… 20- versus 20+ . Thank goodness we are moving to InChIKeys rather than InChIStrings since the majority of people would likely miss the detail. I did the first time! So, based on all of my searches the structure of Ginkgolide B as represented in the article published by the RSC is NOT in the ChemSpider database. I agree with David Barden when he comments “The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.” It is very complex and time-consuming and the hope is that comparison of InChIKeys, specifically the second part of the key, will help catch the differences in a more facile manner.

The question, unfortunately, remains. What IS the correct structure of Ginkgolide B? For now I have assumed that the one in the RSC article is correct and have added the structure to the database using the normal deposition process and have associated with the RSC article and the blog discussions on ChemSpider. If it turns out it is not correct then I will leave the structure, the connection to the article but remove the identifier Ginkgolide B.

suppinfo.png

Buy me a Coffee

We have started to introduce new capabilities onto ChemSpider in preparation for our shift from ChemSpider Beta to ChemSpider RELEASE VERSION to celebrate our one year anniversary. You should see a number of incremental improvements happening over the next few weeks. I’ll highlight as many of them as I can as we release them.

Let’s start with new Search Capabilities. On the search screen you will see a drop down menu. You can now select from 5 different types of searches.This post with highlight only the “Structure” type search. Details about the others will follow.

searches.png

It appears that many people believe that ChemSpider is only a text-based search. The reality is  we went live with structure and substructure search in version 1. For sure we have improved things over the months and the searches are better and faster but structure searching has always been present.

To perform structure searching choose the Structure Search and you will see:

structuresearch1.png

Now click on Input Structure and  a screen for submitting structures will be opened. There are three ways to do this. The Convert will allow you to input a SMILES string,  InChI string or chemical name and convert to the structure. If the structure is corrected select ACCEPT and perform the search. Alternatively load a molfile by browsing and uploading then click ACCEPT. If you want to draw a structure or edit one you have converted simply select edit and draw into the applet.

structuresearch2.png

The manual for the applet is here. The applet will look like this:

structuresearch3.png

These capabilities have been in place for a while. Now what we have done is enable you to search in different ways from this place. Specifically…

structuresearch4.png

These searches are very valuable  specifically in relation to the issue of tautomers and structure skeletons. As I have shown with previous post on ginkgolide B in some cases there are many similar structures on the database. Searching on the skeleton will find them all. A search on Taxol  shows 1 exact structure, 1 tautomer of the exact structure and 42  structures with same skeleton as shown below.

structuresearch5.png

Enjoy the new capabilities. We welcome your feedback.

Buy me a Coffee

Tonight I finished an article on Public Chemistry Databases. During that article I commented on the size of the Public Chemistry Databases versus the commercial databases. There have been numerous discussions in the blogosphere about the size of databases such as PubChem relative to the CAS Registry. Recently PubChem and ChemSpider headed towards 20 million structures. The CAS Registry is about 33 million.

Now, I don’t know how much duplication there is in the Registry but I can comment is what is in ChemSpider and likely in PubChem. Here’s a basic comment about molecules with complex stereochemistry. They tens to exist MULTIPLE times in the database due to different variants of stereochemistry. Let’s examine Ginkgolide B. The structure below is taken from a recent RSC article. I was interested to see whether we had the “correct” structure of Ginkgolide B on ChemSpider, assuming that the correct structure is that one shown on the RSC webpage.

ginkgolide-b.png

A search on the name Ginkgolide B turned up a total of 6 structures. The connectivities are the same for all structures. The ONLY difference is in the stereochemistry. Take a look at the structures in Table View. There is one structure with full stereochemistry expressed. This one comes from PubChem, Thomson Pharma and xPharm. With full stereochemistry it might be safe to assume it is correct.

However, even for Taxol there are structures with complete stereochemistry and they are different: Structure 1, Structure 2, Structure 3, Structure 4 and Structure 5

I actually gave up looking eventually…here are the different complete stereochemistries. Look carefully…

t31-,32?-,33+,35-,?36+,37-,38?-,40-,45+,?46-,47+

t31-,32?+,33+,35-,?36+,37+,38?-,40-,45+,?46-,47+

t31-,32?+,33+,35-,?36-,37+,38?+,40-,45-,?46+,47-

t31-,32?-,33+,35-,?36-,37-,38?-,40-,45-,?46-,47-

t31-,32?-,33+,35-,?36+,37-,38?-,40-,45-,?46-,47-

Question for ChemSpider Users - there are actually WAY MORE than 10 Taxol skeletons on ChemSpider. Can anyone figure out how many? It actually takes one search to find them all!

We believe this is the correct structure of Taxol.

Back to Ginkgolide B. I redrew the structure shown in the RSC article (and as shown below).

ginkgolide-b_2.png

Generating the InChIKey for this structure and performing a search on ChemSpider gave me no hits. It looks like either the RSC structure is wrong OR all of the six structures from all of the different sources are wrong. As mentioned, there is actually only one Ginkgolide B structure ( a structure with the associated identifier) on ChemSpider with full stereochemistry. The stereo for that structure is:

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20+ ChemSpider

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20- RSC Stereo

There is ONE stereocenter difference.

This is what curation is all about. The question now is which one is correct? Is it RSC? Is it the structure on ChemSpider? Can anyone validate? Did I miss something up in the comparison (it happens!)

Now, for the LONG list of Ginkgolide B structures on ChemSpider shown here what do users think we should do? If we simply remove ALL labels for the incorrect structures then we will remove all links into other databases that contain information for “their Ginkgolide B”. If we collapse all links into the correct Ginkgolide B on ChemSpider, and bring 6 records into one, then the structures to which the correct structure links are actually incorrect in the linked databases (but useful information exists there).

Quite the conundrum. I’d appreciate feedback!

Buy me a Coffee

I’ve been involved in a number of conversations recently around how monoisotopic masses can be used and the chance of “elucidating a structure” from a molecular formula. There are some shockingly naive views of this possibility. With the availability of accurate mass determinations by mass spectrometry, and the possibility to extract a molecular formula from the data, there are some who believe it is possible to “elucidate structures” using a monoisotopic mass. Let’s clear this naivety up…

Recently I gave a presentation at a local university regarding informatics. During the presentation I asked the students how many structures could be generated “withint the rules of basic organic chemistry” for some very short elemental formulae. General rules means no inappropriate valences but no limitations on the nature of the rings (except none base don 2-carbons :-) ) etc. EVERYONE underestimated by many factors.

While working on a structure elucidation software program the issue of how many structures could be generated from some fairly nominal formulae became very clear. Below are some example formulae, the “correct” structure associated with the data under analysis and the number of chemical structures that can be generated from this formula. Notice those numbers….numbers like: 138,136,211,624 structures from a formula of C15H22O2 !

Therefore,_the story that monoisotopic mass, that can give a single molecular formula, can give you an unambiguous chemical structure needs to stop. Now, that said, since we have close to 20 million structures online at present the question “What is the distribution of molecular formulae across ChemSpider?” was an interesting question. So, we ran a query to determine the highest frequency of formulae. The formula C18H20N2O3 occurred 5110 times in the database, 4804 times when looking at single components only. Some representative structures are shown:

mf-search-1.png

I imported the data into Excel (Office 2003) with a 65000 row limit. While there are single molecular formula compounds in the list at the end of the file (viewed in wordpad) at the 65000′th row the frequency was still 45 entries in the database. It’s a long tail..

mf-distribution.png

Now, many people are using mI masses to examine metabonomics data so it may be more appropriate to do the analysis on a more restricted dataset. For example, databases of interest to metabonomics people include KEGG and HMDB. Isolating the search to such databases shows that while there is a much shorter list of unique formulae (8590) a similar distribution persists . The most common formula is C6H12O6 with 71 hits. Searching this in the database shows a number of linear and cyclic carbohydrates, some with stereo, some without as shown below. if you are confused about “linear versus cyclic” see this Wikipedia article.

mf-search-2.png

Monoisotopic mass isn’t going to provide the stereo information anyways and all you will get is a lot of similar structures…but of course there are MANY carbohydrates with that formula. I’ve the listed a group of some of the top formulae here and leave it to you to investigate!

Formula Number

C12H22O11 = 55 hits

C6H8O7 = 52 hits

C5H10O5 = 46 hits

C20H3205 = 46 hits

C8H803 = 40 hits

C20H32O3 = 39 hits

C20H32O4 = 38 hits

C2H4O2 = 38 hits

C24H40O4 = 37 hits

CH4O3S = 36 hits

Bottom line…even removing stereo issues and isolating to a small number of databases it is still an issue to declare that a structure is elucidated just from a mass and some form of prior knowledge or additional information such as elution order or time is necessary.

Now, this observation may not be surprising to many people. The response may be that tandem Mass Spectrometry would give an ambiguous structure. This is also not true unfortunately and in general even tandem MS (MS^n) cannot give a conclusive structure. Certainly, if stereochemistry is involved (as with many carbohydrate molecules) you are still stuck. While library look-ups using monoisotopic mass ARE valuable, and tandem MS adds more criteria for structure identification, neither are unambiguous.

Buy me a Coffee

There is a new contributor to the blogosphere…SimBioSys. I recommend adding the blog to your Google Reader. There are some very exciting things going on there right now. I have commented previously about how high performance computing engines such as the Cell Broadband Engine are being brought to bear on scientific problems. SimBioSys appear to be the only group who have chosen the Cell processor to port their virtual high-throughput screening and docking solution to. Their white paper makes for an interesting read.

In their most recent post “Roping in your next scaffold hop with LASSO” they talked about their LASSO publication: LASSO—ligand activity by surface similarity order: a new tool for ligand based virtual screening”. We are presently in the middle of a very exciting project regarding LASSO. We have teamed up to provide the virtual screening results for 40 target families on the full ChemSpider Library, currently containing over 18 million molecules. Using the LASSO similarity search tool, SimBioSys has screened the ChemSpider database against all 40 target families from the Database of Useful Decoys (DUD) dataset.

LASSO descriptors (Ligand Activity by Surface Similarity Order) contain a count of the different Interacting Surface Point Types (ISPT) found on a molecule. LASSO descriptors use 23 different surface point types, ranging from hydrogen bond donors/acceptor, to hydrophobic sites, to pi stacking interactions. Figure 1 shows a “histidinelike” fragment of a molecule. The triangles are the surface point types of this fragment, colored by type. Based on the idea that ligands must have surface properties compatible with the target site in order to bind, LASSO uses a descriptor of Interacting Surface Point Types (ISPT) to find molecules with diverse chemical scaffolds but similar surface properties.

lasso1.png

We are presently populating the ChemSpider database with 10s of millions of LASSO descriptors and this will allow screening of the ChemSpider database to:

? Find molecules which have a higher likelihood of binding to targets.
? Find molecules with better selectivity for a target.
? Reduce toxicity issues.

The 40 Target receptor families included in the screening results were chosen to cover a wide range of receptor classes due to their interest in drug discovery. Each target fa