Archive for the ChemSpider Services Category

Over the weekend we added chemicals from two new data sources – Afid therapeutics and Alfa Aesar. Large depositions of over 25,000 chemicals have been slowed down while we improved our batch deposition system but over the next few days we will be playing catch-up with a large backlog of vendor deposits. When we set up ChemSpider our initial belief was that the ability to source compounds was sufficiently being served by the chemical vendors themselves and many commercial software vendors and websites offering access to aggregated datasets of vendor offerings.

What we have noticed however is that on a daily basis ChemSpider users are requesting sources of chemicals directly. The majority of these requests are coming via email but the forum is also being used via the Looking for a Chemical topic. We have more and more requests to increase the number of chemical vendors represented on ChemSpider and make the navigation of identifying chemical suppliers easier. Despite directing many of our users to other sites our users seem to be looking for a “one-step” shop for their information. We will add some improved navigation to facilitate locating sources of chemicals. We’re hoping that some of the companies focused on sourcing chemicals will also want to help and integrate their services….

Here at ChemSpider we’ve been working for almost a year and a half to build a structure centric community for chemists. During this time we have been dabbling, in the background, with ChemSpider being not so structure-centric, but this has not been exposed yet. Of late we have been attracted to the possibilities around text-mining and mark-up of articles.

We are well underway in terms of providing tools for markup and they will be released incrementally. We have a lot more ideas and are interested to participate in the Article 2.0 contest to see what we can do. What is Article 2.0? Article 2.0 was announced by Elsevier here with the following statement:

“We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset. We will encourage the use of XQuery, but this will not be a mandate. By leveraging these APIs, the contestant becomes the publisher and can render scientific articles to meet their needs including integrating the article into existing applications or combining it with other web service APIs.”

7500 articles and complete freedom to present the articles as we see fit. Enticing! What do we already have on ChemSpider that we could reuse?

1) Structure deposition

2) Analytical data and image deposition

3) Integration to other data via URLs

4) Add comments/description

5) Text markup with “Chemical enhancements”

6) A dataset of >21 million structures and integration to over 120 data sources

7) Good ideas …

Article 2.0 looks interesting…we hope to be involved

I have been in discussion with Christoph Steinbeck and colleagues from the European Bioinformatics Institute. Specifically, we are interested in linking up to AND embedding the text from their ChEBI Entities of the Month. So, as is my preferred manner of not assuming everything is Open Data but rather asking for permission, I approached Christoph. I asked for permission to copy the text for the Entities of the Month onto the appropriate record view in ChemSpider. When I asked the question we were not yet ready to accept rich text format with embedded hyperlinks, a strength of many of the articles on ChEBI‘s Entity of the Month.

I am happy to announce that as part of our ongoing effort to Wikify ChemSpider and allow people to add descriptions to the individual record views we have added a rich text editor and are presently testing it. At present we have rolled out the FULL implementation of the editor. This means it has lots of capabilities/buttons and the entire editor is being tested by curators. But, when rolled out to users, there will be a Simple mode and an Advanced mode for the editor.

Click on the thumbnail below to see the Text Editor in action. Don’t forget, It is the “Full-powered” implementation for now. In this case all I did was copy and paste the text from the ChEBI website and insert the ChEBI article link back to the original article on the ChEBI site.

In the Text editor we are in the process of inserting new capabilities that will facilitate mark up of articles. Since we will be hosting a number of Open Access articles shortly we will be experimenting on those articles with our new markup capabilities.

When this is all rolled out we will have the majority of capabilities necessary for people to track their research online if they wish. Online submission of structures, text deposition with full editing capabilities, submission and tracking of analytical data and images and linking to external sites and data. It’s probably an 80% solution for right now since we are missing some capabilities and workflow issues. For example, poor support for polymers and organometallics and specitfically the structure-centric nature of the solution and the insistence to submit a structure to associate data and text with. We will allow in the future “sample-submission” where the structure is not known but the data, images and experimental details of synthesis and analysis are available. Clearly the standard workflow for synthetic chemists is to synthesize first and then confirm by analysis what the products are. This is a typical workflow and will need to be supported. It’s coming…

Some of you might be asking:

1) will we support versioning of the articles as people modify/edit the article (as is done with Wikipedia)? Yes, we will. Soon.

2) will curators have the ability to lock articles? Yes, in the future we will introduce this if it’s deemed appropriate.

3) will it be possible to allow only one individual (or group) to edit an article? Yes, one of the future directions is to allow an individual or group to perform Open Notebook Science in front of the public but not allow the public to edit the results. They would of course be allowed to comment on the research. Future development…

Zemanta Pixie

We are adding our finishing touches to some markup tools for Open Access articles at present and they will unveil shortly. In parallel we’ve been manually curating a series of articles about drugs, about 3000 of them, and will rollout these articles with similar markup using the tools we have developed. When rolled out we will of extended our ChemSpider toolkit to facilitate integration between “documents” and ChemSpider – watch this space…

As we continue to add data sources to ChemSpider…and it’s going on almost weekly at present, it is clear that we have to make it easier for the users of ChemSpider to know what each of the Data Sources is. We’ve been doing some developments in the background for a couple of collaborations that have required the development of certain components and we’re layering one of them on here. We are using a callout balloon to display description details of the data source. Just hover over the name of the Data Source and you will see the description as shown below.

Clicking on the More Details link at the bottom right hand side of the callout balloon takes you to the details page. If any of you readers are DEPOSITORS on the ChemSpider system please note that we would love you to maintain your own page. Contact me and I will guide you through the process. This is aht efirst of many enhancements to help navigate Data Sources.

here has been a response to my post about Chemical Names and Structures here.

PMR>”For certain purposes, it is valuable to collect as many names as possible, for example for location of lookup. But these should be accompanied with metadata. A similar example is from ChemSpiderMan (ed.):

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here: Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim�th?ylamino)m?�thyl]fur?an-2-yl}m?�thyl)sul?fanyl]�th?yl}-N’-m�?thyl-2-ni?tro�th�ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does “pain” mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn’t when I started and we had to read French and German  papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful. “

First of all, it’s interesting to note that the French name has been rendered as “junk” in Peter’s blog as shown here.

This probably relates to his original comment that the name is junk in his browser too…but acceptable in mine. On the other hand his blog post may look fine to him and looks bad in mine! Oh those dependencies…I see similar things show up in WordPress regularly.

Peter suggests that there should be metadata giving the language information. Good idea. See my previous blog post about that particular issue and the fact that we allow curators to layer on metadata AND we capture and retain it WHEN it is available.

If you look at this record you will see that there are names labeled as Polish, German and Dutch.

Chloropre​ne [Wiki]

1,3-Butad​iene, 2-c​hloro-

126-99-8 [RN]

204-818-0 [EINECS]

2-Chloor-​1,3-butad​ieen [Dutch]

2-Chlor-1​,3-butadi​en [German]

2-Chlorbu​ta-1,3-di​en [German]

2-Chloro-​1,3-butad​iene

2-Chlorob​utadiene

Chloropren [Polish]

Most labels were captured during the deposition process. One was added manually.Notice also the direct links to Wikipedia, the Registry number link to perform a search of PubChem and the link to EINECS.

As I commented in my post on ranitidine, and extracting from Peter’s post “Notice …….. that the name is labeled French on the record.” So, what Peter suggests is already in place on ChemSpider. I display below what is presently available to curators to label the names with. Notice this includes language,
EINECS numbers, CAS Registry Numbers, INNs, JANs etc.


The list of languages is easy to expand. Anybody have any requests?

A further comment “PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide). We have to use authorities (provenance) in our information. Thus the statements: Ranitidine is the Z-isomer and Ranitidine is the E-isomer may be seen as contradictory. That’s why people have suggested that RDF should have quads, not triples, such as Antony_Williams asserts ranitidine hasIsomer Z Wikipedia asserts ranitidine hasIsomer E Both these are true. That is the language we should use in the semantic web PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.”

This leads us into a deeper discussion about retention of metadata and authorities. We retain metadata when it is deposited or we can harvest it. Let’s consider the information below extracted from the same compound on ChemSpider:

Notice all of the

and note that they all link through to the original source of information, in this case NIOSH.

  • Appearance: Colorless liquid with a pungent, ether-like odor.

  • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, skin, respiratory system; anxiety, irritability; dermatitis; alopecia; reproductive effects; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, reproductive system Cancer Site [lung & skin cancer]

  • Incompatibilities and Reactivities: Peroxides & other oxidizers [Note: Polymerizes at room temperature unless inhibited with antioxidants.]

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet (flammable) Change: No recommendation Provide: Eyewash, Quick drench

  • Exposure Limits: NIOSH REL : Ca C 1 ppm (3.6 mg/m 3 ) [15-minute] See Appendix A OSHA PEL ?: TWA 25 ppm (90 mg/m 3 ) [skin]

There are also properties and each piece of data links out to the original source.For this record it is the same source. For some records it is already multiple sources.

Experimental physchem properties

  • Boiling Point: 139F

  • Flash Point: -4F

  • Freezing Point: -153F

  • Specific Gravity: 0.96

  • Solubility: Slight

  • Ionization Potential: 8.79 eV

  • Vapor Pressure: 188 mmHg

This particular structure has been deposited onto the ChemSpider database a total of 18 times from the  source databases listed below. Where possible i.e. when the structure is available online on the suppliers website and can be hyperlinked to, then each external ID links to the depositor. There is an error! The Aldrich depositions are for the polymer forms! Curators can know this info out.

Data Source External ID(s)
ChemDB 6681768
ChemIDplus 000126998, 014523898
DiscoveryGate 31369
DTP/NCI 18589
EINECS N/A
EPA DSSTox 1084_NTPBSI_v2b, 325_CPDBAS_v5b, 326_CPDBAS_v5b, 724_HPVCSI_v2c
Istituto Superiore di Sanità 601
NIOSH EI9625000
NIST 2143397875
NIST Chemistry WebBook 2143397875
PubChem 31369
Sigma-Aldrich 205397_ALDRICH, 205400_ALDRICH
Thomson Pharma 00243363

Also available to master curators is the ability to see who has been editing the names and synonyms and a full record of depositions, by who and when.

So, names are labeled with language and links to Wikipedia and other info. The predicted properties and systematic name are generally labeled according to the provider of the algorithm(s). We keep track of every URL and publication deposition and know which user deposited what and when…if the site is “vandalized” then we know which user did so.

Overall I’d say we have a lot of metadata for this record. The same is true for tens of thousands of records on ChemSpider and the amount of such information is growing literally daily. We’re not done yet of course – there is much more to add. We put a lot of thought into the design of this system and associated metadata but we also chose to jump off the cliff and start “doing”. There is a lot to learn from managing 20 million molecules and the complexity that comes with doing so. We continue to morph and extend as necessary and welcome input.

To clarify re. ranitidine…. I am NOT asserting that ranitidine has Z-isomer. I am stating that ranitidine has multiple names on ChemSpider, some with no stereochemistry and some with Z-stereochemistry. I also
report that a published crystal structure reports a Z-orientation.  I also report that a commercial software package suggests that the three tautomeric structures below are possible for ranitidine.

I also report, just for fun of course, that the InChI algorithm will declare two of these isomers, the bottom two, as equivalent when “mobile protons” are taken into account. Compare the ON InChIKeys below when mobile proton perception is detected by the InChI algorithm.   Need  more information?

With the curation capabilities we have in place, with the retained metadata, linkages to depositors and other sites and the revision history available, I would say that we are well equipped to manage the data for chemists and continue to enhance our platform for chemists worldwide.

A lot of text-indexing of publishers and journals has been underway over the past few weeks, with permission. The two latest additions are the Journal of Biological Chemistry (added over 122,000 new articles) and the Proceedings of the National Academy of Sciences (added over 50,000 new articles). Now on the Literature Search page you will see a series of checkboxes for you to choose the resources for text-searching (as shown below).

I have been testing the searches based on one of my adopted molecules, paclitaxel, sometimes referred to as Taxol.

Searching on paclitaxel without JBC and PNAS gives a total of 427 articles in 11 seconds.

Searching on Taxol without JBC and PNAS gives a total of 270 articles in 5 seconds.

Searching on paclitaxel with JBC and PNAS gives a total of 745 articles in 26 seconds.

Searching on Taxol with JBC and PNAS gives a total of 1192 articles in 35 seconds.

Clearly adding JBC and PNAS is giving a lot more hits on both names with over a 4x increase for Taxol hits. Clearly the number of hits is highly dependent on the name used to perform the searching. Now, when we integrate the chemical structure searching via linked identifiers this dependency should be dramatically reduced. This work is in development.

Zemanta Pixie

For ease of visualizing large sets of structurs resulting from queries such as mass searches or property range searches we have recently introduced the Tile View. So, for example, let’s assume I am doing a Properties Search and looking for how many molecules there are on the database with a monoisotopic mass of 300+/-0.001 I find 133 hits in about a second.

Notice there are the various modes for viewing..Grid, Tile, Table and Record mode. Click each to see them in action. The new mode for viewing is the Tile Mode. Click the Thumbnail below to see the view. Hopefully you find it useful.

I was pinged this weekend by Zsolt Zsoldos of the SimBioSys Blog about us having duplicates of certain amino acids of ChemSpider. He commented that there were a series of structures showing up for a search based on identifier:

Aspartic acid: 411 and 5745
Arginine: 227 , 6082 , 64224 , 1266045
Histidine: 752 , 6038 , 64237 , 4450698

Welcome to our world! So, let’s start with aspartic acid. What IS the structure of aspartic acid? Is it the one on Wikipedia here? The one labeled with the S stereochemistry but showing no stereobonds (will be resolved in the curation process of Wikipedia structures!). Is aspartic acid the deprotonated version? Well, it depends on who you ask and also who is depositing on our system.

The same is true for arginine where there is a non-stereospecific isomer, a D-isomer, a L-isomer and a charged form. Similarly for Histidine. All are appropriate.

Why is Zsolt interested in this? Because of a conversation he is engaged in…

For the docking community there is a very valuable resource out there called ZINC. They also have a very interesting email discussion list called Zinc-Fans. Recently a post initiated a discussion about protonation states and asking the question:

“In a certain docking protocol, my concern is primarily of the protonation states of the ligands in the library (subsets with different pH ranges) downloaded from ZINC, as I have recently read an article on “The influence of protonation in protein-ligand docking”

http://www.journal.chemistrycentral.com/content/2/S1/P12

Considering an enzyme that is reported to be optimum at a pH of 7.6-8.0, which we intend to find inhibitors for, which subset of ZINC compounds do Ichose for docking against my target of interest?”

Zsolt came searching ChemSpider for the amino acid structures and found the complexities in terms of charged/stereo forms. But his postinf regarding the “Correct Protonation State for Docking” was an education in itself. If you are engaged in docking experiments at all this is likely a must read. For the rest of us neophytes it’s education!

Earlier this week we added a new capability to ChemSpider for our users. Using the web service provided via nmrdb.org we embedded the ability to predict an NMR spectrum from any record view on the ChemSpider website. The NMR prediction service is provided by Luc Patiny’s group out of Ecole Polytechnique Fédérale de Lausanne at the Institute of Chemical Sciences and Engineering. Their nmrdb.org webpage offers a series of services, not just NMR prediction and I offer the details below from their website.

NMR Predictor – This page allows to predict the spectrum from the chemical structure

NMR Assigner – Upload and assign NMR spectra on-line. The assignment of NMR spectra may be decomposed in 4 steps:

  1. identification of the signals
  2. integration and multiplicity determination
  3. assignment of each signal to the corresponding atom in the molecule
  4. exportation of the data for publication and/or for database storage

NMR Resurrector - A great amount of NMR information is currently available in the form of scientific publications. However, this information is not readily accessible in the format required for complex searches. The Resurrector enables the user to easily import these in-line spectral descriptions and creates an assigned visual representation that can be seamlessly integrated in the attribution process.

I am an NMR spectroscopist by training and have been involved with NMR either running NMR labs in academia, gov’t labs or Fortune 500 companies for almost a decade or involved with the development of commercial NMR software tools for prediction, processing and structure elucidation. Doing NMR prediction well is not easy. There are multiple approaches and many have been discussed previously on this blog so I won’t belabor that point. However, a set of online free utilities for prediction and assignment offers a new entry into the domain and the ease of integration allows anybody to connect up via their website in just a few minutes.

I haven’t had time to test the system rigorously on complex molecules but simple molecules look fine (based on a test set of about 5 molecules).

We have produced the integration in order to allow crowdsourced testing of the prediction algorithms. test it out. Provide the authors feedback as well as post your comments here. It’s easy to run…navigate to a record view of interest and look for the RED words “Predict NMR”. We will shortly provide you a way to predict the spectrum for any molecule via the ChemSpider structure input interface, it won’t have to be a part of our database.

Peter Murray-Rust responded to my recent comments about a Free Lunch. There are a number of comments to be made and an exciting opportunity to use Open Data and linking from ChemSpider.

I’d asked the question about how many records there were on CrystalEye. In our world a unique record is a unique InChI, not so on CrystalEye and appropriately so as the crystal structure itself is presumably the unique record. Makes sense.

PMR> We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is.

What we will do with multiple crystal structures for a single chemical structure is link all unique crystal structures from the unique chemical structure. In this way people can query the chemical structure and find all associated analytical data – spectra and crystallographic files. If we were to list the number of unique depositions on ChemSpider I think we would be around 40 million depositions..an estimate though!

PMR> The main duplication comes from the Crystallography Open Database which has about 45,000 structures.

I looked at the Crystallography Open Database this morning. it states on the home page “Updated daily: 68268 entries in the COD”. We may have an opportunity with the COD to link up to their data and reduce the need for us to host CIFs. Excellent…we’re all for reducing workload and providing links into other systems. It’s what we do.

PMR> The only thing stopping us putting them (AJW> The structures from CrystalEye)  in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon. When it’s finished it will be in RDF/CML.

This is great news. This means that after the summer we can download the data directly via PubChem and link up to CrystalEye that way. Perfect. We’ll stop working on integrating to CrystalEye now and wait for the integration path via PubChem and focus on other data sources. Thank you Peter, Nick, Andrew and Jim!!! That said I don’t believe that PubChem will take CML, they will convert using their tools to produce their compatible formats and InChI being one of them. That will break organometallics etc. UNLESS PubChem are going to adopt CML now and that would be an interesting positive shift in terms of a sign of support for the format. A strong positive. I’l chat with the PubChem team so that if CML is coming we can consider adopting in some way and be ready.

From my post “AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.”

PMR: Chicken and egg… :-) You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.

It’s nice to know that ChemSpider has that type of influence now. It’s good to see it going into Accelrys’ software and I had heard that from Dan’s blog and had added the CML Blog to my reader. I’m definitely watching and willing to follow. We’re busy leading so many other things right now we’ll wait for adoption and then jump on it like a “hobo on a muffin”.

One of the blogs I really enjoy reading is Deepak Singh’s Business,Bytes,Genes and Molecules. Today there was a blog post about ChemSpider but something strange happened…I could ONLY read it in Google Reader. When I tried to navigate to the actual website it asked me to Save a file. See below.

It may be harmless but I’ve suffered enough at the hands of “bad files” to not grab it. Anyone else seeing this symptom? It’s in both browsers (IE and FF) and on two computers.

Anyhow, thankfully I can read it in Google Reader. There’s a point Deepak raises and I insert it here..

“On the web, data should be available as an addressable resource. The fact that data is available as RDF is great (and I wish more data was available as such). However, my personal preference is that data, especially open data, needs to be accompanied by APIs and bindings that allow the data to be accessed in a number of formats (not a dump per se). I think over time the acceptable formats will be established, much like XML/JSON/RSS have become the standard transport formats. The key aspect here are the business models. Is the business in providing a service on top of the data? For example for more than X number of API calls, there could be a fee associated.”

Just in case people have missed them we have a whole series of Web Services available already and they are being used. You can find details about them here:

Mass Spec Web Services

Taverna Hooks to ChemSpider Web Services for Metabolomics

Web Services Demo Pages and Example Code

Microsoft Hook Web Services into Infomesa

Waters Deliver Integration Via Web Services

There are more examples. We have thousands of calls a day using the Web Services at present and welcome more feedback on them!

We made our web services available with the intention that third parties might take advantage of the capabilities. Waters recently integrated our MassSpecAPI web service to their MarkerLynx software.

To perform an online search of the ChemSpider database the user can set up a series of specific databases they might want to search. In the figure below the user has selected the Human Metabolome Database, Lipidmaps and the KEGG database.

waters1.pngSearches of the selected series of ChemSpider subset databases can be made based on either mass or elemental composition and returns the compound name in the ID column and retains the link to the ChemSpider ID.

waters2.png

ChemSpider records can then be viewed by clicking on the View Hit Details. The chemical structure and associated information can then be reviewed online.

waters3.png

Web-Service_Integrations of this type have started to expand in number and you can expect to see others appearing very shortly.

Over the past year we have been interested in our website statistics and our growing traffic. I have blogged previously about Alexa and was challenged to review the Compete statistics. After growing in rankings for a few weeks we removed the Alexa widget and saw our rankings plummet. We then installed the Compete widget and saw ourselves go up the rankings quite dramatically before removing that widget and seeing our rankings decrease. Meanwhile, our own website statistics have shown consistent month to month growth with an average of about 4000 unique users per day at present (As shown in the figure below).

stats.png

Bottom line, based on our observations, neither Alexa nor Compete give anywhere near valid statistics. At the SBS conference in St Louis this past week I asked the audience, about 60 people, how many in the room knew of or had heard of ChemSpider. ONE hand went up…and that was someone I had informed many months earlier. As I expressed to the audience…this was not disappointing news to me…it was quite exciting to know what the potential growth is as people are informed of the service. I expect the growth to continue, especially after the visits to the ACS and the SBS.

Over the past year ChemSpider has been working hard to build a functional and stable platform for the hosting, deposition and curation of structure-based data. This is to form the foundation of our mission to build a Structure-Based Community for Chemists. Our deposition system is in place and well-tested. Our indexing of articles is proven, and continues. We have indexed multiple Open Access articles. We support the deposition of analytical data (spectra and CIF files) into ChemSpider.

It is now time to take this to the next level and I would like to extend an invitation to Open Access publishers to work with us to design an interface (preferably a web service) to facilitate direct deposition of data into ChemSpider. We’d like to design an interface where you can feed your articles in with Title, Authors, Journal reference, DOI and Abstract. We would associate the article with the chemical structures in one of two specific ways – 1) extract the chemical names from the title and/or abstract and convert on the fly to deposit and/or associate with structures on ChemSpider and 2) allow the publisher to pass us a series of SMILES strings, InChI Strings, molfiles or chemical names to deposit on ChemSpider. Based on what we have already done it is clear this process is feasible, and will require some manual intervention until we optimize processes. If we do this we can design an interface and input format that can be made public, reusable by other groups for the deposition of information into their systems and, potentially, move away from the need for extracting information out of PDF files (and other formats). The outcome of this work would be a freely accessible structure and substructure searchable index of Open Access articles with links back to the Open Access article. We are already indexing articles so, with permission from even the non-Open Access publishers we could use similar processes to index abstracts and make articles structure/substructure searchable based on titles and abstracts.

So, my question. Are there any Open Access/Free Access publishers willing to discuss the possibilities I have outlined? If any of you will be at the ACS meeting and would like to discuss please post a response here or contact me at the usual email address (antonyDOTwilliamsATchemspiderDOTcom) and let’s talk about building a disruptive and enabling technology for chemists around the world

Over the past few months we have been working hard to integrate the ChemRefer system into ChemSpider. I have reported on our recent rollout of the first level of integration and Will Griffiths has started a discussion at the Open Chemistry Web blogpage. Will has played a key role in facilitating our relationships with publishers in the Open Access domain. Many have had an appreciation for his ChemRefer website.

When we connected ChemRefer to ChemSpider one of our first commitments was to facilitate direct structure/substructure searching of the IUCr publications. Will has indexed the publications from 1948 to present day. He has extracted chemical names and systematic names from the titles and abstracts of the articles. Since we have access to name to structure conversion capabilities from OpenEye we can convert many of these names to chemical structures in an automated fashion. As a result of my work on the curation of Wikipedia chemical structures I also have access to some of the tools I was involved in developing at ACD/Labs (ACD/ChemFolder and ACD/Name) This has allowed me to curate and convert other extracted names in a manual manner. This is NOT the most efficient way to conduct this process as it requires a lot of eyeballing. However, it is this type of approach, reviewing many hundreds of extracted names, which has allowed us to optimize the process, recognize where potential failures could arise and improve the chance to extract the best set of names to work with for conversion purposes. Now we are at the whim of name to structure conversion software in terms of the accuracy of the conversion but this can be validated (See later). Since organometallic names are very difficult to convert to structures we can only deal with organic structures at present. However, we do also have many validated synonyms in our hands too on ChemSpider and that provides a useful dictionary for conversion.

Ultimately I hope that other commercial providers of batch Name to Conversion software modules would see the potential value of collaboration with ChemSpider and give us access to their software tools to assist with our project.

We are in the process of curating and checking the names and chemical structures for all that we have extracted so far but the depositions have already started. Of the structures converted we have found about half of them are already on ChemSpider (example) but another half are unique (example). At present we have connected almost a thousand articles via DOIs from ChemSpider to the IUCr articles. The best estimates at present are that we will be able to connect about 8500 structures to articles.

Since this is the IUCr collection we will validate our approach in the future and, with their permission, we will use the CIFs to validate our extracted structures against the CIF-based structures. This will give us an indication of the errors one could expect from an automated extraction of chemical names from articles and conversion to structures using Name to Structure. We are also going to continue our curation process and allow people to curate structures from IUCr(and other articles) to ChemSpider. There will be cases where some names/structures from articles have not been converted to structures and these will need manual submission. This will be possible through the manual deposition system.

We are looking forward to providing similar services to other publishers should they be desired. This is our proof of concept…and it’s working.

Previously we introduced the ability to submit chemical structures to the database using the Single Structure Deposition process. This allows users to submit single structures to the database and associate with publications, URLs, Pubmed IDs and so on. An example of the result can be seen here for Quesnoin…the structure and associated supplementary info was deposited online using the outlined process.

We have previously unveiled the ability to add publication details to existing structures on the database as outlined here. What we’ve heard is that it would be just as useful, and in the time of Web 2.0, even better to allow allow connections to other web pages by allowing URLs to be connected to existing structures on the database. The process is easy.

You need to be logged in to Add URLs, Publications etc. The only action that can be done without logging in is the Posting of Comments. The reason we do this is to help us protect from vandalism, if possible. When logged in then click on Add URL. The example below is for me wanting to form a link between the structure record of Xanax and the article on Wikipedia.

addurl1.png

A dialog box will be displayed. Input the Title that you want displayed in the supplementary info and the URL of the associated link. See below:

addurl2.png

Filling the information will show as follows:

addurl3.png

Then Click OK. The submission will be sent to a curator for approval and should be approved very quickly. The reason for this process is to ensure that we don’t get adorned with “inappropriate links”. The information will show as “Supplementary info” at the bottom of the structure record as shown here.

Two particular Open Access resources providing enormous value to Life Sciences nowadays are PubMed and PubChem. I’m sure everyone reading this blog has heard of them and used them both. We previously announced our deeper collaboration with ChemRefer. We have been busily working in the background to integrate ChemRefer into ChemSpider and the very “alpha” version of the integration is now available online. It can be accessed from the Search menu as shown below. Simply click on ChemRefer

chemrefer.png

When selected it will open the ChemRefer search window and a list of Publishers who have allowed us to index them as shown below. All can be searched or the search can be limited by using the Check Boxes.

chemrefer2.png

The search results for searching on taxol are shown here.The figure below, while too small to see detail, shows that the word searched is highlighted in the text.

chemrefer3.png

Notice that the RSC is no longer indexed at their request. We were sad to lose them from our searches.

We have also integrated to Entrez, for searching health sciences databases at the National Center for Biotechnology Information (NCBI) website. This can be searched by choosing NCBI Entrez from the Search drop down menu. It is available here. An example of the results is shown here. For this first integration we limit the results to 100 hits.

entrez1.png

Clicking on any of the titles is a direct hyperlink to the article in PubMed Central.

These integrations to these text searching engines are only the first part of our work. We have already been extracting chemical names and linking them up to structures in the ChemSpider database.Our first efforts in this area will be unveiled shortly. In this case it will be possible to simultaneously perform structure/substructure and text-based searches. This is a very significant undertaking but we are well underway to bringing our vision of structure and text based indexing of the Open Access literature to fruition.

I’ve reported previously on the fact that we are now adding publication details to chemical structures on ChemSpider. We have introduced the ability to do this in a manual fashion where anyone can associate a blog article, wiki article or scientific publication directly with one of more chemical structures but we are also looking to do this in an automated fashion. Project Prospect from the RSC seemed like an ideal opportunity for us to consider using their InChI association to harvest the titles and DOIs to make the association. This would be done after a discussion with the RSC to receive their blessing if possible, based on our previous interactions.

Today I investigated the possibilities of using the information available. I started with this article and clicked on the Enhanced HTMP View (Prospect View) and then used the Toolbox to Show the Compounds. The article is about the “Chemistry and biology of resorcylic acid lactones”. A partial screenshot of the type of molecules discussed in the article is shown below.

radicicol.png

For_the_purpose of this blogpost notice the visually appealing forms of the structures and the stereochemistry on the molecules. Now, each of the marked compounds in the article is linked to details of the molecule. See below..each of the pink highlights is linked to the molecule and pops up a new box.

radicicol3.png

Looking at radicicol on Project Prospect and on ChemSpider we see a difference. In fact, compare with the structure shown above from the article. The difference is one of stereochemistry..there is no stereochemistry in the InChI or in the SMILES string. There are also issues with the structure depiction shown below and this has been discussed before relative to “Cleaning“.

radicicol2.png

Zearalenone on Project Prospect and on ChemSpider

zearalenone.png

As_previously discussed with Ginkgolide B there can be many versions of a structure on the ChemSpider database. We recently introduced the ability to search on a “skeleton” as shown below in the new Structure Search options.

search-options.png

When the skeleton for zearalonone is searched (same skeleton excluding H) I found 15 hits. Some are shown below. Notice the difference (highlighted with red boxes) structure to structure in terms of the presence/absence of the double bond inside the cycle, the difference between the OH and the =O and the specified  stereochemistry. This search can be very useful for finding related structures and more examples will be given in the near future of using such searches.We DO find versions without specified  stereochemistry but we are presently working on approaches to relate the stereo/non-stereo versions of structures to each others in a very visual manner. More will follow…

skeletons.png

We have started to introduce new capabilities onto ChemSpider in preparation for our shift from ChemSpider Beta to ChemSpider RELEASE VERSION to celebrate our one year anniversary. You should see a number of incremental improvements happening over the next few weeks. I’ll highlight as many of them as I can as we release them.

Let’s start with new Search Capabilities. On the search screen you will see a drop down menu. You can now select from 5 different types of searches.This post with highlight only the “Structure” type search. Details about the others will follow.

searches.png

It appears that many people believe that ChemSpider is only a text-based search. The reality is  we went live with structure and substructure search in version 1. For sure we have improved things over the months and the searches are better and faster but structure searching has always been present.

To perform structure searching choose the Structure Search and you will see:

structuresearch1.png

Now click on Input Structure and  a screen for submitting structures will be opened. There are three ways to do this. The Convert will allow you to input a SMILES string,  InChI string or chemical name and convert to the structure. If the structure is corrected select ACCEPT and perform the search. Alternatively load a molfile by browsing and uploading then click ACCEPT. If you want to draw a structure or edit one you have converted simply select edit and draw into the applet.

structuresearch2.png

The manual for the applet is here. The applet will look like this:

structuresearch3.png

These capabilities have been in place for a while. Now what we have done is enable you to search in different ways from this place. Specifically…

structuresearch4.png

These searches are very valuable  specifically in relation to the issue of tautomers and structure skeletons. As I have shown with previous post on ginkgolide B in some cases there are many similar structures on the database. Searching on the skeleton will find them all. A search on Taxol  shows 1 exact structure, 1 tautomer of the exact structure and 42  structures with same skeleton as shown below.

structuresearch5.png

Enjoy the new capabilities. We welcome your feedback.

I use Google Reader to read blog posts. It’s great. The PROBLEM I have with blogs is when I want to receive comments on comments I have made on a blog. I waste a lot of time going back to blog posts to see whether the author of the original post has commented on my comments. It is a real time hog. Maybe I should stop commenting…it would be less time-consuming.

I think ALL blog hosts should allow users to Subscribe for Comments. What is this? I noticed it on David Bradley’s ScienceBase blog and we added it tonight. Look at the screenshot below and notice the red highlight. Now if you post a comment to the blog check this box and if new comments are received on this particular blog post then you will receive an email about the new comment.

subscribe.png

I’ve commented previously about the fact that Microsoft had used our web services to connect to Infomesa (1,2). Last week I had a chance to meet with Sam Batterman face-to-face for an overview of Infomesa and to see the services in action. Sam and I chatted for a couple of hours about his platform, the challenges of managing quality in publicly accessible data (and not proliferating errors) and the directions for ChemSpider in general and whether we could extend the services to support his needs.

Infomesa is a BIG whiteboard. And I mean BIG! During our discussions Sam threw up photos, charts, spreadsheets and videos onto the board and formed relationships between them. He demonstrated using our services to pull back/generate SMILES strings,  structure images and InChIKeys. He demonstrated mapping relationships as I would do today in MindManager. I could see immediate utility to the approach of the giant whiteboard. I am tired of arranging relationships over multiple documents and while I like MindManager a lot (!) the utility of the enormous whiteboard approach (I think Sam mentioned “equivalent to 50,000 pixels”) became clear in a couple of minutes.

I look forward to playing with Infomesa myself when it’s available and doing what we can through our web services to help offer additional utility to chemists.

    ChemSpider hosts the “electronic catalogs” of many chemical vendors as a public service. When people use us to find chemical suppliers we do not expect to be compensated for the service and do not expect to be acknowledged for the service. If ChemSpider users get connected to companies and buy their chemicals so much the better. What has happened over the past few weeks has been very interesting though…we’ve seen a DRAMATIC increase in the number of requests to help source chemicals for people that are NOT on the ChemSpider database (yet) as well as to assist in identifying organizations (or individuals) who are interested in obtaining custom synthesis support.  We’re happy to do it.

I’m also in the middle of assisting in the curation of Wikipedia and showing “willingness” there too. There are a whole hosts of other projects we are being invited to participate in also and we are showing “willingness”.     There is an unfortunate downside to “willingness” though. There is very little time left to do “real work”. There is an increasing number of email requests and phone requests coming to us regarding “Where can I buy this chemical?” or “Can you help identify someone who can synthesize this chemical?”. In the past month we’re averaged about 20 requests per week but it is increasing fairly quickly. We have about an 80% success rate in finding the chemical of interest on the database or through or depositors as well as introducing the interested party to one or more people interested in doing custom synthesis for them…and then getting out of the way.

With this in mind you are going to  see us expend some efforts to facilitate interactions between chemical vendors and interested parties. I consider this as a form of social networking for chemists – facilitating introductions and connections – okay, maybe more of a dating service.

echemistry.png

Maybe it’s eChemistry (the science of attraction…an online dating service)? By the way, there is also the Microsoft eChemistry project and I hope that people don’t confuse the two….

There is a new contributor to the blogosphere…SimBioSys. I recommend adding the blog to your Google Reader. There are some very exciting things going on there right now. I have commented previously about how high performance computing engines such as the Cell Broadband Engine are being brought to bear on scientific problems. SimBioSys appear to be the only group who have chosen the Cell processor to port their virtual high-throughput screening and docking solution to. Their white paper makes for an interesting read.

In their most recent post “Roping in your next scaffold hop with LASSO” they talked about their LASSO publication: LASSO—ligand activity by surface similarity order: a new tool for ligand based virtual screening”. We are presently in the middle of a very exciting project regarding LASSO. We have teamed up to provide the virtual screening results for 40 target families on the full ChemSpider Library, currently containing over 18 million molecules. Using the LASSO similarity search tool, SimBioSys has screened the ChemSpider database against all 40 target families from the Database of Useful Decoys (DUD) dataset.

LASSO descriptors (Ligand Activity by Surface Similarity Order) contain a count of the different Interacting Surface Point Types (ISPT) found on a molecule. LASSO descriptors use 23 different surface point types, ranging from hydrogen bond donors/acceptor, to hydrophobic sites, to pi stacking interactions. Figure 1 shows a “histidinelike” fragment of a molecule. The triangles are the surface point types of this fragment, colored by type. Based on the idea that ligands must have surface properties compatible with the target site in order to bind, LASSO uses a descriptor of Interacting Surface Point Types (ISPT) to find molecules with diverse chemical scaffolds but similar surface properties.

lasso1.png

We are presently populating the ChemSpider database with 10s of millions of LASSO descriptors and this will allow screening of the ChemSpider database to:

● Find molecules which have a higher likelihood of binding to targets.
● Find molecules with better selectivity for a target.
● Reduce toxicity issues.

The 40 Target receptor families included in the screening results were chosen to cover a wide range of receptor classes due to their interest in drug discovery. Each target family had 10s to 100s of known active molecules, which were used as the basis for the query files used by LASSO, one query for each family. The similarity screening was performed on the full ChemSpider database across all 40 targets and the similarity scores for each structure/target pair is available via the ChemSpider website. Thus for each structure in the ChemSpider database, you can find its similarity score (based on surface properties) relative to actives of each of the 40 target receptors. In addition to allowing instant ranking results for a particular target of interest (retrieving molecules that are likely to be active for a receptor) this matrix of screening results can be used to find molecules that have predicted affinity for a target but low predicted affinity for all other targets. Performing such searches promises to improve selectivity and can be a guide to reducing toxicity concerns. More detail about this collaborative project will be forthcoming but the overview is provided here.

Watch this space for updates and an unveiling date.

I’ve reported previously on how Microsoft have started to use the ChemSpider web services and have hooked them into InfoMesa. Sam has used our services to add in a Search capability to return the InChI, InChIKey, Smiles, and a URL Link to Chemspider based on a search of a trade name, SMILES, etc. He’s also using the ability to return the chemical structure image. I’ll let the image on his blog tell the story more fully.