Archive for the How ChemSpider Runs Category

As we continue to add data sources to ChemSpider…and it’s going on almost weekly at present, it is clear that we have to make it easier for the users of ChemSpider to know what each of the Data Sources is. We’ve been doing some developments in the background for a couple of collaborations that have required the development of certain components and we’re layering one of them on here. We are using a callout balloon to display description details of the data source. Just hover over the name of the Data Source and you will see the description as shown below.

Clicking on the More Details link at the bottom right hand side of the callout balloon takes you to the details page. If any of you readers are DEPOSITORS on the ChemSpider system please note that we would love you to maintain your own page. Contact me and I will guide you through the process. This is aht efirst of many enhancements to help navigate Data Sources.

here has been a response to my post about Chemical Names and Structures here.

PMR>”For certain purposes, it is valuable to collect as many names as possible, for example for location of lookup. But these should be accompanied with metadata. A similar example is from ChemSpiderMan (ed.):

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here: Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim�th?ylamino)m?�thyl]fur?an-2-yl}m?�thyl)sul?fanyl]�th?yl}-N’-m�?thyl-2-ni?tro�th�ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does “pain” mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn’t when I started and we had to read French and German  papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful. “

First of all, it’s interesting to note that the French name has been rendered as “junk” in Peter’s blog as shown here.

This probably relates to his original comment that the name is junk in his browser too…but acceptable in mine. On the other hand his blog post may look fine to him and looks bad in mine! Oh those dependencies…I see similar things show up in WordPress regularly.

Peter suggests that there should be metadata giving the language information. Good idea. See my previous blog post about that particular issue and the fact that we allow curators to layer on metadata AND we capture and retain it WHEN it is available.

If you look at this record you will see that there are names labeled as Polish, German and Dutch.

Chloropre​ne [Wiki]

1,3-Butad​iene, 2-c​hloro-

126-99-8 [RN]

204-818-0 [EINECS]

2-Chloor-​1,3-butad​ieen [Dutch]

2-Chlor-1​,3-butadi​en [German]

2-Chlorbu​ta-1,3-di​en [German]



Chloropren [Polish]

Most labels were captured during the deposition process. One was added manually.Notice also the direct links to Wikipedia, the Registry number link to perform a search of PubChem and the link to EINECS.

As I commented in my post on ranitidine, and extracting from Peter’s post “Notice …….. that the name is labeled French on the record.” So, what Peter suggests is already in place on ChemSpider. I display below what is presently available to curators to label the names with. Notice this includes language,
EINECS numbers, CAS Registry Numbers, INNs, JANs etc.

The list of languages is easy to expand. Anybody have any requests?

A further comment “PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide). We have to use authorities (provenance) in our information. Thus the statements: Ranitidine is the Z-isomer and Ranitidine is the E-isomer may be seen as contradictory. That’s why people have suggested that RDF should have quads, not triples, such as Antony_Williams asserts ranitidine hasIsomer Z Wikipedia asserts ranitidine hasIsomer E Both these are true. That is the language we should use in the semantic web PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.”

This leads us into a deeper discussion about retention of metadata and authorities. We retain metadata when it is deposited or we can harvest it. Let’s consider the information below extracted from the same compound on ChemSpider:

Notice all of the

and note that they all link through to the original source of information, in this case NIOSH.

  • Appearance: Colorless liquid with a pungent, ether-like odor.

  • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, skin, respiratory system; anxiety, irritability; dermatitis; alopecia; reproductive effects; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, reproductive system Cancer Site [lung & skin cancer]

  • Incompatibilities and Reactivities: Peroxides & other oxidizers [Note: Polymerizes at room temperature unless inhibited with antioxidants.]

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet (flammable) Change: No recommendation Provide: Eyewash, Quick drench

  • Exposure Limits: NIOSH REL : Ca C 1 ppm (3.6 mg/m 3 ) [15-minute] See Appendix A OSHA PEL ?: TWA 25 ppm (90 mg/m 3 ) [skin]

There are also properties and each piece of data links out to the original source.For this record it is the same source. For some records it is already multiple sources.

Experimental physchem properties

  • Boiling Point: 139F

  • Flash Point: -4F

  • Freezing Point: -153F

  • Specific Gravity: 0.96

  • Solubility: Slight

  • Ionization Potential: 8.79 eV

  • Vapor Pressure: 188 mmHg

This particular structure has been deposited onto the ChemSpider database a total of 18 times from the  source databases listed below. Where possible i.e. when the structure is available online on the suppliers website and can be hyperlinked to, then each external ID links to the depositor. There is an error! The Aldrich depositions are for the polymer forms! Curators can know this info out.

Data Source External ID(s)
ChemDB 6681768
ChemIDplus 000126998, 014523898
DiscoveryGate 31369
DTP/NCI 18589
EPA DSSTox 1084_NTPBSI_v2b, 325_CPDBAS_v5b, 326_CPDBAS_v5b, 724_HPVCSI_v2c
Istituto Superiore di Sanità 601
NIOSH EI9625000
NIST 2143397875
NIST Chemistry WebBook 2143397875
PubChem 31369
Sigma-Aldrich 205397_ALDRICH, 205400_ALDRICH
Thomson Pharma 00243363

Also available to master curators is the ability to see who has been editing the names and synonyms and a full record of depositions, by who and when.

So, names are labeled with language and links to Wikipedia and other info. The predicted properties and systematic name are generally labeled according to the provider of the algorithm(s). We keep track of every URL and publication deposition and know which user deposited what and when…if the site is “vandalized” then we know which user did so.

Overall I’d say we have a lot of metadata for this record. The same is true for tens of thousands of records on ChemSpider and the amount of such information is growing literally daily. We’re not done yet of course – there is much more to add. We put a lot of thought into the design of this system and associated metadata but we also chose to jump off the cliff and start “doing”. There is a lot to learn from managing 20 million molecules and the complexity that comes with doing so. We continue to morph and extend as necessary and welcome input.

To clarify re. ranitidine…. I am NOT asserting that ranitidine has Z-isomer. I am stating that ranitidine has multiple names on ChemSpider, some with no stereochemistry and some with Z-stereochemistry. I also
report that a published crystal structure reports a Z-orientation.  I also report that a commercial software package suggests that the three tautomeric structures below are possible for ranitidine.

I also report, just for fun of course, that the InChI algorithm will declare two of these isomers, the bottom two, as equivalent when “mobile protons” are taken into account. Compare the ON InChIKeys below when mobile proton perception is detected by the InChI algorithm.   Need  more information?

With the curation capabilities we have in place, with the retained metadata, linkages to depositors and other sites and the revision history available, I would say that we are well equipped to manage the data for chemists and continue to enhance our platform for chemists worldwide.

Please note that we have enhanced the display capabilities for molecules on ChemSpider. On a record view now you will see a new tabbed display allowing you to choose 2D or 3D display of the molecule and the usual load/save/zoom capability. Now you don’t have to go into zoom mode to get the 3D display Just select the tab button. There is also a “Cell mode” that only works if their is crystallographic information available. This is only available for the CrystalEye deposition and more details will come about that shortly.

If you have other suggestions about how to improve the visualization let us know. Thanks.

Zemanta Pixie

We have started a trend of acknowledging contributors to ChemSpider. This month I want to acknowledge the contributions of two individuals who are both curating and depositing data to ChemSpider.

Heinz Kolshorn from the University of Mainz is an almost daily contributor to the curation process as well as a depositor of new compounds and NMR spectra.

Chris Singleton from Boston, MA has been busily connecting up web-based information sheets about chromatographic separations (look at the links and references here) and is about to start a wave of depositions of NMR spectral data.

To both of these gentlemen I say Thank You. ChemSpider is growing in terms of quality and content as a result of your your efforts. if you want to become a depositor or curator for ChemSpider it’s easy. Ping me on this blogpost and I’ll point you to some materials to help.

I had previously posed the question “How many chemicals names are contained in the short paragraph below”? Well, I have highlighted the “chemicals” contained in this paragraph. Click on the link to see what’s what.

“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.” you saw Aspirin immediately right? Maybe you could have made up that Advantage and Commando would be drugs? Some of you might have spotted “he” (helium) and “in” (Indium). But did you expect “of” and “the”?

What was this all challenge all about? It explains the need to do a good job in identifying chemical names when hunting for them in articles. With a dictionary of millions of systematic names, trade names, synonyms and database IDs even the most general text is full of chemicals. So, the application of a dictionary of chemical names must be done very carefully. And, the point is that matching the dictionary of names within ChemSpider at present to text contained within scientific articles will fail without the direct identification of chemical names OR identifying trade names etc within an appropriate context.

There are WAY more complexities than this though. A group at Cambridge has been working on Sciborg since 2005. The project description page outlines the project:

“SciBorg is a four-year project, starting October 1 2005, funded by the EPSRC under the programme for Computer Science for e-Science. The project is a collaboration between three groups at the University of Cambridge:

We are cooperating with three major publishers:

The project summary and objectives are below. For further information, please see the detailed project description , which was based on the project proposal, and the developing SciBorg project wiki pages.”

I have been following the project for a while and am getting much more interested in it right now. It makes for great reading about the challenges of text mining data. Peter Murray-Rust has made a couple of blog posts (1,2) over the weekend relative to the challenges of text mining and I reference you there for a good overview of some of the challenges. They are significant but there are ways to deal with some of the issues.

I’ll blog more about text-mining and names in the next few weeks…

We announced WiChempedia previously in terms of using our new dedicated website approach as a subset of ChemSpider. What we did there is show the leading part of the Wikipedia article and then ask people to click over to read the entire article on Wikipedia. We can support the entire article on ChemSpider if that is of interest to people but it’s easier, for now, to keep it as is. Comments?

What we are doing is trying to provide better integration to Wikipedia since I am working with Wikipedia on the curation of Wikipedia chemical structures and deeper integration seems appropriate. So, tonight we added the ability to Edit the Wikipedia article. In this way you can directly edit errors you might see in the lead of the article but you also get to edit the entire article if you are interested.

As an example see Taxol here. You will see the following Taxol lead for the article. Look at the END where you will see: Read more… or Edit at Wikipedia…

Paclitaxel is a mitotic inhibitor used in cancer chemotherapy. It was discovered in a National Cancer Institute program at the Research Triangle Institute in 1967 when Monroe E. Wall and Mansukh C. Wani isolated it from the bark of the Pacific yew tree, ”Taxus brevifolia” and named it ‘taxol’. When it was developed commercially by Bristol-Myers Squibb (BMS) the generic name was changed to ‘paclitaxel’ and the BMS compound is sold under the trademark ‘Taxol’. In this formulation paclitaxel is dissolved in Cremophor EL, a polyoxyethylated castor oil, as a delivery agent since paclitaxel is not soluble in water. A newer formulation, in which paclitaxel is bound to albumin as the delivery agent (Protein-bound paclitaxel), is sold commercially by [ Abraxis BioScience] under the trademark [ Abraxane].”[ Abraxane Drug Information].” ”Food and Drug Administration.” January 7, 2005. Retrieved on March 9, 2007. Paclitaxel is now used to treat patients with lung, ovarian, breast cancer, head and neck cancer, and advanced forms of Kaposi’s sarcoma. Paclitaxel is also used for the prevention of restenosis. Paclitaxel works by interfering with normal microtubule breakdown during cell division. Together with docetaxel, it forms the drug category of the taxanes. It was the subject of a notable total synthesis by Robert A. Holton. As well as offering substantial improvement in patient care, paclitaxel has been a relatively controversial drug. There was originally concern because of the environmental impact of its original sourcing, no longer used, from the Pacific yew. The assignment of rights, and even the name itself, to BMS were the subject of public debate and Congressional hearings. Read more… or Edit at Wikipedia…
There you have it…this type of integration is a joy to do. Literally a couple of minutes to make the connection and a few minutes to set the style et voila. Editing articles in Wikipedia.

One blog I check out a few times per week is that of Derek Lowe who writes In the Pipeline. What makes Derek’s blog different, in my opinion, is that he has many years under his belt as a synthetic chemist in pharma companies. He watches what is going on in his industry and makes us aware of his opinions and those of others. He’s still active in the lab and makes us aware of the challenges of lab syntheses and, I find, with some historical perspectives regarding what is “was like” then versus now. Always well written with high feedback In the Pipeline is likely one of the most frequented blogs out there.

Today I read about Schering-Plough’s thrombin receptor antagonist compound, SCH 530348. I am an SP shareholder and was rather disappointed by the recent news about Vytorin. Let’s hope the retrials provide new results. However, SCH 530348 looks more exciting according to Derek’s comments and talking to other people in the industry.

I searched ChemSpider for the structure since it seemed of interest. SCH 530348 is based on a natural product called himbacine and Derek had linked to it from his blog but, I assumed, he couldn’t find the structure of interest on the database. Neither could I. So, about 5 minutes work and it was on the database, with the link to the ASAP article and tagged etc as shown below. The link is here.


TRA-SCH 530348 is an oral antiplatelet drug under development by Schering-Plough for the treatment and prevention of atherothrombotic events in patients with Acute Coronary Syndrome, previous Myocardial Infarction, stroke, or existing peripheral arterial disease.


SCH 530348
Schering Plough
TRA-SCH 530348

Links & References

Samuel Chackalamannil, Yuguang Wang, William J. Greenlee, Zhiyong Hu, Yan Xia, Ho-Sam Ahn, George Boykow, Yunsheng Hsieh, Jairam Palamanda, Jacqueline Agans-Fantuzzi, Stan Kurowski, Michael Graziano, and Madhu Chintala . Discovery of a Novel, Orally Active Himbacine-Based Thrombin Receptor Antagonist (SCH 530348) with Potent Antiplatelet Activity, ASAP J. Med. Chem., ASAP Article
A potent series of thrombin receptor (PAR-1) antagonists based on the natural product himbacine is described.

Any user of ChemSpider can do this now. ANYONE. If you are interested in how let me know! I have some documents already prepared tou guide you through the process and need to update others but the database can be expanded by everyone now…not quite Wikipedia but not bad at all! What do you think?

Over the past year ChemSpider has been working hard to build a functional and stable platform for the hosting, deposition and curation of structure-based data. This is to form the foundation of our mission to build a Structure-Based Community for Chemists. Our deposition system is in place and well-tested. Our indexing of articles is proven, and continues. We have indexed multiple Open Access articles. We support the deposition of analytical data (spectra and CIF files) into ChemSpider.

It is now time to take this to the next level and I would like to extend an invitation to Open Access publishers to work with us to design an interface (preferably a web service) to facilitate direct deposition of data into ChemSpider. We’d like to design an interface where you can feed your articles in with Title, Authors, Journal reference, DOI and Abstract. We would associate the article with the chemical structures in one of two specific ways – 1) extract the chemical names from the title and/or abstract and convert on the fly to deposit and/or associate with structures on ChemSpider and 2) allow the publisher to pass us a series of SMILES strings, InChI Strings, molfiles or chemical names to deposit on ChemSpider. Based on what we have already done it is clear this process is feasible, and will require some manual intervention until we optimize processes. If we do this we can design an interface and input format that can be made public, reusable by other groups for the deposition of information into their systems and, potentially, move away from the need for extracting information out of PDF files (and other formats). The outcome of this work would be a freely accessible structure and substructure searchable index of Open Access articles with links back to the Open Access article. We are already indexing articles so, with permission from even the non-Open Access publishers we could use similar processes to index abstracts and make articles structure/substructure searchable based on titles and abstracts.

So, my question. Are there any Open Access/Free Access publishers willing to discuss the possibilities I have outlined? If any of you will be at the ACS meeting and would like to discuss please post a response here or contact me at the usual email address (antonyDOTwilliamsATchemspiderDOTcom) and let’s talk about building a disruptive and enabling technology for chemists around the world

Frequent users to ChemSpider who use the identifiers for searching will commonly find a mixture of “names” and “database IDs” as well as “registry numbers”. Since the number of database IDs can sometimes swamp the synonyms and chemical names we chose to separate them. We have run some regular expressions across the database to separate database IDs out. We have left registry numbers (marked by [RN]), EINECS numbers (marked by [EINECS]) and Wiswesser Line Notation (marked by [WLN]) in with the synonyms.


Unfortunately there are MANY flavors of Database IDs and we might have missed some. If you come across any “potential” DB IDs and think we should segregate them out please use the POST COMMENTS  ability to inform us. Simply Post a Comment to a record and suggest we check the identifiers out for potential DBIDs. Thank!

Previously we introduced the ability to submit chemical structures to the database using the Single Structure Deposition process. This allows users to submit single structures to the database and associate with publications, URLs, Pubmed IDs and so on. An example of the result can be seen here for Quesnoin…the structure and associated supplementary info was deposited online using the outlined process.

We have previously unveiled the ability to add publication details to existing structures on the database as outlined here. What we’ve heard is that it would be just as useful, and in the time of Web 2.0, even better to allow allow connections to other web pages by allowing URLs to be connected to existing structures on the database. The process is easy.

You need to be logged in to Add URLs, Publications etc. The only action that can be done without logging in is the Posting of Comments. The reason we do this is to help us protect from vandalism, if possible. When logged in then click on Add URL. The example below is for me wanting to form a link between the structure record of Xanax and the article on Wikipedia.


A dialog box will be displayed. Input the Title that you want displayed in the supplementary info and the URL of the associated link. See below:


Filling the information will show as follows:


Then Click OK. The submission will be sent to a curator for approval and should be approved very quickly. The reason for this process is to ensure that we don’t get adorned with “inappropriate links”. The information will show as “Supplementary info” at the bottom of the structure record as shown here.

Two particular Open Access resources providing enormous value to Life Sciences nowadays are PubMed and PubChem. I’m sure everyone reading this blog has heard of them and used them both. We previously announced our deeper collaboration with ChemRefer. We have been busily working in the background to integrate ChemRefer into ChemSpider and the very “alpha” version of the integration is now available online. It can be accessed from the Search menu as shown below. Simply click on ChemRefer


When selected it will open the ChemRefer search window and a list of Publishers who have allowed us to index them as shown below. All can be searched or the search can be limited by using the Check Boxes.


The search results for searching on taxol are shown here.The figure below, while too small to see detail, shows that the word searched is highlighted in the text.


Notice that the RSC is no longer indexed at their request. We were sad to lose them from our searches.

We have also integrated to Entrez, for searching health sciences databases at the National Center for Biotechnology Information (NCBI) website. This can be searched by choosing NCBI Entrez from the Search drop down menu. It is available here. An example of the results is shown here. For this first integration we limit the results to 100 hits.


Clicking on any of the titles is a direct hyperlink to the article in PubMed Central.

These integrations to these text searching engines are only the first part of our work. We have already been extracting chemical names and linking them up to structures in the ChemSpider database.Our first efforts in this area will be unveiled shortly. In this case it will be possible to simultaneously perform structure/substructure and text-based searches. This is a very significant undertaking but we are well underway to bringing our vision of structure and text based indexing of the Open Access literature to fruition.

As part of a collaborative project with Jean-Claude Bradley from Drexel University (and member of our Advisory Group) we are in the process of delivering new capabilities for the upload of “activity data” associated with one or more structures, the display of these data with the associated structures on ChemSpider and the download of these data to the desktop. Our efforts are part of JC’s overall workflow outlined here.

So far we have enabled the ability to upload CSV files containing SMILES strings and associated data, converting the CSV files on the fly to SDF files for deposition onto ChemSpider. We have also enabled a general capability for the download of collections of data using checkbox selection and download as an SDF file with associated properties. Rather than insert images into this blog posting please click here to see the PDF of the Powerpoint overview.

While we have the deposition process and downloading process essentially completed (except for testing) we now need to resolve the process for deduplicating the submitted data onto the database (or generating new structure records) as well as defining the format for display on the site. Watch this space.

When we first started the ChemSpider project we made a commitment to “Build a Structure Centric Community for Chemists”. We are well on the way to facilitating that we believe. We have talked about a “wiki” environment for collaboration. In this framework we see wiki to indicate a “collaborative environment”, not necessarily adherence to a specific wiki-platform. Our intention is to provide the ability for users of ChemSpider to collaborate in the co-management of content on the ChemSpider site. A number of our readers have taken our statements to indicate that we will be using the same wiki platform as that utilized on Wikipedia. We have looked at and considered a number of “wiki” tools, platforms, interfaces and user-experiences. At this time we have made a decision to utilize Microsoft Sharepoint as the platform on which to construct our wiki-environment. With a clear commitment to Web 2.0 already declared and our platform built on SQL server and ASP.NET we feel it is the appropriate platform for us to build on. We believe the correct platform choice has already demonstrated that we can deploy a good solution very quickly because of our technology choices.

Now, we realize that this might result in a series of jabs about us not using Open Source solutions and so on but we are more focused on delivering an appropriate scalable solution than building ChemSpider only on Open Source software. We will support anyone who wishes to do the same on Open Source though.

We will keep you informed of our progress. Now we need to migrate ourselves to .NET3 and we hope this will be a short term disruption in the future as we switch over. Watch this space.

Since we went live in March we had a link on the web page for Registration. It is only now enabled so our apologies for the delays. Was it worth the wait? Absolutely. Visit the registration page and sign up to benefit from a number of advantages as well as to become a “contributor” to the site.

I have posted a number of times about the intention for ChemSpider to become Wiki-enabled. While we have not yet layered the full wiki capability onto the system we are about to unveil single structure deposition, multiple-structure deposition and spectrum association with a structure. We already have beta testers testing the spectral association and over 40 spectra have already been added to the database.

The reason we require registration should be fairly obvious. For the purpose of providing access to beta-testers and granting the appropriate rights to submitters and curators we need to have traceability regarding who is making the submissions, the comments and the edits. While we understand this might be slightly invasive it is appropriate to retain a certain level of control over what might show up on the database. We hope we have your support.

Some side benefits of registration include an additional way for us to deliver you updated information regarding new capabilities being introduced to ChemSpider, updates regarding content enhancements and, when the time is right, delivering information to you via a ChemSpider newsletter.

Register today!

If you are a frequent user of ChemSpider you have likely been using the text-based searches to query the system. However, there are definitely those of you who are more adventurous and have initiated structure and substructure searches utilizing input via SMILES, InChI or the Structure Drawing Applet (See Section 2.3.3 of the online ChemSpider manual).

You might have discovered that some of those substructure searches can be a little time-consuming and you might want to let the search continue and “move on”. Alternatively, you might have performed a number of searches during a particular session with ChemSpider or may want to return to some results from searches performed in previous sessions. In any case, what you need in these cases is access to the details of the searches you have performed. The screen grab below speaks volumes.

The History file for Antony Williams

As an example of a saved search in MY personal history of search transactions please visit the link listed here…note it’s unique nature in terms of identifier.
While this search was very fast in reality, we are about to introduce Structure Similarity Searching onto ChemSpider. This can be a very time-consuming operation and certainly the deferred transaction tracking is likely to be of value when this new feature is rolled out.

For now, give us feedback on access to your personal history of transactions. Simply visit the link after you have performed some transactions on the system and should see your list of actions. Enjoy…

So, ChemSpider is still in beta even though we are moving fairly quickly in the “back room”, and the development team has grown. There is a lot of infrastructure work going on despite what you see on the site when you come to crawl with the Spider. You should sense performance improvements when you use it though.

I’m the whiner on the ChemSpider team. That means I use the “whine” function in our bug-tracking system Bugzilla, our Open Source bug-tracking and feature request tracking system. It’s a great system for our needs and if you need a system out-of-the-box which will suffice for moist of your needs go grab it. So, what am I whining about? Here’s my list of Five Things I Don’t Like About ChemSpider.

1) Many of our molecules are simply ugly. The connection table is correct but some of our displays of the molecule are far from perfect – take a look at the two structures below. The one on the left does NOT really have a chlorine attached to the oxygen. The one on the right is simply a mess.
Cleaned Structures - with problems!

This is an issue of Structure Cleaning. It IS difficult. Even the drawing package vendors struggle with this. It needs improving…whine…

2) We really need to do something with the curated data. People have been curating data on our site for a few weeks. We need to do something with it to show that their efforts matter….whine, whine…

3) We need to allow people to deposit data. There are people wanting to flood the system with data…in some cases one structure and in many cases thousands of structures. Right now its a very manual process…data has to come to us. I want a user to be able to submit their own data…sure we’ll validate and review but let people deposit the data at least….whine, whine, whine…

4) I want people to add content and information to structures. Not everything can be done with robots and process…people need to contribute. Wikipedia is all about community participation. Chemists have information about structures, about reactions..about connected data. When they have info they want to associate/link/dump and connect to a structure I want them to be able to do it….whine, whine, whine, whine…

5) ChemSpider was a rushed release. We’ve never hidden it…it was rolled out in time for the ACS in Chicago…and rushed. It went live with a lot of holes. It’s still beta. It’s working, and it’s moving but it’s time for some of the work flows to be improved and the website to be “prettied up”……whine, whine, whine, whine, whine…

Ok…those are my top five whines….and none of them are from a bottle. So, with these in mind it’s what we’re off to work on. Our short to midterm efforts will be in these areas.

What are the things YOU don’t like about ChemSpider? Whine away…we’ll Bugzilla your comments!

ChemSpider was released with the ability to initiate searching of structures using two drawing tools – an Applet and ACD/ChemSketch. Other ways to submit structures include the copying and pasting of either SMILES strings or of InChI strings. We have just celebrated 2 months online and are averaging about 800 users per day at present. Examination of the usage has shown that, in order, users submit structures in the following rank order: SMILES strings, applet, ChemSketch and then InChI.

During the two month beta period we have received numerous suggestions about how to improve the system. A number of these have included new ways to query the database with a chemical structure input. Specifically, some of the statements have been: 1) Use a better structure drawing applet, 2) Provide integration to ChemDraw, 3) Provide integration to ISIS, 4) Allow copying and pasting of a molfile. All are feasible of course…it is all about priorities.

In terms of applets we already have permission to use Peter Ertl’s JME applet and are aware of other potential options including Marvin, the JChemPaint applet and the MCDL applet. We’re very happy with the present applet ourselves but welcome your comments if you believe that other applets should be made available as an option.

Your comments as welcome either as a response to this blog posting or directly at development AT chemspider DOT COM (longhanded to prevent spam).

I was in an exchange with a friend this weekend about his interest in depositing data onto ChemSpider. Due to our travel schedules and family commitments we rarely talk by phone. This gentlemen is a retired chemist, though highly active. He is an expert in nomenclature and has an incredible eye for quality and is a master curator of chemistry databases.

So, he is very interested in ChemSpider and the potential of exposing his databases. However, his expressed concern is that he will lose all the efforts he has invested in developing the databases. Again, these are manually curated, with an experts eye and, based on my experiences working with him are of the highest quality. They amount to tens if not hundreds of hours of work and are a source of revenue for this gentleman.

WIth this in mind, and based on other blog posts I have seen, it appears that we have not clearly defined the intention of ChemSpider. What we are NOT doing is aggregating all data from all publicly available data sources or even supplied databases. Our intention for the immediate future is to form a structure centric environment linking out to the initial data source providers via the chemical structure. The individual providers continue to provide their content and retain their value proposition.

For example, The NIST webbook is a container for a lot of information including spectral data. As discussed in another post about the sodium chloride dimer ChemSpider will provide the link to the webbook to display relevant data for this gas phase species. A search for diazepam will provide links out to all original data sources as shown here and they include ChemBank, ChEBI, NIST Webbook and many others.

ChemSpider is an aggregator of chemical structures and associated identifiers (enabling connectivity to other sites). We are NOT duplicating all content available at other sites. This removes the burden of updating associated data across multiple data sources as individual providers curate and update their own sources. It also keeps ChemSpider on task of linking together multiple sources of data via chemical structures rather than grabbing the work of other groups and reposting.

So, back to my friend who is worried about depositing data on ChemSpider. All we will be taking delivery of are the structures, the structure IDs (if available) and a link to information about the database. In this way we are directing individuals to rich sources of information for ChemSpider users to pursue as they see fit. Just as many depositions into the public online databases are from chemical vendors intending to potentially sell their materials the same model applies to database providers. After all, if information content is of value it is up to the user to choose to pay for the right to access.

Taking this one step further one has to consider the following question. For the large database providers (Beilstein – now MDL, Derwent, CAS, Cambridge Crystallogrpahic Databases, DiscoveryGate and others) why not put their structure collections into the public domain for the purpose of searching and connecting back to the actual content of value. The structures themselves, as far as I know, are in all cases in the public domain since they are published (I might be wrong here but I cannot find statements to the contrary). The value comes from the information associated with the structures – one or more publictaions, reaction details, experimental or predicted properties, connection to a patent, and other such content association.

What’s at risk to provide public access to the structure database(s) for searching and charging the appropropriate fees to access the information once identified? There is little value in simply knowing that a structure exists in a database is there? Isn’t it the information associated that has value? If this wasn’t true then that would suggest that a large database of algorithmically generated structures created with something like MolGen or the structure generator in Structure Elucidator would have value. In fact it does….see the work of Reymond et al in their “Virtual Exploration of the Chemical Universe up to 11 Atoms of C, N, O, F“. The value however comes not from the computer structure itself but rather the virtual screening response.

I judge there are two challenges – a decision at the management level to expose the large structural repositories and the enormous hurdles in migrating certain classes of chemical structures to SDF format to be hosted by general services – specifically polymers, organometallic complexes and inorganics (also all challenges for ChemSpider!). I think the primary challenge is the decision to expose the data…I judge it’s the right decision to make with the increasing availability of Open Access databases such as ChemSpider. It’s a BIG decision …

We’re at 5 weeks since we let people onto ChemSpider to crawl our web. The service is now enabling over 650 visitors per day on average in the past week to search the database and utilize the services. At present the database has grown to 10.6 million structures. However, as stated at the time of release we would limit the database to around 10 million structures solely for the purpose of testing.At present we have 800,000 molecules from 4 new contributors waiting to be added to the database after prediction of all associated properties and then de-duplicated with the existing content. We are in the process of converting over 10 million NEW structures from SMILES format. These will also be passed through all prediction algorithms, added to the database and then de-duplicated. There are a number of other databases to be delivered to ChemSpider for preparation in the next few days. With full disclosure you should be aware that ChemSpider is served up from two Dell servers. They host the transactions, the database, the web server, the webpage, our email system. They are also, in parallel, converting SMILES to connection tables (millions of them) and are predicting a series of properties on every structure (check’ll see them all). Some of these properties take many seconds since they are complex calculations.Bottom line…these servers are in dire need of air conditioning systems. They are running flat out. Our ability to provide fast searches, especially structure and substructure searches, while also being able to perform transaction-based predictions is already starting to fail. People are reporting performance issues so we have moved our predictions to the evening. It is clear we need a new server already (likely two) and much earlier than expected. I guess that’s what you call one of the struggles of success. I spent today reading about how one of the founders of YouTube kept extending his credit card bill to cover their technical costs (Time, January 2007). Well, we are not YouTube, this is not Silicon Valley and we’re not putting our families at risk. As already discussed on this blog we may need to seek sponsorship. As it is we made the painful decision today that we have to start some form of advertising. We will be judged on this. And we acknowledge it. But our intention is to stay faithful to the community to have the service remain free but offer the best services and throughput that we can. Only more computing power will allow this at this stage. We will stumble along for the next month with what we have but if the dataset grows to the expected 15-20 million structures we will need to expand our plastic boxes. Hopefully our users will understand.