Archive for June, 2008

Here at ChemSpider we’ve been working for almost a year and a half to build a structure centric community for chemists. During this time we have been dabbling, in the background, with ChemSpider being not so structure-centric, but this has not been exposed yet. Of late we have been attracted to the possibilities around text-mining and mark-up of articles.

We are well underway in terms of providing tools for markup and they will be released incrementally. We have a lot more ideas and are interested to participate in the Article 2.0 contest to see what we can do. What is Article 2.0? Article 2.0 was announced by Elsevier here with the following statement:

“We’ve worked hard to build the Article 2.0 dataset, and now we’re opening it up to developers via a simple, straightforward REST API. We will provide contestants with access to approximately 7,500 full-text XML scientific articles (including images) and challenge each contestant to be the publisher. In other words, each contestant will have complete freedom for how they would like to present the scientific research articles contained in the Article 2.0 dataset. We will encourage the use of XQuery, but this will not be a mandate. By leveraging these APIs, the contestant becomes the publisher and can render scientific articles to meet their needs including integrating the article into existing applications or combining it with other web service APIs.”

7500 articles and complete freedom to present the articles as we see fit. Enticing! What do we already have on ChemSpider that we could reuse?

1) Structure deposition

2) Analytical data and image deposition

3) Integration to other data via URLs

4) Add comments/description

5) Text markup with “Chemical enhancements”

6) A dataset of >21 million structures and integration to over 120 data sources

7) Good ideas …

Article 2.0 looks interesting…we hope to be involved

Buy me a Coffee

I have been in discussion with Christoph Steinbeck and colleagues from the European Bioinformatics Institute. Specifically, we are interested in linking up to AND embedding the text from their ChEBI Entities of the Month. So, as is my preferred manner of not assuming everything is Open Data but rather asking for permission, I approached Christoph. I asked for permission to copy the text for the Entities of the Month onto the appropriate record view in ChemSpider. When I asked the question we were not yet ready to accept rich text format with embedded hyperlinks, a strength of many of the articles on ChEBI’s Entity of the Month.

I am happy to announce that as part of our ongoing effort to Wikify ChemSpider and allow people to add descriptions to the individual record views we have added a rich text editor and are presently testing it. At present we have rolled out the FULL implementation of the editor. This means it has lots of capabilities/buttons and the entire editor is being tested by curators. But, when rolled out to users, there will be a Simple mode and an Advanced mode for the editor.

Click on the thumbnail below to see the Text Editor in action. Don’t forget, It is the “Full-powered” implementation for now. In this case all I did was copy and paste the text from the ChEBI website and insert the ChEBI article link back to the original article on the ChEBI site.

In the Text editor we are in the process of inserting new capabilities that will facilitate mark up of articles. Since we will be hosting a number of Open Access articles shortly we will be experimenting on those articles with our new markup capabilities.

When this is all rolled out we will have the majority of capabilities necessary for people to track their research online if they wish. Online submission of structures, text deposition with full editing capabilities, submission and tracking of analytical data and images and linking to external sites and data. It’s probably an 80% solution for right now since we are missing some capabilities and workflow issues. For example, poor support for polymers and organometallics and specitfically the structure-centric nature of the solution and the insistence to submit a structure to associate data and text with. We will allow in the future “sample-submission” where the structure is not known but the data, images and experimental details of synthesis and analysis are available. Clearly the standard workflow for synthetic chemists is to synthesize first and then confirm by analysis what the products are. This is a typical workflow and will need to be supported. It’s coming…

Some of you might be asking:

1) will we support versioning of the articles as people modify/edit the article (as is done with Wikipedia)? Yes, we will. Soon.

2) will curators have the ability to lock articles? Yes, in the future we will introduce this if it’s deemed appropriate.

3) will it be possible to allow only one individual (or group) to edit an article? Yes, one of the future directions is to allow an individual or group to perform Open Notebook Science in front of the public but not allow the public to edit the results. They would of course be allowed to comment on the research. Future development…

Zemanta Pixie

Buy me a Coffee

We are adding our finishing touches to some markup tools for Open Access articles at present and they will unveil shortly. In parallel we’ve been manually curating a series of articles about drugs, about 3000 of them, and will rollout these articles with similar markup using the tools we have developed. When rolled out we will of extended our ChemSpider toolkit to facilitate integration between “documents” and ChemSpider - watch this space…

Buy me a Coffee

Most blog readers will likely be aware of the recent article written in Nature about ChemSpider. PMR has recently commented on what he said to the Nature reporter who interviewed him but did not make it into press.

I’ll clarify some of Peter’s statements and differentiate judgments versus truths, some of this is a repeat, again.

1)  “Firstly to say that I commented to Geoff before Chemspider’s announcement that it was adopting CC-SA licences. This is a major advance and has enhanced the importance of Chemspider.

We have REMOVED these licenses now after the rather interesting situation resulting from that and Peter had already commented on his own blog “I agree with John. Licences are not appropriate for data (and when I applauded Chemspider it was for the motivation rather than the actual mechanism - CC-SA is conformant to the OK definition, but difficult to operate for re-use). That’s why we use the OKF’s OpenData sticker on CrystalEye.

2)  “It’s (now) based on Web 2.0 principles in that it uses social computing for some of its content and can and has reacted to external changes.” ChemSpider has been based on Web 2.0 principles since the first rollout and I have commented on this previously.

3)  “It’s not, however, based on semantic web technology such as RDF and XML and this may be a future limitation in managing some of the more complex content.” We use XML in many places on our site and some of this will be exposed in the future. We have discussed RDF’ing our system with Egon Willighagen but it’s not a priority for us at present. It’s on the list though.

4) “Although I’m not party to the internal design I’d guess it has a relational database, most of whose primary keys are the identifiers for chemical compounds. These identifiers map onto canonicalised chemical structures (one serialization of which is the InChI) and this is the primary mechanism for indexing compounds.” Yes, it’s a relational database, on Microsoft SQL server. Primary keys are structures and we do use InChis, a lot.

5) “CS has ca 20 million compounds and the only way to manage these is robotically.” We have a hybrid model of robotic handling and human intervention and interaction with the data. To see human interaction in action visit the feedback page.

6)”there is no guarantee that the computation of properties is free from error - indeed it cannot be. Many physical properties depend on the physical form of the compound and this is often not recorded. I suspect most of the properties are computed by heuristic means (”QSPR”) rather than QM calculations. And many of them fail to take things like chemical stability and reactivity into account.  (Examples are boiling points for compounds that decompose, flashpoints for things that could never burn). But how do you tell this robotically - I don’t have a good suggestion But one can guarantee that in 20 million calculations some will be meaningless.”

I agree with the scientific declarations that properties depend on the physical form of the compound. None of the predictions are QM-based, definitely not feasible with 20 million compounds not only because of lack of access to software but more about time issues as discussed previously in regards to QM NMR predictions. I have 15 years experience around QSPR type predictions and they are fast and generally applied by the majority of chemists at the desktop in Life Science environments (and others) for the prediction of logP, solubility, logD, pKa, NMR etc. I GUARANTEE that in 20 million compounds some will be meaningless. This definiely doesn’t mean the predicted values across the DB are of no value.

Despite some of the previous comments about the properties in the vast majority of cases property prediction is valid. See such discussions here :Calcium Carbonate is not soluble and can’t have a logP PLUS Lipinski says Calcium Carbonate CAN have a logP

We are presently adding MORE predicted properties. Check out at this record the “EPI Summary” at the bottom of the page and you will see this (Scroll inside the box)

Log Octanol-Water Partition Coef (SRC):
    Log Kow (KOWWIN v1.67 estimate) =  0.85

 Boiling Pt, Melting Pt, Vapor Pressure Estimations (MPBPWIN v1.42):
    Boiling Pt (deg C):  290.82  (Adapted Stein & Brown method)
    Melting Pt (deg C):  80.58  (Mean or Weighted MP)
    VP(mm Hg,25 deg C):  3.99E-005  (Modified Grain method)
    Subcooled liquid VP: 0.000135 mm Hg (25 deg C, Mod-Grain method)

 Water Solubility Estimate from Log Kow (WSKOW v1.41):
    Water Solubility at 25 deg C (mg/L):  3.574e+004
       log Kow used: 0.85 (estimated)
       no-melting pt equation used

 Water Sol Estimate from Fragments:
    Wat Sol (v1.01 est) =  1e+006 mg/L

 ECOSAR Class Program (ECOSAR v0.99h):
    Class(es) found:
       Neutral Organics-acid

 Henrys Law Constant (25 deg C) [HENRYWIN v3.10]:
   Bond Method :   9.70E-009  atm-m3/mole
   Group Method:   Incomplete
 Henrys LC [VP/WSol estimate using EPI values]:  2.176E-010 atm-m3/mole

 Log Octanol-Air Partition Coefficient (25 deg C) [KOAWIN v1.10]:
  Log Kow used:  0.85  (KowWin est)
  Log Kaw used:  -6.402  (HenryWin est)
      Log Koa (KOAWIN v1.10 estimate):  7.252
      Log Koa (experimental database):  None

 Probability of Rapid Biodegradation (BIOWIN v4.10):
   Biowin1 (Linear Model)         :   0.7245
   Biowin2 (Non-Linear Model)     :   0.7196
 Expert Survey Biodegradation Results:
   Biowin3 (Ultimate Survey Model):   3.1842  (weeks       )
   Biowin4 (Primary Survey Model) :   3.9956  (days        )
 MITI Biodegradation Probability:
   Biowin5 (MITI Linear Model)    :   0.6808
   Biowin6 (MITI Non-Linear Model):   0.7604
 Anaerobic Biodegradation Probability:
   Biowin7 (Anaerobic Linear Model):  0.5224
 Ready Biodegradability Prediction:   YES

Hydrocarbon Biodegradation (BioHCwin v1.01):
    Structure incompatible with current estimation method!

 Sorption to aerosols (25 Dec C)[AEROWIN v1.00]:
  Vapor pressure (liquid/subcooled):  0.018 Pa (0.000135 mm Hg)
  Log Koa (Koawin est  ): 7.252
   Kp (particle/gas partition coef. (m3/ug)):
       Mackay model           :  0.000167
       Octanol/air (Koa) model:  4.39E-006
   Fraction sorbed to airborne particulates (phi):
       Junge-Pankow model     :  0.00598
       Mackay model           :  0.0132
       Octanol/air (Koa) model:  0.000351 

 Atmospheric Oxidation (25 deg C) [AopWin v1.92]:
   Hydroxyl Radicals Reaction:
      OVERALL OH Rate Constant =  24.3848 E-12 cm3/molecule-sec
      Half-Life =     0.439 Days (12-hr day; 1.5E6 OH/cm3)
      Half-Life =     5.264 Hrs
   Ozone Reaction:
      No Ozone Reaction Estimation
   Fraction sorbed to airborne particulates (phi): 0.00957 (Junge,Mackay)
    Note: the sorbed fraction may be resistant to atmospheric oxidation

 Soil Adsorption Coefficient (PCKOCWIN v1.66):
      Koc    :  1
      Log Koc:  0.000 

 Aqueous Base/Acid-Catalyzed Hydrolysis (25 deg C) [HYDROWIN v1.67]:
    Rate constants can NOT be estimated for this structure!

 Bioaccumulation Estimates from Log Kow (BCFWIN v2.17):
   Log BCF from regression-based method = 0.500 (BCF = 3.162)
       log Kow used: 0.85 (estimated)

 Volatilization from Water:
    Henry LC:  9.7E-009 atm-m3/mole  (estimated by Bond SAR Method)
    Half-Life from Model River: 7.347E+004  hours   (3061 days)
    Half-Life from Model Lake : 8.016E+005  hours   (3.34E+004 days)

 Removal In Wastewater Treatment:
    Total removal:               1.88  percent
    Total biodegradation:        0.09  percent
    Total sludge adsorption:     1.78  percent
    Total to Air:                0.00  percent
      (using 10000 hr Bio P,A,S)

 Level III Fugacity Model:
           Mass Amount    Half-Life    Emissions
            (percent)        (hr)       (kg/hr)
   Air       0.189           10.5         1000
   Water     37              360          1000
   Soil      62.7            720          1000
   Sediment  0.0722          3.24e+003    0
     Persistence Time: 547 hr

7) “Chemspider is using social computing (crowdsourcing) to clean up (curate) the information in the database. This works in Wikipedia, although the number of chemicals in in the thousands, not the millins, and there are still many data and chemical problems. Moreover WP shows that there are compounds - e.g. aluminium chloride - where there is no single structure.” Social computing curation is working well. It’s working on Wikipedia too..I am in the middle of that effort.  There is no reason that ChemSpider cannot support multiple species for one compound either. For example, see the structure of Thymol Blue on Wikipedia and then look at this search: http://www.chemspider.com/q/thymol%20blue on ChemSpider. 2 of the 3 structures in the scheme are noted on ChemSPider. The third can be added. For aluminium chloride we link to Wikipedia to explain this…at present only the lede of the article, we could host the entire article. Why not?

8 ) “What is Chemspider now is and where it may be going? It’s difficult to predict anything on the web but it’s also clear that chemists are one of the most conservative disciplines. Why use a free service when you can get your library to pay (a lot of money) for ACS or Beilstein services? So I wouldn’t predict explosive growth like Flickr or Google” Yup, I’d agree. But it’s not only conservatism. it’s marketing (we don’t do any paid marketing” and ChemSpider is for chemists. Flickr’s for everybody, so is Google. How can it be as explosive? But can it and is it growing? Yup.

9) “Nick found 26 sites displaying staurosporine and there were 19 different structures given. Some were incomplete and several were just crazily wrong. Clearly many chemical suppliers, journal editors, etc. do not care about chemical structures. So there is a huge amount of rubbish out there.” I’ve said the same many times (1,2, and others). But does it mean we should stop? I don’t think so…

and to conclude

10) “PMR: At some stage, therefore, the community will react against this centralisation of information, but it could be a long time. I don’t think anyone should set up to duplicate what ACS does - I think we should use modern thinking to do things quicker, smarter, cheaper and in tune with the modern Web. Chemspider may have to make some choices soon - is it a company or a voluntary activity? does it concentrate on high volume and variable quality, or low volume and high quality - it cannot do both? What is the particular USP of its repository service ?- there may well be a role for a specialist chemical repository service but when? Is it different from Pubchem, and how…?

ChemSpider is not a company. ChemZoo is. We ARE using modern thinking in tune with the modern web. Probably one of the fastest moving efforts in this area..are there others moving as fast at depositing? curating? integrating? So, we are a company and at no cost to the users. Volunteers are helping. We are working on BOTH high volume and high quality. It is work. We are being successful on both. The Wikipedia collection, when finished, will only be a subset of ChemSpider. But structures and associated information (other than predictions!) are validated daily at present, And crowdsourcing can speed it up. And there WILL be disagreements between chemists..just like on Wikipedia! I am in those conversations too. I think there is a role for a free access chemical repository now. We may be surpassed at any time but for now our efforts are valid and valiant, in my opinion…what say you?

Buy me a Coffee

Since ChemSpider went live in Spring 2007 we have received a lot of support, feedback and guidance from our Advisory Group. The advisory group was set up as a rotating group of advisors and it is now time for a “changing of the guard”. If you have an interest in becoming a member of the advisory group please send me a note to antonyDOTwilliamsATchemspiderDOTcom.

For the next year we will be focusing our efforts on: supporting Open Access publishers (see later post), text-mining, document mark-up, working with chemical vendors, enhancement of web services, and extending our penetration into the world of Wiki-based chemistry. If any of these areas are of interest let me know!

Buy me a Coffee

Will Griffiths has posted at Open Chemistry Web a post entitled “Chemrefer could disappear tomorrow“.

He’s not talking about the fact that ChemRefer is disappearing, quite the contrary. he is talking about how the combination of ChemRefer/ChemSpider is powering ahead with our indexing of Open Access articles and the new 10s of 1000s of articles added to ChemSpider text-searching capabilities so far, and the many more coming soon.

I guess now we have to consider that “ChemSpider could disappear tomorrow” too. I hope we disappear in the SAME way! By that I mean I hope that some organization sees the value of what we are doing and will want to collaborate with us in order to make an even bigger impact. One thing about what we are doing, as I commented during my presentation at the Whitney Symposium at GE is “We are upsetting a lot of people – evangelists, cheminformatics system vendors, publishers, data content providers”. This is NOT intentional but what we are doing is disruptive, we understand. We haven’t focused on talking about what’s possible but getting on with doing it, sometimes with warts and all. Not all players in these areas see us as a threat but based on direct feedback some do. Its a shame.  We have a lot of “birthmarks” on us at present …We are upsetting a lot of people

Buy me a Coffee

I spent two days in Albany, New York this week at the GE Research Center. The Whitney Symposium was focused on “Networks” and invited speakers from Harvard University, Caltech, MIT, Yahoo research and the like to talk about their views of networks. These included power networks, biological networks, socio-economic networks and so on. I spoke in the Social Networking section and a link to the presentation is below: Crowd Sourcing to Build a Structure Crowd-Centric Community for Chemists

I have not added text to each of the slides but hope it will be rather self-explanatory.

Buy me a Coffee

A biweekly update of new blog postings on the ChemConnector Blog that might be of interest to ChemSpider readers.

Books I am reading - The Autoimmune Epidemic

Invited Symposium Speaker at a Fortune 500 Company

New Shower Curtains and Our Health

Petaflops and Cell Processors

Buy me a Coffee

As we continue to add data sources to ChemSpider…and it’s going on almost weekly at present, it is clear that we have to make it easier for the users of ChemSpider to know what each of the Data Sources is. We’ve been doing some developments in the background for a couple of collaborations that have required the development of certain components and we’re layering one of them on here. We are using a callout balloon to display description details of the data source. Just hover over the name of the Data Source and you will see the description as shown below.

Clicking on the More Details link at the bottom right hand side of the callout balloon takes you to the details page. If any of you readers are DEPOSITORS on the ChemSpider system please note that we would love you to maintain your own page. Contact me and I will guide you through the process. This is aht efirst of many enhancements to help navigate Data Sources.

Buy me a Coffee

Yay..I am on my way to SciFoo in August to hang out with lots of people I know and lots more I don’t. It’s a Science camp with an intention of “encouraging collaboration between scientists who would not typically work together.”

As mentioned in the invite to me…”The Economist said that it “capture[s] the essence of innovation”; in a photo essay for Edge, George Dyson wrote of  “the impossible choice” when deciding which sessions to attend; another attendee described it simply as “The best gathering ever. Period.”"

I am really excited to participate and gthere are already conversations afoot regarding getting a group of us together to discuss extending ChemSpider to become an ever better platform for “Open Notebook Science”.

This is going to be great!

Buy me a Coffee

The past couple of days has seen an interesting exchange going on over on the SimBioSys blog.

Zsolt Zsoldos is someone I respect, not only for his passion for his science but also for his want to educate others in the challenges of what he does in developing software. I believe his blog post entitled “Crystal Structure Errors in CSD too” was an honest attempt to tell people to be “careful” when using data from databases. I don’t care whether the database is ChemSpider, PubChem, the CAS Registry or any of the other databases available via free access of commercial transaction, they ALL have errors. It is inevitable. Zsolt’s attempt to highlight that such errors exist was done, I believe, with pedagogical intent.

“J” then came back and gave some appropriate comments in response to Zsolt’s post and they should be consumed in series. It appears there was some type of backroom conversation, likely with the CCDC,  about how these comments were not prominent enough. Zsolt then posted this:

Update: Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article.”

He then posted the comment into the original article. Huh? Not sure why Zsolt should have felt obliged to do this for anyone. It’s a Wordpress issue re how comments are displayed. He should not have felt obliged to insert the text into the article. Zsolt then went on to comment about the licence agreement and permission to use the CSD. What is more interesting to me is his view here:

“On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement. ”

Those of you who have been watching the discussion between myself and ACS over the past few months will know I have been trying to get confirmation that “supplementary data” are Open Data and that we could scrape the CIFs if we chose to…it’s a MANY month conversation at this point. The Unilever School at Cambridge, via Nick Day’s work, has generated CrystalEye and, after many conversations, we were provided the data source and have it on ChemSpider now. We are awaiting constructive feedback from Nick and Peter Murray-Rust regarding our implementation of their data on our site. THis is especially important when there are licensing issues as appear to have been enforced on SimBioSys, evidenced by this Public Apology to CCDC. Read the post for details. It is Zsolt’s concluding statement that feeds directly into the value of Open Data in science and the value of CrystalEye to the community.

He comments: “One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as a charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data.”

As efforts like CrystalEye prevail, as the copyrightability and position of publishers regarding supplementary data is resolved, and the efforts of groups such as ChemSpider are applied to gathering Open Data and developing algorithms from these data, there is likely to be increasing tension showing up such as we see here.

Buy me a Coffee

A few days ago I blogged about the removal of the NMR predictor link from ChemSpider and committed to follow up with the developers of the algorithm. They are clearly my type of people…they have moved quickly and have already fixed a couple of bugs. If you check my original post above you will see my comments about the NMR spectrum of benzene. Check below for the NMR spectrum now after their bug-fix. Looks fine to me. They commented they have some additional work to do but it looks like we might be reconnecting to the service shortly.

The benefits of having a community test a software product/service like this is that the developers get the feedback and can go to work. Everybody wins. I look forward to their further comments on this blog post but I can say I am impressed with how fast they mobilized to fix this!

Buy me a Coffee

here has been a response to my post about Chemical Names and Structures here.

PMR>”For certain purposes, it is valuable to collect as many names as possible, for example for location of lookup. But these should be accompanied with metadata. A similar example is from ChemSpiderMan (ed.):

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here: Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim?th?ylamino)m??thyl]fur?an-2-yl}m??thyl)sul?fanyl]?th?yl}-N’-m??thyl-2-ni?tro?th?ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does “pain” mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn’t when I started and we had to read French and German  papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful. “

First of all, it’s interesting to note that the French name has been rendered as “junk” in Peter’s blog as shown here.

This probably relates to his original comment that the name is junk in his browser too…but acceptable in mine. On the other hand his blog post may look fine to him and looks bad in mine! Oh those dependencies…I see similar things show up in Wordpress regularly.

Peter suggests that there should be metadata giving the language information. Good idea. See my previous blog post about that particular issue and the fact that we allow curators to layer on metadata AND we capture and retain it WHEN it is available.

If you look at this record you will see that there are names labeled as Polish, German and Dutch.

Chloropre?ne [Wiki]

1,3-Butad?iene, 2-c?hloro-

126-99-8 [RN]

204-818-0 [EINECS]

2-Chloor-?1,3-butad?ieen [Dutch]

2-Chlor-1?,3-butadi?en [German]

2-Chlorbu?ta-1,3-di?en [German]

2-Chloro-?1,3-butad?iene

2-Chlorob?utadiene

Chloropren [Polish]

Most labels were captured during the deposition process. One was added manually.Notice also the direct links to Wikipedia, the Registry number link to perform a search of PubChem and the link to EINECS.

As I commented in my post on ranitidine, and extracting from Peter’s post “Notice …….. that the name is labeled French on the record.” So, what Peter suggests is already in place on ChemSpider. I display below what is presently available to curators to label the names with. Notice this includes language,
EINECS numbers, CAS Registry Numbers, INNs, JANs etc.


The list of languages is easy to expand. Anybody have any requests?

A further comment “PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide). We have to use authorities (provenance) in our information. Thus the statements: Ranitidine is the Z-isomer and Ranitidine is the E-isomer may be seen as contradictory. That’s why people have suggested that RDF should have quads, not triples, such as Antony_Williams asserts ranitidine hasIsomer Z Wikipedia asserts ranitidine hasIsomer E Both these are true. That is the language we should use in the semantic web PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.”

This leads us into a deeper discussion about retention of metadata and authorities. We retain metadata when it is deposited or we can harvest it. Let’s consider the information below extracted from the same compound on ChemSpider:

Notice all of the

and note that they all link through to the original source of information, in this case NIOSH.

  • Appearance: Colorless liquid with a pungent, ether-like odor.

  • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, skin, respiratory system; anxiety, irritability; dermatitis; alopecia; reproductive effects; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, reproductive system Cancer Site [lung & skin cancer]

  • Incompatibilities and Reactivities: Peroxides & other oxidizers [Note: Polymerizes at room temperature unless inhibited with antioxidants.]

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet (flammable) Change: No recommendation Provide: Eyewash, Quick drench

  • Exposure Limits: NIOSH REL : Ca C 1 ppm (3.6 mg/m 3 ) [15-minute] See Appendix A OSHA PEL ?: TWA 25 ppm (90 mg/m 3 ) [skin]

There are also properties and each piece of data links out to the original source.For this record it is the same source. For some records it is already multiple sources.

Experimental physchem properties

  • Boiling Point: 139F

  • Flash Point: -4F

  • Freezing Point: -153F

  • Specific Gravity: 0.96

  • Solubility: Slight

  • Ionization Potential: 8.79 eV

  • Vapor Pressure: 188 mmHg

This particular structure has been deposited onto the ChemSpider database a total of 18 times from the  source databases listed below. Where possible i.e. when the structure is available online on the suppliers website and can be hyperlinked to, then each external ID links to the depositor. There is an error! The Aldrich depositions are for the polymer forms! Curators can know this info out.

Data Source External ID(s)
ChemDB 6681768
ChemIDplus 000126998, 014523898
DiscoveryGate 31369
DTP/NCI 18589
EINECS N/A
EPA DSSTox 1084_NTPBSI_v2b, 325_CPDBAS_v5b, 326_CPDBAS_v5b, 724_HPVCSI_v2c
Istituto Superiore di Sanità 601
NIOSH EI9625000
NIST 2143397875
NIST Chemistry WebBook 2143397875
PubChem 31369
Sigma-Aldrich 205397_ALDRICH, 205400_ALDRICH
Thomson Pharma 00243363

Also available to master curators is the ability to see who has been editing the names and synonyms and a full record of depositions, by who and when.

So, names are labeled with language and links to Wikipedia and other info. The predicted properties and systematic name are generally labeled according to the provider of the algorithm(s). We keep track of every URL and publication deposition and know which user deposited what and when…if the site is “vandalized” then we know which user did so.

Overall I’d say we have a lot of metadata for this record. The same is true for tens of thousands of records on ChemSpider and the amount of such information is growing literally daily. We’re not done yet of course - there is much more to add. We put a lot of thought into the design of this system and associated metadata but we also chose to jump off the cliff and start “doing”. There is a lot to learn from managing 20 million molecules and the complexity that comes with doing so. We continue to morph and extend as necessary and welcome input.

To clarify re. ranitidine…. I am NOT asserting that ranitidine has Z-isomer. I am stating that ranitidine has multiple names on ChemSpider, some with no stereochemistry and some with Z-stereochemistry. I also
report that a published crystal structure reports a Z-orientation.  I also report that a commercial software package suggests that the three tautomeric structures below are possible for ranitidine.

I also report, just for fun of course, that the InChI algorithm will declare two of these isomers, the bottom two, as equivalent when “mobile protons” are taken into account. Compare the ON InChIKeys below when mobile proton perception is detected by the InChI algorithm.   Need  more information?

With the curation capabilities we have in place, with the retained metadata, linkages to depositors and other sites and the revision history available, I would say that we are well equipped to manage the data for chemists and continue to enhance our platform for chemists worldwide.

Buy me a Coffee

The references and abstracts for my two recent articles in Drug Discovery Today are listed below should anyone be interested.

Internet-based tools for communication and collaboration in chemistry

Drug Discovery Today, Volume 13, Numbers 11/12, June 2008 502-506, doi:10.1016/j.drudis.2008.03.015

Web-based technologies, coupled with a drive for improved communication between scientists, have resulted in the proliferation of scientific opinion, data and knowledge at an ever-increasing rate. The availability of tools to host wikis and blogs has provided the necessary building blocks for scientists with only a rudimentary understanding of computer software science to communicate to the masses. This newfound freedom has the ability to speed up research and sharing of results, develop extensive collaborations, conduct science in public, and in near-real time. The technologies supporting chemistry, while immature, are fast developing to support chemical structures and reactions, analytical data support and integration to related data sources via supporting software technologies. Communication in chemistry is already witnessing a new revolution.

A perspective of publicly accessible/open-access chemistry databases

Drug Discovery Today, Volume 13, Numbers 11/12, June 2008, 495-501, doi:10.1016/j.drudis.2008.03.017

The Internet has spawned access to unprecedented levels of information. For chemists the increasing number of resources they can use to access chemistry-related information provides them a valuable path to discovery of information, one which was previously limited to commercial and therefore constrained resources. The diversity of information continues to expand at a dramatic rate and, coupled with an increasing awareness for quality, curation and improved tools for focused searches, chemists are now able to find valuable information within a few seconds using a few keystrokes. This shift to publicly available resources offers great promise to the benefits of science and society yet brings with it increasing concern from commercial entities. This article will discuss the benefits and disruptions associated with an increase in publicly available scientific resources.

Buy me a Coffee

We had previously announced the ability to perform NMR prediction via ChemSpider using the NMRDB.org services. Based on a number of comments from people testing the system, as well as some of my own tests, we have chosen, for the time-being at least, to remove the connection to this NMR prediction service from the website.

We do not believe that the issue is our integration. A prediction for the proton spectrum of benzene on the nmrdb.org website gives the H1 NMR spectrum below.

GOOD NMR prediction is not easy. I’m an NMR spectroscopist by training. I’ve either been using NMR prediction tools or helping to design and build them for almost 2 decades. Some publications of interest are listed below:

Toward More Reliable 13C and 1H Chemical Shift Prediction: A Systematic Comparison of Neural-Network and Least-Squares Regression Based Approaches J. Chem. Inf. Model.  Web Release Date: 05-Dec-2007; 10.1021/ci700256n

The Performance Validation of Neural Network Based 13C NMR Prediction Using a Publicly Available Data Source. J. Chem. Inf. Model., J. Chem. Inf. Model., 48 (3), 550 -555, 2008. 10.1021/ci700363r

Automated structure verification based on 1H NMR prediction. Magn. Reson. Chem., 44, 524 (2006)

Computer-Aided Determination of Relative Stereochemistry and 3D Models of Complex Organic Molecules from 2D NMR Spectra, Tetrahedron, 61, 9980-9989 (2005)

We definitely want to have an NMR prediction capability hooked up to ChemSpider. We would like that to be a free service if possible but are also open to hooking up for-fee services and the users of ChemSpider can choose whether to use them, for free or fee, or not. If anyone has an interest in hooking up services to ChemSpider feel free to connect with me.

We have contacted the nmrdb.org hosts about the issue also.

Buy me a Coffee

In May of this year we announced we had adopted Creative Commons licenses for ChemSpider. We thought we were doing the best thing for the community and, some agreed. However, the leaking of an internal memo from a Creative Commons discussion highlighted that our adoption of CC licenses for data was not necessarily appropriate. Over teh next few days a spark was lit in the blogosphere regarding our adoption of the licenses, whether it was appropriate or not and what the alternatives were.  Discussions have also continued offline out of the eyes of the blogosphere. I prefer to keep some of those conversations private.

We do not yet have a decision regarding the most appropriate licensing scheme to adopt but have chosen to remove the CC licenses at this time until we make a decision.

I understand that this may result in yet another discussion about our choice. Damned if we do, damned if we don’t. So be it.

If you