Archive for the Quality and Content Category

I’ve started a review of the DailyMed dataset as it is representative of some of the struggles with preparing a curated dataset of chemical structures, chemical names and trade names. In the first comment I pointed to issues with structure representations. I believe one of the worst is shown for qvar to the left. An examination of the qvar record gives the name as beclamethasone propionate. This particular compound has the chemical structure shown below. Not only is the stereochemistry missing from the structure on DailyMed but also half the ring has been lost, maybe during a scanning process? I wonder whether the label circulating out there to the public has this issue? Would the public care? Probably not. But when trying to build a curated dataset it’s rather important.


The past couple of days has seen an interesting exchange going on over on the SimBioSys blog.

Zsolt Zsoldos is someone I respect, not only for his passion for his science but also for his want to educate others in the challenges of what he does in developing software. I believe his blog post entitled “Crystal Structure Errors in CSD too” was an honest attempt to tell people to be “careful” when using data from databases. I don’t care whether the database is ChemSpider, PubChem, the CAS Registry or any of the other databases available via free access of commercial transaction, they ALL have errors. It is inevitable. Zsolt’s attempt to highlight that such errors exist was done, I believe, with pedagogical intent.

“J” then came back and gave some appropriate comments in response to Zsolt’s post and they should be consumed in series. It appears there was some type of backroom conversation, likely with the CCDC,  about how these comments were not prominent enough. Zsolt then posted this:

Update: Since the posting of this blog entry, we have received 2 public comments — displayed in a standard way as all comments by the WordPress blog software, and some private emails originating from CCDC. One of the complaints from CCDC was that the second comment — which explains the problems and directs the blame on my naivity for my wrong expectations about the data — was not displayed as prominently as the original article.”

He then posted the comment into the original article. Huh? Not sure why Zsolt should have felt obliged to do this for anyone. It’s a WordPress issue re how comments are displayed. He should not have felt obliged to insert the text into the article. Zsolt then went on to comment about the licence agreement and permission to use the CSD. What is more interesting to me is his view here:

“On a personal opinion: such restrictions on the use of scientific facts do not seem to make much sense to me. As the IUCr position paper explains: There is a long-standing acceptance within crystallography of the principle that such primary data sets should be freely available for sharing and re-use (with appropriate credit) within the structural science community. Also the FAQ on the CystalEye site explains: “As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download“. Nevertheless, CCDC has the legal right to stop us from using the data, since we signed a licensing agreement containing such conditions. That was a mistake on our part, one that we have to live with now. Let this case be a warning for others who have not yet made such mistake to sign the draconian agreement. ”

Those of you who have been watching the discussion between myself and ACS over the past few months will know I have been trying to get confirmation that “supplementary data” are Open Data and that we could scrape the CIFs if we chose to…it’s a MANY month conversation at this point. The Unilever School at Cambridge, via Nick Day’s work, has generated CrystalEye and, after many conversations, we were provided the data source and have it on ChemSpider now. We are awaiting constructive feedback from Nick and Peter Murray-Rust regarding our implementation of their data on our site. THis is especially important when there are licensing issues as appear to have been enforced on SimBioSys, evidenced by this Public Apology to CCDC. Read the post for details. It is Zsolt’s concluding statement that feeds directly into the value of Open Data in science and the value of CrystalEye to the community.

He comments: “One lesson I learned from this exchange is the importance of Open Data for scientific advancement (some scientists believe that research data must be free), e.g. such that is available from CrystalEye. When even non-profit organizations (registered as a charity) use draconian license agreements protecting data created and published by others, then fully commercial entities (like pharmaceutical companies) must be guarding their own data even stronger. It makes it difficult to make scientific progress if a single blog mention of an error in a data entry invites the wrath of the company who sells services on the data.”

As efforts like CrystalEye prevail, as the copyrightability and position of publishers regarding supplementary data is resolved, and the efforts of groups such as ChemSpider are applied to gathering Open Data and developing algorithms from these data, there is likely to be increasing tension showing up such as we see here.

here has been a response to my post about Chemical Names and Structures here.

PMR>”For certain purposes, it is valuable to collect as many names as possible, for example for location of lookup. But these should be accompanied with metadata. A similar example is from ChemSpiderMan (ed.):

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here: Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim�th?ylamino)m?�thyl]fur?an-2-yl}m?�thyl)sul?fanyl]�th?yl}-N’-m�?thyl-2-ni?tro�th�ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does “pain” mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn’t when I started and we had to read French and German  papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful. “

First of all, it’s interesting to note that the French name has been rendered as “junk” in Peter’s blog as shown here.

This probably relates to his original comment that the name is junk in his browser too…but acceptable in mine. On the other hand his blog post may look fine to him and looks bad in mine! Oh those dependencies…I see similar things show up in WordPress regularly.

Peter suggests that there should be metadata giving the language information. Good idea. See my previous blog post about that particular issue and the fact that we allow curators to layer on metadata AND we capture and retain it WHEN it is available.

If you look at this record you will see that there are names labeled as Polish, German and Dutch.

Chloropre​ne [Wiki]

1,3-Butad​iene, 2-c​hloro-

126-99-8 [RN]

204-818-0 [EINECS]

2-Chloor-​1,3-butad​ieen [Dutch]

2-Chlor-1​,3-butadi​en [German]

2-Chlorbu​ta-1,3-di​en [German]

2-Chloro-​1,3-butad​iene

2-Chlorob​utadiene

Chloropren [Polish]

Most labels were captured during the deposition process. One was added manually.Notice also the direct links to Wikipedia, the Registry number link to perform a search of PubChem and the link to EINECS.

As I commented in my post on ranitidine, and extracting from Peter’s post “Notice …….. that the name is labeled French on the record.” So, what Peter suggests is already in place on ChemSpider. I display below what is presently available to curators to label the names with. Notice this includes language,
EINECS numbers, CAS Registry Numbers, INNs, JANs etc.


The list of languages is easy to expand. Anybody have any requests?

A further comment “PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide). We have to use authorities (provenance) in our information. Thus the statements: Ranitidine is the Z-isomer and Ranitidine is the E-isomer may be seen as contradictory. That’s why people have suggested that RDF should have quads, not triples, such as Antony_Williams asserts ranitidine hasIsomer Z Wikipedia asserts ranitidine hasIsomer E Both these are true. That is the language we should use in the semantic web PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.”

This leads us into a deeper discussion about retention of metadata and authorities. We retain metadata when it is deposited or we can harvest it. Let’s consider the information below extracted from the same compound on ChemSpider:

Notice all of the

and note that they all link through to the original source of information, in this case NIOSH.

  • Appearance: Colorless liquid with a pungent, ether-like odor.

  • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, skin, respiratory system; anxiety, irritability; dermatitis; alopecia; reproductive effects; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, reproductive system Cancer Site [lung & skin cancer]

  • Incompatibilities and Reactivities: Peroxides & other oxidizers [Note: Polymerizes at room temperature unless inhibited with antioxidants.]

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet (flammable) Change: No recommendation Provide: Eyewash, Quick drench

  • Exposure Limits: NIOSH REL : Ca C 1 ppm (3.6 mg/m 3 ) [15-minute] See Appendix A OSHA PEL ?: TWA 25 ppm (90 mg/m 3 ) [skin]

There are also properties and each piece of data links out to the original source.For this record it is the same source. For some records it is already multiple sources.

Experimental physchem properties

  • Boiling Point: 139F

  • Flash Point: -4F

  • Freezing Point: -153F

  • Specific Gravity: 0.96

  • Solubility: Slight

  • Ionization Potential: 8.79 eV

  • Vapor Pressure: 188 mmHg

This particular structure has been deposited onto the ChemSpider database a total of 18 times from the  source databases listed below. Where possible i.e. when the structure is available online on the suppliers website and can be hyperlinked to, then each external ID links to the depositor. There is an error! The Aldrich depositions are for the polymer forms! Curators can know this info out.

Data Source External ID(s)
ChemDB 6681768
ChemIDplus 000126998, 014523898
DiscoveryGate 31369
DTP/NCI 18589
EINECS N/A
EPA DSSTox 1084_NTPBSI_v2b, 325_CPDBAS_v5b, 326_CPDBAS_v5b, 724_HPVCSI_v2c
Istituto Superiore di Sanità 601
NIOSH EI9625000
NIST 2143397875
NIST Chemistry WebBook 2143397875
PubChem 31369
Sigma-Aldrich 205397_ALDRICH, 205400_ALDRICH
Thomson Pharma 00243363

Also available to master curators is the ability to see who has been editing the names and synonyms and a full record of depositions, by who and when.

So, names are labeled with language and links to Wikipedia and other info. The predicted properties and systematic name are generally labeled according to the provider of the algorithm(s). We keep track of every URL and publication deposition and know which user deposited what and when…if the site is “vandalized” then we know which user did so.

Overall I’d say we have a lot of metadata for this record. The same is true for tens of thousands of records on ChemSpider and the amount of such information is growing literally daily. We’re not done yet of course – there is much more to add. We put a lot of thought into the design of this system and associated metadata but we also chose to jump off the cliff and start “doing”. There is a lot to learn from managing 20 million molecules and the complexity that comes with doing so. We continue to morph and extend as necessary and welcome input.

To clarify re. ranitidine…. I am NOT asserting that ranitidine has Z-isomer. I am stating that ranitidine has multiple names on ChemSpider, some with no stereochemistry and some with Z-stereochemistry. I also
report that a published crystal structure reports a Z-orientation.  I also report that a commercial software package suggests that the three tautomeric structures below are possible for ranitidine.

I also report, just for fun of course, that the InChI algorithm will declare two of these isomers, the bottom two, as equivalent when “mobile protons” are taken into account. Compare the ON InChIKeys below when mobile proton perception is detected by the InChI algorithm.   Need  more information?

With the curation capabilities we have in place, with the retained metadata, linkages to depositors and other sites and the revision history available, I would say that we are well equipped to manage the data for chemists and continue to enhance our platform for chemists worldwide.

Recently I posted on whether or not there is “a right structure for a compound“. I taked about trade names and registered chemical entities and posited the question regarding “whether a Registered Trade Name is absolute? I’m asking the question since I’m actually not sure. ”

There were two responses…

1) Rich Apodaca commented:”you’d probably find agreement among chemists that a trade name uniquely identifies one specific chemical entity. Ditto CAS Number.”

2) Peter Murray-Rust, as is his way (does anyone ever get a comment on their blog from PMR?), posted a detailed and thoughtful response on his own blog here.

I, like Rich, am of the opinion that a CAS Number does uniquely identify a specific chemical entity, not necessarily a unique structure. Of course, CAS numbers can be confusing too as I have commented here. Aspirin, for example, has 6 CAS numbers! So Rich and I agree on this…can anyone from CAS confirm or not whether our belief is right?

So, what about Trade Names? There were a number of purposeful errors in my original post to stimulate thought and feedback about my question. There is a LOT of confusion about identifiers and chemicals. The relationships are convoluted and even I struggle with certain aspects. So, let’s examine the confusions!.

I commented that “Zantac is a registered trade name for the chemical here. ” Check out the chemical structure there.

Now check out the Wikipedia text on that record view: “Ranitidine (INN) is a histamine H2-receptor antagonist that inhibits stomach acid production. It is commonly used in the treatment of peptic ulcer disease (PUD) and gastroesophageal reflux disease (GERD). It is currently marketed over the counter under the trade name Zinetac and Zantac by GlaxoSmithKline and by many other companies under various other names. ”

One might assume therefore that I am correct in my statement about Zantac. Check out the DailyMed label for Zantac here. This declares: “The active ingredient in ZANTAC Injection and ZANTAC Injection Premixed is ranitidine hydrochloride (HCl)”. Ah-ha…Zantac is a hydrochloride form of Ranitidine then? A search for Zantac gives THREE results on DailyMed…different in formulations but all pointing to the HCl form of ranitidine as the active component. So, based on this statement is it correct to label the structure here with the label Zantac? It doesn’t have the HCl so in theory, no. Is Wikipedia correct in saying that Ranitidine is “marketed over the counter as Zantac”. No. Hmmm. A conundrum? No. It’s clear. Zantac should ONLY be a Ranitidine HCl formulation. A couple of button clicks and the record now say Zantac (as HCl). But there are a LOT of other trade names associated with that Ranitidine record that don’t have such definitions (yet).

There is a Ranitidine Hydrochloride on ChemSpider here. It came as part of the recent CrystalEye deposition and is at this record. The associated publication is here, the title of the article is “Ranitidine hydrochloride, a polymorphic crystal form” and the abstract says:

” In the title compound, dimethyl({5-[2-(1-methylamino-2-nitroethenylamino)ethylthiomethyl]-2-furyl}methyl)ammonium chloride, C13H23N4O3S+·Cl-, protonation occurs at the dimethylamino N atom. The ranitidine molecule adopts an eclipsed conformation. Bond lengths indicate extensive electron delocalization in the N,N‘-dimethyl-2-nitro-1,1-ethenediamine system of the molecule. The nitro and methylamino groups are trans across the side chain C=C double bond, while the ethylamino and nitro groups are cis. The Cl- ions link molecules through hydrogen bonds.”

When I take the orientation information and draw the molecule from the crystal structure then I get:

and when I name this I get: (Z)-N-{2-[({5-[(dimethylamino)methyl]furan-2-yl}methyl)sulfanyl]ethyl}-N’-methyl-2-nitroethene-1,1-diamine, a Z-orientation.

Let’s return to Peter’s analysis of the list of identifiers associated with Ranitidine on the ChemSpider record in question. He comments

“PMR: ….. It is clear that

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

and

N-[2-[[[-​5-[(Dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-1,1-ethe​nediamine

are not identical. One describes a compound whose stereochemistry is asserted, the other describes one where the stereochemistry is not asserted. Butene and 1-butene and 2-butene and (Z)-butene are all different. They all have different InChIs. Some of them may refer to the same concept in some contexts, but they are not synonyms. Fowler (Modern English Usage) says “perfect synonyms are extremely rare”.”

We are in absolute agreement about this issue. The names are not identical. One declares stereo and the other doesn’t. The question then is what synonyms are useful to the user of ChemSpider to locate the structure if they have a systematic name. One might assume that the more the merrier. There is an enormous number of variants of bracket styles and dashes that could give rise to probably dozens of names that are all consistent with the structure and the names shown come from different sources.

Additionally the comment is made “If we are representing something in a machine, and we assert the two are to be used interchangeably then we have to be very sure that they can be. Adding a “(Z)” may appear a reasonable thing to do – in this case it is a diastrous act that corrupts information.” This is the problem with identifiers – they are confounded with complexity and supports the concept that there are no absolutes in names associated with compounds.

In discussing Wikipedia Peter has previously pointed to Wikipedia as “Open, re-usable, very highly curated, and the first place that students look. That – or a derivative – is where the world’s chemistry should reside.” I have covered the complexity of Taxol/paclitaxel previously (1,2,3) so where does WIkipedia stand on Rantidine?

Wikipedia actually shows and names an E-orientation as shown below

So, Wikipedia says E, ChemSpider says Z- and no-specific stereochemistry (in its identifiers). The crystal structure specifies Z-stereo. Oh dear, what can the matter be?

I then searched PubChem and found 2E’s and a Z- under Zantac. I searched MeSH for ranitidine and found no stereo specified. I searched ChEBI for both ranitidine and zantac and found nothing.

Further down the rabbit hole we go…

PMR> “The robotic aggregation of chemical names and identifiers, if done without metadata and ontology, corrupts information. That’s a strong statement, but we can see it in the current case. First there is junk out there. Robotic name harvesting harvests junk. (Christoph Steinbeck described it in worse terms at the RSC meeting. ) Here’s a snip from page571454

Validated by Experts, Validated by Users, Non-Validated, Removed by Users, Redirected by Users, Redirect Approved by Experts

Ranitidine [Wiki]

(Z)-N-{2-​[({5-[(Di​m?thylami​no)m?thyl​]furan-2-​yl}m?thyl​)sulfanyl​]?thyl}-N​’-m?thyl-​2-nitro?t​h?ne-1,1-​diamine

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

The “?” characters show up in my browser – I don’t know what they are, but they are not normal “e”s (ASCII 101). The first name is not a synonym – I’m sorry, but it’s junk. Associating junk with good information degrades the good information rather than increasing the quality of the junk (There is a more formal proof somewhere by Shannon – I believe – that machines cannot act as 100% proofreaders).”

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here:

Looks fine in my broswer and pasted in here too: N-{2-[({5​-[(diméth​ylamino)m​éthyl]fur​an-2-yl}m​éthyl)sul​fanyl]éth​yl}-N’-mé​thyl-2-ni​troéthène​-1,1-diam​ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

Further

“PMR: A trade name represents a product, not a compound and certainly not a connection table. In some cases it may refer to a pure substance, which itself is describable by a connection table, but these are not synonyms. And aggregating them as synonyms adds error rather than clarity. However there is an even stronger reason why “Zantac” does not describe ranitidine. See the FDA page. Zantac (Ranitidine Hydrochloride) Tablets Zantac contains (not “is”) ranitidine hydrochloride.”

A Trade Name DOES represent a product. It can represent MANY formulations also. The active component is commonly the material of interest that we would like to see as a connection table.

However, if one wants to find the active component in Zantac what would YOU do to find out? Type in Zantac on Wikipedia maybe? Look where it takes you: http://en.wikipedia.org/wiki/Zantac. So, Zantac redirects to Ranitidine..don’t forget the earlier statement about Wikipedia: “Open, re-usable, very highly curated, and the first place that students look. That – or a derivative – is where the world’s chemistry should reside.” Should the same be true for ChemSpider? I think so. But this is a choice we have to make to provide a service to the users.On MeSH a search on Zantac takes you to Ranitidine. On PubChem Zantac takes you to Ranitidine(s). So, association of Zantac with Ranitidine is appropriate BUT there is a need for ontologies, I agree. ChEBI has a good model for this (more later).

Interestingly, a search on Ranitidine on ChemSpider provides the following list of names:

PMR comments: “But the current aggregations of chemicals (Chemspider, eMolecules, Chempedia) are designed for use by machines as well as humans. And unless high-quality metadata is given, along with a structured ontology then machine aggregation of chemistry corrupts rather than enhances. For that reason we are building molecular repositories based on metadata and ontologies. In the current era of the web it’s becoming essential. ”

I  look forward to seeing how Zantac and Ranitidine are handled in this new world- if its a structured ontology then it sounds like an integration of MeSH with structures? Wikipedia is over 5000 organics now and is the culmination of thousands of hours of work by many dedicated individuals. And is not error-free. Any other efforts will be prone to similar issues so it’s going to be a major undertaking and I look forward to the results. The ChEBI team are already doing a good job in this area. You can see an ontology Tree View here. So, I’m definitely excited to see what will be better! Exciting times.

PMR comments? “Now, I suggested that the “(Z)” should not have been added to “ranitidine” to indicate the stereochemistry. You can find pages out there with “(E)”. What is the “correct structure”? Or is this a meaningless question?”

In my opinion this is NOT a meaningless question it is a good question. You saw what the crystal structure showed. SHould the name include stereochemistry? If so, when?

Please stay engaged in these discussions with both Peter and I. They are important and meaningful.

Following the announcement by JC Bradley that Drexel University now has an eCrystals Repository I connected with Simon Coles. We’ve exchanged a few email and have the go ahead to scrape the eCrystals structures and DOIs from their eCrystals repository in Southampton and will be doing so over the next few days and adding the data to ChemSpider. Watch out for the new collection as it goes online.

Identifiers, synonyms, registry numbers and so on are the primary textual manner by which chemical structures are searched on ChemSpider. There are various “flavors” of these. if you take a look at the record for Xanax you will see just a FEW of the names have links to Wikipedia, are EINECS numbers, Registry Numbers, International Names, Japanese Names, are Latin names, French Names etc.

Alprazolam [Wiki]

Xanax [Wiki]

249-349-2 [EINECS]

28981-97-7 [RN]

4H-(1,2,4​)Triazolo​(4,3-a)(1​,4)benzod​iazepine,​ 8-chloro​-1-methyl​-6-phenyl-

4H-[1,2,4​]Triazolo​[4,3-a][1​,4]benzod​iazepine,​ 8-chloro​-1-methyl​-6-phenyl-

Alplax

Alprazola​m (JP15/U​SP)

Alprazola​m [USAN:B​AN:INN:JA​N]

Alprazola​mum [INN-​Latin]

Since there are so many levels of complexity associated with identifiers we have added new tools to allow our curators to label the names with appropriate labels. the “present” list is shown below. The language tab lists a whole series of languages. One more effort to expand our curation…

I refer you back to the original post from which this comment was made as it is taken from a specific context.

“There is no “right structure (sic)” for a compound. There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.”

Is this a true statement? In many case I would agree but I have my own opinion in specific cases and let’s focus on the drug industry for a moment and trade names. First, let’s talk about me..and my identifiers. Depending who’s talking about me I am Tony, Antony, Dr Williams, Mr Williams, Dad, sweetheart, son, Tone, AJ, Bro’ and so on. However I am registered with a social security number and exist as a legal entity, a “registered” entity.

Now, Zantac is a registered trade name for the chemical here. I am not an expert in the registration process but I believe that somewhere along the line a defined chemical entity is associated with that name. Whether the chemical entity has been appropriately elucidated by analytical technologies or not is a different question. What is registered as a compound, and associated with the name, is what that name defines.

Now, there are a whole series of other names for the same compound – registry numbers, systematic names, organization numbers. See below:

Ranitidine [Wiki]

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hene-1,1-​diamine

1,1-Ethen​ediamine,​ N-[2-[[[​5-[(dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-, (Z)-

128345-62​-0 [RN]

266-332-5 [EINECS]

66357-59-3 [RN]

Azantac

GR 122311X

Melfax

N-[2-[[[-​5-[(Dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-1,1-ethe​nediamine

Noctone

Raniben

Ranidil

Raniplex

Ranitidin​e Base

Sostril

Taural

Terposen

Trigger

Ulcex

Ultidine

ZANTAC [Wiki]

Zantic

I think that the Trade Name for a compound is definitive since its registered. Relative to the statement “There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.”…my question is whether a Registered Trade Name is absolute? I’m asking the question since I’m actually not sure. Thoughts anyone?

I was pinged this weekend by Zsolt Zsoldos of the SimBioSys Blog about us having duplicates of certain amino acids of ChemSpider. He commented that there were a series of structures showing up for a search based on identifier:

Aspartic acid: 411 and 5745
Arginine: 227 , 6082 , 64224 , 1266045
Histidine: 752 , 6038 , 64237 , 4450698

Welcome to our world! So, let’s start with aspartic acid. What IS the structure of aspartic acid? Is it the one on Wikipedia here? The one labeled with the S stereochemistry but showing no stereobonds (will be resolved in the curation process of Wikipedia structures!). Is aspartic acid the deprotonated version? Well, it depends on who you ask and also who is depositing on our system.

The same is true for arginine where there is a non-stereospecific isomer, a D-isomer, a L-isomer and a charged form. Similarly for Histidine. All are appropriate.

Why is Zsolt interested in this? Because of a conversation he is engaged in…

For the docking community there is a very valuable resource out there called ZINC. They also have a very interesting email discussion list called Zinc-Fans. Recently a post initiated a discussion about protonation states and asking the question:

“In a certain docking protocol, my concern is primarily of the protonation states of the ligands in the library (subsets with different pH ranges) downloaded from ZINC, as I have recently read an article on “The influence of protonation in protein-ligand docking”

http://www.journal.chemistrycentral.com/content/2/S1/P12

Considering an enzyme that is reported to be optimum at a pH of 7.6-8.0, which we intend to find inhibitors for, which subset of ZINC compounds do Ichose for docking against my target of interest?”

Zsolt came searching ChemSpider for the amino acid structures and found the complexities in terms of charged/stereo forms. But his postinf regarding the “Correct Protonation State for Docking” was an education in itself. If you are engaged in docking experiments at all this is likely a must read. For the rest of us neophytes it’s education!

I had previously posed the question “How many chemicals names are contained in the short paragraph below”? Well, I have highlighted the “chemicals” contained in this paragraph. Click on the link to see what’s what.

“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”

Ok..so you saw Aspirin immediately right? Maybe you could have made up that Advantage and Commando would be drugs? Some of you might have spotted “he” (helium) and “in” (Indium). But did you expect “of” and “the”?

What was this all challenge all about? It explains the need to do a good job in identifying chemical names when hunting for them in articles. With a dictionary of millions of systematic names, trade names, synonyms and database IDs even the most general text is full of chemicals. So, the application of a dictionary of chemical names must be done very carefully. And, the point is that matching the dictionary of names within ChemSpider at present to text contained within scientific articles will fail without the direct identification of chemical names OR identifying trade names etc within an appropriate context.

There are WAY more complexities than this though. A group at Cambridge has been working on Sciborg since 2005. The project description page outlines the project:

“SciBorg is a four-year project, starting October 1 2005, funded by the EPSRC under the programme for Computer Science for e-Science. The project is a collaboration between three groups at the University of Cambridge:

We are cooperating with three major publishers:

The project summary and objectives are below. For further information, please see the detailed project description , which was based on the project proposal, and the developing SciBorg project wiki pages.”

I have been following the project for a while and am getting much more interested in it right now. It makes for great reading about the challenges of text mining data. Peter Murray-Rust has made a couple of blog posts (1,2) over the weekend relative to the challenges of text mining and I reference you there for a good overview of some of the challenges. They are significant but there are ways to deal with some of the issues.

I’ll blog more about text-mining and names in the next few weeks…

Community curation is cleaning up ChemSpider very efficiently. Especially in regards to identifiers. And we are getting educated too.

Today one of the community curators asked a question for the following record.

He stated “Thumbnail. Structure appears twice. Probably not dimer.” One email later to the depositor and the depositor returns with (and I excerpt):

“This material is commercialised as a mixture. The product is produced as a mixture of isomers, thus two (related) structures are appropriate

I’ve added some data as well, and Wiswesser Line Notations.

The Product name Assert is sold as a 300 g/l formulation, thus Assert 300. It is the same for Dagger and Dagger G where the G just stands for granules, ie it is a granular formulation. ”

Well, who knew? Well, now everyone knows…it’s on the record for all to see. Simply scroll to the bottom of the record and look at the curation record feedback. Click on the thumbnail here.

In a post in March of this year Peter Murray-Rust discussed the Issue of CAS Numbers. I believe the outcome of that post, especially as a result of the insightful comments of Steven Bachrach, was that CAS numbers have their place and provide significant value to the community. Since then I have posted on how confused we are about CAS Numbers. Hopefully these discussions are cleaning things up?

Now onto nomenclature, names, synonyms and identifiers OTHER than CAS Numbers. One of the questions Peter asked in his blogpost was “What is the structure of “snow”? This depends on an authority and cannot be answered without also quoting them.”

I think most of us will think of Ice and Snow as forms of water so the answer to the question might Peter poses might be some statement around ice-like water. However, ice on ChemSpider is the structure shown here while snow is the structure shown here. Both are street names for drugs.

How common is this situation where “common everyday words” are labels for chemical compounds? Well, let’s see. This is not a trick question! In the short paragraph below a number of chemicals are mentioned. How many? The closest guess will get a “ChemSpider Kudos” (which is just bragging rights). Why is this so important? That will come later….

How many chemicals are mentioned in this paragraph?

“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”

Post your best guess and WHAT words are chemical names!

In a recent post about ChemSpider we’ve been accused of wanting a Free Lunch. I copy a segment of the post and comment with insertions.

“Data are normally produced for a particular purpose and the reuse them for another cost money. I’ll exemplify this by taking CrystalEye data – about 120,000 crystal structures and 1 million molecular fragments – which were aggregated, transformed and validated by Nick Day as part of his thesis. (BTW Nick is writing up – it’s a tribute to his work that CrystalEye runs without attention for months on end).

AJW> It is true…it is a tribute to Nick that CrystalEye can run for months without attention. Kudos. I am interested in how much pressure the site is under. How many searches/users in a day etc.? We find that our struggles in uptime (and these are negligible) are primarily based on stress on the servers. For nighttime users tonight things will have been slow…we deposited over 100,000 new molecules from 5 new data sources. That does create some slowness. We will hit about 40,000 transactions today. Our problems are ISP issues and powercuts. But we are also not in a University using thick pipes etc.

One comment…it was 130,000 structures according to a previous blog and has been expanding since then from daily depositions. Right now I would expect it to be 140,000 rather than 120,000. When we did try scraping the data our best estimate was about 90,000. We might have missed something in our scraping and it’s why we asked for a dump of the data.

The primary purpose of CrystalEye was to allow Nick to test the validity of QM calculations in high-throughput mode. It turned out that the collection might be useful so we have posted it as Open Data. To add to its value we have made it browsable by journal and article, searcahable by cell dimensions, searchable by chemical substructure and searchable by bond-length. This is a fair range of what the casual visitor might wish to have available. Andrew Walkingshaw has transformed it into RDF and built a SPARQL endpoint with the help of Talis. It has a Jmol applet and 2D diagrams, and links back to the papers. So there is a lot of functionality associated with it.

AJW> The team has done a good job in putting the site together. The JMol applet is an excellent utility for us all to use and thanks to that team for sure! Egon has been challenging us to RDF the site and it’s on our list, but keeps getting pushed down based on other requests. Since he’s the only voice asking it will keep getting pushed down unfortunately.

This has come under some criticism to the effect that we haven’t really made it Openly available. For example Antony Williams(Chemspider blog) writes (Acting as a Community Member to Help Open Access Authors and Publishers):

“This [interaction with MDPI] is contrary to some of my experiences with some other advocates of Open Data and Open Access where trying to get their “Open Data” is like pulling teeth.”

PMR: I assume this relates to CrystalEye – I don’t know of any other case.

AJW> There are other examples and he’s right. He doesn’t know of them and I’d prefer he not rant on my behalf so I’ll not name them.

Antony and I have had several discussions about CrystalEye – basically he would like to import it into his database (which is completely acceptable) but it’s not in the format he wants (multi-entry files in MDL’s SDF format, whereas CrystalEye is in CML and RDF).

AJW> To clarify, again. I DON’T want to import CrystalEye into ChemSpider. I DON’T! All I want is the set of structures and unique associated URLs so that users of ChemSpider can find that there is crystal structure information over on CrystalEye and can click the link and be on CrystalEye and get the benefit of Nick, Andrew and Peter’s work. I don’t want to reproduce their effort. I want to integrate to it. I’ve said it many times on Peter’s blog and on this one.

This type of problem arises everywhere in the data world. For example the problem of converting between map coordinates (especially in 3D) can be enormous. As Rich says, it costs money. There is generally no escape from the cost, but certain approaches such as using standards such as XML and RDF can dramatically lower the costs. Nevertheless there is a cost. Jim Downing made this investment by creating an Atom feed mechanism so that CrystalEeye couls be systematically downloaded but I don’t think Chemspider has used this.

AJW> If Jim can contact me by email and provide me with detailed instructions to download the entire file of structures ONLY and their associated URLs that would be excellent. I’ll send the request to him tonight.

The real point is that Chemspider wishes to use the data for a different purpose from which it was intended.

AJW> The problem is that stories keep getting made up about what we want. ALL I want to do is drive traffic to CrystalEye so that people who don’t know about it can use it. No more than that. I don’t get how trying to provide an integration path is so difficult. I’ll ask Jim to help.

That’s fine. But as Rich says it costs money. It’s unrealistic to expect we should carry out the conversion for a commercial company for free. We’d be happy to a mutually acceptable business proposition and it could probably be done by hiring a summer student.

AJW> I am interested in what commercial benefit integrating to CrystalEye can have. It’s work on our side. I’m not sure what a mutually acceptable business proposition would look like. It can’t be that much work to send us a set of InChIStrings and URLs for the CrystalEye dataset..they already exist on CrystalEye. So, I’ll assume that this is a last comment on “No thanks to CrystalEye data in ChemSpider”. I have to ask why not put them in PubChem. Since PubChem is held as the standard of OpenData why not put CrystalEye there?

I continue to stress that CrystalEye is completely Open. If you want it enough and can make the investment then all the mechanism are available. There’s a downloader and converters and they are all Open (though it may cost money to integrate them).

AJW> Just fyi ChemSpider has adopted Creative Commons licenses.

FWIW we are continuing to explore the ways in which CrystalEye is made available. We’re being funded by Microsoft as part of the OREChem project and the result of this could represent some of the way in which the Web technology is influencing scientific disciplines. We’d recommend that those interested in mashups and re-use in chemistry took a close look at RDF/SPARQL/CML/ORE as those are going to be standard in other fields.

AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.

I have spoken previously about the challenges of Scraping CrystalEye Content and staying in relationship with publishers. I have approached CAS and spoken with the Copyright team at ACS. In December of last year I spoke about the 5 month delay to discuss with ACS about whether or not we could scrape CIF files from ACS journals directly. Well, I had a nice chat with two ACS people in New Orleans, one of them from ACS Pubs. We had a nice chat about ChemSpider and I answered a lot of questions about what we were doing, where we were going, how we are “funded” (we are not!) etc. Many pages of notes were taken. At the end of the meeting I asked the question “So, relative to my question about CrystalEye and scarping CIFS. Are Supplementary Data ok to scrape or not?”

The answer? “We haven’t made a decision yet. We need to discuss”.

Are crystal structures really that special? It’s been difficult to get JUST the structures associated with even Open Data. Now I’ve been waiting over 7 months for a question to be answered by ACS…and it’s binary. YES or NO.

At this point I give up. Peter Murray-Rust has had ACS CIFs scraped from their publications for a LONG time. And continues to scrape them. Cambridge University/Unilever School of Informatics didn’t get permission and have been very vocal about what they’ve done and no legal action re. copyright has been taken so I’ll assume it’s not an issue. If it’s not an issue then we can go ahead.

If we can go ahead then why wouldn’t we? We have…we already have scraped the collection of CIFs from ACS, from a broader range of ACS journals than CrystalEye taps into. It’s Supplementary Data, it’s non-copyrightable and now its ours to publish. We already support CIF displays on ChemSpider so what we need to do now is to mass convert/handle the data and deposit onto ChemSpider. We also have the IUCR CIFs to deposit. I guess ChemSpider will soon become “CrystalEye 2″ as we host the data. That said we are NOT crystallographers so I have an open request to the community for someone with interest/skills in crystallography to join our advisory group and support this effort. Feel free to ping me.

Over the past year ChemSpider has been challenged over the nature of our offering in terms of Open Data etc. A small number of people focused a lot of time talking about this while we remained focused on improving the website and having it available for people to use as a Free Access website. I spoke to Peter Suber about Open Access and then John Willbanks about Creative Commons.

Since ChemSpider is the aggregate of a number of people’s work (including provision of software by collaborators) I had to get into conversation to see what licenses would be acceptable to those groups.

With the redesign of the website we have structured ourselves in a way to add licenses as we see appropriate now. So, as of today we have added the Creative Commons Attribution Share Alike 3.0 United States License and the appropriate logo is on all sections of a Record View except for the predicted properties. Once we get approval from our collaborators for this same license (and discussions are underway) then the whole record view will be Licensed.

At that point, you are free :

  • to Remix — to make derivative works

Under the following conditions:

  • Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
  • Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same, similar or a compatible license.

I am in catch up mode tonight reading a week long backlog of blog posts. I’ve caught up tonight with some of Peter’s posts about semantic chemical authoring (1,2). I’ll respond shortly with comments regarding our own efforts in pulling together the web. I agree with Peter that improved semantic chemical authoring tools are necessary but we are focused right now on doing what we can with what is already available online. What it takes is coding, some regular expressions, some visual inspection and work. Lots of work. More later…

One of the things we are working on is connecting blog posts and wiki pages to ChemSpider as evidenced by our work with Molecule of the Day and our integrations to TotallySynthetic posts on an ongoing basis. What we expect of the authors though is that they author with care. We are generally using name to structure conversion capabilities to generate the chemical structures for connecting to on ChemSpider. Paul Doherty at TotallySynthetic used to provide us with inChIStrings and InChIKeys to connect up to but stopped because it was a lot of work I believe. Molecule of the Day generally discusses fairly simple molecule relative to TotallySynthetic’s COMPLEX molecules. Manual inspection is unfortunately necessary even in the simplest of cases. And it IS time-consuming. Robots will gather information and, in my judgment, PROLIFERATE incorrect data unless someone is going to do the work to inspect OR the system provides a curation platform to quickly remove errors.

I blogged tonight on the ChemConnector blog about the importance of dashes and spaces in systematic names. It should be very clear from that post how important it is. it is a major challenge to use name to structure conversion tools on chemical names that are imperfect and do not represent the structure they are meant to represent. There needs to be respect for chemical names and as we move them from system to system, database to database we need to do our best to retain their integrity. This HAS BEEN a major challenge for us as we scrape data from various data sources OR when people provide us data files such as Wikipedia and we need to check name-structure connections. It is not difficult to lose the integrity of a chemical name.

Back to Peter Murray-Rust’s discussions about semantic chemical authoring. Peter is talking about building a site of aggregated information from various websites.

PMR> “We’re in the process of aggregating a repository of common chemicals (somewhere in the range 1000-10000 entries) and we are taking data from various publicly available web sites. Typical sources are Wikipedia, any aggregator with Open Data policies and MSDS sheets (chemical safety information). One such site is INCHEM (Chemical Safety Information from Intergovernmental Organizations which lists about 1500 materials (most are chemical compounds though some are mixtures).”

Readers of this blog will know we’ve already done this. Both for NIOSH and for the Oxford MSDS set. We took a select subset of information. We integrated this with our Wikipedia set of data on ChemSpider (and, of course, also on WiChempedia).

PMR> “…From this we extract the most important information and turn it into CML – names, formula, connection tables, properties, etc.”

Our process was extraction of the same (but there arent any connection tables to grab from NIOSH or Oxford MSDS) then we converted names to structures and ran some “confirmation processes” including visual inspection when necessary.

PMR> “There are a large number of simple but niggly lexical problems, such as the degrees symbol for temperature (totally inconsistent within and between documents) And the semantics – how do you record a boiling point as “between 120 and 130 at 20 mm Hg”? (CML can do this, but it takes work to do the conversion.)”

Oh yes…these are problems. the inconsistencies between records is a pain but can be dealt with by mapping as shown here. Recording a boiling point “between 120 and 130 at 20 mm Hg” is no issue really. See this figure for something just as complex  regarding “loss of waters”.

PMR> “And the sites have errors. Here’s a rather subtle one which the average human would miss (we needed a machine to find it). You’ll have to go to the page for chloromethylmethylether – I daren’t try to transcribe it into WordPress. The error is in the displayed page (no need to scroll down).”

There are a couple of issues here. We actually prefer NOT to use either the molecular formula or the molecular weight. In our Wikipedia work we found a lot of errors around these parameters and for the Wikipedia work at least the name, SMILES, InChI etc were more correct while MFs and MW would be wrong.

There may absolutely be value in using both MF and MW to confirm the structure and I definitely see the value. This would definitely help resolve some of the Nomenclature-Structure issues that can can arise from converting the names! One of the things that occurred in the blog post was that my earlier comments came to pass regarding removal of a space in the chemical name.

The names on the ORIGINAL InCHEM page were:

CHLOROMETHYL METHYL ETHER, Chloromethoxymethane with a CAS Number of 107-30-2 and an EINECS number of 203-480-1.

There was NOT a name listed as “chloromethylmethylether” which PMR listed in his post. The only difference is dropping one space. It’s only an accidental removal but dramatically changes the meaning of the record. This is where Peter’s use of either MF or MW becomes crucial! That loss of a space CAN cause big problems as described here. Does it cause a problem this time? Check below…look at the name with and without the space and the result of conversion in a commercial Name to Structure software package.

The CORRECT structure is on ChemSpider here and already includes the following Supplemental Information.

User Data

  • experimental physchem properties
    • Boiling Point: 138F

    • Freezing Point: -154F

    • Specific Gravity: 1.06

    • Solubility: Reacts

    • Ionization Potential: 10.25 eV

  • miscellaneous
    • Appearance: Colorless liquid with an irritating odor.

    • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

    • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

    • Symptoms: Irritation eyes, skin, mucous membrane; pulmonary edema, pulmonary congestion, pneumonitis; skin burns, necrosis; cough, wheezing, pulmonary congestion; blood stained-sputum; weight loss; bronchial se cretions; [potential occupational carcinogen]

    • Target Organs: Eyes, skin, respiratory system Cancer Site [in animals: skin & lung cancer]

    • Incompatibilities and Reactivities: Water [Note: Reacts with water to form hydrochloric acid & formaldehyde.]

    • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated/Daily Remove: When wet (flammable) Change: Daily Provide: Eyewash, Quick drench

    • Exposure Limits: NIOSH REL : Ca See Appendix A OSHA PEL : [1910.1006] See Appendix B

Peter IS right. We DO Need Semantic Chemical Authoring Tools. However, we’ve already gone a long way without them and what is already online CAN be dealt with. Incredible care is needed with nomenclature  and just spaces can mess things up! I know we have errors on our database – both structures and names. What is to be expected with 20 million structures and associated data? However, we are cleaning them up, rather quickly. We are scraping and integrating data at an increasing rate having learned a lot of lessons over the past year.

I’ll comments on Peter’s other Semantic Chemical Authoring posts in the next couple of days.

I’ve blogged previously about us adding safety and toxicity data to ChemSpider. We are busily sourcing new information from other data sources to add information and in the past couple of days we have added NIOSH data as it is a rich source of additional safety information. For example, the record for 1,2,3-trichloropropane shows:

  • First Aid: Eye: Irrigate immediately Skin: Soap wash Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, nose, throat; central nervous system depression; in animals: liver, kidney injury; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, central nervous system, liver, kidneys Cancer Site [in animals: forestomach, liver & mammary gland cancer]

  • Incompatibilities and Reactivities: Chemically-active metals, strong caustics & oxidizers

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet or contaminated Change: No recommendation Provide: Eyewash, Quick drench

Some additional examples are here: Temefos, Warfarin and Allyl Alcohol. Note that each of these also has a coincident extract from Wikipedia. We are therefore integrating Wikipedia articles, safety, toxicity, experimental and predicted properties. Our plan for semanticising and integrating the chemistry web is clearly well underway.

A couple of days ago I blogged about building the first dedicated website for Molecule of the Day. To continue our “proof of concept” demonstrations in this vein we now unveil our first support of a free-access publisher. Molbank is defined to be an Open Access journal on Wikipedia but based on some of the conversations I have seen on Murray-Rust’s blog this is in question. As I have expressed previously I hope to stay in relationship with publishers as we navigate our way through building our structure centric community for chemists. I have exchanged numerous emails with the editorial team at Mobank and have found them very supportive of our integration so away we went.

The data was scraped from the Molbank website, specifically the titles, authors, URL link to the article and the molfile itself. A couple of scripts later and an SDF was constructed from the molfiles and the text. This SDF file was then opened and reviewed visually to remove “errors in the data”. There were a number of different types of errors and some examples are listed below. For example:

http://www.mdpi.org/molbank/molbank2007/m558.htm includes HA and HB annotations

http://www.mdpi.org/molbank/molbank2007/m555.htm includes R groups – should be expanded

http://www.mdpi.org/molbank/molbank2005/m407.htm the mol file is for CH2=CH2

http://www.mdpi.org/molbank/molbank2005/m409.htm the mol file is for ethane

There are other example and Rich Apodaca has made a number of similar observations previously.

Our belief is that we have created from this dataset a high quality, curated (but likely not perfect) dataset as a subset at molbank.chemspider.com.  The structures show names, identifiers, supplementary info where appropriate and a link to the original article. An example is shown below for the linkages.

molbank21.png

Notice the Link to the article from the data sources, from the supplementary info and the miscellaneous safety and tox data scraoed from MSDS sheets online. We will now keep this dataset updated as Molbank expands. With the permission of the editorial staff we would be interested in extracting the analytical data also.

Our proof of concepts have shown that we can host different datasets on ChemSpider and we urge anybody interested in such a service to approach us for discussions.

Over the past year we have been interested in our website statistics and our growing traffic. I have blogged previously about Alexa and was challenged to review the Compete statistics. After growing in rankings for a few weeks we removed the Alexa widget and saw our rankings plummet. We then installed the Compete widget and saw ourselves go up the rankings quite dramatically before removing that widget and seeing our rankings decrease. Meanwhile, our own website statistics have shown consistent month to month growth with an average of about 4000 unique users per day at present (As shown in the figure below).

stats.png

Bottom line, based on our observations, neither Alexa nor Compete give anywhere near valid statistics. At the SBS conference in St Louis this past week I asked the audience, about 60 people, how many in the room knew of or had heard of ChemSpider. ONE hand went up…and that was someone I had informed many months earlier. As I expressed to the audience…this was not disappointing news to me…it was quite exciting to know what the potential growth is as people are informed of the service. I expect the growth to continue, especially after the visits to the ACS and the SBS.

Frequent users to ChemSpider who use the identifiers for searching will commonly find a mixture of “names” and “database IDs” as well as “registry numbers”. Since the number of database IDs can sometimes swamp the synonyms and chemical names we chose to separate them. We have run some regular expressions across the database to separate database IDs out. We have left registry numbers (marked by [RN]), EINECS numbers (marked by [EINECS]) and Wiswesser Line Notation (marked by [WLN]) in with the synonyms.

database-ids.png

Unfortunately there are MANY flavors of Database IDs and we might have missed some. If you come across any “potential” DB IDs and think we should segregate them out please use the POST COMMENTS  ability to inform us. Simply Post a Comment to a record and suggest we check the identifiers out for potential DBIDs. Thank!

Recently I posted about trying to identify the correct structure of Ginkgolide B and the need for curation of ChemSpider entries. David Barden from the RSC commented on my post:

“Antony – I am an organic chemist working on the RSC journal in which the published structure of ginkgolide B appeared, and am pretty sure that it is correct, having been written by a regular author of ours familiar with the literature on the ginkgolides. I think the problem might lie with the representation (and/or conversion to InChI) of the structures – even in the one structure you indicated as having “full stereochemistry”, it seemed to me that 3 stereocenters were undefined, from a visual inspection of the structure. Apart from these stereocenters, the structure and InChI (generated myself) otherwise seem identical, so I’m not sure why the last part of the string in the ChemSpider entry is “20+” rather than “20-”. The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.”

I have redrawn the structure of Ginkgolide to echo that shown in the RSC journal and it is shown below alongside a cropped image from the article:

compare-the-two.png

I’m_pretty sure I have the structure correct. The InChIString is:

InChI=1/C20H24O10/c1-6-12(23)28-11-9(21)18-8-5-7(16(2,3)4)17(18)10(22)13(24)29-15(17)30-20(18,14(25)27-8)19(6,11)26/h6-11,15,21-22,26H,5H2,1-4H3/t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20-/m1/s1

and the InChIKey is:

SQOJOAFXDQDRGF-MMQTXUMRBS

In the previous post I searched on Ginkgolide B as an identifier to see how many Ginkgolide B’s there are. There are 6 as shown here.

I searched on the entire InChIkey and found no hits. This means that the structure is NOT on ChemSpider.

I then searched on the CONNECTIONs captured within the InChIKey and represented by: SQOJOAFXDQDRGF . I received 18 hits in total varying in completeness in terms of incomplete stereochemistry and DIFFERENT but fully assigned stereochemistry. I searched the entire InChIKey on Google (SQOJOAFXDQDRGF-MMQTXUMRBS) but received no hits. Just to check I then searched the InChIString shown above on Google. Surprisingly, I DID get a hit! It was for this structure. I was puzzled and a comparison of the strings showed a difference in ONE section of the string, the stereo layer.

Searched on Google: /t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20-/m1/s1

Found by Google: /t6-,7​+,8-,9+,10​+,11+,15+,​17+,18+,19​-,20+/m1/s​1

See the difference? ONE stereocenter… 20- versus 20+ . Thank goodness we are moving to InChIKeys rather than InChIStrings since the majority of people would likely miss the detail. I did the first time! So, based on all of my searches the structure of Ginkgolide B as represented in the article published by the RSC is NOT in the ChemSpider database. I agree with David Barden when he comments “The difficulty of visually comparing structures from different sources (rotation, reflection, etc), especially for complex molecules like this, would make the task of validation much more difficult.” It is very complex and time-consuming and the hope is that comparison of InChIKeys, specifically the second part of the key, will help catch the differences in a more facile manner.

The question, unfortunately, remains. What IS the correct structure of Ginkgolide B? For now I have assumed that the one in the RSC article is correct and have added the structure to the database using the normal deposition process and have associated with the RSC article and the blog discussions on ChemSpider. If it turns out it is not correct then I will leave the structure, the connection to the article but remove the identifier Ginkgolide B.

suppinfo.png

I’ve reported previously on the fact that we are now adding publication details to chemical structures on ChemSpider. We have introduced the ability to do this in a manual fashion where anyone can associate a blog article, wiki article or scientific publication directly with one of more chemical structures but we are also looking to do this in an automated fashion. Project Prospect from the RSC seemed like an ideal opportunity for us to consider using their InChI association to harvest the titles and DOIs to make the association. This would be done after a discussion with the RSC to receive their blessing if possible, based on our previous interactions.

Today I investigated the possibilities of using the information available. I started with this article and clicked on the Enhanced HTMP View (Prospect View) and then used the Toolbox to Show the Compounds. The article is about the “Chemistry and biology of resorcylic acid lactones”. A partial screenshot of the type of molecules discussed in the article is shown below.

radicicol.png

For_the_purpose of this blogpost notice the visually appealing forms of the structures and the stereochemistry on the molecules. Now, each of the marked compounds in the article is linked to details of the molecule. See below..each of the pink highlights is linked to the molecule and pops up a new box.

radicicol3.png

Looking at radicicol on Project Prospect and on ChemSpider we see a difference. In fact, compare with the structure shown above from the article. The difference is one of stereochemistry..there is no stereochemistry in the InChI or in the SMILES string. There are also issues with the structure depiction shown below and this has been discussed before relative to “Cleaning“.

radicicol2.png

Zearalenone on Project Prospect and on ChemSpider

zearalenone.png

As_previously discussed with Ginkgolide B there can be many versions of a structure on the ChemSpider database. We recently introduced the ability to search on a “skeleton” as shown below in the new Structure Search options.

search-options.png

When the skeleton for zearalonone is searched (same skeleton excluding H) I found 15 hits. Some are shown below. Notice the difference (highlighted with red boxes) structure to structure in terms of the presence/absence of the double bond inside the cycle, the difference between the OH and the =O and the specified  stereochemistry. This search can be very useful for finding related structures and more examples will be given in the near future of using such searches.We DO find versions without specified  stereochemistry but we are presently working on approaches to relate the stereo/non-stereo versions of structures to each others in a very visual manner. More will follow…

skeletons.png

We have started to introduce new capabilities onto ChemSpider in preparation for our shift from ChemSpider Beta to ChemSpider RELEASE VERSION to celebrate our one year anniversary. You should see a number of incremental improvements happening over the next few weeks. I’ll highlight as many of them as I can as we release them.

Let’s start with new Search Capabilities. On the search screen you will see a drop down menu. You can now select from 5 different types of searches.This post with highlight only the “Structure” type search. Details about the others will follow.

searches.png

It appears that many people believe that ChemSpider is only a text-based search. The reality is  we went live with structure and substructure search in version 1. For sure we have improved things over the months and the searches are better and faster but structure searching has always been present.

To perform structure searching choose the Structure Search and you will see:

structuresearch1.png

Now click on Input Structure and  a screen for submitting structures will be opened. There are three ways to do this. The Convert will allow you to input a SMILES string,  InChI string or chemical name and convert to the structure. If the structure is corrected select ACCEPT and perform the search. Alternatively load a molfile by browsing and uploading then click ACCEPT. If you want to draw a structure or edit one you have converted simply select edit and draw into the applet.

structuresearch2.png

The manual for the applet is here. The applet will look like this:

structuresearch3.png

These capabilities have been in place for a while. Now what we have done is enable you to search in different ways from this place. Specifically…

structuresearch4.png

These searches are very valuable  specifically in relation to the issue of tautomers and structure skeletons. As I have shown with previous post on ginkgolide B in some cases there are many similar structures on the database. Searching on the skeleton will find them all. A search on Taxol  shows 1 exact structure, 1 tautomer of the exact structure and 42  structures with same skeleton as shown below.

structuresearch5.png

Enjoy the new capabilities. We welcome your feedback.

Tonight I finished an article on Public Chemistry Databases. During that article I commented on the size of the Public Chemistry Databases versus the commercial databases. There have been numerous discussions in the blogosphere about the size of databases such as PubChem relative to the CAS Registry. Recently PubChem and ChemSpider headed towards 20 million structures. The CAS Registry is about 33 million.

Now, I don’t know how much duplication there is in the Registry but I can comment is what is in ChemSpider and likely in PubChem. Here’s a basic comment about molecules with complex stereochemistry. They tens to exist MULTIPLE times in the database due to different variants of stereochemistry. Let’s examine Ginkgolide B. The structure below is taken from a recent RSC article. I was interested to see whether we had the “correct” structure of Ginkgolide B on ChemSpider, assuming that the correct structure is that one shown on the RSC webpage.

ginkgolide-b.png

A search on the name Ginkgolide B turned up a total of 6 structures. The connectivities are the same for all structures. The ONLY difference is in the stereochemistry. Take a look at the structures in Table View. There is one structure with full stereochemistry expressed. This one comes from PubChem, Thomson Pharma and xPharm. With full stereochemistry it might be safe to assume it is correct.

However, even for Taxol there are structures with complete stereochemistry and they are different: Structure 1, Structure 2, Structure 3, Structure 4 and Structure 5

I actually gave up looking eventually…here are the different complete stereochemistries. Look carefully…

t31-,32​-,33+,35-,​36+,37-,38​-,40-,45+,​46-,47+

t31-,32​+,33+,35-,​36+,37+,38​-,40-,45+,​46-,47+

t31-,32​+,33+,35-,​36-,37+,38​+,40-,45-,​46+,47-

t31-,32​-,33+,35-,​36-,37-,38​-,40-,45-,​46-,47-

t31-,32​-,33+,35-,​36+,37-,38​-,40-,45-,​46-,47-

Question for ChemSpider Users – there are actually WAY MORE than 10 Taxol skeletons on ChemSpider. Can anyone figure out how many? It actually takes one search to find them all!

We believe this is the correct structure of Taxol.

Back to Ginkgolide B. I redrew the structure shown in the RSC article (and as shown below).

ginkgolide-b_2.png

Generating the InChIKey for this structure and performing a search on ChemSpider gave me no hits. It looks like either the RSC structure is wrong OR all of the six structures from all of the different sources are wrong. As mentioned, there is actually only one Ginkgolide B structure ( a structure with the associated identifier) on ChemSpider with full stereochemistry. The stereo for that structure is:

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20+ ChemSpider

t6-,7+,8-,9+,10+,11+,15+,17+,18+,19-,20- RSC Stereo

There is ONE stereocenter difference.

This is what curation is all about. The question now is which one is correct? Is it RSC? Is it the structure on ChemSpider? Can anyone validate? Did I miss something up in the comparison (it happens!)

Now, for the LONG list of Ginkgolide B structures on ChemSpider shown here what do users think we should do? If we simply remove ALL labels for the incorrect structures then we will remove all links into other databases that contain information for “their Ginkgolide B”. If we collapse all links into the correct Ginkgolide B on ChemSpider, and bring 6 records into one, then the structures to which the correct structure links are actually incorrect in the linked databases (but useful information exists there).

Quite the conundrum. I’d appreciate feedback!

Frequent readers of this blog will recall the multiple exchanges which occurred around trying to get access to the “Open Data” on CrystalEye. I commented then that our intention was to : “… scrape the InChIs, the title of the article, the journal name, volume and page details and the DOI number. We will de-duplicate the structures onto the database or create new structure records as appropriate. My concern is whether or not the ACS will allow us to scrape their Open Data so I have issued the direct question to them below. I am hoping for an affirmative response and then I will move on to confirm with the other publishers.”

I have not been able to get an answer out of ACS about whether their data can be accepted as Open Data and keep us out of trouble if we publish it. It really should be a simple answer but Open Data is causing lots of issues nowadays and we are in for a rocky ride. However, NONE of the list are copyrightable anyways… “the InChIs, the title of the article, the journal name, volume and page details and the DOI number“.

And so to work we went. Supposedly there are 130,000 structures on CrystalEye. Since we were scraping we had to source these ourselves. We could only find 93,000. It doesn’t mean they aren’t there but there’s no site map to help us find the rest.

We have scraped InChIs where they exist (and they don’t for many inorganics and organometallics) and have grabbed DOIs etc. We have extracted about 56,000 InChIs total. What we are trying to achieve with this approach is to provide a manner by which to link from a structure through to an article. However, I’ve made a decision NOT to do that. Here’s why:

1) There are many broken URLs associated with the InChI…for example here and here . There is no standard format to the URLs as we have tried to achieve with the standard URL structures for ChemSpider related to InChIKey for example. We are dealing with complex URLs such as

http://wwmm.ch.cam.ac.uk/crystaleye/summary/acs/cgdefu/2006 /1/data/cg050086d/cg050086dsup2_-aku———————-/ cg050086dsup2-aku———————-.cif.summary.html

2)Looking through the data it is clear that there are issues with structures and the accuracy of what a structure is. Just looking at internal consistency within a record we see issues. Look at this record. The associated XML file is here. At the bottom of the file we see:

<identifier convention=”daylight:smiles”>
[H]C=2C([H])=C(C([H])=C([H])C=2(N=NC(=NN([H])C1=C([H])C([H])= C(C([H])=C1([H]))C([H])([H])[H])[N+](=O)[O-]))C([H])([H])[H]
</identifier>

<identifier convention=”iupac:inchi”>
InChI=1/C15H15N5O2/c1-11-3-7-13(8-4-11)16-18-15(20(21)22)19-17 -14-9-5-12(2)6-10-14/h3-10,16H,1-2H3/b18-15+,19-17+
</identifier>

These are representations of the same structure in SMILES and InChI format. (I question the label Daylight SMILES as the SMILES string would have to be generated by Daylight, and maybe it is, for that to be true. Many software packages generate SMILES, some good, some bad. Daylight generate “theirs”.) I believe the process is that the structure is extracted from the CIF and then converted via CML to SMILES and InChI. The problem is in the internal consistency. The structures below were generated by converting the structures from InChI and SMILES to structures as shown. Notice that they are inconsistent in E/Z stereo.
smiles-and-inchi.png

Looking at the paper here I believe that the InChI is correct but the SMILES is incorrect. This is either “Daylight’s issue” or the tool converting the CML to SMILEs. According to PMR it is either the SMILES conversion in Jumbo or OpenBabel it appears.This is not the only example..there are others.

3) Let me clarify before continuing by commenting that I am a NMR spectroscopist, not a crystallographer. And so, my knowledge of CIF files etc is not at the level of the developers of CrystalEye. With this premise in mind I looked at the list of InChIs scraped from CrystalEye. While there are MANY examples of InChIs not accurately representing complex organometallics this IS to be expected since InChI has not been developed to deal with them yet. However, there are also many examples of what I “think” are strange InChIs. Taking the premise that on ChemSpider we want people to search a chemical structure and find their way to related information on other databases, let me provide some examples.

For this link the InChI is InChI=1/12H2O/h12*1H2. If you look at the image at the link you will see “12 waters”. It’s not a surprise as it links to an article entitled “A Blind Structure Prediction of Ice XIV” in the Journal of the American Chemical Society. I doubt that anyone would “draw” twelve molecules of water to generate an InChI to search for this article, but you never know. However, looking at other examples we find for example “InChI=1/3C6O2/c3*7-5-1-2-6(8)4-3-5″. The number 3 at the beginning of the formula 3C6O2 indicates there are three EQUIVALENT molecules and I assume these are contained within a unit cell (?) as shown in the figure below and at this link. I would expect a person to draw ONE of these molecules in order to perform a search and not have to draw all three to generate the InChI.

3-molecules.png

What is interesting about the article associated with this example is the nature of its commentary “The space groups of point group C3: some corrections, some comments” where the abstract says “A survey of the October 2001 release of the Cambridge Structural Database has uncovered approximately 675 separate apparently reliable entries under space groups P3, P31,P32 and R3;  in approximately 100 of these entries, the space-group assignment appears to be incorrect. Other features of these space groups are also discussed.” I wonder whether this observation is related?

There are many more examples. Check out this CIF. Note the InChI, InChI=1/4C7H4ClN/c4*8-7-4-2-1-3-6(7)5-9/h4*1-4H, and the Figure below. 4 equivalent molecules.

4-molecules-in-cell.png

I am assuming that the InChI is being derived from the unit cell. It is certainly being derived from the CIF. I don’t think of this as a “chemical structure” that I want to deposit to ChemSpider.

What I have noticed about all the online databases I visit, other than ChemSpider and Wikipedia, is that there is no easy way to annotate a record with an error. On ChemSpider we allow people to click on “Post Comments” (it used to be called Curate Data) and add comments directly to a record. This means that if there ARE errors that have been spotted by people that they are visible for everyone else to see also. If someone finds an error why not let them tell the world. I don’t believe PubChem, Drugbank, CrystalEye, blah, blah allow such comments (that I can see). If we can add comments to blogs for all to see why not comments to DB records???? As it is I burn up hours of time trying to hunt down email addresses for contacts for the individual databases I find errors on and trying to inform people. The majority of people are grateful and respond. Some don’t respond, don’t make edits and leave errors online to proliferate. Such a simple capability as Post a Comment could really help identify these errors for other people.

So, there are InChIs on CrystalEye. I cannot speak to whether they represent the structure in the article or whether they represent the structure in the CIF en masse but I am concerned that we do not proliferate incorrect data. But the InChIs on CrystalEye are what they are and if we are only linking to them then we are identifying that there is expected to be related info on that database. However, because of the “multiplicative nature” of some of the InChIs I don’t want to index them either right now. I am presently defining some regular expressions to allow us to “refine” the InChIs for connecting to. An example is shown below..it should be clear what needs to be edited to remove the “multiples”.

multiplier1.png

For_now we will not index InChIs from CrystalEye on ChemSpider. The quality of what’s related to those InChIs from that point on will need to be checked further. This is NOT just a CrystalEye issue. It’s an issue for all databases including ourselves. However you arrive at a database, whether it’s ChemSpider, Wikipedia, or PubChem, or ChEBI, always check, if you can, for the quality of data. We are all contaminated in some way with errors. We hope that ChemSpider isn’t struggling with InChI and SMILES collisions (and we haven’t found any yet) but we might be. We ARE struggling with InChIs and organometallics in the same way all other public online databases are.

With time I will review a few tens of InChIs versus structures in the CIFs on CrystalEye and we might be able to decide whether or not to post the DOIs and titles etc. in the future. What we will NOT be doing is grabbing any “Open Data” for the time being until some more validation work has been done. It’s a great shame as I was hoping to link to crystal structures. I am not aware that anyone has linked to CrystalEye. If you have I welcome your guidance regarding how you got around broken links, InChI vs. SMILES conflicts and InChI complexities. Thanks!

There is a new contributor to the blogosphere…SimBioSys. I recommend adding the blog to your Google Reader. There are some very exciting things going on there right now. I have commented previously about how high performance computing engines such as the Cell Broadband Engine are being brought to bear on scientific problems. SimBioSys appear to be the only group who have chosen the Cell processor to port their virtual high-throughput screening and docking solution to. Their white paper makes for an interesting read.

In their most recent post “Roping in your next scaffold hop with LASSO” they talked about their LASSO publication: LASSO—ligand activity by surface similarity order: a new tool for ligand based virtual screening”. We are presently in the middle of a very exciting project regarding LASSO. We have teamed up to provide the virtual screening results for 40 target families on the full ChemSpider Library, currently containing over 18 million molecules. Using the LASSO similarity search tool, SimBioSys has screened the ChemSpider database against all 40 target families from the Database of Useful Decoys (DUD) dataset.

LASSO descriptors (Ligand Activity by Surface Similarity Order) contain a count of the different Interacting Surface Point Types (ISPT) found on a molecule. LASSO descriptors use 23 different surface point types, ranging from hydrogen bond donors/acceptor, to hydrophobic sites, to pi stacking interactions. Figure 1 shows a “histidinelike” fragment of a molecule. The triangles are the surface point types of this fragment, colored by type. Based on the idea that ligands must have surface properties compatible with the target site in order to bind, LASSO uses a descriptor of Interacting Surface Point Types (ISPT) to find molecules with diverse chemical scaffolds but similar surface properties.

lasso1.png

We are presently populating the ChemSpider database with 10s of millions of LASSO descriptors and this will allow screening of the ChemSpider database to:

● Find molecules which have a higher likelihood of binding to targets.
● Find molecules with better selectivity for a target.
● Reduce toxicity issues.

The 40 Target receptor families included in the screening results were chosen to cover a wide range of receptor classes due to their interest in drug discovery. Each target family had 10s to 100s of known active molecules, which were used as the basis for the query files used by LASSO, one query for each family. The similarity screening was performed on the full ChemSpider database across all 40 targets and the similarity scores for each structure/target pair is available via the ChemSpider website. Thus for each structure in the ChemSpider database, you can find its similarity score (based on surface properties) relative to actives of each of the 40 target receptors. In addition to allowing instant ranking results for a particular target of interest (retrieving molecules that are likely to be active for a receptor) this matrix of screening results can be used to find molecules that have predicted affinity for a target but low predicted affinity for all other targets. Performing such searches promises to improve selectivity and can be a guide to reducing toxicity concerns. More detail about this collaborative project will be forthcoming but the overview is provided here.

Watch this space for updates and an unveiling date.

A few months ago we rolled out the ability to post analytical data onto ChemSpider. The deposition process at this point appears to be seamless. We have had no bugs or failures reported during the depositions of the last 80 spectra. We have had an initial deposition from a publisher as discussed here and believe that ChemSpider does offer an opportunity to many other potential contributors to expose their data to the public. There was an early perception that depositors were transferring copyright of the data to us but that is not the case. We enabled the facility for users to declare their data as Open Data for others to download -some depositors declare it Open and some don’t – it’s their choice.

I encourage all users to consider the deposition of analytical data to ChemSpider. Instrument Vendors  particularly might wish to expose their data from their latest and greatest instruments (new NMR probes, new algorithm processing techniques etc).

We will soon open the ability to deposit images and CIF crystal structure files also. We will use Jmol to display the CIF file. Image deposition will allow us to support 2D spectral data (since JSPecView does not support them yet) as well as photographs of crystals, surfaces etc.

We welcome any further suggestions for online exposure of data.