Archive for the ChemSpider Chemistry Category

For those of you who have been following the discussions of Stevan Harnad, Peter Suber and others regarding institutional repositories and Open Access you will already be up to speed regarding OA mandates and what they could mean in terms of access to data. Rather than go into this area in detail myself I point you specifically to Steven Harnad’s site to review ongoing discussions there (there are mutliple parties exchanging views.

What I am going to so though is point you to this comment on Peter Suber’s blog regarding “ Stanford Opens Access to All Its Education Studies“.

Specifically, the following comments are of interest “Under Stanford’s new policy, only the author’s final, peer-reviewed copy of the article would be posted online —in some cases, potentially months before the printed version becomes available….By early fall, the education school plans to have a Web site in place where the articles will be posted and archived in a searchable database. With approximately 50 scholars on Stanford’s education school faculty, the site could accumulate as many as 100 articles a year, by Mr. Willinsky’s estimate.”

Stanford is not alone in this type of shift. What does this mean for indexing of articles and availability for searching in terms of the work we are doing with ChemSpider right now (1,2,3). Text-indexing of chemistry articles would simply mean turning our spider onto the repository. Using the tools we have available now and the database of 21 million compounds and associated dictionary we could also convert the chemical names to structures and make the articles searchable by both text and structure BEFORE publication, in theory, months before. With the work that is already underway on Open Access articles on ChemSpider and SOON to be unveiled, we could also provide tools for authors to markup their own documents. My preference, as for many others, is that authors of Chemistry articles use semantic authoring tools to allow us to grab the appropriate information from the articles for linking as well as provide a path for semantic connectivity.

The question then is whether or not ChemSpider can index institutional repositories or authors self-archived collections on their university research group websites. The authors self-archived collections will be very valuable but of course most likely to upset the publishers. We’d like to do both.

I envisage a time when articles are indexed and searchable even before they are published and indexed by others. Why not? If there are changes to the article between pre-and post-publication both can be indexed.

We welcome your comments! Anyone want to introduce me to the host of an institutional repository?

I have blogged previously about ChEBI entities of the month and our work to include the information to ChemSpider. In order to do so we had to introduce rich text support. This work is done and reported here. As of today nearly all ChEBI Entity of the Month information is now posted to ChemSpider. During the processs we have provided feedback to the team about some suggested changes to some structure depictions and have also noted some differences in stereochemistry between our reference structures and those on ChEBI. This type of interaction has us all be very vigilant about accuracy and it was great (and fast) to work with the group at ChEBI to cross-validate the limited dataset. Everyone gains.

The Rich text editor worked perfectly and without failure and is ready to roll out to the general public we think but we would still like some beta-testers to help test it please.

Zemanta Pixie

Over the weekend we added chemicals from two new data sources – Afid therapeutics and Alfa Aesar. Large depositions of over 25,000 chemicals have been slowed down while we improved our batch deposition system but over the next few days we will be playing catch-up with a large backlog of vendor deposits. When we set up ChemSpider our initial belief was that the ability to source compounds was sufficiently being served by the chemical vendors themselves and many commercial software vendors and websites offering access to aggregated datasets of vendor offerings.

What we have noticed however is that on a daily basis ChemSpider users are requesting sources of chemicals directly. The majority of these requests are coming via email but the forum is also being used via the Looking for a Chemical topic. We have more and more requests to increase the number of chemical vendors represented on ChemSpider and make the navigation of identifying chemical suppliers easier. Despite directing many of our users to other sites our users seem to be looking for a “one-step” shop for their information. We will add some improved navigation to facilitate locating sources of chemicals. We’re hoping that some of the companies focused on sourcing chemicals will also want to help and integrate their services….

When we started building ChemSpider we focused initially on building the data model around “organic structures”. We always knew that we would need to deal with rather regular collisions such as inability to handle polymers, organometallic representations, allotropes and so on. Nevertheless, we move forward. Information is aggregated from multiple sources and we remove semantically linked back to the originating sources for users to check if they deem it necessary.

Carbon is a challenge. Check out the record and the identifiers will list carbon, graphite, diamond, carbon nanotubes to name just a few. Clearly the physical properties of these materials will differ. We are capturing this information. Click on the hyperlink at the end of each of the appearance data

  • Appearance: very hard crystals or light green powder

  • Appearance: soft dark grey solid

  • Appearance: grey to black powder

  • Appearance: grey solid

Now, what’s missing is the NAME of the associated form: diamond, graphite, nanotubes etc. If you hover the cursor over the hyperlink pointer you will see the title of the article and this can help identify the form IF it’s in the title. That won’t always be the case. So, we’ll be adding our Wiki capabilities to enable annotation of the properties…

The full User Data for Carbon are listed below…it’s quite extensive AND linked to original sources.

  • experimental physchem properties
    • Melting Point: ca. 3750 (sublimes)

    • Melting Point: 3727 C

    • Melting Point: 3650 C

    • Melting Point: 3652 C

    • Boiling Point: Sublimes

    • Boiling Point: ca. 5000 C

    • Boiling Point: 4200 C

    • Specific Gravity: 1.8-2.1

    • Solubility: Insoluble

    • Vapor Pressure: 0 mmHg (approx)

  • miscellaneous
    • Appearance: Black, odorless solid.

    • Appearance: very hard crystals or light green powder

    • Appearance: soft dark grey solid

    • Appearance: grey to black powder

    • Appearance: grey solid

    • Appearance: finely divided black dust or powder

    • Stability: Stable. In the form of powder reacts vigorously with a wide variety ofmaterials; in the rod form is relatively inert.

    • Stability: Stable. Incompatible with strong oxidizing agents. Combustible.Highly flammable in powdered form.

    • Stability: Stable. Combustible.

    • Toxicity: IVN-MUS LD50 440 mg kg-1

    • Safety: FLAMMABLE

    • Safety: FLAMMABLE

    • Safety: FLAMMABLE / IRRITANT

    • Safety: Minimize exposure.

    • Safety: Avoid exposure to dust. If machining or cutting, do soin an area with good ventilation.

    • Safety: Safety glasses if working with powdered carbon.

    • First Aid: Eye: Irrigate promptly Breathing: Fresh air

    • Exposure Routes: inhalation, skin and/or eye contact

    • Symptoms: Cough; irritation eyes; in presence of polycyclic aromatic hydrocarbons: [potential occupational carcinogen]

    • Target Organs: respiratory system, eyes Cancer Site [lymphatic cancer (in presence of PAHs)]

    • Incompatibilities and Reactivities: Strong oxidizers such as chlorates, bromates & nitrates

    • Personal protection and Sanitation: Skin: No recommendation Eyes: Prevent eye contact Wash skin: Daily Remove: No recommendation Change: No recommendation

    • Exposure Limits: NIOSH REL : TWA 3.5 mg/m 3 Ca TWA 0.1 mg PAHs/m 3 [Carbon black in presence of polycyclic aromatic hydrocarbons (PAHs)] See Appendix A See Appendix C OSHA PEL : TWA 3.5 mg/m3

We have about a million structures backlogged in our deposition system at present and are feeding in the smaller depositions at present. As the number of data sources increases the ability to see a little more information about each of the Data Sources is necessary. We put callout balloons on the Record View page so that you can hover over the name of the data source provider and this will provide an overview of the depositor. We have just layered the same capabilities onto the list of Data Sources at this page so if you want to know a little more about the Data Source just hover over the name of the depositor and up will pop up a balloon as shown below. For example, the one below represents our latest addition from the Manchester Center for Integrative Systems Biology.

I blogeed previously about the confusion of Talc and DMSO under the name of sclerasol here and pointed to the fact that PubChem was a mixture of both DMSO and Talc. What I like about our community is how fast attention is paid to issues like this. The record on PubChem has already been cleaned up and the confusion resolved. All links back to DMSO and MeSH have been resolved and the collective quality for the community is improved. The outcome of eyeballs carefully curating records like this is to the benefit of all…robots cannot do this…they are fast but mostly dumb to such detail I’m afraid. They can check that molecular formula and molecular weights associated with a structure are appropriate (actually they should be derived FROM the structure/compound!) We keep saying it and WIkipedians around the world would agree, a platform for public curation and annotation is necessary at a time when the amount of data available to the community continues to grow at an astonishing rate.

I worked at Kodak for over 5 years  in the research labs in Rochester. I had a great time and did some of my best scientific research  in those labs and saw the scientists there do things with chemistry and light sensitivity that were just fascinating. Today I was catching up with my blog reader backlog as well as reading through C&E News. I came across this fascinating article on both the Wired blog and in C&E News (subscription required) and MORE importantly the accompanying video.

What you will see in that video is an “immediate” change in color from a colorless solution to a localized green spot based on light. And I mean IMMEDIATE. Watch the video…check out the molecule here. This will provide a reference/link to the article.

Start thinking..how could this technology be used? As they modify and create different colors what will the technology lead to….suggestions welcome!

ChemSpider added the Directory of Useful Decoys over the weekend. This dataset is well known to the community of scientists performing computational docking experiments and is outlined below. The dataset contributed over 128,000 molecules to the collection.

DUD, a directory of useful decoys for benchmarking virtual screening. DUD is designed to help test docking algorithms by providing challenging decoys. It contains:

  • A total of 2,950 active compounds against a total of 40 targets
  • For each active, 36 “decoys” with similar physical properties (e.g. molecular weight, calculated LogP) but dissimilar topology.

DUD is provided by the Shoichet Laboratory in the Department of Pharmaceutical Chemistry at the University of California, San Francisco (UCSF). To cite DUD, please reference Huang, Shoichet and Irwin, J. Med. Chem., 2006, 49(23), 6789-6801. doi 10.1021/jm0608356. There is a DUD wiki page where you can discuss DUD and an errata page where problems are reported and explained.”

I have been in discussion with Christoph Steinbeck and colleagues from the European Bioinformatics Institute. Specifically, we are interested in linking up to AND embedding the text from their ChEBI Entities of the Month. So, as is my preferred manner of not assuming everything is Open Data but rather asking for permission, I approached Christoph. I asked for permission to copy the text for the Entities of the Month onto the appropriate record view in ChemSpider. When I asked the question we were not yet ready to accept rich text format with embedded hyperlinks, a strength of many of the articles on ChEBI‘s Entity of the Month.

I am happy to announce that as part of our ongoing effort to Wikify ChemSpider and allow people to add descriptions to the individual record views we have added a rich text editor and are presently testing it. At present we have rolled out the FULL implementation of the editor. This means it has lots of capabilities/buttons and the entire editor is being tested by curators. But, when rolled out to users, there will be a Simple mode and an Advanced mode for the editor.

Click on the thumbnail below to see the Text Editor in action. Don’t forget, It is the “Full-powered” implementation for now. In this case all I did was copy and paste the text from the ChEBI website and insert the ChEBI article link back to the original article on the ChEBI site.

In the Text editor we are in the process of inserting new capabilities that will facilitate mark up of articles. Since we will be hosting a number of Open Access articles shortly we will be experimenting on those articles with our new markup capabilities.

When this is all rolled out we will have the majority of capabilities necessary for people to track their research online if they wish. Online submission of structures, text deposition with full editing capabilities, submission and tracking of analytical data and images and linking to external sites and data. It’s probably an 80% solution for right now since we are missing some capabilities and workflow issues. For example, poor support for polymers and organometallics and specitfically the structure-centric nature of the solution and the insistence to submit a structure to associate data and text with. We will allow in the future “sample-submission” where the structure is not known but the data, images and experimental details of synthesis and analysis are available. Clearly the standard workflow for synthetic chemists is to synthesize first and then confirm by analysis what the products are. This is a typical workflow and will need to be supported. It’s coming…

Some of you might be asking:

1) will we support versioning of the articles as people modify/edit the article (as is done with Wikipedia)? Yes, we will. Soon.

2) will curators have the ability to lock articles? Yes, in the future we will introduce this if it’s deemed appropriate.

3) will it be possible to allow only one individual (or group) to edit an article? Yes, one of the future directions is to allow an individual or group to perform Open Notebook Science in front of the public but not allow the public to edit the results. They would of course be allowed to comment on the research. Future development…

Zemanta Pixie

here has been a response to my post about Chemical Names and Structures here.

PMR>”For certain purposes, it is valuable to collect as many names as possible, for example for location of lookup. But these should be accompanied with metadata. A similar example is from ChemSpiderMan (ed.):

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here: Looks fine in my broswer and pasted in here too: N-{2-[({5?-[(dim�th?ylamino)m?�thyl]fur?an-2-yl}m?�thyl)sul?fanyl]�th?yl}-N’-m�?thyl-2-ni?tro�th�ne?-1,1-diam?ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

PMR: Without the metadata giving the langauage information is losr. For example what does “pain” mean? If the language is not given there is a tendency to interpret this as english.  We have to acknowledge that the language of science is currently english (it wasn’t when I started and we had to read French and German  papers). So RDF, for example, provides a language qualifier (e.g. @en or @fr). The addition of that qualifier transforms the information from junk to meaningful. “

First of all, it’s interesting to note that the French name has been rendered as “junk” in Peter’s blog as shown here.

This probably relates to his original comment that the name is junk in his browser too…but acceptable in mine. On the other hand his blog post may look fine to him and looks bad in mine! Oh those dependencies…I see similar things show up in WordPress regularly.

Peter suggests that there should be metadata giving the language information. Good idea. See my previous blog post about that particular issue and the fact that we allow curators to layer on metadata AND we capture and retain it WHEN it is available.

If you look at this record you will see that there are names labeled as Polish, German and Dutch.

Chloropre​ne [Wiki]

1,3-Butad​iene, 2-c​hloro-

126-99-8 [RN]

204-818-0 [EINECS]

2-Chloor-​1,3-butad​ieen [Dutch]

2-Chlor-1​,3-butadi​en [German]

2-Chlorbu​ta-1,3-di​en [German]

2-Chloro-​1,3-butad​iene

2-Chlorob​utadiene

Chloropren [Polish]

Most labels were captured during the deposition process. One was added manually.Notice also the direct links to Wikipedia, the Registry number link to perform a search of PubChem and the link to EINECS.

As I commented in my post on ranitidine, and extracting from Peter’s post “Notice …….. that the name is labeled French on the record.” So, what Peter suggests is already in place on ChemSpider. I display below what is presently available to curators to label the names with. Notice this includes language,
EINECS numbers, CAS Registry Numbers, INNs, JANs etc.


The list of languages is easy to expand. Anybody have any requests?

A further comment “PMR: I very much like the idea of regarding chemical names as social identifiers. But, of course, that only works for humans. The machines can aggregate the tags but they cannot make inferences from them. The problem is that when they are put into databases they lose their social context and are managed by hard boolean logic. That fails immediately and often dramatically. A major cause is the loss of metadata and authorities. In this world you cannot use voting (which is why Chempedia cannot be seen as an authority for CAS numbers, only a useful guide). We have to use authorities (provenance) in our information. Thus the statements: Ranitidine is the Z-isomer and Ranitidine is the E-isomer may be seen as contradictory. That’s why people have suggested that RDF should have quads, not triples, such as Antony_Williams asserts ranitidine hasIsomer Z Wikipedia asserts ranitidine hasIsomer E Both these are true. That is the language we should use in the semantic web PeterMR still deliberately fails to make an assertion about this isomerism and is waiting to see what others think.”

This leads us into a deeper discussion about retention of metadata and authorities. We retain metadata when it is deposited or we can harvest it. Let’s consider the information below extracted from the same compound on ChemSpider:

Notice all of the

and note that they all link through to the original source of information, in this case NIOSH.

  • Appearance: Colorless liquid with a pungent, ether-like odor.

  • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, skin, respiratory system; anxiety, irritability; dermatitis; alopecia; reproductive effects; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, reproductive system Cancer Site [lung & skin cancer]

  • Incompatibilities and Reactivities: Peroxides & other oxidizers [Note: Polymerizes at room temperature unless inhibited with antioxidants.]

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet (flammable) Change: No recommendation Provide: Eyewash, Quick drench

  • Exposure Limits: NIOSH REL : Ca C 1 ppm (3.6 mg/m 3 ) [15-minute] See Appendix A OSHA PEL ?: TWA 25 ppm (90 mg/m 3 ) [skin]

There are also properties and each piece of data links out to the original source.For this record it is the same source. For some records it is already multiple sources.

Experimental physchem properties

  • Boiling Point: 139F

  • Flash Point: -4F

  • Freezing Point: -153F

  • Specific Gravity: 0.96

  • Solubility: Slight

  • Ionization Potential: 8.79 eV

  • Vapor Pressure: 188 mmHg

This particular structure has been deposited onto the ChemSpider database a total of 18 times from the  source databases listed below. Where possible i.e. when the structure is available online on the suppliers website and can be hyperlinked to, then each external ID links to the depositor. There is an error! The Aldrich depositions are for the polymer forms! Curators can know this info out.

Data Source External ID(s)
ChemDB 6681768
ChemIDplus 000126998, 014523898
DiscoveryGate 31369
DTP/NCI 18589
EINECS N/A
EPA DSSTox 1084_NTPBSI_v2b, 325_CPDBAS_v5b, 326_CPDBAS_v5b, 724_HPVCSI_v2c
Istituto Superiore di Sanità 601
NIOSH EI9625000
NIST 2143397875
NIST Chemistry WebBook 2143397875
PubChem 31369
Sigma-Aldrich 205397_ALDRICH, 205400_ALDRICH
Thomson Pharma 00243363

Also available to master curators is the ability to see who has been editing the names and synonyms and a full record of depositions, by who and when.

So, names are labeled with language and links to Wikipedia and other info. The predicted properties and systematic name are generally labeled according to the provider of the algorithm(s). We keep track of every URL and publication deposition and know which user deposited what and when…if the site is “vandalized” then we know which user did so.

Overall I’d say we have a lot of metadata for this record. The same is true for tens of thousands of records on ChemSpider and the amount of such information is growing literally daily. We’re not done yet of course – there is much more to add. We put a lot of thought into the design of this system and associated metadata but we also chose to jump off the cliff and start “doing”. There is a lot to learn from managing 20 million molecules and the complexity that comes with doing so. We continue to morph and extend as necessary and welcome input.

To clarify re. ranitidine…. I am NOT asserting that ranitidine has Z-isomer. I am stating that ranitidine has multiple names on ChemSpider, some with no stereochemistry and some with Z-stereochemistry. I also
report that a published crystal structure reports a Z-orientation.  I also report that a commercial software package suggests that the three tautomeric structures below are possible for ranitidine.

I also report, just for fun of course, that the InChI algorithm will declare two of these isomers, the bottom two, as equivalent when “mobile protons” are taken into account. Compare the ON InChIKeys below when mobile proton perception is detected by the InChI algorithm.   Need  more information?

With the curation capabilities we have in place, with the retained metadata, linkages to depositors and other sites and the revision history available, I would say that we are well equipped to manage the data for chemists and continue to enhance our platform for chemists worldwide.

Recently I posted on whether or not there is “a right structure for a compound“. I taked about trade names and registered chemical entities and posited the question regarding “whether a Registered Trade Name is absolute? I’m asking the question since I’m actually not sure. ”

There were two responses…

1) Rich Apodaca commented:”you’d probably find agreement among chemists that a trade name uniquely identifies one specific chemical entity. Ditto CAS Number.”

2) Peter Murray-Rust, as is his way (does anyone ever get a comment on their blog from PMR?), posted a detailed and thoughtful response on his own blog here.

I, like Rich, am of the opinion that a CAS Number does uniquely identify a specific chemical entity, not necessarily a unique structure. Of course, CAS numbers can be confusing too as I have commented here. Aspirin, for example, has 6 CAS numbers! So Rich and I agree on this…can anyone from CAS confirm or not whether our belief is right?

So, what about Trade Names? There were a number of purposeful errors in my original post to stimulate thought and feedback about my question. There is a LOT of confusion about identifiers and chemicals. The relationships are convoluted and even I struggle with certain aspects. So, let’s examine the confusions!.

I commented that “Zantac is a registered trade name for the chemical here. ” Check out the chemical structure there.

Now check out the Wikipedia text on that record view: “Ranitidine (INN) is a histamine H2-receptor antagonist that inhibits stomach acid production. It is commonly used in the treatment of peptic ulcer disease (PUD) and gastroesophageal reflux disease (GERD). It is currently marketed over the counter under the trade name Zinetac and Zantac by GlaxoSmithKline and by many other companies under various other names. ”

One might assume therefore that I am correct in my statement about Zantac. Check out the DailyMed label for Zantac here. This declares: “The active ingredient in ZANTAC Injection and ZANTAC Injection Premixed is ranitidine hydrochloride (HCl)”. Ah-ha…Zantac is a hydrochloride form of Ranitidine then? A search for Zantac gives THREE results on DailyMed…different in formulations but all pointing to the HCl form of ranitidine as the active component. So, based on this statement is it correct to label the structure here with the label Zantac? It doesn’t have the HCl so in theory, no. Is Wikipedia correct in saying that Ranitidine is “marketed over the counter as Zantac”. No. Hmmm. A conundrum? No. It’s clear. Zantac should ONLY be a Ranitidine HCl formulation. A couple of button clicks and the record now say Zantac (as HCl). But there are a LOT of other trade names associated with that Ranitidine record that don’t have such definitions (yet).

There is a Ranitidine Hydrochloride on ChemSpider here. It came as part of the recent CrystalEye deposition and is at this record. The associated publication is here, the title of the article is “Ranitidine hydrochloride, a polymorphic crystal form” and the abstract says:

” In the title compound, dimethyl({5-[2-(1-methylamino-2-nitroethenylamino)ethylthiomethyl]-2-furyl}methyl)ammonium chloride, C13H23N4O3S+·Cl-, protonation occurs at the dimethylamino N atom. The ranitidine molecule adopts an eclipsed conformation. Bond lengths indicate extensive electron delocalization in the N,N‘-dimethyl-2-nitro-1,1-ethenediamine system of the molecule. The nitro and methylamino groups are trans across the side chain C=C double bond, while the ethylamino and nitro groups are cis. The Cl- ions link molecules through hydrogen bonds.”

When I take the orientation information and draw the molecule from the crystal structure then I get:

and when I name this I get: (Z)-N-{2-[({5-[(dimethylamino)methyl]furan-2-yl}methyl)sulfanyl]ethyl}-N’-methyl-2-nitroethene-1,1-diamine, a Z-orientation.

Let’s return to Peter’s analysis of the list of identifiers associated with Ranitidine on the ChemSpider record in question. He comments

“PMR: ….. It is clear that

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

and

N-[2-[[[-​5-[(Dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-1,1-ethe​nediamine

are not identical. One describes a compound whose stereochemistry is asserted, the other describes one where the stereochemistry is not asserted. Butene and 1-butene and 2-butene and (Z)-butene are all different. They all have different InChIs. Some of them may refer to the same concept in some contexts, but they are not synonyms. Fowler (Modern English Usage) says “perfect synonyms are extremely rare”.”

We are in absolute agreement about this issue. The names are not identical. One declares stereo and the other doesn’t. The question then is what synonyms are useful to the user of ChemSpider to locate the structure if they have a systematic name. One might assume that the more the merrier. There is an enormous number of variants of bracket styles and dashes that could give rise to probably dozens of names that are all consistent with the structure and the names shown come from different sources.

Additionally the comment is made “If we are representing something in a machine, and we assert the two are to be used interchangeably then we have to be very sure that they can be. Adding a “(Z)” may appear a reasonable thing to do – in this case it is a diastrous act that corrupts information.” This is the problem with identifiers – they are confounded with complexity and supports the concept that there are no absolutes in names associated with compounds.

In discussing Wikipedia Peter has previously pointed to Wikipedia as “Open, re-usable, very highly curated, and the first place that students look. That – or a derivative – is where the world’s chemistry should reside.” I have covered the complexity of Taxol/paclitaxel previously (1,2,3) so where does WIkipedia stand on Rantidine?

Wikipedia actually shows and names an E-orientation as shown below

So, Wikipedia says E, ChemSpider says Z- and no-specific stereochemistry (in its identifiers). The crystal structure specifies Z-stereo. Oh dear, what can the matter be?

I then searched PubChem and found 2E’s and a Z- under Zantac. I searched MeSH for ranitidine and found no stereo specified. I searched ChEBI for both ranitidine and zantac and found nothing.

Further down the rabbit hole we go…

PMR> “The robotic aggregation of chemical names and identifiers, if done without metadata and ontology, corrupts information. That’s a strong statement, but we can see it in the current case. First there is junk out there. Robotic name harvesting harvests junk. (Christoph Steinbeck described it in worse terms at the RSC meeting. ) Here’s a snip from page571454

Validated by Experts, Validated by Users, Non-Validated, Removed by Users, Redirected by Users, Redirect Approved by Experts

Ranitidine [Wiki]

(Z)-N-{2-​[({5-[(Di​m?thylami​no)m?thyl​]furan-2-​yl}m?thyl​)sulfanyl​]?thyl}-N​’-m?thyl-​2-nitro?t​h?ne-1,1-​diamine

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

The “?” characters show up in my browser – I don’t know what they are, but they are not normal “e”s (ASCII 101). The first name is not a synonym – I’m sorry, but it’s junk. Associating junk with good information degrades the good information rather than increasing the quality of the junk (There is a more formal proof somewhere by Shannon – I believe – that machines cannot act as 100% proofreaders).”

On a record view we list “Names and Synonyms”. The question marks Peter sees are for a French name shown here:

Looks fine in my broswer and pasted in here too: N-{2-[({5​-[(diméth​ylamino)m​éthyl]fur​an-2-yl}m​éthyl)sul​fanyl]éth​yl}-N’-mé​thyl-2-ni​troéthène​-1,1-diam​ine. So, not junk (saying that the French name is junk would offend the Parisians). Notice that the Z- has been removed (for now) and that the name is labeled French on the record. If any of you are seeing issues in your browser let us know and we will investigate at our end.

Further

“PMR: A trade name represents a product, not a compound and certainly not a connection table. In some cases it may refer to a pure substance, which itself is describable by a connection table, but these are not synonyms. And aggregating them as synonyms adds error rather than clarity. However there is an even stronger reason why “Zantac” does not describe ranitidine. See the FDA page. Zantac (Ranitidine Hydrochloride) Tablets Zantac contains (not “is”) ranitidine hydrochloride.”

A Trade Name DOES represent a product. It can represent MANY formulations also. The active component is commonly the material of interest that we would like to see as a connection table.

However, if one wants to find the active component in Zantac what would YOU do to find out? Type in Zantac on Wikipedia maybe? Look where it takes you: http://en.wikipedia.org/wiki/Zantac. So, Zantac redirects to Ranitidine..don’t forget the earlier statement about Wikipedia: “Open, re-usable, very highly curated, and the first place that students look. That – or a derivative – is where the world’s chemistry should reside.” Should the same be true for ChemSpider? I think so. But this is a choice we have to make to provide a service to the users.On MeSH a search on Zantac takes you to Ranitidine. On PubChem Zantac takes you to Ranitidine(s). So, association of Zantac with Ranitidine is appropriate BUT there is a need for ontologies, I agree. ChEBI has a good model for this (more later).

Interestingly, a search on Ranitidine on ChemSpider provides the following list of names:

PMR comments: “But the current aggregations of chemicals (Chemspider, eMolecules, Chempedia) are designed for use by machines as well as humans. And unless high-quality metadata is given, along with a structured ontology then machine aggregation of chemistry corrupts rather than enhances. For that reason we are building molecular repositories based on metadata and ontologies. In the current era of the web it’s becoming essential. ”

I  look forward to seeing how Zantac and Ranitidine are handled in this new world- if its a structured ontology then it sounds like an integration of MeSH with structures? Wikipedia is over 5000 organics now and is the culmination of thousands of hours of work by many dedicated individuals. And is not error-free. Any other efforts will be prone to similar issues so it’s going to be a major undertaking and I look forward to the results. The ChEBI team are already doing a good job in this area. You can see an ontology Tree View here. So, I’m definitely excited to see what will be better! Exciting times.

PMR comments? “Now, I suggested that the “(Z)” should not have been added to “ranitidine” to indicate the stereochemistry. You can find pages out there with “(E)”. What is the “correct structure”? Or is this a meaningless question?”

In my opinion this is NOT a meaningless question it is a good question. You saw what the crystal structure showed. SHould the name include stereochemistry? If so, when?

Please stay engaged in these discussions with both Peter and I. They are important and meaningful.

Following the announcement by JC Bradley that Drexel University now has an eCrystals Repository I connected with Simon Coles. We’ve exchanged a few email and have the go ahead to scrape the eCrystals structures and DOIs from their eCrystals repository in Southampton and will be doing so over the next few days and adding the data to ChemSpider. Watch out for the new collection as it goes online.

A lot of text-indexing of publishers and journals has been underway over the past few weeks, with permission. The two latest additions are the Journal of Biological Chemistry (added over 122,000 new articles) and the Proceedings of the National Academy of Sciences (added over 50,000 new articles). Now on the Literature Search page you will see a series of checkboxes for you to choose the resources for text-searching (as shown below).

I have been testing the searches based on one of my adopted molecules, paclitaxel, sometimes referred to as Taxol.

Searching on paclitaxel without JBC and PNAS gives a total of 427 articles in 11 seconds.

Searching on Taxol without JBC and PNAS gives a total of 270 articles in 5 seconds.

Searching on paclitaxel with JBC and PNAS gives a total of 745 articles in 26 seconds.

Searching on Taxol with JBC and PNAS gives a total of 1192 articles in 35 seconds.

Clearly adding JBC and PNAS is giving a lot more hits on both names with over a 4x increase for Taxol hits. Clearly the number of hits is highly dependent on the name used to perform the searching. Now, when we integrate the chemical structure searching via linked identifiers this dependency should be dramatically reduced. This work is in development.

Zemanta Pixie

Identifiers, synonyms, registry numbers and so on are the primary textual manner by which chemical structures are searched on ChemSpider. There are various “flavors” of these. if you take a look at the record for Xanax you will see just a FEW of the names have links to Wikipedia, are EINECS numbers, Registry Numbers, International Names, Japanese Names, are Latin names, French Names etc.

Alprazolam [Wiki]

Xanax [Wiki]

249-349-2 [EINECS]

28981-97-7 [RN]

4H-(1,2,4​)Triazolo​(4,3-a)(1​,4)benzod​iazepine,​ 8-chloro​-1-methyl​-6-phenyl-

4H-[1,2,4​]Triazolo​[4,3-a][1​,4]benzod​iazepine,​ 8-chloro​-1-methyl​-6-phenyl-

Alplax

Alprazola​m (JP15/U​SP)

Alprazola​m [USAN:B​AN:INN:JA​N]

Alprazola​mum [INN-​Latin]

Since there are so many levels of complexity associated with identifiers we have added new tools to allow our curators to label the names with appropriate labels. the “present” list is shown below. The language tab lists a whole series of languages. One more effort to expand our curation…

I refer you back to the original post from which this comment was made as it is taken from a specific context.

“There is no “right structure (sic)” for a compound. There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.”

Is this a true statement? In many case I would agree but I have my own opinion in specific cases and let’s focus on the drug industry for a moment and trade names. First, let’s talk about me..and my identifiers. Depending who’s talking about me I am Tony, Antony, Dr Williams, Mr Williams, Dad, sweetheart, son, Tone, AJ, Bro’ and so on. However I am registered with a social security number and exist as a legal entity, a “registered” entity.

Now, Zantac is a registered trade name for the chemical here. I am not an expert in the registration process but I believe that somewhere along the line a defined chemical entity is associated with that name. Whether the chemical entity has been appropriately elucidated by analytical technologies or not is a different question. What is registered as a compound, and associated with the name, is what that name defines.

Now, there are a whole series of other names for the same compound – registry numbers, systematic names, organization numbers. See below:

Ranitidine [Wiki]

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hen-1,1-d​iamin

(Z)-N-{2-​[({5-[(Di​methylami​no)methyl​]furan-2-​yl}methyl​)sulfanyl​]ethyl}-N​’-methyl-​2-nitroet​hene-1,1-​diamine

1,1-Ethen​ediamine,​ N-[2-[[[​5-[(dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-, (Z)-

128345-62​-0 [RN]

266-332-5 [EINECS]

66357-59-3 [RN]

Azantac

GR 122311X

Melfax

N-[2-[[[-​5-[(Dimet​hylamino)​methyl]-2​-furanyl]​methyl]th​io]ethyl]​-N’-methy​l-2-nitro​-1,1-ethe​nediamine

Noctone

Raniben

Ranidil

Raniplex

Ranitidin​e Base

Sostril

Taural

Terposen

Trigger

Ulcex

Ultidine

ZANTAC [Wiki]

Zantic

I think that the Trade Name for a compound is definitive since its registered. Relative to the statement “There are structures which have a very high probability of being associated with a name. There are names which have a probability of representing a chemical entity.”…my question is whether a Registered Trade Name is absolute? I’m asking the question since I’m actually not sure. Thoughts anyone?

The primary mission behind ChemSpider is to build a community for chemists. Initially this was to be a structure centric community for chemists but now we are starting to expand out of those limitations and you will see more information on that in the next few weeks. In order to build a community for chemists we want to engage our users in discussion and provide an environment for discussion about chemistry, chemicals, reactions and all things Chemistry.

With this in mind we have released the ChemSpider Forums. The ChemSpider Forums is the place for users to connect and form a community. There are other Chemistry forums online already. We hope that this one will provide an additional gathering place for discussing chemistry as well as what we can do to enhance our system and serve your needs. With time we hope to integrate the ChemSpider database with the Forum more closely but we are releasing at this time to initiate discussions.

Please visit the ChemSpider Chemistry Forum and help us build a community for chemists. For right now we have set up a number of separate discussion groups. We welcome your input if you have other suggestions.

Earlier this week we added a new capability to ChemSpider for our users. Using the web service provided via nmrdb.org we embedded the ability to predict an NMR spectrum from any record view on the ChemSpider website. The NMR prediction service is provided by Luc Patiny’s group out of Ecole Polytechnique Fédérale de Lausanne at the Institute of Chemical Sciences and Engineering. Their nmrdb.org webpage offers a series of services, not just NMR prediction and I offer the details below from their website.

NMR Predictor – This page allows to predict the spectrum from the chemical structure

NMR Assigner – Upload and assign NMR spectra on-line. The assignment of NMR spectra may be decomposed in 4 steps:

  1. identification of the signals
  2. integration and multiplicity determination
  3. assignment of each signal to the corresponding atom in the molecule
  4. exportation of the data for publication and/or for database storage

NMR Resurrector - A great amount of NMR information is currently available in the form of scientific publications. However, this information is not readily accessible in the format required for complex searches. The Resurrector enables the user to easily import these in-line spectral descriptions and creates an assigned visual representation that can be seamlessly integrated in the attribution process.

I am an NMR spectroscopist by training and have been involved with NMR either running NMR labs in academia, gov’t labs or Fortune 500 companies for almost a decade or involved with the development of commercial NMR software tools for prediction, processing and structure elucidation. Doing NMR prediction well is not easy. There are multiple approaches and many have been discussed previously on this blog so I won’t belabor that point. However, a set of online free utilities for prediction and assignment offers a new entry into the domain and the ease of integration allows anybody to connect up via their website in just a few minutes.

I haven’t had time to test the system rigorously on complex molecules but simple molecules look fine (based on a test set of about 5 molecules).

We have produced the integration in order to allow crowdsourced testing of the prediction algorithms. test it out. Provide the authors feedback as well as post your comments here. It’s easy to run…navigate to a record view of interest and look for the RED words “Predict NMR”. We will shortly provide you a way to predict the spectrum for any molecule via the ChemSpider structure input interface, it won’t have to be a part of our database.

I had previously posed the question “How many chemicals names are contained in the short paragraph below”? Well, I have highlighted the “chemicals” contained in this paragraph. Click on the link to see what’s what.

“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”

Ok..so you saw Aspirin immediately right? Maybe you could have made up that Advantage and Commando would be drugs? Some of you might have spotted “he” (helium) and “in” (Indium). But did you expect “of” and “the”?

What was this all challenge all about? It explains the need to do a good job in identifying chemical names when hunting for them in articles. With a dictionary of millions of systematic names, trade names, synonyms and database IDs even the most general text is full of chemicals. So, the application of a dictionary of chemical names must be done very carefully. And, the point is that matching the dictionary of names within ChemSpider at present to text contained within scientific articles will fail without the direct identification of chemical names OR identifying trade names etc within an appropriate context.

There are WAY more complexities than this though. A group at Cambridge has been working on Sciborg since 2005. The project description page outlines the project:

“SciBorg is a four-year project, starting October 1 2005, funded by the EPSRC under the programme for Computer Science for e-Science. The project is a collaboration between three groups at the University of Cambridge:

We are cooperating with three major publishers:

The project summary and objectives are below. For further information, please see the detailed project description , which was based on the project proposal, and the developing SciBorg project wiki pages.”

I have been following the project for a while and am getting much more interested in it right now. It makes for great reading about the challenges of text mining data. Peter Murray-Rust has made a couple of blog posts (1,2) over the weekend relative to the challenges of text mining and I reference you there for a good overview of some of the challenges. They are significant but there are ways to deal with some of the issues.

I’ll blog more about text-mining and names in the next few weeks…

In a post in March of this year Peter Murray-Rust discussed the Issue of CAS Numbers. I believe the outcome of that post, especially as a result of the insightful comments of Steven Bachrach, was that CAS numbers have their place and provide significant value to the community. Since then I have posted on how confused we are about CAS Numbers. Hopefully these discussions are cleaning things up?

Now onto nomenclature, names, synonyms and identifiers OTHER than CAS Numbers. One of the questions Peter asked in his blogpost was “What is the structure of “snow”? This depends on an authority and cannot be answered without also quoting them.”

I think most of us will think of Ice and Snow as forms of water so the answer to the question might Peter poses might be some statement around ice-like water. However, ice on ChemSpider is the structure shown here while snow is the structure shown here. Both are street names for drugs.

How common is this situation where “common everyday words” are labels for chemical compounds? Well, let’s see. This is not a trick question! In the short paragraph below a number of chemicals are mentioned. How many? The closest guess will get a “ChemSpider Kudos” (which is just bragging rights). Why is this so important? That will come later….

How many chemicals are mentioned in this paragraph?

“She had the drive to derive success in any venture and was well versed in Karate. When the man in the tartan shirt approached her with a dagger in his hand she spat in his face, took the stance of a commando and took advantage of his shock to release the dagger from his grip and causing him to recoil. He went home and took an aspirin after the beating.”

Post your best guess and WHAT words are chemical names!

Over on my other blog I have recently posted some comments that may be of interest to ChemSpider Blog readers

Spaces, Dashes and Issues with Nomenclature Conversion

It appears that members of the text-mining for chemistry community are using one or more of the commercial name to structure software programs to convert chemical names to structures and, prior to feeding the algorithms, they are removing all white spaces from the names. They are also doing the same, in some cases, with dashes. How well is that going to work? Is it safe to remove spaces from chemical names and assume this has no effect? Is consideration being given more to the accuracy of the text-mining than to the nature of systematic nomenclature?

Let’s look at some examples of the result of removing spaces from chemical names. Consider the different results just from moving a space. READ MORE

Hamburger PDFs and Making Them Structure Searchable

This is just an fyi comment for the community really since this is a general assumption that Word Documents and PDFs cannot be made structure-searchable. The truth is that both can be made structure searchable. How? Well, you need to write the correct information into the file to enable it but it’s possible. There are a number of solutions out there allowing structure-based searching of Word document files. I believe the first one was originally from Oxford Molecular before being acquired by Accelrys. I think there are now multiple including, I believe, Cambridgesoft, ACD/Labs and probably others. READ MORE

One blog I check out a few times per week is that of Derek Lowe who writes In the Pipeline. What makes Derek’s blog different, in my opinion, is that he has many years under his belt as a synthetic chemist in pharma companies. He watches what is going on in his industry and makes us aware of his opinions and those of others. He’s still active in the lab and makes us aware of the challenges of lab syntheses and, I find, with some historical perspectives regarding what is “was like” then versus now. Always well written with high feedback In the Pipeline is likely one of the most frequented blogs out there.

Today I read about Schering-Plough’s thrombin receptor antagonist compound, SCH 530348. I am an SP shareholder and was rather disappointed by the recent news about Vytorin. Let’s hope the retrials provide new results. However, SCH 530348 looks more exciting according to Derek’s comments and talking to other people in the industry.

I searched ChemSpider for the structure since it seemed of interest. SCH 530348 is based on a natural product called himbacine and Derek had linked to it from his blog but, I assumed, he couldn’t find the structure of interest on the database. Neither could I. So, about 5 minutes work and it was on the database, with the link to the ASAP article and tagged etc as shown below. The link is here.

Description

TRA-SCH 530348 is an oral antiplatelet drug under development by Schering-Plough for the treatment and prevention of atherothrombotic events in patients with Acute Coronary Syndrome, previous Myocardial Infarction, stroke, or existing peripheral arterial disease.

Tags

SCH 530348
Schering Plough
TRA-SCH 530348

Links & References

Samuel Chackalamannil, Yuguang Wang, William J. Greenlee, Zhiyong Hu, Yan Xia, Ho-Sam Ahn, George Boykow, Yunsheng Hsieh, Jairam Palamanda, Jacqueline Agans-Fantuzzi, Stan Kurowski, Michael Graziano, and Madhu Chintala . Discovery of a Novel, Orally Active Himbacine-Based Thrombin Receptor Antagonist (SCH 530348) with Potent Antiplatelet Activity, ASAP J. Med. Chem., ASAP Article
A potent series of thrombin receptor (PAR-1) antagonists based on the natural product himbacine is described.

Any user of ChemSpider can do this now. ANYONE. If you are interested in how let me know! I have some documents already prepared tou guide you through the process and need to update others but the database can be expanded by everyone now…not quite Wikipedia but not bad at all! What do you think?

Peter Murray-Rust responded to my recent comments about a Free Lunch. There are a number of comments to be made and an exciting opportunity to use Open Data and linking from ChemSpider.

I’d asked the question about how many records there were on CrystalEye. In our world a unique record is a unique InChI, not so on CrystalEye and appropriately so as the crystal structure itself is presumably the unique record. Makes sense.

PMR> We don’t know how many unique structures there are. I’m guessing that there are about 130,000+ entries but that many are duplicates. We (or rather Nick) does a good job on disambiguating by cell dimensions but this is not foolproof and indeed no method is.

What we will do with multiple crystal structures for a single chemical structure is link all unique crystal structures from the unique chemical structure. In this way people can query the chemical structure and find all associated analytical data – spectra and crystallographic files. If we were to list the number of unique depositions on ChemSpider I think we would be around 40 million depositions..an estimate though!

PMR> The main duplication comes from the Crystallography Open Database which has about 45,000 structures.

I looked at the Crystallography Open Database this morning. it states on the home page “Updated daily: 68268 entries in the COD”. We may have an opportunity with the COD to link up to their data and reduce the need for us to host CIFs. Excellent…we’re all for reducing workload and providing links into other systems. It’s what we do.

PMR> The only thing stopping us putting them (AJW> The structures from CrystalEye)  in Pubchem, or anywhere, is work. We need to make sure that we have data integrity and referential integrity. We’re going to do it, but at present Nick is writing his thesis. We have some limited funding earmarked for this and hope to start it soon. When it’s finished it will be in RDF/CML.

This is great news. This means that after the summer we can download the data directly via PubChem and link up to CrystalEye that way. Perfect. We’ll stop working on integrating to CrystalEye now and wait for the integration path via PubChem and focus on other data sources. Thank you Peter, Nick, Andrew and Jim!!! That said I don’t believe that PubChem will take CML, they will convert using their tools to produce their compatible formats and InChI being one of them. That will break organometallics etc. UNLESS PubChem are going to adopt CML now and that would be an interesting positive shift in terms of a sign of support for the format. A strong positive. I’l chat with the PubChem team so that if CML is coming we can consider adopting in some way and be ready.

From my post “AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.”

PMR: Chicken and egg… :-) You won’t adopt it until other people adopt it and they won’t adopt it till you do. But we make progress. It’s now mainstream in part of Accelrys software (funded by DTI). It’s being put into compchem codes by the COST project, and it’s really the only choice for datuments (combined data and documents) as in semantic publishing and the results of test-mining.

It’s nice to know that ChemSpider has that type of influence now. It’s good to see it going into Accelrys’ software and I had heard that from Dan’s blog and had added the CML Blog to my reader. I’m definitely watching and willing to follow. We’re busy leading so many other things right now we’ll wait for adoption and then jump on it like a “hobo on a muffin”.

In a recent post about ChemSpider we’ve been accused of wanting a Free Lunch. I copy a segment of the post and comment with insertions.

“Data are normally produced for a particular purpose and the reuse them for another cost money. I’ll exemplify this by taking CrystalEye data – about 120,000 crystal structures and 1 million molecular fragments – which were aggregated, transformed and validated by Nick Day as part of his thesis. (BTW Nick is writing up – it’s a tribute to his work that CrystalEye runs without attention for months on end).

AJW> It is true…it is a tribute to Nick that CrystalEye can run for months without attention. Kudos. I am interested in how much pressure the site is under. How many searches/users in a day etc.? We find that our struggles in uptime (and these are negligible) are primarily based on stress on the servers. For nighttime users tonight things will have been slow…we deposited over 100,000 new molecules from 5 new data sources. That does create some slowness. We will hit about 40,000 transactions today. Our problems are ISP issues and powercuts. But we are also not in a University using thick pipes etc.

One comment…it was 130,000 structures according to a previous blog and has been expanding since then from daily depositions. Right now I would expect it to be 140,000 rather than 120,000. When we did try scraping the data our best estimate was about 90,000. We might have missed something in our scraping and it’s why we asked for a dump of the data.

The primary purpose of CrystalEye was to allow Nick to test the validity of QM calculations in high-throughput mode. It turned out that the collection might be useful so we have posted it as Open Data. To add to its value we have made it browsable by journal and article, searcahable by cell dimensions, searchable by chemical substructure and searchable by bond-length. This is a fair range of what the casual visitor might wish to have available. Andrew Walkingshaw has transformed it into RDF and built a SPARQL endpoint with the help of Talis. It has a Jmol applet and 2D diagrams, and links back to the papers. So there is a lot of functionality associated with it.

AJW> The team has done a good job in putting the site together. The JMol applet is an excellent utility for us all to use and thanks to that team for sure! Egon has been challenging us to RDF the site and it’s on our list, but keeps getting pushed down based on other requests. Since he’s the only voice asking it will keep getting pushed down unfortunately.

This has come under some criticism to the effect that we haven’t really made it Openly available. For example Antony Williams(Chemspider blog) writes (Acting as a Community Member to Help Open Access Authors and Publishers):

“This [interaction with MDPI] is contrary to some of my experiences with some other advocates of Open Data and Open Access where trying to get their “Open Data” is like pulling teeth.”

PMR: I assume this relates to CrystalEye – I don’t know of any other case.

AJW> There are other examples and he’s right. He doesn’t know of them and I’d prefer he not rant on my behalf so I’ll not name them.

Antony and I have had several discussions about CrystalEye – basically he would like to import it into his database (which is completely acceptable) but it’s not in the format he wants (multi-entry files in MDL’s SDF format, whereas CrystalEye is in CML and RDF).

AJW> To clarify, again. I DON’T want to import CrystalEye into ChemSpider. I DON’T! All I want is the set of structures and unique associated URLs so that users of ChemSpider can find that there is crystal structure information over on CrystalEye and can click the link and be on CrystalEye and get the benefit of Nick, Andrew and Peter’s work. I don’t want to reproduce their effort. I want to integrate to it. I’ve said it many times on Peter’s blog and on this one.

This type of problem arises everywhere in the data world. For example the problem of converting between map coordinates (especially in 3D) can be enormous. As Rich says, it costs money. There is generally no escape from the cost, but certain approaches such as using standards such as XML and RDF can dramatically lower the costs. Nevertheless there is a cost. Jim Downing made this investment by creating an Atom feed mechanism so that CrystalEeye couls be systematically downloaded but I don’t think Chemspider has used this.

AJW> If Jim can contact me by email and provide me with detailed instructions to download the entire file of structures ONLY and their associated URLs that would be excellent. I’ll send the request to him tonight.

The real point is that Chemspider wishes to use the data for a different purpose from which it was intended.

AJW> The problem is that stories keep getting made up about what we want. ALL I want to do is drive traffic to CrystalEye so that people who don’t know about it can use it. No more than that. I don’t get how trying to provide an integration path is so difficult. I’ll ask Jim to help.

That’s fine. But as Rich says it costs money. It’s unrealistic to expect we should carry out the conversion for a commercial company for free. We’d be happy to a mutually acceptable business proposition and it could probably be done by hiring a summer student.

AJW> I am interested in what commercial benefit integrating to CrystalEye can have. It’s work on our side. I’m not sure what a mutually acceptable business proposition would look like. It can’t be that much work to send us a set of InChIStrings and URLs for the CrystalEye dataset..they already exist on CrystalEye. So, I’ll assume that this is a last comment on “No thanks to CrystalEye data in ChemSpider”. I have to ask why not put them in PubChem. Since PubChem is held as the standard of OpenData why not put CrystalEye there?

I continue to stress that CrystalEye is completely Open. If you want it enough and can make the investment then all the mechanism are available. There’s a downloader and converters and they are all Open (though it may cost money to integrate them).

AJW> Just fyi ChemSpider has adopted Creative Commons licenses.

FWIW we are continuing to explore the ways in which CrystalEye is made available. We’re being funded by Microsoft as part of the OREChem project and the result of this could represent some of the way in which the Web technology is influencing scientific disciplines. We’d recommend that those interested in mashups and re-use in chemistry took a close look at RDF/SPARQL/CML/ORE as those are going to be standard in other fields.

AJW> It would be good to see CML be a standard. I’ve been following it for a decade and when it gets accepted by a larger majority then we might adopt it.

I have spoken previously about the challenges of Scraping CrystalEye Content and staying in relationship with publishers. I have approached CAS and spoken with the Copyright team at ACS. In December of last year I spoke about the 5 month delay to discuss with ACS about whether or not we could scrape CIF files from ACS journals directly. Well, I had a nice chat with two ACS people in New Orleans, one of them from ACS Pubs. We had a nice chat about ChemSpider and I answered a lot of questions about what we were doing, where we were going, how we are “funded” (we are not!) etc. Many pages of notes were taken. At the end of the meeting I asked the question “So, relative to my question about CrystalEye and scarping CIFS. Are Supplementary Data ok to scrape or not?”

The answer? “We haven’t made a decision yet. We need to discuss”.

Are crystal structures really that special? It’s been difficult to get JUST the structures associated with even Open Data. Now I’ve been waiting over 7 months for a question to be answered by ACS…and it’s binary. YES or NO.

At this point I give up. Peter Murray-Rust has had ACS CIFs scraped from their publications for a LONG time. And continues to scrape them. Cambridge University/Unilever School of Informatics didn’t get permission and have been very vocal about what they’ve done and no legal action re. copyright has been taken so I’ll assume it’s not an issue. If it’s not an issue then we can go ahead.

If we can go ahead then why wouldn’t we? We have…we already have scraped the collection of CIFs from ACS, from a broader range of ACS journals than CrystalEye taps into. It’s Supplementary Data, it’s non-copyrightable and now its ours to publish. We already support CIF displays on ChemSpider so what we need to do now is to mass convert/handle the data and deposit onto ChemSpider. We also have the IUCR CIFs to deposit. I guess ChemSpider will soon become “CrystalEye 2″ as we host the data. That said we are NOT crystallographers so I have an open request to the community for someone with interest/skills in crystallography to join our advisory group and support this effort. Feel free to ping me.

I am in catch up mode tonight reading a week long backlog of blog posts. I’ve caught up tonight with some of Peter’s posts about semantic chemical authoring (1,2). I’ll respond shortly with comments regarding our own efforts in pulling together the web. I agree with Peter that improved semantic chemical authoring tools are necessary but we are focused right now on doing what we can with what is already available online. What it takes is coding, some regular expressions, some visual inspection and work. Lots of work. More later…

One of the things we are working on is connecting blog posts and wiki pages to ChemSpider as evidenced by our work with Molecule of the Day and our integrations to TotallySynthetic posts on an ongoing basis. What we expect of the authors though is that they author with care. We are generally using name to structure conversion capabilities to generate the chemical structures for connecting to on ChemSpider. Paul Doherty at TotallySynthetic used to provide us with inChIStrings and InChIKeys to connect up to but stopped because it was a lot of work I believe. Molecule of the Day generally discusses fairly simple molecule relative to TotallySynthetic’s COMPLEX molecules. Manual inspection is unfortunately necessary even in the simplest of cases. And it IS time-consuming. Robots will gather information and, in my judgment, PROLIFERATE incorrect data unless someone is going to do the work to inspect OR the system provides a curation platform to quickly remove errors.

I blogged tonight on the ChemConnector blog about the importance of dashes and spaces in systematic names. It should be very clear from that post how important it is. it is a major challenge to use name to structure conversion tools on chemical names that are imperfect and do not represent the structure they are meant to represent. There needs to be respect for chemical names and as we move them from system to system, database to database we need to do our best to retain their integrity. This HAS BEEN a major challenge for us as we scrape data from various data sources OR when people provide us data files such as Wikipedia and we need to check name-structure connections. It is not difficult to lose the integrity of a chemical name.

Back to Peter Murray-Rust’s discussions about semantic chemical authoring. Peter is talking about building a site of aggregated information from various websites.

PMR> “We’re in the process of aggregating a repository of common chemicals (somewhere in the range 1000-10000 entries) and we are taking data from various publicly available web sites. Typical sources are Wikipedia, any aggregator with Open Data policies and MSDS sheets (chemical safety information). One such site is INCHEM (Chemical Safety Information from Intergovernmental Organizations which lists about 1500 materials (most are chemical compounds though some are mixtures).”

Readers of this blog will know we’ve already done this. Both for NIOSH and for the Oxford MSDS set. We took a select subset of information. We integrated this with our Wikipedia set of data on ChemSpider (and, of course, also on WiChempedia).

PMR> “…From this we extract the most important information and turn it into CML – names, formula, connection tables, properties, etc.”

Our process was extraction of the same (but there arent any connection tables to grab from NIOSH or Oxford MSDS) then we converted names to structures and ran some “confirmation processes” including visual inspection when necessary.

PMR> “There are a large number of simple but niggly lexical problems, such as the degrees symbol for temperature (totally inconsistent within and between documents) And the semantics – how do you record a boiling point as “between 120 and 130 at 20 mm Hg”? (CML can do this, but it takes work to do the conversion.)”

Oh yes…these are problems. the inconsistencies between records is a pain but can be dealt with by mapping as shown here. Recording a boiling point “between 120 and 130 at 20 mm Hg” is no issue really. See this figure for something just as complex  regarding “loss of waters”.

PMR> “And the sites have errors. Here’s a rather subtle one which the average human would miss (we needed a machine to find it). You’ll have to go to the page for chloromethylmethylether – I daren’t try to transcribe it into WordPress. The error is in the displayed page (no need to scroll down).”

There are a couple of issues here. We actually prefer NOT to use either the molecular formula or the molecular weight. In our Wikipedia work we found a lot of errors around these parameters and for the Wikipedia work at least the name, SMILES, InChI etc were more correct while MFs and MW would be wrong.

There may absolutely be value in using both MF and MW to confirm the structure and I definitely see the value. This would definitely help resolve some of the Nomenclature-Structure issues that can can arise from converting the names! One of the things that occurred in the blog post was that my earlier comments came to pass regarding removal of a space in the chemical name.

The names on the ORIGINAL InCHEM page were:

CHLOROMETHYL METHYL ETHER, Chloromethoxymethane with a CAS Number of 107-30-2 and an EINECS number of 203-480-1.

There was NOT a name listed as “chloromethylmethylether” which PMR listed in his post. The only difference is dropping one space. It’s only an accidental removal but dramatically changes the meaning of the record. This is where Peter’s use of either MF or MW becomes crucial! That loss of a space CAN cause big problems as described here. Does it cause a problem this time? Check below…look at the name with and without the space and the result of conversion in a commercial Name to Structure software package.

The CORRECT structure is on ChemSpider here and already includes the following Supplemental Information.

User Data

  • experimental physchem properties
    • Boiling Point: 138F

    • Freezing Point: -154F

    • Specific Gravity: 1.06

    • Solubility: Reacts

    • Ionization Potential: 10.25 eV

  • miscellaneous
    • Appearance: Colorless liquid with an irritating odor.

    • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

    • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

    • Symptoms: Irritation eyes, skin, mucous membrane; pulmonary edema, pulmonary congestion, pneumonitis; skin burns, necrosis; cough, wheezing, pulmonary congestion; blood stained-sputum; weight loss; bronchial se cretions; [potential occupational carcinogen]

    • Target Organs: Eyes, skin, respiratory system Cancer Site [in animals: skin & lung cancer]

    • Incompatibilities and Reactivities: Water [Note: Reacts with water to form hydrochloric acid & formaldehyde.]

    • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated/Daily Remove: When wet (flammable) Change: Daily Provide: Eyewash, Quick drench

    • Exposure Limits: NIOSH REL : Ca See Appendix A OSHA PEL : [1910.1006] See Appendix B

Peter IS right. We DO Need Semantic Chemical Authoring Tools. However, we’ve already gone a long way without them and what is already online CAN be dealt with. Incredible care is needed with nomenclature  and just spaces can mess things up! I know we have errors on our database – both structures and names. What is to be expected with 20 million structures and associated data? However, we are cleaning them up, rather quickly. We are scraping and integrating data at an increasing rate having learned a lot of lessons over the past year.

I’ll comments on Peter’s other Semantic Chemical Authoring posts in the next couple of days.

I’ve blogged previously about us adding safety and toxicity data to ChemSpider. We are busily sourcing new information from other data sources to add information and in the past couple of days we have added NIOSH data as it is a rich source of additional safety information. For example, the record for 1,2,3-trichloropropane shows:

  • First Aid: Eye: Irrigate immediately Skin: Soap wash Breathing: Respiratory support Swallow: Medical attention immediately

  • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

  • Symptoms: Irritation eyes, nose, throat; central nervous system depression; in animals: liver, kidney injury; [potential occupational carcinogen]

  • Target Organs: Eyes, skin, respiratory system, central nervous system, liver, kidneys Cancer Site [in animals: forestomach, liver & mammary gland cancer]

  • Incompatibilities and Reactivities: Chemically-active metals, strong caustics & oxidizers

  • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated Remove: When wet or contaminated Change: No recommendation Provide: Eyewash, Quick drench

Some additional examples are here: Temefos, Warfarin and Allyl Alcohol. Note that each of these also has a coincident extract from Wikipedia. We are therefore integrating Wikipedia articles, safety, toxicity, experimental and predicted properties. Our plan for semanticising and integrating the chemistry web is clearly well underway.