Copyright©2008 Antony Williams
I am in catch up mode tonight reading a week long backlog of blog posts. I’ve caught up tonight with some of Peter’s posts about semantic chemical authoring (1,2). I’ll respond shortly with comments regarding our own efforts in pulling together the web. I agree with Peter that improved semantic chemical authoring tools are necessary but we are focused right now on doing what we can with what is already available online. What it takes is coding, some regular expressions, some visual inspection and work. Lots of work. More later…
One of the things we are working on is connecting blog posts and wiki pages to ChemSpider as evidenced by our work with Molecule of the Day and our integrations to TotallySynthetic posts on an ongoing basis. What we expect of the authors though is that they author with care. We are generally using name to structure conversion capabilities to generate the chemical structures for connecting to on ChemSpider. Paul Doherty at TotallySynthetic used to provide us with inChIStrings and InChIKeys to connect up to but stopped because it was a lot of work I believe. Molecule of the Day generally discusses fairly simple molecule relative to TotallySynthetic’s COMPLEX molecules. Manual inspection is unfortunately necessary even in the simplest of cases. And it IS time-consuming. Robots will gather information and, in my judgment, PROLIFERATE incorrect data unless someone is going to do the work to inspect OR the system provides a curation platform to quickly remove errors.
I blogged tonight on the ChemConnector blog about the importance of dashes and spaces in systematic names. It should be very clear from that post how important it is. it is a major challenge to use name to structure conversion tools on chemical names that are imperfect and do not represent the structure they are meant to represent. There needs to be respect for chemical names and as we move them from system to system, database to database we need to do our best to retain their integrity. This HAS BEEN a major challenge for us as we scrape data from various data sources OR when people provide us data files such as Wikipedia and we need to check name-structure connections. It is not difficult to lose the integrity of a chemical name.
Back to Peter Murray-Rust’s discussions about semantic chemical authoring. Peter is talking about building a site of aggregated information from various websites.
PMR> “We’re in the process of aggregating a repository of common chemicals (somewhere in the range 1000-10000 entries) and we are taking data from various publicly available web sites. Typical sources are Wikipedia, any aggregator with Open Data policies and MSDS sheets (chemical safety information). One such site is INCHEM (Chemical Safety Information from Intergovernmental Organizations which lists about 1500 materials (most are chemical compounds though some are mixtures).”
Readers of this blog will know we’ve already done this. Both for NIOSH and for the Oxford MSDS set. We took a select subset of information. We integrated this with our Wikipedia set of data on ChemSpider (and, of course, also on WiChempedia).
PMR> “…From this we extract the most important information and turn it into CML – names, formula, connection tables, properties, etc.”
Our process was extraction of the same (but there arent any connection tables to grab from NIOSH or Oxford MSDS) then we converted names to structures and ran some “confirmation processes” including visual inspection when necessary.
PMR> “There are a large number of simple but niggly lexical problems, such as the degrees symbol for temperature (totally inconsistent within and between documents) And the semantics – how do you record a boiling point as “between 120 and 130 at 20 mm Hg”? (CML can do this, but it takes work to do the conversion.)”
Oh yes…these are problems. the inconsistencies between records is a pain but can be dealt with by mapping as shown here. Recording a boiling point “between 120 and 130 at 20 mm Hg” is no issue really. See this figure for something just as complex regarding “loss of waters”.
PMR> “And the sites have errors. Here’s a rather subtle one which the average human would miss (we needed a machine to find it). You’ll have to go to the page for chloromethylmethylether – I daren’t try to transcribe it into WordPress. The error is in the displayed page (no need to scroll down).”
There are a couple of issues here. We actually prefer NOT to use either the molecular formula or the molecular weight. In our Wikipedia work we found a lot of errors around these parameters and for the Wikipedia work at least the name, SMILES, InChI etc were more correct while MFs and MW would be wrong.
There may absolutely be value in using both MF and MW to confirm the structure and I definitely see the value. This would definitely help resolve some of the Nomenclature-Structure issues that can can arise from converting the names! One of the things that occurred in the blog post was that my earlier comments came to pass regarding removal of a space in the chemical name.
The names on the ORIGINAL InCHEM page were:
CHLOROMETHYL METHYL ETHER, Chloromethoxymethane with a CAS Number of 107-30-2 and an EINECS number of 203-480-1.
There was NOT a name listed as “chloromethylmethylether” which PMR listed in his post. The only difference is dropping one space. It’s only an accidental removal but dramatically changes the meaning of the record. This is where Peter’s use of either MF or MW becomes crucial! That loss of a space CAN cause big problems as described here. Does it cause a problem this time? Check below…look at the name with and without the space and the result of conversion in a commercial Name to Structure software package.
The CORRECT structure is on ChemSpider here and already includes the following Supplemental Information.
- experimental physchem properties
Appearance: Colorless liquid with an irritating odor.
First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately
Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact
Target Organs: Eyes, skin, respiratory system Cancer Site [in animals: skin & lung cancer]
Incompatibilities and Reactivities: Water [Note: Reacts with water to form hydrochloric acid & formaldehyde.]
Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated/Daily Remove: When wet (flammable) Change: Daily Provide: Eyewash, Quick drench
Exposure Limits: NIOSH REL : Ca See Appendix A OSHA PEL : [1910.1006] See Appendix B
Peter IS right. We DO Need Semantic Chemical Authoring Tools. However, we’ve already gone a long way without them and what is already online CAN be dealt with. Incredible care is needed with nomenclature and just spaces can mess things up! I know we have errors on our database – both structures and names. What is to be expected with 20 million structures and associated data? However, we are cleaning them up, rather quickly. We are scraping and integrating data at an increasing rate having learned a lot of lessons over the past year.
I’ll comments on Peter’s other Semantic Chemical Authoring posts in the next couple of days.Stumble it!