I am in catch up mode tonight reading a week long backlog of blog posts. I’ve caught up tonight with some of Peter’s posts about semantic chemical authoring (1,2). I’ll respond shortly with comments regarding our own efforts in pulling together the web. I agree with Peter that improved semantic chemical authoring tools are necessary but we are focused right now on doing what we can with what is already available online. What it takes is coding, some regular expressions, some visual inspection and work. Lots of work. More later…

One of the things we are working on is connecting blog posts and wiki pages to ChemSpider as evidenced by our work with Molecule of the Day and our integrations to TotallySynthetic posts on an ongoing basis. What we expect of the authors though is that they author with care. We are generally using name to structure conversion capabilities to generate the chemical structures for connecting to on ChemSpider. Paul Doherty at TotallySynthetic used to provide us with inChIStrings and InChIKeys to connect up to but stopped because it was a lot of work I believe. Molecule of the Day generally discusses fairly simple molecule relative to TotallySynthetic’s COMPLEX molecules. Manual inspection is unfortunately necessary even in the simplest of cases. And it IS time-consuming. Robots will gather information and, in my judgment, PROLIFERATE incorrect data unless someone is going to do the work to inspect OR the system provides a curation platform to quickly remove errors.

I blogged tonight on the ChemConnector blog about the importance of dashes and spaces in systematic names. It should be very clear from that post how important it is. it is a major challenge to use name to structure conversion tools on chemical names that are imperfect and do not represent the structure they are meant to represent. There needs to be respect for chemical names and as we move them from system to system, database to database we need to do our best to retain their integrity. This HAS BEEN a major challenge for us as we scrape data from various data sources OR when people provide us data files such as Wikipedia and we need to check name-structure connections. It is not difficult to lose the integrity of a chemical name.

Back to Peter Murray-Rust’s discussions about semantic chemical authoring. Peter is talking about building a site of aggregated information from various websites.

PMR> “We’re in the process of aggregating a repository of common chemicals (somewhere in the range 1000-10000 entries) and we are taking data from various publicly available web sites. Typical sources are Wikipedia, any aggregator with Open Data policies and MSDS sheets (chemical safety information). One such site is INCHEM (Chemical Safety Information from Intergovernmental Organizations which lists about 1500 materials (most are chemical compounds though some are mixtures).”

Readers of this blog will know we’ve already done this. Both for NIOSH and for the Oxford MSDS set. We took a select subset of information. We integrated this with our Wikipedia set of data on ChemSpider (and, of course, also on WiChempedia).

PMR> “…From this we extract the most important information and turn it into CML – names, formula, connection tables, properties, etc.”

Our process was extraction of the same (but there arent any connection tables to grab from NIOSH or Oxford MSDS) then we converted names to structures and ran some “confirmation processes” including visual inspection when necessary.

PMR> “There are a large number of simple but niggly lexical problems, such as the degrees symbol for temperature (totally inconsistent within and between documents) And the semantics – how do you record a boiling point as “between 120 and 130 at 20 mm Hg”? (CML can do this, but it takes work to do the conversion.)”

Oh yes…these are problems. the inconsistencies between records is a pain but can be dealt with by mapping as shown here. Recording a boiling point “between 120 and 130 at 20 mm Hg” is no issue really. See this figure for something just as complex  regarding “loss of waters”.

PMR> “And the sites have errors. Here’s a rather subtle one which the average human would miss (we needed a machine to find it). You’ll have to go to the page for chloromethylmethylether – I daren’t try to transcribe it into WordPress. The error is in the displayed page (no need to scroll down).”

There are a couple of issues here. We actually prefer NOT to use either the molecular formula or the molecular weight. In our Wikipedia work we found a lot of errors around these parameters and for the Wikipedia work at least the name, SMILES, InChI etc were more correct while MFs and MW would be wrong.

There may absolutely be value in using both MF and MW to confirm the structure and I definitely see the value. This would definitely help resolve some of the Nomenclature-Structure issues that can can arise from converting the names! One of the things that occurred in the blog post was that my earlier comments came to pass regarding removal of a space in the chemical name.

The names on the ORIGINAL InCHEM page were:

CHLOROMETHYL METHYL ETHER, Chloromethoxymethane with a CAS Number of 107-30-2 and an EINECS number of 203-480-1.

There was NOT a name listed as “chloromethylmethylether” which PMR listed in his post. The only difference is dropping one space. It’s only an accidental removal but dramatically changes the meaning of the record. This is where Peter’s use of either MF or MW becomes crucial! That loss of a space CAN cause big problems as described here. Does it cause a problem this time? Check below…look at the name with and without the space and the result of conversion in a commercial Name to Structure software package.

The CORRECT structure is on ChemSpider here and already includes the following Supplemental Information.

User Data

  • experimental physchem properties
    • Boiling Point: 138F

    • Freezing Point: -154F

    • Specific Gravity: 1.06

    • Solubility: Reacts

    • Ionization Potential: 10.25 eV

  • miscellaneous
    • Appearance: Colorless liquid with an irritating odor.

    • First Aid: Eye: Irrigate immediately Skin: Soap wash immediately Breathing: Respiratory support Swallow: Medical attention immediately

    • Exposure Routes: inhalation, skin absorption, ingestion, skin and/or eye contact

    • Symptoms: Irritation eyes, skin, mucous membrane; pulmonary edema, pulmonary congestion, pneumonitis; skin burns, necrosis; cough, wheezing, pulmonary congestion; blood stained-sputum; weight loss; bronchial se cretions; [potential occupational carcinogen]

    • Target Organs: Eyes, skin, respiratory system Cancer Site [in animals: skin & lung cancer]

    • Incompatibilities and Reactivities: Water [Note: Reacts with water to form hydrochloric acid & formaldehyde.]

    • Personal protection and Sanitation: Skin: Prevent skin contact Eyes: Prevent eye contact Wash skin: When contaminated/Daily Remove: When wet (flammable) Change: Daily Provide: Eyewash, Quick drench

    • Exposure Limits: NIOSH REL : Ca See Appendix A OSHA PEL : [1910.1006] See Appendix B

Peter IS right. We DO Need Semantic Chemical Authoring Tools. However, we’ve already gone a long way without them and what is already online CAN be dealt with. Incredible care is needed with nomenclature  and just spaces can mess things up! I know we have errors on our database – both structures and names. What is to be expected with 20 million structures and associated data? However, we are cleaning them up, rather quickly. We are scraping and integrating data at an increasing rate having learned a lot of lessons over the past year.

I’ll comments on Peter’s other Semantic Chemical Authoring posts in the next couple of days.

Stumble it!

5 Responses to “Care in Nomenclature Handling and Why Visual Inspection Will Remain”

  1. Egon Willighagen says:

    Antony, you make it sound like it is a solved problem. But I do not have the feeling it is. Talking to publishers have learned me that putting semantics in the publishing process is much work. That would argument that most things are not solved yet, at all…

  2. Antony Williams says:

    It is DEFINITELY not a solved problem. As I said above “Peter IS right. We DO Need Semantic Chemical Authoring Tools.” My comment is that even though they have not existed for almost all of the information that has been put online it is possible to reap a benefit from it when care, time and effort is put into it. This is what Peter’s team and we are up to (and likely you too!). There are many of us going through the struggles at present and dealing with the minutiae to put together high quality information. It takes so much time and effort because it demands visual inspection. Machine-readable formats would dramatically reduce this need. But, until they are in place visual inspection is going to need to be a part of the process. That’s why so much curation work is done at places like CAS and the other database builders. A lot of the curation is now done offshore because of the repetitive nature of the work and the price point per reviewed record.

  3. Rich Apodaca says:

    Wikipedia is the ultimate data curation system, and it works far better than many think it should. In a few short years, this peculiar system has eclipsed all competition.

    How did that happen?

    Understanding the answer to this question may well be the key to building scalable, well-curated, and self-sustaining chemical databases in the future, whether they be created by aggregation or from scratch.

  4. Antony Williams says:

    I DO believe it works very well and it certainly has eclipsed all competition.

    I have my own opinions of why it happened and I think there are three drivers. In order or priority I think these are:

    1) By far the most important – a rather small group of dedicated individuals who are willing to spend their time, their “off hours”, researching, writing, annotating and curating a growing dataset of information for consumption by Wikipedia users. This is done without any compensation, commonly without receiving kudos except within the team and driven by a passion to make a difference. These are special people…in a good way!
    2) Media-Wiki is fairly easy to use and it is possible to make edits, updates and leave comments on records thereby ensuring the data is cleaned by oneself or by others. Peer-validation is important in this domain.
    3) Wikipedia NOW has the name as a crowd-sourced encyclopedia, it has brand-recognition, traction and momentum and, I judge, there is more willingness to contribute now that it has persisted. It was not always true…ask the WP’edians about the “early days”.

    I’ll prompt my WP:Chem colleagues for their input too…these are my opinions and I might be way off base…

  5. will says:

    Whatever web authoring tools become/are available, not everyone will use them, or use them properly. The issue of inconsistent formats will never go away.

    If you want to make chemical data searchable, just put in the work of reformatting and gathering the data properly (as happens on Wikipedia) and not hold any publisher/database responsible for this task…. it’s a lot of work.

    and link back too –> dont republish (except factual/with permission) Its nicer ;)

Leave a Reply