I posted previously (1,2) about some of the challenges of text-mining to identify chemicals and chemical names. The example was one of how many simple English words can be chemical identifiers.

Now Peter Murray-Rust is asking a similar question in a recent post. He comments

“Here’s a chunk of text from a thesis we – or rather OSCAR-the-journal-eating-robot – is reading. There are no tricks – it’s exactly as is. I asked them to same how many chemical entities there were in this chunk. (Ideally we should ask for where they start and end, put I just asked for a show of hands for the total count.

To a solution of crude bisynol 85 (1.05 g, assume 2.34 mmol) in dichloromethane (50 cm3) were added 4 Å molecular sieves (1.17 g), 4-methylmorpholine N-oxide (410 mg, 3.52 mmol) and TPAP (82 mg, 0.23 mmol).  The reaction mixture was stirred at ambient temperature for 1 h.  The crude reaction mixture was filtered through a plug of silica, washed with diethyl ether (100 cm3) and concentrated under reduced pressure.  Gradient flash column chromatography (Petroleum ether:diethyl ether, 100:0 ? 95:5) afforded 1-(tert-butyldiphenylsilyloxy)-trideca-5,8-diyn-7-one 86 (750 mg, 32% over two steps) as a yellow oil:

PMR: The audience gave answers varying betwee between 4 and 11. To be fair some were not scientists and although they’d had an hour and a half of slides from the others they were not used to reading this sort of stuff.

So how many do YOU think there are?. Just a number between 4 and 11, although you can add comments if you wish.”

I’ll try and post this on PMR’s blog but my comments never show up anymore. Seems like PMR’s WordPress spam filter doesn’t like me anymore though I am not alone and they are trying to fix it. So, how many chemical ENTITIES? I’ll take a poke at this.

1) The solution of crude bisynol is a chemical entity

2) Crude Bisynol itself is a chemical entity

3) Dichloromethane is a chemical entity

4) 4 Å molecular sieves is a chemical entity

5) It’s also true that CD2Cl2 over Mol Seive is a chemical entity

6) 4-methylmorpholine N-oxide is a chemical entity

7) TPAP is a chemical entity

8) The reaction mixture is a chemical entity

9) Silica is a chemical entity

10) Petroleum ether is a chemical entity

11) Diethyl ether is a chemical entity

12) Petroleum ether:diethyl ether is a chemical entity

13) 1-(tert-butyldiphenylsilyloxy)-trideca-5,8-diyn-7-one is a chemical entity

Now, I was given a choice between 4 and 11 so what would I choose as OBVIOUS entities that an entity extractor might convert?

1) Bisynol

2) Dichloromethane

3) 4 Å molecular sieves

4)  4-methylmorpholine N-oxide


6) Silica

7) Petroleum ether

8) Diethyl ether

9) Petroleum ethere:diethyl ether

10) 1-(tert-butyldiphenylsilyloxy)-trideca-5,8-diyn-7-one

What say you?

Stumble it!

One Response to “How Many Chemical Entities in a Paragraph of Text”

  1. Chris Singleton says:

    Here’s my list:

    1) Bisynol
    2) Dichloromethane
    3) 4-methylmorpholine
    4) TPAP
    5) Silica
    6) Ethyl ether
    7) Pet ether:ethyl ether
    8) 1-(tert-butyldiphenylsilyloxy)-trideca-5,8-diyn-7-one

    Now I agree that DCM over moleculer sieves is something different, but without having an exact definition of ‘entity’ it would be difficult to say it is separate. Plus since pet ether doesn’t occur by itself (only in solution) in this paragraph I wouldn’t consider it a separate ‘entity’ for the purpose of this exercise.
    For example, if I write ‘brine’, should I include NaCl, H2O, and brine as individual chemical entities? I wouldn’t, counting all three towards the final count of chemicals seems incorrect to me.

Leave a Reply