Recently, I was introduced to the concept of the “Open Text Mining Interface” or OTMI. Officially, this is an initiative from the Nature Publishing Group but its aims are to enable all scholarly publishers to make their full texts (or as much text as is deemed appropriate) available for text mining without risking a loss of control over the human-readable material.

How is this possible? Well, a search engine will query its index based on these criteria (basically):

1) Do the searched terms exist in any articles and if so how often?

2) Where they exist, how closely together do they appear in an article?

It is easy for OTMI to meet the first criterion. The second is slightly more difficult since it relies on the structure of the article (that is, to some extent, those features that make it human-readable) but “snippets” (i.e. actual phrases from the orginal article) can be included in OTMI files which, if utilised as fully as possible, could compensate for this.

I think the use of the term “Open” is justified because anyone can index the OTMI files provided (they exist specifically for the purpose of being indexed). It probably can’t totally reproduce the all-round searchability of directly indexing articles (though I am open to being proven wrong on this). However, I would like to see publishers who rule out direct spidering take part in this initiative or something like it.

Some example OTMI files (that is articles transformed into indexable, but not human-readable, format) can be found here. The rationale of OTMI has also been well thought through and is best expressed here.

  1. Egon Willighagen says:

    I blogged about OTMI in the past [1], and actually think OTMI resource might work great with linking articles with molecular structures.


  2. will says:

    Yes and hopefully the “context” of the structures (reactant, product, catalyst?) would not be lost –> good use of “snippets” could avoid this. We could (in theory) try to ‘generate’ OTMI files for articles as an alternative to conventional search result output (with publisher permission of course).

