Recently, I was introduced to the concept of the “Open Text Mining Interface” or OTMI. Officially, this is an initiative from the Nature Publishing Group but its aims are to enable all scholarly publishers to make their full texts (or as much text as is deemed appropriate) available for text mining without risking a loss of control over the human-readable material.
How is this possible? Well, a search engine will query its index based on these criteria (basically):
1) Do the searched terms exist inÂ any articles and if so how often?
2) Where they exist, how closely together do they appear in anÂ article?
It is easy for OTMI to meet the first criterion. The second is slightly more difficult since it relies on the structure of the article (that is, to some extent, those features that make it human-readable) but “snippets” (i.e. actual phrases from theÂ orginal article)Â can beÂ included in OTMI files which, if utilised as fully as possible, could compensate for this.
I think the use of the term “Open” is justified because anyone canÂ index the OTMI files providedÂ (they exist specifically for the purpose of being indexed). It probably can’t totally reproduce the all-round searchability of directly indexing articlesÂ (though I am open to being proven wrong on this). However,Â I would like to seeÂ publishersÂ who rule outÂ direct spidering take part in this initiative or something like it.
Some example OTMI files (that is articles transformed into indexable, but not human-readable, format) can be found here. The rationale of OTMI has also been well thought through and is best expressed here.Stumble it!