There are a great number of elaborate interfaces for searching databases, repositories and search indexes……………. that is for anyone with enough determination to figure out how to use them. Google, of all search engines, surely has the greatest expertise around in developing search interfaces and yet they emphasise, more than anyone else, the power of the ’single box’ search interface. It is as unscholarly as it gets – totally undefined and uncontrolled. It is an “anything/everything” search, what you type in can be anything in any category in any context. I think the joint text/substructure interface should be made with this in mind. By introducing too much “organisation” in the form of e.g. categories, searches for specific data types, etc. we could lose the dynamic nature that is embodied in the full text search index.

That is not to say there won’t be LOTS of advanced search features, there will, it is just that my feeling is they should be built to filter results from the full text index as opposed to organising/standardising (i.e. losing) data from the index prior to releasing it for searching. Standardised, defined and organised data in boxes looks wonderful and scholarly, but chaotic full text search indexes are more useful.

To build further on what information people might like to see in search results. Currently we have a ’summary’ feature planned which will allow people to view amongst other things:

A list of the number of articles per journal (in that set of search results for the entered keyword(s))

A simple bar chart type presentation outlining the number of articles matching the research query by the year the research was published (in that set of search results for the entered keyword(s)).

The idea is that users can easily build a snapshot of where to find research (in terms of which journals) and when that research activity peaked. I’m also working on some analogous features which will perform similar functions but for author names.

It is all PHP based – not for any good reason just can’t speak any other language to a computer. The manipulation of PDFs is a particular challenge I find, but there is a nice program “pdftotext” to be found here which will solve all your PDF headaches, at least as far as text indexing is concerned.

Any other feature suggestions or other comments welcome.

As the title of this post suggests, we are currently tackling the problem of developing a feature which will provide users with a list of the articles that have cited a paper of interest. This is either not too hard or it is impossible depending on how you go about it. There are different ways to cite a paper: put the journal name first (may be abbreviated or not…. or abbreviated in a different way), put the year first, put the issue number in, don’t put the issue number in etc. So the search engine fell in on itself the first few times I tried this but now it is not too bad. ChemRefer also managed to crash ChemSpider last weekend (oops). Anyway, below is the embryonic version of this feature which will form part of a “detailed view” for displaying article metadata more clearly (only the article titles and the publisher name have hyperlinks in the example – btw the end presentation probably will be very different to this but the metadata will be similar). Note that the current index wasn’t designed with this kind of feature in mind so there are likely to be more citations of this article (chose this example because both articles are Open Access so anyone can analyse these citations): 

Acetone 3-nitrophenylhydrazone, redetermined at 120 K: sheets built from N-H…O, C-H…O and C-H…N hydrogen bonds

S. M. S. V. Wardell, M. V. N. de Souza, J. L. Wardell, J. N. Low and C. Glidewell
Search Authors

Citation Information
Acta Cryst. Section E 2006;62(7):o2838-o2840

Publisher ©
International Union of Crystallography

Article URL

Find More
Search For Similar Articles

Citations Found: 1

Acetone 2-nitrophenylhydrazone (Acta Cryst. Section E 2007;63(2):o970-o971) – International Union of Crystallography

S. M. S. V. Wardell, J. L. Wardell, J. N. Low and C. Glidewell

2006). Acta Cryst. E62, o2838 – o2840. o971 Acta Cryst. ( Brazil, hydrazone, (II) (Wardell et al., 2006). c M. S. V., de Souza, M. V. N.,

OK, we are now in the process of actually putting the new index together.

We are planning on these new advanced search features which will allow people to restrict their search by:

- Article title

- Authors

- Journal name; Volume; Issue; Page

- Provider

- Text in DOI

- Text in URL

- Full Text

- (Sub)structure << Java/Applet thing (ask the cheminformatics people about that)

We plan to retain title, author, citation and provider information in search result metadata. We also want to introduce a mouseover “detailed view” containing the above information but with additional search options such as:

- More articles from these authors

- Direct link to publisher homepage

- Search for similar articles (based on title keywords)

- Search for similar structures

Anything else you want to see? Comments welcome.

Recently, I was introduced to the concept of the “Open Text Mining Interface” or OTMI. Officially, this is an initiative from the Nature Publishing Group but its aims are to enable all scholarly publishers to make their full texts (or as much text as is deemed appropriate) available for text mining without risking a loss of control over the human-readable material.

How is this possible? Well, a search engine will query its index based on these criteria (basically):

1) Do the searched terms exist in any articles and if so how often?

2) Where they exist, how closely together do they appear in an article?

It is easy for OTMI to meet the first criterion. The second is slightly more difficult since it relies on the structure of the article (that is, to some extent, those features that make it human-readable) but “snippets” (i.e. actual phrases from the orginal article) can be included in OTMI files which, if utilised as fully as possible, could compensate for this.

I think the use of the term “Open” is justified because anyone can index the OTMI files provided (they exist specifically for the purpose of being indexed). It probably can’t totally reproduce the all-round searchability of directly indexing articles (though I am open to being proven wrong on this). However, I would like to see publishers who rule out direct spidering take part in this initiative or something like it.

Some example OTMI files (that is articles transformed into indexable, but not human-readable, format) can be found here. The rationale of OTMI has also been well thought through and is best expressed here.

The last few days we have been exploring:

• Which publishers want to be part of our first index to support substructure and text queries in the same interface.
• What the most effective methods of indexing their publications are.
• What the possibilities are in terms of providing reports on usage/input search terms and ‘third-party’ search interfaces on publisher websites.

All these ideas were put forward by publishers before we had the chance to suggest them or think of them ourselves. We were made conscious of the fact that we needed to consult directly with publisher organisations last week and, having started afresh, this approach has attracted positive responses within a few days of implementation.

Ideally, author and our users interests should be considered at the forefront since that is what is important and that is what we are determined this search engine will do when it is launched.

The Hindawi Publishing Corporation, the International Union of Crystallography and Nature Publishing Group are three of the participating organisations consenting to be mentioned in this particular post.

We have been requested to remove all RSC articles from the ChemRefer Index.
The articles in question, from 1997-2004 are marked as ‘Free access’ and, these being indexable according to the robots.txt file, formed the basis of the current indexing. The RSC are unhappy at the way their articles have been presented and linked to in our search results, and consider that the additional intended reuse of the indexed information in ChemSpider without permission violates the terms of use.

RSC will reconsider the indexing policy for ChemRefer if requested changes are made to the search results and we are presently in discussions with the RSC to identify and execute on these modifications.  All RSC articles will be de-indexed from ChemRefer during the next indexing cycle.

One of my intentions on this blog is to help readers with information about approaches to text indexing in Chemistry articles. For example, we have been asked how do we decide what sites to index with Chemrefer? Here goes…

ChemRefer uses spidering technology to index a wide range of articles on many websites. It has spidered websites both according to the well known robots.txt protocol as well as those who have granted permission by email. The robots.txt file is a file containing the rules laid down by the webmaster and lets us know where a spider is not allowed to index.

e.g. the contents of a robots.txt file could look like this:

Disallow: /info/

Disallow: /login/

This would mean that a robot is not allowed to index files in the directories “info” and “login” on the website in question.

We are considering dropping the use of this protocol as a sole justification for indexing. This is for a reason that will become very clear in my next post (how exciting).

This protocol is also a convention (and a widely accepted one at that) as opposed to a requirement but many websites do not have a robots.txt file and so are likely to be within their rights not to recognise the convention.

Since all major search engines (including Google and Yahoo for instance) are happy to use the robots.txt protocol and ChemRefer utilises similar technology to these search engines (albeit on a much smaller scale), I had felt comfortable with this approach up until now. So, we are changing this approach.

Also, to simplify things and avoid conflict with publishers we have decided in future to email all websites to confirm explicitly that they are happy (or not) to be indexed (which is more polite anyway) in case they do not support the use of the robots.txt protocol by ChemRefer (which IS the uncertainty here) or simply wish to withdraw a permission given many months ago. The ChemRefer service, after all, was not set up to index websites that do not wish to be included though we WELCOME and are grateful to those who have chosen to do so.

As the latest addition to ChemSpider’s services, ChemRefer is specialised in text-indexing and it is now focused (and soon to be integrated with the main ChemSpider search) on providing access to chemistry related information and building a structure centric community for chemists. I originally created the ChemRefer service to allow chemists to have a search engine to perform text-based searches of freely accessible chemistry articles. When I saw what ChemSpider was trying to achieve I joined their advisory group to assist their efforts. With time it was clear that a closer relationship would benefit both parties. Now, ChemRefer and ChemSpider are merged together and we have an opportunity to produce a FREE search engine which will allow users to input structural and textual queries into one search interface. Any ideas, comments on any sources you would like us to index or any features you would wish this service to have are most welcome.

This blog is a parallel blog to the ChemSpider Blog and ChemSpider News so that we can discuss the ins and outs of text indexing of the chemistry literature. At a time when there is a great deal of openly available literature and data in this arena, it is time there was an openly available service with the cheminformatics and text indexing capabilities to search this effectively. We want to play a role in making that happen. We look forward to dialoguing with you. Please add Open Chemistry Web to your Blog Reader…