One of my intentions on this blog is to help readers with information about approaches to text indexing in Chemistry articles. For example, we have been asked how do we decide what sites to index with Chemrefer? Here goes…

ChemRefer uses spidering technology to index a wide range of articles on many websites. It has spidered websites both according to the well known robots.txt protocol as well as those who have granted permission by email. The robots.txt file is a file containing the rules laid down by the webmaster and lets us know where a spider is not allowed to index.

e.g. the contents of a robots.txt file could look like this:

Disallow: /info/

Disallow: /login/

This would mean that a robot is not allowed to index files in the directories “info” and “login” on the website in question.

We are considering dropping the use of this protocol as a sole justification for indexing. This is for a reason that will become very clear in my next post (how exciting).

This protocol is also a convention (and a widely accepted one at that) as opposed to a requirement but many websites do not have a robots.txt file and so are likely to be within their rights not to recognise the convention.

Since all major search engines (including Google and Yahoo for instance) are happy to use the robots.txt protocol and ChemRefer utilises similar technology to these search engines (albeit on a much smaller scale), I had felt comfortable with this approach up until now. So, we are changing this approach.

Also, to simplify things and avoid conflict with publishers we have decided in future to email all websites to confirm explicitly that they are happy (or not) to be indexed (which is more polite anyway) in case they do not support the use of the robots.txt protocol by ChemRefer (which IS the uncertainty here) or simply wish to withdraw a permission given many months ago. The ChemRefer service, after all, was not set up to index websites that do not wish to be included though we WELCOME and are grateful to those who have chosen to do so.

Stumble it!

2 Responses to “ChemRefer And Robots.txt”

  1. David Bradley says:

    As I understand it, robots.txt is taken by search engines to be nothing more than a set of suggestions. Several people in the SEO community report that SEs sometimes ignore robots.txt altogether and some spiders don’t even have the ability to read it in the first place let alone respond to it. robots.txt is certainly not a legal document. If you post information on the web you must accept that it will be spidered. The only full-on way to stop a spider in its tracks is to ban its IP address from your server. But, that opens a minefield as some spiders have multiple IPs and some are dynamic, so what could be the IP address of a spider you wish to ban one day could be a potential visitor the next.

  2. ChemSpider Blog » Blog Archive » Support from the Publishing Community for Text Indexing Scientific Articles says:

    [...] few weeks ago Will Griffiths blogged about robots.txt files for informing on indexing policies for a website and then later on our discussions with RSC. As a result of our experiences in the [...]

Leave a Reply

Spam protection by WP Captcha-Free