One of my intentions on this blog is to help readers with information about approaches to text indexing in Chemistry articles. For example, we have been asked how do we decide what sites to index with Chemrefer? Here goesâ€¦
ChemRefer uses spidering technology to index a wide range of articles on many websites. It has spidered websites both according to the well known robots.txt protocol as well as those who have granted permission by email. The robots.txt file is a file containing the rules laid down by the webmaster and lets us know where a spider is not allowed to index.
e.g. the contents of a robots.txt file could look like this:
This would mean that a robot is not allowed to index files in the directories â€œinfoâ€ and â€œloginâ€ on the website in question.
We are considering dropping the use of this protocol as a sole justification for indexing. This is for a reason that will become very clear in my next post (how exciting).
This protocol is also a convention (and a widely accepted one at that) as opposed to a requirement but many websites do not have a robots.txt file and so are likely to be within their rights not to recognise the convention.
Since all major search engines (including Google and Yahoo for instance) are happy to use the robots.txt protocol and ChemRefer utilises similar technology to these search engines (albeit on a much smaller scale), I had felt comfortable with this approach up until now. So, we are changing this approach.
Also, to simplify things and avoid conflict with publishers we have decided in future to email all websites to confirm explicitly that they are happy (or not) to be indexed (which is more polite anyway) in case they do not support the use of the robots.txt protocol by ChemRefer (which IS the uncertainty here) or simply wish to withdraw a permission given many months ago. The ChemRefer service, after all, was not set up to index websites that do not wish to be included though we WELCOME and are grateful to those who have chosen to do so.Stumble it!