One of the most untapped and certainly unsearched sources of chemical literature on the web is journal articles in image PDF format. I am using ImageMagick and Tesseract to get round this but (having no experience of ‘image indexing’) I am discovering how memory intensive this process is and it is painfully slow.
A source which would obviously benefit from this is the Acta Chemica Scandinavica archive put together by the Danish, Swedish, Norwegian and Finnish chemical societies which has extremely high quality image PDFs that lend themselves readily to this process. We will then have full text searchable functionality for this archive – will be interesting to test the quality of the free tools I am using for this as well. Could take weeks though!