Having blogged on this before I think it important to emphasise that you CAN spider PubMed Central. They even have their own utilities designed specifically for the mass downloading of articles in the form of an OAI feed. What you cannot do is spider the article URLs directly (you must use the XML) because this is forbidden in robots.TXT and you will be blocked on this basis.

PubMed Central is one of the most innovative and open chemistry resources on the web with fantastic metadata and article retrieval tool sets designed to facilitate (not prevent) the spread of chemical information at no cost.

HighWire hosted journal texts are to be indexed and linked back to by ChemSpider and structure records linking to their content be deposited here as well. HighWire will be indexed in accordance with their robots.TXT protocol (the conventional web publishing standard for stating indexing permissions).

From the website:

“HighWire-hosted publishers have collectively made 1,873,044 articles free” [and with their partner publishers] “produce 71 of the 200 most-frequently-cited journals.”

We would like to thank them for one of the most phenomenal academic publishing indexing/structure deposition permissions we have received and we expect it will greatly enhance the discoverability of their partner publishers’ works through our free cheminformatics and text search.

The first part of the first build of the Open Chemistry Web project is now available for viewing and testing here.

It lacks the advanced and substructure capabilities at the moment but these are well on the way. Currently, it more closely resembles the old text search over at ChemRefer.com and we have actually been asked to preserve that metadata format although there are soem changes already implemented.

These include an effort to clarify metadata by standardising citation data to (or as closely as possible to) Journal name, Year, Volume, Issue, and page (all explicitly stated).

The idea behind this is that people will take citations from the primary source (not ChemRefer) so citations in search results should serve only to be as clear and easily readable to the user as possible.

Soon to implemented (for some publishers) is Digital Object Identifier linking – at the request of the publisher so far. Search engines periodically refresh all their links anyway so the link permanency issues that apply to databases (which DOI solves) do not apply here and so their is no policy on this at the moment.

There will be a SIMPLE user interface. One text box, one applet on one page (preferably with very little else). We want to be addictively usable and deliver useful search results quickly. We do not want to build some all-singing-all-dancing and yet overly complex system that no-one without a Masters in cheminformatics will ever be able to decipher.

There are around 150,000 articles on the new index in comparison to ~50,000 in ChemRefer’s index of 12 months ago. Around half are open access (meaning you can download the full work in its entirety for free), and the full text of articles have been indexed to maximise the depth of the search (so even if you cannot access the full text for free, you are still searching the full text).

There is an enormous analytical and life sciences bias at the moment but these are often the most searched for chemical topics on the web due to their scope and importance.

For general interest, ChemRefer differs in structure from ChemSpider in that it is a search engine not a database. That means:

- ChemSpider exists as a website: you can link to it, bookmark it etc. Its purpose is to refer you to useful and curated resources but also to provide information on the ChemSpider.com web resource

- ChemRefer is just a searchable index. You cannot link to ChemRefer (unless you want to link to constantly changing search result pages). Its purpose is to get you off the website and to the useful primary source. Articles and metadata are spidered but this is dynamic so can hardly be described as curation. Systems have been set up to allow the curation of chemical structures from this raw full text index into ChemSpider in an accurate way but also quickly (luckily Tony Williams is a human Xerox). ChemRefer also now serves not just as a full text indexer, but also to mass harvest chemical data from selected web resources and deliver it to ChemSpider.

So, the robot is often used to deliver the data for curation such that it can be processed not (as I initally assumed) just to be fed into the Name-to-structure conversion software necessarily.

Any and all feedback welcome.

PHP is great. This quote which appears, ironically, on ASP.net explains why in a nutshell:

“I think PHP is great if you don’t wanna spent alot of time and ENERGY to become a web developer and still have some power”

Now, the literature indexing is built partly with PHP for this very reason. I am not a programmer but I want to program and I want to do it quickly because it’s just a means to an end i.e. building an index. Whether that’s an index of data from articles, CIFs, catalogs etc. is immaterial with PHP. So, it seems to me that for librarians or information professionals wherever, this is a great tool and you dont have to have any extra money (PHP is free as is the Apache webserver). Just determination and constant access to this resource: php.net .

Of course, ChemSpider is .NET and so this can create some difficulties whenever something indexed by me has to be implemented at ChemSpider.com and I rarely have any idea what they are talking about when I hear words like SQL Server and so on. On the whole though, I am increasingly using a combination of free and came-with-my-computer-Microsoft tools. e.g. the indexing runs on a WAMP server.

Snippets of the indexing code include just basic commands e.g. this for matching all URLs on a page including “/catalog/”.


preg_match_all(”@/catalog/[^\"]+@”, $get, $outurls);


So, what does this mean for chemistry libraries. Well, having someone at your  library with this knowhow is a must. Indexing and organising literature and more effectively complementing it with data indexed from the WWW can help a library to make up for the fact that it cannot afford all the subscriptions it would like. And, in an era where these once complex, labour intensive and expensive activities are now free and dynamic, even the smallest library can use this to its advantage. You’re not getting any more money in your budget and your subscriptions aren’t getting any cheaper? … well the solution is still free.

I read this post on whether DOI is a good identifier or not. My feeling is that it has the following weaknesses:

It cannot (normally) be generated from citation information (a big disadvantage for an identifier) – you have to resolve them at e.g. CrossRef. This kills it as a way to communicate articles effectively.

If you want to resolve lots of them, you have to pay (there is no real value in this.. except that they have the identifiers and you do not).

It does not replace the URL, it is simply a redirect. This makes it hard to bookmark and those unfamiliar with the system who think they have bookmarked it have in fact bookmarked the URL.

Also, publishers have to pay for it too (though its possible they may receive money from CrossRef too). Essentially, all they are paying for is an unintuitive link that does not break provided they keep the redirect up to date.

Hence OpenURL.

It creates a persistent link as DOI does except it actually exists as a webpage (it is not a redirect) and can therefore be bookmarked easily and it CAN be generated from citation information without permissions. Here is a useful implementation.

A note on the CrossRef website caught my eye. It states that OpenURL is not competitive with DOI. This, of course, is nonsense (since it addresses link permanency). Apparently:

An OpenURL link that contains a DOI is similarly persistent.” [as a link]

Why would an OpenURL pointing to a publisher website not be persistent without a DOI? OpenURL can be created with citation data so it is TOTALLY persistent. With DOI, you need to fill in a form at CrossRef or Doi.org which you do not need to do with OpenURL.

It is DOIs that need third party ‘resolving’, not URLs and especially not OpenURLs which require no link up to a database (a restricted one in the case of CrossRef) for generation.

So, it is a shame that only a few publishers have taken it up. Surely, it is a competitive advantage to use a totally freely available URL structure that anyone can generate? After all, the worst that could happen is that someone might find your articles more easily.

PDFs are fantastic as a format in many ways. They store the position of their elements (unlike HTML) so allowing easy extraction of metadata (like titles and authors etc) for display in search results. There are a variety of free tools available to convert PDF files to text format and so the perception that Adobe rule the world of PDFs is false.

Most of these tools have simple ways to undo the potential damage caused by the double columned PDF, especially with long chemical names. Another common problem with chemical name extraction from PDF is that you often read this: “5-diphenyl” ….. but end up extracting this: “5diphenyl” …. not fantastic (although whether an Adobe tool would produce a better result I dont know), but easily solvable with things like regex.

 So for PDFs: I find these free/three things surprisingly useful:

1) PDFtoText (for extraction) 

2) PHP (to generate output)

3) Regex (for name matching/repairing)

For those interested, here’s some stats from indexing PMC’s OAI subset:

  • 3,114,818 unique terms
  • 58,807 articles

How to get stuff with PMC’s OAI service.

To get ONE article:


Where the number at the end is the PubMed ID.

To get lots of articles:


Where metadataPrefix=pmc means harvest the full text (change that to oai_dc to just get the metadata) and the ‘from’ and ‘until’ parameters restrict by the articles’ date. Restrict the set of articles to open access with the set=pmc-open parameter.

Or you can leave off the ‘from’ and ‘until’ arguments and then OAI will serve you the first subset of matching articles and give a ‘resumptionToken’ for you to harvest the rest afterwards. I won’t provide an example here (because these tokens expire) but the basic construction is:


Happy harvesting. Don’t hack them off by making more than 1 request per 3 secs or more than 100 requests during their peak hours.

It is not the largest single source we have indexed (IUCr: ~83,000)

So far in terms of indexing, these are complete:

Hindawi, Electrochemical Science Group, Repositorium (Universidade do Minho Eprints), Medknow, MDPI

The next few to go:

ACBI, IUCr, PubMed Central*, PubMed**, Bentham***, Nature****

* Full text indexing for Open Access list only; ** Bibliographic data and chemical names only; *** Bibliographic data only except Bentham Open (Full Text Indexing); **** OTMI

 All sources are full text indexed unless otherwise stated.

The open access debate has recently focused on the following aspects specifically:

- redistribution of materials

- the scholarly definition of the term “open access” (assuming one is attributable)

- who has the copyright (publisher or author)

- use of terms such as ‘open’ in marketing of services

Now I’m sat here (without access by the way) and I am thinking that these are the concerns of people who ALREADY have access to the articles that they need. Do I ask myself whether I can redistribute an article or if the use the term “open” is justified when I have a page with an abstract and a “buy now” button beneath it. I do not. This is because I cannot get to the thing in the first place. So, all I care about is free full text literature —> the real open access serving the needs of those who cannot get to the literature rather than the people with the Athens password who worry about redistribution and pedantic definitions of terms that were only ever meant to describe a general concept.

So, far from blow torching publishers for providing free full text access services – a bizarre phenomenon in the chemistry blogosphere but a common one – because a copyright policy is not quite to my liking (<violins>) we should applaud the ACS and Nature Sample Issues, the RSC Free Access and the Springer Open Choice (and many more to mention besides) alongside the more generous BioMedCentral’s and Hindawi’s for giving us a chance to get a look at their materials.

Springer has taken over the Indian Journal of Clinical Biochemistry (0970-1915), currently published by the Association of Clinical Biochemists of India, and will start publishing it as of volume 23, issue 1 (jan. 08). The journal is currently covered in ChemRefer and hopefully we can continue to full text index this journal.

The signs are reasonably good so far since a Springer Open Access journal, Nanoscale Research Letters (1556-276X) has asked to be indexed in the upcoming release.

The Zhurnal Organicheskoi Khimii (Russian Journal of Organic Chemistry) – another with a Springer connection –  have also requested to be indexed so it looks like it’s all down to what Springer’s World Wide Web of publishing decides to make of our project: Open and Integrated Structure and Text Search of the chemical web.