Archive for January, 2008

I read this post on whether DOI is a good identifier or not. My feeling is that it has the following weaknesses:

It cannot (normally) be generated from citation information (a big disadvantage for an identifier) - you have to resolve them at e.g. CrossRef. This kills it as a way to communicate articles effectively.

If you want to resolve lots of them, you have to pay (there is no real value in this.. except that they have the identifiers and you do not).

It does not replace the URL, it is simply a redirect. This makes it hard to bookmark and those unfamiliar with the system who think they have bookmarked it have in fact bookmarked the URL.

Also, publishers have to pay for it too (though its possible they may receive money from CrossRef too). Essentially, all they are paying for is an unintuitive link that does not break provided they keep the redirect up to date.

Hence OpenURL.

It creates a persistent link as DOI does except it actually exists as a webpage (it is not a redirect) and can therefore be bookmarked easily and it CAN be generated from citation information without permissions. Here is a useful implementation.

A note on the CrossRef website caught my eye. It states that OpenURL is not competitive with DOI. This, of course, is nonsense (since it addresses link permanency). Apparently:

An OpenURL link that contains a DOI is similarly persistent.” [as a link]

Why would an OpenURL pointing to a publisher website not be persistent without a DOI? OpenURL can be created with citation data so it is TOTALLY persistent. With DOI, you need to fill in a form at CrossRef or Doi.org which you do not need to do with OpenURL.

It is DOIs that need third party ‘resolving’, not URLs and especially not OpenURLs which require no link up to a database (a restricted one in the case of CrossRef) for generation.

So, it is a shame that only a few publishers have taken it up. Surely, it is a competitive advantage to use a totally freely available URL structure that anyone can generate? After all, the worst that could happen is that someone might find your articles more easily.

PDFs are fantastic as a format in many ways. They store the position of their elements (unlike HTML) so allowing easy extraction of metadata (like titles and authors etc) for display in search results. There are a variety of free tools available to convert PDF files to text format and so the perception that Adobe rule the world of PDFs is false.

Most of these tools have simple ways to undo the potential damage caused by the double columned PDF, especially with long chemical names. Another common problem with chemical name extraction from PDF is that you often read this: “5-diphenyl” ….. but end up extracting this: “5diphenyl” …. not fantastic (although whether an Adobe tool would produce a better result I dont know), but easily solvable with things like regex.

 So for PDFs: I find these free/three things surprisingly useful:

1) PDFtoText (for extraction) 

2) PHP (to generate output)

3) Regex (for name matching/repairing)