PDFs are fantastic as a format in many ways. They store the position of their elements (unlike HTML) so allowing easy extraction of metadata (like titles and authors etc) for display in search results. There are a variety of free tools available to convert PDF files to text format and so the perception that Adobe rule the world of PDFs is false.

Most of these tools have simple ways to undo the potential damage caused by the double columned PDF, especially with long chemical names. Another common problem with chemical name extraction from PDF is that you often read this: “5-diphenyl” ….. but end up extracting this: “5diphenyl” …. not fantastic (although whether an Adobe tool would produce a better result I dont know), but easily solvable with things like regex.

 So for PDFs: I find these free/three things surprisingly useful:

1) PDFtoText (for extraction) 

2) PHP (to generate output)

3) Regex (for name matching/repairing)

Stumble it!

4 Responses to “Repairing Chemical Names In PDF”

  1. Egon Willighagen says:

    PDFs are pretty painful with respect the newlines, particular in combination with ‘-’s in compound names. Have you seen OSCAR3 [1] and OSRA [2]? A mix of tools is likely the best way to correctly recover chemistry from PDFs.

    1.http://chem-bla-ics.blogspot.com/2006/09/chemical-archeology-oscar3-to.html
    2.http://chem-bla-ics.blogspot.com/2007/07/osra-gpl-ed-molecule-drawing-to-smiles.html

  2. will says:

    Nice.
    The other open drain in the unlit street (not that that has ever happened to me) with PDFs was the whole greek character encoding nonsense, but that is not just a PDF issue.

  3. ChemSpiderMan says:

    OSRA’s got a long way to go based on my tests: http://www.chemspider.com/blog/converting-images-of-chemical-structures-to-real-structures.html

    Based on my tests CliDE is way ahead, for the time being. BUT. it’s not open source, rather a commercial product.
    http://www.chemspider.com/blog/im-hard-to-impressim-impressed-clide-can-convert-palytoxin-without-error.html

    Based on early feedback I have received (but I confess I have not conducted the tests myself) OSCAR is good for nominal complexity but cannot deal with fairly general complex names that show up in manuscripts. I have a good dataset to test in my hands if anyone is interested in comparing OSCAR to other NTS converters in a joint project…I’m looking for someone to drive OSCAR3.

  4. Joerg Kurt Wegner says:

    I agree with Egon, this is suboptimal. How many regex patterns would you need? Anyway, since we can not change available PDF documents this might add chemical information, but I would not trust it without a ChemSpider curation step !

Leave a Reply

Spam protection by WP Captcha-Free