For those interested, here’s some stats from indexing PMC’s OAI subset:

  • 3,114,818 unique terms
  • 58,807 articles

How to get stuff with PMC’s OAI service.

To get ONE article:

Where the number at the end is the PubMed ID.

To get lots of articles:

Where metadataPrefix=pmc means harvest the full text (change that to oai_dc to just get the metadata) and the ‘from’ and ‘until’ parameters restrict by the articles’ date. Restrict the set of articles to open access with the set=pmc-open parameter.

Or you can leave off the ‘from’ and ‘until’ arguments and then OAI will serve you the first subset of matching articles and give a ‘resumptionToken’ for you to harvest the rest afterwards. I won’t provide an example here (because these tokens expire) but the basic construction is:

Happy harvesting. Don’t hack them off by making more than 1 request per 3 secs or more than 100 requests during their peak hours.

It is not the largest single source we have indexed (IUCr: ~83,000)

Stumble it!

3 Responses to “PubMedCentral OAI Stats”

  1. ChemSpiderMan says:

    Just in regards to indexing crystallography data and info again. You’ve likely seen my post on CrystalEye ( Today I was talking to one of the managers at the copyright office of ACS. The conversations continue and there is a group brought together to discuss this built up of people from CAS, Ohio and ACS, Washington. I can’t believe there will be any outcome other than allowing the scraping of the data into CrystalEye because if they didn’t then the the public relations nightmare fallout would be amazing.

  2. Egon Willighagen says:

    It would give the ACS some serious credit if they would announce that such use of the ACS literature is permitted!

  3. OPEN CHEMISTRY WEB » Blog Archive » Indexing PubMed Central says:

    [...] blogged on this before I think it imporatnt to emphasise that you CAN spider PubMed Central. They even have their own [...]

Leave a Reply

Spam protection by WP Captcha-Free