For those interested, here’s some stats from indexing PMC’s OAI subset:
- 3,114,818 unique terms
- 58,807 articles
How to get stuff with PMC’s OAI service.
To get ONE article:
Where the number at the end is the PubMed ID.
To get lots of articles:
http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open&from=2007-11-17&until=2007-11-18
Where metadataPrefix=pmc means harvest the full text (change that to oai_dc to just get the metadata) and the ‘from’ and ‘until’ parameters restrict by the articles’ date. Restrict the set of articles to open access with the set=pmc-open parameter.
Or you can leave off the ‘from’ and ‘until’ arguments and then OAI will serve you the first subset of matching articles and give a ‘resumptionToken’ for you to harvest the rest afterwards. I won’t provide an example here (because these tokens expire) but the basic construction is:
http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=STICK-IT-HERE
Happy harvesting. Don’t hack them off by making more than 1 request per 3 secs or more than 100 requests during their peak hours.
It is not the largest single source we have indexed (IUCr: ~83,000)

Entries (RSS)