For those interested, here’s some stats from indexing PMC’s OAI subset:
- 3,114,818 unique terms
- 58,807 articles
How to get stuff with PMC’s OAI service.
To get ONE article:
Where the number at the end is the PubMed ID.
To get lots of articles:
http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open&from=2007-11-17&until=2007-11-18
Where metadataPrefix=pmc means harvest the full text (change that to oai_dc to just get the metadata) and the ‘from’ and ‘until’ parameters restrict by the articles’ date. Restrict the set of articles to open access with the set=pmc-open parameter.
Or you can leave off the ‘from’ and ‘until’ arguments and then OAI will serve you the first subset of matching articles and give a ‘resumptionToken’ for you to harvest the rest afterwards. I won’t provide an example here (because these tokens expire) but the basic construction is:
http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=STICK-IT-HERE
Happy harvesting. Don’t hack them off by making more than 1 request per 3 secs or more than 100 requests during their peak hours.
It is not the largest single source we have indexed (IUCr: ~83,000)

Entries (RSS)
December 12th, 2007 at 9:02 pm
Just in regards to indexing crystallography data and info again. You’ve likely seen my post on CrystalEye (http://www.chemspider.com/blog/?p=220). Today I was talking to one of the managers at the copyright office of ACS. The conversations continue and there is a group brought together to discuss this built up of people from CAS, Ohio and ACS, Washington. I can’t believe there will be any outcome other than allowing the scraping of the data into CrystalEye because if they didn’t then the the public relations nightmare fallout would be amazing.
December 13th, 2007 at 3:08 am
It would give the ACS some serious credit if they would announce that such use of the ACS literature is permitted!
April 6th, 2008 at 1:33 pm
[...] blogged on this before I think it imporatnt to emphasise that you CAN spider PubMed Central. They even have their own [...]