Archive for the Project Category

One of the surprises when indexing the huge array of literature available on the web is that many major names, that is the ones who are associated with the traditional closed model, pop up as by far and away the biggest contributors to open access works (defined here as those that are downloadable in their entirety free of charge or other barrier such as login giving away substantial personal info).

American Society for Biochemistry and Molecular Biology (100,000+ free articles)

Royal Society of Chemistry (70,000+ free articles) – trawled, but not yet added to lit search.

National Academy of Sciences of the USA (50,000+ free articles)

The observation is that around 99% of the open access works in chemistry indexed by ChemSpider are supported financially by the subscription model, and we can suppose that open access works support subscriptions by attracting unsubscribed readers too.

As we see above, this is not theory, it has been happening for years, it is a real world material contribution to openness in chemistry that has crazily not attracted any attention on the blogosphere as far as I can tell.

There is a continued focus on relabelling data produced by others as “open data” – but this data has already been labelled and licensed by the orginal producer so this could be misleading. I’ve always thought that building searchable indices that link back, as do the major search engines, is the best way to build a resource through which users can discover works and where data producers are not undercut.

…. and it did. But not quite in the way that Cambridge had imagined. Over the last few weeks around 200,000 articles from contributing publishers have been added to ChemSpider’s literature search (as ChemRefer is now styled), though even this is not in the final form which we imagine.

Another 40,000 articles or so are following next week as this resource grows. The indexer is running hot 24 hours, seven days a week. Tens of thousands more articles will follow after that and on top of that we now have the capability to index text from image PDFs (many journal articles are still in this form) which that also opens up the possibility of users sending in scanned images of their data rich documents as a form of submission of chemical information to ChemSpider as well.

The main issue now is not having the time/resources to index everything we have permission for, we have still barely scratched the surface of Highwire for instance and adding updates from the resources we already index is not yet implemented properly. But, these are nice problems to have.

When we do have the critical mass of text journal articles indexed, the “cited in” feature can be implemented and we can open up the chemical names from the indexed content for downloading and curation by the ChemSpider community… and that’s when things get really interesting.

We are still on track, with just scant resources, to create a community curated cheminformatics-text search that we hope will eventually gain unstoppable momentum thanks to our community backing. Mozilla Firefox competes with Microsoft’s Internet Explorer because it has user and developer community backing and that is worth consideration as a role model for ChemSpider and the chemistry world as a whole.

The turn around that has occurred in terms of the interest in having published materials text indexed is highly significant in the long run since thousands of references will pour into ChemSpider structure records to enhance the usefulness of the database.

These, of course, will be free for anyone to download, so will make a material contribution to the openness of chemical data (which is what I want Open Chemistry Web to be all about) as opposed to talking about definitions/licenses/copyrights and other such distractions (as I see them) surrounding open access and open data.

PNAS – National Academy of Sciences of the USA on Highwire:

78,379 articles indexed; 2 indices created to be linked into the literature search; linking strategy: pnas.org URL; indexing type: link back to full text (chemical name to structure conversions to follow)

ChemBlink:

18,453 web pages scanned; ChemSpider structure records to be created; linking strategy: chemblink.com URL; indexing type: Chemical name, structure, synonyms and property data and link back to original web page.

HighWire hosted journal texts are to be indexed and linked back to by ChemSpider and structure records linking to their content be deposited here as well. HighWire will be indexed in accordance with their robots.TXT protocol (the conventional web publishing standard for stating indexing permissions).

From the website:

“HighWire-hosted publishers have collectively made 1,873,044 articles free” [and with their partner publishers] “produce 71 of the 200 most-frequently-cited journals.”

We would like to thank them for one of the most phenomenal academic publishing indexing/structure deposition permissions we have received and we expect it will greatly enhance the discoverability of their partner publishers’ works through our free cheminformatics and text search.

The first part of the first build of the Open Chemistry Web project is now available for viewing and testing here.

It lacks the advanced and substructure capabilities at the moment but these are well on the way. Currently, it more closely resembles the old text search over at ChemRefer.com and we have actually been asked to preserve that metadata format although there are soem changes already implemented.

These include an effort to clarify metadata by standardising citation data to (or as closely as possible to) Journal name, Year, Volume, Issue, and page (all explicitly stated).

The idea behind this is that people will take citations from the primary source (not ChemRefer) so citations in search results should serve only to be as clear and easily readable to the user as possible.

Soon to implemented (for some publishers) is Digital Object Identifier linking – at the request of the publisher so far. Search engines periodically refresh all their links anyway so the link permanency issues that apply to databases (which DOI solves) do not apply here and so their is no policy on this at the moment.

There will be a SIMPLE user interface. One text box, one applet on one page (preferably with very little else). We want to be addictively usable and deliver useful search results quickly. We do not want to build some all-singing-all-dancing and yet overly complex system that no-one without a Masters in cheminformatics will ever be able to decipher.

There are around 150,000 articles on the new index in comparison to ~50,000 in ChemRefer’s index of 12 months ago. Around half are open access (meaning you can download the full work in its entirety for free), and the full text of articles have been indexed to maximise the depth of the search (so even if you cannot access the full text for free, you are still searching the full text).

There is an enormous analytical and life sciences bias at the moment but these are often the most searched for chemical topics on the web due to their scope and importance.

For general interest, ChemRefer differs in structure from ChemSpider in that it is a search engine not a database. That means:

- ChemSpider exists as a website: you can link to it, bookmark it etc. Its purpose is to refer you to useful and curated resources but also to provide information on the ChemSpider.com web resource

- ChemRefer is just a searchable index. You cannot link to ChemRefer (unless you want to link to constantly changing search result pages). Its purpose is to get you off the website and to the useful primary source. Articles and metadata are spidered but this is dynamic so can hardly be described as curation. Systems have been set up to allow the curation of chemical structures from this raw full text index into ChemSpider in an accurate way but also quickly (luckily Tony Williams is a human Xerox). ChemRefer also now serves not just as a full text indexer, but also to mass harvest chemical data from selected web resources and deliver it to ChemSpider.

So, the robot is often used to deliver the data for curation such that it can be processed not (as I initally assumed) just to be fed into the Name-to-structure conversion software necessarily.

Any and all feedback welcome.

PDFs are fantastic as a format in many ways. They store the position of their elements (unlike HTML) so allowing easy extraction of metadata (like titles and authors etc) for display in search results. There are a variety of free tools available to convert PDF files to text format and so the perception that Adobe rule the world of PDFs is false.

Most of these tools have simple ways to undo the potential damage caused by the double columned PDF, especially with long chemical names. Another common problem with chemical name extraction from PDF is that you often read this: “5-diphenyl” ….. but end up extracting this: “5diphenyl” …. not fantastic (although whether an Adobe tool would produce a better result I dont know), but easily solvable with things like regex.

 So for PDFs: I find these free/three things surprisingly useful:

1) PDFtoText (for extraction) 

2) PHP (to generate output)

3) Regex (for name matching/repairing)

For those interested, here’s some stats from indexing PMC’s OAI subset:

  • 3,114,818 unique terms
  • 58,807 articles

How to get stuff with PMC’s OAI service.

To get ONE article:

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:1088242

Where the number at the end is the PubMed ID.

To get lots of articles:

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open&from=2007-11-17&until=2007-11-18

Where metadataPrefix=pmc means harvest the full text (change that to oai_dc to just get the metadata) and the ‘from’ and ‘until’ parameters restrict by the articles’ date. Restrict the set of articles to open access with the set=pmc-open parameter.

Or you can leave off the ‘from’ and ‘until’ arguments and then OAI will serve you the first subset of matching articles and give a ‘resumptionToken’ for you to harvest the rest afterwards. I won’t provide an example here (because these tokens expire) but the basic construction is:

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=STICK-IT-HERE

Happy harvesting. Don’t hack them off by making more than 1 request per 3 secs or more than 100 requests during their peak hours.

It is not the largest single source we have indexed (IUCr: ~83,000)

Springer has taken over the Indian Journal of Clinical Biochemistry (0970-1915), currently published by the Association of Clinical Biochemists of India, and will start publishing it as of volume 23, issue 1 (jan. 08). The journal is currently covered in ChemRefer and hopefully we can continue to full text index this journal.

The signs are reasonably good so far since a Springer Open Access journal, Nanoscale Research Letters (1556-276X) has asked to be indexed in the upcoming release.

The Zhurnal Organicheskoi Khimii (Russian Journal of Organic Chemistry) – another with a Springer connection –  have also requested to be indexed so it looks like it’s all down to what Springer’s World Wide Web of publishing decides to make of our project: Open and Integrated Structure and Text Search of the chemical web.

There are a great number of elaborate interfaces for searching databases, repositories and search indexes……………. that is for anyone with enough determination to figure out how to use them. Google, of all search engines, surely has the greatest expertise around in developing search interfaces and yet they emphasise, more than anyone else, the power of the ’single box’ search interface. It is as unscholarly as it gets – totally undefined and uncontrolled. It is an “anything/everything” search, what you type in can be anything in any category in any context. I think the joint text/substructure interface should be made with this in mind. By introducing too much “organisation” in the form of e.g. categories, searches for specific data types, etc. we could lose the dynamic nature that is embodied in the full text search index.

That is not to say there won’t be LOTS of advanced search features, there will, it is just that my feeling is they should be built to filter results from the full text index as opposed to organising/standardising (i.e. losing) data from the index prior to releasing it for searching. Standardised, defined and organised data in boxes looks wonderful and scholarly, but chaotic full text search indexes are more useful.

To build further on what information people might like to see in search results. Currently we have a ’summary’ feature planned which will allow people to view amongst other things:

A list of the number of articles per journal (in that set of search results for the entered keyword(s))

A simple bar chart type presentation outlining the number of articles matching the research query by the year the research was published (in that set of search results for the entered keyword(s)).

The idea is that users can easily build a snapshot of where to find research (in terms of which journals) and when that research activity peaked. I’m also working on some analogous features which will perform similar functions but for author names.

It is all PHP based – not for any good reason just can’t speak any other language to a computer. The manipulation of PDFs is a particular challenge I find, but there is a nice program “pdftotext” to be found here which will solve all your PDF headaches, at least as far as text indexing is concerned.

Any other feature suggestions or other comments welcome.