Archive for the Project Category

PNAS - National Academy of Sciences of the USA on Highwire:

78,379 articles indexed; 2 indices created to be linked into the literature search; linking strategy: pnas.org URL; indexing type: link back to full text (chemical name to structure conversions to follow)

ChemBlink:

18,453 web pages scanned; ChemSpider structure records to be created; linking strategy: chemblink.com URL; indexing type: Chemical name, structure, synonyms and property data and link back to original web page.

HighWire hosted journal texts are to be indexed and linked back to by ChemSpider and structure records linking to their content be deposited here as well. HighWire will be indexed in accordance with their robots.TXT protocol (the conventional web publishing standard for stating indexing permissions).

From the website:

“HighWire-hosted publishers have collectively made 1,873,044 articles free” [and with their partner publishers] “produce 71 of the 200 most-frequently-cited journals.”

We would like to thank them for one of the most phenomenal academic publishing indexing/structure deposition permissions we have received and we expect it will greatly enhance the discoverability of their partner publishers’ works through our free cheminformatics and text search.

The first part of the first build of the Open Chemistry Web project is now available for viewing and testing here.

It lacks the advanced and substructure capabilities at the moment but these are well on the way. Currently, it more closely resembles the old text search over at ChemRefer.com and we have actually been asked to preserve that metadata format although there are soem changes already implemented.

These include an effort to clarify metadata by standardising citation data to (or as closely as possible to) Journal name, Year, Volume, Issue, and page (all explicitly stated).

The idea behind this is that people will take citations from the primary source (not ChemRefer) so citations in search results should serve only to be as clear and easily readable to the user as possible.

Soon to implemented (for some publishers) is Digital Object Identifier linking - at the request of the publisher so far. Search engines periodically refresh all their links anyway so the link permanency issues that apply to databases (which DOI solves) do not apply here and so their is no policy on this at the moment.

There will be a SIMPLE user interface. One text box, one applet on one page (preferably with very little else). We want to be addictively usable and deliver useful search results quickly. We do not want to build some all-singing-all-dancing and yet overly complex system that no-one without a Masters in cheminformatics will ever be able to decipher.

There are around 150,000 articles on the new index in comparison to ~50,000 in ChemRefer’s index of 12 months ago. Around half are open access (meaning you can download the full work in its entirety for free), and the full text of articles have been indexed to maximise the depth of the search (so even if you cannot access the full text for free, you are still searching the full text).

There is an enormous analytical and life sciences bias at the moment but these are often the most searched for chemical topics on the web due to their scope and importance.

For general interest, ChemRefer differs in structure from ChemSpider in that it is a search engine not a database. That means:

- ChemSpider exists as a website: you can link to it, bookmark it etc. Its purpose is to refer you to useful and curated resources but also to provide information on the ChemSpider.com web resource

- ChemRefer is just a searchable index. You cannot link to ChemRefer (unless you want to link to constantly changing search result pages). Its purpose is to get you off the website and to the useful primary source. Articles and metadata are spidered but this is dynamic so can hardly be described as curation. Systems have been set up to allow the curation of chemical structures from this raw full text index into ChemSpider in an accurate way but also quickly (luckily Tony Williams is a human Xerox). ChemRefer also now serves not just as a full text indexer, but also to mass harvest chemical data from selected web resources and deliver it to ChemSpider.

So, the robot is often used to deliver the data for curation such that it can be processed not (as I initally assumed) just to be fed into the Name-to-structure conversion software necessarily.

Any and all feedback welcome.

PDFs are fantastic as a format in many ways. They store the position of their elements (unlike HTML) so allowing easy extraction of metadata (like titles and authors etc) for display in search results. There are a variety of free tools available to convert PDF files to text format and so the perception that Adobe rule the world of PDFs is false.

Most of these tools have simple ways to undo the potential damage caused by the double columned PDF, especially with long chemical names. Another common problem with chemical name extraction from PDF is that you often read this: “5-diphenyl” ….. but end up extracting this: “5diphenyl” …. not fantastic (although whether an Adobe tool would produce a better result I dont know), but easily solvable with things like regex.

 So for PDFs: I find these free/three things surprisingly useful:

1) PDFtoText (for extraction) 

2) PHP (to generate output)

3) Regex (for name matching/repairing)

For those interested, here’s some stats from indexing PMC’s OAI subset:

  • 3,114,818 unique terms
  • 58,807 articles

How to get stuff with PMC’s OAI service.

To get ONE article:

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=GetRecord&metadataPrefix=pmc&identifier=oai:pubmedcentral.nih.gov:1088242

Where the number at the end is the PubMed ID.

To get lots of articles:

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&metadataPrefix=pmc&set=pmc-open&from=2007-11-17&until=2007-11-18

Where metadataPrefix=pmc means harvest the full text (change that to oai_dc to just get the metadata) and the ‘from’ and ‘until’ parameters restrict by the articles’ date. Restrict the set of articles to open access with the set=pmc-open parameter.

Or you can leave off the ‘from’ and ‘until’ arguments and then OAI will serve you the first subset of matching articles and give a ‘resumptionToken’ for you to harvest the rest afterwards. I won’t provide an example here (because these tokens expire) but the basic construction is:

http://www.pubmedcentral.nih.gov/oai/oai.cgi?verb=ListRecords&resumptionToken=STICK-IT-HERE

Happy harvesting. Don’t hack them off by making more than 1 request per 3 secs or more than 100 requests during their peak hours.

It is not the largest single source we have indexed (IUCr: ~83,000)

Springer has taken over the Indian Journal of Clinical Biochemistry (0970-1915), currently published by the Association of Clinical Biochemists of India, and will start publishing it as of volume 23, issue 1 (jan. 08). The journal is currently covered in ChemRefer and hopefully we can continue to full text index this journal.

The signs are reasonably good so far since a Springer Open Access journal, Nanoscale Research Letters (1556-276X) has asked to be indexed in the upcoming release.

The Zhurnal Organicheskoi Khimii (Russian Journal of Organic Chemistry) - another with a Springer connection -  have also requested to be indexed so it looks like it’s all down to what Springer’s World Wide Web of publishing decides to make of our project: Open and Integrated Structure and Text Search of the chemical web.

There are a great number of elaborate interfaces for searching databases, repositories and search indexes……………. that is for anyone with enough determination to figure out how to use them. Google, of all search engines, surely has the greatest expertise around in developing search interfaces and yet they emphasise, more than anyone else, the power of the ’single box’ search interface. It is as unscholarly as it gets - totally undefined and uncontrolled. It is an “anything/everything” search, what you type in can be anything in any category in any context. I think the joint text/substructure interface should be made with this in mind. By introducing too much “organisation” in the form of e.g. categories, searches for specific data types, etc. we could lose the dynamic nature that is embodied in the full text search index.

That is not to say there won’t be LOTS of advanced search features, there will, it is just that my feeling is they should be built to filter results from the full text index as opposed to organising/standardising (i.e. losing) data from the index prior to releasing it for searching. Standardised, defined and organised data in boxes looks wonderful and scholarly, but chaotic full text search indexes are more useful.

To build further on what information people might like to see in search results. Currently we have a ’summary’ feature planned which will allow people to view amongst other things:

A list of the number of articles per journal (in that set of search results for the entered keyword(s))

A simple bar chart type presentation outlining the number of articles matching the research query by the year the research was published (in that set of search results for the entered keyword(s)).

The idea is that users can easily build a snapshot of where to find research (in terms of which journals) and when that research activity peaked. I’m also working on some analogous features which will perform similar functions but for author names.

It is all PHP based - not for any good reason just can’t speak any other language to a computer. The manipulation of PDFs is a particular challenge I find, but there is a nice program “pdftotext” to be found here which will solve all your PDF headaches, at least as far as text indexing is concerned.

Any other feature suggestions or other comments welcome.

As the title of this post suggests, we are currently tackling the problem of developing a feature which will provide users with a list of the articles that have cited a paper of interest. This is either not too hard or it is impossible depending on how you go about it. There are different ways to cite a paper: put the journal name first (may be abbreviated or not…. or abbreviated in a different way), put the year first, put the issue number in, don’t put the issue number in etc. So the search engine fell in on itself the first few times I tried this but now it is not too bad. ChemRefer also managed to crash ChemSpider last weekend (oops). Anyway, below is the embryonic version of this feature which will form part of a “detailed view” for displaying article metadata more clearly (only the article titles and the publisher name have hyperlinks in the example - btw the end presentation probably will be very different to this but the metadata will be similar). Note that the current index wasn’t designed with this kind of feature in mind so there are likely to be more citations of this article (chose this example because both articles are Open Access so anyone can analyse these citations): 

Title
Acetone 3-nitrophenylhydrazone, redetermined at 120 K: sheets built from N-H…O, C-H…O and C-H…N hydrogen bonds

Authors
S. M. S. V. Wardell, M. V. N. de Souza, J. L. Wardell, J. N. Low and C. Glidewell
Search Authors

Citation Information
Acta Cryst. Section E 2006;62(7):o2838-o2840

Publisher ©
International Union of Crystallography

Article URL
http://journals.iucr.org/e/issues/2006/07/00/lh2094/lh2094.pdf

Find More
Search For Similar Articles

Citations Found: 1

Acetone 2-nitrophenylhydrazone (Acta Cryst. Section E 2007;63(2):o970-o971) - International Union of Crystallography

S. M. S. V. Wardell, J. L. Wardell, J. N. Low and C. Glidewell

2006). Acta Cryst. E62, o2838 – o2840. o971 Acta Cryst. ( Brazil, hydrazone, (II) (Wardell et al., 2006). c M. S. V., de Souza, M. V. N.,

OK, we are now in the process of actually putting the new index together.

We are planning on these new advanced search features which will allow people to restrict their search by:

- Article title

- Authors

- Journal name; Volume; Issue; Page

- Provider

- Text in DOI

- Text in URL

- Full Text

- (Sub)structure << Java/Applet thing (ask the cheminformatics people about that)

We plan to retain title, author, citation and provider information in search result metadata. We also want to introduce a mouseover “detailed view” containing the above information but with additional search options such as:

- More articles from these authors

- Direct link to publisher homepage

- Search for similar articles (based on title keywords)

- Search for similar structures

Anything else you want to see? Comments welcome.