Archive for March, 2008

The first part of the first build of the Open Chemistry Web project is now available for viewing and testing here.

It lacks the advanced and substructure capabilities at the moment but these are well on the way. Currently, it more closely resembles the old text search over at ChemRefer.com and we have actually been asked to preserve that metadata format although there are soem changes already implemented.

These include an effort to clarify metadata by standardising citation data to (or as closely as possible to) Journal name, Year, Volume, Issue, and page (all explicitly stated).

The idea behind this is that people will take citations from the primary source (not ChemRefer) so citations in search results should serve only to be as clear and easily readable to the user as possible.

Soon to implemented (for some publishers) is Digital Object Identifier linking - at the request of the publisher so far. Search engines periodically refresh all their links anyway so the link permanency issues that apply to databases (which DOI solves) do not apply here and so their is no policy on this at the moment.

There will be a SIMPLE user interface. One text box, one applet on one page (preferably with very little else). We want to be addictively usable and deliver useful search results quickly. We do not want to build some all-singing-all-dancing and yet overly complex system that no-one without a Masters in cheminformatics will ever be able to decipher.

There are around 150,000 articles on the new index in comparison to ~50,000 in ChemRefer’s index of 12 months ago. Around half are open access (meaning you can download the full work in its entirety for free), and the full text of articles have been indexed to maximise the depth of the search (so even if you cannot access the full text for free, you are still searching the full text).

There is an enormous analytical and life sciences bias at the moment but these are often the most searched for chemical topics on the web due to their scope and importance.

For general interest, ChemRefer differs in structure from ChemSpider in that it is a search engine not a database. That means:

- ChemSpider exists as a website: you can link to it, bookmark it etc. Its purpose is to refer you to useful and curated resources but also to provide information on the ChemSpider.com web resource

- ChemRefer is just a searchable index. You cannot link to ChemRefer (unless you want to link to constantly changing search result pages). Its purpose is to get you off the website and to the useful primary source. Articles and metadata are spidered but this is dynamic so can hardly be described as curation. Systems have been set up to allow the curation of chemical structures from this raw full text index into ChemSpider in an accurate way but also quickly (luckily Tony Williams is a human Xerox). ChemRefer also now serves not just as a full text indexer, but also to mass harvest chemical data from selected web resources and deliver it to ChemSpider.

So, the robot is often used to deliver the data for curation such that it can be processed not (as I initally assumed) just to be fed into the Name-to-structure conversion software necessarily.

Any and all feedback welcome.

PHP is great. This quote which appears, ironically, on ASP.net explains why in a nutshell:

“I think PHP is great if you don’t wanna spent alot of time and ENERGY to become a web developer and still have some power”

Now, the literature indexing is built partly with PHP for this very reason. I am not a programmer but I want to program and I want to do it quickly because it’s just a means to an end i.e. building an index. Whether that’s an index of data from articles, CIFs, catalogs etc. is immaterial with PHP. So, it seems to me that for librarians or information professionals wherever, this is a great tool and you dont have to have any extra money (PHP is free as is the Apache webserver). Just determination and constant access to this resource: php.net .

Of course, ChemSpider is .NET and so this can create some difficulties whenever something indexed by me has to be implemented at ChemSpider.com and I rarely have any idea what they are talking about when I hear words like SQL Server and so on. On the whole though, I am increasingly using a combination of free and came-with-my-computer-Microsoft tools. e.g. the indexing runs on a WAMP server.

Snippets of the indexing code include just basic commands e.g. this for matching all URLs on a page including “/catalog/”.

<?PHP

preg_match_all(”@/catalog/[^\"]+@”, $get, $outurls);

?>

So, what does this mean for chemistry libraries. Well, having someone at your  library with this knowhow is a must. Indexing and organising literature and more effectively complementing it with data indexed from the WWW can help a library to make up for the fact that it cannot afford all the subscriptions it would like. And, in an era where these once complex, labour intensive and expensive activities are now free and dynamic, even the smallest library can use this to its advantage. You’re not getting any more money in your budget and your subscriptions aren’t getting any cheaper? … well the solution is still free.