Jean-Claude Bradley recently blogged about an all too familiar issue which you will need to read about there to understand this post.

They’ve done work, now they have to remember what they’ve done in one place in an updatable way w/o creating a bureaucratic headache. At ChemSpider, organisation of data in a cost-effective manner has caused headaches every now and then though it has also been one of the biggest drivers of innovation.

I have followed from time to time but the activity around what is called ‘Open Notebook Science’ is reaching an industrial scale.

The ONSchallenge site, which i looked at the most closely, is annotated with links to comprehensive experimental, structural and relevant external data sources but still is reader friendly.

So anyway, after some discussion over at Usefulchem, i wrote a small script to collect google spreadsheet urls (that link directly to the xLS format) and associate them with the wiki page on which they are linked from.

This is only at the first level of connections intially (starting from this base url) and no extra metadata such as InChI are collected.. yet. But we have to determine the best format for this service before it is scaled.

e.g. a web service might be a good idea (but it could face spam requests) or perhaps a locally run script (which would assume configuration by every potential user is convenient).

The generated excel spreadsheet is << here >>.

The literature search now has a (partially enabled) feature which extracts references from documents where the author has cited themselves. It also counts citations of that reference in other documents.

This allows users to follow a chain of research backwards and forwards (as cited in links will always be in the future, with references being in the past).

These features are in testing but are live for users in the text search as well fyi an example search:

Suzuki coupling

There is no engineering to make cited papers ‘preferred’ by the search engine in terms of ranking as, whilst the association between keyword and relevance is strong, it is not clear (to me) that this is true for the association between citation count and relevance.

And anyway as we are only indexing a couple of hundred thousand docs so far we dont have a big enough sample size for ‘cited in’ counts to be comprehensive. Their main use is for following chains of research.

The selection of sources we can full text search is growing but is still limited by publisher permission.

However, the references within those papers, which are obviously from a much wider set of sources, is available to us and now you can search for them, see the demo << HERE >>.

Click on the example on the search form for a usage guide. Feedback appreciated.

Again, full integration into ChemSpider /ChemMantis and addition of other sources is pending (and running all this off one laptop for too long is not advisable).

This makes article metadata for over a million articles searchable in a structured way.

The << literature search >> capability has some new features, such as a “cited in” link beneath search results. This represents our efforts to incorporate Google Scholar like features into ChemSpider using text indices. We have indexed 230K articles so far into this new framework and would appreciate your feedback.

A demo is online << HERE >> and it is currently covering the RSC free access set and the JBC. Other sources will be re-added from the old search over the coming weeks and there are new features planned (time and resources permitting) including:

- ability to search for references within articles

- references extracted from articles displayed in search results and links to which other articles they are also cited in

- filtering results by fields such as author / journal name

Full integration into ChemSpider and ChemMantis is also anticipated.

One of the most untapped and certainly unsearched sources of chemical literature on the web is journal articles in image PDF format. I am using ImageMagick and Tesseract to get round this but (having no experience of ‘image indexing’) I am discovering how memory intensive this process is and it is painfully slow.

A source which would obviously benefit from this is the Acta Chemica Scandinavica archive put together by the Danish, Swedish, Norwegian and Finnish chemical societies which has extremely high quality image PDFs that lend themselves readily to this process. We will then have full text searchable functionality for this archive – will be interesting to test the quality of the free tools I am using for this as well. Could take weeks though!

One of the surprises when indexing the huge array of literature available on the web is that many major names, that is the ones who are associated with the traditional closed model, pop up as by far and away the biggest contributors to open access works (defined here as those that are downloadable in their entirety free of charge or other barrier such as login giving away substantial personal info).

American Society for Biochemistry and Molecular Biology (100,000+ free articles)

Royal Society of Chemistry (70,000+ free articles) – trawled, but not yet added to lit search.

National Academy of Sciences of the USA (50,000+ free articles)

The observation is that around 99% of the open access works in chemistry indexed by ChemSpider are supported financially by the subscription model, and we can suppose that open access works support subscriptions by attracting unsubscribed readers too.

As we see above, this is not theory, it has been happening for years, it is a real world material contribution to openness in chemistry that has crazily not attracted any attention on the blogosphere as far as I can tell.

There is a continued focus on relabelling data produced by others as “open data” – but this data has already been labelled and licensed by the orginal producer so this could be misleading. I’ve always thought that building searchable indices that link back, as do the major search engines, is the best way to build a resource through which users can discover works and where data producers are not undercut.

In a recent article in Chemistry World, the declining support for PhD and Postdocs was highlighted. The explanation given is that this is a result of “changes to the way research costs are calculated”.

Can’t argue with that, but it is also true that if PhD/Postdocs were all that valuable, industry would fund them regardless. The truth about chemistry degrees, at least in England, is that they are all about sticking to the recipe given out by the lecturer. “Experiments” consist of following a procedure to the letter. Whilst this is a valuable skill to have, these are not “experiments” at all. In fact they are anti-experiments because should you deviate from the plan given to you, your grades will fall. If you do not *think* or question a procedure, you will get straight As.

Given that it is the straight A students who are likely to go on and do Phds/postdocs, these qualities will persist at this level, and industry is right to be unimpressed by this, since they are all about the ability to think and innovate.

Vested interest: I only got a 2.2 so maybe I would say this ;)

…. and it did. But not quite in the way that Cambridge had imagined. Over the last few weeks around 200,000 articles from contributing publishers have been added to ChemSpider’s literature search (as ChemRefer is now styled), though even this is not in the final form which we imagine.

Another 40,000 articles or so are following next week as this resource grows. The indexer is running hot 24 hours, seven days a week. Tens of thousands more articles will follow after that and on top of that we now have the capability to index text from image PDFs (many journal articles are still in this form) which that also opens up the possibility of users sending in scanned images of their data rich documents as a form of submission of chemical information to ChemSpider as well.

The main issue now is not having the time/resources to index everything we have permission for, we have still barely scratched the surface of Highwire for instance and adding updates from the resources we already index is not yet implemented properly. But, these are nice problems to have.

When we do have the critical mass of text journal articles indexed, the “cited in” feature can be implemented and we can open up the chemical names from the indexed content for downloading and curation by the ChemSpider community… and that’s when things get really interesting.

We are still on track, with just scant resources, to create a community curated cheminformatics-text search that we hope will eventually gain unstoppable momentum thanks to our community backing. Mozilla Firefox competes with Microsoft’s Internet Explorer because it has user and developer community backing and that is worth consideration as a role model for ChemSpider and the chemistry world as a whole.

The turn around that has occurred in terms of the interest in having published materials text indexed is highly significant in the long run since thousands of references will pour into ChemSpider structure records to enhance the usefulness of the database.

These, of course, will be free for anyone to download, so will make a material contribution to the openness of chemical data (which is what I want Open Chemistry Web to be all about) as opposed to talking about definitions/licenses/copyrights and other such distractions (as I see them) surrounding open access and open data.

Some of the richest sources of chemical information are research group websites. Some time ago, I indexed primary literature PDFs from many such websites into the legacy (now non-existent) ChemRefer index.

I then received this correspondence from a major publisher. I submitted it to Chilling Effects to see what the various legal ins and outs of all of this meant.

Please read the letter and then the rest of this post.

It is worth pointing out here that the publisher may well have been right, but there is no way to confirm this since I am not (and should not be) able to access author-publisher contracts.

In any case, the result was that I stopped linking to research group website PDFs (the “just in case” approach). Was that the best course of action?  Comments welcome.

PNAS – National Academy of Sciences of the USA on Highwire:

78,379 articles indexed; 2 indices created to be linked into the literature search; linking strategy: URL; indexing type: link back to full text (chemical name to structure conversions to follow)


18,453 web pages scanned; ChemSpider structure records to be created; linking strategy: URL; indexing type: Chemical name, structure, synonyms and property data and link back to original web page.