Archive for the Uncategorized Category

Jean-Claude Bradley recently blogged about an all too familiar issue which you will need to read about there to understand this post.

They’ve done work, now they have to remember what they’ve done in one place in an updatable way w/o creating a bureaucratic headache. At ChemSpider, organisation of data in a cost-effective manner has caused headaches every now and then though it has also been one of the biggest drivers of innovation.

I have followed from time to time but the activity around what is called ‘Open Notebook Science’ is reaching an industrial scale.

The ONSchallenge site, which i looked at the most closely, is annotated with links to comprehensive experimental, structural and relevant external data sources but still is reader friendly.

So anyway, after some discussion over at Usefulchem, i wrote a small script to collect google spreadsheet urls (that link directly to the xLS format) and associate them with the wiki page on which they are linked from.

This is only at the first level of connections intially (starting from this base url) and no extra metadata such as InChI are collected.. yet. But we have to determine the best format for this service before it is scaled.

e.g. a web service might be a good idea (but it could face spam requests) or perhaps a locally run script (which would assume configuration by every potential user is convenient).

The generated excel spreadsheet is << here >>.

The literature search now has a (partially enabled) feature which extracts references from documents where the author has cited themselves. It also counts citations of that reference in other documents.

This allows users to follow a chain of research backwards and forwards (as cited in links will always be in the future, with references being in the past).

These features are in testing but are live for users in the text search as well fyi an example search:

Suzuki coupling

There is no engineering to make cited papers ‘preferred’ by the search engine in terms of ranking as, whilst the association between keyword and relevance is strong, it is not clear (to me) that this is true for the association between citation count and relevance.

And anyway as we are only indexing a couple of hundred thousand docs so far we dont have a big enough sample size for ‘cited in’ counts to be comprehensive. Their main use is for following chains of research.

The selection of sources we can full text search is growing but is still limited by publisher permission.

However, the references within those papers, which are obviously from a much wider set of sources, is available to us and now you can search for them, see the demo << HERE >>.

Click on the example on the search form for a usage guide. Feedback appreciated.

Again, full integration into ChemSpider /ChemMantis and addition of other sources is pending (and running all this off one laptop for too long is not advisable).

This makes article metadata for over a million articles searchable in a structured way.

The << literature search >> capability has some new features, such as a “cited in” link beneath search results. This represents our efforts to incorporate Google Scholar like features into ChemSpider using text indices. We have indexed 230K articles so far into this new framework and would appreciate your feedback.

A demo is online << HERE >> and it is currently covering the RSC free access set and the JBC. Other sources will be re-added from the old search over the coming weeks and there are new features planned (time and resources permitting) including:

- ability to search for references within articles

- references extracted from articles displayed in search results and links to which other articles they are also cited in

- filtering results by fields such as author / journal name

Full integration into ChemSpider and ChemMantis is also anticipated.

One of the most untapped and certainly unsearched sources of chemical literature on the web is journal articles in image PDF format. I am using ImageMagick and Tesseract to get round this but (having no experience of ‘image indexing’) I am discovering how memory intensive this process is and it is painfully slow.

A source which would obviously benefit from this is the Acta Chemica Scandinavica archive put together by the Danish, Swedish, Norwegian and Finnish chemical societies which has extremely high quality image PDFs that lend themselves readily to this process. We will then have full text searchable functionality for this archive – will be interesting to test the quality of the free tools I am using for this as well. Could take weeks though!

In a recent article in Chemistry World, the declining support for PhD and Postdocs was highlighted. The explanation given is that this is a result of “changes to the way research costs are calculated”.

Can’t argue with that, but it is also true that if PhD/Postdocs were all that valuable, industry would fund them regardless. The truth about chemistry degrees, at least in England, is that they are all about sticking to the recipe given out by the lecturer. “Experiments” consist of following a procedure to the letter. Whilst this is a valuable skill to have, these are not “experiments” at all. In fact they are anti-experiments because should you deviate from the plan given to you, your grades will fall. If you do not *think* or question a procedure, you will get straight As.

Given that it is the straight A students who are likely to go on and do Phds/postdocs, these qualities will persist at this level, and industry is right to be unimpressed by this, since they are all about the ability to think and innovate.

Vested interest: I only got a 2.2 so maybe I would say this ;)

PHP is great. This quote which appears, ironically, on explains why in a nutshell:

“I think PHP is great if you don’t wanna spent alot of time and ENERGY to become a web developer and still have some power”

Now, the literature indexing is built partly with PHP for this very reason. I am not a programmer but I want to program and I want to do it quickly because it’s just a means to an end i.e. building an index. Whether that’s an index of data from articles, CIFs, catalogs etc. is immaterial with PHP. So, it seems to me that for librarians or information professionals wherever, this is a great tool and you dont have to have any extra money (PHP is free as is the Apache webserver). Just determination and constant access to this resource: .

Of course, ChemSpider is .NET and so this can create some difficulties whenever something indexed by me has to be implemented at and I rarely have any idea what they are talking about when I hear words like SQL Server and so on. On the whole though, I am increasingly using a combination of free and came-with-my-computer-Microsoft tools. e.g. the indexing runs on a WAMP server.

Snippets of the indexing code include just basic commands e.g. this for matching all URLs on a page including “/catalog/”.


preg_match_all(”@/catalog/[^\"]+@”, $get, $outurls);


So, what does this mean for chemistry libraries. Well, having someone at your  library with this knowhow is a must. Indexing and organising literature and more effectively complementing it with data indexed from the WWW can help a library to make up for the fact that it cannot afford all the subscriptions it would like. And, in an era where these once complex, labour intensive and expensive activities are now free and dynamic, even the smallest library can use this to its advantage. You’re not getting any more money in your budget and your subscriptions aren’t getting any cheaper? … well the solution is still free.

So far in terms of indexing, these are complete:

Hindawi, Electrochemical Science Group, Repositorium (Universidade do Minho Eprints), Medknow, MDPI

The next few to go:

ACBI, IUCr, PubMed Central*, PubMed**, Bentham***, Nature****

* Full text indexing for Open Access list only; ** Bibliographic data and chemical names only; *** Bibliographic data only except Bentham Open (Full Text Indexing); **** OTMI

 All sources are full text indexed unless otherwise stated.

The open access debate has recently focused on the following aspects specifically:

- redistribution of materials

- the scholarly definition of the term “open access” (assuming one is attributable)

- who has the copyright (publisher or author)

- use of terms such as ‘open’ in marketing of services

Now I’m sat here (without access by the way) and I am thinking that these are the concerns of people who ALREADY have access to the articles that they need. Do I ask myself whether I can redistribute an article or if the use the term “open” is justified when I have a page with an abstract and a “buy now” button beneath it. I do not. This is because I cannot get to the thing in the first place. So, all I care about is free full text literature —> the real open access serving the needs of those who cannot get to the literature rather than the people with the Athens password who worry about redistribution and pedantic definitions of terms that were only ever meant to describe a general concept.

So, far from blow torching publishers for providing free full text access services – a bizarre phenomenon in the chemistry blogosphere but a common one – because a copyright policy is not quite to my liking (<violins>) we should applaud the ACS and Nature Sample Issues, the RSC Free Access and the Springer Open Choice (and many more to mention besides) alongside the more generous BioMedCentral’s and Hindawi’s for giving us a chance to get a look at their materials.