I gave a presentation on text-mining and document mark-up at ACS Philadelphia today. I’m busy writing my talk for tomorrow now but there have been enough requests for today’s presentation already that it’s now online. I’ll blog later about details but here’s a summary:

1) Pubmed is structure searchable from ChemSpider…we’ve got about 800,000 structures deposited at present and will be streaming more in this week.

2) We are finishing up a project on chemical name extraction of documents SUBMITTED to our site – word documents, RTF files and web pages.(NOT available to the public quite yet!)

3) We are supporting the NLM-DTD and have extended it to support chemical name markup, conversion to structures and integration to ChemSpider

4) We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo with the article title, author list and “fractional abstract”. When a publication goes live the author can login, associate a DOI or a URL for the publication (for non-DOI based Open Access publishers) and the structures and article details get lifted from embargo and are immediately available for searching to the public. This moves the task of structure validation to the shoulders of the author (who wants it right!), provides a platform for structure-identifier validation and enables NLM-DTD markup (with extensions) for reuse by other platforms.

5) We are investigating structure IMAGE conversion capabilities

6) If we received community support for this it could be game-changing.

Using Text-Mining and Crowdsourced Curation to Build a Structure Centric Community for Chemists

ChemSpider is a free access online structure-based community for chemists to research data and information. The database of over 20 million chemical structures and associated data has been derived from depositions by well over a hundred contributing data sources including chemical vendors, commercial database providers, web-based scraping of data and individual scientists looking to share their information with the community. Text-mining and conversion of chemical names and identifiers to chemical structures has made an enormous contribution to the availability of diverse data on ChemSpider and includes contributions from patents, open access articles and various online resources. This presentation will provide an overview of the present state of development of this important public resource and review the processes and procedures for the harvesting, deposition and curation of large datasets derived via text-mining and conversion.

Link to Presentation

Stumble it!

3 Responses to “My Presentation on Text-Mining and Document Mark-up at ACS”

  1. Egon Willighagen says:

    I like point 4: this is the way it should be: integrate chemoinformatics into the publishing process! You might even add checking of spectra for structures, and other experimental evidence etc. Extracting info from experimental sections is certainly more difficult than processing names (but possible; check my blog for a Bioclipse plugin based on work by PMRs group), but certainly more worthwhile too.

    Cheers!

  2. Tobias Kind says:

    Hi Tony,
    I can only congratulate you and your team and hope everything works well in the future. You and your team have done so much for small molecule sciences. This includes of course also the innovative pushes form eMolecules and PubChem. Many people from metabolomics and also companies are grateful for that. You can see that probably from your log files. I know this doesn’t buy you any coffee, but some spiritual support may not be wrong.

    “We foresee a situation where authors submit an article to our markup system AHEAD of submission to a publisher. We will validate chemical names, allow authors to confirm the structure-name associations, deposit their structures to ChemSpider under embargo”

    That is great news, I guess even if nobody knows the outcome but that is a good starter. Chemistry journals or authors may opt-in during the beginning or it can become a requirement for every chemistry related publication. Nobody knows where this is going, but its the right direction. It will allow a semantic connection of content and molecules with (hopefully) free access also in the future.

    Kind regard
    Tobias Kind
    fiehnlab.ucdavis.edu

  3. Chem4Word Project from Microsoft and Murray-Rust at The ChemConnector Blog by Antony Williams - Observations and Musings for the Chemistry Community By Antony Williams says:

    [...] on from my presentation regarding text-mining and document mark-up at the ACS meeting in Philly it was interesting to see the announcement about the Chem4Word project [...]

Leave a Reply