Archive for the Quality and Content Category
The RSC eScience Team has always been keen to get more links to literature references and we are currently engaged in work to extract much more information from the wealth of articles that are published in our journals (keep your eyes peeled for more information on this in the future).
The RSC now encourages authors for several of our journals to supply extra information, structures and spectra in their original file formats – which are attached to the article as supplementary information. Already we’ve seen several submissions of data that we have incorporated into ChemSpider records, both enriching the ChemSpider database and also showcasing the research of these authors through their publications. In this way, the RSC hopes to encourage the addition of reusable data files to the research paper as the start of its efforts to promote increased data sharing within chemical science research.
In a few short weeks we’ve received a number of submissions from authors that include key chemical structures as mol files and in some cases extra data including 1H and 13C NMR spectra as well as UV and IR spectra.
We’ve selected a few examples that show how this data not only enriches ChemSpider but, we hope, has benefits to researchers as authors and as consumers of chemical data.
Below are 4 articles for which we received additional supplementary files – the first 2 entries are from submissions where mol files were provided, which allowed us to deposit the structures and associate the article references with the ChemSpider records. The 3rd and 4th entries are examples of submissions where spectra were also provided
A closer look at the data
Taking this last example, let us investigate some of the benefits of supplying these files along with the submission:
1. With the mol file that was supplied we were able to create a new ChemSpider record (CSID 28945607) and then use the DOI of the article to insert the literature reference of the source article.
2. With the spectra files that were supplied we were able to add them to the ChemSpider record as interactve components; we hope that making the spectra interactive makes them easier to use. Lets compare the PDF version of the 1H NMR spectrum with the ChemSpider version – the screenshots below are taken as they appeared on my screen (then scaled to 80% to fit in the blog post). At first glance, they seem similar but. . .
In the ChemSpider record you can use your cursor to easily select an area of interest and instantly see the peaks’ fine structure.
The authors have supplied a very high quality PDF - so you can zoom in on the PDF to get a better view of the splitting of the peaks, but there are always limitations. In the final image (below) we can see a comparison of the peak centred at 4.23 ppm. In embedded spectrum on ChemSpider you can clearly distinguish the splitting – but looking at the same peak from the PDF we reach the limit of the resolution (this shot was taken when the pdf was scaled to 1200% in the viewer)
So by supplying their NMR data as supplementary information it has become easier to discover and use.
3. We can provide links to relevant sources and a comment that can contain extra information above any spectra or CIF files that are displayed in ChemSpider. In this way, rather than your article pointing others to useful data within it, you are using your data to showcase and point back to your article.
How can you get involved?
If you are already publishing in the RSC journals ChemComm, OBC, MedChemComm and Toxicology Research - when you receive the Author Revision email it will contain details about how you can supply extra Supplementary Information. If you have already had your article published (either with the RSC or another publisher) you can email us at chemspider-at-rsc.org and we can add data for you, or alternatively you can register for a ChemSpider account and add your own data at your leisure.
If you have any questions please do leave a comment (or email us directly). We look forwards to hearing from you!
(This is not a post about carbohydrates, despite the title!)
Dodgy stereochemistry is a persistent problem. Even if someone knows all of the stereocentres in a particular molecule, they might not necessarily draw them in a way that a machine, or even a person, can interpret. There are rules about whether the pointy end or the blunt end of a bond indicates the stereocentre, and it’s surprising how often you see them done wrongly.
Today I’m going to talk about a particular IUPAC recommendation for drawing stereocentres that might at first glance seem surprising, the rule that you may only have one stereobond at a given stereocentre. If you have a wedged bond attached to an atom, you can’t have a hashed bond attached to the same atom. And vice versa.
Why is this?
You might think that as you’re supplying more information, you’re making the diagram easier to interpret. However, you’re running directly counter to the normal principles of communication. You’re being more informative than required, and this sets off alarm bells in the reader. What are you trying to say? If you ask a passerby the time and they say “Well, it’s half past six Greenwich Mean Time” you’re entitled to wonder why they’re quoting the timezone. Maybe they’re trying to be funny.
Paul Grice thought about this whole problem in the 1970s and came up with a set of four principles, summarized in maxims, that listeners (or readers) assume that speakers are following. These are they:
- Be Truthful. Do not say what you believe to be false. Do not say that for which you lack adequate evidence.
Let us hope that this one is implicit in any chemical drawing!
- Make your contribution as informative as is required. Do not make your contribution more informative than required.
If you have two methyl groups coming off an atom, do not make one wedgy and one hashy. You are adding no new information!
Do not mark carbons with the letter C unless your target audience is schoolchildren.
- Be relevant:
On the grand scale: do not illustrate an article with any old molecule—make sure the molecule mentioned is actually relevant.
On the scale of the drawing itself, however: If you have three bonds about an ordinary p-block atom, for example, make sure they’re at 120 degrees to each other. If they aren’t, for example if two of them are at right angles, the reader will infer that something odd is going on.
- Be clear:
Make sure all your double bonds actually look like double bonds rather than a single bond parallel to another single bond. I suspect a lot of the success of ChemDraw is down to the fact that it produces attractive, clear chemical drawings.
Do people ever flout the maxims on purpose?
Oh yes. People often flout the maxims when trying to be funny, or in a political interview. Similarly there are all kinds of Gricean violations in the chemical drawings you see in patents: bonds which do not quite extend all the way to atoms, R groups labelled as Y (particularly dangerous as Y is yttrium!) or Q or W (also tungsten) or some other unusual letter and so forth. Exactly why this happens so much more often in patents than in journal articles is left as an exercise for the reader.
You might not think so, but you’re very good at taking a two-dimensional drawing and converting it into a three-dimensional shape in your head. No, really, you are.
Take the drawing of galatose in Fig. 1. Even if you’re not a chemist, you can tell which bits of the ring are at the front and at the back, which bonds point up and which bonds point down. If you actually are a chemist, you’ve been trained to apply this geometrical intuition to work out what’s going on at each of the five stereocentres.
However, if you ask the InChI algorithm about the stereochemistry of this molecule, it’ll say that there is no stereochemistry in there and you’re looking at a stereoless description of which atom is attached to which. Since we use the InChI algorithm to say whether two records describe the same molecule, this puts us in a quandary, and there are thousands of entries in ChemSpider that come from just such a drawing and hence lack stereochemistry.
We will soon be depositing data from the SORD databases (Selected Organic Reactions Database) onto ChemSpider. This will be done as two separate but related datasets until the SORD data source: Reactants and Products. If you don’t know what SORD is then who better to explain than Dick Wife, the “host” of the SORD database. Dick wrote the overview article below to provide an overview about what SORD is…ENJOY!
The Selected Organic Reactions (SOR) Database: capturing “Lost Chemistry”
A new database is capturing the 80% of Lost Chemistry from theses and dissertations which doesn’t make it into publications and chemists who contribute their data get access to the entire database for free.
SORD, an independent Dutch company, is carefully selecting the synthetic chemistry focused on Life Science research and making this chemistry available in their Selected Organic Reactions (SOR) Database. For the theses/dissertations which they select, SORD excerpts all of the reactions in the Experimental section are excerpted. This means there will still be a small overlap of data with full publications. There will also be a larger overlap with publications such as Notes, Letters or Communications but these do not contain the experimental details. The SOR Database brings all this chemistry to the desktop, every last detail written by the author.
Some time back, SORD looked at around 300k interesting drug-like compounds in the literature and which countries they had come from, and the native language. The English-speaking countries accounted for only 37% of the total. German/Swiss dissertations are often written in English but this is new. The theses and dissertations in the other languages represent more than half of the total. SORD routinely translates German and French experimental texts into English. They are about to start on Chinese and Japanese translations and, if anyone can give them access to Russian theses, they will translate these as well!
A thesis or dissertation is the result of several years of hard work by a research student under the constant supervision of the research leader whose reputation is at stake if the work described is wrong or inaccurate. It is also examined by a committee who decide on awarding the degree, or not. They scrutinize closely the Results & Discussion as well as the Experimental sections. The chemistry is reliable.
Advanced Chemistry Development, Inc (ACD/Labs) is partnering SORD in developing this Database. The SOR Database is available for in-house use with ChemFolder Enterprise or on the Internet with ACD/Web Librarian™. This is a screen-shot of a typical SOR Database record in Web Librarian.
The Reaction Scheme shows every atom (there are no abbreviations). The Experimental text is edited to ASCII format and the key parameters (Reagent(s), Solvent(s), yield(s), MP(s) and Optical Rotation(s) are displayed in separate Fields, as are the full bibliographic data, making data-mining possible. There is also a link which enables the user to bring up the PDF of each reaction containing all of the spectral and other physical data which SORD does not excerpt. The PDF-EX link is a powerful and unique feature of the SOR Database.
Now some explanation about SORD’s excerption rules. What they call the Reaction Scheme (A + B à C, etc.) contains only the reacting and product compound structures. A Reagent is an essential reaction component of which no part ends up in the product – if it does, it becomes a Reactant! When several reactions are performed before the product is isolated (and characterized) the Reagents and Solvents are listed in Steps. Failed reactions are not excerpted but reactions with poor yields are.
The SOR Database currently contains 170k reactions; the target is one million at the end of 2013. Even this number is a lot smaller than what you find today in the major commercial reaction databases. Back in the nineties, SORD researchers looked at one such large commercial database which then contained 9 million compounds. Sifting through the content for drug-like compounds resulted in just 450k or 5% of the records. Size is one database metric; quality is much more important! In the SOR Database, you will only find characterized products – and no polymers, or compounds with no molecular structure.
Users of the SOR Database also have access to the separate databases which contain the Reagents (ca. 3,000) and Solvents (ca. 450) which have been encountered so far. Often a Reagent is a catalyst (organic/organometallic) but they can also be simple entities like bases, acids, ammonium salts, etc. or complex chiral ligands. Authors give Reagents many different names and so each Reagent (and Solvent) in the SOR Database has been assigned a unique name. This enables rapid searches using the assigned names, again a novel feature of the database. Such searches can bring you to really nice chemistry.
As an Example, the second generation Grubbs olefin metathesis catalyst has been given the name Grubbs 2 catalyst. In the current SOR Database, there are more than 500 reactions where it has been used. Some of these are straightforward; some are not and generate novel ring systems like this one from the Martin group at North Carolina at Chapel Hill:
Searches in the Reactions Scheme, or using Reagent/Solvent names and hit refinement brings you to new chemistry which until now was only found on a dusty shelf in a library. The “Lost Chemistry” is now getting smaller as SORD carefully selects and excerpts the reactions which deserve a new life. The SOR Database is essential for novelty searches and it is a powerful supplement for the other commercial reaction databases.
Finally some more good news for academic research chemists; your data will be readily accessible to the whole chemical world who will cite your work in their publications. The chemistry which you never published may be just what others are looking for. Routinely SORD excerpts the complete collection of theses and dissertations from research supervisors; they will be more than happy to see your work appear in the next SOR Database!
 de Laet, A.; Hehenkamp, J. J.; Wife, R. L. Finding Drug Candidates in Lost/Emerging Chemistry. J. Heterocycl. Chem. 2000, 37, 669–674.
I’m sure that by now everyone has noticed that the ChemSpider homepage design changed just over a month ago. A few features moved around, the Molecules of Interest section was retired and perhaps most significantly the Search box was given a dose of CSID: 5791, becoming bigger and more prominent.
The reason for this wasn’t just to make the site more attractive (though I think it does look ‘prettier’). Our motivation for the change is to deliver a site that makes it easier for users to interact with and understand. And by doing so, hopefully make it quicker and simpler for you to get your tasks done using ChemSpider. The refresh of the homepage is hopefully illustrative of this: We think that as most users come to ChemSpider to search for information – it should be easy to get straight into a search, hence the greater emphasis on this feature.
In the next few days we will release another upgrade to the interface which is centered on making it easier to understand the data presented in the compound Record View pages. I’ll post a blog entry dealing with some of the key features in the next few days.
The development of ChemSpider is an ongoing process, and we are aware that even after this upgrade there will be aspects of the compound Record View pages that will need more work (and also other parts of the site that still need development). It’s not going to be easy: ChemSpider brings together a rich and varied set of data from a large number of sources – this poses many challenges. We also realise that there are many different tasks that each of you – as users – want to perform, and it is always going to be difficult to reconcile all of the different opinions/needs.
However, we are trying to make the site better for you. And therefore, we’d really like to know your opinions on the changes (please test new features for a few days first). We welcome your feedback on the redesign either in the form of blog comments or email feedback (chemspider-at-rsc.org).
Over the next week – keep your eyes peeled for the upgrade and my accompanying blog post which will endeavor to give you a good introduction to the new features.
Earlier this month I reported on the integration of Infotherm to ChemSpider but at that time it would have been necessary for non-RSC members to pay for the data on Infotherm despite the fact that a search would have provided the links and you could have clicked through to the Infotherm data pages. Some good news from Fiz-Chemie though…they are waiving the fee for data on pure compounds accessed from ChemSpider and as a result giving access to over 200,000 tables of data. This is a great contribution to the community of ChemSpider users. Thanks Fiz-Chemie!
Last night I gave a presentation at the BAGIM meeting in Boston. The abstract is below together with the embedded presentation from Slideshare
ChemSpider – Is This The Future of Linked Chemistry on the Internet?
ChemSpider was developed with the intention of aggregating and indexing available sources of chemical structures and their associated information into a single searchable repository and making it available to everybody, at no charge. There are now hundreds of chemical structure databases such as literature data, chemical vendor catalogs, molecular properties, environmental data, toxicity data, analytical data etc. and no single way to search across them. Despite the diversity of databases available online their inherent quality, accuracy and completeness is lacking in many regards. ChemSpider was established to provide a platform whereby the chemistry community could contribute to cleaning up the data, improving the quality of data online and expanding the information available to include data such as reaction syntheses, analytical data and experimental properties. ChemSpider has now grown into a database of almost 25 million chemical substances, grows daily, and is integrated with over 400 sources, many of these directly supporting the Life Sciences. This presentation will provide an overview of our efforts to improve the quality of data online, to provide a foundation for a linked web for chemistry and to provide access to a set online tools and services to support access to these data.
We deposit a lot of data onto ChemSpider in a month and the database is growing daily. As an example of the ongoing depositions take a look at what has been deposited in a one month timrframe from July-August. This is simply what has been published by me…not all depositions. It’s a pretty good indicator of ongoing efforts to enhance the quantity of content on the site.
PubChem is a very large source of compound structures and data, but the quality and reliability of these can be variable. However, within it, some sets of compounds and substances could be trusted more than most because they’ve been deposited by reliable data sources – for example those deposited by the Nature Publishing Group that correspond to compounds in Nature Chemistry, Nature Communications and Nature Chemistry Biology articles.
We have developed an automated method to search PubChem for substances deposited by the Nature Publishing Group, to extract their structures and properties in sdf format and then import them into ChemSpider. The result is a newly imported set of 5525 molecules in Chemspider. These compounds were deposited in PubChem since 2005 and originate from over 400 articles. All imported compounds link back to the original article – see below.
The process is automated and can be scheduled to scrape PubChem for newly deposited compounds, and stream these into ChemSpider so this subset will be updated regularly.
This initial prototype could pave the way for other high quality, consistently formatted subsets of PubChem to be identified and deposited into ChemSpider in a similar way. To suggest other possible subsets of PubChem which could be used by ChemSpider join the discussion on the ChemSpider forum.
The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010
Following on from the last post regarding integrating to RSC Databases via the RSC Publishing Beta web services layer this post expands on the nature of the integration that we have been able to introduce. The RSC publishing beta gives us access to over 500,000 journal articles, book chapters and database records through one simple search interface. Using a similar approach to that outlined for the RSC database searches, that of using validated synonyms as the basis of the search for chemicals, we are able to search across the entire ePlatform of articles and retrieve hits as shown below. The hits are under the RSC journals tab.
Since the RSC publishing platform segregates the journals from the books the same search will return results from RSC books also. Our tests show that this is incredibly fast and highly accurate. This is our first venture into tapping into the chemical compounds sitting inside the RSC archive. More work is coming…
If you look at the tabs below you will also see that we have integrated to Google Books, Google Scholar and the Microsoft Academic Search. We are truly integrating to available internet resources to bring together the benefits of all of the primary search engines available.
The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010
The Royal Society of Chemistry has a whole series of databases. None of them have been structure searchable…until now. As with our PubMed integration and our Google Patents integration rolling out shortly, just because a database hasn’t had the chemical structures extracted and indexed doesn’t mean that those resources cannot be made “structure searchable”. It’s not a subtle distinction however, as discussed in the Google Patents blog post. These types of integrations depend on the correct association between chemical names and structures, access to an API allowing facile and flexible searching and, something that is purely serendipitous in nature, the absence of overlaps between chemical names and common language.
We have used the recently announced RSC Publishing beta platform and the API made available to us to enable the searching. As my colleague Graham McCann announced recently “(the) platform gives access to over 500,000 journal articles, book chapters and database records through one simple search interface. The new platform delivers faster browsing, intelligent searching and more intuitive navigation and is open for beta testing now.”
Our approach has been to search the title and the abstract for each of the databases for all of the validated identifiers. It works. It is FAST and it provides “structure-related” access to all six RSC databases. An example screen shot is below where a search on chlorobenzene retrieves data on each of the following databases: Mass Spectrometry Bulletin, Laboratory Hazards Bulletin, Methods in Organic Synthesis, Catalysts and Catalysed Reactions, Natural Product Updates and Analytical Abstracts. The screen shot below shows the analytical abstracts linked by the term chlorobenzene in the title or abstract itself. 284 hits..in a fraction of a second. The abstract is linked out to the original article via DOI, where possible.
My personal favorites in the set of databases are the Natural Product Updates (NPU) and the Methods in Organic Synthesis (MOS) databases. The NPU database contains tens of thousands of natural product chemical structures, together with chemical names, references and some physical properties. Rich resources for ChemSpider. MOS includes includes reaction schemes, title and bibliographic details. Rich resources to connect to ChemSpider SyntheticPages in the future.
We have only just started to tap into the riches contained within the RSC archive. It’s like stumbling across a roomful of rubies to pick up diamonds. There is content all around us waiting for us to connect. We will connect this up to ChemSpider and make it available. Access to the databases will be shown at the ACS Meeting in San Francisco.
The functionality discussed below will be released at the ACS Spring Meeting during the week of March 21st 2010
We had previously released NMR prediction on ChemSpider as announced here. Based on community feedback we later removed that connection and had never reconnected, despite reported improvements. I am an NMR spectroscopist by training …if you check out my Mendeley profile you’ll see that the majority of my papers are NMR-based. Because I am an NMR jock, and despite working in cheminformatics I do keep my hands in NMR research (NMR prediction and computer assisted structure elucidation) I really wanted to make sure that we deliver NMR prediction via ChemSpider. I was involved with the development of the ACD/Labs NMR prediction tools for H1, C13, N15, F19 and P31 nuclei. There are a number of other NMR prediction modules on the market including those of Bio-Rad (in the Know-It-All package), Modgraph and certainly the work of Wolfgang Robien, one of the founding fathers of NMR prediction. These are primarily commercial packages.
In the background we have been working on the introduction of NMR prediction to ChemSpider in time for the ACS. We were looking for a platform that we could integrate that involved community deposition of data to ensure there was a growing database to enhance the prediction algorithms. We also wanted to know that the underlying data quality was good. We wanted to integrate to an Open system that had support from both an active community of participants as well as at least one developer who could provide support if we needed it. All of these criteria point to only one resource, NMRShiftDB. There have been some heated discussions, including on this blog, regarding data quality, especially in NMRShiftDB. However, I co-authored a paper with Chris Steinbeck and colleagues from ACD/Labs validating the dataset as well as ACD/Labs’ NMR prediction approaches.
NMRShiftDB is a high quality data set and certainly contains enough data to provide a training set for NMR prediction algorithms. The NMR predictions provided by NMRShiftDB are used by many people and overall feedback seems to be very positive. Based on our previous knowledge of the data in NMRShiftDB, and the availability of a well defined programming interface to connect ChemSpider, we have worked with Stefan Kuhn at the EBI to produce a first level integration.
As a result at the ACS meeting in San Francisco next week we will roll out NMR prediction integration. In keeping with the new layout model we have adopted for ChemSpider using tabbed approaches for display of data, we have bundled together all predictions. The first ACD/Labs tab provides access to ACD/Labs PhysChem properties, the EPI Summary provides access to the EPISuite and the NMRShiftDB provides access to the predicted NMR spectra. The left spectrum shows the Proton NMR spectrum and the right spectrum shows the C13 NMR spectrum.
When the system is fully integrated the process will work as follows. Since NMRShiftDB already contains many thousands of assigned spectra we will retrieve the experimentally assigned spectra directly and display them. When we cannot retrieve the experimental spectra then we will predict the NMR spectra and display them.
In the future we might pre-predict and store the NMR spectra for all structures on the NMR database. I am a little leery of doing this at present as we need to gather some basic feedback from the ChemSpider users regarding the performance of the NMR prediction algorithms and our existing implementation. In terms of predicting NMR spectra across a database of this size then a lot of consideration has to be given to domain applicability..i.e, what subset of structures should be excluded from having NMR predictions performed? For example, organometallic complexes, free radicals etc. CAS likely had to take this type of issue into account when they applied NMR predictions to their CAS registry.
If there are other NMR prediction algorithms or databases that you would be interested in integrating into ChemSpider please contact me. If you are a cheminformatics vendor selling NMR predictions/databases we would be VERY interested in receiving JUST the structures from your NMR databases. We will deposit them and link directly to your product page as an indicator that you have NMR data available.
From the early days of the acquisition of ChemSpider by the RSC we have been focused on accessing the rich content that the RSC has contained in its databases and in its rich archive. We have been working hard for a number of months now to integrate systems, projects and processes into ChemSpider so that RSC chemistry is more discoverable. What we will be unveiling in the next few days we believe is big. We’ll roll it out one piece at a time. The last blog post discussed the deposition of new compounds from RSC prospected articles into ChemSpider. The email below results from the deposition of compounds from one article. One set of 10 structures from one article that are directly deposited into ChemSpider when the article goes live. These are compounds that are deposited and live immediately, not abstracted later. Imagine when we are doing this for all RSC articles, database and books….
ALL of the compounds below are NEW to the ChemSpider database…everyone of them. While not all RSC articles are only about novel compounds clearly there are new compounds moving into the database from the RSC publications.
Dear RSC Prospect,
This email is to notify that your deposition (#3427) has been published. Below please find a list of links to the structures that belong to your deposition:
The structures link back directly to the RSC article via DOI as shown below.
We’ve taken the first step towards user being able to seamlessly bounce back and forth between finding compounds of interest using the ChemSpider search and selection tools and finding more information about them in RSC journals…
I’m pleased to announce that we’ve just switched on a deposition system which will take compounds from the prospected version of RSC articles as they are published and automatically deposit them into ChemSpider, making a link back to the original article from the new compound page. An example of a new compound is here which was generated when this article was prospected. The same deposition process is used to make links from existing ChemSpider compounds to new RSC articles, for example here was generated when this article was published.
This is basically a way to stick our toe in the water to investigate how much intervention and cleaning is necessary to deposit compounds when all the information that we have been storing for them is the InChI without any 2D layout information (which is an issue that other potential data sources may also face too). To do this we’ve been making use of the ChemSpider webservices http://www.chemspider.com/InChI.asmx to download the mol files of InChIs already in ChemSpider, or using the InChItoMol webservice to generate new mol files where they don’t exist already. Tracking and fixing problems as they crop up at this manageable rate will help us when we face the larger task of importing all of the compounds that have been prospected in the past into ChemSpider.
In the recent rollout of functionality we added to the home page statistics regarding the number of various types of spectra that have been added to ChemSpider as well as updates of new data associated with data sources. We will likely optimize these displayed further in the future but this is an initial display for the time being. It’s rather impressive how many different types of 2D NMR data are being uploaded to the database.
Following on from my earlier post regarding our interest in aggregating physicochemical data for other groups to use in building their models and algorithms we announce that we are now depositing the data from QSAR world into ChemSpider and pointing back to the original sources on QSAR World. We harvest the SDF files, deposit onto ChemSpider and provide direct links into the original SDF file, with the appropriate titles, so that our users can proceed to gather the data for re-analysis if they find it of interest. An example record is here for Atovaquone where we list the links to data residing on QSAR world for download. The links can be seen under the supplemental information section as shown below where you can see links to seven different types of data. We have chosen, for the time being, to not deposit the values associated with these data onto ChemSpider as the data are very heterogeneous in representation even though they are all delivered as SDF files.
As an active member of the Wikipedia Chemistry team I continue to be impressed with the dedication and commitment that the members have to improving the quality AND quantity of information available on Wikipedia for chemists. The number of lost hours of sleep freely given to the benefit of Wikipedia, and in this specific case to the chemistry community, is immense. The number of “Compound Pages” on Wikipedia dedicated to drugs/chemicals has continued to grow and, despite a sincere effort on our part to keep everything linked up from ChemSpider to Wikipedia it’s a little like chasing the Road Runner….we’re always behind!
We have been working with the WikiChem team of late to embed links from Wikipedia back to ChemSpider. I am humbled to know that our hard work to establish ChemSpider as a source of quality information has reached a level of trust such that Wikipedia now links from the ChemBoxes out to ChemSpider. The links are being updated on an on going basis at present with hundreds of new links already established and more being generated on an ongoing basis. Wikipedia User: Beetstra has written a ‘bot that is inserting ChemSpiderIDs across the database (see below) and we ARE doing rigorous checking of all of the links.This was using a file that we generated on our side showing links to Wikipedia from ChemSpider.
We will then be able to generate a list of all ChemBoxes/DrugBoxes without links from Wikipedia to ChemSpider and we will then make the links on our side, manually curating the structures, and then hand back a file to finish all linking. At this point we will have the backfile under control and we can perform ongoing updates as new compound pages are created on ChemSpider and, if we curate and find errors on Wikipedia or ChemSpider making a few manual edits is easy.
There are very dedicated teams on Wikipedia and ChemSpider carefully poring over data with their robots and eyeballs to create a linked data set of quality chemistry. It’s long, tedious AND important work. When its done we will have an expanded set of data to semantically link from RSC articles when we do markup.
Today I received an email via the CHMINF list server pointing to the following Press Release. Part of the press release is shown here:
“In collaboration with the German National Library of Science and Technology (TIB) Thieme is the first publisher to make primary chemistry data accessible worldwide. Analytical data, from various experiments, is the foundation of research work and scientific papers. From now on, primary data will be registered and made available online via the Thieme eJournals website (www.thieme-connect.com/ejournals) using digital object recognition in the form of Digital Object Identifiers (DOI). This will enable scientists to easily locate research articles, including accompanying data, and make enhanced use of the scientific content.”
There has been a lot of discussion over the years regarding making available “primary data”. We offered to do this on the ChemSpider Journal of Chemistry : if people wanted to submit analytical data with an article that we published then we would post them as spectra associated with the article. Unfortunately the general consensus based on a few conversations that I had is that it is a lot of work to prepare data and deposit it. This is one of the reasons that, until now, publishers have generally made the spectral data available as plots and printouts of the data. These data are generally made available as electronic supplementary data. These data ARE valuable even in that form but, and I believe that the majority of scientists would agree, they would be of more valuable if they were available in a format that would allow display in online applets, downloadable for processing and expansions etc. The RSC would certainly welcome the availability of spectral data associated with publications especially since they can now be hosted on ChemSpider.
Thieme have actually managed to pull off quite a coup and I commend them for their efforts. The first example datasets are available here. The listing includes “FIDs and associated files for the 1H, 13C and DEPT NMR spectra for compounds 14, (SS)-23, (SS)-25, (RS)-26, 27, (SS)-28, (RS,SS)-29, 30, (RS)-36, (SS)-36, (SS)-37, 38, (RS)-39, (SS)-39, (SS)-44, (RS)-46, (SS)-46, (RS)-48, (SS)-48, (SS)-49, 52, (RS)-53, (RS)-55, (RS)-57, (SS)-57, (SS)-58, (RS)-61, (SS)-61, (RS)-62, (SS)-62, (RS)-65 and (SS)-65 are summarized.” That’s a lot of data.
Since these are primary data they cannot be copyrighted so I chose to download the data, take a look and insert a couple into ChemSpider as an example of what can be done with these data. The associated PDF for the data says “The files can be processed using the following programs: MestReC, Bruker’s WINNMR and XWINNMR.” The files came as binary Bruker files so needed to be reprocessed and, in order to be deposited, had to be converted to JCAMP-DX format, the format supported by the JSpecView applet used on ChemSpider to display spectra. In order to this I am fortunate to have access to ACD/NMR Processor, a product I managed for a few years while working at ACD/Labs. This product also supports the Bruker format so I imported the data, processed and exported as JCAMP and imported to ChemSpider. For compound 14 I have attached the H1 and C13 spectra and they can be seen here. I didn’t attach the “DEPT spectrum” yet. In order for me to download the spectra, redraw the structure, process the spectra, export as JCAMP and deposit to ChemSpider took about 15 minutes. However, there are a lot of spectra and it will take me a while. There are 32 compounds, I assume 3 spectra per compound (HNMR, CNMR and DEPT) so that’s a total of 96 spectra. It’ll take me about 10-12 hours just to deposit this collection so that’s a lot of work to do in my spare time. If anyone wants to help out and can process the spectra to deposit please do!
One of the spectra are shown below using the Spectral Embed function we introduced previously:
This is a rich collection of data…it can feed the Spectral Game described in this article. I look forward to getting the data onto ChemSpider and will be following up with Thieme to see if we can work together to host the data in a more generic format for the future. It’s a shame that the data are locked into a binary file format that needs reprocessing to view and I believe display through the JSpecView applet is advantageous for all. I encourage Thieme to consider also making the structure collection available in molfile, SMILES, InChI and InChIKey format – the InChIs will make the article discoverable via internet searches and through the InChI Resolver while the download of molfiles will speed up the loading process to ChemSpider and other systems.
I gave a talk today at the ICIC 2009 meeting here in Sitges, Spain. It is an interesting meeting and I will report on some of the presentations later. I’m glad I am here. The presentation is here on Slideshare and is a modified version of a presentation I gave on Saturday at the Microsoft eScience conference in Pittsburg. One of the questions that followed the presentation was in regards to whether ChemSpider could be used as a measuring stick for quality (I am paraphrasing). My response was that there are millions of errors on ChemSpider and that seemed to raise a giggle and other people since then seemed surprised.
In my opinion, as shocking as it sounds, it must be true. Why?
There are almost 23 million unique chemical entities on the database. Many of them have multiple names associated, experimental properties, many have 10s of links to external databases. The structural layout has been created using algorithms. Algorithms have been used to generate systematic names. There are spectra submitted by the public and they can be mis-referenced, as an example, or declared to run in one solvent and ACTUALLY run in another. There are sometimes multiple registry numbers associated with a compound…a CAS number for a salt associated with with the neutral compound for example. The multiple links out to external resources number in the 10s of millions and these are changing daily as other websites and databases curate and edit their data. Errors are inevitable and, I judge, there must be millions of errors on ChemSPider. Just as there must be millions on Wikipedia and in the search results you get back from Google. The question is what counts as an error? I’m using a broad stroke brush for an error…a structure with a poor depiction is an error. A misspelling is an error. A dead link to a database is an error. So…definitely millions. But we continue our work to whittle down the number, with the assistance of the community, everyday. But we’re doing it while we are depositing new compounds onto the database so it’s an interesting challenge. Millions of errors doesn’t make ChemSpider less useful…we’re just realistic about the magnitude of the challenge!
In the history of developing ChemSpider we have undertaken some fairly demanding curation activities. For example, Vancomycin and Ginkgolide B. Now we are in the middle of trying to resolve the structure of Digitonin. There are 25 (!) skeletons for digitonin on ChemSpider from various sources. There were eleven compounds on ChemSpider called Digitonin. We have been able to clean most of these by removing partial stereochemistry. We are now left with three structures…simply search Digitonin on ChemSpider and you will see three structures with full, but different stereochemistry.
What is a “correct structure” is a matter of assertion. Who says what is correct? What publications, what techniques, what database, who says its correct? Structures have timelines…they can change with time as new analytical techniques are applied.
This is a call to the community to help resolve the existing confusions around Digitonin on ChemSpider…but they are out there in all the other databases also and there are discrepencies between Wikipedia, DSSTox, ChEBI, PubChem and so on. So, my call to community…what is the correct structure of Digitonin and based on what assertions?
With this information in place, and assuming communal agreement on the conclusion, we can go help clean up the other databases. Help!
For those of you who have been using ChemSpider for the past few months you will be aware that historically we had an integration in place to SureChem’s Patent Portal. A few months ago that integration was unfortunately broken as SureChem improved their service. Also, we were un-synchronized with their growing set of chemical structures as they updated their patents. The previous integration was very limited in nature anyway as it simply showed the presence of patents associated with the ChemSpider structure in the SureChem database. Certainly a more ideal solution is the one that we introduced just in time for the ACS meeting in Washington.
The new solution lists not only the number of patents containing the chemical compound shown in the ChemSpider record but also show the first 10 patents, by title, and provides direct link-throughs to the patents on SureChem. This is a much improved integration and we hope you enjoy it. The next stage is to deposit the latest SureChem structure collection that has grown significantly since our last deposition. Thanks to our collaborators at SureChem from offering you, our users, access to their service.
A few weeks ago I noticed that PubChem had grown substantially after a deposition from the Zinc group.I had thought, incorrectly, that this was due to the deposition of protonated forms of the ZINC database because they produce such forms as part of their docking procedures. I had discussed this possibility with Evan Bolton from the PubChem team when we were at the InChI meeting in Glasgow. In fact, this was not due to the different protonation states but because ZINC had deposited 12M make-on-demand compounds that they hold in their catalogs. For me these are virtual chemicals. The vendors involved with the deposition of such chemistry into the Zinc Database have done research to demonstrate that the chemistries that would be involved in the production of these chemicals, when ordered, would have a good probability of being synthesized but they are, for the time-being, virtual compounds only. In the early days of ChemSpider we went through a discussion internally regarding whether or not we should open ourselves to the deposition of virtual compounds and we did add a dataset from the UsefulChem team from Drexel University. Since then however we have steered away from the deposition of such libraries. As explained on the Zinc blog a decision was made to remove 12 million of the make-on-demand chemicals as “Pubchem’s rules require that compounds have been made somewhere before they be included”. I’m fairly sure that what is left on PubChem does not fully exclude such compounds as they are deposited by a number of vendors who have the ability to submit such collections but I appreciate the effort made by ZINC to remove their deposition from this class.
I am interested in community feedback on this matter. Should ChemSpider host collections of virtual chemistry? There is certainly value for people who wish to perform such activities as virtual screening but we don’t allow downloads of our entire database the way that ZINC and PubChem would. We are focused on layering on more information associated with a chemical compound at present.- physicochemical properties, spectra, article links, patents etc. We want to make sure that the chemistry represented in the backfile of RSC articles makes it onto ChemSpider in the future. This parallels some of the efforts being made by Fiz Chemie and InfoChem to make available the backfile of Chemisches Zentralblatt. We want to make sure that the compounds in the Natural Product Updates file from RSC make it onto ChemSpider. We have a lot to do but the focus is getting real data, real structures onto the database and removing “junk chemistry” from the deposited data.That said we are interested in your comments. What are your thoughts regarding “virtual chemistry”? Should we support virtual compounds or not? For sure there will always be some virtual chemistry on there in some form – for example, products that were thought to once be elucidated but were later shown to be something else are virtual chemistry. Compounds that have been deposited with incomplete stereochemistry can be “partial chemistry” if you like. Your thoughts and comments are welcomed.
There have been other comments about Wolfram Alpha and it’s support for Chemistry (1,2 and others) but I have remained rather quiet until now about my experiences with Alpha for a couple of reasons. First of all I’d rather let the service settle down a bit before poking at it too hard. My experiences of going live with ChemSpider were definitely that it takes a while to stabilize the system and address some of the earliest feedback. Also, knowing that I would be at Scifoo and aware that Theodore Gray would be there I had hoped to see Alpha in action. I wasn’t disappointed. Yesterday Theodore drove the system in front of an audience including a number of interested scientists, members of Google and, Peter Murray-Rust and myself from Chemistry. Theo had no fear…essential for live demos. He was asked questions and he did took the plunge, did the search and with the rest of us celebrated a successful search, a weird result and just plain wrong. It was ALL good. I am impressed. I am impressed by that they are out to achieve with Wolfram Alpha. I am convinced that what they are doing with Alpha will contribute to science and mathematics in general and that Chemists will be using this system when they have more awareness of it.
For a general intro to Alpha see the presentation here.
So, some examples of interesting searches:
1) A guy in the room had asked the question “What is the largest land mammal?” and had not received an answer a few weeks earlier. Now Theo posed that question and got the answer here. Nice! Now, I took that to mean that they were keeping logs of failed queries and tweaking…confirmed by Theo. VERY nice.
2) Peter Murray Rust had previously blogged about bad results from his searches (searching on dibromoethane for example). When he repeated his searches in the session hosted by Theo he acknowledged that he was pleased that they had fixed the issues he had previously blogged about. This is how modern systems should be …moving quickly.
3) Searching on names…for example, what is the number of people with my name…my spelling is Antony NOT Anthony. See here for the results.
4) What is the return per employee for Google versus IBM. It’s in this query: http://www35.wolframalpha.com/input/?i=GOOG+IBM
5) What are the chemical structures of Taxol? Methamphetamine? Cholesterol? Buckminsterfullerene? You get answers for all. The organic molecules all give images of chemical structures. The connections in all cases are correct but I see no evidence of stereochemistry anywhere across the chemical structures on the database..it doesn’t mean it’s not there but I couldn’t find it.
So, for chemistry, am I impressed. Yes I am. I’m not worried right now that Alpha is not dealing with stereochemistry…I am sure they will layer that on later. It is clear based on most of the results that I have seen that there is some GOOD curation of the data going on. According to Theo there are chemists on staff and they are curating the data coming in. Hallelujah! If you look in the Source Information for Taxol you see a LONG list of sources of chemical source information and the primary source is the Wolfram Alpha Curated Data.
There is much that can be done to help Wolfram Alpha to have better Chemistry. They have a HARD job ahead of them if they are going to sample the Public Databases to grab quality chemistry. It’s in there for sure but it’s hard to find. What could come out of ChemSpider and Wolfram Alpha working together?
1) If we could get the list of “compounds” in Wolfram Alpha then we can provide chemical compound connection tables with all necessary stereochemistry etc.
2) When we pass back the compound list then we can pass back ChemSpider IDs and get them listed as identifiers alongside the PubChem CID. In theory it would be good to get these linked back to ChemSpider so that a user can come and find associated articles, analytical data, the wikipedia article, predicted and experimental properties and so on. This is where ChemSpider’s integration would be of value.
3) There is an opportunity to expand the chemistry in Wolfram Alpha by passing a subset of ChemSpider compounds to be added to Alpha. Certainly I don’t think that Alpha should host all 21.5 million of our compounds for the reasons I have enumerated many times on this blog. See my last post about the 54 versions of the Taxol skeleton…there should be only one Taxol. But, there may be a way to subset “important chemistry” and get it into Alpha. OR, maybe they do want it all?
There are clearly opportunities to help expand the chemistry and I hope we have the chance. I think Alpha is incredibly ambitious. But why not be ambitious? ChemSpider was ambitious too and look what we have done with three servers in a basement…it’s a whole lot less resources that Wolfram are throwing at Alpha. I want them to be successful…a computational engine for the public. Why not….so many of us are asking questions using search engines right now and can’t get anywhere near an answer…
STOP COUNTING the Number of Chemical Entities in Public Compound Databases and There are Ghosts in the ClosetPosted by: Antony Williams in Quality and Content
Let’s start off where I intend to finish. Bigger does not necessarily mean better. A large database of unique chemical entities does not necessarily mean a good database and accurate chemical representations of chemical entities can be pretty hard to find.
Few people realize how these simple statements are impacting the quality of what’s available online for chemists to use and how curation of data must occur in order to improve what’s available.
Now…what’s the basis for me to initiate this discussion and WHY would I prefer that ChemSpider was actually a smaller database?
Today on CHMINF Steve Heller posted the following review:
PubChem is a search tool for chemical information, divided into three areas: Compounds, Substances, and BioAssays. Full entries provide detailed information with the most basic information – a general description, the molecular weight and formula, the structure, plus a Table of Contents (ToC) for the full entryall easily found above the fold. Use the ToC or scroll down to retrieve more advanced information, such as bioactivity results, synonyms, chemical actions, detailed properties, and more. Each module is fully interlinked with the other sections of PubChem as well as resources in ToxNet and PubMed, providing full access to toxicology resources and the medical literature, and allowing users access to as much or as little of the chemical information as they need.
Author/Publisher: National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health
Date reviewed: February 16, 2009
PS. PubChem now has 37,326,949 DIFFERENT structures”.
Bob Buntrock made the following statement “Re the PS below, I find it difficult to believe that PubChem has 37.3 million “different” compounds. The figures from the CAS website show 48 million organic and inorganic compounds which excludes sequences but includes polymers, alloys, coordination compounds, minerals, and mixtures. Since PubChem aims to cover “small molecules”, it would seem that many compounds in these last 5 categories would not be present. Therefore, I assume that a significant number of the 37.3 million PubChem compounds are redundant.” All hell broke loose with lots of posts discussing the uniqueness of chemical entities and the fact that PubChem compounds WERE unique. Okay, I’m not going to argue this for the moment but I am going to agree with Bob that a significant number of the compounds are likely redundant. It is ALSO true of ChemSpider. Why?
I could write a multipage blog but I have already discussed this issue many times on this blog but am clearly failing to communicate the issue. I’ll try again but I reference you to previous posts about Taxol (1,2,3), Vancomycin (4) and Ginkgolide B (5,6). I suggest you read these earlier posts but will try and explain again anyway.
Some general statements. Many complex chemical compounds, especially natural products, have timelines. A compound when initially elucidated can give the connectivity only and get reported. Then stereochemistry might be layered on later, and reported. Then stereochemistry might be adjusted, and reported. Through this whole timeline the compound might be referred to by a particular chemical name….let’s call it Afonwenium. So, based on the timeline for this molecule there can be anywhere between 1-4 “versions” of the structure by that name. They are all unique chemical entities but the “final structure” is the one that people will want. It’s the one that should be represented on Wikipedia, the one that should correctly be drawn in all publications following the final elucidation report and assertion of structure and the one that should be found on many of the “reference” databases such as KEGG, DrugBank etc.
Search Taxol on ChemSpider and Taxol on PubChem and compare the number of structures you get. I judge that there are MANY unique chemical entities on PubChem that are MEANT to be Taxol but are not. And I don’t mean the ones that are named as “Taxol derivative”, I mean the ones that may have the SAME molecular weight, formula and connectivity but have DIFFERENT stereo – no stereo, MULTIPLE partial stereo and MULTIPLE full stereo. These issues exist for compounds like Ginkgolide B and Vancomycin and many more structures. There is of course only one Taxol, a compound registered by Bristol Myers Squibb and asserted to have a specific constitution.
Just out of interest lets see how many compounds are on ChemSPider with a specific skeleton (ignoring stereo).
There are 54 compounds with the skeleton of Taxol: http://www.chemspider.com/InChIKey/RCINICONZNJXQF. These are all UNIQUE chemical entities but there are C-11 and C-14 labeled, Deuterium and Tritium labeled and so on. But there are over 30 compounds that have the same skeleton, without isotopically labeled sites, that still have the Taxol skeleton. Maybe some of these are meant to be Taxol with different stereochemistry but I judge that MOST of these are meant to be Taxol and are labeled as such but differ in terms ofno, partial and full stereo at least. This is ONE example. To Bob’s question…is this redundancy? I say yes. How does this get solved? Curation will do it but it’s expensive and time consuming and the only way forward in my judgment is to crowdsource it. This problem is not going away anytime soon in PubChem or ChemSpider. We HAVE curated the name associations and removed the name of Taxol for all skeletons that are not what is the asserted form of Taxol. But the structures do remain on the database and link back to the original sources. We will be working on ways to show on every search that there are associated skeletons, compounds related by isotopic labeling and the status of no, partial and full stereochemistry. All to come…
The ongoing “Bigger is Better” arguments for Public Compound Databases is irrelevant at this point in my opinion. We can add 50 million new compounds with a simple enumeration exercise but woulf it bring any value? I say no. We can add virtual libraries from a number of our collaborators but I judge it to be of very limited value. The value of the Public Compound Databases are in what they connect to and whether there is an answer to a question at the end of the chain. If I search on a chemical and find it on ChemSpider but I cannot find a vendor for it, no analytical data, no properties of value, no manuscripts, no patents linked etc then I have just done a search, found it on ChemSpider but have derived no value. We are working on increasing the VALUE of our content. Linking compounds to rich data sources, layering on additional properties, links to papers, blog entries and discussions and so on. If the result of a search is a hit but with no value who cares. If the result of a search is a hit but with links to the wrong information that’s worse. If I ask the question “What is Taxol” and get one hit I need it to be right. If I ask the question and get tens of hits now what?
Curation has been underway for 2 years. We’re not finished. Its a massive task. In reality it will NEVER be finished – new chemistry comes in every day and more information gets associated. We don’t have answers to all of the issues that exist around these diverse datasets but we are not naive in our understanding that our database is polluted with issues inherited from many other sources. We have marked tens of thousands of structures for deprecation. We have likely added information into PubChem that has contributed to the issue of data quality. But we are working on it.
Meanwhile errors that exist in PubChem are proliferating. A simple example is that of methane in PubChem that I have blogged about many times…one example here. Here are some of the names associated with the structure of methane on PubChem: 1,3-DICHLORO-PROPAN-2-ONE, diamond, charcoal and many tens of other incorrect names.
The National Cancer Institute’s Chemical Structure Lookup Service has over 46 million unique chemical entities and they have offered a series of services to search by InChI, name and many other queries. A posting to CHMINF outlined the service
“Chemical Identifier Resolver (beta):
This service is a resolver for different chemical structure representations and identifiers, including those that do not carry any information about the structure itself. For instance, it can work as a Standard InChIKey Resolver, an NCI/CADD Identifier Resolver or a Chemical Name Resolver. The service also allows one to convert a given structure identifier into another representation or structure identifier.
Representations/identifiers supported are: Standard InChI/InChIKey, NCI/CADD Identifiers (FICuS, FICTS, uuuuu), SMILES, SDF, names, and a few other types of
IDs. See the web page for more information.
For those identifiers that require lookup, the underlying database currently contains about 67 million unique structure records, from which the respective Standard InChIKeys and NCI/CADD Identifiers have been calculated. For lookup by chemical names, 68 million names associated with 16 million unique structure records are currently available in the database. The database continues to grow.
Closely related are the new capabilities of resolving/converting chemical structure identifiers by simply using a URL adhering to the following scheme: http://cactus.nci.nih.gov/chemical/structure/”structure identifier”/”representation”[/xml]
We just list a few examples here that should give you an idea of what’s possible with this service. For more detailed explanations, see the above web page.
Example: Standard InChI for chemical name string “aspirin”: http://cactus.nci.nih.gov/chemical/structure/aspirin/stdinchi
Example: Standard InChIKey of “ethanol” specified as SMILES string “CCO”: http://cactus.nci.nih.gov/chemical/structure/CCO/stdinchikey
Example: Unique SMILES string of chemical name string “benzene”:http://cactus.nci.nih.gov/chemical/structure/benzene/smiles
Example: SD File for chemical name string “morphine”:http://cactus.nci.nih.gov/chemical/structure/morphine/sdf
Example: Chemical names for Standard InChIKey “InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N” (Standard InChIKey of “ethanol”): http://cactus.nci.nih.gov/chemical/structure/InChIKey=LFQSCWFLJHTTHZ-UHFFFAOYSA-N/names
Example: Synonyms for chemical name string “aspirin”:http://cactus.nci.nih.gov/chemical/structure/aspirin/names”
Unfortunately polluted names are finding their way across all of these databases which is why a lookup on methane gives us: http://cactus.nci.nih.gov/chemical/structure/methane/names including in the list:
1-Chlorobenzylethyl-3,5,7,9,11,13,15-heptaisobutylpentacyclo[18.104.22.168(3,9).1(5,15).1(7,13)]octasiloxane, mixture of isomers
The CAS database is highly curated, not without errors, and built up using robots and eyes. Public Compound Databases are built with the best intent and are useful. But they are not curated and are polluted. Bigger does NOT mean better and care is warranted. ChemSPider will likely stay smaller that many of the other Public Compound Databases moving forward as we remain focused on adding value and addressing the issues of inherited and future quality. It’s a long journey…