Archive for the ChemSpider Chemistry Category

My colleague Will originally developed the ChemRefer service. When ChemSpider started up Will brought the ChemRefer technology and joined us to help expand the capabilities of our services. We integrated ChemRefer and released the text searching capabilities. Will indexed more and more journals and grew the index by 100s of thousands of articles. Unfortunately the downside was that the speed of the search decreased dramatically. Also, we kept hearing the comparison with the Google service and that their advantage was in their citations. So, Will has taken a few months off from indexing and has focused his efforts on developing his technologies to dramatically improve the speed of searching as well as implementing a system for recognizing citations. The system has been made available online for beta-testing just in time for the ACS meeting here in Salt Lake City BUT it is not yet integrated into ChemSpider.

I have performed some basic tests focused on searching chemical names initially. The literature search on ChemSpider has a lot more journals indexed but in order to perform the comparison I searched ONLY the RSC and Journal of Biological Chemistry articles since that is all we have indexed so far on the new system. The search results were as follows. The numbers compare number of hits for the old versus new literature search. The new search has indexed the latest RSC and JBC articles also so in theory should provide more hits.

Searching on Taxol: 626 hits found in 22 seconds (OLD) vs 717 hits in 1 seconds (NEW)

Searching on phenolphthalein: 47 hits found in 5 seconds OLD) vs 1514 hits in 1 second (NEW)

Searching on benzene: 846 hits found in 75 seconds vs 15260 hits in 4 seconds (NEW)

Clearly the searches are MUCH faster with the new system but it is also returning much more results. These are very early results and we will explain more about the system, the results and our future development shortly…

Try out the new system here for now and send us feedback at info@chemspider.com. Thanks

Reblog this post [with Zemanta]

Buy me a Coffee

We continue to expand the ChemSpider Database with new depositions sourced from various collaborators. We are especially privileged to have received the RSC’s structure collection associated with their Project Prospect articles and have spent a couple of weeks working with the data prior to depositing onto ChemSpider. During the deposition process we have formed the link between the chemical structures and their articles via a DOI link. We have been able to deposit the title, an associated author and the DOI. In this way we have been able to link thousands of chemical structures to articles on the RSC website. On each record associated an RSC article you will see both a link from the data source table and a link via DOI from the reference as shown here and in the figure below.

rsc_linkWith the RSC depositions came many beautiful structures - highly symmetric, complex and just plain “pretty” to a chemist. But a high level of complexity also arrived with the collection and while many InChIs could be converted to their associated connection tables the act of converting the InChIs could add additional stereochemistry and structure cleaning could change stereochemistry so this was a long, tedious and mostly manual process I’m afraid. Nevertheless, a wonderul addition to the ChemSpider database and our sincere thanks, on behalf of the community too, to the Royal Society of Chemistry for sharing their data with us. The InChIs will be deposited into the InChI Resolver shortly.

Reblog this post [with Zemanta]

Buy me a Coffee

There are some interesting articles showing up on ChemSpider from across the blogosphere. We have just added to our list of high priorities to generate an RSS feed of structures, short descriptions and ChemSpider IDs so that anyone can access them. When we add new descriptions we will add snippets to the RSS feed.

New Articles include:

Teen Chemist and Splenda

A Discussion about the Synthesis of Spirangien A from the TotallySynthetic Blog by Paul Docherty

A Discussion about the Synthesis of Omaezakianol from the TotallySynthetic Blog by Paul Docherty

Reblog this post [with Zemanta]

Buy me a Coffee

ons1We’ve been working with Jean-Claude Bradley and his Open Notebook Solubility Challenge group to assist where we can. This has included enhancing some of our services (though there is more work to be done…), populating data into ChemSpider and, now, linking us up to the Data Tables built by Andy Lang (of The Spectral Game fame…we’re quite a team).

The Open Notebook Solubility Challenge is described here. The present list of compounds for which we have created the integration to be described below is here. WHen you open that link you’ll see the first bunch…notice the little icons showing patent links, Wikipedia links and the presence of spectra on those records.

WHat we have done now is deposit the links into the Data Source tables for these compounds and providing the direct link to the ONS tables. They can be viewed WITHOUT leaving the site simply by hovering over the link…OR you can click on the link to view the data directly. An example of the link view is shown below. To find these tables simply look up the Open Notebook Solubility Challenge data source in the table.

 

ons2

Buy me a Coffee

Late nights and ailing computers aren’t conducive to the best of work. So, when I posted about the clean chemical structure I obtained using ChemDraw I was genuinely excited about the quality of clean-up that was produced. However I slept on it and reminded myself to check that the output InChI was equivalent to the input InchI as my experience with structure cleaning is that it can swap stereocenters.

So, I returned to that particular problem and looked specifically at the InChI string fed to ChemDraw to convert and then converted the resulting strcture to an InChI in Chemdraw. So, to clarify, this was all done inside the package:

Here’s the stereo layer of the input structure:

/t35u,36u,37u,40-,41+,42+,43-,44+,45-,46-,47+,48u,52-,53-,54-/m0/s1

and the stereo layer of the output InChI

/t35-,36-,37-,40+,41-,42-,43+,44-,45+,46+,47-,48-,52+,53+,54+/m1/s1/

This is the name of the structure generated by converting the original InChI to a structure and generating the name using nomenclature software: (4R,5E)-4-{[(1E,2S)-2-{[(E)-{2-[(1S)-1-amino-2-methylbutyl]-4,5-dihydro-1,3-thiazol-5-yl}(hydroxy)methylidene]amino}-1-hydroxy-4-methylpentylidene]amino}-5-{[(1E,2S)-1-{[(1E,3S,4E,6R,7E,9S,10E,12R,13E,15S,16E,18R,19E,21S)-18-(3-aminopropyl)-12-benzyl-15-(butan-2-yl)-6-(carboxymethyl)-2,5,8,11,14,17,20-heptahydroxy-3-(2-hydroxy-2-iminoethyl)-9-(1H-imidazol-5-ylmethyl)-1,4,7,10,13,16,19-heptaazacyclopentacosa-1,4,7,10,13,16,19-heptaen-21-yl]imino}-1-hydroxy-3-methylpentan-2-yl]imino}-5-hydroxypentanoic acid

This is the name of the structure generated by naming the structure produced by ChemDraw resulting from reversing the original InChI

(4R,5Z)-4-{[(1Z,2S)-2-{[(Z)-{(5R)-2-[(1S,2R)-1-amino-2-methylbutyl]-4,5-dihydro-1,3-thiazol-5-yl}(hydroxy)methylidene]amino}-1-hydroxy-4-methylpentylidene]amino}-5-{[(1Z,2S,3R)-1-{[(1Z,3S,4Z,6R,7Z,9S,10E,12R,13Z,15S,16Z,18R,19Z,21S)-18-(3-aminopropyl)-12-benzyl-15-[(2R)-butan-2-yl]-6-(carboxymethyl)-2,5,8,11,14,17,20-heptahydroxy-3-(2-hydroxy-2-iminoethyl)-9-(1H-imidazol-5-ylmethyl)-1,4,7,10,13,16,19-heptaazacyclopentacosa-1,4,7,10,13,16,19-heptaen-21-yl]imino}-1-hydroxy-3-methylpentan-2-yl]imino}-5-hydroxypentanoic acidCheck out and compare the names…look at the difference in stereocenters. Maybe there is someting I am not doing correctly and causing this effect. I am presently communicating with Cambridgesoft on this point to see if there is some setting I am missing that retains stereochemistry. This is exactly the issue I see with InChI reversals and CLEANING in other applications unfortunately. I will report back when I determine what the optimal settings are to stop such issues, if indeed they can be prevented.

 

Buy me a Coffee

I’ve been fighting with technology today. I opened my computer at 7am and the nightmares started…..40 minutes to boot, 20 minutes to open my Outlook PST file and that’s where we stay. The CPU pegged at 95% while Outlook is open. I have scanned the pst file to fix it and spent hours defrag’ing and blah, blah, blah. Looks like a reformatting job is coming…fortunately for me blogging and chemspider are all web-based so some catch ups tonight…

Some fast comments …

We’ve been adding new blog posts into some of our records…we can do this with your material if you want a larger audience and preservation moving forward. Some totallysynthetic blogs are here (1,2) and a fun posting from J on Bromination

We have agreement from NIST to use a “small slice” of the NIST Webbook data and are adding IR, MS and UV-vis data onto ChemSpider at present. See the spectra for Cholesterol here

Buy me a Coffee

InChIs are a powerful way to communicate chemical structures. They are going to enable internet chemistry and when we roll out the InChI Resolver shortly then the community will have access to a resource to resolve InChIKeys and ultimately navigate chemistry on the web. We commonly receive chemical structures in the form of InChIs and in order to deposit the structures we have to convert the InChIs back to chemical structures, commonly into SDF format for batch deposition. For simple organics this is not a difficult process…the tools we have at our disposal can deal with the layout of simple organics. However, for some of the chemical structures we receive optimizing 2D layout is very challenging. Many of the issues come with fullerenes (See examples below) but not only. Carbohydrates, complex cycles etc are big challenges.

clean

In building the InChI resolver we hope to provide attractive visual depictions of the associated structures. Without AuxInfo data carrying the coordinates,  or without the deposition of SDF files containing the layout coordinates we have a major challenge ahead of us. Auxinfo data are shown below for erythromycin. These data are rarely generated when people generate InChIKeys and the issue of structure layout will dominate the interpretation of complex structures.

auxinfo

Since beauty is in the eye of the beholder my judgement is that automatc layour algorithms should only assist in the appropriate layout and eyeballs will need to make the final decision. That is why it is better to deposit SDF files of InChIs with Auxinfo carrying the coordinates than it is to deposit InChIs only and leave the structure layout to an algorithm. It will fail.

I am interested in seeing what people can do with their structure cleaning algorithms on InChIs like this:

InChI=1/C66H103N17O16S/c1-9-35(6)52(69)66-72-32-48(100-66)63(97)80-43(26-34(4)5)59(93)75-42(22-23-50(85)86)58(92)83-53(36(7)10-2)64(98)76-40-20-15-16-25-71-55(89)46(29-49(68)84)78-62(96)47(30-51(87)88)79-61(95)45(28-39-31-70-33-73-39)77-60(94)44(27-38-18-13-12-14-19-38)81-65(99)54(37(8)11-3)82-57(91)41(21-17-24-67)74-56(40)90/h12-14,18-19,31,33-37,40-48,52-54H,9-11,15-17,20-30,32,67,69H2,1-8H3,(H2,68,84)(H,70,73)(H,71,89)(H,74,90)(H,75,93)(H,76,98)(H,77,94)(H,78,96)(H,79,95)(H,80,97)(H,81,99)(H,82,91)(H,83,92)(H,85,86)(H,87,88)/t35u,36u,37u,40-,41+,42+,43-,44+,45-,46-,47+,48u,52-,53-,54-/m0/s1

The images below show the iterative application of DIFFERENT structure layout algorithms. One caution…your layout algorithm should produce the SAME InChI at the end and NOT flip stereocenters. Interesting challenge. Who says cheminformatics isn’t challenging? And who thought building an InChI Resolver would be easy?

layout1layout2layout3layout4

Reblog this post [with Zemanta]

Buy me a Coffee

I gave my talk yesterday at CShals 2009, the conference on Semantics in Healthcare and Life Sciences.It was a great meeting for me (hindered by dismal access to wireless internet as a result of Marriott’s want to make more money from the conference organizers. They should be ashamed of themselves in this day and age!) as it was not about Chemistry, not about spectroscopy, not even about Open Data, Open Access and Open Source. It was about Semantics. I learned a lot and got to hear Tim Berners-Lee talk about where the semantic web is and where it can go and how can be disruptive in a good way while NOT being too disruptive to layer onto what already exists. The best part of the meetingfor me was the clear passion for the InChI, as well as a lot of acknowledgement that it is not perfect, cannot presently compete with molfiles, commercial systems, CAS Numbers and so on. But, people are optimistic and are waiting and supportive. Overnight I inserted a lot more information about InChIs and how they can be useful, where some of the limitations are presently, how the StdInChI has now added a new level of complexity on one hand and simplifcation on the other. There have already been a number of requests for a copy of the talk so it is up on Slideshare for now (and linked below). I’ll do a voice over in the next few days and upload to Scivee. I unveiled the first version of the InChI Resolver at conference and showed it to a couple of people. The general consensus is we are heading in the right direction. The timing on this conference was good because the intention is to layer on RDF before we release at the ACS, time allowing.

Reblog this post [with Zemanta]

Buy me a Coffee

Where in the world is Carmen Sandiego and who and where is Katie Crow? We’re still looking for her ever since she put her photo on ChemSpider and took advantage of the new capability we have for depositing images.

Well, a more appropriate use of the function is to actually deposit images of appropriate data. JSpecView does not support 2D NMR data at present but such data can still be of value. Ryan Sasaki from ACD/Labs was kind up enough to give me an example 2D COSY spectrum for strychnine so i could use it as a proof of concept. It is available under the spectra tab at this record (see the bottom of the page). This 2D spectrum could also show a structure with correlations etc.

Reblog this post [with Zemanta]

Buy me a Coffee

Beauty is in the eye of the beholder. Something I see as stunningly beautiful can just as easily be unattractive to my peers. Such is the nature of Chemistry too. Some might find a particular reaction particularly elegant while others would argue it is mundane. I judge that when it comes to the depiction of chemical structures we would all have fairly consistent views of what are attractive and appropriate chemical structure depictions or “layouts”.

Structure layout is hard to do well and there is still a need for THE optimal layout algorithm. We still find some nightmare organic structure layouts on ChemSpider. When we push them through the layout algorithm we use now they are easily resolved so we’re not sure why some escape the layout algorithm first time but such it is. We have provided the ability to clean these individual records as we find them and it takes just a couple of seconds. The technical note explaining how is here.

Such an operation was applied here. The structure on the left is the “ugly” structure (does anyone think it’s pretty?) and the one on the right is the cleaned version using the online process.

Unfortunately it is NOT so easy to obtain such improved layouts for the MAJORITY of organometallic compounds. This can be seen on PubChem (here) and, similarly, on ChemSpider here. The example is shown below. Are we working on this problem? Not really…the layout for such complex systems has been a challenge for many years and the appropriate way to deal with such situations is to use the CIF file, if its available, and display in JMol as we have enabled here. We are however still working on cleaning up the structures of organic molecules as we see them and still searching for the ultimate layout tool…

Buy me a Coffee

For those of you performing curation activities on ChemSpider you will likely have noticed the ability to mark a new type of identifier, a shorthand formula. We have enabled this because it has become clear that this could be a useful part of document markup as part of our ChemMantis system. For example, looking at an article let’s consider the excerpt shown below.

Regarding the excerpt you can see a number of highlighted terms, all being shorthand formulae and not depending on name to structure conversion algorithms but rather depending on a lookup dictionary. Each of these names are linked to ChemSpider for direct look up of information associated with the chemicals. The list of shorthand formulae extracted from a couple of hundred articles is actually only a couple of hundred formulae at present. It includes the most obvious compounds that we can all interpret: CH3OH, MeOH, CH3CN, MeCN, CH3COOH, NaCl, NaF, NaCN, KBr, KCl and so on. All of these are immediately interpretable by chemists. There are likely a few more to be found over the coming months but in the past week of reviewing articles from various sources we have actually only added a couple of new formulae. We have also seen value in linking up ions and elements as appropriate. We are likely to add filters for display/not display of elements and ions since we’re of the opinion that displaying every incidence of an element in an article is of luttle value…just imagine how many times you might see the word carbon or hydrogen in an article… carbon-carbon bonds, hydrogen bonding etc. So, we’re switching them off by default. We’ll keep reporting on how we are improving ChemMantis…based on the review of a stack of articles the system has improved dramatically. We are asking for your articles now…combining shorthand formulae and chemical name markup will highlight a document as shown below.

Buy me a Coffee

When  ChemSpider was rolled out to the world as a part of ChemZoo we always knew we would be introducing more “critters”. We are happy to announce our progree with our new development ChemMantis. Why Mantis? Well…it’s the Markup And Nomenclature Transformation Integrated System. Fits perfectly into our zoo!

We have been working on the markup of chemistry documents for a number of months and I unveiled the first aspects of our work at the ACS meeting in Philadelphia. The presentation is available online on my Slideshare account. What we are trying to do is to use our ChemSpider platform as the foundation of a document markup system whereby chemical names are automatically identified and can either be converted to chemical structures (possible using algorithms for name to structure conversion) or are retrieved from our ChemSpider database. We have invested a lot of efforts to curate and validate the ChemSpider database of over 21.5 million unique chemical entities over the past year and are now sitting on a foundation of information allowing us to connect between chemical identifiers, chemical structures and out to rich sources such as Wikipedia and PubChem and to provide information such as chemical vendors and other online systems. ChemMantis is well and truly weved into the web of ChemSpider now.

We are now in alpha release and are adding some finishing tweaks to the markup system, the visualization elements and the  workflow. You can see the immediate effects of our recent work on improving the quality of structure images in the balloon below.

We_would_like to test the system on YOUR documents if you are willing to participate. What we are looking for are WORD documents for already published papers. They can be Open or Closed access papers. We are not expecting copyright transfer - we want to markup the documents and return to you for feedback. In the process we will be testing the quality of our Dictionary, our conversions, our visulaizations and our process. We welcome your support. Feel free to connect with us at infoATchemspiderDOTcom. Over the next few weeks you will hear more about ChemMantis and our contributions to text mining and markup of chemistry documents.

Buy me a Coffee

Recently a new website connecting chemicals to synthesis references went online. The site is ChemSynthesis and as well as synthesis references the database also contains physical properties for many of the listed substances. There are currently more than 40 000 compounds and more than 45 000 synthesis references in the database and there is an intention to keep the database growing with contributions from the community. Presently ChemSynthesis is indexing information from quite an extensive list of journals given below.

The Journal of the American Chemical Society, Canadian Journal of Chemistry, Chemical and Pharmaceutical Bulletin, Chemistry Letters, Journal of Heterocyclic Chemistry, Journal of Medicinal Chemistry, The Journal of Organic Chemistry, Organic Syntheses, Synthesis, Synthetic Communications, Tetrahedron Letters, Tetrahedron

An example record can be found here and a list of hits from a text search is shown below.

Linking_from ChemSpider to ChemSynthesis seemed like a natural way to help our users source potential synthesis details. So, that’s done. Also we have exchanged the appropriate information with ChemSynthesis so that we have completed the loop. Users searching ChemSynthesis can navigate directly to the ChemSpider record with one click.

To review the entire ChemSynthesis dataset on ChemSpider simply follow this link. It is >40,000 molecules so might take a while to load. Another contribution to the community of connected chemists….

Buy me a Coffee

We’ve been working on structure depictions on ChemSpider and overall we are very happy with where we have got to. These structure depictions are going to be showing up in various parts of our system now.

However, we should qualify the difference between structure images and structure layout. The depictions and the layout are governed by different algorithms.While a structure image can be attractive the layout may not be perfect. it is possible to improve the layout of the molecule deposited on ChemSpider. Notice for the structure on the left that there is overlap with the methyl group.

For details on how to CLEAN structures on ChemSpider please read the Technical Note here: Interactive Cleaning of Molecules During Curation and Deposition.

The result of performing cleaning is shown below. This layout may also not be the perfect layout but there is no overlap. The user can continue to manually optimize the structure for the preferred layout.

Buy me a Coffee

It is finally time to rollout more attractive structure depictions. We have needed some more attractive structure depictions for a while but they have become an absolute must have as we rollout the following new capabilities:

1) The ability to make YOUR chemical blog structure searchable (watch this space…). We suggested one path previously…this is BETTER…

2) Structure balloons for using with our document markup tools, both browser-based and Microsoft Word based

We all judge quality of visual aesthetics quickly. We know a good structure when we see one. This is an announcement that we will be rolling out new structures across the site in the next few days. You will see better looking structures showing up across the site - during deposition, during service-based predictions, during searches and, well, everywhere. While not perfect as yet a little more tweaking and the entire database will be supported by the new structure depiction algorithms. As it is you should see some examples now on the database…one shown below. We welcome your feedback!

Buy me a Coffee

I recently started a discussion with the users of ChemSpider about how they use our system. There have already been two responses and I am hoping for more. Having sat in on a IUPAC InChI meeting in Washington last week I can honestly say that it was one of the most functional and on-task meetings I have sat in on in a long time. Decisions were made about how to move forward with the next release of the InChIKey and “standard versions” of both the InChIString and InChIKey.

The meeting has prompted the question how do you use InChI? For what purpose do you use InChI and do you use only the string? Do you use it for communication purposes and structure exchange? Do you use it in your internal databases? Is it a primary path to deduplication? What settings do you use for the InChIString?

I’m interested in how you are using InChI nad how important it has become for you? Comments welcomed..

Buy me a Coffee

As ChemSpider has grown into an important part of the online community for providing access to information and data to chemists to assist them in their work there are many subjective criteria by which to be measured. We set some objectives early on in regards to how we would measure our own successes in the first couple of years. These included:

1) A result of >500,000 in a Google search (we have been at this number for over a month I believe)

2) Acknowledgment by our “peers”, another subjective criterion, by comments made in the blogosphere, recognized by invitations to speak, participate in panel discussions etc. No shortage here.

3) Reach 5000 unique users per day in our first year (already achieved)

4) Be reviewed in a mainstream publication (the Nature article written about ChemSpider does that)

5) Have over 150 data sources feed ChemSpider. We are close…145 data sources at present and more in the pipe to feed in shortly

6) Be indexed by Chemical Abstracts Service.

CAS has been indexing a number of web resources for a considerable time. Until today I didn’t know that we were one of these sources. It actually makes a lot of sense that we should be indexed. We have unique chemistry on our site since we host Open Notebook Science from groups such as that of Jean-Claude Bradley at Drexel University. But, we also have spectra and assignments from research compounds being deposited onto the database and are establishing relationships with Open Access publishers to index their chemical compounds connected directly to their articles. So, being indexed makes sense.

There has been a murmuring in the community that what ChemSpider is doing will collide with CAS. I have reiterated many times that I believe CAS offers the crown jewels in terms of quality and curated data. With what amounts to likely 1000s of person years of investment in building the registry we are unlikely to surpass CAS’ breadth of knowledge. Rather we are focused on providing a service to the community so that the community can participate in developing and growing the databas. I believe CAS and ChemSpider are synergistic and have much to offer by being connected in this way.

Inserted above is a screen grab of part of a record showing the ChemSpider database as the source of the structure. CAS have rigorous expectations regarding how they select what chemical entities should be inserted into their database. While I don’t know this list of definitions this structure clearly meets it. The structure above is on ChemSpider here. We’re very happy that we are being indexed now in the CAS registry and will continue to enhance our “unique structure collection” working with chemical vendors, publishers and scientists to grow our database.

 

Buy me a Coffee

In the past 48 hours we have added six new depositors datasets to ChemSpider. Details of all of our data sources are listed here. The list of six new depositors and the number of compounds in each collection is given below. Click on the hyperlinks for more information. The number of compounds link will display the compound collection and the link to the title of the compound collection will list some details about the data source provider.

489 NIH Clinical Collection 9/7/2008
3080 Shanghai Institute of Organic Chemistry 9/7/2008
12356 HDH Pharma 9/7/2008
196 OmegaChem 9/6/2008
2110 Exclusive Chemistry 9/6/2008
13412 Oakwood 9/6/2008

Buy me a Coffee

We have put in a place a simple way to associate a chemical compound in a single record view out to an external data source. We made this a general solution but did it specifically to enable connections to be made quickly between new Wikipedia records and records on ChemSpider. We have become very experienced with the validation of data on both Wikipedia and ChemSpider over the past few months so when we find new records on Wikipedia that are not already connected to ChemSpider we clean and validate structures on ChemSpider while validating the compounds on Wikpedia. Then, when we are convinced of the validity of the compounds then we connect them. While it may take a long time to validate the data associating the WIkipedia and ChemSpider records takes just a few seconds.

We have now established “Wikipedia on ChemSpider” for Wikipedia searching by structure and substructure searchable. We believe that people may be more likely to use this over WiChempedia but we will see.

The process for linking Data Sources directly to a record view is described in this Technical Note. We welcome feedback on the document in case it is difficult to follow.

Buy me a Coffee

Users of ChemSpider might have noticed some performance isseus in the past 2-3 weeks with our web services, service availability and speed of searches. I put my hand in the air and say “Yup, acknowledged”. Hopefully they have not been too disruptive BUT it is for the overall benefit of the service ultimately. We have been streaming in 8 MILLION links to Pubmed in order to make Pubmed structure and substructure searchable. We are NOT rolling this out with full fanfare yet but I do want to explain the performance issues you might be experiencing. We work on Microsoft technology and while we are advocates for the platforms of .NET, IIS and SQL Server we definitely are putting them under pressure as we keep expanding the database and adding more value. We have thoughts about how to resolve this but want to finishg populating the tables first.

The upside….the majority of links are already in place. For an example visit a structure and look for PubMed as a data source and click on one of the links. For example, for Valium here you will see in the datasource table a series of Pubmed IDs next to the PubMed datasource…

  16971504, 17673, 874970, 406430, 17881, 327854, 879884, 577681, 560225, 195649, …

These will link you out to PubMed directly. Try it out…

Now, do we have implementation issues? YES. The lists of external IDs can be long so right now we show only the first 10. We wiil deal with display of others shortly. We need to provide a way to curate out “junk” entries. For example, “methyl” is on Chemspider as a fragment and has links to PubMed IDs…you’ll see why if you click them..it was done with text mining. These issues will be resolved but for now we announce that PubMed is structure and substructure searchable via ChemSpider. We will explain how we did it shortly but for now we will acknowledge the massive contribution of our colleagues at SureChem. More to come…

Buy me a Coffee

I am very proud at the response from our user base to my request for assistance with curating ChemSpider in regards to carbohydrates. Carbohydrates are complex in nature. They can be represented in linear form and cyclic form, they exist in ChemSpider with a common name but no defined stereochemistry, there are pentoses, hexoses and many stereoisomers per skeleton. There are MANY common carbohydrates with trivial names - RiboseArabinoseXyloseLyxoseAlloseAltroseMannoseGuloseIdoseGalactoseTalose

Carbohydrates have been very challenging for us at ChemSpider…many depositors have not been careful with the  association between the chemical structure and the associated identifiers. With a chemical structure as the primary key on a record we find confusing associations with structures. For example, a search on Maltotriose as an identifier turns up 5 structures on ChemSpider. Maltotriose is defined on Wikipedia as “trisaccharide (three-part sugar) consisting of three glucose molecules linked with 1,4 glycosidic bonds.” This should mean that it is not appropriate for the identifier maltotriose to be associated with this structure. The registry number associated with this structure should be deleted also based on Wikipedia as a resource. How many of the other identifiers should be deleted? Maybe all???

Looking at this record we see identifiers such as: alpha-D-G?lc-(1->4)?-alpha-D-?Glc-(1->4?)-D-Glc; alpha-D-G?lc, O-alp?ha-D-glc; GLC-(4-1)?GLC-(4-1)?GLC-(4-4)?GTE and O-alpha-D?D-Glucopy?ranosyl-(?1->4)-O-a?lpha-D-gl?ucopyrano?syl-(1->4?)-D-gluco?se . Are these appropriate for this compound?

The challenge for maltotriose is therefore to identify the CORRECT structure associated with that name. “Maybe” it is the structure on Wikipedia but don’t forget that we have an effort underway to validate the structures on Wikipedia and make sure they are correctly associated with the monograph title. Is Maltotriose an identifier for a unique stereoconfiguration or is there alpha- and beta-maltotriose?  I am not sure. What needs to be determined is the correct association between structures and identifiers. Incorrect associations should be removed so that they do not turn up the incorrect structures in ChemSpider when searched.

This is the start of the validation process for carbohydrates…its iterative, complex and hard work. Its going to begin with giving the group of interested parties curator power over on ChemSpider and asking them to work on this challenge. We welcome their assistance. The efforts of contributors like this will be essential. 

Buy me a Coffee

I’ve had a number of questions about the presentation I gave at ACS Philly last week about document markup. The phrase I keep hearing is “very disruptive” followed by the question “will authors do more work and what’s in it for them?”.

The presentation here outlines the general concept that I talked about…

The basic concept I presented is as follows, with a focus on Chemistry Articles.

A lot of effort is being expended in “text-mining” publications, post-publication, to index these articles and make them searchable not only by text but by the specific language of chemistry, chemical structures. We are specifically asking the question “why extract chemical structures from articles using chemical name conversion approaches and chemical image conversion tools when the structures in the article were ORIGINALLY machine readable?”

We are considering a system whereby authors are asked to contribute to the availability of a free online service for performing structure and substructure-based searches of chemistry articles. While the submission of journal articles is already a lot of work (I know from experience of authoring/co-authoring about 10 a year) we hope that authors will support a service whereby they can upload their own articles to a “validation and mark-up service”. The upload capabilities will support upload of the primary document, chemical structures in standard formats and supplementary information of various types (to be defined)

This system will perform the following services:

1) semi-automated markup of a document - title, author(s), abstract and additional dictionary-based terms plus the ability to use the NLM-DTD markup
2) identification of chemical names and conversion to structures in an automated fashion
3) conversion of structure IMAGES to connection tables using optical structure recognition software (either commercial or open surce)
4) ask authors to confirm whether the converted structures are appropriate
5) provide a structure validation service for submitted molecules checking for “accurate representation”
6) Deposit all structures associated with an article onto ChemSpider but under embargo. Associate the article Title, authors and “abstract snippet” with all structures.
7) Issue a set of ChemSpider IDs for the author to submit to the publisher with the article
8) When a publication has passed through review the author can release the structures from embargo using a DOI or an article URL (more common for Open Access articles)

The result of this project will be a way for publishers to link their articles directly to a free access chemistry database and use a series of web services to enable other capabilities (to be defined). It will also allow articles in Open Access and non-Open Access publications to searchable by the “language of chemistry”.

This is only a slice of the overall project but I think it may be of interest relative to the comments you have made below.

Parts of this were shown last week at Drexel University and a particular snippet is available online here:

We are also going to provide a Microsoft Word add-on which will allow users to prepare articles for publishing using similar technologies.

We think this IS disruptive..what say you?

Buy me a Coffee

I am looking for someone with a good understanding of carbohydrate chemistry to join the ChemSpider Advisory Group and help us “get carbohydrates right” on the ChemSpider database. It would be sweet if someone could help us clean up the abundance of data on the site and offer us their skills. It might be a good project for a student to work on with us as it will require some research to make sure that we end up with REFERENCE quality data fo rothers to use. Anybody interested?

Buy me a Coffee

We are testing out a new 3D molecule optimizer on ChemSpider at present. It appears to be more rugged than our previous algorithm and, in our hands at least, has not yet failed to produce a 3D representation. On general small organics it appears to handle the optimization very well and is quite fast. We welcome your feedback if you have time to test. In order to use the optimizer simply go to a record view, lick on Zoom on the Structure tab (see below) and then click on Show 3D as shown in the second image. Since all coordinates are calculated in real time please note that it can take a few seconds from opening the JMol applet to display of the optimized structure (we need to ad a “calculating….” display element when we have time.

Thins look good for us but we have had a couple of reports of failures and are trying to trace whether it is a browser security setting or not. Please let us know if you have any issues. Thanks

Buy me a Coffee

A link to the presentation I gave at ACS-Philly yesterday in Rajarshi Guha’s session is provided below. A lot changes between writing an abstract and writing a talk so I had the chance to expose an increasing number of papers ALREADY using ChemSpider as one of its platforms of choice to source information from.

Can a Free Access Structure-Centric Community for Chemists Benefit Drug Discovery?

ChemSpider is an online database of over 20 million chemical structures assembled from well over a hundred data sources including chemical and screening library vendors, publicly accessible databases and resources, commercial databases and Open Access literature articles. Such a public resource provides a rich source of ligands for the purpose of virtual screening experiments. These can take many forms. This work will present results from two specific types of studies: 1) Quantitative Structure Activity Relationship (QSAR) based analyses and 2) In-silico docking into protein receptor sites. We will review results from the application of both approaches to a number of specific examples. QSAR analyses utilizing the ChemModLab environment for assessing quantitative structure-activity relationships will and screening using a molecular surface descriptor model.

Link to presentation

Buy me a Coffee