<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Does ChemSpider Have Millions of Errors?</title>
	<atom:link href="http://www.chemspider.com/blog/does-chemspider-have-millions-of-errors.html/feed" rel="self" type="application/rss+xml" />
	<link>http://www.chemspider.com/blog/does-chemspider-have-millions-of-errors.html</link>
	<description>Building Community for Chemists</description>
	<lastBuildDate>Fri, 24 May 2013 06:45:34 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5</generator>
	<item>
		<title>By: Joerg Kurt Wegner</title>
		<link>http://www.chemspider.com/blog/does-chemspider-have-millions-of-errors.html/comment-page-1#comment-195006</link>
		<dc:creator>Joerg Kurt Wegner</dc:creator>
		<pubDate>Thu, 22 Oct 2009 18:13:51 +0000</pubDate>
		<guid isPermaLink="false">http://www.chemspider.com/blog/?p=1459#comment-195006</guid>
		<description><![CDATA[I agree with Egon that &#039;upstream&#039; or &#039;remote curation/annotation&#039; would be really helpful. The Reflect service of EMBL/EBI allows now to tag missed entries (e.g. gene identifiers) and to annotate wrong ones. With a little mashup and right mouse-click actions this could be done for ChemSpider, too.

Input: URL, text (chemistry or identifier), meta-data (annotation details)
Output in ChemSpider database - Annotation of basically any web-page out there allowing chemistry enrichment.

On the long-run people could &#039;remotely&#039; help curating chemistry and any associated data.]]></description>
		<content:encoded><![CDATA[<p>I agree with Egon that &#8216;upstream&#8217; or &#8216;remote curation/annotation&#8217; would be really helpful. The Reflect service of EMBL/EBI allows now to tag missed entries (e.g. gene identifiers) and to annotate wrong ones. With a little mashup and right mouse-click actions this could be done for ChemSpider, too.</p>
<p>Input: URL, text (chemistry or identifier), meta-data (annotation details)<br />
Output in ChemSpider database &#8211; Annotation of basically any web-page out there allowing chemistry enrichment.</p>
<p>On the long-run people could &#8216;remotely&#8217; help curating chemistry and any associated data.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Imants Zudans</title>
		<link>http://www.chemspider.com/blog/does-chemspider-have-millions-of-errors.html/comment-page-1#comment-195005</link>
		<dc:creator>Imants Zudans</dc:creator>
		<pubDate>Tue, 20 Oct 2009 10:38:01 +0000</pubDate>
		<guid isPermaLink="false">http://www.chemspider.com/blog/?p=1459#comment-195005</guid>
		<description><![CDATA[I understand well the magnitude of this problem as we are facing some similar issues. You can generate InChI Keys and IUPAC names from structures. But it is very difficult to ensure that information from the outside is correct. However, sometimes even wrong information is useful so maybe you are too hard on yourself. CAS number might be incorrectly associated with a particular structure, for example. But it is far better to get 3 structures for one CAS query and then decide which one is the correct one than not to get any results. This is why Google is so useful. A lot of websites in search results are somewhat irrelevant but since they don&#039;t censor the content, most of the time it is possible to find the needed information. Chemical databases are more restrictive about information they let in and that sometimes makes it harder to find the needed compound. That is even true for the structure search as many databases don&#039;t offer a tautomer search option.

I am looking forward to see the followup post!]]></description>
		<content:encoded><![CDATA[<p>I understand well the magnitude of this problem as we are facing some similar issues. You can generate InChI Keys and IUPAC names from structures. But it is very difficult to ensure that information from the outside is correct. However, sometimes even wrong information is useful so maybe you are too hard on yourself. CAS number might be incorrectly associated with a particular structure, for example. But it is far better to get 3 structures for one CAS query and then decide which one is the correct one than not to get any results. This is why Google is so useful. A lot of websites in search results are somewhat irrelevant but since they don&#8217;t censor the content, most of the time it is possible to find the needed information. Chemical databases are more restrictive about information they let in and that sometimes makes it harder to find the needed compound. That is even true for the structure search as many databases don&#8217;t offer a tautomer search option.</p>
<p>I am looking forward to see the followup post!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Kuhn</title>
		<link>http://www.chemspider.com/blog/does-chemspider-have-millions-of-errors.html/comment-page-1#comment-195004</link>
		<dc:creator>Michael Kuhn</dc:creator>
		<pubDate>Tue, 20 Oct 2009 08:14:46 +0000</pubDate>
		<guid isPermaLink="false">http://www.chemspider.com/blog/?p=1459#comment-195004</guid>
		<description><![CDATA[KEGG also seems to be responsive to fixes. However, I think that it is probably better to adjust your downstream system to deal with errors from upstream. When we looked into creating a good set of chemical synonyms for the STITCH database (based on PubChem) it seemed to us that once erroneous synonyms get into the system, they&#039;ll just be re-shared between dbs even if you get some of them to fix the error. We thus prioritized the source databases, KEGG being one of the &quot;good&quot; ones – but even KEGG makes mistakes sometimes.

So perhaps it would make sense to make &quot;diffs&quot; that can be re-applied to source databases when you import them again?]]></description>
		<content:encoded><![CDATA[<p>KEGG also seems to be responsive to fixes. However, I think that it is probably better to adjust your downstream system to deal with errors from upstream. When we looked into creating a good set of chemical synonyms for the STITCH database (based on PubChem) it seemed to us that once erroneous synonyms get into the system, they&#8217;ll just be re-shared between dbs even if you get some of them to fix the error. We thus prioritized the source databases, KEGG being one of the &#8220;good&#8221; ones – but even KEGG makes mistakes sometimes.</p>
<p>So perhaps it would make sense to make &#8220;diffs&#8221; that can be re-applied to source databases when you import them again?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Egon Willighagen</title>
		<link>http://www.chemspider.com/blog/does-chemspider-have-millions-of-errors.html/comment-page-1#comment-195003</link>
		<dc:creator>Egon Willighagen</dc:creator>
		<pubDate>Tue, 20 Oct 2009 08:03:11 +0000</pubDate>
		<guid isPermaLink="false">http://www.chemspider.com/blog/?p=1459#comment-195003</guid>
		<description><![CDATA[&quot;Bottom line though is we cannot clean up everyone elses data right now and, in many cases, people don’t seem to respond even when errors are pointed out.&quot;

Indeed! That&#039;s why I was so interested is learning if/how ChemSpider approaches this. Ubuntu as an example organization that has to deal with the same problem, and often need to file bug reports &#039;upstream&#039; with Debian too.

What this needs is a simple, clean ticket system that integrates with ChemSpider itself. If you file a ticket against a CS entry (make a modification), this should trigger this upstream reporting. There are two alternatives, but since &#039;upstream&#039; in this case very likely does not have a ticket (or bug track) systems anyway, I would propose the following:

* for each &#039;upstream&#039; resources, create a web page that lists all errors reported for each entry in that upstream resource.
* possibly create a machine readable version of it (RDFa perhaps), so that we can write a simple Userscript to make those reports *visible* on the upstream website too

In that way, the manual work needed for upstream to take advantage of the ChemSpider work is minimized and to usefulness of ChemSpider maximized.

This is beyond Open Data, and this goes into Open Projects and interoperability. If ChemSpider would adopt something along these lines, it would actually take chemical databases a step further instead of being yet another repository.]]></description>
		<content:encoded><![CDATA[<p>&#8220;Bottom line though is we cannot clean up everyone elses data right now and, in many cases, people don’t seem to respond even when errors are pointed out.&#8221;</p>
<p>Indeed! That&#8217;s why I was so interested is learning if/how ChemSpider approaches this. Ubuntu as an example organization that has to deal with the same problem, and often need to file bug reports &#8216;upstream&#8217; with Debian too.</p>
<p>What this needs is a simple, clean ticket system that integrates with ChemSpider itself. If you file a ticket against a CS entry (make a modification), this should trigger this upstream reporting. There are two alternatives, but since &#8216;upstream&#8217; in this case very likely does not have a ticket (or bug track) systems anyway, I would propose the following:</p>
<p>* for each &#8216;upstream&#8217; resources, create a web page that lists all errors reported for each entry in that upstream resource.<br />
* possibly create a machine readable version of it (RDFa perhaps), so that we can write a simple Userscript to make those reports *visible* on the upstream website too</p>
<p>In that way, the manual work needed for upstream to take advantage of the ChemSpider work is minimized and to usefulness of ChemSpider maximized.</p>
<p>This is beyond Open Data, and this goes into Open Projects and interoperability. If ChemSpider would adopt something along these lines, it would actually take chemical databases a step further instead of being yet another repository.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Antony Williams</title>
		<link>http://www.chemspider.com/blog/does-chemspider-have-millions-of-errors.html/comment-page-1#comment-195002</link>
		<dc:creator>Antony Williams</dc:creator>
		<pubDate>Tue, 20 Oct 2009 06:58:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.chemspider.com/blog/?p=1459#comment-195002</guid>
		<description><![CDATA[When I have reported errors to a number of the database hosts where errors have come into ChemSpider I generally get no response. Not always, but generally. In most cases I see no adjustments made to the data in error so I have to assume that either they are not receiving/reading my email or are choosing to not do the work to remove the errors. SOme groups do make adjustments - especially, of course, Wikipedia, where I make adjustments myself but also the WP:Chem team are very active and also David Wishart&#039;s group at DrugBank.

When we change name-structure associations they are persisted because we keep the erroneous names on the database and even if they are redeposited they remain flagged as in error. The same is true for database IDs. We are instituting the same policies for experimental data moving forward but not right now as we are adjusting our data model to make such data more discoverable and available via web services. I&#039;ll expand on this discussion in a separate blog post. 

Bottom line though is we cannot clean up everyone elses data right now and, in many cases, people don&#039;t seem to respond even when errors are pointed out.]]></description>
		<content:encoded><![CDATA[<p>When I have reported errors to a number of the database hosts where errors have come into ChemSpider I generally get no response. Not always, but generally. In most cases I see no adjustments made to the data in error so I have to assume that either they are not receiving/reading my email or are choosing to not do the work to remove the errors. SOme groups do make adjustments &#8211; especially, of course, Wikipedia, where I make adjustments myself but also the WP:Chem team are very active and also David Wishart&#8217;s group at DrugBank.</p>
<p>When we change name-structure associations they are persisted because we keep the erroneous names on the database and even if they are redeposited they remain flagged as in error. The same is true for database IDs. We are instituting the same policies for experimental data moving forward but not right now as we are adjusting our data model to make such data more discoverable and available via web services. I&#8217;ll expand on this discussion in a separate blog post. </p>
<p>Bottom line though is we cannot clean up everyone elses data right now and, in many cases, people don&#8217;t seem to respond even when errors are pointed out.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Egon Willighagen</title>
		<link>http://www.chemspider.com/blog/does-chemspider-have-millions-of-errors.html/comment-page-1#comment-195001</link>
		<dc:creator>Egon Willighagen</dc:creator>
		<pubDate>Tue, 20 Oct 2009 05:52:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.chemspider.com/blog/?p=1459#comment-195001</guid>
		<description><![CDATA[Hi Tony,

what happens with these &#039;fixes&#039;? Does ChemSpider have a mechanism of reporting them &#039;upstream&#039; (where the error came from)? And, how is this handled when you pull in an updated upstream release, which may still lack the fix, or, and that&#039;s where manual labor comes in again, have a *conflicting* fix?]]></description>
		<content:encoded><![CDATA[<p>Hi Tony,</p>
<p>what happens with these &#8216;fixes&#8217;? Does ChemSpider have a mechanism of reporting them &#8216;upstream&#8217; (where the error came from)? And, how is this handled when you pull in an updated upstream release, which may still lack the fix, or, and that&#8217;s where manual labor comes in again, have a *conflicting* fix?</p>
]]></content:encoded>
	</item>
</channel>
</rss>
