Rich Apodaca continues to expose the benefits of Ruby and how PubChem can be “hacked”. In true Rich style he is offering us a very useful approach to extracting more and more information from PubChem, this posting entitled “Hacking Pubchem: Convert CAS Numbers into PubChem CIDs with Ruby“. I’m not going to argue Rich’s technology approach..I’m sure that’s fine.

The issue is down to the association between CAS numbers and PubChem ID. Let’s take an extreme example to start…Methane. I posted about the issue of synonyms to Peter Corbett’s commentary on BioCyc entries in PubChem this week. Now let’s take a look at registry numbers. Scroll to the bottom of the methane entry… here’s a SUBSET of the CAS registry numbers for methane.

101239-80-9
106907-70-4
109766-76-9
114680-00-1
115344-49-5
116788-82-0
12424-49-6
124760-06-1
12751-41-6
12768-98-8
12789-22-9
130960-03-1
131452-56-7
131640-45-4
133136-50-2
1333-86-4
1343-03-9
137322-21-5
137906-62-8
138464-41-2
1399-57-1
147335-73-7
155660-93-8
156854-02-3
158271-80-8
159251-18-0
208519-32-8
214540-86-0
2465-56-7
26837-67-2
37196-29-5
37265-44-4
37265-48-8
39422-04-3
39434-34-9
47932-00-3
50814-81-8
52623-24-2
53095-52-6
53851-02-8
55353-42-9
55607-95-9
56257-79-5
56257-80-8
56274-59-0
56729-25-0
58517-29-6
61512-59-2
63661-31-4
64427-56-1
64900-31-8
67167-41-3

Here’s the question…are all of these real CAS registry numbers for methane? How can they be validated? While it is possible with Rich’s approach to link CAS IDs and PubChem IDs the question is whether the connection is appropriate. I say take care.

The Quality should be questioned since some of the identifiers for methane are listed below:

Special Black 1V & V
Rocol X 7119
Graphite (natural), dust
Shawinigan Acetylene Black
N-METHYL ASPARAGINE
C.I. Pigment Black 10
SKLN 1
Single wall carbon nanotube
HSDB 167
Pelikan C 11/1431a

Now, I acknowledge that this may be a worst case scenario, and the five CAS numbers for aspirin might all be real

11126-35-5
11126-37-7
2349-94-2
26914-13-6
98201-60-6

but how can they be validated? Just as we have just exposed web services for InChI can you imagine the value of a web service into the CAS Registry to validate a CAS Number-chemical compound relationship? Now that’s a great idea…

Stumble it!

8 Responses to “Hacking PubChem – Technology Easy, Quality Difficult”

  1. Rich Apodaca says:

    The question of quality will always be present in any database, and your post raises valid concerns. It _would_ be great if CAS created a free Web service for CAS number lookup. In fact, such a move might just save their franchise.

    I’ve seen many examples of the same substance having multiple CAS numbers – it’s quite common. It might be interesting for a reader with access to SciFinder to get a list of all of the CAS numbers for methane. Clearly, many of those CAS numbers are for “Carbon” and its mixtures/derivatives/allotropes – which could be attributed to the molfile format not having a consistent way to represent implicit hydrogens. (The data format used for compound registration matters – a lot). So the methane example may be the worst case scenario for a molecule unlikely to appear in most databases.

    During the transition away from CAS numbers as a chemical identifier, conversions such as CAS number-> PubChem CID (or CAS number-> IUPAC name) will become very important. PubChem records contain IUPAC names (in several flavors), so it might be possible to use that field as a consistency check – as you’ve done.

    Bottom line: limbo-land is where we’re at. As the Chinese say, may you live in interesting times…

  2. Joerg Kurt Wegner says:

    Agreed! This gives a *big* minus for PubChem. Are they really expecting that others find the duplicates and the wrong structures?

    Anyway, can you not offer something for this? E.g. another web-service called “Check pubchem cid in curated chemspider data”

    And thanks for *doing* this
    http://www.chemspider.com/blog/?p=135

    You could even go a step further by offering a service called “return curated unique inchikey or chemspider csid from non-curated pubchem cid”!

  3. Antony Williams says:

    A friend sent me these comments offline “Some of those seem to be formulations for CH4 and a non-structured component. This is one of the most common ways that CAS numbers get associated improperly. I seem to recall that Methane is simply 74-82-8, not even in your subset.”

    He is right that 74-82-8 is not in the subset but it is bolded and is the FIRST of all of the synonyms showing that it has been validated by a curator

  4. Egon Willighagen says:

    Yet another reason to use InChI instead. I *know* a lot of people use the CAS registry number as identifiers; some people also can’t stop smoking/drinking/…

    To use Rich blog of last week: InChI is to chemoinformatics what the pass forward was to football. And this blog item shows the score :)

    Cheers!

  5. Unilever Centre for Molecular Informatics, Cambridge - petermr’s blog » Blog Archive » Comments on comments and agents and eyeballs says:

    [...] ChemSpiderMan Says: October 1st, 2007 at 3:55 pm ePeter, I’ve given many examples of the issue of Data Quality on the blog. Some links are:http://www.chemspider.com/blog/?p=64 http://www.chemspider.com/blog/?p=164 http://www.chemspider.com/blog/?p=168 http://www.chemspider.com/blog/?p=137 [...]

  6. ChemSpider Blog » Blog Archive » Enforcing Copyright of CAS Numbers says:

    [...] ID” for a compound. Check out my earlier posts about the need for curation (1,2,3,4 and many others). CAS is very highly curated and are the authority for the CAS numbers. PubChem [...]

  7. A naive biochemist wakes up to the closed world of chemical abstracts and such « The Omics world says:

    [...] providers using a suitable lookup id . Naively I assumed this would be the CAS id which is the “unique id” associated with each molecule . An hour of googling later I woke up to the realization that CAS is [...]

  8. ChemSpider Blog » Blog Archive » STOP COUNTING the Number of Chemical Entities in Public Compound Databases and There are Ghosts in the Closet says:

    [...] simple example is that of methane in PubChem that I have blogged about many times…one example here. Here are some of the names associated with the structure of methane on PubChem: [...]

Leave a Reply