A few weeks ago I noticed that PubChem had grown substantially after a deposition from the Zinc group.I had thought, incorrectly, that this was due to the deposition of protonated forms of the ZINC database because they produce such forms as part of their docking procedures. I had discussed this possibility with Evan Bolton from the PubChem team when we were at the InChI meeting in Glasgow. In fact, this was not due to the different protonation states but because ZINC had deposited 12M make-on-demand compounds that they hold in their catalogs. For me these are virtual chemicals. The vendors involved with the deposition of such chemistry into the Zinc Database have done research to demonstrate that the chemistries that would be involved in the production of these chemicals, when ordered, would have a good probability of being synthesized but they are, for the time-being, virtual compounds only. In the early days of ChemSpider we went through a discussion internally regarding whether or not we should open ourselves to the deposition of virtual compounds and we did add a dataset from the UsefulChem team from Drexel University. Since then however we have steered away from the deposition of such libraries. As explained on the Zinc blog a decision was made to remove 12 million of the make-on-demand chemicals as “Pubchem’s rules require that compounds have been made somewhere before they be included”. I’m fairly sure that what is left on PubChem does not fully exclude such compounds as they are deposited by a number of vendors who have the ability to submit such collections but I appreciate the effort made by ZINC to remove their deposition from this class.

I am interested in community feedback on this matter. Should ChemSpider host collections of virtual chemistry? There is certainly value for people who wish to perform such activities as virtual screening but we don’t allow downloads of our entire database the way that ZINC and PubChem would. We are focused on layering on more information associated with a chemical compound at present.- physicochemical properties, spectra, article links, patents etc. We want to make sure that the chemistry represented in the backfile of RSC articles makes it onto ChemSpider in the future. This parallels some of the efforts being made by Fiz Chemie and InfoChem to make available the backfile of Chemisches Zentralblatt. We want to make sure that the compounds in the Natural Product Updates file from RSC make it onto ChemSpider. We have a lot to do but the focus is getting real data, real structures onto the database and removing “junk chemistry” from the deposited data.That said we are interested in your comments. What are your thoughts regarding “virtual chemistry”? Should we support virtual compounds or not? For sure there will always be some virtual chemistry on there in some form – for example, products that were thought to once be elucidated but were later shown to be something else are virtual chemistry. Compounds that have been deposited with incomplete stereochemistry can be “partial chemistry” if you like. Your thoughts and comments are welcomed.

Reblog this post [with Zemanta]
Stumble it!

10 Responses to “Should ChemSpider Host Virtual Compounds?”

  1. Antony Williams says:

    From Friendfeed

    Certainly there has to be a good reason for including compounds that have not been characterized. We routinely have Ugi products that we attempt to make but fail to isolate and we may discuss them in our notebook – it might be helpful to have these compounds indexed on ChemSpider and linked to the failed attempts. There is useful information there for anyone who might try to make them in the future. – Jean-Claude Bradley

  2. Sargis Dallakyan says:

    This is somewhat similar to what happened to theoretical models in PDB. I think that having only real compounds in PubChem or ChemSpider is a good idea. However, its also important to have a database (or a software) that would give you virtual chemicals that can be readily synthesized. So, in that sense, having 12M make-on-demand compounds stored in ZINK but not in PubChem or ChemSpider is fine.

  3. Egon Willighagen says:

    I agree the focus on compound databases like PubChem and ChemSpider should be the chemical compounds we know something about. As discussed on Twitter, I agree that this knowledge may very well include a failed experiment. It’s negative knowledge, but knowledge nevertheless. “Don’t go that way, son”.

  4. hko says:

    Please don´t include virtual compounds in ChemSpider. If there is real need then these
    virtual compounds should be stored in a separate database.

  5. Chris Singleton says:

    I think including only ‘real’ chemicals in Chemspider is the way to go. As was mentioned in the last few sentences of the blog, chemicals that were once considered to be one structure and were then corrected are virtual. Taxol is one example that was discussed extensively, and there are several antibiotics that also have different structures for the same compound. If the stereo structure is not known then deposition is fine (and frankly can aid in elucidating the real structure), but to deposit a compound that COULD be made is entirely different. In the first case you are not sure of the exact assignments, in the latter you aren’t sure whether is even exists.
    However, one possible exception is a known series of reactions. For instance, if we talk about nitriles, we can dehydrate a primary amide to get a nitirile. So one could reasonably conclude that we could take most any alkyl primary amide and make a nitrile. Have all possible permutations of nitriles from C2 to C20 skeletons been made? Probably not, but there is no reason to think that they would not be able to be synthesized and could not exist. So for these types of compounds, where adding in virtual compounds would be just ‘filling in the blanks’ in an already well-established reaction series, I would support it. As long as we mention that it is virtual and what our rationale is for including it.

  6. Tobias Kind says:

    Hi Tony,
    the question is not *should* but *could* aka:
    Is Chemspider capable of including all screening compounds?

    The concerns about “validity” of virtual compounds
    can be easily dismissed, nomen est omen, they are
    of course not “real existing” compounds, but all
    virtual. Are they “valid” compounds, hell yeah,
    not sure if cursing is allowed here, of course
    they are.

    In case of selection of compounds, its easy
    to set a filter, “natural product” or
    existing organic compound or “virtual
    compound” which can be changed once fully synthesized.
    So that is none of the users concern, but
    rather a small technical issue for RSC.

    If Chemspider (similar to PubChem, which actually
    has a different mandate) decides to drop
    virtual compounds, other databases will come up,
    if virtual compound databases are needed.

    The existence of ZINC (http://docking.org/?q=node/131)
    or the GDB-13 database (unibe.ch) which currently
    covers around 1 billion compounds downloadable for free
    http://www.dcb-server.unibe.ch/groups/reymond/gdb/home.html
    speaks volumes.

    Just because people never heard about data driven science,
    or don’t have imaginations what to do with such virtual
    compounds doesn’t make it automatically “junk chemistry”.
    I am well aware of your quotation marks :-)

    This is a good example where scientific p2p technology
    would work, instead of a single host, people upload and
    download part of such a database via peer-to-peer networks.
    Or maybe Sigma-Aldrich, which just recently bought
    ChemNavigator with its 83 million compound database,
    will be the new kid on the block, similar to Chemspider
    some months/years ago. Once isomer generators and
    combinatorial algorithms and their filters will be
    fast enough to work in real-time, there’s no need to
    put such virtual compounds in a database anyway.

    But such large databases of virtual compounds
    must be downloadable in bulk for HT drug screening,
    as datasets for cheminformatics or other purposes.
    It does not really make sense to scroll online
    through 1 to 10 compounds a at time.

    As you have said, semantic annotation of
    new chemistry publications is much more important.
    I wouldn’t care that much about the old stuff, that is
    very well covered in Beilstein and CAS Scifinder.

    That said, if Chemspider can store one or two billion
    compounds with full download capability – keep it.
    If not – drop it.

    Cheers
    Tobias

  7. Physchim62 says:

    As the ChemSpider team know already, I found a couple of “virtual compounds” lurking on ChemSpider today. They weren’t anything like the stuff in the ZINC database, rather it was more of a “Chinese whisper” effect working its way through InChIs to rewrite several chapters of inorganic chemistry, if these compounds were to exist in the form they were presented!

    It is a trivial matter to generate, say, acyclic alkane isomers, but what is the lightest alkane which has yet to be isolated or synthesized? Herein lies the distinction between chemical information and mathematics. The “virtual chemists” should stay exactly that – virtual, not real.

  8. Markus Sitzmann says:

    Interesting discussion. However, I am not quite sure how a real compound can be separate from a virtual one. Well, probably most people would agree that a “real” compound needs to be available from some chemical substance vendor or has been described somewhere in the literature. However, I am quite sure that most of the literature compounds are more “virtual” than any compound classified as “virtual” by a chemical substance vendor. Why? In the the first case, if a compound that has been described somewhere in literature can not be found in some “forgotten” refrigerator somebody has to re-synthesis it. Any chemist knows what that means – i.e. compounds like this can turn literally virtual :-) . On the other hand if a chemical substance vendor knows a compound is expected to be accessible easily because all necessary synthesis step and the building blocks are well known it can be regarded as the much smaller risk – or as I said “less virtual”.

    In my opinion, as soon as there is some secondary information available about a compound – independent of whether it is classified virtual or real it is worthwhile to store it into a database. By secondary information I mean any information that is not accessible from the chemical structure itself. Might be as little as an URL to a (trustworthy) source. Or might be what Egon said: it is confirmed that the compound is “virtual”, i.e. can not be synthesized.

  9. Joerg Kurt Wegner says:

    Instead of discouraging uploading not-yet-made molecules I would rather like to see a confirmation for all “real” compounds, e.g. by annotating explicitly synthesis routes, papers, vendors, and prices ( I know, I keep pushing on this) ! ;-)
    Based on this, people will get encouraged to upload not only “real” compounds, but also synthesis, or purchase details.

    Some companies might offer a synthesis on request, strictly speaking are those compounds virtual, because you can not pick them up. Anyhow, after a certain time such compounds are accessible, so I see no problems with such a scenario.

    Finally, nonetheless, molecule spam should get clearly avoided, e.g. by “report this compound as spam or impossible”, which could be an interesting RSS feed to follow, especially if people have to provide argumentations why certain compounds are not accessible ;-)

  10. Chris Southan says:

    For those like myself who do think PubChem and ChemSpider compounds should all be/have been in pots somewhere I can take a micro-credit for the recent PubChem shrinkage. In the first instance my co-author Sorel, as soon as we picked up the big recent jump in PubChem, immediatly guessed someone had let in virtuals. All I did was alert the folk concerned at PubChem and ZINC and they did the rest because the PubChem policy is to keep them out. Now I take the point that was made to me that the synthetic sucess rate for some virtual make-on-demand (MODs) was higher than other nominal off-the-shelf compounds and I can alos see the uitlity of enumerating out to novel but syntheticaly plausible chemical space (pharma does this for their libraries after all) Notwithstanding, the solution is clear informatic tagging so that they can not only easily be be toggled in or out of search space by choice but are held in repositories outside PubChem and ChemSpider. Cheers, Chris

  11. David Sharpe says:

    I too think this is an interesting debate. In principle I see no problem with hosting virtual compounds. Already every ChemSpider record contains a set of predicted properties, and however good the algorithm is this is essentially virtual data, that doesn’t make it bad; just as any bit of scientific data, one has to appraise how it was obtained and how much value you want to place on it. I would suggest that any judgments based on the feasibility of the synthesis of a virtual compound is somewhat of a distraction. I think there are two questions that need to be answered to make the decision.

    1. What does ChemSpider want to be?
    Is it a resource for researching experimental data, if so what are the critera for accepting data? Should every record be a compound you can isolate and store in a pot? Or do you also include experimentally observed (but un-isolable) reactive intermediates? At the complete other end of the spectrum is the idea of ChemSpider as a hub for connecting out to information on chemical structures, where the only real rule is that all links and data in a record are consistent with the stated InChI. The advantage of being closer to the latter case is that you can (in theory at least) generate/search more specific subsets of the data from this sort of database.

    2. Would the information on virtual compounds be widely used?
    If the data is used widely by lots of researchers from differing backgrounds then there is justification for incorporating it. But if the principle users of the virtual compounds are likely to be a small group of companies or specific research groups, who’s main interest is in screening libraries of virtual compounds, a separate database that is suited to their needs would seem to be a much better idea. Why would they want to get the data from ChemSpider rather than the source library?

    I would also suggest that unless there is a change of attitude in the science community and disclosure of unsuccessful reactions becomes more commonplace, almost all records for virtual compounds are likely to be ‘orphans’ in the sense that there will be nothing to link them to (other than information on other virtual libraries that hold the same structure).

    I don’t know what the answer to question 2 is, but I would say that in my previous life as a bench-based synthetic chemist, I was only ever interested either synthetic procedures or analytical data, and I would have wanted searching of virtual compounds to be something I opted into (rather than being the default).

    I suppose the short answer is I personally think virtual compounds are fine so long as the data is being used, is not a significant drain on resources (in terms of curation and hardware). Above all, it must be clear when records relate to virtual compounds, and this creates a restriction that it must be possible to incorporate virtual compounds in a way that is not a hindrance to people who specifically want to search for real compounds, but aids researchers who are interested in virtual compounds.

    Dave

Leave a Reply