In a couple of email exchanges this weekend the “Right to Use” FAQ regarding data provided on ChemSpider was under discussion. The FAQ page hasn’t been updated since we went live in March so, based on almost 6 months of experience with feedback and commentary, and stimulated by the exchange over the weekend I’ve updated the statement on the FAQ page to state the following:

May I download the data and use it in my own database(s)?

You have limited rights in this regard. You can only assemble a database of 5000 structures or less, and their associated properties, from our database without our permission. You can download up to 1000 structures per day from the website. Please contact us at feedbackATchemspiderDOTcom to request an extension outside this constraint. We are willing to provide the ENTIRE database of ChemSpider structures at your request – the file will consist of InChI Strings, InChIKeys and ChemSpider IDs. These constraints are under regular review so please feel free to engage us in conversation.”

What we’re trying to do here is to stop the offshore raiding of the database that is going on. Certain groups are attempting to download the database and putting an incredible load on our server(s). So, please stop!

We are presently in the process of downloading the entire database into a series of SDF files so that we can provide the ENTIRE ChemSpider database to interested parties. We will cross the 20 million mark shortly in terms of unique structures on the database. Each structure will be accompanied by the InChI String, the InChIKey and the associated ChemSpider ID. We are INTENT on proliferating the value of InChI across the chemical community and expanding the value of InChI to the semantic web.

So, a question to you, our readers…is there anyone out there who would like to receive the ChemSpider database when it is ready? Please let me know by responding to this blog post. Thanks

7 Responses to “Who Would Like to Have the Entire ChemSpider Database?”

  1. Rajarshi Guha says:

    What would your update policy be? Would incremental updates be provided (such as the monthlies that Pubchem does)? Do you ever delete any entries?

  2. Egon Willighagen says:

    This is an interesting situation… if all data is (or would be) ‘open data’, then I could imagine a distributed FTP server setup; Linux distributions face the same problem. Or maybe a P2P network would even be better, with 100MB SDF (/CML.gz/whatever) files or so. The latter seems actually the most interesting setup.

    Really, I was waiting for this to happen; this allows you to make a difference; take things forward. ChemSpider, have a look at “debtorrent”. It’s a the Debian/Ubuntu package distribution system on top of torrent, one of the P2P systems. It was developed this summer within the Google Summer of Code, but it needs to be worked out how the .deb files are replaced by .cml.gz files (or sdf.gz).

  3. Rajarshi Guha says:

    Egons solution is a good idea for bandwidth managment, but that is not really a policy problem.

    What gets placed in the update (if there is one) is more important for downstream infrastructure. We face a similar problem with Pubchem Bioassay data – they have no update mechamism and basically dump new assays as they arrive. I could impose a monthly schedule on my side, but there are no monthly incrementals

  4. Antony Williams says:

    Rajarshi…as yet we have not needed to delete any entries. It does not mean that it might never happen but reasons for this are not foreseen at present.

    In terms of incremental updates we are more likely to do it on an as-needed basis (when there is a significant jump in the number of structures). At least once a month is foreseen…maybe more often. It is all about where we apply our limited resources at present.

  5. Egon Willighagen says:

    Antony, with “You have limited rights in this regard”, I assume you really mean: “You have limited options in this regard: while you are entitled (have the rights) to download everything, because of practical reasons we do not provide an easy to download one-contains-all zip file.”

    Is that a correct analysis?

  6. Antony Williams says:

    Yes, you are right…we have no experience of what it will do to our servers to have everyone downloading data in one big file. But, once it’s deposited at PubChem people can take it from there anyways and they will likely be able to support that better than us.

    One thing to clarify is that the ChemSpider database we are donating to PubChem will NOT contain any of the predicted properties etc as we do not have permission to give them away. It is irrelevant anyway…PubChem will remove them from the submission and replace with their own predicted properties for homogeneity purposes.

  7. ChemSpider Blog » Blog Archive » Another Response to Constructive Feedback from Peter Murray-Rust… says:

    [...] ChemSpider contribute to the community…and support PubChem [...]

