As reported recently we have handed over the entire ChemSpider Database to PubChem for deposition. I did receive a number of offline requests about when would it be deposited. I am so used to having us survive with a minimum number of servers and in a world of developing processes to support close to 20 million compounds that I was estimating at least end of year for deposition and exposure.

My favorite color is Green. And I am experiencing green now. Pure envy…but in a good way. The data were only delivered to PubChem late last week and I’ve been informed that they should have the data deposited by the end of this weekend/early next week. That is amazing. That is all about their experience from receiving data for many data sources and “learning lessons”. It’s all about access to “enough hardware” . It’s all about the commitment of the PubChem team to keep things moving forward, not lose momentum regarding the benefit of the PubChem project to the chemistry community at large and staying on task. I’m impressed…and green :-)

Keep your eye on the PubChem data sources page and you will see ChemSpider top out at about 17.8 Million structures when all are deposited. We’re proud and happy to have contributed!

We have also had requests from other people to access the ChemSpider database files. Yesterday an organization tried to download the files by FTP but unfortunately we have had to cut off access since there are so many files being downloaded, so much bandwidth being consumed, that we have decided we can only provide the data on DVD. Apologies…

Stumble it!

3 Responses to “PubChem Deposition of ChemSpider Data is Well Underway. My Favorite Color is Green.”

  1. Rich Apodaca says:

    The problem of redistributing the massive amount of data available on sites like PubChem and ChemSpider is very real, and will only get worse.

    In that sense, ChemSpider benefits by depositing its compounds with PubChem. This lets PubChem do what it’s good at (data warehousing) and lets ChemSpider do what it’s good at (collecting and curating data).

    The size of the FTP site you’d need to support should shrink significantly if instead of providing a complete connection table for a record, all you include is a PubChem CID. Have you tried this approach?

    Web API is another approach that bypasses a lot of the problems of the FTP site (creating some new ones in the process).

  2. Antony Williams says:

    Rich..I’ve been exchanging emails with the PubChem team for the past couple of days. They have already stepped in to distribute the ChemSpider files to one of the other organizations who have requested the files. For the future they have acknowledged that the best way to help ChemSpider distribute structures is to use the “fat pipe” of NIH (and I mean it in a nice way) to distribute. I have been informed within the past hour that all of our data are already on PubChem but only the first 3 million are searchable. I am presently downloading the SID to ChemSpider ID to allow us to connect cleanly. Re. the Web API, yes we COULD do that. But relative to all the other things waiting on us, and the limited resources issue, we won’t do that in the near future. PubChem have been very supportive of our efforts and gracious enough to help the distribution through their process. Who am I to say no to that!

  3. ChemSpider Blog » Blog Archive » eMolecules and ChemSpider - A Respectful Comparison of Capabilities says:

    [...] the PubChem collection but the data sources collection is much more diverse now and we actually deposited back to PubChem (which I don’t believe eMolecules has yet?). Our structures are unique..but you MUST be [...]

Leave a Reply