How much space do I need to download the full open target data?

Hi opentarget community,
I am a postdoc researcher working at Northwestern University. I am recently exploring the fantastic open target data. I plan to download the full set of open target data in order to start a comprehensive analysis. However, before I download the data, I would like to know how large storage space do I need for downloading the full data, such that I can request a corresponding allocation from my University.
Thank you very much!

Hello @mengysun! :wave:

Welcome to the Open Targets Community! :tada:

We make multiple datasets available for download via FTP. For a complete list, please visit:

http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/latest/

For our most recent release, the datasets generated by our ETL pipelines totalled 54.09GB in JSON format and 17.28GB in Parquet format.

Please note that this figure does not include our literature index, which is significantly larger. Also, it is important to note that some datasets contained in those counts may not be relevant to your analyses (e.g. our search index datasets, specific entity annotation datasets like , etc.).

Instead, I suggest that you use our BigQuery instance - open-targets-prod - to get familiar with our datasets and their respective schemas. Then you can download the datasets most relevant for your work.

If you want to see the size of specific dataset directories on our FTP server, I would recommend using a library like lftp and running the du command on the relevant directories of interest.

https://lftp.yar.ru/lftp-man.html

Please feel free to comment below if you have questions about specific datasets.

Thank you! :slight_smile:

~ Andrew

Thanks a lot. In my analysis, I might need to take a look in the literature index, because in my analysis I am particularly interested in linking literature to target prediction. Is it possible to download the data related to that? I do know the literature mining were done on EuroPMC, and EuroPMC has some API to return the linkage (with some limit, as all API do). And it appears to be very labor intensive to start from scratch to do the mining on EuroPMC.

Hi @mengysun!

Yes, you can download all of the data for our literature index. This forms the basis for the Bibliography feature found on our target, disease/phenotype, and drug profile pages. The data for our literature index is available in Parquet format at http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/latest/output/literature/parquet/

You may also be interested in the EuropePMC evidence datasets that we use to build millions of target-disease associations. You can find these datasets in Parquet format at http://ftp.ebi.ac.uk/pub/databases/opentargets/platform/latest/output/etl/parquet/evidence/sourceId%3Deuropepmc/

Thanks a lot! These are very helpful!