Open Targets data available as a Google Cloud public datasets!

We’re very happy to announce that both Open Targets Platform and Genetics data are now available as a Google Cloud public dataset! :tada:

:point_right: Open Targets Platform
:point_right: Open Targets Genetics

Using Google BigQuery, you can now process 1 TB of this data per month for free.

BigQuery can be accessed using the Cloud Console, the bq command-line tool, or through the BigQuery REST API.

New to BigQuery? Take a look at the sample queries on the dataset page, or explore topics here on the Open Targets Community:

If you’re familiar with BigQuery, we’d love to know how you use Open Targets data!

1 Like

Hi @hcornu, this is fantastic work. I am wondering if I was interested in downloading the harmonised sumstats (am primarily thinking GWAS catalog and possibly UKBB/Fingen as well) such that I have a local-ish copy to use in some pipelines if this is something that can be done through Google Cloud and if so – how?

Many thanks again for a fantastic resource!

Marcus

1 Like

Hi @marcustutert, the instructions in this StackOverflow answer appear to do what you want:

(make sure to substitute with the correct dataset name)

Please let us know if this doesn’t work for you and we’ll be able to help

Hi @Kirill_Tsukanov, many thanks for the response back. This has somewhat answered my question, but I am wondering how to download the copy from google cloud storage onto a local directory rather than simply ‘move’ it to another google cloud bucket. Is that something that can be done?

Marcus

Hi @marcustutert, if you want to download the files locally, then you first need to do the export as described in that answer (note that it doesn’t only move the data, it exports it from the BigQuery database into the Google Storage bucket). Once this is done, you will be able to download the data locally using:

gsutil -m cp -r gs://bucket/filename.ext local_directory

(make sure to substitute path and local directory relevant to your case). Please let me know if you have any more questions, I’ll be happy to help

Thanks again @Kirill_Tsukanov for the helpful reply. Previously when I spoke to members of OTAR, to download the data (locally and off the google cloud bucket) they informed me it would cost £ to do so. Is this still the case? Also in the event I wanted to pull more up-to-date data from OTAR from google cloud, how would you suggest I do so? Is there a way to automate a sync that only collects the “diff” between the two files perhaps?

Marcus

Hi @marcustutert. Indeed, when you download data from the Google Cloud platform, it applies egress charges. The exact amount depends on your location, the complete table can be found here: All networking pricing  |  Virtual Private Cloud  |  Google Cloud.

The egress charges don’t apply when you move the data within the cloud; for example, if you export the BigQuery dataset to a cloud bucket using the command I described below).

However, the egress charges will apply whenever you export the data from the cloud bucket to your local setup outside the Google Cloud infrastructure. This would apply in any case, whether you’re downloading from our “open-targets-genetics-releases” bucket, or from your own bucket to which you exported the BigQuery data.

It’s important to note that these are being charged by Google, so we have no control over that (and none of that money goes to Open Targets as well).

You can always fetch all of the data generated by Open Targets for free from the EMBL-EBI FTP server. Please see these links for further reference:

Finally, to address your question about automatic a sync. Unfortunately there isn’t a straightforward way to do this. Note that the schema of the data may (and frequently does) change between the releases, including adding and reorganising certian data fields. So it’s not an easy task to do a incremental sync in this situation.

Please let me know if you have any further questions, I’ll be happy to help

Note that Open Targets doesn’t provide harmonised sumstats (though we use them internally). For those you would need to go to the original providers, namely, GWAS catalog, FinnGen, and wherever you want UKB sumstats from. OT fine-mapping data is available though (v2d_credset).

1 Like

Does BigQuery always have the dataset from the latest release ?

Hi Karan, yes, Google BigQuery always has the datasets from the latest release.
A detailed response to your query can be found here.