My goal is to use Google BigQuery to get the version of evidence sources when obtain the final association score and e.g. scores of somatic mutation in a certain disease. Could you please give me a sample code to do this? Thank you.
if you want to have traceability of what Open Targets version you are using in your analysis, you can include the open-targets-prod.platform.ot_release table. This version will be the same for all sources of evidence.
If you need more granular provenance of our data, we internally produce a manifest that contains metadata about everything that feeds into each of our releases. This is not available on BigQuery, you can download it from the FTP.
Thank you so much for your explanation. It is helpful. But I have other question.
I used the BigQuery to see the version of Open Targets Platform (OTP), and get its current version is 23.09. So OTP gives version of all evidence sources as 23.09. Furthermore, I want to obtain which version of, e.g. UniProt (the latest is Release 2023_04 ), OTP obtains. I can not directly find such version information in the FTP.
So far, I can only get “created” date (2023-08-30T00:15:52+00:00) of a database in the FTP link. Then I can go to see the history of the database to match which version it is. I am curious that how to obtain the version of databases in a more efficient way. Thank you.
Knowing the exact release of each of our sources isn’t straightforward. A summary of what the process of generating data for a release looks like this:
Data providers deposit their latest data in designated buckets.
Our system automatically selects the most recent data from each provider.
We then generate a consolidated set of evidence from this data.
For a detailed breakdown of the data feeding into a specific release, you can refer to our manifest at gs://open-targets-data-releases/23.09/input/manifest.json. Here you will find paths that inform us about the provenance of our data, not necessarily the version that our providers handle internally.
However, if you’re interested in time stamping individual evidence instead of the data sources, I want to let you know that we’re actively addressing this. You can stay updated on our progress in this ticket: