I just downloaded the full Open Targets dataset in Parquet format, and set up sparklr with R. The demo script is working ok. However, I’m a bit lost how to do a systematic query with a list of UKBB endophenotypes, and get a data.frame or tibble with association information. Practically the “Association Information” table that is available, for example here, using " Red blood cell (erythrocyte) distribution width" as query.
Is there a more detailed description of the data structure somewhere or more detailed example scripts?
Something we are currently working on is the best way to communicate what fields are available (or relevant) for each dataset.
In the meantime, you can see what fields are available for each datasource in our json schema. You will be interested in the “ot_genetics_portal” dataset fields. The “projectId” will for example tell you which entries come from UKBB GWAS.
Keep in mind that what you can see through the Open Targets Platform are all the GWAS-significant loci with an L2G score > 0.05. This is the best dataset if your analysis focuses in potentially causal genes. If instead, your focus is on the actual signals you should probably consider accessing the genetics portal data directly.
Thanks, that was useful. However, I’m still missing some information. In particular, the “Credible Set Size” and “LD Set Size” columns from the study summary page, and the Gene Prioritization details, like “Variant Pathogenicity”, “Distance”, etc columns. I guess they are present in some other datasources. Thanks!
The Open Targets Platform does not provide that level of granularity. Instead, you can find this information accessing the Open Targets Genetics data. More information on how to access this data can be found in the documentation.