Hi,
I’m new to the platform (and to both GraphQL and SQL). I have a list of genetics variants as rsIDs, for which I would to retreive the table of “assigned genes” that is shown on the website when I query each variant individually. I understand that to do this, I will first need to map the rsIDs to the open targets variants IDs (e.g. 1_154453788_C_T).
I’m having a hard to working out how to do this from your documentation - perhaps I’m looking in the wrong place?! - and so would appreciate some advice on how I can do this or to be pointed in the direction of some tutorials and schema. From what I understand the best way to achieve this would be via the bigquery instance, but what the relevant tables and column names are seems like a mystery at the moment.
Hello @new_user,
for your query you would need to use 2 of our datasets:
The variant index dataset. To map your rsIDs to our variant notation.
The variant to gene scored dataset. To annotate your variants with the associated genes and their scores.
As always, if you have a large number of variants you want to annotate, I’d encourage you to use our datasets dump, and not the API, as it is easier to operate with. To get the variant information you can join your rsIDs with out variant index using the rs_id and chr_id, position, ref_allele, alt_allele (we build our variant ID by concatenating these 4 columns). And to get the V2G scores, you can use our scored dataset to join the harmonised variant IDs with the columns gene_id and overall_score.
As explained in the documentation, you should be able to follow this approach either by downloading the data from the FTP or by using BigQuery.
Thanks Irene, this is very useful. More generally, where is best to look in order to determine which variables are stored in which datasets? I have read the docs but perhaps have missed where a schema is conveniently available (i.e. outside of the being stored within the individual datasets).