GWAS lead variants via API

Hi OpenTargets Team,

I am a huge fan of your resource. I am currently working on a script to query GWAS variant associations for a set of genetic variants. I managed to do so for PheWAS associations. However, since most of those are likely due to LD-confounding I am most interest in the link between the queried variant and genuine lead signals at published GWAS loci. Basically the panel called " GWAS lead variants". Would be great to get some sample script using R.


Hello @pietznerm! :wave:

Welcome to the Open Targets Community! :tada:

If you want to recreate the “GWAS Lead Variants” table for a number of variants, I would recommend that you use our v2d dataset.

What is the Open Targets Genetics v2d dataset?

The v2d dataset is a series of JSON lines files, where each individual JSON line entry includes data on the tag and lead variants for a given study.

{"study_id":"GCST90012110","lead_chrom":"1","lead_pos":1574655,"lead_ref":"GGC","lead_alt":"G","direction":"+","beta":0.00657455,"beta_ci_lower":0.00476262212,"beta_ci_upper":0.00838647788,"pval_mantissa":4.3,"pval_exponent":-13,"pval":4.3E-13,"pmid":"PMID:32042192","pub_date":"2020-02-10","pub_journal":"Nat Med","pub_title":"Using human genetics to understand the disease impacts of testosterone in men and women.","pub_author":"Ruth KS","trait_reported":"Sex hormone-binding globulin levels adjusted for BMI","ancestry_initial":["European=368929"],"ancestry_replication":[],"n_initial":368929,"num_assoc_loci":815,"has_sumstats":true,"source":"GCST","trait_efos":["EFO_0004696"],"trait_category":"measurement","tag_chrom":"1","tag_pos":1537493,"tag_ref":"T","tag_alt":"A","overall_r2":0.868737707721,"AFR_1000G_prop":0.0,"AMR_1000G_prop":0.0,"EAS_1000G_prop":0.0,"EUR_1000G_prop":1.0,"SAS_1000G_prop":0.0,"log10_ABF":19.755002754225544,"posterior_prob":0.016564937297166}

How do I query the v2d dataset?

To query the dataset and find the GWAS lead variant information for a given variant, you will need to write an R script that queries the dataset for the following fields: tag_chrom, tag_pos, tag_ref, and tag_alt.

For example, using variant 1_1313807_G_A, the R script would need to query for:

tag_chrom == 1
tag_pos == 1313807
tag_ref == G
tag_alt == A

The script would return 6 JSON objects, including:

{"study_id":"GCST007430","lead_chrom":"1","lead_pos":1384749,"lead_ref":"C","lead_alt":"G","direction":"-","beta":-0.0194,"beta_ci_lower":-0.025672,"beta_ci_upper":-0.013128,"pval_mantissa":1.98,"pval_exponent":-9,"pval":1.98E-9,"pmid":"PMID:30804560","pub_date":"2019-02-25","pub_journal":"Nat Genet","pub_title":"New genetic signals for lung function highlight pathways and chronic obstructive pulmonary disease associations across multiple ancestries.","pub_author":"Shrine N","trait_reported":"Peak expiratory flow","ancestry_initial":["European=321047"],"ancestry_replication":["European=24218"],"n_initial":321047,"n_replication":24218,"num_assoc_loci":265,"has_sumstats":true,"source":"GCST","trait_efos":["EFO_0009718"],"trait_category":"measurement","tag_chrom":"1","tag_pos":1313807,"tag_ref":"G","tag_alt":"A","overall_r2":0.757951324816,"AFR_1000G_prop":0.0,"AMR_1000G_prop":0.0,"EAS_1000G_prop":0.0,"EUR_1000G_prop":1.0,"SAS_1000G_prop":0.0}

Within each of the 6 JSON objects, you will see the published lead variant information in the lead_chrom, lead_pos, lead_ref, and lead_alt fields along with relevant study fields (e.g. study_id).

For example, using the above JSON object, the lead variant is 1_1384749_C_G and it was identified in the GWAS Catalog study GCST007430.

How can I access the v2d dataset?

The Open Targets Genetics v2d dataset is available in JSON format from our FTP server:

Alternatively, you can also access the dataset in our BigQuery open-targets-genetics instance and use SQL to access the relevant fields:

Find GWAS lead variant, study, and reported trait information for a given tag variant with BigQuery

I hope this helps – feel free to post any follow up questions below!


~ Andrew :slight_smile:

1 Like

Thank you, Andrew!! This was really helpful and easy to implement. However, some overall_r2 entries do contain missing values and I was wondering, how the connection between the lead signal and the tag variant has been made in those cases.


Hello @pietznerm!

My sincere apologies for not responding to your follow up question earlier.

Our Variant2Disease pipeline relies on different methods of expanding lead variants to tag variants, including LD expansion and fine mapping expansion. The FinnGen data that we have integrated also has its own method of lead variant to tag variant expansion.

For more information, please see our Variant2Disease pipeline documentation, which details the different methods we use.