GWAS lead variants via API

Hi OpenTargets Team,

I am a huge fan of your resource. I am currently working on a script to query GWAS variant associations for a set of genetic variants. I managed to do so for PheWAS associations. However, since most of those are likely due to LD-confounding I am most interest in the link between the queried variant and genuine lead signals at published GWAS loci. Basically the panel called " GWAS lead variants". Would be great to get some sample script using R.


Hello @pietznerm! :wave:

Welcome to the Open Targets Community! :tada:

If you want to recreate the “GWAS Lead Variants” table for a number of variants, I would recommend that you use our v2d dataset.

What is the Open Targets Genetics v2d dataset?

The v2d dataset is a series of JSON lines files, where each individual JSON line entry includes data on the tag and lead variants for a given study.

{"study_id":"GCST90012110","lead_chrom":"1","lead_pos":1574655,"lead_ref":"GGC","lead_alt":"G","direction":"+","beta":0.00657455,"beta_ci_lower":0.00476262212,"beta_ci_upper":0.00838647788,"pval_mantissa":4.3,"pval_exponent":-13,"pval":4.3E-13,"pmid":"PMID:32042192","pub_date":"2020-02-10","pub_journal":"Nat Med","pub_title":"Using human genetics to understand the disease impacts of testosterone in men and women.","pub_author":"Ruth KS","trait_reported":"Sex hormone-binding globulin levels adjusted for BMI","ancestry_initial":["European=368929"],"ancestry_replication":[],"n_initial":368929,"num_assoc_loci":815,"has_sumstats":true,"source":"GCST","trait_efos":["EFO_0004696"],"trait_category":"measurement","tag_chrom":"1","tag_pos":1537493,"tag_ref":"T","tag_alt":"A","overall_r2":0.868737707721,"AFR_1000G_prop":0.0,"AMR_1000G_prop":0.0,"EAS_1000G_prop":0.0,"EUR_1000G_prop":1.0,"SAS_1000G_prop":0.0,"log10_ABF":19.755002754225544,"posterior_prob":0.016564937297166}

How do I query the v2d dataset?

To query the dataset and find the GWAS lead variant information for a given variant, you will need to write an R script that queries the dataset for the following fields: tag_chrom, tag_pos, tag_ref, and tag_alt.

For example, using variant 1_1313807_G_A, the R script would need to query for:

tag_chrom == 1
tag_pos == 1313807
tag_ref == G
tag_alt == A

The script would return 6 JSON objects, including:

{"study_id":"GCST007430","lead_chrom":"1","lead_pos":1384749,"lead_ref":"C","lead_alt":"G","direction":"-","beta":-0.0194,"beta_ci_lower":-0.025672,"beta_ci_upper":-0.013128,"pval_mantissa":1.98,"pval_exponent":-9,"pval":1.98E-9,"pmid":"PMID:30804560","pub_date":"2019-02-25","pub_journal":"Nat Genet","pub_title":"New genetic signals for lung function highlight pathways and chronic obstructive pulmonary disease associations across multiple ancestries.","pub_author":"Shrine N","trait_reported":"Peak expiratory flow","ancestry_initial":["European=321047"],"ancestry_replication":["European=24218"],"n_initial":321047,"n_replication":24218,"num_assoc_loci":265,"has_sumstats":true,"source":"GCST","trait_efos":["EFO_0009718"],"trait_category":"measurement","tag_chrom":"1","tag_pos":1313807,"tag_ref":"G","tag_alt":"A","overall_r2":0.757951324816,"AFR_1000G_prop":0.0,"AMR_1000G_prop":0.0,"EAS_1000G_prop":0.0,"EUR_1000G_prop":1.0,"SAS_1000G_prop":0.0}

Within each of the 6 JSON objects, you will see the published lead variant information in the lead_chrom, lead_pos, lead_ref, and lead_alt fields along with relevant study fields (e.g. study_id).

For example, using the above JSON object, the lead variant is 1_1384749_C_G and it was identified in the GWAS Catalog study GCST007430.

How can I access the v2d dataset?

The Open Targets Genetics v2d dataset is available in JSON format from our FTP server:

Alternatively, you can also access the dataset in our BigQuery open-targets-genetics instance and use SQL to access the relevant fields:

Find GWAS lead variant, study, and reported trait information for a given tag variant with BigQuery

I hope this helps – feel free to post any follow up questions below!


~ Andrew :slight_smile:

1 Like

Thank you, Andrew!! This was really helpful and easy to implement. However, some overall_r2 entries do contain missing values and I was wondering, how the connection between the lead signal and the tag variant has been made in those cases.