R script for GraphQL query: query multiple rsID and get gene IDs and Annotation information

Hello,
I have a list of rsID from a GWAS dataset like this
https://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs001672/analyses/phs001672.pha004826.txt
I understand that GraphQL API is not working for multiple arguments. The script here is giving me for one rsID the gene ID and symbol and some other information.

query useSearchToConvertRSIDIntoIDFormat {
  search(queryString: "rs114818383") {
    variants {
      id
      rsId
      nearestGene {
        id
        start
        symbol
        tss
        description
        chromosome
        exons
      }
      nearestGeneDistance
    }
  }
}

I was wondering how can i modify this query to get as much as possible annotation information from open targets (enchancer, propoter annotation, disease etc) ?
How can i write this query in R for a list of rsIDs from a CSV file ?

1 Like

Hi Dimitris,

First, to write the same query in R, you can use the following script, then, for a list of rsIDs, you can loop the function through the list to obtain results for each rsID.

library(httr)

# Set rs ID variable
query_rsID <- "rs114818383"

# Build query string
query_string = "
query useSearchToConvertRSIDIntoIDFormat($query_rsID: String!) {
  search(queryString: $query_rsID) {
    variants {
      id
      rsId
      nearestGene {
        id
        start
        symbol
        tss
        description
        chromosome
        exons
      }
      nearestGeneDistance
    }
  }
} 
"

# Set base URL of GraphQL API endpoint
base_url <- "https://api.genetics.opentargets.org/graphql"

# Set variables object of arguments to be passed to endpoint
variables <- list("query_rsID" = query_rsID)

# Construct POST request body object with query string and variables
post_body <- list(query = query_string, variables = variables)

# Perform POST request
r <- POST(url=base_url, body=post_body, encode='json')

# Results:
content(r)

For additional annotation information, you can try querying the study table to see which diseases are associated with your variant, and the v2g table to obtain functional gene assignments based on those annotations.

1 Like

Thank you. I already created the script in r for all my rs_ids. What do you mean with “querying the top loci table”? Do you have the query in R for this and for v2g?
I am also trying to use BigQuery for a list of rs_ids from a txt file. if you have a query to do this in BigQuery i would appreciate to have the code and try it .

Great! Apologies, I meant the study table, the query for the v2g looks something like this:

# Set gene_id variable
variantId <- "1_154453788_C_T"

# Build query string
query_string = "
query v2g($variantId: String!) {
  genesForVariant(variantId: $variantId) {
    gene {
      id
    }
    variant
    overallScore
    functionalPredictions {
      typeId
      sourceId
      aggregatedScore
    }
    intervals {
      typeId
      sourceId
      aggregatedScore
    }
  }
}
"

# Set base URL of GraphQL API endpoint
base_url <- "https://api.genetics.opentargets.org/graphql"

# Set variables object of arguments to be passed to endpoint
variables <- list("variantId" = variantId)

# Construct POST request body object with query string and variables
post_body <- list(query = query_string, variables = variables)

# Perform POST request
r <- POST(url=base_url, body=post_body, encode='json')

# Results:
content(r)

If you want to try to do the same with bigquery, you can try joining your rs_ids to the variant table to obtain the chrom, pos, ref and alt for each rs_id, then concatenate them together (chrom_pos_ref_alt), and use this to join with the “studies” and the “variant_gene” table as you require.

1 Like

Thank you. I am trying to to it in Big Query but its a bit difficult to create the query since i dont have experience with that but i think its the best solution. If you have a code for that or similar from bigQuery would be extremely helpful for me !

Hi Dimitris,

If you have your SNP info loaded into the Bigquery, you can try something like this in the console to join it with the “variant_gene” (v2g) table, replace the YOUR_SNP_TABLE with your table!

WITH
  SNP_info AS (
  SELECT
    CONCAT(CAST(chrom AS string), CAST(pos AS string), CAST(ref AS string), CAST(alt AS string)) AS identifier
  FROM
`YOUR_SNP_TABLE`)
SELECT
  *
FROM
  SNP_info
JOIN (
  SELECT
    CONCAT(CAST(chr_id AS string), CAST(position AS string), CAST(ref_allele AS string), CAST(alt_allele AS string)) AS identifier,
*
 FROM
    `open-targets-genetics.210608.variant_gene` ) variants
ON
  SNP_info.identifier = variants.identifier

Best wishes,
Jack

1 Like

Thank you for your reply . I am very new in BigQuery thats why i am asking something so simple.
I have my SNP information loaded as a table in gwas-386212.gwas_dataset_1.SNP_ids with table name SNP_ids. the table has 4 columns with names rs_id , chr, position, ref and alt. I tried your code, at first i had message like this :

Access Denied: Table open-targets-genetics:210608.variant_gene: User does not have permission to query table open-targets-genetics:210608.variant_gene, or perhaps it does not exist in location US…

I replaced it with bigquery-public-data.open_targets_genetics.variant_gene and it works.

And if i want first to JOIN rs_ids from open targets data set with variants and locus2gene information to create a new file with all annotation information related to rs_ids in OpenTargets database and then JOIN it with my rs_ids table ?
I would like to have a table lets say with columns:
rs_id from my data, chr, position, ref allele, alt_allele, Pvalue, GeneID (from open targets annotation), consequence .
I tried a script similar to yours but its not fully working

WITH
  SNP_info AS (
  SELECT
    CONCAT(CAST(rs_id AS string)) AS identifier
  FROM
  `gwas-386212.gwas_dataset_1.SNP_ids`)
SELECT
  *
FROM
  SNP_info
JOIN (
  SELECT
    CONCAT(CAST(rs_id AS string)) AS identifier,
*
 FROM
    `bigquery-public-data.open_targets_genetics.variants`) variants
ON
  SNP_info.identifier = variants.identifier

Is any other file available that i should use to retrieve the maximun annotation for my SNPs list ?

Hi Dimitris, are you using bigquery through the online console? If so can you see the datasets available on the left side of the console?

Does the access issue become fixed if you replace the variant gene table name with : open-targets-genetics.genetics.variant_gene ?

Just a heads up, if your SNP table column names are " chr, position, ref, alt" , then they would need to be re-named in your query too, the CONCAT commands are concatenating/pasting these columns into one column for the join, as the v2g table does not have an rsID column:

WITH
  SNP_info AS (
  SELECT
    CONCAT(CAST(chr AS string), CAST(position AS string), CAST(ref AS string), CAST(alt AS string)) AS identifier
  FROM
  `gwas-386212.gwas_dataset_1.SNP_ids`)
SELECT
  *
FROM
  SNP_info
JOIN (
  SELECT
    CONCAT(CAST(chr_id AS string), CAST(position AS string), CAST(ref_allele AS string), CAST(alt_allele AS string)) AS identifier,
*
 FROM
    `open-targets-genetics.genetics.variant_gene`) variants
ON
  SNP_info.identifier = variants.identifier

Alternatively, you can try to bulk download the entire v2g table with ftp, and perform the join locally.

Best wishes,
Jack

1 Like

Yes the issue fixed when i replaced the name with bigquery-public-data.open_targets_genetics.variant_gene.
I followed what you suggested and i made two queries: One to join my rs_ids with variant-gene file and get the result with focus to genes and another query to join my rs_ids with variants file. Is the same probably but i wanted retrieve as much as possible annotation information. Idealy i would like to include in the resulted table the rs_ids and not only chr, position, ref, alt and the geneIDs

I was trying to join my rs_ids with locus2gene file but the result i got was “There is no data to display”. Does it make sense or this is an error ? Is is proper to JOIN locus2gene with my rs_ids file?

Furthermore, i tried to join variant_gene with locus2gene but it returns a huge table and the query doesn’t work, so i selected only few columns for that query.
Is anything else that i could do to get the maximum annotation for my input data ?

Thank you very much for the valuable help

Hi Dimitris,

Glad to hear your variant_gene join worked!

Difficult to troubleshoot your locus2gene query without seeing it, but it may be because rs_ID is not a column in the locus2gene table, you can try repeating the CONCAT command prior to the join, bear in mind that the columns you need in locus2gene have changed to chrom, pos, ref, alt .

In addition to the annotation information from these tables, you can also retrieve colocalisation results from the variant_disease_coloc table, this will show you if your SNPs are driving any associations behind other traits or QTLs.

Best wishes,
Jack

Hi i was thinking about colocalisation as well. So the query in that case will be similar to what we tried with variant to gene ? In case of variant_disease_coloc table there are left and right columns so how i will perform the query ? Do you have some code for that ?

In case of locus2gene here is the query :

WITH
  SNP_info AS (
  SELECT
    CONCAT(CAST(chr AS string), CAST(position AS string), CAST(ref AS string), CAST(alt AS string)) AS identifier
  FROM
  `gwas-386212.gwas_dataset_1.SNP_ids`)
SELECT
  *
FROM
  SNP_info
JOIN (
  SELECT
    CONCAT(CAST(chrom AS string), CAST(pos AS string), CAST(ref AS string), CAST(alt AS string)) AS identifier,
chrom AS chrom,
pos AS pos,
ref AS ref,
alt AS alt,
gene_id AS gene_id_locus,
study_id AS study_id,
 FROM
    `bigquery-public-data.open_targets_genetics.locus2gene`) locus
ON
  SNP_info.identifier = locus.identifier

Do you have an alternative for making locus2gene work with my input data?

Thank you very much

Hi Dimitris,

Regarding the colocalisation, yes, you can join your SNPs to the left study, this is because molecular trait QTL studies are always on the “right_study”. So, same as before, except using left_chrom, left_pos, left_ref, left_alt on the variant_disease_coloc table.

The l2g query looks fine. However, the chrom, pos, ref, alt columns in l2g refers to the lead SNP at a given locus, so if your SNPs were not lead SNPs from studies already incorporated in OpenTargets, they are unlikely to match up exactly with the locus2gene table.

One approach you could try is to join your SNPs to the variant_disease_credset table on the tag_chrom, tag_pos, tag_ref, tag_alt columns (CONCAT as before). From this resulting join, you can match your SNPs to their corresponding lead_chrom, lead_pos, lead_ref, lead_alt columns (if they exist). Then you can try to join the lead_chrom, lead_pos, lead_ref, lead_alt columns back to the locus2gene table as you did before.

Best wishes,
Jack

1 Like

Great i tried your approach and it seems to work fine. after that i used a script in r to intersect with my original file and attach the rs_ids. In case of the particular data there are not many genes from locus2gene. I was wondering what is the main difference between variant_gene and locus2gene files? why they contain different genes for the same locus ?

Hi Dimitris,

Great question! I should have mentioned that the locus2gene (l2g) and v2g are very different approaches: v2g is a variant centric method that only considers the functional evidence for each individual SNP, whereas l2g aggregates the evidence from all SNPs within a given locus, and the evidence is weighted based on the fine-mapping posterior probability for each SNP (documented here).

If you check the postprob column from the variant-disease-credset table for your SNPs, it can give you an indication of how much your SNPs are contributing to the gene assignment at its locus, in the event that postprob is low, it is likely that other SNPs in the locus have higher posterior probabilities, therefore the gene assignment is based on a different set of evidence.

Best wishes,
Jack

1 Like

Hi great thank you for the detailed answer. So i suppose that this is the reason we first JOIN our chrom, pos, ref, alt with the lead_chrom, lead_pos, lead_ref, lead_alt. for a number of variants we select just the lead to run in the l2g. In general i would expect to see in l2g file a locus. Which means chr, start, end position. But by reading about l2g pipeline I understand that lead_chrom, lead_pos, lead_ref, lead_alt is the result of a procedure which starts from a GWAS locus and ends with stable lead position.
Is it possible to run a query and get the annotation results we are discussing, if we have as input instead of rs_ids or position a locus with chr, start, end coordinates? Do you have a code example for something like that in Big Query ?

Hi Dimitris, I imagine there must be a way to query tables using intervals in bigquery, but I’m not sure what the syntax is for that! I guess the conditions for the join would be:

  1. Chromosomes are the same.

  2. Position is greater than start.

  3. Position is less than end.

Alternatively, if you would like to get the annotations for all SNPs within a locus of interest, you can use the lead_SNP to get all the tag SNPs for that locus from the variant-disease-credset table, and re-join to the other annotation tables of interest.

Best wishes,
Jack

1 Like

Thank you Jack for the valuable help. I will try to investigate it a bit more especially the l2g, and how to create a query with input a locus with start end .

One more question on the topic. What is the main difference between variants and v2g files?
In variants there are geneIDs retrieved and most_severe_consequence and gene distance. Is it possible to retrieve for those genes also the biotype (for example if its enhancer )?
How those genes are related to the variants ? Is it just the distance from gene or there is another reason of associating each variant to each gene in variant file ?