Get ClinVar + GWAS variant annotation by chromosomic location

Recently, a twitter user was wondering how to get “interesting” variants by chromosomic location. The user specifically mentioned GWAS and ClinVar as data sources of interest.

The OpenTargets Platform contains several sources of potentially “interesting” variants in GRCh38 chromosomic coordinates:

  • OT Genetics Portal - GWAS-significant loci above a certain threshold of gene causality provided by Open Targets Genetics. At the moment, the GWAS studies are obtained from GWAS catalog and UK Biobank.
  • PheWAS Catalog variants.
  • ClinVar germline (eva) and somatic eva_somatic variants (processed by the European Variation Archive - EVA) at all levels of pathogenicity.

As part of the integration process, all phenotypes are standardised to the same disease/phenotype ontology (EFO) and same variant identifiers (CHR_LOC_REF_ALT). The remaining question is how to get the variants of interest using their chromosomic locations.

Step 1: Download the data

All the required information can be downloaded from the downloads page. The next datasets are required:

  • Target
  • Target-Disease evidence

I use rsync to download both datasets in parquet format from version 21.04, but alternative tools and formats are also available.

# target dataset
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/21.04/output/etl/parquet/targets .
# evidence dataset
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/21.04/output/etl/parquet/evidence .

Reading and querying the datasets

Next, it’s a simple script on how to access all variants in chromosome 1 in the range 173600000-173800000. The script selects a few fields:

  • variantId in chr_position_ref_alt format
  • variantRsId when available
  • studyId containing ClinVar RCV or GWAS study identifier
  • datasourceId specifying the origin of the evidence
  • diseaseIdand diseaseFromSource Providing original trait/phenotype and mapped ontology term
  • targetId providing likely causal gene
  • cs (Clinical significances) specifying the clinical significance of the ClinVar variants.

More fields of potential interest can be explored by querying the evidence datasets using the fields provided in the schema evd.printSchema().

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)

targetPath = "<localpath>/targets"
evidencePath = "<localpath>/evidence"
target = spark.read.parquet(targetPath)
evd = spark.read.parquet(evidencePath)

variantEvidencePositioned = (
    evd
    .filter(F.col("variantId").isNotNull())
    .withColumn("chr", F.split("variantId", "_").getItem(0))
    .withColumn("position", F.split("variantId", "_").getItem(1))
 )

out = (
    variantEvidencePositioned
    .filter((F.col("chr") == "1") &
            (F.col("position").cast("long") > 173600000) &
            (F.col("position").cast("long") < 173800000))
    .select("variantId",
            "variantRsId",
            "datasourceId",
            "diseaseId",
            "diseaseFromSource",
            "studyId",
            "targetId",
            F.explode_outer("clinicalSignificances").alias("cs"))
    .join(target.selectExpr("id AS targetId", "approvedSymbol"),
          how="left_outer", on="targetId"))

Next, it’s an example of the first 3 records in vertical format

>>> out.show(3, vertical = True, truncate = False)
-RECORD 0--------------------------------------
 targetId          | ENSG00000076321           
 variantId         | 1_173757077_G_A           
 variantRsId       | null                      
 datasourceId      | eva                       
 diseaseId         | HP_0001249                
 diseaseFromSource | Intellectual disability   
 studyId           | RCV001078229              
 cs                | likely pathogenic         
 approvedSymbol    | KLHL20                    
-RECORD 1--------------------------------------
 targetId          | ENSG00000076321           
 variantId         | 1_173757077_G_A           
 variantRsId       | null                      
 datasourceId      | eva                       
 diseaseId         | EFO_0003847               
 diseaseFromSource | Intellectual disability   
 studyId           | RCV001078229              
 cs                | likely pathogenic         
 approvedSymbol    | KLHL20                    
-RECORD 2--------------------------------------
 targetId          | ENSG00000183831           
 variantId         | 1_173656571_T_C           
 variantRsId       | rs7365380                 
 datasourceId      | ot_genetics_portal        
 diseaseId         | EFO_0004337               
 diseaseFromSource | General cognitive ability 
 studyId           | GCST006269                
 cs                | null                      
 approvedSymbol    | ANKRD45                   
only showing top 3 rows

This example illustrates the value of integrating multiple sources, since variants in the same region have been linked to similar traits ( Intellectual disability or General cognitive ability) by 2 resources: GWASCatalog and ClinVar.

1 Like