Recently, a twitter user was wondering how to get “interesting” variants by chromosomic location. The user specifically mentioned GWAS and ClinVar as data sources of interest.
The OpenTargets Platform contains several sources of potentially “interesting” variants in GRCh38 chromosomic coordinates:
- OT Genetics Portal - GWAS-significant loci above a certain threshold of gene causality provided by Open Targets Genetics. At the moment, the GWAS studies are obtained from GWAS catalog and UK Biobank.
- PheWAS Catalog variants.
- ClinVar germline (
eva
) and somaticeva_somatic
variants (processed by the European Variation Archive - EVA) at all levels of pathogenicity.
As part of the integration process, all phenotypes are standardised to the same disease/phenotype ontology (EFO) and same variant identifiers (CHR_LOC_REF_ALT). The remaining question is how to get the variants of interest using their chromosomic locations.
Step 1: Download the data
All the required information can be downloaded from the downloads page. The next datasets are required:
- Target
- Target-Disease evidence
I use rsync
to download both datasets in parquet
format from version 21.04
, but alternative tools and formats are also available.
# target dataset
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/21.04/output/etl/parquet/targets .
# evidence dataset
rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/21.04/output/etl/parquet/evidence .
Reading and querying the datasets
Next, it’s a simple script on how to access all variants in chromosome 1
in the range 173600000-173800000
. The script selects a few fields:
-
variantId
in chr_position_ref_alt format -
variantRsId
when available -
studyId
containing ClinVar RCV or GWAS study identifier -
datasourceId
specifying the origin of the evidence -
diseaseId
anddiseaseFromSource
Providing original trait/phenotype and mapped ontology term -
targetId
providing likely causal gene -
cs
(Clinical significances) specifying the clinical significance of the ClinVar variants.
More fields of potential interest can be explored by querying the evidence datasets using the fields provided in the schema evd.printSchema()
.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
# establish spark connection
spark = (
SparkSession.builder
.master('local[*]')
.getOrCreate()
)
targetPath = "<localpath>/targets"
evidencePath = "<localpath>/evidence"
target = spark.read.parquet(targetPath)
evd = spark.read.parquet(evidencePath)
variantEvidencePositioned = (
evd
.filter(F.col("variantId").isNotNull())
.withColumn("chr", F.split("variantId", "_").getItem(0))
.withColumn("position", F.split("variantId", "_").getItem(1))
)
out = (
variantEvidencePositioned
.filter((F.col("chr") == "1") &
(F.col("position").cast("long") > 173600000) &
(F.col("position").cast("long") < 173800000))
.select("variantId",
"variantRsId",
"datasourceId",
"diseaseId",
"diseaseFromSource",
"studyId",
"targetId",
F.explode_outer("clinicalSignificances").alias("cs"))
.join(target.selectExpr("id AS targetId", "approvedSymbol"),
how="left_outer", on="targetId"))
Next, it’s an example of the first 3 records in vertical format
>>> out.show(3, vertical = True, truncate = False)
-RECORD 0--------------------------------------
targetId | ENSG00000076321
variantId | 1_173757077_G_A
variantRsId | null
datasourceId | eva
diseaseId | HP_0001249
diseaseFromSource | Intellectual disability
studyId | RCV001078229
cs | likely pathogenic
approvedSymbol | KLHL20
-RECORD 1--------------------------------------
targetId | ENSG00000076321
variantId | 1_173757077_G_A
variantRsId | null
datasourceId | eva
diseaseId | EFO_0003847
diseaseFromSource | Intellectual disability
studyId | RCV001078229
cs | likely pathogenic
approvedSymbol | KLHL20
-RECORD 2--------------------------------------
targetId | ENSG00000183831
variantId | 1_173656571_T_C
variantRsId | rs7365380
datasourceId | ot_genetics_portal
diseaseId | EFO_0004337
diseaseFromSource | General cognitive ability
studyId | GCST006269
cs | null
approvedSymbol | ANKRD45
only showing top 3 rows
This example illustrates the value of integrating multiple sources, since variants in the same region have been linked to similar traits ( Intellectual disability
or General cognitive ability
) by 2 resources: GWASCatalog and ClinVar.