How to batch-access information related to a list of targets from the
Open Targets Platform is a recurrent question. Here, I provide an
example of how to access all target-disease evidence for a set of
IFN-gamma signalling related proteins. I will further reduce the
evidence to focus on all the coding or non-coding variants
clinically associated with the gene list of interest. At the moment
this includes evidence from Open Targets Genetics Portal, PheWAS
catalogue, ClinVar (EVA) and Uniprot (HumSaVar). I used R and
sparklyr, but a Python implementation would be very similar. The
platform documentation and the community space have very similar
examples.
Spark connection and loading datasets
To most efficiently interact with the datasets in parquet format, I
will use Sparklyr in R. It’s not the only way to access the information
as JSON files are also available.
library(dplyr)
library(sparklyr)
library(sparklyr.nested)
Option 1: Work with local datasets
The easiest way to interact with the datasets is to download the latest
platform release parquet data directly into your computer. Datasets vary
in size and they are all partitioned into multiple chunks to optimise I/O.
You can go to the
documentation
to find alternative ways of downloading the information.
The first step is to create a connection to your local Spark instance.
sc <- spark_connect(master = "local")
For the purpose of this exercise I will use 3 datasets:
target,
disease
and evidence. See
their respective documentation pages for more information. To load the
datasets, you will just need to provide the top-level directory
containing all the chunks of a dataset (e.g. ~/datasets/evidence
),
spark will then interpret all the partitioned chunks included in this
directory.
evidencePath <- "<local-path-evidence>"
targetPath <- "<local-path-target>"
diseasePath <- "<local-path-disease>"
## read datasets
evd <- spark_read_parquet(sc, path = evidencePath)
target <- spark_read_parquet(sc, path = targetPath)
disease <- spark_read_parquet(sc, path = diseasePath)
Option 2: Directly in Google Cloud Platform (Advanced)
Alternatively, you can setup a spark configuration to read directly from
Google Cloud Platform. This is a more complex setup as it’s running in a
Dataproc cluster accessing directly the datasets. This will probably be
more useful to power users of the Open targets Platform.
conf <- spark_config()
conf$`spark.hadoop.fs.gs.requester.pays.mode` <- "AUTO"
conf$`spark.hadoop.fs.gs.requester.pays.project.id` <- "open-targets-eu-dev"
sc <- spark_connect(master = "yarn", config = conf)
## Re-using existing Spark connection to yarn
With a slightly modified spark configuration you can directly access the
google cloud datasets. Here, an example of how to tap into the 21.11
data.
evidencePath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/evidence"
targetPath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/targets"
diseasePath <- "gs://open-targets-data-releases/21.11/output/etl/parquet/diseases"
## read datasets
evd <- spark_read_parquet(sc, path = evidencePath)
target <- spark_read_parquet(sc, path = targetPath)
disease <- spark_read_parquet(sc, path = diseasePath)
Browse the schema to find fields of interest
The best way to better understand the available information on the datasets
is to browse the schema. Next, you can see the available columns in the
evidence
dataset. Evidence is a composite of objects from multiple
datasources. As a result, the schema is a compromise between multiple
elements. This means that if you are only interested in a subset of the
datasources, you might find a lot of the columns in the dataset useless.
## Browse the evidence schema
columns <- evd %>%
sdf_schema() %>%
lapply(function(x) do.call(tibble, x)) %>%
bind_rows()
name | type |
---|---|
datasourceId | StringType |
targetId | StringType |
alleleOrigins | ArrayType(StringType,true) |
allelicRequirements | ArrayType(StringType,true) |
beta | DoubleType |
betaConfidenceIntervalLower | DoubleType |
Batch-searching datasets information (data joining)
In this exercise, I want to look at information related to a list of
IFN-gamma signalling related genes/proteins. I will use the list of
approved symbols of interest and load them into Spark.
IFNgsignalling <- data.frame(
approvedSymbol = c(
"JAK1",
"JAK2",
"IFNGR1",
"IFNGR2",
"STAT1",
"B2M",
"IRF1",
"SOCS1"
)
)
IFNgsignalling <- copy_to(sc, IFNgsignalling, "IFNgsignalling")
The next step is the joining of multiple pieces of information that I
have. I will sequentially:
-
Subset the target dataset containing 60k+ genes using the genes of
interest. I will only keep the information about the approved
symbol and the Ensembl ID, but there is a large amount of gene
metadata on that dataset. -
I will join the information about all the evidence related to the
targets. This is a one-2-many relationship as there could be
multiple evidence for the same gene. -
I will use the disease dataset to resolve the disease names. In the
evidence dataset, diseases are only represented with their EFO ids,
so I make use of the dataset to obtain their names. There is as
well rich information about the disease that could be leveraged for
more complex hypothesis.
IFNgEvd <- target %>%
inner_join(IFNgsignalling, by = "approvedSymbol") %>%
select(targetId = id, approvedSymbol) %>%
# join target-disease evidence
inner_join(evd, by = "targetId") %>%
# join disease names
inner_join(
disease %>%
select(
diseaseId = id,
diseaseName = name
),
by = c("diseaseId")
) %>%
sdf_persist()
The resulting IFNgEvd
object contains all the Platform target-disease
evidence for the targets of interest.
Extract variant-related information
For the purpose of this exercise, I want to filter the dataset to
restrict to all the evidence that contain variant information. This
reduces the set to the few datasources that report causal variants to
the Open Targets Platform. That implies that a number of columns also
become irrelevant to the reduced dataset.
Next, I filter the information to the data of interest and provide an
example of how to post-process columns that contain nested information.
For example, this is the case of the ClinVar clinicalSignificances
which I will explode into multiple rows for this use case. Finally, I
order the dataset by the Platform score to prioritise these variants
with a more likely causal link to the disease.
# function to drop all columns without non-Null values
dropAllNullColumns <- function(sparkdf){
null_counts <- sparkdf %>%
summarise_all(~ sum(as.numeric(!is.na(.)), na.rm = TRUE)) %>%
collect()
notNullColumns <- null_counts %>%
select_if(~(. != 0)) %>%
colnames()
out <- sparkdf %>%
select(one_of(notNullColumns))
return(out)
}
IFNgVariantInfo <- IFNgEvd %>%
# keep only variant-based information
filter(!is.na(variantRsId)) %>%
# drop all columns without information
dropAllNullColumns() %>%
# explode nested field
sdf_explode(clinicalSignificances) %>%
# sort by score
arrange(desc(score))
In some cases, data might need to be moved to the R workspace (outside
Spark) for further analysis.
df <- IFNgVariantInfo %>%
collect()
Careful interpretation of the data is always required. For example, this
table contains variants that have been catalogued as benign in ClinVar,
so there is no truly causal link reported. Similarly, Open Targets
Genetics Portal evidence with very low L2G scores might have a very weak
causal link with the gene. Some of these might result in some false
positive assignments.
The full script is available here: