Objective
I recently got the next request: “I would like to know what are the cancer driver genes that haven’t been explored in the clinic as drug targets and whether there are any indications that they could eventually become valid drug targets.” While the question looks simple, there are a number of considerations that need to be done and different data that needs to be adjusted to answer this query. This post does not aim to provide a comprehensive response. Instead, it attempts to illustrate how to merge multiple Open Targets datasets to respond a more complex query.
Data
This example uses data from the 21.04
Open Targets Platform release. In particular, the targets
, diseases
and evidence
datasets.
To download the datasets in parquet
format, I use the gsutil
tool. Alternative strategies are described in the Data Downloads documentation section.
gsutil -m cp -r \
gs://open-targets-data-releases/21.04/output/etl/parquet/diseases/ \
gs://open-targets-data-releases/21.04/output/etl/parquet/targets/ \
gs://open-targets-data-releases/21.04/output/etl/parquet/evidence/ \
~/Datasets/
PySpark to the rescue
To read and process the relevant information, I’m using PySpark. Alternative strategies can also be used. We will first establish the Spark connection. PySpark is widely available and easy to install using your preferred package manager.
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
# establish spark connection
spark = (
SparkSession.builder
.master('local[*]')
.getOrCreate()
)
Reading the datasets
The downloaded datasets are directories containing all the partitioned parquet
files. To read them in PySpark is as easy as to provide the directory containing all the files.
# read evidence dataset
target = spark.read.parquet("<path>/targets")
disease = spark.read.parquet("<path>/diseases")
evd = spark.read.parquet("<path>/evidence")
Preparing some data
There are a few datasets that need to be prepared in order to make our final query.
Cancer-indicated drugs
We want to know what are all the cancer-indicated drugs and their targets based on the known mechanism-of-action. This question requires information from 2 different datasets:
- We want to know what are all the diseases that belong to the
cell proliferation disorder (MONDO_0045024)
therapeutic area. - We want to subset the Open Targets Platform target-disease evidence that it’s provided by
chembl
and contains the target-drug-indication triplet for clinical candidates or approved drugs.
By joining these 2 datasets, we will be able to answer this question. We will also get some aggregations by target including the number of distinct cancer indicated drugs (cancerIndicatedDrugs
) or the maximum clinical phase for all target drugs (maxCancerPhase
).
# get all diseases in the cell proliferation disorder TA
cancer_diseases = (
disease
.select(F.col("id").alias("diseaseId"),
F.explode("therapeuticAreas").alias("TA"))
.filter(F.col("TA") == "MONDO_0045024")
)
# Cancer indicated drugs by target and maximum clinical trial phase
chemblByTarget = (
evd
.filter(F.col("sourceId") == "chembl")
.join(cancer_diseases, on="diseaseId", how="inner")
.groupBy("targetId")
.agg(F.countDistinct(F.col("drugId")).alias("cancerIndicatedDrugs"),
F.max(F.col("clinicalPhase")).alias("maxCancerPhase"))
)
Target role in cancer
We are going to use 2 different sources to nominate targets that are believed to be involved in the cancer biology.
- Manually curated Cancer Gene Census role in cancer. Genes are porposed as oncogenes and/or tumor suppresor genes (TSG) and the Open Targets Platform presents this data at the target-level.
- IntOgen cancer-driver gene predictions based on large-scale sequencing project data. This information is also available as Open Targets Platform evidence.
# get the role in cancer (Cancer Gene Census) for all targets
roleInCancer = (
target
.select(F.col("id").alias("targetId"), "hallmarks.*")
.select("targetId", F.explode("attributes").alias("attributes"))
.select("targetId", "attributes.attribute_name", "attributes.description")
.filter(F.col("attribute_name") == "role in cancer")
.groupBy("targetId")
.agg(F.concat_ws(", ", F.collect_set(F.col("description")))
.alias("roleInCancer"))
.distinct()
)
# number of methods suggesting a cancer driver by IntOgen
intogen = (
evd
.filter(F.col("sourceId") == "intogen")
.select("targetId", F.explode("significantDriverMethods").alias("methods"))
.groupBy("targetId")
.agg(F.size(F.collect_set(F.col("methods"))).alias("intogenDriverMethods"))
)
Target tractability
Finally, we will like to know if these targets are believed to be tractable based on the Open Targets tractability assesments for small molecule and/or antibody.
# Target tractability assesment for antibody (ab) and small molecule (sm) by target
tractability = (
target
.select(
F.col("id").alias("targetId"),
F.col("tractability.antibody.categories.predicted_tractable_high_confidence")
.alias("ab_tractable_highconf"),
F.col("tractability.antibody.categories.predicted_tractable_med_low_confidence")
.alias("ab_tractable_medlowconf"),
F.col("tractability.smallmolecule.categories.discovery_precedence")
.alias("sm_tractable_discovery_precedence"),
F.col("tractability.smallmolecule.categories.predicted_tractable")
.alias("sm_tractable_predicted")))
Some magic to join all prepared data
Finally, we will just join all the information that we have been preparing. We will also use the opportunity to include some other information that it’s currently available in the target object such as information on target safety concerns or any available chemical probes targeting the protein.
# resulting dataset with all data integrated
df = (
evd
.filter(F.col("sourceId").isin(["cancer_gene_census", "intogen"]))
.select("targetId")
.distinct()
# add role in cancer from CGC
.join(roleInCancer, on="targetId", how="left")
# intogen driver methods
.join(intogen, on="targetId", how="left")
# chemblByTarget
.join(chemblByTarget, on="targetId", how="left")
# chemicalProbes
.join(target
.select(F.col("id").alias("targetId"), "chemicalProbes")
.withColumn("chemicalProbes", F.col("chemicalProbes").isNotNull()),
on="targetId",
how="left")
# add safety data
.join(target
.select(F.col("id").alias("targetId"), "safety")
.withColumn("safety", F.col("safety").isNotNull())
.withColumnRenamed("safety", "safety_data"),
on="targetId",
how="left")
# add tractability
.join(tractability, on="targetId", how="left")
# adding labels
.join(target.select(F.col("id").alias("targetId"), "approvedSymbol"),
on="targetId", how="left")
.sort(F.col("intogenDriverMethods").desc())
)
Exploring the results…
The final dataset (df
) contains 886 potentially driver genes with all the available metadata in a dataframe format. The next is a sample of 3 genes out of the full dataset displayed vertically.
>>> df.show(3, vertical = True)
-RECORD 0--------------------------------------------
targetId | ENSG00000146648
roleInCancer | oncogene
intogenDriverMethods | 8
cancerIndicatedDrugs | 52
maxCancerPhase | 4
chemicalProbes | true
safety_data | true
ab_tractable_highconf | 1.0
ab_tractable_medlowconf | 0.75
sm_tractable_discovery_precedence | 1.0
sm_tractable_predicted | 1.0
approvedSymbol | EGFR
-RECORD 1--------------------------------------------
targetId | ENSG00000149311
roleInCancer | TSG
intogenDriverMethods | 8
cancerIndicatedDrugs | null
maxCancerPhase | null
chemicalProbes | true
safety_data | false
ab_tractable_highconf | null
ab_tractable_medlowconf | null
sm_tractable_discovery_precedence | 0.3
sm_tractable_predicted | 0.3
approvedSymbol | ATM
-RECORD 2--------------------------------------------
targetId | ENSG00000122025
roleInCancer | oncogene
intogenDriverMethods | 8
cancerIndicatedDrugs | 33
maxCancerPhase | 4
chemicalProbes | true
safety_data | false
ab_tractable_highconf | 0.3
ab_tractable_medlowconf | 0.25
sm_tractable_discovery_precedence | 1.0
sm_tractable_predicted | 0.6
approvedSymbol | FLT3
only showing top 3 rows
The results can be exported to a file or into pandas (df.toPandas()
) for further analysis.
Conclusion
The query displayed here aims to illustrate how data can be handled to get some more complex answers by joining multiple datasets. The biological question can also be expanded for example by including other Platform data:
- CRISPR synthetic-lethality information from Project Score
- Drug action types (agonist, antagonist, modulator, etc.)
- Most recurrent mutation type per target
- Target Enabling Packages (TEPs)
- Tissue or tissues affected
- …
As more Platform datasets are available, we hope to expand the number of therapeutic hypotheses that can be answered using the Platform data.