Gap between cancer driver genes and clinical trials


I recently got the next request: “I would like to know what are the cancer driver genes that haven’t been explored in the clinic as drug targets and whether there are any indications that they could eventually become valid drug targets.” While the question looks simple, there are a number of considerations that need to be done and different data that needs to be adjusted to answer this query. This post does not aim to provide a comprehensive response. Instead, it attempts to illustrate how to merge multiple Open Targets datasets to respond a more complex query.


This example uses data from the 21.04 Open Targets Platform release. In particular, the targets, diseases and evidence datasets.

To download the datasets in parquet format, I use the gsutil tool. Alternative strategies are described in the Data Downloads documentation section.

gsutil -m cp -r \
gs://open-targets-data-releases/21.04/output/etl/parquet/diseases/ \
gs://open-targets-data-releases/21.04/output/etl/parquet/targets/ \
gs://open-targets-data-releases/21.04/output/etl/parquet/evidence/ \

PySpark to the rescue

To read and process the relevant information, I’m using PySpark. Alternative strategies can also be used. We will first establish the Spark connection. PySpark is widely available and easy to install using your preferred package manager.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# establish spark connection
spark = (

Reading the datasets

The downloaded datasets are directories containing all the partitioned parquet files. To read them in PySpark is as easy as to provide the directory containing all the files.

# read evidence dataset
target ="<path>/targets")
disease ="<path>/diseases")
evd ="<path>/evidence")

Preparing some data

There are a few datasets that need to be prepared in order to make our final query.

Cancer-indicated drugs

We want to know what are all the cancer-indicated drugs and their targets based on the known mechanism-of-action. This question requires information from 2 different datasets:

  1. We want to know what are all the diseases that belong to the cell proliferation disorder (MONDO_0045024) therapeutic area.
  2. We want to subset the Open Targets Platform target-disease evidence that it’s provided by chembl and contains the target-drug-indication triplet for clinical candidates or approved drugs.

By joining these 2 datasets, we will be able to answer this question. We will also get some aggregations by target including the number of distinct cancer indicated drugs (cancerIndicatedDrugs) or the maximum clinical phase for all target drugs (maxCancerPhase).

# get all diseases in the cell proliferation disorder TA
cancer_diseases = (
    .filter(F.col("TA") == "MONDO_0045024")

# Cancer indicated drugs by target and maximum clinical trial phase 
chemblByTarget = (
    .filter(F.col("sourceId") == "chembl")
    .join(cancer_diseases, on="diseaseId", how="inner")

Target role in cancer

We are going to use 2 different sources to nominate targets that are believed to be involved in the cancer biology.

  1. Manually curated Cancer Gene Census role in cancer. Genes are porposed as oncogenes and/or tumor suppresor genes (TSG) and the Open Targets Platform presents this data at the target-level.
  2. IntOgen cancer-driver gene predictions based on large-scale sequencing project data. This information is also available as Open Targets Platform evidence.
# get the role in cancer (Cancer Gene Census) for all targets
roleInCancer = (
    .select(F.col("id").alias("targetId"), "hallmarks.*")
    .select("targetId", F.explode("attributes").alias("attributes"))
    .select("targetId", "attributes.attribute_name", "attributes.description")
    .filter(F.col("attribute_name") == "role in cancer")
    .agg(F.concat_ws(", ", F.collect_set(F.col("description")))

# number of methods suggesting a cancer driver by IntOgen
intogen = (
    .filter(F.col("sourceId") == "intogen")
    .select("targetId", F.explode("significantDriverMethods").alias("methods"))

Target tractability

Finally, we will like to know if these targets are believed to be tractable based on the Open Targets tractability assesments for small molecule and/or antibody.

# Target tractability assesment for antibody (ab) and small molecule (sm) by target 
tractability = (

Some magic to join all prepared data

Finally, we will just join all the information that we have been preparing. We will also use the opportunity to include some other information that it’s currently available in the target object such as information on target safety concerns or any available chemical probes targeting the protein.

# resulting dataset with all data integrated
df = (
    .filter(F.col("sourceId").isin(["cancer_gene_census", "intogen"]))
    # add role in cancer from CGC
    .join(roleInCancer, on="targetId", how="left")
    # intogen driver methods
    .join(intogen, on="targetId", how="left")
    # chemblByTarget
    .join(chemblByTarget, on="targetId", how="left")
    # chemicalProbes
          .select(F.col("id").alias("targetId"), "chemicalProbes")
          .withColumn("chemicalProbes", F.col("chemicalProbes").isNotNull()),
    # add safety data
          .select(F.col("id").alias("targetId"), "safety")
          .withColumn("safety", F.col("safety").isNotNull())
          .withColumnRenamed("safety", "safety_data"),
    # add tractability
    .join(tractability, on="targetId", how="left")
    # adding labels
    .join("id").alias("targetId"), "approvedSymbol"),
          on="targetId", how="left")

Exploring the results…

The final dataset (df) contains 886 potentially driver genes with all the available metadata in a dataframe format. The next is a sample of 3 genes out of the full dataset displayed vertically.

>>>, vertical = True)
-RECORD 0--------------------------------------------
 targetId                          | ENSG00000146648 
 roleInCancer                      | oncogene        
 intogenDriverMethods              | 8               
 cancerIndicatedDrugs              | 52              
 maxCancerPhase                    | 4               
 chemicalProbes                    | true            
 safety_data                       | true            
 ab_tractable_highconf             | 1.0             
 ab_tractable_medlowconf           | 0.75            
 sm_tractable_discovery_precedence | 1.0             
 sm_tractable_predicted            | 1.0             
 approvedSymbol                    | EGFR            
-RECORD 1--------------------------------------------
 targetId                          | ENSG00000149311 
 roleInCancer                      | TSG             
 intogenDriverMethods              | 8               
 cancerIndicatedDrugs              | null            
 maxCancerPhase                    | null            
 chemicalProbes                    | true            
 safety_data                       | false           
 ab_tractable_highconf             | null            
 ab_tractable_medlowconf           | null            
 sm_tractable_discovery_precedence | 0.3             
 sm_tractable_predicted            | 0.3             
 approvedSymbol                    | ATM             
-RECORD 2--------------------------------------------
 targetId                          | ENSG00000122025 
 roleInCancer                      | oncogene        
 intogenDriverMethods              | 8               
 cancerIndicatedDrugs              | 33              
 maxCancerPhase                    | 4               
 chemicalProbes                    | true            
 safety_data                       | false           
 ab_tractable_highconf             | 0.3             
 ab_tractable_medlowconf           | 0.25            
 sm_tractable_discovery_precedence | 1.0             
 sm_tractable_predicted            | 0.6             
 approvedSymbol                    | FLT3            
only showing top 3 rows

The results can be exported to a file or into pandas (df.toPandas()) for further analysis.


The query displayed here aims to illustrate how data can be handled to get some more complex answers by joining multiple datasets. The biological question can also be expanded for example by including other Platform data:

  • CRISPR synthetic-lethality information from Project Score
  • Drug action types (agonist, antagonist, modulator, etc.)
  • Most recurrent mutation type per target
  • Target Enabling Packages (TEPs)
  • Tissue or tissues affected

As more Platform datasets are available, we hope to expand the number of therapeutic hypotheses that can be answered using the Platform data.