Get marketed drugs for a set of targets with the data downloads

``At Open Targets, we have several ways of accessing our data. If BigQuery falls short for you or you are simply comfortable using Pyspark, in this post, we illustrate what it would be like to reproduce this analysis in BigQuery using our data files downloadable from the Downloads page.

Batch search using our data files

This snippet is an example of how to extract the drugs that are on the market for a restricted list of targets.

from pyspark.sql import SparkSession
import pyspark.sql.functions as F

# Establish spark connection
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)

# Targets of interest
myTargets = [
    "ERBB3",
    "EGFR"
]

# Read and filter datasets
targetPath = "<localpath>/targets"
drugPath = "<localpath>/evidence"

drugs = (spark
            .read.parquet(drugPath)
            .select(
                F.col("id").alias("drugId"),
                "maximumClinicalTrialPhase", "linkedTargets")
            # Filter by marketed drugs
            .filter(
                F.col("maximumClinicalTrialPhase") == 4)
            # Explode linked target
            .withColumn(
                "linkedTarget",
                F.explode("linkedTargets.rows"))
)
targets = (spark
            .read.parquet(targetPath)
            .select(
                F.col("id").alias("targetId"),
                "approvedSymbol")
            # Filter by targets of interest
            .filter(
                F.col("approvedSymbol").isin(myTargets))
)

# Join both tables
out = (targets
        .join(
            drugs.withColumnRenamed("linkedTarget", "targetId"),
            on="targetId",
            how="inner")
        .select("targetId", "approvedSymbol", "drugId")
        .groupby("targetId", "approvedSymbol")
        .agg(F.collect_set("drugId").alias("drugIds"))
)

# Export to CSV
out.toPandas().to_csv("targets_w_drugs.csv", index=False, header=True)

This is an alternative to the batch search functionality of the previous version so that we allow our users to have more control and to elaborate more complex hypotheses with the vast amount of data available in the Platform.

Confused? Let’s break it down:

Step 1: Select the datasets

We use 2 datasets to establish the relationship between the drug and the target:

  • Target, which contains the HGNC Symbols;
  • Molecule, with the annotation of the drug molecules.

Step 2: Select marketed drugs

We want to filter out drugs with a maximum phase observed in clinical trials below < 4 to perform the analysis only on those drugs that are available on the market.

Step 3: Join the data

The molecule dataset comprises in linkedTargets.rows all the targets (in the form of Ensembl gene IDs) on which a drug acts according to its mechanism of action.

When joining the two tables, the only thing we have to take into account is that this field is an array that needs to be flattened to establish the pairwise relationship between each array element and the id from target. For this purpose, this field is exploded so that there is a row for each drug ID and linked target.

Step 4: Aggregate results by target

Grouping first by the symbol and the ENSID of the target, we collect all the resulting drugs in the array drugIds .

Download the data

Once we have extracted the table we can proceed with the analysis either in PySpark or in Pandas with the toPandas() method.

1 Like