``At Open Targets, we have several ways of accessing our data. If BigQuery falls short for you or you are simply comfortable using Pyspark, in this post, we illustrate what it would be like to reproduce this analysis in BigQuery using our data files downloadable from the Downloads page.
This snippet is an example of how to extract the drugs that are on the market for a restricted list of targets.
from pyspark.sql import SparkSession import pyspark.sql.functions as F # Establish spark connection spark = ( SparkSession.builder .master('local[*]') .getOrCreate() ) # Targets of interest myTargets = [ "ERBB3", "EGFR" ] # Read and filter datasets targetPath = "<localpath>/targets" drugPath = "<localpath>/evidence" drugs = (spark .read.parquet(drugPath) .select( F.col("id").alias("drugId"), "maximumClinicalTrialPhase", "linkedTargets") # Filter by marketed drugs .filter( F.col("maximumClinicalTrialPhase") == 4) # Explode linked target .withColumn( "linkedTarget", F.explode("linkedTargets.rows")) ) targets = (spark .read.parquet(targetPath) .select( F.col("id").alias("targetId"), "approvedSymbol") # Filter by targets of interest .filter( F.col("approvedSymbol").isin(myTargets)) ) # Join both tables out = (targets .join( drugs.withColumnRenamed("linkedTarget", "targetId"), on="targetId", how="inner") .select("targetId", "approvedSymbol", "drugId") .groupby("targetId", "approvedSymbol") .agg(F.collect_set("drugId").alias("drugIds")) ) # Export to CSV out.toPandas().to_csv("targets_w_drugs.csv", index=False, header=True)
This is an alternative to the batch search functionality of the previous version so that we allow our users to have more control and to elaborate more complex hypotheses with the vast amount of data available in the Platform.
Confused? Let’s break it down:
We use 2 datasets to establish the relationship between the drug and the target:
- Target, which contains the HGNC Symbols;
- Molecule, with the annotation of the drug molecules.
We want to filter out drugs with a maximum phase observed in clinical trials below < 4 to perform the analysis only on those drugs that are available on the market.
The molecule dataset comprises in
linkedTargets.rows all the targets (in the form of Ensembl gene IDs) on which a drug acts according to its mechanism of action.
When joining the two tables, the only thing we have to take into account is that this field is an array that needs to be flattened to establish the pairwise relationship between each array element and the id from target. For this purpose, this field is exploded so that there is a row for each drug ID and linked target.
Grouping first by the symbol and the ENSID of the target, we collect all the resulting drugs in the array
Once we have extracted the table we can proceed with the analysis either in PySpark or in Pandas with the