Hi!
I’ve been trying to connect the Open Targets CC0 data to Wikidata (with proper references, of course!)
I am looking for a way to download a simple summary of all gene-disease relations, for example as a CSV with an edge list containing EFO IDs in one column and the ENSG IDs in other.
I was thinking of doing it the dumb way and going through all EFO IDs via Graph QL but it seems redundant and maybe a waste of Open Target server time
Does anyone have a better suggestion? Thanks!
P.S.: I do want to do the same for drug-disease relations too, so any suggestions on that are super welcome too!
Hi @lubianat, and welcome to the Open Targets community!
One possible way to download and create a table of all the data you’re interested in would be to do an FTP download and then create the table as you want using Python. In particular, you could use Pyspark if possible for efficiency. For example, you could do:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("OpenTargets Genetic Associations") \
.getOrCreate()
# File path ftp downloaded data of interest (examples given)
path_to_association_parquet = './data/associationByOverallIndirect/'
path_to_evidence_parquet = './data/associationByDatasourceIndirect/'
# Load genetic associations into one dataframe
gene_assoc_df = df = spark.read.parquet(path_to_association_parquet)
gene_assoc_df = gene_assoc_df.withColumnRenamed('targetId', 'ensembl_id') \
.withColumnRenamed('score', 'ot_indirect_association_score')
gene_assoc_df = gene_assoc_df.select('ensembl_id', 'diseaseId', 'ot_indirect_association_score')
# Load evidence into one dataframe
evidence_assoc_df = df = spark.read.parquet(path_to_evidence_parquet)
evidence_assoc_df = evidence_assoc_df.withColumnRenamed('targetId', 'ensembl_id')
evidence_assoc_df = evidence_assoc_df.select('ensembl_id', 'datatypeId', 'datasourceId')
# Join dataframes by genetic associations
open_targets_df = gene_assoc_df.join(evidence_assoc_df, on='ensembl_id', how='left')
open_targets_df.write.parquet('ot_genes_indirect_associations.parquet', compression='snappy')
# Option to write a single CSV files coalescing partitions for large data:
# open_targets_df.coalesce(1).write.csv('ot_genes_indirect_associations', header=True)
spark.stop()
As an example this would output:
+---------------+-------------+-----------------------------+------------+------------+
| ensembl_id| diseaseId|ot_indirect_association_score| datatypeId|datasourceId|
+---------------+-------------+-----------------------------+------------+------------+
|ENSG00000002822|MONDO_0004985| 0.29067802423909317|animal_model| impc|
|ENSG00000002822|MONDO_0004985| 0.29067802423909317|animal_model| impc|
|ENSG00000002822|MONDO_0004985| 0.29067802423909317| literature| europepmc|
|ENSG00000002822|MONDO_0004985| 0.29067802423909317| literature| europepmc|
|ENSG00000002822|MONDO_0004985| 0.29067802423909317| literature| europepmc|
+---------------+-------------+-----------------------------+------------+------------+
1 Like
Thank you, hannah, that worked great!