Best way to dowload all Gene-Disease associations from Open Targets Platform

lubianat · 23 August 2023 00:39

Hi!

I’ve been trying to connect the Open Targets CC0 data to Wikidata (with proper references, of course!)

I am looking for a way to download a simple summary of all gene-disease relations, for example as a CSV with an edge list containing EFO IDs in one column and the ENSG IDs in other.

I was thinking of doing it the dumb way and going through all EFO IDs via Graph QL but it seems redundant and maybe a waste of Open Target server time

Does anyone have a better suggestion? Thanks!

P.S.: I do want to do the same for drug-disease relations too, so any suggestions on that are super welcome too!

hannah · 23 August 2023 13:16

Hi @lubianat, and welcome to the Open Targets community!

One possible way to download and create a table of all the data you’re interested in would be to do an FTP download and then create the table as you want using Python. In particular, you could use Pyspark if possible for efficiency. For example, you could do:


from pyspark.sql import SparkSession

spark = SparkSession.builder \
.appName("OpenTargets Genetic Associations") \
.getOrCreate()

# File path ftp downloaded data of interest (examples given)
path_to_association_parquet = './data/associationByOverallIndirect/'
path_to_evidence_parquet = './data/associationByDatasourceIndirect/'

# Load genetic associations into one dataframe
gene_assoc_df = df = spark.read.parquet(path_to_association_parquet)
gene_assoc_df = gene_assoc_df.withColumnRenamed('targetId', 'ensembl_id') \
.withColumnRenamed('score', 'ot_indirect_association_score')
gene_assoc_df = gene_assoc_df.select('ensembl_id', 'diseaseId', 'ot_indirect_association_score')

# Load evidence into one dataframe
evidence_assoc_df = df = spark.read.parquet(path_to_evidence_parquet)
evidence_assoc_df = evidence_assoc_df.withColumnRenamed('targetId', 'ensembl_id')
evidence_assoc_df = evidence_assoc_df.select('ensembl_id', 'datatypeId', 'datasourceId')

# Join dataframes by genetic associations
open_targets_df = gene_assoc_df.join(evidence_assoc_df, on='ensembl_id', how='left')

open_targets_df.write.parquet('ot_genes_indirect_associations.parquet', compression='snappy')

# Option to write a single CSV files coalescing partitions for large data:
# open_targets_df.coalesce(1).write.csv('ot_genes_indirect_associations', header=True)

spark.stop()

As an example this would output:

+---------------+-------------+-----------------------------+------------+------------+
|     ensembl_id|    diseaseId|ot_indirect_association_score|  datatypeId|datasourceId|
+---------------+-------------+-----------------------------+------------+------------+
|ENSG00000002822|MONDO_0004985|          0.29067802423909317|animal_model|        impc|
|ENSG00000002822|MONDO_0004985|          0.29067802423909317|animal_model|        impc|
|ENSG00000002822|MONDO_0004985|          0.29067802423909317|  literature|   europepmc|
|ENSG00000002822|MONDO_0004985|          0.29067802423909317|  literature|   europepmc|
|ENSG00000002822|MONDO_0004985|          0.29067802423909317|  literature|   europepmc|
+---------------+-------------+-----------------------------+------------+------------+

lubianat · 30 August 2023 12:38

Thank you, hannah, that worked great!

Topic		Replies	Views
How can I download data from Open Targets Genetics? Open Targets Genetics FAQs	0	1331	4 August 2021
Batch-query variant-centric evidence for a list of targets (R) Data downloads	2	1104	15 August 2022
Issue getting target-associated information via pyspark Data downloads datadownloads , ot-platform	8	573	27 May 2022
Batch download disease/association scores GraphQL API ot-platform	2	474	28 June 2022
Associated studies: locus-to-gene pipeline Data downloads datadownloads , genetics-portal	5	382	22 December 2021

Best way to dowload all Gene-Disease associations from Open Targets Platform

Related topics