Hi @mose_rab!
Welcome to the Open Targets Community!
Unfortunately our GraphQL API does not allow you to access data about all of the diseases/phenotypes contained in the Platform. Instead, you will need to use BigQuery or our dataset downloads.
Below, I have included instructions on how to use BigQuery and our dataset downloads. If you use our data, please cite our latest publication - Ochoa, D et al, 2021
Cheers,
Andrew
Accessing disease/phenotype data with BigQuery
Using our BigQuery instance - open-targets-prod - you can generate an export of disease data by querying our diseases dataset with the following query:
SELECT
id,
name,
description,
FROM
`open-targets-prod.platform.diseases`
After running the query, you can export the results in JSON or CSV format or import into another BigQuery instance or Google Sheets file.
Accessing disease/phenotype data with Platform dataset downloads
Using our FTP server, you can download our diseases
dataset in either Parquet or JSON format.
Once you have downloaded the files, you can then parse using the programming language and libraries of your choice.
Please see below for an example using Python, PySpark, and pandas.
# import relevant libraries
from pyspark import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pandas as pd
# create Spark session
spark = (
SparkSession.builder
.master('local[*]')
.getOrCreate()
)
# set location of diseases dataset downloaded in Parquet format
disease_data_path = "/Users/amh/Downloads/platform-data-analysis/data/diseases"
# read diseases dataset
disease_data = spark.read.parquet(disease_data_path)
# print diseases dataset schema
disease_data.printSchema()
# generate subset of diseases dataset with relevant fields
disease_data_subset = (disease_data.select(F.col("id").alias("disease_id"), "name", "description"))
# convert to Pandas dataframe
disease_df = disease_data_subset.toPandas()
# print first 5 rows of disease dataframe
disease_df.head(5)
Our dataset downloads documentation also includes a sample sparklyR script that you can also use to access and parse the diseases
dataset.