Returning all associations data using the Platform API

Hello @anticancer.org.uk! :wave:

Welcome to the Open Targets Community! :tada:

Thank you for submitting your question about how to access associations data using our GraphQL API. I have responded below and also included two other ways of accessing the data for more systematic and comprehensive queries.

How do I access associations data using the Open Targets Platform GraphQL API?

To access all associations data for a given target or disease/phenotype through our GraphQL API, you you will need to iterate through each page of results using the associatedDiseases field and the page argument.

associatedDiseases(page: { size: x, index:y })

Using your example above β€” diseases associated with CAV1 (ENSG00000105974) β€” your first API query would be:

query targetDiseaseAssociations {
  target(ensemblId: "ENSG00000105974") {
    id
    approvedSymbol
    approvedName
    associatedDiseases(page: { size: 50, index: 0 }) {
      count
      rows {
        score
        disease {
          id
          name
        }
      }
    }
  }
}

Run this query

This query will return the first 50 results that you see in the web interface associations page.

To access the next 50 results, you will need to update your query and set index to 1.

query targetDiseaseAssociations {
  target(ensemblId: "ENSG00000105974") {
    id
    approvedSymbol
    approvedName
    associatedDiseases(page: { size: 50, index: 1 }) {
      count
      rows {
        score
        disease {
          id
          name
        }
      }
    }
  }
}

Run this query

You would need to continue to update index until you obtain all results. Because our API returns a maximum of 50 records for a given query, you would need to run the query 19 times to get all 947 results.

As noted in our API documentation, the GraphQL API is optimised for querying a single target, disease/phenotype, drug, or target-disease association. It is not suitable for running for loops to iterate through pages and pages of results.

Instead, for the more programmatic, systematic, and comprehensive use case that you have, I would strongly recommend using our associations datasets - associationByOverallDirect, associationByOverallIndirect, associationByDatasourceDirect and associationByDatasourceIndirect. These datasets can be accessed using our BigQuery instance - open-targets-prod - or our dataset downloads.

How do I access associations data using BigQuery?

In our BigQuery open-targets-prod instance, you will find different associations files for direct and indirect associations and by overall and datasource scores. You can query these datasets using SQL.

For example, using our associationByOverallDirect dataset, you can run the following query and return all 947 associations and the overall target-disease association scores:

SELECT
  associations.targetId AS target_id,
  targets.approvedSymbol AS target_approved_symbol,
  associations.diseaseId AS disease_id,
  diseases.name AS disease_name,
  associations.score AS overall_association_score
FROM
  `open-targets-prod.platform.associationByOverallDirect` AS associations
JOIN
  `open-targets-prod.platform.diseases` AS diseases
ON
  associations.diseaseId = diseases.id
JOIN
  `open-targets-prod.platform.targets` AS targets
ON
  associations.targetId = targets.id
WHERE
  targetId='ENSG00000105974'
ORDER BY associations.score desc

Run in BigQuery

You can then export these results to CSV, JSON, or Google Sheets formats or import into your own BigQuery table.

How do I access associations data using dataset downloads?

Using our FTP service, you can download our datasets in either JSON or Parquet formats and use your programming language of choice to query and analyse the data.

For example, to generate a CSV with all 947 associations, you can use PySpark, Python, and pandas to process the associationByOverallDirect and diseases datasets.

# import relevant libraries
from pyspark import SparkConf
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
import pandas as pd

# create Spark session
spark = (
    SparkSession.builder
    .master('local[*]')
    .getOrCreate()
)

# set location of associations dataset (Parquet format)
associations_data_path = "path to directory with associations dataset files"

# read associations dataset
associations_data = spark.read.parquet(associations_data_path)

# print associations dataset schema
# associations_data.printSchema()

# create subset with relevant fields
associations_data_subset = (associations_data.select("targetId","diseaseId", F.col("score").alias("overallAssociationScore")))

# set location of diseases dataset (Parquet format)
disease_data_path = "path to directory with diseases dataset files"

# read diseases dataset
disease_data = spark.read.parquet(disease_data_path)

# print diseases dataset schema
# disease_data.printSchema()

# create subset with relevant fields
disease_data_subset = (disease_data.select(F.col("id").alias("diseaseId"), F.col("name").alias("diseaseLabel")))

# merge associations and diseases data
output = (associations_data_subset
              .join(disease_data_subset, on="diseaseId", how="inner")
         )

# show output of merged data
# output.show()

# convert output to pandas dataframe
output_df = output.toPandas()

# filter dataframe for CAV1 (ENSG00000105974)
output_df = output_df[output_df["targetId"] == "ENSG00000105974"]

# print length of filtered dataframe
print(len(output_df))

# export dataframe to CSV
output_df.to_csv("CAV1_associated_diseases.csv")

Apologies for such a long reply :sweat_smile: but I wanted to show you that you can also answer your research question using BigQuery or our dataset downloads.

Good luck :crossed_fingers: β€” and feel free to comment below if you have any further questions!

Cheers,

Andrew :slight_smile: