Discrepancy in Evidence Counts Between GraphQL API and BigQuery for enableIndirect: true

Hi,

I’m encountering a discrepancy between the evidence retrieved via the Open Targets GraphQL API and the BigQuery dataset when querying evidence for a specific disease-target pair with enableIndirect: true.

  1. GraphQL API:
  • Using the following query with enableIndirect: true, I retrieve evidence for 28 papers associated with the disease (EFO_0005917) and the target (ENSG00000164588).
query EuropePMCQuery(
  $ensemblId: String!
  $efoId: String!
  $size: Int!
  $cursor: String
) {
  disease(efoId: $efoId) {
    id
    europePmc: evidences(
      ensemblIds: [$ensemblId]
      enableIndirect: true
      size: $size
      datasourceIds: ["europepmc"]
      cursor: $cursor
    ) {
      count
      cursor
      rows {
        disease {
          name
          id
        }
        target {
          approvedSymbol
          id
        }
        literature
        resourceScore
      }
    }
  }
}

This query provides evidence for both direct and indirect associations (through disease ontology descendants).

  1. BigQuery:
    I attempted to replicate this query in BigQuery using the open_targets_platform dataset. The following query was used:
SELECT
    targetId,
    diseaseId,
    ARRAY_AGG(literature_element) AS aggregated_literature,
    COUNT(*) AS evidence_count
FROM
    `bigquery-public-data.open_targets_platform.evidence`,
    UNNEST(literature.list) AS literature_element
WHERE
    targetId = "ENSG00000164588"
    AND diseaseId = "EFO_0005917"
GROUP BY
    targetId, diseaseId

This query only returns 7 pieces of evidence, despite accounting for all literature in the evidence.literature.list field.

Key Observations

  • The GraphQL API retrieves 28 pieces of evidence with enableIndirect: true.
  • BigQuery results are limited to 7 pieces of evidence, even when aggregating all available literature for the disease-target pair.

Questions

  1. How does the GraphQL API handle enableIndirect: true? Does it include evidence from indirect associations (e.g., descendant diseases) not captured in the BigQuery dataset?
  2. Is there a way to replicate the behavior of enableIndirect: true in BigQuery?
  3. Are there known differences between the GraphQL API and BigQuery datasets that might explain this discrepancy?

Thank you for your help in clarifying this issue.

Muhamed

Hi Muhamed,

  1. When the GraphQL API returns evidence data, it either looks direct connection between a given disease and a target or it can retrieve all evidence between all descendants of the disease. Eg. if you have direct evidence between gene X and breast cancer, X, indirectly is also an evidence for cancer (because breast cancer is a descendant of the more general disease term, cancer). Evidence in the evidence dataset is always direct. To retrieve indirect evidence you need look up all descendant disease terms for your disease of interest in the disease table.
  2. Yes, you can modify the query, but that requires an other table describing the relationship between diseases. An example query is here: [1]
  3. The content available via GraphQL is expected to be identical that you can access via BigQuery, however the shape of datasets on GraphQL are more optimised for UI display and better suited for UI based use cases. There can also be delays in refreshing BigQuery datasets in the release process, but that should be marginal.

If there’s anything else to clarify, please let us know!

[1]

WITH exploded_diseases AS (
  SELECT
    id,
    descendant.element AS diseaseId
  FROM
    `bigquery-public-data.open_targets_platform.diseases`,
    UNNEST(descendants.list) AS descendant
  WHERE
    id = 'EFO_0005917'
)

SELECT DISTINCT
  e.targetId,
  e.diseaseId,
 e.literature.list
FROM
  exploded_diseases d,
  `bigquery-public-data.open_targets_platform.evidence` e
WHERE
  (d.diseaseId = e.diseaseId OR 
  d.id = e.diseaseId) AND 
  e.targetId = 'ENSG00000164588' AND
  e.datasourceId = 'europepmc'
1 Like

Gotcha! Thanks for the comprehensive reply Daniel:)