Discrepancy in Evidence Counts Between GraphQL API and BigQuery for enableIndirect: true

Muhamed · 29 January 2025 09:28

Hi,

I’m encountering a discrepancy between the evidence retrieved via the Open Targets GraphQL API and the BigQuery dataset when querying evidence for a specific disease-target pair with enableIndirect: true.

GraphQL API:

Using the following query with enableIndirect: true, I retrieve evidence for 28 papers associated with the disease (EFO_0005917) and the target (ENSG00000164588).

query EuropePMCQuery(
  $ensemblId: String!
  $efoId: String!
  $size: Int!
  $cursor: String
) {
  disease(efoId: $efoId) {
    id
    europePmc: evidences(
      ensemblIds: [$ensemblId]
      enableIndirect: true
      size: $size
      datasourceIds: ["europepmc"]
      cursor: $cursor
    ) {
      count
      cursor
      rows {
        disease {
          name
          id
        }
        target {
          approvedSymbol
          id
        }
        literature
        resourceScore
      }
    }
  }
}

This query provides evidence for both direct and indirect associations (through disease ontology descendants).

BigQuery:
I attempted to replicate this query in BigQuery using the open_targets_platform dataset. The following query was used:

SELECT
    targetId,
    diseaseId,
    ARRAY_AGG(literature_element) AS aggregated_literature,
    COUNT(*) AS evidence_count
FROM
    `bigquery-public-data.open_targets_platform.evidence`,
    UNNEST(literature.list) AS literature_element
WHERE
    targetId = "ENSG00000164588"
    AND diseaseId = "EFO_0005917"
GROUP BY
    targetId, diseaseId

This query only returns 7 pieces of evidence, despite accounting for all literature in the evidence.literature.list field.

Key Observations

The GraphQL API retrieves 28 pieces of evidence with enableIndirect: true.
BigQuery results are limited to 7 pieces of evidence, even when aggregating all available literature for the disease-target pair.

Questions

How does the GraphQL API handle enableIndirect: true? Does it include evidence from indirect associations (e.g., descendant diseases) not captured in the BigQuery dataset?
Is there a way to replicate the behavior of enableIndirect: true in BigQuery?
Are there known differences between the GraphQL API and BigQuery datasets that might explain this discrepancy?

Thank you for your help in clarifying this issue.

Muhamed

dsuveges · 30 January 2025 09:33

Hi Muhamed,

When the GraphQL API returns evidence data, it either looks direct connection between a given disease and a target or it can retrieve all evidence between all descendants of the disease. Eg. if you have direct evidence between gene X and breast cancer, X, indirectly is also an evidence for cancer (because breast cancer is a descendant of the more general disease term, cancer). Evidence in the evidence dataset is always direct. To retrieve indirect evidence you need look up all descendant disease terms for your disease of interest in the disease table.
Yes, you can modify the query, but that requires an other table describing the relationship between diseases. An example query is here: [1]
The content available via GraphQL is expected to be identical that you can access via BigQuery, however the shape of datasets on GraphQL are more optimised for UI display and better suited for UI based use cases. There can also be delays in refreshing BigQuery datasets in the release process, but that should be marginal.

If there’s anything else to clarify, please let us know!

[1]

WITH exploded_diseases AS (
  SELECT
    id,
    descendant.element AS diseaseId
  FROM
    `bigquery-public-data.open_targets_platform.diseases`,
    UNNEST(descendants.list) AS descendant
  WHERE
    id = 'EFO_0005917'
)

SELECT DISTINCT
  e.targetId,
  e.diseaseId,
 e.literature.list
FROM
  exploded_diseases d,
  `bigquery-public-data.open_targets_platform.evidence` e
WHERE
  (d.diseaseId = e.diseaseId OR 
  d.id = e.diseaseId) AND 
  e.targetId = 'ENSG00000164588' AND
  e.datasourceId = 'europepmc'

Muhamed · 1 February 2025 08:47

Gotcha! Thanks for the comprehensive reply Daniel:)

Muhamed · 13 February 2025 19:12

Hey @dsuveges,

I tried to replicate the process also for ChEMBL data source, but it doesn’t works…
The query should return 46 resources, but it return an empty object, do you know why?

WITH DiseaseIds AS (
  SELECT
    id AS parentDiseaseId,
    descendant.element AS diseaseId
  FROM
    `bigquery-public-data.open_targets_platform.diseases`,
    UNNEST(descendants.list) AS descendant
  WHERE
    id = 'EFO_0000729'
)

SELECT
    d.parentDiseaseId,
    e.diseaseId,
    e.targetId,
    e.datatypeId, 
    e.sourceId,
    e.score, 
    e.clinicalPhase, 
    e.clinicalStatus, 
    e.diseaseFromSource, 
    e.diseaseFromSourceMappedId, 
    e.drugId, 
    e.studyId, 
    e.studyStartDate, 
    e.targetFromSource, 
    e.targetFromSourceId, 
    e.directionOnTrait,
    ARRAY_AGG(urls_list.element.url) AS urls
FROM `bigquery-public-data.open_targets_platform.evidence` e
JOIN DiseaseIds d
ON e.diseaseId = d.diseaseId
LEFT JOIN UNNEST(e.urls.list) AS urls_list 
WHERE e.targetId = 'ENSG00000139626'
  AND e.datasourceId = 'chembl'
GROUP BY 
    d.parentDiseaseId, e.diseaseId, e.targetId, e.datasourceId, e.datatypeId, e.sourceId, e.score, 
    e.clinicalPhase, e.clinicalStatus, e.diseaseFromSource, e.diseaseFromSourceMappedId, 
    e.drugId, e.studyId, e.studyStartDate, e.targetFromSource, e.targetFromSourceId, e.directionOnTrait
ORDER BY e.score DESC;

dsuveges · 13 February 2025 20:57

Hi, I’m afraid your query is not right. If you look at the exploded diseases:

WITH DiseaseIds AS (
  SELECT
    id AS parentDiseaseId,
    descendant.element AS diseaseId
  FROM
    `bigquery-public-data.open_targets_platform.diseases`,
    UNNEST(descendants.list) AS descendant
  WHERE
    id = 'EFO_0000729'
)

select * from DiseaseIds

You get a table with two columns and two rows:

[{
  "parentDiseaseId": "EFO_0000729",
  "diseaseId": "EFO_0005623"
}, {
  "parentDiseaseId": "EFO_0000729",
  "diseaseId": "EFO_0005626"
}]

If you want all direct and indirect evidence for the parent disease, it is not enough to join by the child disease identifiers: JOIN DiseaseIds d ON e.diseaseId = d.diseaseId, you also need to take the parent id into account!

ON (e.diseaseId = d.diseaseId) OR (e.diseaseId = d.parentDiseaseId)

The modified query returns 46 rows, which is consistent with the UI. Please give it a go!

Muhamed · 14 February 2025 12:03

You’re right! Thanks for catching this:)

Topic		Replies	Views
Spurious indirect association/evidence via GraphQL API? Bug reports	6	442	19 January 2023
Getting literature evidence using the Open Targets Platform GraphQL API GraphQL API	4	446	24 March 2022
Retrieving >50 pathways from evidences GraphQL API	4	698	22 September 2022
Why are there some inconsistencies in evidence count? Technical Support	3	264	7 September 2023
Querying for associated diseases doesn't return full list GraphQL API	3	46	26 September 2024

Discrepancy in Evidence Counts Between GraphQL API and BigQuery for enableIndirect: true

Key Observations

Questions

Related topics