Hi,
I’m encountering a discrepancy between the evidence retrieved via the Open Targets GraphQL API and the BigQuery dataset when querying evidence for a specific disease-target pair with enableIndirect: true
.
- GraphQL API:
- Using the following query with
enableIndirect: true
, I retrieve evidence for 28 papers associated with the disease (EFO_0005917
) and the target (ENSG00000164588
).
query EuropePMCQuery(
$ensemblId: String!
$efoId: String!
$size: Int!
$cursor: String
) {
disease(efoId: $efoId) {
id
europePmc: evidences(
ensemblIds: [$ensemblId]
enableIndirect: true
size: $size
datasourceIds: ["europepmc"]
cursor: $cursor
) {
count
cursor
rows {
disease {
name
id
}
target {
approvedSymbol
id
}
literature
resourceScore
}
}
}
}
This query provides evidence for both direct and indirect associations (through disease ontology descendants).
- BigQuery:
I attempted to replicate this query in BigQuery using the open_targets_platform
dataset. The following query was used:
SELECT
targetId,
diseaseId,
ARRAY_AGG(literature_element) AS aggregated_literature,
COUNT(*) AS evidence_count
FROM
`bigquery-public-data.open_targets_platform.evidence`,
UNNEST(literature.list) AS literature_element
WHERE
targetId = "ENSG00000164588"
AND diseaseId = "EFO_0005917"
GROUP BY
targetId, diseaseId
This query only returns 7 pieces of evidence, despite accounting for all literature in the evidence.literature.list
field.
Key Observations
- The GraphQL API retrieves 28 pieces of evidence with
enableIndirect: true
.
- BigQuery results are limited to 7 pieces of evidence, even when aggregating all available literature for the disease-target pair.
Questions
- How does the GraphQL API handle
enableIndirect: true
? Does it include evidence from indirect associations (e.g., descendant diseases) not captured in the BigQuery dataset?
- Is there a way to replicate the behavior of
enableIndirect: true
in BigQuery?
- Are there known differences between the GraphQL API and BigQuery datasets that might explain this discrepancy?
Thank you for your help in clarifying this issue.
Muhamed
Hi Muhamed,
- When the GraphQL API returns evidence data, it either looks direct connection between a given disease and a target or it can retrieve all evidence between all descendants of the disease. Eg. if you have direct evidence between gene X and breast cancer, X, indirectly is also an evidence for cancer (because breast cancer is a descendant of the more general disease term, cancer). Evidence in the evidence dataset is always direct. To retrieve indirect evidence you need look up all descendant disease terms for your disease of interest in the disease table.
- Yes, you can modify the query, but that requires an other table describing the relationship between diseases. An example query is here: [1]
- The content available via GraphQL is expected to be identical that you can access via BigQuery, however the shape of datasets on GraphQL are more optimised for UI display and better suited for UI based use cases. There can also be delays in refreshing BigQuery datasets in the release process, but that should be marginal.
If there’s anything else to clarify, please let us know!
[1]
WITH exploded_diseases AS (
SELECT
id,
descendant.element AS diseaseId
FROM
`bigquery-public-data.open_targets_platform.diseases`,
UNNEST(descendants.list) AS descendant
WHERE
id = 'EFO_0005917'
)
SELECT DISTINCT
e.targetId,
e.diseaseId,
e.literature.list
FROM
exploded_diseases d,
`bigquery-public-data.open_targets_platform.evidence` e
WHERE
(d.diseaseId = e.diseaseId OR
d.id = e.diseaseId) AND
e.targetId = 'ENSG00000164588' AND
e.datasourceId = 'europepmc'
1 Like
Gotcha! Thanks for the comprehensive reply Daniel:)
Hey @dsuveges,
I tried to replicate the process also for ChEMBL data source, but it doesn’t works…
The query should return 46 resources, but it return an empty object, do you know why?
WITH DiseaseIds AS (
SELECT
id AS parentDiseaseId,
descendant.element AS diseaseId
FROM
`bigquery-public-data.open_targets_platform.diseases`,
UNNEST(descendants.list) AS descendant
WHERE
id = 'EFO_0000729'
)
SELECT
d.parentDiseaseId,
e.diseaseId,
e.targetId,
e.datatypeId,
e.sourceId,
e.score,
e.clinicalPhase,
e.clinicalStatus,
e.diseaseFromSource,
e.diseaseFromSourceMappedId,
e.drugId,
e.studyId,
e.studyStartDate,
e.targetFromSource,
e.targetFromSourceId,
e.directionOnTrait,
ARRAY_AGG(urls_list.element.url) AS urls
FROM `bigquery-public-data.open_targets_platform.evidence` e
JOIN DiseaseIds d
ON e.diseaseId = d.diseaseId
LEFT JOIN UNNEST(e.urls.list) AS urls_list
WHERE e.targetId = 'ENSG00000139626'
AND e.datasourceId = 'chembl'
GROUP BY
d.parentDiseaseId, e.diseaseId, e.targetId, e.datasourceId, e.datatypeId, e.sourceId, e.score,
e.clinicalPhase, e.clinicalStatus, e.diseaseFromSource, e.diseaseFromSourceMappedId,
e.drugId, e.studyId, e.studyStartDate, e.targetFromSource, e.targetFromSourceId, e.directionOnTrait
ORDER BY e.score DESC;
Hi, I’m afraid your query is not right. If you look at the exploded diseases:
WITH DiseaseIds AS (
SELECT
id AS parentDiseaseId,
descendant.element AS diseaseId
FROM
`bigquery-public-data.open_targets_platform.diseases`,
UNNEST(descendants.list) AS descendant
WHERE
id = 'EFO_0000729'
)
select * from DiseaseIds
You get a table with two columns and two rows:
[{
"parentDiseaseId": "EFO_0000729",
"diseaseId": "EFO_0005623"
}, {
"parentDiseaseId": "EFO_0000729",
"diseaseId": "EFO_0005626"
}]
If you want all direct and indirect evidence for the parent disease, it is not enough to join by the child disease identifiers: JOIN DiseaseIds d ON e.diseaseId = d.diseaseId
, you also need to take the parent id into account!
ON (e.diseaseId = d.diseaseId) OR (e.diseaseId = d.parentDiseaseId)
The modified query returns 46 rows, which is consistent with the UI. Please give it a go!
1 Like
You’re right! Thanks for catching this:)