Difference between direct and indirect links in data downloads

flo · 12 October 2021 15:06

Hi,

I downloaded the 21.09 JSON data for “associationByDatasourceDirect” as well as “associationByDatasourceIndirect”.

I then checked for the associations of CFTR (ENSG00000001626) in cystic fibrosis (Orphanet_586), both via the web (Open Targets Platform) as well as by grepping through the downloaded JSON for direct and indirect associations.

I observe that (a) the grepping gets exactly the same results for both direct and indirect associations, and (b) the webpage states 2409 entries from ClinVar whereas from both JSON downloads the respective number is only 2188.

For (a): should there not be a difference in the data for direct or indirect associations?
For (b): where did the other 221 genetic associations from ClinVar go?

The number of genetic associations from associationByDatatypeDirect/Indirect is 2289 - yet another number. The number of total associations is 6745.

The number of associations from associationByOverallDirect/Indirect is also 6745.

The number of associations on the web page is 6959, which is now 214 more than in the download.

The numbers don’t really add up, and even if they do in some way, it does not seem that this is easily reconstructed from the data downloads. (I have not tried the same from the parquet files, assuming that the parquet contains a copy of the JSON anyway.)

Any insights would be greatly appreciated.

ahercules · 14 October 2021 19:29

Hi @flo!

To answer your questions:

For the CFTR and cystic fibrosis association, there is no difference in the data for direct and indirect associations because cystic fibrosis does not have any child terms in our ontology (EFO). And so the only association data available is direct association data for CFTR and cystic fibrosis.
I have opened a ticket #1819 and asked our data team to investigate the difference in the evidenceCount field. In the meantime, please use our evidence dataset and filter by a specific datasource (e.g. eva). This will return the exact number of evidence strings seen in the web interface.

github.com/opentargets/platform

Investigate differences in `evidenceCount` field in associations datasets

opened 07:26PM - 14 Oct 21 UTC

andrewhercules

Kind: Bug Kind: Data

### Describe the bug Based on a [Community post](https://community.opentarget…s.org/t/difference-between-direct-and-indirect-links-in-data-downloads/376), it appears that the `evidenceCount` values in our `associationByDatasource` and `associationByDatatype` datasets are incorrect. Both the `direct` and `indirect` versions are affected by this issue. ### Observed behaviour When parsing and querying the `associationByDatasourceDirect` and `associationByDatasourceIndirect` datasets, the association for CFTR (ENSG00000001626) and cystic fibrosis (Orphanet_586), returns an evidence count of `2188` for the `eva` datasource. However, on the [CFTR and cystic fibrosis evidence page](https://platform.opentargets.org/evidence/ENSG00000001626/Orphanet_586), there are `2409` evidence strings from `eva`. And so `221` records are missing from the `evidenceCount` field. When parsing and querying the `associationByDatatypeDirect` and `associationByDatatypeIndirect` datasets, the same association returns an evidence count of `2289` for the `genetic_associations` datatype. However, this should be `2510`. And so it appears the same 221 `eva` records are also missing from the `evidenceCount` field in the datatype datasets. ### Expected behaviour The `evidenceCount` values for a given datasource or datatype should match what is displayed in the web interface. ### Additional context** The CFTR and cystic fibrosis association is noteworthy because cystic fibrosis does not have any child terms and so we should expect that the direct and indirect counts are the same.

Please let me know if this has answered your question.

Thank you!

Cheers,

Andrew

flo · 15 October 2021 10:45

Hi @ahercules,

Thanks a lot for the explanation.

In the case of CFTR when the direct and indirect associations are all the same, I could either conclude that all of them are direct or that all of them are indirect - right?

What would be your suggested way then to identify which associations are direct and indirect? At present and with the JSON files given, the only way would be to take a set difference between all associations in the two files (most likely involving calculating some kind of hash of each line in the JSON files).

irene · 15 October 2021 18:51

Hello @flo!

Answering your first question: for this association, the results are identical between the indirect and direct datasets not because of the target, but the disease.

The indirect associations are the product of expanding the ontology to the ancestors of a given disease. For instance, you can see that SLAPEnrich lists 10 evidence of association between adenocarcinoma and DAXX although the exact diseases they are reporting are all more granular terms like lung adenocarcinoma, a child term. Here the direct evidence is being expanded to the ancestors.
Coming back to your example, the data in both datasets are identical because:

Clinvar has direct evidence of an association between cystic fibrosis and CFTR;
cystic fibrosis does not have child terms, therefore no children evidence are being expanded.

If you want to make the distinction between the direct associations and those which are the result of expanding the ontology, you are right that the most straightforward way would be to make the difference between the indirect associations (direct + expanded) and the direct ones.
It would be very simple in PySpark:

# We want to join the datasets using all columns
all_columns = assoc_d.columns

expanded_assocs = (
    indirect_assocs.join(direct_assocs, on=all_columns, how='left_anti')
    .distinct()
)

I hope this helps! You can read more about the disease ontology expansion in our documentation.

Best,
Irene

Topic		Replies	Views
Indirect data included in the direct associations data General ot-platform , data	1	341	15 August 2022
Different Association datasets General datadownloads , ot-platform , data , ftp	5	101	15 October 2024
Spurious indirect association/evidence via GraphQL API? Bug reports	6	453	19 January 2023
How does the Platform display direct and indirect evidence? Frequently Asked Questions	7	877	13 May 2022
Score values from Disease->Target vs. Target->Disease Frequently Asked Questions	3	560	25 June 2021

Difference between direct and indirect links in data downloads

Related topics