Difference between direct and indirect links in data downloads

Hi,

I downloaded the 21.09 JSON data for “associationByDatasourceDirect” as well as “associationByDatasourceIndirect”.

I then checked for the associations of CFTR (ENSG00000001626) in cystic fibrosis (Orphanet_586), both via the web (Open Targets Platform) as well as by grepping through the downloaded JSON for direct and indirect associations.

I observe that (a) the grepping gets exactly the same results for both direct and indirect associations, and (b) the webpage states 2409 entries from ClinVar whereas from both JSON downloads the respective number is only 2188.

For (a): should there not be a difference in the data for direct or indirect associations?
For (b): where did the other 221 genetic associations from ClinVar go?

The number of genetic associations from associationByDatatypeDirect/Indirect is 2289 - yet another number. The number of total associations is 6745.

The number of associations from associationByOverallDirect/Indirect is also 6745.

The number of associations on the web page is 6959, which is now 214 more than in the download.

The numbers don’t really add up, and even if they do in some way, it does not seem that this is easily reconstructed from the data downloads. (I have not tried the same from the parquet files, assuming that the parquet contains a copy of the JSON anyway.)

Any insights would be greatly appreciated.

Hi @flo! :wave:

To answer your questions:

  1. For the CFTR and cystic fibrosis association, there is no difference in the data for direct and indirect associations because cystic fibrosis does not have any child terms in our ontology (EFO). And so the only association data available is direct association data for CFTR and cystic fibrosis.

  2. I have opened a ticket #1819 and asked our data team to investigate the difference in the evidenceCount field. In the meantime, please use our evidence dataset and filter by a specific datasource (e.g. eva). This will return the exact number of evidence strings seen in the web interface.

Please let me know if this has answered your question.

Thank you! :slight_smile:

Cheers,

Andrew

Hi @ahercules,

Thanks a lot for the explanation.

In the case of CFTR when the direct and indirect associations are all the same, I could either conclude that all of them are direct or that all of them are indirect - right?

What would be your suggested way then to identify which associations are direct and indirect? At present and with the JSON files given, the only way would be to take a set difference between all associations in the two files (most likely involving calculating some kind of hash of each line in the JSON files).

Hello @flo!

Answering your first question: for this association, the results are identical between the indirect and direct datasets not because of the target, but the disease.

The indirect associations are the product of expanding the ontology to the ancestors of a given disease. For instance, you can see that SLAPEnrich lists 10 evidence of association between adenocarcinoma and DAXX although the exact diseases they are reporting are all more granular terms like lung adenocarcinoma, a child term. Here the direct evidence is being expanded to the ancestors.
Coming back to your example, the data in both datasets are identical because:

  • Clinvar has direct evidence of an association between cystic fibrosis and CFTR;
  • cystic fibrosis does not have child terms, therefore no children evidence are being expanded.

If you want to make the distinction between the direct associations and those which are the result of expanding the ontology, you are right that the most straightforward way would be to make the difference between the indirect associations (direct + expanded) and the direct ones.
It would be very simple in PySpark:

# We want to join the datasets using all columns
all_columns = assoc_d.columns

expanded_assocs = (
    indirect_assocs.join(direct_assocs, on=all_columns, how='left_anti')
    .distinct()
)

I hope this helps! You can read more about the disease ontology expansion in our documentation.

Best,
Irene