Difference between direct and indirect links in data downloads

Hello @flo!

Answering your first question: for this association, the results are identical between the indirect and direct datasets not because of the target, but the disease.

The indirect associations are the product of expanding the ontology to the ancestors of a given disease. For instance, you can see that SLAPEnrich lists 10 evidence of association between adenocarcinoma and DAXX although the exact diseases they are reporting are all more granular terms like lung adenocarcinoma, a child term. Here the direct evidence is being expanded to the ancestors.
Coming back to your example, the data in both datasets are identical because:

  • Clinvar has direct evidence of an association between cystic fibrosis and CFTR;
  • cystic fibrosis does not have child terms, therefore no children evidence are being expanded.

If you want to make the distinction between the direct associations and those which are the result of expanding the ontology, you are right that the most straightforward way would be to make the difference between the indirect associations (direct + expanded) and the direct ones.
It would be very simple in PySpark:

# We want to join the datasets using all columns
all_columns = assoc_d.columns

expanded_assocs = (
    indirect_assocs.join(direct_assocs, on=all_columns, how='left_anti')
    .distinct()
)

I hope this helps! You can read more about the disease ontology expansion in our documentation.

Best,
Irene