I downloaded the 21.09 JSON data for “associationByDatasourceDirect” as well as “associationByDatasourceIndirect”.
I then checked for the associations of CFTR (ENSG00000001626) in cystic fibrosis (Orphanet_586), both via the web (Open Targets Platform) as well as by grepping through the downloaded JSON for direct and indirect associations.
I observe that (a) the grepping gets exactly the same results for both direct and indirect associations, and (b) the webpage states 2409 entries from ClinVar whereas from both JSON downloads the respective number is only 2188.
For (a): should there not be a difference in the data for direct or indirect associations?
For (b): where did the other 221 genetic associations from ClinVar go?
The number of genetic associations from associationByDatatypeDirect/Indirect is 2289 - yet another number. The number of total associations is 6745.
The number of associations from associationByOverallDirect/Indirect is also 6745.
The number of associations on the web page is 6959, which is now 214 more than in the download.
The numbers don’t really add up, and even if they do in some way, it does not seem that this is easily reconstructed from the data downloads. (I have not tried the same from the parquet files, assuming that the parquet contains a copy of the JSON anyway.)
Any insights would be greatly appreciated.