Hello @njeanray!
Welcome to the Open Targets Community!
We are currently working on adjusting how we generate our datasets, which causes a difference in the scores. As noted in a previous Community post, both scores are correct and valid but there are differences in the algorithm harmonisation strategy due to the on-the-fly scoring feature available in our API. We have opened a GitHub ticket (#1627) and will update the Community once we have a new set of datasets available.
It is also important to note that when you view a page of diseases associated with a target (e.g. SOD2), we show direct association scores. However, when you view a page of targets associated with a disease (e.g. epilepsy), we show indirect association scores. The evidence page shows both direct and indirect evidence for the target-disease association (e.g. SOD2 and epilepsy where the PhenoDigm evidence is for child terms of epilepsy). For more information, please see our web interface documentation.
As for the data file format, it really is a personal preference as both formats contain the same information. Personally, I prefer to use the Parquet files as I have PySpark set up to read-in and query our datasets. There is an R version - sparklyr that provides a similar interface for R.
We also have a Google BigQuery instance where you can run SQL queries and export the results in JSON or CSV format or save to your own Google Cloud bucket. All of our datasets can be queried. Below are some sample queries:
- Find all of the individual data source scores for a specific target-disease association
- Find all of the known drugs data for a specific disease/phenotype.
I hope this helps - and check back regularly as we will update the Community when the new datasets are available!