Hello OpenTargets team,
I’ve downloaded the parquet files to perform requests locally thanks to R script. However, I’m not able to find out where is located the overall association score. There are many scores mentionned in different tables, unfortunately, I can’t find which one is the overalll association score as it’s described in the user interface. When I compare them with the ones mentionned in the application GUI, there are always different. Is this score calculated directly in the front-end application ? If it is, where can I find this calculation in the sources ?
Also, do you think it’s better to perform all these requests on parquet files or is it better to uses JSON format ?
Many thanks in advance,
Welcome to the Open Targets Community!
We are currently working on adjusting how we generate our datasets, which causes a difference in the scores. As noted in a previous Community post, both scores are correct and valid but there are differences in the algorithm harmonisation strategy due to the on-the-fly scoring feature available in our API. We have opened a GitHub ticket (#1627) and will update the Community once we have a new set of datasets available.
It is also important to note that when you view a page of diseases associated with a target (e.g. SOD2), we show direct association scores. However, when you view a page of targets associated with a disease (e.g. epilepsy), we show indirect association scores. The evidence page shows both direct and indirect evidence for the target-disease association (e.g. SOD2 and epilepsy where the PhenoDigm evidence is for child terms of epilepsy). For more information, please see our web interface documentation.
As for the data file format, it really is a personal preference as both formats contain the same information. Personally, I prefer to use the Parquet files as I have PySpark set up to read-in and query our datasets. There is an R version - sparklyr that provides a similar interface for R.
We also have a Google BigQuery instance where you can run SQL queries and export the results in JSON or CSV format or save to your own Google Cloud bucket. All of our datasets can be queried. Below are some sample queries:
I hope this helps - and check back regularly as we will update the Community when the new datasets are available!
I just wanted to give you an update that our back-end team have fixed the association score files and they match what is now available in the UI.
You can find the files in our FTP in both JSON and Parquet formats. Alternatively, you can also access the data through our BigQuery instance, open-targets-prod.
Hello @ahercules !
Many thanks for giving me these information, this is really great ! I will have a look on these new datasets ASAP.
Also, I was looking for a kind of tutorial explaining how we can deploy Open Targets locally. In fact, I’d like to test Open Targets, integrating other data sources, and also, I’d like to be able to deal with score calculations. What would you advice in this case ? I had a look on github, but there are many modules, and I’m not sure of the ones to pick up.
What do you think ?
We are still working on improvements to the our ETL pipelines — specifically the step that aggregates and integrates target annotation data. For more information on our ongoing work in this area, see our GitHub issue tracker #1313 and #1641.
While we work on our ETL pipelines and complete the documentation needed to support local instances, please use our infrastructure documentation page to see which GitHub repos are used for the Platform.