We have just released the latest update to the Open Targets Platform — 23.02.
Key highlights for this release:
This release integrates 14,611,717 evidence strings to build 6,960,486 target-disease associations between 22,274 diseases or phenotypes and 62,678 targets from the following 22 public resources:
- 2,177,595 genetic evidence from European Variation Archive (EVA)
- 782,147 genetic evidence from Open Targets Genetics
- 3,031 genetic evidence from Gene2Phenotype
- 31,995 genetic evidence from the Genomics England PanelApp
- 1,971 genetic evidence from ClinGen
- 6,254 genetic evidence from Orphanet
- 27,373 genetic evidence from Gene burden
- 4,151 genetic evidence from UniProt Literature
- 16,965 somatic evidence from European Variation Archive (EVA)
- 3,299 somatic evidence from intOGen
- 76,292 somatic evidence from the Cancer Gene Census
- 26,383 somatic evidence from Uniprot
- 612,079 drug evidence from ChEMBL
- 230,903 expression evidence from Expression Atlas
- 10,413 affected pathway evidence from Reactome
- 72,294 affected pathway evidence from SLAPenrich
- 378 affected pathway evidence from PROGENy
- 390 systems biology evidence from SysBio
- 1,298 somatic evidence from the Cancer Genome Interpreter
- 1,838 CRISPR-Cas9 (Cancer Cell Lines) evidence from Behan et al. 2019
- 1,047,024 mouse model evidence from IMPC
- 5,300,042 scientific literature evidence from co-occurence mining in Europe PMC
Additionally, the Platform now allows users to explore data on 12,854 drugs or compounds.
For more details, read the 23.02 blog post.
The evidence from Europe PMC seems to be half that from the previous release, in spite of adding patents. What caused this change? It would be useful to put a percentage change for evidence count in each of these release notes.
The drop in evidence from EuropePMC is due to a known bug in the pipeline. Because of this, we are not processing all the publications that we should be processing. We are actively working with Europe PMC to resolve this.
However, the drop in associations is less drastic, which suggests that we are not losing crucial evidence.
Thank you for your feedback about including percentage changes. However, we provide these metrics to give a sense of the amount of data and its distribution, but we do not want to place too much emphasis on the numbers, since we value the quality of the data over its quantity. The amount of evidence will fluctuate over time based on our data sources or the way we process the data, particularly with a data source like this one. In fact, we are currently introducing some changes to the pipeline that will cause the number of evidences to drop.
Out of curiosity, would you be willing to share how you use these metrics? Thank you!
Thanks, @hcornu. Can you provide a little bit more detail in terms of the number of publications not being processed? Should we continue to use the previous release for the epmc data until this bug is fixed, or a union of the two releases?
I agree with your comment about quality, not quantity, but I have found that quantity is important for QC reasons. I use it to ensure that when I am postprocessing the data something has not changed to make me lose significant amounts.
I noticed that one of the INPUT files for evidence in the 23.02 release is BZIP2 zipped (evidence-files/atlas.json.bz2) while all other files are GZIPPED, but the platform-etl-bakcend scripts and reference.conf do not specify that it is BZIP2, so the pipeline fails to process this file since it is trying to load it as text files since all evidence files are just “globbed up” I think…