Hello Open Targets Community,
I’d like to know where to find the calculations of harmonic sums and more generally, where to find the scores calculation in the code ? In which module can I find these informations ?
Thanks in advance,
The harmonic sum calculations for the overall, datasource, and datatype association scores can be found in the Association.scala module.
Individual pieces of evidence are scored in the evidence_datasource_parsers module or ETL configuration.
For example, the PhenoDigm evidence is scored in the PhenoDigm.py module. The resulting
resource_score field is then used in the ETL configuration to calculate the association scores.
However, for a datasource like ClinGen, the individual pieces of evidence are scored in the ETL configuration file. The
confidence string generated by the ClinGen module is used and mapped to a score.
Also, note that the ETL configuration also sets the default weights and datasource-specific weights.
Could you say a bit more about how the harmonic sum scores are calculated? I ask because the scores are sensitive to the order of elements in the score vector, and I’m curious where this ordering is defined? I’m having some trouble reproducing overall association scores and wondering if it’s due to misordering-- I’m using the implied ordering in the figure on this page (Target - disease associations - Open Targets Platform Documentation) (with genomics_england corresponding to i=1 and phenodigm, i=22) but I’m not sure if that’s what’s intended?
Thanks in advance.
Unfortunately, I have recently left the Open Targets team, but will try and answer your question below. Feel free to tag the help desk team – @SirTarget – for further assistance.
Details on how associations are scored – overall, by data source, by data type – can be found in the Association.scala file that is part of the Platform ETL pipeline.
To understand how the individual evidence scores are prepared and sorted prior to scoring the association, please look at line 165 of the Association.scala file where the
prepareEvidences function uses PySpark’s repartionByRange and sortWithinPartions functions to sort the evidence before returning the evidence set. This returned evidence set is then used in other functions to calculate the direct and indirect association scores on an overall, per data source, and per data type basis.
When reviewing Platform association scores, it is important to note that the association score is not a confidence assessment of the target-disease association, but rather than assessment of the availability of data for a given target-disease association.
I hope this helps answer your question and as I said, the Open Targets help desk team can provide further assistance – just tag their profile handle @SirTarget.