Hello Open Targets,

My main goal is to understand how the association scores shown on the website are calculated by Harmonic Sum from scratch.

First, I am interested in the association between myeloproliferative disorder (EFO: EFO_0004251) and JAK2 (ENSG00000096968). Their association scores can be obtained from websites as

Figure1:

Specifically, the datatype score, “Somatic mutations” is 0.9099111064831517 which is what I want to calculate by myself from all evidence scores. I used BigQuery to extract necessary information.

First, I got the following result by the corresponding SQL statement

Figure2:

SELECT

diseaseId,

targetId,

datatypeId,

datasourceId,

score,

evidenceCount

FROM

`open-targets-prod.platform.associationByDatasourceIndirect`

WHERE

diseaseId=‘EFO_0004251’ AND

targetId=‘ENSG00000096968’ AND

datatypeId=‘somatic_mutation’

Next, I also got evidences and their corresponding scores as figures and SQL queries below

Figure3:

SELECT

datasourceId,

targetId,

datatypeId,

studyId,

diseaseId,

sourceId,

score

FROM

`open-targets-prod.platform.evidence`

WHERE

diseaseId=‘EFO_0004251’ AND

targetId=‘ENSG00000096968’ AND

datatypeId=‘somatic_mutation’

I found the mismatche between evidenceCount and records:

In Fig.2, evidenceCounts for eva_somatic and cancer_gene_census are 7 and 11, but there are only five records in Fig.3. Thus, I can not calculate the association score for somatic_mutation in both Fig1 and Fig.2.

Can you point out where I did wrong in querying raw data from Google Cloud? Or it is other reasons. Moreover, if it is necessary to take weights into considerations when calculating Harmonic Sum, can you tell me the exact equations since both the journal paper and document do not provide the details.

Charlie