Hello Open Targets,
My main goal is to understand how the association scores shown on the website are calculated by Harmonic Sum from scratch.
First, I am interested in the association between myeloproliferative disorder (EFO: EFO_0004251) and JAK2 (ENSG00000096968). Their association scores can be obtained from websites as
Figure1:
Specifically, the datatype score, “Somatic mutations” is 0.9099111064831517 which is what I want to calculate by myself from all evidence scores. I used BigQuery to extract necessary information.
First, I got the following result by the corresponding SQL statement
Figure2:
SELECT
diseaseId,
targetId,
datatypeId,
datasourceId,
score,
evidenceCount
FROM
open-targets-prod.platform.associationByDatasourceIndirect
WHERE
diseaseId=‘EFO_0004251’ AND
targetId=‘ENSG00000096968’ AND
datatypeId=‘somatic_mutation’
Next, I also got evidences and their corresponding scores as figures and SQL queries below
Figure3:
SELECT
datasourceId,
targetId,
datatypeId,
studyId,
diseaseId,
sourceId,
score
FROM
open-targets-prod.platform.evidence
WHERE
diseaseId=‘EFO_0004251’ AND
targetId=‘ENSG00000096968’ AND
datatypeId=‘somatic_mutation’
I found the mismatche between evidenceCount and records:
In Fig.2, evidenceCounts for eva_somatic and cancer_gene_census are 7 and 11, but there are only five records in Fig.3. Thus, I can not calculate the association score for somatic_mutation in both Fig1 and Fig.2.
Can you point out where I did wrong in querying raw data from Google Cloud? Or it is other reasons. Moreover, if it is necessary to take weights into considerations when calculating Harmonic Sum, can you tell me the exact equations since both the journal paper and document do not provide the details.
Charlie