The number of evidences from "open-targets-prod.platform.evidence" does not match "evidenceCount"

Hello Open Targets,

My main goal is to understand how the association scores shown on the website are calculated by Harmonic Sum from scratch.

First, I am interested in the association between myeloproliferative disorder (EFO: EFO_0004251) and JAK2 (ENSG00000096968). Their association scores can be obtained from websites as

Figure1:

Specifically, the datatype score, “Somatic mutations” is 0.9099111064831517 which is what I want to calculate by myself from all evidence scores. I used BigQuery to extract necessary information.

First, I got the following result by the corresponding SQL statement

Figure2:

SELECT
diseaseId,
targetId,
datatypeId,
datasourceId,
score,
evidenceCount
FROM
open-targets-prod.platform.associationByDatasourceIndirect
WHERE
diseaseId=‘EFO_0004251’ AND
targetId=‘ENSG00000096968’ AND
datatypeId=‘somatic_mutation’

Next, I also got evidences and their corresponding scores as figures and SQL queries below

Figure3:

SELECT
datasourceId,
targetId,
datatypeId,
studyId,
diseaseId,
sourceId,
score
FROM
open-targets-prod.platform.evidence
WHERE
diseaseId=‘EFO_0004251’ AND
targetId=‘ENSG00000096968’ AND
datatypeId=‘somatic_mutation’

I found the mismatche between evidenceCount and records:
In Fig.2, evidenceCounts for eva_somatic and cancer_gene_census are 7 and 11, but there are only five records in Fig.3. Thus, I can not calculate the association score for somatic_mutation in both Fig1 and Fig.2.

Can you point out where I did wrong in querying raw data from Google Cloud? Or it is other reasons. Moreover, if it is necessary to take weights into considerations when calculating Harmonic Sum, can you tell me the exact equations since both the journal paper and document do not provide the details.

Charlie

Hi Charlie,

I think the difference is caused by the fact that you were querying indirect associations in the first example, however the evidence dataset contains links for direct disease/target relationships. This information is then propagated upwards on the disease ontology at the association level.

If you change associationByDatasourceIndirect to associationByDatasourceDirect , the evidence count returned will be consistent with the number of evidence returning from the evidence dataset.

Let us know if something still seems off.

1 Like

Thanks for pointing out this direct and indirect problem. Then I check if I can reproduce the score from my Figure 3 to your figure in the following step 1. Yes, I can. But I can not reproduce the “Somatic mutation”=0.909 in Fig1 from the scores listed in your figure in step 2. May you tell me if I did a wrong calculation?

step 1. I use the harmonic sum for “eva_somatic” and “cancer_gen_census”:

a. the eva_somatic evidence scores from my Figure3:

0.9/1 + 0.7/4 + 0.7/9 + 0.7/16 = 1.19652
→ normalize score = 1.19652 / 1.64393 = 0.727 which is exactly the number in your Fig.

b. the cancer_gene_census score from my Figure3:

1/1=10.
→ normalize score = 1 / 1.64393 = 0.608 which is exactly the number in your Fig.

step 2: Calculating Somatic mutation score on the website from your figure

0.727406086/1 + 0.607930798/4 = 0.879388786
→ normalize score = 0.879388786 / 1.64393 = 0.534929311 which is totally wrong

Many thanks if you can help me with this.

Hi Charlie — and welcome to the Open Targets Community! :tada:

There are a couple of things happening here:

  • The “Somatic Mutation” data type score for JAK2/myeloproliferative disorder — using direct evidence only — is 0.70. This is what you can calculate using the evidence listed in your Figure 3. (You can see this when you search for diseases associated with your target, see Open Targets Platform)

  • When you calculate the data type score from the data source scores (step 2), the maximum theoretical harmonic sum score for this calculation is actually 1.0/1^2 + 1/2^2 = 1.25 (explained in the documentation).

With this new maximum, we have:
0.727406086/1 + 0.607930798/4 = 0.879388786
0.879388786/1.25 = 0.7035…

I hope this helps! Let me know if you have follow up questions.

We recently updated our scoring documentation to try to clarify how this works. Could you let us know which parts were unclear so that we can improve it? :smiley:

1 Like

Your explanation for the “Somatic Mutation” data type score by using direct evidence only is very clear. Thank you for this and your rapid reply.

How about the “Somatic Mutation” data type score (=0.9) in my Fig1?
The information here is obviously contributed from both direct and indirect evidences.

I calculate this score contributed from Fig2 (0.92549 and 0.847595) and you Fig (0.727406086 and 0.607930798) as
→ (0.92549/1^1)+(0.847595/2^2)+(0.727406086/3^2)+(0.607930798/4^2)=1.25620732331944
→ 1.25620732331944/1.4236=0.88 which is wrong

Did I miss certain things in the above calculation?

Hi Charlie,

Calculating the indirect data type score is the same as calculating the direct score. The scores you have in Figure 2 are the indirect scores for the two data sources contributing to this data type.

The indirect scores contain both direct and indirect evidence. In other words, they will be the score derived from the evidence scores in Figure 3, plus any additional indirect evidence.

So for the indirect score:

0.92549… (cancer gene census score from Figure 2) / 1 + 0.84759… (eva somatic score) / 4 = 1.13675…

To normalise, since we have two data source scores, we use the theoretical maximum 1.0/1^2 + 1/2^2 = 1.25
1.13675 / 1.25 = 0.909…

Thank you for your clear explanation again. It is very helpful.

1 Like