Generating Text Mining Score

rikard · 9 August 2022 18:46

Hi there!

Is it possible/is there any easy way to recalculate the text mining scores for genes associated to a given disease, based on a specific time frame?

Ex: Given all the genes associated with liver disease, I want to pull their text_mining scores, but only computed on literature published after January of 2020.

Thank you!

hcornu · 11 August 2022 14:57

Hi @rikard! Welcome to the Open Targets Community

It is not possible for users to recalculate the individual text mining scores.

However, our upcoming release will allow you to filter the bibliography for a specific time frame, which might help with your problem.

irene · 11 August 2022 16:08

To extend a bit on @hcornu’s answer: this is not possible at the moment since we don’t report the publication date in the EuropePMC’s evidence. We have plans to add this information in our next release so that queries like yours will be more straightforward.

The evidence scores are aggregated at the association level by calculating the harmonic sum of all the evidence scores. This value is then divided by the theoretical maximum score so that we make sure that this value is always lower than 1. You can read more about how we use the harmonic sum in our docs.

So given that you are interested in calculating the association score but only from a subset of evidence, what I would do is:

Filter the evidence dataset to get those which are of interest (again, a publicationYear field will be soon available).
To calculate the harmonic sum, you first take the vector of all evidence scores.
Apply the harmonic sum definition.

There are many ways to achieve this. I did a similar exercise and this is the function that I used:

theoretical_max = evd.agg({'score': 'max'}).collect()[0][0]

def calculate_hsum(scores: list, max_score: float):

    return sum(score / 2**idx for idx, score in enumerate(scores) if idx < 20) / max_score

Here I did a little tweak for when my array of scores is larger than 20. The contribution of the scores at that point is very marginal (score/2**20) and it is computationally expensive.

I hope this is helpful!
Irene

Topic		Replies	Views
Has OpenTargets significantly changed the number of high associations (e.g. overall association > 0.6) in the last 3 years? Technical Support data	2	259	16 September 2022
Question/Suggestion for procedure of text-mining association for protein subunits Community Feedback	1	204	24 May 2022
Setting a threshold for OpenTarget scores General	1	301	30 May 2023
The number of evidences from "open-targets-prod.platform.evidence" does not match "evidenceCount" Google BigQuery/Cloud	6	271	11 August 2023
How is the OT Genetics score listed in the OT Platform calculated? Platform FAQs ot-platform	2	115	15 April 2024

Generating Text Mining Score

Related topics