Generating Text Mining Score

Hi there!

Is it possible/is there any easy way to recalculate the text mining scores for genes associated to a given disease, based on a specific time frame?

Ex: Given all the genes associated with liver disease, I want to pull their text_mining scores, but only computed on literature published after January of 2020.

Thank you!

Hi @rikard! Welcome to the Open Targets Community :tada:

It is not possible for users to recalculate the individual text mining scores.

However, our upcoming release will allow you to filter the bibliography for a specific time frame, which might help with your problem.

To extend a bit on @hcornu’s answer: this is not possible at the moment since we don’t report the publication date in the EuropePMC’s evidence. We have plans to add this information in our next release so that queries like yours will be more straightforward.

The evidence scores are aggregated at the association level by calculating the harmonic sum of all the evidence scores. This value is then divided by the theoretical maximum score so that we make sure that this value is always lower than 1. You can read more about how we use the harmonic sum in our docs.

So given that you are interested in calculating the association score but only from a subset of evidence, what I would do is:

  1. Filter the evidence dataset to get those which are of interest (again, a publicationYear field will be soon available).
  2. To calculate the harmonic sum, you first take the vector of all evidence scores.
  3. Apply the harmonic sum definition.

There are many ways to achieve this. I did a similar exercise and this is the function that I used:

theoretical_max = evd.agg({'score': 'max'}).collect()[0][0]

def calculate_hsum(scores: list, max_score: float):

    return sum(score / 2**idx for idx, score in enumerate(scores) if idx < 20) / max_score

Here I did a little tweak for when my array of scores is larger than 20. The contribution of the scores at that point is very marginal (score/2**20) and it is computationally expensive.

I hope this is helpful!