For our study, we need to retrieve similar entities for targets along with its score from Open Target(Bibliography). We could identify the target-disease score (similarity score) from the following dataset.
Let me clarify a little better what our literature pipeline consists of in broad terms:
The main component is the entity recognition algorithm resulting from looking at all publications indexed by EPMC where targets, diseases or traits and drugs are mentioned. On this result, which we usually refer to as “matches”, we apply a normalisation algorithm with the aim of mapping the labels to IDs (ENSGIDs, EFO IDs, CHEMBL IDs) so that we can use them in the Platform. The algorithm that feeds our Bibliography widget is this set of vectors that we create based on how each ID is referenced in the scientific corpus
From this whole corpus of scientific literature, we define that if a disease/trait is mentioned in the same section as a target in the same publication, this means that there is evidence of association between both entities. The score given to this association will depend on the section where they are mentioned. We usually refer to this set as “co-occurrences”.
In your case, since you are interested about the similarities between entities across the whole literature, I would recommend you to work with the matches datasets and the embedding that we generate from them. The similarities between vectors are calculated on the fly, so we don’t make them available in a precomputed dataset.
However we do make available the vectors for each entity, so that calculating the similarity scores between the entities of your interest is easy. The dataset that you want to check out is this one /pub/databases/opentargets/platform/22.11/output/literature-etl/parquet/vectors
You can compute the similarity score between entities by calculating the dot product between their vectors.
I hope this is what you need! If you want to know more about this pipeline in detail, I’d suggest you to check out the blogpost we have about it.
Using KRAS as an example we tried to determine its cooccurring entities (first ten targets). We were able to retrieve the ten genes set by applying a literature cross-match filter and our results matched with UI output. To determine the similarity score between entities we calculated the dot product between their vectors using data provided for each target in the vector table. Our results ranged from 2 to 5.5, whereas the corresponding value in UI was lesser than one. Could you elaborate on how to calculate the similarity score on the fly using the vector data? Is there any computation to be done further to match the following results obtained from User Interface?