Similar entities & score (OT Platform Bibliography)

asathfernando · 14 December 2022 12:28

For our study, we need to retrieve similar entities for targets along with its score from Open Target(Bibliography). We could identify the target-disease score (similarity score) from the following dataset.

/pub/databases/opentargets/platform/latest/output/literature-etl/parquet/evidence/

Could you please guide us on how to extract similarity scores for target-target and target-Drug.

irene · 14 December 2022 17:05

Hello @asathfernando and welcome to our Community!

Let me clarify a little better what our literature pipeline consists of in broad terms:

The main component is the entity recognition algorithm resulting from looking at all publications indexed by EPMC where targets, diseases or traits and drugs are mentioned. On this result, which we usually refer to as “matches”, we apply a normalisation algorithm with the aim of mapping the labels to IDs (ENSGIDs, EFO IDs, CHEMBL IDs) so that we can use them in the Platform. The algorithm that feeds our Bibliography widget is this set of vectors that we create based on how each ID is referenced in the scientific corpus
From this whole corpus of scientific literature, we define that if a disease/trait is mentioned in the same section as a target in the same publication, this means that there is evidence of association between both entities. The score given to this association will depend on the section where they are mentioned. We usually refer to this set as “co-occurrences”.

In your case, since you are interested about the similarities between entities across the whole literature, I would recommend you to work with the matches datasets and the embedding that we generate from them. The similarities between vectors are calculated on the fly, so we don’t make them available in a precomputed dataset.

However we do make available the vectors for each entity, so that calculating the similarity scores between the entities of your interest is easy. The dataset that you want to check out is this one /pub/databases/opentargets/platform/22.11/output/literature-etl/parquet/vectors
You can compute the similarity score between entities by calculating the dot product between their vectors.

I hope this is what you need! If you want to know more about this pipeline in detail, I’d suggest you to check out the blogpost we have about it.

Best,
Irene

asathfernando · 19 December 2022 06:36

Using KRAS as an example we tried to determine its cooccurring entities (first ten targets). We were able to retrieve the ten genes set by applying a literature cross-match filter and our results matched with UI output. To determine the similarity score between entities we calculated the dot product between their vectors using data provided for each target in the vector table. Our results ranged from 2 to 5.5, whereas the corresponding value in UI was lesser than one. Could you elaborate on how to calculate the similarity score on the fly using the vector data? Is there any computation to be done further to match the following results obtained from User Interface?

|ENSG00000099949 - LZTR1 =>4.944294208|0.716869366|
|ENSG00000125731 - SH2D3A=>5.368573759|0.664970445|
|ENSG00000111405 - ENDOU=>2.71252414|0.654289812|
|ENSG00000198732 - SMOC1=>5.001285831|0.650119609|
|ENSG00000133321 - PLAAT4=>3.885406255|0.64984986|
|ENSG00000272395 - IFNL4=>5.482690501|0.645822321|
|ENSG00000089127 - OAS1=>5.084897459|0.622650407|
|ENSG00000154438 - ASZ1=>4.1043518|0.621260402|
|ENSG00000108771 - DHX58=>5.469361946|0.604952311|
|ENSG00000126456 - IRF3=>4.988631979|0.595128282|

ochoa · 16 January 2023 13:40

The resulting values are indeed normalised:

github.com

opentargets/platform-api/blob/ed86c9d226f428006981b022628ed77674e47b00/app/models/db/QW2V.scala#L32


      
          val v: Column = column("vector")
          val norm: Column = column("norm")
          val T: Column = column(tableName)
          
          val vv: Column = Q(
            Select(F.sumForEach(v) :: Nil),
            From(T),
            PreWhere(F.in(label, F.set(labels.map(literal).toSeq)))
          ).toColumn(None).as(Some("vv"))
          
          val vvNorm: Column = F.sqrt(F.arraySum(Some("x -> x*x"), vv)).as(Some("vvnorm"))
          
          val sim: Column = F
            .ifThenElse(
              F.and(F.notEquals(vvNorm.name, literal(0d)), F.notEquals(norm, literal(0d))),
              F.divide(
                F.arraySum(Some("x -> x.1 * x.2"), F.arrayZip(vv.name, v)),
                F.multiply(norm, vvNorm.name)
              ),
              literal(0d)
            )

As you noticed, this should not alter the gene rankings.

Topic		Replies	Views
Scoring of the gene classification General ot-platform	1	192	25 October 2022
Score values from Disease->Target vs. Target->Disease Frequently Asked Questions	3	400	25 June 2021
Where can I find the overall association score in DB? Data Access ot-platform	4	475	2 August 2021
Why are there near-identical disease terms in the Platform (ADHD), and how to collate the evidence? Community Feedback data	8	635	3 October 2022
Data statistics of Open Targets General data	2	320	13 September 2022

Similar entities & score (OT Platform Bibliography)

Related Topics