Though SOTORASIB targets KRAS, it seems the bibliography pipeline is not extracting this relationship on either the drug page or gene page that I can see. There are not many publications yet (less than 30 from a quick pubmed search) as this is a new drug, but I wanted to flag this as I wonder if it is due to the way the gene is often represented in the literature, with the genetic variant directly attached to the gene name and whether this could be something that could be factored into the NER as this may be true for other personalised medicines?

Example drug page:

Example target page:

Example publications where the drug and target co-feature:
Europe PMC (Sept 2020) and Europe PMC (Aug 2023) - both however use ‘KRASG12C’

Hi Ellie,

Investigating the matches dataset, I can confirm both articles are in the ETL literature output and sotorasib is annotated in both publication. However, KRAS is not recognised in either of them. I don’t execaly know why this is happening, KRAS is not even recognised as a potential entity, which could not be normalised to ENSG00000133703 (can’t see in the failed matches dataset either).

Also, although both datasets are expected to be available as full text, pmid:37686523 seems to be only processed from the abstract pile with the total recognised entity count 3.

This could be a good exercise to improve the entity recognition.


Thanks Daniel - yes, possibly an interesting case for future development which could also consider pulling out the variant information.