Question/Suggestion for procedure of text-mining association for protein subunits

It is interesting that for many genes that are considered to have “drugs”-based association with the disease, there is NO other evidence, including text mining, that associate the gene (i.e. target) with the disease (for example, TUBB*, POLE*, etc. in Targets associated with T-cell non-Hodgkin lymphoma). This is counterintuitive because a paper should definitely exist to capture the drug-target interaction (and thus should be demonstrated by text mining), or else there is no way a drug could mediate the disease-target association.

I examined randomly some diseases and the source at ChEMBL, and found that all of the genes that exhibit such behavior correspond to protein subunits. I believe the cause is that the evidence for “drugs”-association of such a given [subunit, disease] is inherited from the complex-level evidence, i.e. the paper only finds the drug interacting with the protein complex and there is no resolution at the subunit level, but ALL subunits are still considered to interact with the drug.

Please correct me if my guess is wrong. If it is indeed the case, would it be possible to optimize this procedure such that, say, there comes some weight for this kind of inheritance and text mining also include such inherited evidence?


Hi @jasperhyp and thank you for your question!

It is very interesting what you are reporting and we really should be able to capture through text mining these kinds of associations - unless they are only reported in clinical trials.

Our pipeline has roughly two critical components:

  1. the identification and extraction of entities based on literature analysis - the output will consist of a set of labels.
  2. the mapping of these labels to their respective ontology (EFO, Ensembl, CHEMBL) - for this we use text processing techniques such as lemmatisation.

We need to investigate in which component we are losing annotation such as the TUBB6 case you report. Is it that the NLP pipeline is not able to capture the subunits as you say? Or is it a problem of the text processing steps to map the labels?

Thanks for reporting, I’ll get back to you once we investigate further.


1 Like