Clinical precedence not capturing entire data

Hello, I’ve come across a large number of drugs with “no data” for Clinical Precedence but with at least one indication with clinical phase data. Can someone clarify the distinction?

Adding an example to illustrate what I mean.


Hi again @Karl_Gemayel,

I guess these are the inconsistencies you commented on the other thread :rofl:

The reason you are seeing discrepancies is because the clinical precedence dataset and the indications dataset basically draw from two different sources:

  • Mechanism of Action / Indication / Drug Warnings, they are reproductions of the widgets you can find on ChEMBL. We download them directly from their database.
  • Clinical Precedence / Known Drugs, are datasets that we create derived from an ad-hoc pipeline that generates the disease/target evidence data from ChEMBL. In some cases it extends the data present in ChEMBL with extra annotation for us such as new mechanisms of action.

The reason why clindamycin palmitate is not in the evidence set is because the target is not human, and therefore it falls out of the Clinical Precedence dataset as well. You can find similar cases also when we are missing the mechanism of action annotation, as for Tozinameran.
Clinical precedence is derived from the evidence mainly for historical reasons (drug annotations were incorporated later). We will discuss in the team how we want to scope this task and I will keep you posted.

Thanks for reporting it!

Thanks @irene and sorry for posting twice.

  1. Perhaps the example I chose wasn’t the best. What about drugs such as: CHEMBL1200680, CHEMBL1200522, CHEMBL203266?
  2. I’ve found 1015 such drugs that have a phase 4 status in the indication data file, but are not present in evidence (unless my download is messed up). All the ones I’ve checked online did not have Clinical precedence data. Am I missing something?

Code to reproduce:

chembl = evidence.filter("sourceId == 'chembl'").select(drugId).toPandas()

indication = indication.withColumnRenamed("id", "drugId")
indication = indication.withColumn("indications", F.explode("indications"))
indication = indication.withColumn("phase", F.col("indications.maxPhaseForIndication"))
indication ="drugId", "phase").toPandas()

# compare IDs for phase 4 in indication
indication_ids = set(indication[indication.apply(lambda x: x.phase == 4, axis=1)].drugId)
chembl_ids = set(df_chembl.drugId)
leftover = indication_ids - chembl_ids

# len(leftover) == 1015

Thanks again,

First of all, thank you for providing a reproducible example! :slight_smile:

The molecules you mention are different examples of the case with Tozinameran that I was commenting yesterday. These are drugs for which we don’t know their mechanism of action, either because it is unknown or because there is a curation gap. If we don’t have annotation on the target these drugs are modulating, we cannot therefore build a target/disease relationship. Consequently, data will be missing in the Clinical Precedence widget.

I hope that answers your question!