Differences in number of unique drugs per clinical precedence

I’m trying to get a dataframe with all clinical data using the json/parquet files. Specifically, I care about drugId, targetId, diseaseId, and maxPhaseForIndication.

It seems this data is present in ‘molecule’, ‘indication’, and ‘knownDrugsAggregated’ files. However, the numbers I get from each are very different.

Can someone explain why? And perhaps suggest the best way to get this information?

Hi @Karl_Gemayel and welcome to our Community!

In your case I would use the evidence parquet file, it should be straightforward to extract a dataframe with the fields that you need directly from there. You wouldn’t need to perform any joins.

As you know, ChEMBL is a provider of evidence between target and disease. ChEMBL evidence represents any target-disease relationship that can be explained by an approved or clinical candidate drug, targeting the gene product and indicated for the disease.

If you download ChEMBL evidence, your fields of interest are: drugId, clinicalPhase, targetId, diseaseId. Note that one evidence represents one study, so to extract the max phase for indication you’d need to aggregate the data and extract the maximum clinical phase. I hope this is helpful.

Could you please elaborate on the differences in numbers you are seeing?

Best,
Irene

1 Like