In your case I would use the evidence parquet file, it should be straightforward to extract a dataframe with the fields that you need directly from there. You wouldn’t need to perform any joins.
As you know, ChEMBL is a provider of evidence between target and disease. ChEMBL evidence represents any target-disease relationship that can be explained by an approved or clinical candidate drug, targeting the gene product and indicated for the disease.
If you download ChEMBL evidence, your fields of interest are: drugId, clinicalPhase, targetId, diseaseId. Note that one evidence represents one study, so to extract the max phase for indication you’d need to aggregate the data and extract the maximum clinical phase. I hope this is helpful.
Could you please elaborate on the differences in numbers you are seeing?