I have a query about differences in the ‘molecule’ and ‘indication’ datasets available on OpenTargets.
My current understanding of these datasets is according to the schema of the OpenTargets datasets available at https://api.platform.opentargets.org/api/v4/graphql/schema
linkedDiseases: “Therapeutic indications for drug based on clinical trial data or post-marketed drugs, when mechanism of action is known"”
approvedIndications: “Indications for which there is a phase IV clinical trial”
indications: “Investigational and approved indications curated from clinical trial records and post-marketing package inserts”
When I compared the number of ChEMBL IDs in phase IV (molecules with maximumClinicalTrialPhase = 4) from the ‘molecule’ dataset with the number of ChEMBL IDs with approvedIndications>0 in the ‘indication’ dataset, I found that the there is a significant difference in their total count. My comparison ranged from versions 21.04 to 23.12 and it is illustrated in the attached bar plot.
In manual checks, I noticed that some of the ChEMBL IDs in phase IV from the ‘molecule’ dataset have no linkedDiseases values. However, the same IDs have values in the ‘approvedIndications’ column in the ‘indication’ dataset. I would appreciate your help in understanding the reasons for these differences in your datasets. I have also listed this difference in the following table.
I would appreciate your help in understanding this difference in annotation between the 2 datasets and if there is any possibility for us to fill in the gaps with the data you have published across versions. As the difference has decreased across versions, I suspect there is something already present in the dataset that we could use to fill in the gap. There is a significant addition of drugs from version 22.04. I could not find an explanation of this in the release notes. Was a a new data source used to add these Drugs?
Please let me know if I can provide you any more information to help you answer this query.