Level of confidence of L2G gold standards

Dear OpenTargets team,

I’ve a question regarding your dataset of gold standards used to train the L2G model

I’ve noticed that some gene-trait pairs with evidence class == drug and the same type of evidence (e.g. ChEMBL_III) have difference levels of confidence (e.g. “low”, “medium”, “high”). I wonder why this is the case, or how that should be interpreted?

Many thanks,

Best wishes,

Andrea

Hi Andrea,
for our gold standard training set, we used anything that was classified as medium or high confidence from CHEMBL. These are drugs in phase III or phase IV clinical trials respectively that have targets that overlap with GWAS loci - matched for the same indication. Anything that is low confidence corresponds to clinical trials drugs in <phase II or less.
I hope this helps!
Thanks,
Annalisa

Hi Annalisa,

Many thanks for your response. I’m still a bit confused. If I select from this file (https://raw.githubusercontent.com/opentargets/genetics-gold-standards/master/gold_standards/processed/gwas_gold_standards.191108.tsv) raws where “metadata.set_label” == “ChEMBL_IV”; I get the following counts for each “gold_standard_info.highest_confidence” categories:

  • High (288)
  • Low (144)
  • Medium (105)

Based on your explanation it sounded as if all of these should be considered as “high”?

Many thanks

Andrea

Hi Andrea, Confidence levels in the gold standard list were adjusted to indicate the distance of the lead variant to the drug target; variant–gene distances of <500, 250 or 100 kb were assigned confidences of low, medium and high, respectively. For example, a phase IV drug target that overlaps with GWAS locus where the lead SNP<100 Kb from the TSS of the gene would qualify as high confidence.
Note that in our machine learning work and paper, we only used 37 drug target-disease pair from CHEMBL III and 88 from CHEMBL IV after duplications were removed, so that only one locus to gene pairing is kept in the training set (usually the one with the best GWAS association p-value). You could find the list of GS we used for our training in supplementary table 8 at this link (An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci | Nature Genetics).

Hope this helps,

Maya

Hi Maya,

Many thanks - that makes a lot of sense

Best,

Andrea