Level of confidence of L2G gold standards

Andrea_RM · 1 November 2022 16:03

Dear OpenTargets team,

I’ve a question regarding your dataset of gold standards used to train the L2G model

I’ve noticed that some gene-trait pairs with evidence class == drug and the same type of evidence (e.g. ChEMBL_III) have difference levels of confidence (e.g. “low”, “medium”, “high”). I wonder why this is the case, or how that should be interpreted?

Many thanks,

Best wishes,

Andrea

Annalisa_Buniello · 4 November 2022 15:41

Hi Andrea,
for our gold standard training set, we used anything that was classified as medium or high confidence from CHEMBL. These are drugs in phase III or phase IV clinical trials respectively that have targets that overlap with GWAS loci - matched for the same indication. Anything that is low confidence corresponds to clinical trials drugs in <phase II or less.
I hope this helps!
Thanks,
Annalisa

Andrea_RM · 4 November 2022 16:18

Hi Annalisa,

Many thanks for your response. I’m still a bit confused. If I select from this file (https://raw.githubusercontent.com/opentargets/genetics-gold-standards/master/gold_standards/processed/gwas_gold_standards.191108.tsv) raws where “metadata.set_label” == “ChEMBL_IV”; I get the following counts for each “gold_standard_info.highest_confidence” categories:

High (288)
Low (144)
Medium (105)

Based on your explanation it sounded as if all of these should be considered as “high”?

Many thanks

Andrea

MayaGhoussaini · 8 November 2022 02:09

Hi Andrea, Confidence levels in the gold standard list were adjusted to indicate the distance of the lead variant to the drug target; variant–gene distances of <500, 250 or 100 kb were assigned confidences of low, medium and high, respectively. For example, a phase IV drug target that overlaps with GWAS locus where the lead SNP<100 Kb from the TSS of the gene would qualify as high confidence.
Note that in our machine learning work and paper, we only used 37 drug target-disease pair from CHEMBL III and 88 from CHEMBL IV after duplications were removed, so that only one locus to gene pairing is kept in the training set (usually the one with the best GWAS association p-value). You could find the list of GS we used for our training in supplementary table 8 at this link (An open approach to systematically prioritize causal variants and genes at all published human GWAS trait-associated loci | Nature Genetics).

Hope this helps,

Maya

Andrea_RM · 8 November 2022 06:31

Hi Maya,

Many thanks - that makes a lot of sense

Best,

Andrea

Topic		Replies	Views
Gold standard for new L2G scores Technical Support	3	56	2 April 2025
Filtration of the data ingested into the Open Targets Platform from Open Targets Genetics Data downloads ot-platform	4	445	22 February 2023
How to interpret Variant-to-Gene (V2G) and Locus-to-Gene (L2G) scores in Open Targets Genetics Open Targets Genetics FAQs	0	1498	19 July 2021
Associated studies: locus-to-gene pipeline Data downloads datadownloads , genetics-portal	5	394	22 December 2021
Query by GeneID and Phenotype to get L2G scores GraphQL API genetics-portal	1	340	15 March 2023

Level of confidence of L2G gold standards

Related topics