How to interpret Variant-to-Gene (V2G) and Locus-to-Gene (L2G) scores in Open Targets Genetics

The V2G score

The V2G score is a disease-agnostic score that we developed early on to assign likely causal genes for any given variant in gnomAD v2.1. The score aggregates across a range of datasets that overlap with the SNP (chromatin interaction, QTL, in-silico functional prediction and distance).

The data aggregation process as well as the weighting applied to each of the datasets are described in the Open Targets Genetics documentation: Assigning Variants to Genes (V2G) - Open Targets Genetics Documentation

Diagram of the Open Targets Genetics Variant-to-Gene (V2G) pipeline

In the end, the pipeline provides a single aggregated score for each variant-gene prediction. The score is far from ideal as it considers variants to be independent and does not capture the information at a given locus to assign causal gene(s) (e.g. credible set, disease specificity, etc).

The L2G score

More recently, we developed the Locus-to-Gene score, a disease-specific score and a much better approach to prioritise genes using a machine learning model.

For this, we integrate fine-mapping credible set analysis with functional genomics data (including pathogenicity prediction, colocalisation with molecular QTLs, genomic distance and chromatin interaction data) to generate predictive features.

We then train a supervised model using over 400 gold-standard positive GWAS loci for which we are confident of the gene implicated to predict causal genes at each locus (see GitHub - opentargets/genetics-gold-standards: GWAS gold standards repository).

The L2G score is computed using ~50 input features for each gene at each locus, based on an XGBoost model trained on “gold standard” genes as true positive examples taken from a number of GWAS traits.

Diagram of the Locus-to-Gene (L2G) pipeline in Open Targets Genetics

Interpreting the L2G score

The score is calibrated so that a gene’s score indicates the fraction of genes at or above that score threshold that would be expected to be true positives. For example, we expect that 80% of genes with a score >= 0.8 would be causal genes, assuming that the characteristics of the chosen GWAS locus are similar to those in the training dataset.

In other words, the score can be interpreted as reflecting an FDR threshold of 1 - L2G_score. For example, among all genes with L2G > 0.8, 20% will likely be false positives. Note that this definition means that for a gene with a score exactly 0.8, the probability that it is causal would be slightly less than 80%, just as the last items “discovered” at an FDR threshold of 20% have a greater than 20% chance of being false positives.

If you have a disease/trait in mind, it is much better to use the L2G to make causal gene inferences, rather than the V2G score. For the most robust interpretation, when there are multiple GWAS for a given trait (or related traits), we advise to look at the L2G results for the equivalent locus in each GWAS. You may also observe cases where there are multiple independent signals at a locus, and you can evaluate the results for those distinct signals.

Further reading

For more information on the V2G score: Assigning variants to genes (V2G) - Open Targets Genetics documentation

For more information on the L2G score: Prioritising causal genes at GWAS loci (L2G) - Open Targets Genetics Documentation