Overrepresentation of MVP cohort PheCodes among diseases with very few colocalized molecular QTLs

shastvx1 · 1 December 2025 19:21

Thanks for creating and maintaining this amazing resource!

In my analysis, I am running a pipeline to extract molQTLs for a range of diseases from OpenTargets. To do this, I first subset to single-phenotype GWAS studies from the study table and only use the study with the largest number of loci with fine-mapping credible sets (by cross-referencing to credible_set). Then I extract colocalized molecular QTLs (ciseQTLs, cispQTLs, cistxQTLs, cisexonQTLs and transpQTLs) via the colocalization_ecaviar and colocalization_coloc tables. Interestingly, I notice a striking pattern: in the top 100 diseases ranked by num_credible_sets, I find that >95% of diseases with very few (<5) molecular QTL colocalization hits are PheCodes and >95% of these are GWASs from the MVP cohort.

I was wondering if others have noticed a similar pattern, and what factors could be contributing to this. I am using the latest release (25_09).

Thanks,

Viv

Daniel_Considine · 2 December 2025 17:30

Hi Viv,

Thanks for the kind words and showing an interest in Open Targets

I’ve briefly looked at replicating the analysis you describe above. I think what you’re observing is a consequence of a few things:

MVP studies often have the largest sample size and power, as well as covering many diseases. As a result, for any given diseaseId it’s likely an MVP study will have the most credible sets from a statistical point of view (more power generally results in more/smaller effect detection). So your study/credible set selection criteria, along with statistical power, could be why you are seeing MVP studies in the colocalisation results more often.
We’re aware that we potentially have a increased false discovery rate for SuSiE-inf fine-mapping of cohorts with mixed ancestries (which includes MVP). This is because currently we use the major ancestry group LD matrix which makes up the majority of the study sample (in this case non-Finnish European). However, this is not going to reflect the actual in-sample LD very well. In the upcoming December release we plan to use only PICS fine-mapped credible sets for studies with mixed ancestries to be more conservative. In the future we are going to implement a more sophisticated way of estimating the LD structure for mixed ancestry studies, but no ETA for that I’m afraid.

I hope this is helpful!

shastvx1 · 8 December 2025 16:57

Thanks for your prompt response @Daniel_Considine!

Indeed, that is true. Based on 150 or so diseases I’ve looked at, 80% of these show MVP studies to have the largest sample size. But within these, I observe an overrepresentation of MVP cohorts in diseases with very few molecular QTLs compared to those with many. More specifically, over 95% of the 75 diseases with no to very few QTLs hits come from MVP whereas only 60% of those with many molecular QTLs come from MVP. So, this makes me think maybe your second point could be a potential reason for this.
I appreciate this thorough response. Using PICS fine-mapped credible sets would be a good initial pass, but using an estimated LD structure for mixed ancestry studies would be the way to go. So, if there is a board tracking future improvements, you could put a +1 for this effort on my behalf!

Cheers,

Viv

Daniel_Considine · 10 December 2025 11:22

No problem!

After responding to you I did a bit more investigation of our fine-mapping of MVP studies vs what is reported in the manuscript. I don’t have any hard comparison metrics - I just visually inspected a few studies to compare number of credible sets, credible set size, and the variant PIP distribution. It actually seems like our fine-mapping pipeline is doing a pretty good job of replicating their results. The main reason we report so much more credible sets is that MVP only fine-mapped signals with P-value < 5e-11, whereas we fine-map every signal with P-value < 1e-8. If I restrict our MVP credible sets to that threshold for the lead variant, we actually get similar results.

Regarding the overlap with molecular-QTLs: MVP studies having the largest sample size/power is going to mean they will detect signals with smaller effect sizes, and it’s possible these weaker signals are more likely to overlap with certain subsets of molecular QTLs. It would be interesting to investigate this further though.

Appreciate the feedback - improving the fine-mapping pipeline is definitely high on our list of priorities!

Topic		Replies	Views
Error in the exported tsv file for GWAS colocalisation General gwas	2	43	27 May 2025
Where to find summary statistics for QTL studies (and through API)? Data Access genetics-portal	7	540	28 February 2025
How was LD clumping performed to select eQTLs for candidate gene? Open Targets Genetics FAQs	3	366	11 May 2022
Recommend hyprcoloc for next version coloc analysis Community Feedback genetics-portal	1	331	12 May 2022
What fine-mapping and colocalisation software does Open Targets Genetics use? Open Targets Genetics FAQs	0	539	14 July 2021

Overrepresentation of MVP cohort PheCodes among diseases with very few colocalized molecular QTLs

Related topics