Searching credibleSets by region overlap

I’m trying to retrieve all credible sets (study-loci) that overlap a genomic interval (e.g., a ±500kb window around a gene). I expected the regions argument of credibleSets to behave like an interval/overlap filter, but it seems to behave as an exact string match on a pre-defined region identifier.

Example:

  • credibleSets(regions:["chr18:59371918-61371918"]) returns 4 rows

  • credibleSets(regions:["chr18:59371917-61371918"]) returns 0 rows

This suggests the API is not doing coordinate overlap but requires an exact “region” key.

Is this exact-match behavior intentional?

And if yes, is there any supported way in v4 to query all credible sets overlapping a given region (“chr:start-end”), without knowing the exact pre-defined region strings in advance?

Hi @Flizzy

You’re correct that the region column is a string. This column is used to keep track of which region susie fine-mapped credible sets have been derived from. As such, it isn’t really ideal to filter for overlaps. Also, this field will be NULL for credible sets derived from PICS fine-mapping.

Unfortunately I’m not very familiar with using the API or BigQuery for this type of data manipulation, and I’m not certain our data allows for this kind of API query. I will doublecheck with someone in the team after Christmas, but in the meantime I’ll show you what my strategy would be to do this in pyspark locally. Credible sets can be downloaded here: Open Targets Platform

  • Find the minimum and maximum values for the variant positions, via variantIds, in the StudyLocus/credible set locus object. VariantIds always have the format 18_59371918_ref_alt so we can split by ‘_’ and take the second element:
import pyspark.sql.functions as f
from gentropy.common.session import Session

session = Session()

cs = (
    session.spark.read.parquet("/users/dc16/data/releases/25.12/credible_set")
    .withColumn(
        "locusStart",
        f.array_min(
            f.transform(
                f.col("locus.variantId"),
                lambda v: f.split(v, "_").getItem(1).cast("int"),
            )
        ),
    )
    .withColumn(
        "locusEnd",
        f.array_max(
            f.transform(
                f.col("locus.variantId"),
                lambda v: f.split(v, "_").getItem(1).cast("int"),
            )
        ),
    )
)
  • Then use these locusStart and locusEnd columns, along with the chromosome column, to filter the credible set rows according to your criteria. locusStart must be less than the end of your overlap region, and locusEnd must be greater than overlap region start:
cs_filtered = cs.filter(f.col("chromosome") == "18").filter(
    (f.col("locusStart") <= 61371918) & (f.col("locusEnd") >= 59371918)
)
cs_filtered.count()
2161

This returns 2,161 credible sets in total.

Apologies again that I can’t give you a more comprehensive answer for the API query. Our Gentropy python package documentation might also be useful: Open Targets Gentropy - Open Targets Gentropy