Associated studies: locus-to-gene pipeline

brcopeland · 16 December 2021 01:36

I’m getting started using opentargets and have downloaded all the of the parquet tables for both the platform and genetics from the EBI FTP. i want to start just doing something simple to start with, recreate the Associated studies:locus-to-gene pipeline table. I chose the gene COL4A3 and when I go the page Open Targets Genetics I find 112 records. On the other hand using pyspark and the l2g table loaded into a dataframe in tables with the following python code:
tables["l2g"].filter(tables["l2g"].gene_id == "ENSG00000169031").count()
I get 114 records. Do you have any suggestion here?

JarrodBaker · 16 December 2021 14:45

The Genetics Platform (UI) isn’t an exact representation of the raw data. When you look at the entry for COL4A3 in the UI, what you are seeing is generated by a database query which looks like:

SELECT study_id,
       chrom,
       pos,
       ref,
       alt,
       ... other fields
FROM ot.l2g_by_gsl l
ANY INNER JOIN (
    SELECT *,
           CAST(lead_chrom, 'String') as stringChrom
    FROM ot.v2d_by_stchr
    ) v on (v.study_id = l.study_id and
            v.stringChrom = l.chrom and
            v.lead_pos = l.pos and
            v.lead_ref = l.ref and
            v.lead_alt = l.alt)
PREWHERE gene_id = 'ENSG00000169031';

This query returns 112 records which are displayed in the UI. If you only look at the l2g table you would indeed find 114 records for COL4A3 (ENSG00000169031).

The discrepancy in the counts is because we only display in the UI studies which have a corresponding entry in the v2d table.

brcopeland · 16 December 2021 18:08

Thank you Jarrod. Is this query in the codebase somewhere? Additionally is there any documentation as to how the tables are used to serve up the data on the site?

brcopeland · 16 December 2021 18:28

One further somewhat related question - for the listed traits is there any simple way to distinguish between diseases vs. non-disease phenotypes? And if so is there a field in a table for this?

JarrodBaker · 22 December 2021 08:31

The query can be found in the code here. As far as I’m aware there is currently no non-code documentation on how different tables are generated.

I’ll have to get a colleague to follow up with you regarding how to distinguish between diseases and non-disease phenotypes. We’re a little light-handed this close to Christmas, but I’ll make sure someone follows up.

brcopeland · 22 December 2021 17:41

Thanks again Jarrod. What I ended up doing was, for each record in diseases, for each element of diseases.parents, follow it up to the root(s) in the table. This gives me sufficiently few overarching categories that it is simple to make a decision for each such category.

I hope you and your team have a nice Christmas.

Topic		Replies	Views
L2G Scoring Pipeline Implementation Data Access genetics-portal	5	398	14 March 2024
How does the Open Targets Platform process data from Open Targets Genetics? Community Feedback genetics-portal , ot-platform	3	490	23 February 2023
Query by GeneID and Phenotype to get L2G scores GraphQL API genetics-portal	1	329	15 March 2023
Accessing locus-to-gene (L2G) and colocalisation data from Open Targets Genetics by querying the API using Python GraphQL API graphql , genetics-portal	0	786	3 June 2021
V2D dataset replication via Google BigQuery Google BigQuery/Cloud bigquery , genetics-portal	1	344	4 May 2023

Associated studies: locus-to-gene pipeline

Related topics