Associated studies: locus-to-gene pipeline

I’m getting started using opentargets and have downloaded all the of the parquet tables for both the platform and genetics from the EBI FTP. i want to start just doing something simple to start with, recreate the Associated studies:locus-to-gene pipeline table. I chose the gene COL4A3 and when I go the page Open Targets Genetics I find 112 records. On the other hand using pyspark and the l2g table loaded into a dataframe in tables with the following python code:
tables["l2g"].filter(tables["l2g"].gene_id == "ENSG00000169031").count()
I get 114 records. Do you have any suggestion here?

The Genetics Platform (UI) isn’t an exact representation of the raw data. When you look at the entry for COL4A3 in the UI, what you are seeing is generated by a database query which looks like:

SELECT study_id,
       chrom,
       pos,
       ref,
       alt,
       ... other fields
FROM ot.l2g_by_gsl l
ANY INNER JOIN (
    SELECT *,
           CAST(lead_chrom, 'String') as stringChrom
    FROM ot.v2d_by_stchr
    ) v on (v.study_id = l.study_id and
            v.stringChrom = l.chrom and
            v.lead_pos = l.pos and
            v.lead_ref = l.ref and
            v.lead_alt = l.alt)
PREWHERE gene_id = 'ENSG00000169031';

This query returns 112 records which are displayed in the UI. If you only look at the l2g table you would indeed find 114 records for COL4A3 (ENSG00000169031).

The discrepancy in the counts is because we only display in the UI studies which have a corresponding entry in the v2d table.

Thank you Jarrod. Is this query in the codebase somewhere? Additionally is there any documentation as to how the tables are used to serve up the data on the site?

One further somewhat related question - for the listed traits is there any simple way to distinguish between diseases vs. non-disease phenotypes? And if so is there a field in a table for this?

The query can be found in the code here. As far as I’m aware there is currently no non-code documentation on how different tables are generated.

I’ll have to get a colleague to follow up with you regarding how to distinguish between diseases and non-disease phenotypes. We’re a little light-handed this close to Christmas, but I’ll make sure someone follows up.

Thanks again Jarrod. What I ended up doing was, for each record in diseases, for each element of diseases.parents, follow it up to the root(s) in the table. This gives me sufficiently few overarching categories that it is simple to make a decision for each such category.

I hope you and your team have a nice Christmas.