ID mapping table

Mathias_Saver · 24 November 2023 10:43

Hi All,

Recently I’ve been interested in having a reference table for mapping ENSEMBL gene ids to HGNC symbols. I thought that i could get that information by dowloading the Targets data from OT (Index of /pub/databases/opentargets/platform/23.09/output/etl/parquet/targets). However, after downloading and Parsing the parquet files I only find ~ 1000 unique ENSEMBL gene ids. In contrast, I’ve used a similar approach to also download and parse the OT - DepMap data (Index of /pub/databases/opentargets/platform/23.09/output/etl/parquet/targetEssentiality) , and in that case I do find ~ 17000 unique ENSEMBL ids.

Is it possible to download from OT a set of id mapping data/parquets for all genes in the human genome ?

Kind regards ,

-Mathias

dsuveges · 24 November 2023 11:11

Hi Mathias,

I suspect you need to double check your local files that you have retrieved from ftp. I have just tried the target index from the September release and it has 62.7k unique ids.

Best,
Daniel

ochoa · 24 November 2023 11:32

You need to download the whole directory and read all the partitions at once (no need for for loops. Each individual file on this directory is just a chunk of the data (used for parallelised computing).

You might also benefit from looking at the searchTarget[Index of /pub/databases/opentargets/platform/23.09/output/etl/parquet/searchTarget] dataset. This dataset feeds the Open Targets Platform search, but it contains a lookup with all the alternative IDs we use for every target. The data is derived from the target dataset, so it’s basically the same information but it might be in a friendlier format if you want to build mappings beyond the approveSymbol.

In our upcoming release there will also be a graphQL API endpoint to perform the same task:

github.com/opentargets/issues

API endpoint for verifying entity list for Upload target list

opened 12:47PM - 12 Oct 23 UTC

closed 12:01PM - 09 Nov 23 UTC

prashantuniyal02

Backend API AOTF PPP

Creating an API endpoint for verifying entity list for enabling upload of a targ…et/disease list For a uploaded list of target, we need to match the uploaded entry to the following set of ids: - Ensembl - UniProt - HGNC In case an uploaded entry matches to multiple results, we will display all the matched results. For a uploaded list of diseases, we need to match the uploaded entry to the following set of ids: - EFO - (other to be confirmed) We also need to confirm how to deal with entries that do not yield a match in both the backend and the frontend.

Mathias_Saver · 24 November 2023 15:50

Thanks for your advice. I figured that the error was happening for the following reason: I used read_parquet() from the ‘arrow’ parquet to read the parquet files. arrow::read_parquet( ) reads the parquet files into a tibble. I then used unnest() from tidyr to expand the list-columns . Apparently, some entries in this list columns might be NULL or NA. If so, when using unnest(), this entries will get dropped. To fix this behavior one must specify ‘keep_empty = TRUE’ inside unnest.

Blockquote

Topic		Replies	Views
Availability of mapping files in BigQuery Google BigQuery/Cloud	2	388	7 October 2022
Is it possible to convert Gene Symbol to Ensembl ID? General ot-platform	4	1994	28 June 2022
Accessing PCHi-C, DHS-promoter corr. etc. info via Open Targets Genetics GraphQL API GraphQL API genetics-portal	2	227	9 February 2023
Associated studies: locus-to-gene pipeline Data downloads datadownloads , genetics-portal	5	382	22 December 2021
Getting literature evidence using the Open Targets Platform GraphQL API GraphQL API	4	446	24 March 2022

ID mapping table

Related topics