ID mapping table

Hi All,

Recently I’ve been interested in having a reference table for mapping ENSEMBL gene ids to HGNC symbols. I thought that i could get that information by dowloading the Targets data from OT (Index of /pub/databases/opentargets/platform/23.09/output/etl/parquet/targets). However, after downloading and Parsing the parquet files I only find ~ 1000 unique ENSEMBL gene ids. In contrast, I’ve used a similar approach to also download and parse the OT - DepMap data (Index of /pub/databases/opentargets/platform/23.09/output/etl/parquet/targetEssentiality) , and in that case I do find ~ 17000 unique ENSEMBL ids.

Is it possible to download from OT a set of id mapping data/parquets for all genes in the human genome ?

Kind regards ,

-Mathias

Hi Mathias,

I suspect you need to double check your local files that you have retrieved from ftp. I have just tried the target index from the September release and it has 62.7k unique ids.

Best,
Daniel

You need to download the whole directory and read all the partitions at once (no need for for loops. Each individual file on this directory is just a chunk of the data (used for parallelised computing).

You might also benefit from looking at the searchTarget[Index of /pub/databases/opentargets/platform/23.09/output/etl/parquet/searchTarget] dataset. This dataset feeds the Open Targets Platform search, but it contains a lookup with all the alternative IDs we use for every target. The data is derived from the target dataset, so it’s basically the same information but it might be in a friendlier format if you want to build mappings beyond the approveSymbol.

In our upcoming release there will also be a graphQL API endpoint to perform the same task:

Thanks for your advice. I figured that the error was happening for the following reason: I used read_parquet() from the ‘arrow’ parquet to read the parquet files. arrow::read_parquet( ) reads the parquet files into a tibble. I then used unnest() from tidyr to expand the list-columns . Apparently, some entries in this list columns might be NULL or NA. If so, when using unnest(), this entries will get dropped. To fix this behavior one must specify ‘keep_empty = TRUE’ inside unnest.

Blockquote

1 Like