I suspect you need to double check your local files that you have retrieved from ftp. I have just tried the target index from the September release and it has 62.7k unique ids.
You need to download the whole directory and read all the partitions at once (no need for for loops. Each individual file on this directory is just a chunk of the data (used for parallelised computing).
You might also benefit from looking at the searchTarget[Index of /pub/databases/opentargets/platform/23.09/output/etl/parquet/searchTarget] dataset. This dataset feeds the Open Targets Platform search, but it contains a lookup with all the alternative IDs we use for every target. The data is derived from the target dataset, so it’s basically the same information but it might be in a friendlier format if you want to build mappings beyond the approveSymbol.
In our upcoming release there will also be a graphQL API endpoint to perform the same task:
Thanks for your advice. I figured that the error was happening for the following reason: I used read_parquet() from the ‘arrow’ parquet to read the parquet files. arrow::read_parquet( ) reads the parquet files into a tibble. I then used unnest() from tidyr to expand the list-columns . Apparently, some entries in this list columns might be NULL or NA. If so, when using unnest(), this entries will get dropped. To fix this behavior one must specify ‘keep_empty = TRUE’ inside unnest.