Search pairs of DiseaseID and traitFromSource

hanl3 · 23 January 2026 18:29

In the study dataset from the OTG platform (v25.12/output), I only see the traitFromSource column and cannot find diseaseId. How can I obtain the corresponding pairs of traitFromSource and diseaseId?

Thanks!

irene · 23 January 2026 18:33

Hi @hanl3, and welcome to our Community!

The mapped disease IDs are in the diseaseIdsfield. You can browse our schema documentation from here Open Targets Platform

Best,

Irene

hanl3 · 23 January 2026 19:05

Thanks to Irene for the quick response. In the disease data, there are only four columns: id, code, name, and description. How can I merge this information with traitFromSource to create pairs between them?

irene · 23 January 2026 19:30

The information you need to create pairs of traitFromSource/diseaseId is all available in the study dataset. You only need the disease dataset if you want to get, for example, the therapeutic area or the normalised disease name for your list of disease IDs.

I don’t know exactly what your use case is, but if you’re interested in having a list of how the different diseases are represented at source, you might want to have a look at the evidence datasets too. For example, in the ClinVar germline evidence, you can find the disease labels ClinVar assigns to its evidence.

I hope this is helpful!

hanl3 · 23 January 2026 19:54

Here, I want to identify novel GWAS signals based on known GWAS datasets for our findings. I used credible sets, and I collected all available variant IDs and study IDs. However, I need to know which diseases are associated with these variant IDs. Using the study IDs, I can retrieve all available traitFromSource entries, but I cannot find the disease IDs, so I don’t know which diseases correspond to the traitFromSource traits.

hanl3 · 23 January 2026 20:01

However, I cannot find this entry in the study dataset file (part-00000-2c4d825c-7fce-4f14-ae0e-0c516c8624e1-c000.snappy.parquet). Could you please indicate where it should be located?

Szymon_Szyszkowski · 23 January 2026 20:27

@hanl3

It seem like when you read the dataset, you miss all nested columns.

Since the schema contains arrays and structs, it is useful to use arrow for reading the parquet files. With some of the engines, you just need to point to the directory with multiple parquet files (example using polars)

In [1]: !rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.12/output/study .
receiving incremental file list

sent 29 bytes  received 158 bytes  124,67 bytes/sec
total size is 96.400.296  speedup is 515.509,60

In [2]: import polars as pl
In [3]: data = pl.read_parquet('study')
In [4]: data.select("diseaseIds", "studyId")
Out[4]: 
shape: (2_001_227, 2)
┌───────────────────┬─────────────────────────────────┐
│ diseaseIds        ┆ studyId                         │
│ ---               ┆ ---                             │
│ list[str]         ┆ str                             │
╞═══════════════════╪═════════════════════════════════╡
│ ["EFO_0004237"]   ┆ FINNGEN_R12_AUTOIMMUNE_HYPERTH… │
│ ["MONDO_0002354"] ┆ FINNGEN_R12_CD2_BENIGN_LARYNX   │
│ ["EFO_0005539"]   ┆ FINNGEN_R12_E4_ADRENAS          │
│ ["MONDO_0001330"] ┆ FINNGEN_R12_H7_PRESBYOPIA       │
│ ["EFO_0004611"]   ┆ GCST000337_7                    │
│ …                 ┆ …                               │
│ []                ┆ walker_2019_exon_neocortex_ens… │
│ []                ┆ walker_2019_ge_neocortex_ensg0… │
│ []                ┆ walker_2019_ge_neocortex_ensg0… │
│ []                ┆ walker_2019_ge_neocortex_ensg0… │
│ []                ┆ walker_2019_tx_neocortex_enst0… │
└───────────────────┴─────────────────────────────────┘
In [5]: !ls study/
part-00000-2c4d825c-7fce-4f14-ae0e-0c516c8624e1-c000.snappy.parquet  _SUCCESS

Hope it can help.

With kind regards,
Szymon

ochoa · 23 January 2026 20:36

@hanl3 What are you using to read the data? Can you give a reproducible example?

hanl3 · 23 January 2026 20:36

part-00000-2c4d825c-7fce-4f14-ae0e-0c516c8624e1-c000.snappy.parquet	2025-12-10 13:25	92M

hanl3 · 23 January 2026 20:49

Thank you all. I read the data using R, which may be the reason this happened. study ← open_dataset(“Study/”, format = “parquet”) %>%
collect() # now it’s a normal data frame
cat(“Columns in dataset:\n”)

print(names(study))
cat(“Number of rows after filtering:”, nrow(study), “\n”)

ochoa · 23 January 2026 21:45

If you want to stick to R, a workaround is to use sparklyr, like in the example showcased in the documentation: Download datasets | Open Targets Platform Documentation

hanl3 · 23 January 2026 23:22

Thanks ochoa for your help

Topic		Replies	Views
Genetics data - combine variant id, target id, disease id Data downloads	1	55	19 November 2025
PheWAS Download GraphQL API	8	115	3 April 2025
Associated studies: locus-to-gene pipeline Data downloads datadownloads , genetics-portal	5	421	22 December 2021
GWAS lead variants via API GraphQL API	6	576	24 October 2024
Get associations from different ids GraphQL API	1	315	4 January 2023

Search pairs of DiseaseID and traitFromSource

Related topics