In the study dataset from the OTG platform (v25.12/output), I only see the traitFromSource column and cannot find diseaseId. How can I obtain the corresponding pairs of traitFromSource and diseaseId?
Thanks!
In the study dataset from the OTG platform (v25.12/output), I only see the traitFromSource column and cannot find diseaseId. How can I obtain the corresponding pairs of traitFromSource and diseaseId?
Thanks!
Hi @hanl3, and welcome to our Community!
The mapped disease IDs are in the diseaseIdsfield. You can browse our schema documentation from here Open Targets Platform
Best,
Irene
Thanks to Irene for the quick response. In the disease data, there are only four columns: id, code, name, and description. How can I merge this information with traitFromSource to create pairs between them?
The information you need to create pairs of traitFromSource/diseaseId is all available in the study dataset. You only need the disease dataset if you want to get, for example, the therapeutic area or the normalised disease name for your list of disease IDs.
I donβt know exactly what your use case is, but if youβre interested in having a list of how the different diseases are represented at source, you might want to have a look at the evidence datasets too. For example, in the ClinVar germline evidence, you can find the disease labels ClinVar assigns to its evidence.
I hope this is helpful!
Here, I want to identify novel GWAS signals based on known GWAS datasets for our findings. I used credible sets, and I collected all available variant IDs and study IDs. However, I need to know which diseases are associated with these variant IDs. Using the study IDs, I can retrieve all available traitFromSource entries, but I cannot find the disease IDs, so I donβt know which diseases correspond to the traitFromSource traits.
However, I cannot find this entry in the study dataset file (part-00000-2c4d825c-7fce-4f14-ae0e-0c516c8624e1-c000.snappy.parquet). Could you please indicate where it should be located?
It seem like when you read the dataset, you miss all nested columns.
Since the schema contains arrays and structs, it is useful to use arrow for reading the parquet files. With some of the engines, you just need to point to the directory with multiple parquet files (example using polars)
In [1]: !rsync -rpltvz --delete rsync.ebi.ac.uk::pub/databases/opentargets/platform/25.12/output/study .
receiving incremental file list
sent 29 bytes received 158 bytes 124,67 bytes/sec
total size is 96.400.296 speedup is 515.509,60
In [2]: import polars as pl
In [3]: data = pl.read_parquet('study')
In [4]: data.select("diseaseIds", "studyId")
Out[4]:
shape: (2_001_227, 2)
βββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββ
β diseaseIds β studyId β
β --- β --- β
β list[str] β str β
βββββββββββββββββββββͺββββββββββββββββββββββββββββββββββ‘
β ["EFO_0004237"] β FINNGEN_R12_AUTOIMMUNE_HYPERTHβ¦ β
β ["MONDO_0002354"] β FINNGEN_R12_CD2_BENIGN_LARYNX β
β ["EFO_0005539"] β FINNGEN_R12_E4_ADRENAS β
β ["MONDO_0001330"] β FINNGEN_R12_H7_PRESBYOPIA β
β ["EFO_0004611"] β GCST000337_7 β
β β¦ β β¦ β
β [] β walker_2019_exon_neocortex_ensβ¦ β
β [] β walker_2019_ge_neocortex_ensg0β¦ β
β [] β walker_2019_ge_neocortex_ensg0β¦ β
β [] β walker_2019_ge_neocortex_ensg0β¦ β
β [] β walker_2019_tx_neocortex_enst0β¦ β
βββββββββββββββββββββ΄ββββββββββββββββββββββββββββββββββ
In [5]: !ls study/
part-00000-2c4d825c-7fce-4f14-ae0e-0c516c8624e1-c000.snappy.parquet _SUCCESS
Hope it can help.
With kind regards,
Szymon
@hanl3 What are you using to read the data? Can you give a reproducible example?
| part-00000-2c4d825c-7fce-4f14-ae0e-0c516c8624e1-c000.snappy.parquet | 2025-12-10 13:25 | 92M |
|---|
Thank you all. I read the data using R, which may be the reason this happened. study β open_dataset(βStudy/β, format = βparquetβ) %>%
collect() # now itβs a normal data frame
cat(βColumns in dataset:\nβ)
print(names(study))
cat(βNumber of rows after filtering:β, nrow(study), β\nβ)
If you want to stick to R, a workaround is to use sparklyr, like in the example showcased in the documentation: Download datasets | Open Targets Platform Documentation
Thanks ochoa for your help