Difference between parquet files and website/API

fru · 7 February 2024 10:21

Hi,

I have downloaded the parquet files for “Associations - direct (by data type)” (Open Targets Platform). I have extracted information for a specific disease (MONDO_0005277). When I compare the parquet files with the data in the website (Open Targets Platform) and the data retrieved using the API, I see some differences. The API and the website agree on the number of associated targets and contain the same information but the parquet files seem to miss some data. In particular, they do not contain info on the “animal_model” data type.

Are the parquet files not updated?

Command line to download the data:
wget --recursive --no-parent --no-host-directories --cut-dirs 8 ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.12/output/etl/parquet/associationByDatatypeDirect

R code to extract the information for the specific disease:

library(dplyr)
library(sparklyr)
library(sparklyr.nested)

## establish connection
sc <- spark_connect(master = "local")

## read dataset
evd <- spark_read_parquet(sc, path = path)

## Browse the schema
columns <- evd %>%
  sdf_schema() %>%
  lapply(function(x) do.call(tibble, x)) %>%
  bind_rows()

## select fields of interest
evdSelect <- evd %>%
  select(diseaseId,
         targetId,
         datatypeId,
         score,
         evidenceCount) 

evdSelect_disease <- evdSelect %>%
  collect() %>%
  dplyr::filter(diseaseId == "MONDO_0005277")

unique(evdSelect_disease$datatypeId)
# [1] "literature"          "known_drug"          "genetic_association"

In my hands, the “animal_model” data type is missing but it’s contained in the website:

Thanks for your feedback!

Francesco

hcornu · 7 February 2024 10:33

Hi @fru, and welcome to the Open Targets Community!

I think this is probably because you are looking at the Associations — direct dataset, whereas the data for animal models for Migraine Disorder appears to be indirect evidence, mainly for familial or sporadic hemiplegic migraine (MONDO_0018925), which is a child term of Migraine Disorder.

You can find more information about direct/indirect evidence and how it’s show in the Platform in our documentation.

Let me know if you have any questions about this!

Best wishes,

Helena

fru · 7 February 2024 11:31

Thanks for the reply! I see, it makes sense. I guess the file “Associations - indirect (by data type)” should contain the indirect evidence for Migraine Disorder as well as it’s stated in the website and not only reported for the child term Anyway, your reply fixes my problem, thanks again!

Francesco

fru · 7 February 2024 13:12

For future reference, the “Associations - indirect (by data type)” file contains the same number of targets and associations as the website and API. It looks like it contains both direct and indirect associations: “literature”, “known_drug”, “genetic_association”, “animal_model”.
The only data types not present in this file are: “somatic_mutations”, “affected_pathway”, “rna_expression” because those are actually not available for the disease ID “MONDO_0005277”, but these data types are present in “Associations - indirect (by data type)”.

Francesco

Topic		Replies	Views
Where can I find the overall association score in DB? Data Access ot-platform	4	552	2 August 2021
Cannot reproduce python code of 'Accessing and querying datasets' Data downloads	7	597	3 December 2021
Difference between direct and indirect links in data downloads Data downloads	3	514	15 October 2021
Indirect data included in the direct associations data General ot-platform , data	1	339	15 August 2022
Best way to dowload all Gene-Disease associations from Open Targets Platform Data Access ot-platform	2	458	30 August 2023

Difference between parquet files and website/API

Related topics