Difference between parquet files and website/API

Hi,

I have downloaded the parquet files for “Associations - direct (by data type)” (Open Targets Platform). I have extracted information for a specific disease (MONDO_0005277). When I compare the parquet files with the data in the website (Open Targets Platform) and the data retrieved using the API, I see some differences. The API and the website agree on the number of associated targets and contain the same information but the parquet files seem to miss some data. In particular, they do not contain info on the “animal_model” data type.

Are the parquet files not updated?

Command line to download the data:
wget --recursive --no-parent --no-host-directories --cut-dirs 8 ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.12/output/etl/parquet/associationByDatatypeDirect

R code to extract the information for the specific disease:

library(dplyr)
library(sparklyr)
library(sparklyr.nested)

## establish connection
sc <- spark_connect(master = "local")

## read dataset
evd <- spark_read_parquet(sc, path = path)

## Browse the schema
columns <- evd %>%
  sdf_schema() %>%
  lapply(function(x) do.call(tibble, x)) %>%
  bind_rows()

## select fields of interest
evdSelect <- evd %>%
  select(diseaseId,
         targetId,
         datatypeId,
         score,
         evidenceCount) 

evdSelect_disease <- evdSelect %>%
  collect() %>%
  dplyr::filter(diseaseId == "MONDO_0005277")

unique(evdSelect_disease$datatypeId)
# [1] "literature"          "known_drug"          "genetic_association"

In my hands, the “animal_model” data type is missing but it’s contained in the website:

Thanks for your feedback!

Francesco

Hi @fru, and welcome to the Open Targets Community! :tada:

I think this is probably because you are looking at the Associations — direct dataset, whereas the data for animal models for Migraine Disorder appears to be indirect evidence, mainly for familial or sporadic hemiplegic migraine (MONDO_0018925), which is a child term of Migraine Disorder.

You can find more information about direct/indirect evidence and how it’s show in the Platform in our documentation.

Let me know if you have any questions about this!

Best wishes,

Helena

Thanks for the reply! I see, it makes sense. I guess the file “Associations - indirect (by data type)” should contain the indirect evidence for Migraine Disorder as well as it’s stated in the website and not only reported for the child term :slight_smile: Anyway, your reply fixes my problem, thanks again!

Francesco

1 Like

For future reference, the “Associations - indirect (by data type)” file contains the same number of targets and associations as the website and API. It looks like it contains both direct and indirect associations: “literature”, “known_drug”, “genetic_association”, “animal_model”.
The only data types not present in this file are: “somatic_mutations”, “affected_pathway”, “rna_expression” because those are actually not available for the disease ID “MONDO_0005277”, but these data types are present in “Associations - indirect (by data type)”.

Francesco

1 Like