Hi,
I have downloaded the parquet files for “Associations - direct (by data type)” (Open Targets Platform). I have extracted information for a specific disease (MONDO_0005277). When I compare the parquet files with the data in the website (Open Targets Platform) and the data retrieved using the API, I see some differences. The API and the website agree on the number of associated targets and contain the same information but the parquet files seem to miss some data. In particular, they do not contain info on the “animal_model” data type.
Are the parquet files not updated?
Command line to download the data:
wget --recursive --no-parent --no-host-directories --cut-dirs 8 ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/23.12/output/etl/parquet/associationByDatatypeDirect
R code to extract the information for the specific disease:
library(dplyr)
library(sparklyr)
library(sparklyr.nested)
## establish connection
sc <- spark_connect(master = "local")
## read dataset
evd <- spark_read_parquet(sc, path = path)
## Browse the schema
columns <- evd %>%
sdf_schema() %>%
lapply(function(x) do.call(tibble, x)) %>%
bind_rows()
## select fields of interest
evdSelect <- evd %>%
select(diseaseId,
targetId,
datatypeId,
score,
evidenceCount)
evdSelect_disease <- evdSelect %>%
collect() %>%
dplyr::filter(diseaseId == "MONDO_0005277")
unique(evdSelect_disease$datatypeId)
# [1] "literature" "known_drug" "genetic_association"
In my hands, the “animal_model” data type is missing but it’s contained in the website:
Thanks for your feedback!
Francesco