Hello!
I have a couple of questions about the newest version of the dataset, and specifically the version hosted on S3 as part of the RODA.
We have been using the previous version of the dataset for a while now, and recently started migrating to v22. We’re also transitioning from using the FTP server to using the S3 bucket. In these transitions, we’ve noticed changes in the data which are hard to understand. A couple of examples are below.
More disease entries, but more overlap in terms
In the diseases list accessed from s3://aws-roda-hcls-datalake/opentargets_latest/diseases/, there are now more entries than in the previous version, but it seems that the same entry is included multiple times with slightly different data. For example:
-
Orphanet_217607 is included in 5 rows. 4 rows appear to be exact duplicates of each other.
-
The 5th row for Orphanet_217607 has a different dbXRefs value, and one fewer parent in the parents column.
-
Orphanet_217607 is “Familial dilated cardiomyopathy”. There is also a row for “familial dilated cardiomyopathy” (all lower case) with ID MONDO_0016333. We didn’t see this sort of overlap in naming in the previous version.
Multiple entries for disease-gene associations in the S3 bucket
In the dataset which associates diseases with genes accessed at s3://aws-roda-hcls-datalake/opentargets_latest/associationbydatatypeindirect/, there are multiple entries for a given tuple of <gene, disease, datatype>. Specifically there are 5 entries for the tuple of:
diseaseId: Orphanet_217607
targetId: ENSG00000160789
datatypeID: genetic_association
The 5 rows all differ in score and evidenceCount, and it is not clear why these differences exist.
In the version of this data accessed using wget there is a single entry per tuple, which is what we expect. The wget command is:
<wget --recursive --no-parent --no-host-directories --cut-dirs 8 ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.06/output/etl/parquet/associationByDatatypeIndirect>
Questions
Our main question is how to reconcile these seemingly duplicate entries which have the same ID but different data.
If you don’t have suggestions for that, it would be helpful to understand more about why these changes were made for v22 and/or S3 (which might give us direction on how to reconcile the entries).
Thanks in advance for any advice you can give.