Questions about the 22.06 dataset — overlap in disease terms/multiple entries for disease-gene associations


I have a couple of questions about the newest version of the dataset, and specifically the version hosted on S3 as part of the RODA.

We have been using the previous version of the dataset for a while now, and recently started migrating to v22. We’re also transitioning from using the FTP server to using the S3 bucket. In these transitions, we’ve noticed changes in the data which are hard to understand. A couple of examples are below.

More disease entries, but more overlap in terms

In the diseases list accessed from s3://aws-roda-hcls-datalake/opentargets_latest/diseases/, there are now more entries than in the previous version, but it seems that the same entry is included multiple times with slightly different data. For example:

  • Orphanet_217607 is included in 5 rows. 4 rows appear to be exact duplicates of each other.

  • The 5th row for Orphanet_217607 has a different dbXRefs value, and one fewer parent in the parents column.

  • Orphanet_217607 is “Familial dilated cardiomyopathy”. There is also a row for “familial dilated cardiomyopathy” (all lower case) with ID MONDO_0016333. We didn’t see this sort of overlap in naming in the previous version.

Multiple entries for disease-gene associations in the S3 bucket

In the dataset which associates diseases with genes accessed at s3://aws-roda-hcls-datalake/opentargets_latest/associationbydatatypeindirect/, there are multiple entries for a given tuple of <gene, disease, datatype>. Specifically there are 5 entries for the tuple of:

diseaseId: Orphanet_217607

targetId: ENSG00000160789

datatypeID: genetic_association

The 5 rows all differ in score and evidenceCount, and it is not clear why these differences exist.

In the version of this data accessed using wget there is a single entry per tuple, which is what we expect. The wget command is:

<wget --recursive --no-parent --no-host-directories --cut-dirs 8>


Our main question is how to reconcile these seemingly duplicate entries which have the same ID but different data.

If you don’t have suggestions for that, it would be helpful to understand more about why these changes were made for v22 and/or S3 (which might give us direction on how to reconcile the entries).

Thanks in advance for any advice you can give.

More disease entries, but more overlap in terms

There is an issue in the current version of the Experimental Factor Ontology that the Platform uses causing issues with the Orphanet IDs. The observed behaviour is that a significant number of terms with Orphanet ID are now also included in MONDO. The “familial dilated cardiomyopathy” is a good example. There is a ticket to follow up on the issue:

Multiple entries for disease-gene associations in the S3 bucket

Based on your observation, it looks like there is a mismatch in the dataset that S3 is serving. Very likely, this dataset does not really correspond to “associationByDatatypeIndirect”. Very likely this is another association aggregation (e.g. associationByDatasourceIndirect). We don’t actively move the data to S3, so we will work with AWS to resolve the problem. Thanks for flagging this.
In the meantime, I would recommend downloading the dataset directly from the EBI FTP as this is supposed to be the source of truth.

Thank you so much for these replies! Very helpful to understand more about the underlying issues, and to know that they are both being followed-up on. (I am the OP, but I only just created my account and Helpdesk very kindly posted this query for me.) I really appreciate all the work that you and the team do, and I’m grateful for your suggestions of work-arounds, too.

Hi @vcatterson ,

We have contacted the AWS team to clarify the duplication issue with the datasets they fetch from the EBI ftp. More details under this ticket. We are waiting for their reply. In the meantime, I would suggest to source from the EBI ftp.


1 Like