Questions about the 22.06 dataset — overlap in disease terms/multiple entries for disease-gene associations

vcatterson · 6 July 2022 15:43

Hello!

I have a couple of questions about the newest version of the dataset, and specifically the version hosted on S3 as part of the RODA.

We have been using the previous version of the dataset for a while now, and recently started migrating to v22. We’re also transitioning from using the FTP server to using the S3 bucket. In these transitions, we’ve noticed changes in the data which are hard to understand. A couple of examples are below.

More disease entries, but more overlap in terms

In the diseases list accessed from s3://aws-roda-hcls-datalake/opentargets_latest/diseases/, there are now more entries than in the previous version, but it seems that the same entry is included multiple times with slightly different data. For example:

Orphanet_217607 is included in 5 rows. 4 rows appear to be exact duplicates of each other.
The 5th row for Orphanet_217607 has a different dbXRefs value, and one fewer parent in the parents column.
Orphanet_217607 is “Familial dilated cardiomyopathy”. There is also a row for “familial dilated cardiomyopathy” (all lower case) with ID MONDO_0016333. We didn’t see this sort of overlap in naming in the previous version.

Multiple entries for disease-gene associations in the S3 bucket

In the dataset which associates diseases with genes accessed at s3://aws-roda-hcls-datalake/opentargets_latest/associationbydatatypeindirect/, there are multiple entries for a given tuple of <gene, disease, datatype>. Specifically there are 5 entries for the tuple of:

diseaseId: Orphanet_217607

targetId: ENSG00000160789

datatypeID: genetic_association

The 5 rows all differ in score and evidenceCount, and it is not clear why these differences exist.

In the version of this data accessed using wget there is a single entry per tuple, which is what we expect. The wget command is:

<wget --recursive --no-parent --no-host-directories --cut-dirs 8 ftp://ftp.ebi.ac.uk/pub/databases/opentargets/platform/22.06/output/etl/parquet/associationByDatatypeIndirect>

Questions

Our main question is how to reconcile these seemingly duplicate entries which have the same ID but different data.

If you don’t have suggestions for that, it would be helpful to understand more about why these changes were made for v22 and/or S3 (which might give us direction on how to reconcile the entries).

Thanks in advance for any advice you can give.

ochoa · 7 July 2022 09:07

More disease entries, but more overlap in terms

There is an issue in the current version of the Experimental Factor Ontology that the Platform uses causing issues with the Orphanet IDs. The observed behaviour is that a significant number of terms with Orphanet ID are now also included in MONDO. The “familial dilated cardiomyopathy” is a good example. There is a ticket to follow up on the issue:

github.com/EBISPOT/efo

Large number of duplicated terms

opened 04:32PM - 26 Jun 22 UTC

closed 12:21PM - 23 Aug 22 UTC

d0choa

Open Targets

At least in 3.42 and 3.43, there are a large number of duplicated terms in EFO m…ostly affecting rare diseases. Just by lower-casing the names and looking for exact matches, there are 3036 duplicated terms (v3.42). Some of them are explained by disease vs phenotype conondrum, but the vast majority correspond to a MONDO vs Orphanet duplication. Some examples: [Hemophilia](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_448) Orphanet:448 - [hemophilia](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMONDO_0018660) MONDO:0018660 [Fragile X syndrome](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_908) Orphanet:908 - [fragile X syndrome](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMONDO_0010383) MONDO:0010383 [Apert syndrome](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fwww.orpha.net%2FORDO%2FOrphanet_87) Orphanet:87 - [apert syndrome](https://www.ebi.ac.uk/ols/ontologies/efo/terms?iri=http%3A%2F%2Fpurl.obolibrary.org%2Fobo%2FMONDO_0007041) MONDO:0007041 ...

Multiple entries for disease-gene associations in the S3 bucket

Based on your observation, it looks like there is a mismatch in the dataset that S3 is serving. Very likely, this dataset does not really correspond to “associationByDatatypeIndirect”. Very likely this is another association aggregation (e.g. associationByDatasourceIndirect). We don’t actively move the data to S3, so we will work with AWS to resolve the problem. Thanks for flagging this.
In the meantime, I would recommend downloading the dataset directly from the EBI FTP as this is supposed to be the source of truth.

vcatterson · 8 July 2022 19:35

Thank you so much for these replies! Very helpful to understand more about the underlying issues, and to know that they are both being followed-up on. (I am the OP, but I only just created my account and Helpdesk very kindly posted this query for me.) I really appreciate all the work that you and the team do, and I’m grateful for your suggestions of work-arounds, too.

dsuveges · 9 July 2022 17:19

Hi @vcatterson ,

We have contacted the AWS team to clarify the duplication issue with the datasets they fetch from the EBI ftp. More details under this ticket. We are waiting for their reply. In the meantime, I would suggest to source from the EBI ftp.

Best,
Daniel

Topic		Replies	Views
Duplicate genes in the disease-association record with inconsistent scores Data issue	1	244	7 June 2022
Target-disease and disease id - disease name Bugs report Bug reports	3	45	24 July 2024
Repeated records in OpenTargets Genetics PheWAS table report? Technical Support genetics-portal	3	413	6 October 2022
Searching for associations with a set of diseases and phenotypes (and how to find that set) General ot-platform	1	401	10 June 2022
Disease categories redundancy/overlap Bug reports	3	289	6 July 2022

Questions about the 22.06 dataset — overlap in disease terms/multiple entries for disease-gene associations

More disease entries, but more overlap in terms

Multiple entries for disease-gene associations in the S3 bucket

Questions

Related topics