Disease categories redundancy/overlap

I am wondering how the disease/association categories are defined? For example i ran a list of candidates over the association download using the API can top categories i notice are neoplasm (189 IDs) and cancer (157) which i think are the same things?

Looking at the overlap between the set of IDs, there are 35 unique to cancer and 67 to neoplasm category while 122 are common. What could be the reason of such a difference?

Code is available at https://raw.githubusercontent.com/animesh/scripts/master/openTargetsDis.py , hopefully reproducible :see_no_evil:

Hi @animesh ,

First of all, there’s a distinction between neoplasm and cancer: neoplasm is just an excessive growth of tissue, while cancer is a special, malignant type of neoplasm.

Categorizing diseases is a tricky business. We don’t do it ourselves, instead we use an ontology (Experimental Factor Ontology or EFO) developed by the SPOT team at EMBL-EBI. Diseases are organized into a hierarchical structure, where the ancestor <-> descendant relationship reflects underlying biology or disease mechanism. (eg. a breast cancer is a type of cancer which is a type of neoplasm, which is a type of disease)

When we ingest evidence we get a single link between a target and a particular disease. Example we can get one evidence between target x and breast cancer, and target y and cancer. These are the direct associations of the Platform. You can propagate evidence upwards on the ontology to a more general term. It means you can say both target x and target y are associated to cancer, where the association between target x and cancer is called indirect, as this link is inferred from the disease ontology. (For obvious reason, this propagation cannot happen the other way around). For this reason, such discrepancies can happen if you see at the direct associations. You can enable indirect associations if you add enableIndirect: true parameters to associatedDiseases in the graphql query.

I took a look at your script. For this scale, I would dissuade you from using the search endpoint, especially because it returns only 25 associated diseases. It is not an ideal solution to get associations for a given gene name. Instead, I recommend using the REST API of Ensembl. You can check out at this simple implementation as inspiration.

2 Likes

Thanks @dsuveges :+1: had absolutely no clue that there were genes specific to benign cancer! That probably explains the IDs specific to neoplasm but what about those specific to cancer then given the hierarchy you mention? Is that because of the direct relations or just retrieving top hits or both?

BTW that script of yours is awespiring to say the least!! Probably therein in lies the answer…

but what about those specific to cancer then given the hierarchy you mention?

Exactly: all genes that are associated with cancer are indirectly associated with neoplasm as well, because their associations are propagated upwards the ontology. However you might not get these associations unless you explicitly allow indirect evidence in the GraphQL query as mentioned earlier:

You can enable indirect associations if you add enableIndirect: true parameters to associatedDiseases in the graphql query.

1 Like