How to group associations by therapeutic area?

On the Open Targets Platform, it is possible to classify target-disease associations by therapeutic area.

After using the API to download all associations to selected genes of interest, is there a file I can download to perform this classification myself? For example, a file with mappings trait/phenotype/disease based on ID term (EFO, HP, Mondo, Orphanet) to therapeutic area or parent term.

This question was sent to the Open Targets helpdesk and has been posted here so the answer can benefit the whole Community of users.

1 Like

Hello!

There are two ways that you can find the therapeutic area for a disease of interest.

  • The first would be to query this directly in the API.

The following query would give you the therapeutic areas for diseases and phenotypes associated with IL22:

query associatedDiseases {
  target(ensemblId: "ENSG00000127318") {
    id
    approvedSymbol
    associatedDiseases {
      count
      rows {
        disease {
          id
          name
          therapeuticAreas {
            id
            name
          }
        }
      }
    }
  }
}
  • The other is to use the disease/phenotype dataset in our data downloads

The disease/phenotype dataset contains a field called therapeuticAreas . If you enrich your associations dataset with disease information by joining on the disease ID, it will give you the information you are looking for.

I hope this helps! Let us know if you have other questions

Follow up question:

Is it possible to get the mappings of therapeutic areas to EFO, HP, Orphanet and MONDO terms you use in a simple tabular format, that I can integrate with my current analysis pipeline?

Yes, the terms found in the therapeuticAreas field are still terms in the disease/phenotype dataset that you can look up to find out more information about them.
The field that can be helpful is dbXrefs. Here we have a list of cross references between a disease in EFO and the respective ID in other ontologies. For example, to get the cross references for immune system disease in the API, the query would be:

query diseaseAnnotation {
  disease(efoId: "EFO_0000540") {
    id
		dbXRefs
  }
}

Please keep in mind that this list will not always be so extensive as it is pulled from EFO’s cross references. If you can’t find a cross reference, you can ask EFO to add it to their issue tracker: Issues · EBISPOT/efo · GitHub

Thanks for the question!
Irene

Hi,
Thanks for this. Is there a way to get parent terms for the traits, whether they are EFO, HP, MONDO etc.
For diseases I can use therapeutic areas, thank you for letting me know how to get those.
What about traits that are not diseases? For example, hematocrit EFO_0004348 has parent term Hematological measurement EFO_0004503 (hematocrit, hematological measurement).

Thanks!

Hi Violeta,

Welcome to the OpenTargets community forum! On the disease profile page you can find both the child and parent terms in the Ontology widget. For hematocrit you’ll find hematological measurement and measurement. If you click on any of the terms, the link will take you to the disease profile page of the given disease.

This information is included in our disease index, so depending on your use-case, the graphql API or reading the flat files could be a better choice. Please let us know if it resolves your question.

Best,
Daniel

Hi Daniel!
Thank you all, this is very helpful.
Can I get additional parent terms?
For example, for gene AAK1, we are getting 77 asociations (we use the grahql API endpoint as we are querying a large number of genes).
I extract some lines:
ENSG00000115977|AAK1|EFO_0004348|hematocrit|EFO_0001444: measurement
ENSG00000115977|AAK1|EFO_0004305|erythrocyte count|EFO_0001444: measurement
ENSG00000115977|AAK1|EFO_0005762|neuropathic pain|EFO_0000651: phenotype
ENSG00000115977|AAK1|MONDO_0005149|pulmonary hypertension|EFO_0000319: cardiovascular disease

For the cases in which we are getting “measurement” or “phenotype”, it is possible to get these parent terms, that is, for:
EFO:0004348 hematocrit or EFO:0004305 erythrocyte count, can we get EFO:0004503 hematological measurement
EFO:0005762 neuropathic pain, can we get EFO:0003843 pain

Thanks!

Hi Violeta,

OK, I think I get it now. IF the therapeutic area is measurement of phenotype, you want a more granular parent term if possible. AND you want to get these values for a larger number of associations. Is this correct?

For such use case I would highly recommend to use our flat files on ftp (associations, diseases), rather using the frontend or graphql api. (The ontology widget on the frontend uses a json to load all EFO terms.)

So if I understand the problem right, the logic would flow like this:

  1. We need to get all the EFO terms that lies directly under phenotype and measurements.
  2. Get all associations of interest.
  3. Join diseases with associations with exploding ancesties.
  4. Join ancestries with the more granular terms.
  5. Aggregate associations.
therapeutic_area_to_resolve = [
    'EFO_0001444',  # Measurement 
    'EFO_0000651'  # Phenotype
]

gene_of_interest = 'ENSG00000115977'

# Loading disease dataset:
diseases = spark.read.parquet('/Users/dsuveges/project_data/diseases/')

# We are looking for the least granular terms for phenotypes and measurements,
# Which means, we want to get a list of terms where the only parent term is measurement or phenotype
resolved_phenotypes_measurements = (
    diseases
    .select(
        f.col('id').alias('resolvedDiseaseId'),
        f.col('name').alias('resolvedDiseaseName'), 
        f.explode('therapeuticAreas').alias('therapeuticArea'),
        'parents', 'ancestors'
    )
    # Get all terms that are directly under phenotypes and measurements:
    .filter(
        f.col('therapeuticArea').isin(therapeutic_area_to_resolve) &
        (
            (f.array_contains(f.col('parents'), therapeutic_area_to_resolve[0])) |
            (f.array_contains(f.col('parents'), therapeutic_area_to_resolve[1]))
        )
    )
    .select('resolvedDiseaseId', 'resolvedDiseaseName')
#     .show(truncate=False)
    .persist()
)


# Load associations:
associations = (
    spark.read.parquet('/Users/dsuveges/project_data/associationByOverallDirect/')
    .select('diseaseId', 'targetId', 'score')
    
    # Filter gene of interest:
    .filter(f.col('targetId') == gene_of_interest)
)

resolved_associations = (
    associations
    
    # Joining associations with diseases where the ancestors column is exploded:
    .join(
        diseases.select(
            f.col('id').alias('diseaseId'),
            f.col('name').alias('diseaseName'),
            f.explode(f.col('ancestors')).alias('resolvedDiseaseId'),
            f.col('therapeuticAreas')
        ),
        on='diseaseId', how='left'
    )
    
    # Joining with the resolved disease dataset:
    .join(resolved_phenotypes_measurements, on='resolvedDiseaseId', how='left')
    .drop('resolvedDiseaseId')
    
    # Aggregating associations:
    .groupBy(['diseaseId', 'targetId'])
    .agg(
        f.first('score').alias('score'),
        f.first('diseaseName').alias('diseaseName'),
        f.first('therapeuticAreas').alias('therapeuticAreas'),
        f.collect_set('resolvedDiseaseName')
    )
    .persist()
)

Which gives rows like this:

 diseaseId                        | EFO_0004736                                     
 targetId                         | ENSG00000115977                                 
 score                            | 0.021152588627945577                            
 diseaseName                      | aspartate aminotransferase measurement          
 therapeuticAreas                 | [EFO_0001444]                                   
 collect_set(resolvedDiseaseName) | [liver enzyme measurement, protein measurement] ````````

Please let me know if I misunerstood anything.

Hi Daniel,
So our script queries the API for 1000+ genes and rows look like this (here the symbol | represents a tab in our output file):
ENSG00000115977|AAK1|EFO_0004348|hematocrit|EFO_0001444: measurement
ENSG00000115977|AAK1|EFO_0004305|erythrocyte count|EFO_0001444: measurement
ENSG00000115977|AAK1|EFO_0005762|neuropathic pain|EFO_0000651: phenotype
ENSG00000115977|AAK1|MONDO_0005149|pulmonary hypertension|EFO_0000319: cardiovascular disease

I would like to get, so I can classify the associations (e.g. by phyisiological system):
ENSG00000115977|AAK1|EFO_0004348|hematocrit|EFO:0004503 hematological measurement
ENSG00000115977|AAK1|EFO_0004305|erythrocyte count|EFO:0004503 hematological measurement
ENSG00000115977|AAK1|EFO_0005762|neuropathic pain|EFO:0003843 pain
ENSG00000115977|AAK1|MONDO_0005149|pulmonary hypertension|EFO_0000319: cardiovascular disease

For “measurements” it seems to be the term under it that I need, for “phenotypes” it seems to be the term under “sign or symptom”, based on the ontologies for these examples:

EFO:0004305 erythrocyte count
EFO:0005762 neuropathic pain

Is there a way to do this?
Thanks,
Violeta