How to group associations by therapeutic area?

On the Open Targets Platform, it is possible to classify target-disease associations by therapeutic area.

After using the API to download all associations to selected genes of interest, is there a file I can download to perform this classification myself? For example, a file with mappings trait/phenotype/disease based on ID term (EFO, HP, Mondo, Orphanet) to therapeutic area or parent term.

This question was sent to the Open Targets helpdesk and has been posted here so the answer can benefit the whole Community of users.

1 Like

Hello!

There are two ways that you can find the therapeutic area for a disease of interest.

  • The first would be to query this directly in the API.

The following query would give you the therapeutic areas for diseases and phenotypes associated with IL22:

query associatedDiseases {
  target(ensemblId: "ENSG00000127318") {
    id
    approvedSymbol
    associatedDiseases {
      count
      rows {
        disease {
          id
          name
          therapeuticAreas {
            id
            name
          }
        }
      }
    }
  }
}
  • The other is to use the disease/phenotype dataset in our data downloads

The disease/phenotype dataset contains a field called therapeuticAreas . If you enrich your associations dataset with disease information by joining on the disease ID, it will give you the information you are looking for.

I hope this helps! Let us know if you have other questions

1 Like

Follow up question:

Is it possible to get the mappings of therapeutic areas to EFO, HP, Orphanet and MONDO terms you use in a simple tabular format, that I can integrate with my current analysis pipeline?

1 Like

Yes, the terms found in the therapeuticAreas field are still terms in the disease/phenotype dataset that you can look up to find out more information about them.
The field that can be helpful is dbXrefs. Here we have a list of cross references between a disease in EFO and the respective ID in other ontologies. For example, to get the cross references for immune system disease in the API, the query would be:

query diseaseAnnotation {
  disease(efoId: "EFO_0000540") {
    id
		dbXRefs
  }
}

Please keep in mind that this list will not always be so extensive as it is pulled from EFO’s cross references. If you can’t find a cross reference, you can ask EFO to add it to their issue tracker: Issues · EBISPOT/efo · GitHub

Thanks for the question!
Irene

Hi,
Thanks for this. Is there a way to get parent terms for the traits, whether they are EFO, HP, MONDO etc.
For diseases I can use therapeutic areas, thank you for letting me know how to get those.
What about traits that are not diseases? For example, hematocrit EFO_0004348 has parent term Hematological measurement EFO_0004503 (hematocrit, hematological measurement).

Thanks!

Hi Violeta,

Welcome to the OpenTargets community forum! On the disease profile page you can find both the child and parent terms in the Ontology widget. For hematocrit you’ll find hematological measurement and measurement. If you click on any of the terms, the link will take you to the disease profile page of the given disease.

This information is included in our disease index, so depending on your use-case, the graphql API or reading the flat files could be a better choice. Please let us know if it resolves your question.

Best,
Daniel

Hi Daniel!
Thank you all, this is very helpful.
Can I get additional parent terms?
For example, for gene AAK1, we are getting 77 asociations (we use the grahql API endpoint as we are querying a large number of genes).
I extract some lines:
ENSG00000115977|AAK1|EFO_0004348|hematocrit|EFO_0001444: measurement
ENSG00000115977|AAK1|EFO_0004305|erythrocyte count|EFO_0001444: measurement
ENSG00000115977|AAK1|EFO_0005762|neuropathic pain|EFO_0000651: phenotype
ENSG00000115977|AAK1|MONDO_0005149|pulmonary hypertension|EFO_0000319: cardiovascular disease

For the cases in which we are getting “measurement” or “phenotype”, it is possible to get these parent terms, that is, for:
EFO:0004348 hematocrit or EFO:0004305 erythrocyte count, can we get EFO:0004503 hematological measurement
EFO:0005762 neuropathic pain, can we get EFO:0003843 pain

Thanks!

Hi Violeta,

OK, I think I get it now. IF the therapeutic area is measurement of phenotype, you want a more granular parent term if possible. AND you want to get these values for a larger number of associations. Is this correct?

For such use case I would highly recommend to use our flat files on ftp (associations, diseases), rather using the frontend or graphql api. (The ontology widget on the frontend uses a json to load all EFO terms.)

So if I understand the problem right, the logic would flow like this:

  1. We need to get all the EFO terms that lies directly under phenotype and measurements.
  2. Get all associations of interest.
  3. Join diseases with associations with exploding ancesties.
  4. Join ancestries with the more granular terms.
  5. Aggregate associations.
therapeutic_area_to_resolve = [
    'EFO_0001444',  # Measurement 
    'EFO_0000651'  # Phenotype
]

gene_of_interest = 'ENSG00000115977'

# Loading disease dataset:
diseases = spark.read.parquet('/Users/dsuveges/project_data/diseases/')

# We are looking for the least granular terms for phenotypes and measurements,
# Which means, we want to get a list of terms where the only parent term is measurement or phenotype
resolved_phenotypes_measurements = (
    diseases
    .select(
        f.col('id').alias('resolvedDiseaseId'),
        f.col('name').alias('resolvedDiseaseName'), 
        f.explode('therapeuticAreas').alias('therapeuticArea'),
        'parents', 'ancestors'
    )
    # Get all terms that are directly under phenotypes and measurements:
    .filter(
        f.col('therapeuticArea').isin(therapeutic_area_to_resolve) &
        (
            (f.array_contains(f.col('parents'), therapeutic_area_to_resolve[0])) |
            (f.array_contains(f.col('parents'), therapeutic_area_to_resolve[1]))
        )
    )
    .select('resolvedDiseaseId', 'resolvedDiseaseName')
#     .show(truncate=False)
    .persist()
)


# Load associations:
associations = (
    spark.read.parquet('/Users/dsuveges/project_data/associationByOverallDirect/')
    .select('diseaseId', 'targetId', 'score')
    
    # Filter gene of interest:
    .filter(f.col('targetId') == gene_of_interest)
)

resolved_associations = (
    associations
    
    # Joining associations with diseases where the ancestors column is exploded:
    .join(
        diseases.select(
            f.col('id').alias('diseaseId'),
            f.col('name').alias('diseaseName'),
            f.explode(f.col('ancestors')).alias('resolvedDiseaseId'),
            f.col('therapeuticAreas')
        ),
        on='diseaseId', how='left'
    )
    
    # Joining with the resolved disease dataset:
    .join(resolved_phenotypes_measurements, on='resolvedDiseaseId', how='left')
    .drop('resolvedDiseaseId')
    
    # Aggregating associations:
    .groupBy(['diseaseId', 'targetId'])
    .agg(
        f.first('score').alias('score'),
        f.first('diseaseName').alias('diseaseName'),
        f.first('therapeuticAreas').alias('therapeuticAreas'),
        f.collect_set('resolvedDiseaseName')
    )
    .persist()
)

Which gives rows like this:

 diseaseId                        | EFO_0004736                                     
 targetId                         | ENSG00000115977                                 
 score                            | 0.021152588627945577                            
 diseaseName                      | aspartate aminotransferase measurement          
 therapeuticAreas                 | [EFO_0001444]                                   
 collect_set(resolvedDiseaseName) | [liver enzyme measurement, protein measurement] ````````

Please let me know if I misunerstood anything.

Hi Daniel,
So our script queries the API for 1000+ genes and rows look like this (here the symbol | represents a tab in our output file):
ENSG00000115977|AAK1|EFO_0004348|hematocrit|EFO_0001444: measurement
ENSG00000115977|AAK1|EFO_0004305|erythrocyte count|EFO_0001444: measurement
ENSG00000115977|AAK1|EFO_0005762|neuropathic pain|EFO_0000651: phenotype
ENSG00000115977|AAK1|MONDO_0005149|pulmonary hypertension|EFO_0000319: cardiovascular disease

I would like to get, so I can classify the associations (e.g. by phyisiological system):
ENSG00000115977|AAK1|EFO_0004348|hematocrit|EFO:0004503 hematological measurement
ENSG00000115977|AAK1|EFO_0004305|erythrocyte count|EFO:0004503 hematological measurement
ENSG00000115977|AAK1|EFO_0005762|neuropathic pain|EFO:0003843 pain
ENSG00000115977|AAK1|MONDO_0005149|pulmonary hypertension|EFO_0000319: cardiovascular disease

For “measurements” it seems to be the term under it that I need, for “phenotypes” it seems to be the term under “sign or symptom”, based on the ontologies for these examples:

EFO:0004305 erythrocyte count
EFO:0005762 neuropathic pain

Is there a way to do this?
Thanks,
Violeta

Hello!
Could you elaborate a bit more in the field “therapeutic area” in the disease or phenotype table please?

I cannot find information about how is assigned the therapeutic area in the docs. Clinical signs and symptoms | Open Targets Platform Documentation .

Thanks to this post I wwas able to get a bit more understanding. But for example, I have a disease id, related to a therapeutic area null, and also to a disease_ontology with a leaf called “is_therapeutic_area” that sometimes is false, and sometimes true.
Where could I find information about this, or how is it assigned?

Many thanks

Hi @koryclick,

Unfortunately, we don’t currently document each field in the downloads. You can take a look at the docs in the API Playground to get short descriptions of the fields.

We get the therapeutic areas from our ontology. The therapeutic area terms are the highest terms in the ontology. For example, in the case of melanoma, you can see that the therapeutic area is “cancer or benign tumour”.



“Cancer or benign tumour” is itself a disease in our ontology, for which we collect evidence in the Platform. The field “is_therapeutic_area” tells you that this disease is also a high-level term in the ontology; it has no parent term.

Please let me know if you have additional questions!

1 Like

Thanks again @hcornu for your quick reply.

Then should I assume that:

  1. when therapeuticAreas contains a string, isTherapeuticArea ==True, and leaf==False? (I cannot find the meaning of “leaf” in the APIGraphQL Docs Playground - https://api.platform.opentargets.org/api/v4/graphql .
  2. In the same way, if therapeuticAreas is null, then isTherapeuticAres == False and leaf ==True.

Am I right assuming this? It would help me on my filtering queries.

  1. Also another question, is that “ancestors” ID should match with the content in “TherapeuticAreas” when “isTherapeuticArea” ==True?

  2. Also disease_ontology → sources_url Is ALWAYS the same content as disease_code?

  3. And what is “indirectLocations”? and direcLocations?

Many thanks!

Hi @koryclick,

  1. Here you are describing an ID referring to a therapeutic area like cancer or benign tumor. therapeuticAreas will always be an array of size 1, containing the term itself.
  2. The therapeuticAreas field cannot be null or empty because every term will either have its associated therapeutic area listed in this field or, in the case of a therapeutic area ID, the field will contain that same ID. Also, leaf doesn’t refer to any term that is not a therapeutic area. Leaves are terms in the bottom of the ontology branch.
  3. No. ancestors will be empty when the ID is a therapeutic area.
  4. Yes.
  5. They are anatomical terms that annotate where the disease is located. They can be UBERONs, for anatomical structures, or GO terms, for cellular components. The fields are mostly null at the moment because of a bug that we haven’t solved yet Missing disease locations in latest platform release · Issue #2548 · opentargets/issues · GitHub

I hope that was helpful! Because these questions are to clarify disease metadata, and not so much about the initial thread, I’d suggest you to open a new thread next time. That will make it easier for the users to find relevant answers to their questions.

Thanks!
Irene

2 Likes