Help understanding the data contained in the Platform data downloads

I would appreciate it if you could tell me whether it is possible to have metadata about the data available in the data downloads, because I can’t quite understand some column names and the meaning of some rows. For instance, in the diseases table there is a column named “ sko “ and I can’t understand what it means, and I also couldn’t find in the documentation about it. I have a hard time understanding dbXRefs column values as well for example.

So my request is: could you please help me find a documentation about the data that will make me understand what is it about, the columns meaning for example because I couldn’t find it anywhere, and guide me to a better understanding of the data available in the Open Targets Platform?

This question was sent to the Open Targets helpdesk and has been posted here so that the answers can benefit the whole Community of users.

1 Like

Hi @imane, and thank you for your question!

I have had a chat with the team and we agree that having metadata would help to make our data more accessible, but we don’t currently have any metadata for specific datasets nor the capacity to create it.

sko is an inherited field name from other datasets, while dbXrefs are database cross references that we use to help us join our data to other input sources.

For other fields, we suggest taking a look at the GraphQL API schema endpoint, which includes some documentation. For example, disease in the API schema shows:

"Disease or phenotype entity"
type Disease {
  "Open Targets disease id"
  id: String!

  "Disease name"
  name: String!

  "Disease description"
  description: String

  "List of external cross reference IDs"
  dbXRefs: [String!]

  "List of direct location Disease terms"
  directLocationIds: [String!]

  "List of indirect location Disease terms"
  indirectLocationIds: [String!]

  "List of obsolete diseases"
  obsoleteTerms: [String!]

  "Disease synonyms"
  synonyms: [DiseaseSynonyms!]
  ancestors: [String!]!
  descendants: [String!]!

You can play around with the API in our GraphQL Playground.

However, it is important to note that not all fields are annotated, and the API names aren’t always correlated with the names used in the raw data.

We’re working on documenting our endpoints, but I’m sorry I couldn’t be more helpful at this time!

1 Like

Could you please help me understand how the “score” in the table “associationByOverallDirect” is calculated and what does it mean?

I would appreciate it also if you could explain to me what is the “evidenceCount” in the same table.

Hi @imane!

AssociationByOverallDirect is the overall association score using only direct evidence, and evidenceCount is the number of evidence strings that support that association.

For more information about how the scoring is calculated, we recommend you take a look at the Platform documentation: Target - disease associations - Open Targets Platform Documentation

Helena

Hello!

I have a question about disease > ontology > leaf. What does it mean leaf?
image

Leaves are disease terms that are at the very bottom of the branch, i.e a term without children. Cowpox is a leaf term, syphilis is not.

2 Likes

Hello OT team!
In the same line of “clarifying scores of evidence”. But I write it here because it is defined “Evidence Count”.
Basically I extracted the “score” and the “evidenceCount” from “diseaseId”=EFO_0000565 & “targetId”=ENSG00000157764 (leukemia and BRAF).
The information comes from "Associations - direct (overall score) " and "Associations - direct (by data source) " json files.
And the evidenceCount of both json files is the same for this association (4) .
the results is just one dictionary for each json file.

associationByDatasourceDirect”:
“score”: 0.16253705352810702,
“datasourceId”: “chembl”,
“evidenceCount”: 4,
“diseaseId”: “EFO_0000565”,

associationByOverallDirect_view
“score”: 0.09881128059278485,
“targetId”: “ENSG00000157764”,
“diseaseId”: “EFO_0000565”,
“evidenceCount”: 4

In which part of the documentation do they say how is calculated the evidenceCount? I understand is the ontology evidence ? Should I assume evidenceCount means that the ontological association between “TargetID” and “DiseaseID” appears in ONLY 1 datasource from the total of 23 (in OT_24.06), which is on the order 4 ? or the string appears 4 times in the ontology association calculation algorithm (that I guess is somewhere related with the data_processing?)

Each unique target-disease pair in the Open Targets Platform is defined as an association . For example, while there might be several pieces of evidence referring to CFTR and Cystic fibrosis from multiple sources, one single association contextualises all this information within the Platform. [1]

Just to be sure I understand the numbers behind and when to use them.

Many thanks again!

UPDATE:

I Think I found part of my answer in this piece of text.
" Target - disease evidence

Every event or set of events pinpointing a target as a potential causal gene or protein for a disease, represents the unit of information, most often referred as evidence. Within Open Targets, a series of pipelines ensure information is retrieved from their sources and standardised in a way that can be immediately applied to answer drug development queries."

I was able to establish the comparison between the score and evidenceCount in the platform and the downloaded JSON files. Many thanks!