How to assess how valid data on protein expression is?

Hi everyone,

after setting up our API we found several values concering expression patterns which we can’t understand. I didn’t find any explanation in the documentation so I wanted to ask here, how are the levels on RNA and Protein calculated and can we use them to rank targets due to likelihood of expression?

rna’: {‘zscore’: -1, ‘value’: 0, ‘unit’: ‘’, ‘level’: -1},
‘protein’: {‘reliability’: True,
‘level’: 2,

If I missed it in the documentation I would greatly appreciated a link to the said data paramers.

Kind regards,
Roman

Hi Roman,

First of all, here is the general explanation as to how the existing baseline expression dataset was produced: Baseline expression - Open Targets Platform Documentation

Now about the specific fields you see in the API. For RNA expression:

  • value is a normalised TPM (transcipts per million) count for all transcripts of a given gene in a given tissue
  • level is a bin number (1 to 10), which is mentioned in the documentation above as “Binned value of expression”. If the value is -1, it means expression was lower than a threshold, and it was discarded
  • zscore is tissue specificity score, which is mentioned in the documentation above as “Tissue specificity”

For protein expression:

  • level is a categorical variable: 0 - Not detected/below threshold, 1 - Low expression, 2 - Medium, 3 - High
  • reliability is a technical flag passed on from the HPA data which reflects whether the value in the level field is reliable enough. You can discard values with "reliability = False`

So, in conclusion: yes, you could use the “level” field to rank targets by expression in a given tissue, but just keep in mind that this field has different ranges for RNA and protein expression.

Finally, in case you are interested in the fine technical details, here is the source code of the module which produces these datasets: GitHub link

Thank you very much!