Missing 'units' in the baseline expression data?

I actually managed to answer my own question while writing it out…

Here is my summary:

Why is unit often empty?

Look at baselineExpressionMaps in the code. GitHub

val rnaTransposed     = transposeDataframe(rnaDF, Seq("ID")).withColumn("unit", lit("TPM"))
val binnedTransposed  = transposeDataframe(binnedDF, Seq("ID")).withColumn("unit", lit(""))
val zscoreTransposed  = transposeDataframe(zscoreDF, Seq("ID")).withColumn("unit", lit(""))
...
.groupBy("Gene", "Tissue")
.agg(
  max(col("rna")).as("rna_val"),
  max(col("binned")).as("binned_val"),
  max(col("zscore")).as("zscore_val"),
  first("unit", ignoreNulls = true).as("unit_val")
)

Key points:

  • Only the raw RNA baseline matrix is tagged with unit = "TPM".

  • The “others” you’re asking about are:

    • binned (discrete expression level; dimensionless)

    • zscore (standardised; also dimensionless)

  • Those correctly get unit = "" (no unit).

Then later:

  • In generateBaselineInfo, unit_valunit with coalesce(..., lit("")).

  • In the final rna struct:

    struct(
      max(col("rna")).as("value"),
      max(col("zscore")).as("zscore"),
      max(col("binned")).as("level"),
      max(col("unit")).as("unit")
    ).as("rna")
    
    

So what happens in practice:

  1. If a (gene, tissue) has RNA baseline data:

    • There is at least one row with unit = "TPM", so unit in the final rna struct is "TPM".
  2. If a (gene, tissue) only has HPA protein data and no RNA baseline row:

    • There is no rna/binned/zscore row.

    • unit ends up as "" (empty).

  3. If there are only derived metrics without raw TPM (rare / depends on inputs):

    • unit will also be "", because those metrics are intentionally unitless.

Is the metadata for “others” stored?

  • Not in this dataset.

  • Semantics of binned and zscore are implicit / documented externally (docs + pipeline notes), not encoded as per-field metadata in the JSON.

  • Similarly, protein level (0–3) and reliability are encoded numerically/boolean, but their textual legend lives in documentation, not alongside each record.

So if you’re building your consolidated schema:

  • Treat rna.unit:

    • "TPM" → genuine quantitative RNA.

    • "" → either no RNA baseline or only dimensionless summaries.

  • Treat rna.level (binned) and rna.zscore as:

    • unitless derived fields.
  • Treat protein.level 0–3 via the HPA legend (Not detected/Low/Medium/High).