I actually managed to answer my own question while writing it out…
Here is my summary:
Why is unit often empty?
Look at baselineExpressionMaps in the code. GitHub
val rnaTransposed = transposeDataframe(rnaDF, Seq("ID")).withColumn("unit", lit("TPM"))
val binnedTransposed = transposeDataframe(binnedDF, Seq("ID")).withColumn("unit", lit(""))
val zscoreTransposed = transposeDataframe(zscoreDF, Seq("ID")).withColumn("unit", lit(""))
...
.groupBy("Gene", "Tissue")
.agg(
max(col("rna")).as("rna_val"),
max(col("binned")).as("binned_val"),
max(col("zscore")).as("zscore_val"),
first("unit", ignoreNulls = true).as("unit_val")
)
Key points:
-
Only the raw RNA baseline matrix is tagged with
unit = "TPM". -
The “others” you’re asking about are:
-
binned(discrete expression level; dimensionless) -
zscore(standardised; also dimensionless)
-
-
Those correctly get
unit = ""(no unit).
Then later:
-
In
generateBaselineInfo,unit_val→unitwithcoalesce(..., lit("")). -
In the final
rnastruct:struct( max(col("rna")).as("value"), max(col("zscore")).as("zscore"), max(col("binned")).as("level"), max(col("unit")).as("unit") ).as("rna")
So what happens in practice:
-
If a (gene, tissue) has RNA baseline data:
- There is at least one row with
unit = "TPM", sounitin the finalrnastruct is"TPM".
- There is at least one row with
-
If a (gene, tissue) only has HPA protein data and no RNA baseline row:
-
There is no
rna/binned/zscorerow. -
unitends up as""(empty).
-
-
If there are only derived metrics without raw TPM (rare / depends on inputs):
unitwill also be"", because those metrics are intentionally unitless.
Is the metadata for “others” stored?
-
Not in this dataset.
-
Semantics of
binnedandzscoreare implicit / documented externally (docs + pipeline notes), not encoded as per-field metadata in the JSON. -
Similarly, protein
level(0–3) andreliabilityare encoded numerically/boolean, but their textual legend lives in documentation, not alongside each record.
So if you’re building your consolidated schema:
-
Treat
rna.unit:-
"TPM"→ genuine quantitative RNA. -
""→ either no RNA baseline or only dimensionless summaries.
-
-
Treat
rna.level(binned) andrna.zscoreas:- unitless derived fields.
-
Treat
protein.level0–3 via the HPA legend (Not detected/Low/Medium/High).