Croissant.json errors in new 26.03 release - and request for consistent column names between releases (if possible)

Hi OpenTargets team,

Thanks and congrats on the 26.03 release!

The croissant file is very useful, allowing for automated parsing pipelines. Those pipelines work if the croissant file is correct, which unfortunately is not the case with the current one. Some of the keys mentioned are not present as fields:

WARNING: Table ‘association_by_datasource_direct’ has invalid primary key field(s): datasourceId
WARNING: Table ‘association_by_datasource_indirect’ has invalid primary key field(s): datasourceId
WARNING: Table ‘association_by_datatype_direct’ has invalid primary key field(s): datatypeId
WARNING: Table ‘association_by_datatype_indirect’ has invalid primary key field(s): datatypeId

On a related note: it would be VERY nice if column names were to stay the same as much as possible between releases. That is not always possible, but many times it would be. Example: previously for association_by_datasource_direct there were columns datatype_id and datasource_id - making it evident that there are multiple sources per data type. This is now gone, and the 26.03 association_by_datasource_direct has aggregation_type (a string constant == “datasourceId”) and aggregation_value, the latter being equal to the previous datasource_id (eg, europepmc).

Hi,

Thank you for letting us know about the discrepancy in the croissant annotation and the datasets. I’ll take a look, and will try to fix it.

Regarding:

On a related note: it would be VERY nice if column names were to stay the same as much as possible between releases. That is not always possible, but many times it would be.

This schema change has happened as part of a larger scale effort to take advantage of the date of evidence to assess the novelty (and in general temporal trends) in association data ref. We are going to provide more explanation as the product got fully integrated into the platform and is released.

Sorry for the inconvenience this update caused, but hopefully you’ll be compensated by the value the new dataset provides.

Best,
Daniel

Thanks for the reply.

For the association scores by data source: is there a mapping somewhere of data source to data type? That was very useful and would be nice to keep.

Hi,

is there a mapping somewhere of data source to data type? That was very useful and would be nice to keep.

Each evidence file has a field with the datasourceId and datatyeId. The collected set of all identifiers:

+-------------------+-------------------+
|datasourceId       |datatypeId         |
+-------------------+-------------------+
|reactome           |affected_pathway   |
|crispr             |affected_pathway   |
|cancer_biomarkers  |affected_pathway   |
|crispr_screen      |affected_pathway   |
|impc               |animal_model       |
|clinical_precedence|clinical           |
|gwas_credible_sets |genetic_association|
|eva                |genetic_association|
|uniprot_variants   |genetic_association|
|gene_burden        |genetic_association|
|orphanet           |genetic_association|
|uniprot_literature |genetic_literature |
|genomics_england   |genetic_literature |
|clingen            |genetic_literature |
|gene2phenotype     |genetic_literature |
|europepmc          |literature         |
|expression_atlas   |rna_expression     |
|cancer_gene_census |somatic_mutation   |
|intogen            |somatic_mutation   |
|eva_somatic        |somatic_mutation   |
+-------------------+-------------------+

The list with the weights can be found in our documentation where the association score calculation is expalained.

The croissant file is very useful, allowing for automated parsing pipelines. Those pipelines work if the croissant file is correct, which unfortunately is not the case with the current one. Some of the keys mentioned are not present as fields:

When looking through the datasets for the current release, I can confirm that they are consistent with the current croissant file you can access from the downloads page. For each release, the croissant file is consistent with the released data (actually the croissant file is generated based on the schema of the data itself), however this consistency is not applied across releases. So to interpret the data only the croissant file belonging to that specific release can be used.

Please let us know if there’s anything needs to be clarified.
Best,
Daniel

Thanks Daniel!

I just had a look at the croissant file again. If I were to ignore what is in recordSet.key for the record set with id association_by_datasource_direct then yes, all is good. All actual fields reported are consistent with what is in the parquet file. The datasourceId field, however, is listed as a key but this column is not present in the corresponding parquet file.

Also thanks for the complete table above listing datasourceIds and datatypeIds. I’ve seen that both fields are in each individual evidence file; yet they are not pushed through anymore to for example association_by_datasource_direct. If somebody wanted to query all genetic evidence from that table, one would need to have a mapping of type to sources - in previous releases that was included in the associations table. Sure, everyone can generate the mapping you listed above, and everyone can store it somewhere. For the end user it would be easier to just carry it through during aggregation, and it would allow for easier direct workflows with the parquet files.

In the 25.12 release this was still present:
datatype_id | datasource_id | disease_id | target_id | score | evidence_count

---------------------±-------------------±------------±----------------±-------------------±---------------
genetic_association | gwas_credible_sets | EFO_0009322 | ENSG00000166415 | 0.5019521445109003 | 1

Best,

F

Ah, ok, now I get it! The inconsistency is within croissant annotation, not across the recordset and the actual data. Yes, you are right. We are fixing this. Thank you for raising our attention to this issue.

We are looking into how the datatype/datasoruce pairing could be propagated, however, there’s a huge downstream benefit in unifying the schemas.

Best,
Daniel

Correct - the data is fine! (I do have an automatic pipeline that processes everything based on the croissant file, including checks on the croissant file itself - that’s how this came to light).

As for propagating datatype/datasource pairings - if the drawbacks outweigh the benefits, perhaps a simple csv/json file somewhere that provides this pairing to download might be enough. It would avoid everybody having to derive such a mapping individually.

Best regards & thanks again,

F