Ideas for Change Data Capture

Good evening,

I was just curious if anyone has a recommendation for a particular use case that I’ve come across, which sort of relates to change data capture (CDC).

Let’s say I stood up my infrastructure successfully and loaded data into ElasticSearch and ClickHouse from release X.0. When release X.1 is published, I only want to import the data that we can refer to as the delta between release X.1 and X.0:

Delta = X.1 - X.2

This would reduce the compute (and therefore, financial burden) on the data consumer and support refreshes as new releases are published.

Are there any recommendations for how to approach this use case? Does BigQuery provide any tooling that can be leveraged to support the use case? Even something like transaction logs for the source table would maybe be helpful in identifying new/changed rows (that brings me to the realization that the equation I posted above doesn’t account for existing rows that are updated- if there is such a thing).

Thanks for your help!

Cole

Good morning! Just reaching out to see if anyone has any recommendations for addressing this use case. Are others working through similar issues? Anyone have any lessons learned?

Do we know if there are any intentions of including something like a “last updated” field for some of the major entities in the datasets?

In the core team, we are not actively working on any incremental changes. We considered it in the past but the risks and cost of it seem to outweigh the benefits. Particularly for the Platform in which the data is relatively lightweight and the ETL processing/ingestion is not that heavy. The Genetics Portal is a different story as ETL and ES/CH data load is a lot heavier, but at the moment we think there are several streamlining steps that would precede such effort.

However, we can support others efforts or try to make this more amenable to happen. We are always welcome to external contributions or discussions on this or other topics. cc @JarrodBaker