Ideas for Change Data Capture

Cole_DeVries · 13 January 2022 00:15

Good evening,

I was just curious if anyone has a recommendation for a particular use case that I’ve come across, which sort of relates to change data capture (CDC).

Let’s say I stood up my infrastructure successfully and loaded data into ElasticSearch and ClickHouse from release X.0. When release X.1 is published, I only want to import the data that we can refer to as the delta between release X.1 and X.0:

Delta = X.1 - X.2

This would reduce the compute (and therefore, financial burden) on the data consumer and support refreshes as new releases are published.

Are there any recommendations for how to approach this use case? Does BigQuery provide any tooling that can be leveraged to support the use case? Even something like transaction logs for the source table would maybe be helpful in identifying new/changed rows (that brings me to the realization that the equation I posted above doesn’t account for existing rows that are updated- if there is such a thing).

Thanks for your help!

Cole

Cole_DeVries · 1 February 2022 15:15

Good morning! Just reaching out to see if anyone has any recommendations for addressing this use case. Are others working through similar issues? Anyone have any lessons learned?

Do we know if there are any intentions of including something like a “last updated” field for some of the major entities in the datasets?

ochoa · 14 February 2022 13:25

In the core team, we are not actively working on any incremental changes. We considered it in the past but the risks and cost of it seem to outweigh the benefits. Particularly for the Platform in which the data is relatively lightweight and the ETL processing/ingestion is not that heavy. The Genetics Portal is a different story as ETL and ES/CH data load is a lot heavier, but at the moment we think there are several streamlining steps that would precede such effort.

However, we can support others efforts or try to make this more amenable to happen. We are always welcome to external contributions or discussions on this or other topics. cc @JarrodBaker

Topic		Replies	Views
New version as updates or new instances? Google BigQuery/Cloud	1	308	25 July 2022
Issues with GCP Big Query Google BigQuery/Cloud	6	129	6 May 2025
BigQuery Release Date General	3	162	30 May 2024
Accessing older versions of score tables in OT Platform Data Access ot-platform	3	203	14 December 2022
Is the BigQuery dataset being maintained? Data Access bigquery	1	30	19 November 2024

Ideas for Change Data Capture

Related topics