Data Lake CDC: are we there yet?

A presentation at AI Council in May 2026 in San Francisco, CA, USA by Marta Paes

MY FIRST REACTION Data lakes were built for scale , not speed. For real-time analytics, why not stick to one of the standard patterns? 01 02 03 04 ALT 01 ALT 02 ALT 03 ALT 04 Real-time ingestion. Batch refresh. Kafka as the hub. Remote reads. Ingest directly from streaming, Refresh the table on a schedule. Centralize in the broker. 1 topic, Query data where it lives. No object storage, or OLTP CDC Old school, solid, no real-time fan out to n consumers. pipelines, no copies. Trade off data sources. No intermediate architecture overhead. latency for simplicity. storage, no extra hop. 03 / 14

WELL, ACTUALLY Real pain from real customers. 01 · LAKE-FIRST ARCHITECTURE 02 · LAKE AS CONTRACT The lake is the source of truth. The lake is the interface.
Pipeline lands in Delta or Iceberg.
The producing team owns the data sources.
The analytics database lives downstream.
Consuming teams get access to the lake, build off it.
Multiple consumers read from the lake, not the data sources.
No control over what lives upstream. 03 · CROSS-CLOUD SYNC 04 · SUB-HOUR FRESHNESS Multi-cloud data sharing. Batch isn’t fast enough.
Data sources and analytics live in different clouds.
Real-time use cases need fresh data, sub-second performance.
Repeated cross-cloud access is expensive.
Full table refreshes at low intervals is prohibitive at scale.
Incremental sync from the lake is less expensive.
Only sync what changed. 04 / 14

Writer Writer … CATALOG DATA LAKE 🎣 How do we build this? �� 🐠 CLICKHOUSE ⚡ Real-time sync CDC inserts, updates, deletes Sub-second analytics OLAP Agent App …

01 · THE BASICS The primitives exist. Both formats expose change information, but use different approaches. ABSTRACTION Change Data Feed Row lineage AVAILABLE SINCE Delta 2.0 2022 Iceberg V3 spec Feb 2025 WHERE CHANGES LIVE Pre-computed in _change_data/ Implicit, in metadata and sequence numbers STORAGE COST 2, main data plus change files Negligible, extra columns only CONSUMER EFFORT Low: read files tagged by change type. High: filter, sort, dedupe as the consumer. 06 / 14

02 · HOW THE BASICS WORK delta-cdf.sql DELTA LAKE Change Data Feed in practice. — 1. Producer side: enable once. ALTER TABLE silver.orders SET TBLPROPERTIES ( ‘delta.enableChangeDataFeed’ = ‘true’ ); — 2. Consumer side: read a version range. Enable delta.enableChangeDataFeed . Delta SELECT * FROM table_changes ( materializes change files alongside the data. The ‘silver.orders’ , consumer queries a version range and gets insert / starting_version => 142 , update / delete rows back, tagged. ending_version => 156 ) WHERE _change_type IN (‘update_postimage’, ‘insert’); WINS COSTS Trivial to consume. 2× storage on changing — Returns: Versioned. Before/after tables. Schema changes images for updates. can invalidate the feed. _commit_timestamp, <your columns> _change_type, _commit_version, — 07 / 14

02 · HOW THE BASICS WORK iceberg-lineage.sql APACHE ICEBERG V3 — 1. Producer side: nothing to enable. Row lineage in practice.
Row lineage is on by default in V3. — 2. Consumer side: read forward from your last seq. SELECT _row_id, No change files. Each row exposes _row_id and _last_updated_sequence_number AS _last_updated_sequence_number . The consumer order_id, status, total keeps a checkpoint and queries forward. seq, FROM silver.orders WHERE _last_updated_sequence_number > 8741 ORDER BY _last_updated_sequence_number; WINS COSTS Cheap. Schema-evolution You own dedup, ordering, — Returns: stable row IDs + monotonic seqs. friendly. Consumer chooses snapshot reasoning, and semantics. state.
Consumer dedupes, snapshots, advances. 08 / 14

03 · THE ICEBERG SPEC Each version closes one gap. Making CDC primitives first-class citizens. V1 V2 2021 Feb 2025 Ongoing No CDC concept Snapshot diffing only Row identity Compact deltas 2017 V3 Row identity V3 SHIPPED V4 Compact deltas V4 ONGOING Rows finally know who they are. Updates stop looking like Change detection moves to the root manifest. Polling costs delete and insert pairs. CDC without replaying history. drop. Small commits stay small. 09 / 14

04 · OLTP CDC What we learned from database CDC. 01 One globally ordered, durable log. Postgres WAL MySQL binlog Mongo oplog 02 Durable checkpoints with retention. Postgres LSN MySQL GTID Mongo resume token 03 One protocol, pluggable backends. Debezium ClickPipes Others Every change flows through a single log with A stable identifier consumers can resume from. Each data source exposes its own primitives. monotonic ordering, durability guarantees, and The database holds changes until the The consumer builds the abstraction: one transaction context. consumer has caught up (…kind of). consistent interface, many sources behind it. LAKE TODAY LAKE TODAY LAKE TODAY Change tracking is per-table. No global ordering External clients track offsets. No retention contract No one is building the consumer layer yet. The user across tables. tied to consumer progress. builds the abstraction. 11 / 14

05 · THE CATALOG Is the catalog the missing piece? 01 · GLOBAL ORDERING A monotonic sequence on every commit. Consumers ask for everything after a given offset, across the whole catalog. 02 · EFFICIENT CHANGE DETECTION The catalog tells consumers what changed since their last read. No full metadata scan required. 03 · CONSUMER-AWARE RETENTION A retention contract tied to consumer progress. Today that’s the consumer’s problem, at every level. Source 12 / 14

AI COUNCIL 2026 Not quite. The primitives exist. The plumbing doesn’t. We’re building it at ClickHouse. JOIN US! CAREERS clickhouse.com/careers

Marta Paes
@morsapaes

1 / 14

The idea of incremental reads from data lakes has been cooking for years, but few are serving it up. As a user, you must wrangle change feeds, snapshots, time travel, that one corrupted manifest file. Do you need to be a “Big Data Engineer” to get it right? In this lightning talk, we’ll explore what’s broken, what’s just hard, and why making data lake CDC accessible is a problem worth solving.

Data Lake CDC: are we there yet?

WELL, ACTUALLY Real pain from real customers. 01 · LAKE-FIRST ARCHITECTURE 02 · LAKE AS CONTRACT The lake is the source of truth. The lake is the interface.

Pipeline lands in Delta or Iceberg.

The producing team owns the data sources.

The analytics database lives downstream.

Consuming teams get access to the lake, build off it.

Multiple consumers read from the lake, not the data sources.

No control over what lives upstream. 03 · CROSS-CLOUD SYNC 04 · SUB-HOUR FRESHNESS Multi-cloud data sharing. Batch isn’t fast enough.

Data sources and analytics live in different clouds.

Real-time use cases need fresh data, sub-second performance.

Repeated cross-cloud access is expensive.

Full table refreshes at low intervals is prohibitive at scale.

Incremental sync from the lake is less expensive.

02 · HOW THE BASICS WORK iceberg-lineage.sql APACHE ICEBERG V3 — 1. Producer side: nothing to enable. Row lineage in practice.

Link for this presentation:

HTML code for embedding:

Share on social media: