A presentation at AI Council in in San Francisco, CA, USA by Marta Paes

AI COUNCIL 2026 LIGHTNING TALK Data lake CDC: Are we there yet? Marta Paes Sr. Product Manager
MY FIRST REACTION 02 / 14
MY FIRST REACTION Data lakes were built for scale , not speed. For real-time analytics, why not stick to one of the standard patterns? 01 02 03 04 ALT 01 ALT 02 ALT 03 ALT 04 Real-time ingestion. Batch refresh. Kafka as the hub. Remote reads. Ingest directly from streaming, Refresh the table on a schedule. Centralize in the broker. 1 topic, Query data where it lives. No object storage, or OLTP CDC Old school, solid, no real-time fan out to n consumers. pipelines, no copies. Trade off data sources. No intermediate architecture overhead. latency for simplicity. storage, no extra hop. 03 / 14
Only sync what changed. 04 / 14
Writer Writer … CATALOG DATA LAKE 🎣 How do we build this? �� 🐠 CLICKHOUSE ⚡ Real-time sync CDC inserts, updates, deletes Sub-second analytics OLAP Agent App …
01 · THE BASICS The primitives exist. Both formats expose change information, but use different approaches. ABSTRACTION Change Data Feed Row lineage AVAILABLE SINCE Delta 2.0 2022 Iceberg V3 spec Feb 2025 WHERE CHANGES LIVE Pre-computed in _change_data/ Implicit, in metadata and sequence numbers STORAGE COST 2, main data plus change files Negligible, extra columns only CONSUMER EFFORT Low: read files tagged by change type. High: filter, sort, dedupe as the consumer. 06 / 14
02 · HOW THE BASICS WORK delta-cdf.sql DELTA LAKE Change Data Feed in practice. — 1. Producer side: enable once. ALTER TABLE silver.orders SET TBLPROPERTIES ( ‘delta.enableChangeDataFeed’ = ‘true’ ); — 2. Consumer side: read a version range. Enable delta.enableChangeDataFeed . Delta SELECT * FROM table_changes ( materializes change files alongside the data. The ‘silver.orders’ , consumer queries a version range and gets insert / starting_version => 142 , update / delete rows back, tagged. ending_version => 156 ) WHERE _change_type IN (‘update_postimage’, ‘insert’); WINS COSTS Trivial to consume. 2× storage on changing — Returns: Versioned. Before/after tables. Schema changes images for updates. can invalidate the feed. _commit_timestamp, <your columns> _change_type, _commit_version, — 07 / 14
Consumer dedupes, snapshots, advances. 08 / 14
03 · THE ICEBERG SPEC Each version closes one gap. Making CDC primitives first-class citizens. V1 V2 2021 Feb 2025 Ongoing No CDC concept Snapshot diffing only Row identity Compact deltas 2017 V3 Row identity V3 SHIPPED V4 Compact deltas V4 ONGOING Rows finally know who they are. Updates stop looking like Change detection moves to the root manifest. Polling costs delete and insert pairs. CDC without replaying history. drop. Small commits stay small. 09 / 14
What’s missing?
04 · OLTP CDC What we learned from database CDC. 01 One globally ordered, durable log. Postgres WAL MySQL binlog Mongo oplog 02 Durable checkpoints with retention. Postgres LSN MySQL GTID Mongo resume token 03 One protocol, pluggable backends. Debezium ClickPipes Others Every change flows through a single log with A stable identifier consumers can resume from. Each data source exposes its own primitives. monotonic ordering, durability guarantees, and The database holds changes until the The consumer builds the abstraction: one transaction context. consumer has caught up (…kind of). consistent interface, many sources behind it. LAKE TODAY LAKE TODAY LAKE TODAY Change tracking is per-table. No global ordering External clients track offsets. No retention contract No one is building the consumer layer yet. The user across tables. tied to consumer progress. builds the abstraction. 11 / 14
05 · THE CATALOG Is the catalog the missing piece? 01 · GLOBAL ORDERING A monotonic sequence on every commit. Consumers ask for everything after a given offset, across the whole catalog. 02 · EFFICIENT CHANGE DETECTION The catalog tells consumers what changed since their last read. No full metadata scan required. 03 · CONSUMER-AWARE RETENTION A retention contract tied to consumer progress. Today that’s the consumer’s problem, at every level. Source 12 / 14
Are we there yet?
AI COUNCIL 2026 Not quite. The primitives exist. The plumbing doesn’t. We’re building it at ClickHouse. JOIN US! CAREERS clickhouse.com/careers
The idea of incremental reads from data lakes has been cooking for years, but few are serving it up. As a user, you must wrangle change feeds, snapshots, time travel, that one corrupted manifest file. Do you need to be a “Big Data Engineer” to get it right? In this lightning talk, we’ll explore what’s broken, what’s just hard, and why making data lake CDC accessible is a problem worth solving.