Data Lake CDC: are we there yet?

WELL, ACTUALLY Real pain from real customers. 01 · LAKE-FIRST ARCHITECTURE 02 · LAKE AS CONTRACT The lake is the source of truth. The lake is the interface.

Pipeline lands in Delta or Iceberg.

The producing team owns the data sources.

The analytics database lives downstream.

Consuming teams get access to the lake, build off it.

Multiple consumers read from the lake, not the data sources.

No control over what lives upstream. 03 · CROSS-CLOUD SYNC 04 · SUB-HOUR FRESHNESS Multi-cloud data sharing. Batch isn’t fast enough.

Data sources and analytics live in different clouds.

Real-time use cases need fresh data, sub-second performance.

Repeated cross-cloud access is expensive.

Full table refreshes at low intervals is prohibitive at scale.

Incremental sync from the lake is less expensive.

Only sync what changed. 04 / 14

02 · HOW THE BASICS WORK delta-cdf.sql DELTA LAKE Change Data Feed in practice. — 1. Producer side: enable once. ALTER TABLE silver.orders SET TBLPROPERTIES ( ‘delta.enableChangeDataFeed’ = ‘true’ ); — 2. Consumer side: read a version range. Enable delta.enableChangeDataFeed . Delta SELECT * FROM table_changes ( materializes change files alongside the data. The ‘silver.orders’ , consumer queries a version range and gets insert / starting_version => 142 , update / delete rows back, tagged. ending_version => 156 ) WHERE _change_type IN (‘update_postimage’, ‘insert’); WINS COSTS Trivial to consume. 2× storage on changing — Returns: Versioned. Before/after tables. Schema changes images for updates. can invalidate the feed. _commit_timestamp, <your columns> _change_type, _commit_version, — 07 / 14

04 · OLTP CDC What we learned from database CDC. 01 One globally ordered, durable log. Postgres WAL MySQL binlog Mongo oplog 02 Durable checkpoints with retention. Postgres LSN MySQL GTID Mongo resume token 03 One protocol, pluggable backends. Debezium ClickPipes Others Every change flows through a single log with A stable identifier consumers can resume from. Each data source exposes its own primitives. monotonic ordering, durability guarantees, and The database holds changes until the The consumer builds the abstraction: one transaction context. consumer has caught up (…kind of). consistent interface, many sources behind it. LAKE TODAY LAKE TODAY LAKE TODAY Change tracking is per-table. No global ordering External clients track offsets. No retention contract No one is building the consumer layer yet. The user across tables. tied to consumer progress. builds the abstraction. 11 / 14