Decoding Data Lakehouse: A Technical Breakdown

A presentation at Big Data Technology Warsaw in March 2024 in Warsaw, Poland by Dipankar Mazumdar

Slide 1

Slide 1

Decoding Data Lakehouse: A Technical Breakdown Big Data Warsaw, Dipankar Mazumdar

Slide 2

Slide 2

Speaker Bio Dipankar Mazumdar - Staff Developer Advocate Open Source Contributor: Apache Hudi, Iceberg, XTable, Arrow Prev Work: BI, ML, Data Architecture in/dipankarmazumdar/ @dipankartnt

Slide 3

Slide 3

Agenda ● Evolution of Data Architecture ● Research - Hypothesis ● Data Warehouse ● Requirements of Data Warehousing ● Data Lakehouse & Open Table Formats

Slide 4

Slide 4

Evolution of Data Architecture OLTP Data Warehouse OLAP Data Lakes Store, Retrieve Data Centralized & Reliable Data Platform Democratize data Lakehouse

Slide 5

Slide 5

Hypothesis Research Paper 1. Lakehouse - more than a marketing term 2. Data Warehouse means different things 3. Data Lakehouse = Data Warehouse + Lake + (Additional Values) 4. Advantages with a Open Lakehouse

Slide 6

Slide 6

Data Warehouse

Slide 7

Slide 7

Data Warehousing (DWH) ● Data Warehouse is an ‘overloaded’ term ● Usually refers to 2 different things: ● ○ Technology ○ Tech-independent Practices DWH = Technology + Practices

Slide 8

Slide 8

Defining the Requirements ● How do we compare? ● Distilling down into 3 aspects: ○ Technical Components ○ Technical Capabilities ○ Technology-independent Practices

Slide 9

Slide 9

Technical Components (DWH) ● Storage ● File Format ● Table Format ● Storage Engine ● Compute Engine ● Catalog

Slide 10

Slide 10

Technical Capabilities (DWH) ● Governance & Security ● High Concurrency ● Low Query Latency ● Ad hoc queries ● Workload Management (WLM) ● Schema & Physical Layout Evolution ● ACID-compliant Transactions

Slide 11

Slide 11

Tech-Independent Practices (DWH) ● Data Modeling ● ETL/ELT ● Data Quality ○ Master Data Management (MDM) ○ Referential Integrity ○ Slowly Changing Dimensions (SCD)

Slide 12

Slide 12

Challenges (DWH) Structured Workloads Vendor Lock-in High Costs

Slide 13

Slide 13

Data Lakehouse

Slide 14

Slide 14

Data Lakehouse (DLH)

Slide 15

Slide 15

Data Lakehouse Characteristics ● Transactional Support (ACID) ● Open Data Architecture ● Schema Management ● Scalability ● Less data movement

Slide 16

Slide 16

Technical Components (DLH) Lakehouse Warehouse

Slide 17

Slide 17

Storage ● Data lands after ingestion from operational systems ● Data files (such as Parquet) stored ● Supports storing data of any type (structured, unstructured) ● Cloud object stores: AWS S3, GCS, Azure; On-Prem: HDFS

Slide 18

Slide 18

File Format ● Holds raw data & are physically stored on Data Lakes ● Usually in columnar formats but can be row-based ● Open file formats allows access to different compute

Slide 19

Slide 19

Table Format ● Organize the data files (Parquet) as a single ‘table’ - an abstraction ● File Layout, schema, metadata

Slide 20

Slide 20

Table Format: Under the Hood ● ● ● ● Tables with SQL semantics and schema evolution ACID transactions Updates and deletes (merge/upsert) Data layout optimizations for performance tuning

Slide 21

Slide 21

Table Format: Under the Hood ● Fundamentals of table formats Hudi, Delta, Iceberg are not that different ● Each has a special metadata layer on top of Parquet

Slide 22

Slide 22

Storage Engine ● Keeps data layout optimized for performance ● Table Management tasks - Compaction, Clustering, Cleaning, Indexing ● Enabled by Table formats with Compute engine

Slide 23

Slide 23

Compute Engine ● Responsible for processing data ● Interacts with Open Table Formats’ APIs ● Cater to different types of workloads - Ad hoc SQL, Distributed ETL, Streaming engines

Slide 24

Slide 24

Catalog ● Logical separation of metastore ● Efficient search & data discovery via metadata ● Governance, Security & Data Federation

Slide 25

Slide 25

Technical Capabilities (DLH) Capability Data Lakehouse Governance & Security Apache Ranger, Lakehouse Platforms High Concurrency Concurrency Control, Engines Scalable Low Query Latency Clustering, Partitioning, Indexing Ad hoc Queries Compute engines (Presto, Trino), BI tools Workload Management (WLM) Isolated workloads for different users Schema & Physical Layout Evolution Evolve schema with Table formats ACID Transactions Table formats brings consistency

Slide 26

Slide 26

Tech-Independent Practices (DLH) Practice Data Lakehouse Data Modeling Various modeling techniques; Different Layers (Bronze, Silver, Gold) ETL/ELT Schema-on-Read Data Quality MDM, SCD, Pre-commit checks, WAP

Slide 27

Slide 27

Additional Values

Slide 28

Slide 28

Open Data Architecture ● Data stored as an open & independent tier ● Open to multiple engines ● Eliminates vendor lock-in

Slide 29

Slide 29

Fewer Data Copies ● Less Data movement, More Governance (unlike a 2-tier architecture) ● Query directly on the lake using various technologies

Slide 30

Slide 30

Interoperate between Formats ● Choosing a table format maybe tough ● Each project has rich features that may fit different use-cases ● Newer use cases requires formats to be interoperable ● Apache XTable (incubating) for interoperability

Slide 31

Slide 31

Apache XTable ● An omni-directional interop of lakehouse table formats ● NOT a new or separate format ● XTable provides abstractions and tools for the translation of metadata Read your table as any of the formats: 1: Choose your “source” format 2: Choose your “destination” format(s) 3: XTable will translate the metadata layers

Slide 32

Slide 32

Simple Lakehouse Implementation

Slide 33

Slide 33

Revisiting Hypothesis Going beyond the jargons Data Warehousing = Tech + Practices Data Lakehouse = DWH + DL + Additional Stuff Advantages (Open architecture, Interoperability)

Slide 34

Slide 34

Q&A