Uncorking Real-Time Analytics with Pulsar, Pinot and Flink

Uncorking Analytics with Apache Pulsar, Apache Flink, and Apache Pinot f Viktor Gamov, Con luent | @gamussa San Francisco, CA, October 2024

Before We Proceed… https://gamov.dev/uncorking-analytics @gamussa | @confluentinc | #DataStreamingSummit

A Taxonomy of Analytics @gamussa | @confluentinc | #DataStreamingSummit

REAL-TIME OBSERVABILITY/ MONITORING EXTERNAL DASHBOARDS REPORTING BATCH INTERNAL USER-FACING ANALYTICS @gamussa | @confluentinc | #DataStreamingSummit REPORTING FEATURES

REAL-TIME DATADOG PINOT CLICKHOUSE ROCKSET DRUID TRINO, PRESTO, BIGQUERY SNOWFLAKE HADOOP LEGACY DWH @gamussa EXTERNAL BATCH INTERNAL SEE ALSO (BUT WITH CACHING) | @confluentinc | #DataStreamingSummit

Who Does Real-Time Analytics? @gamussa | @confluentinc | #DataStreamingSummit

Who Viewed My Pro ile? Total users 700 Million+ QPS 100,000s Latency SLA < 100 ms p99th Freshness Seunghyun Lee Senior Software Engineer Chinmay Soman Founding Engineer f @gamussa | @confluentinc | #DataStreamingSummit Seconds

Viktor GAMOV Principal Developer Advocate | Con luent Java Champion O’Reilly and Manning Author Twitter X: @gamussa f f THE CLOUD CONNECTIVITY COMPANY Kong Con idential

@gamussa | @testcontainers | #DataStreamingSummit

What is Apache Pinot ? ™ @gamussa | @confluentinc | #DataStreamingSummit

“Apache Pinot is a real-time distributed OLAP database, designed to serve OLAP workloads on streaming data with extreme low latency and high concurrency.” @gamussa | @confluentinc | #DataStreamingSummit

The essence of real-time analytics LATENCY The amount of time it takes to execute a query CONCURRENCY The ability of a system to handle multiple queries simultaneously FRESHNESS The up-to-date nature of data in the system @gamussa | @confluentinc | #DataStreamingSummit

The essence of real-time analytics LATENCY CONCURRENCY FRESHNESS As low as 10ms As many as 100,000 queries per second Seconds from event time till queryable in Pinot @gamussa | @confluentinc | #DataStreamingSummit

OLTP OLTP OLAP • Transaction focused • Write-heavy workloads • Often involves a single record per operation • Aggregation-focused • Read-heavy workloads • Often involves many records in one operation @gamussa | @confluentinc | #DataStreamingSummit

Data Model ● Pinot uses the completely familiar tabular data model ● Table creation and schema definition expressed in JSON ● Queries expressed in SQL

Architecture: Tables and Segments @gamussa | @confluentinc | #DataStreamingSummit

Tables ● ● ● ● ● ● The basic unit of data storage in Pinot Composed of rows and columns Expected to scale to arbitrarily large row counts Defined using a schema and tableConfig JSON file Three varieties: offline, real-time, and hybrid Every column is either a metric, dimension, or date/time @gamussa | @confluentinc | #DataStreamingSummit

Segments ● Tables are split into units of storage called segments ● Similar to shards or partitions but transparent to you, the user ● For offline tables, segments are created outside of Pinot and pushed into the cluster using a REST API ● For real-time tables, segments are created automatically from events sourced by the event streaming system (e.g., Pulsar, Kafka) ● Standard utilities support batch ingest from standard file types (AVRO, JSON, CSV) ● APIs are available to create segments from Spark, Flink, and Hadoop @gamussa | @confluentinc | #DataStreamingSummit

Segments segments 1 2 3 4 5 6 7 8 TABLE @gamussa | @confluentinc | #DataStreamingSummit

Segment Structure ● Pinot is a columnar database ● All of a segment’s values for a single column are stored contiguously ● Dimension columns are typically dictionary-encoded ● Indexes are stored as a part of the segment ● Segments are immutable once written ● Segments have a configurable retention period @gamussa | @confluentinc | #DataStreamingSummit

Segment Structure …, { “ip”: “111.173.165.103”, “userid”: 10, “remote_user”: “-“, “time”: “3271”, “_time”: 3271, “request”: “GET”, “status”: “406”, “bytes”: “1289”, “referrer”: “-“, “agent”: “Mozilla/5.0” }, …, ip userid remote_user _time bytes time request referrer @gamussa | @confluentinc | #DataStreamingSummit status bytes agent

Segment Structure ip @gamussa | @confluentinc | #DataStreamingSummit

Segment Structure @gamussa | @confluentinc | #DataStreamingSummit

@gamussa | @confluentinc | #DataStreamingSummit 132.116.134.205 239.36.131.30 199.191.187.233 13.213.178.183 182.45.66.204 211.235.25.163 165.193.151.176 250.192.178.235 202.43.225.122 166.27.69.94 Segment Structure

Part 1 Batch Ingestion in Pinot @gamussa | @confluentinc | #DataStreamingSummit

@gamussa | @confluentinc | #DataStreamingSummit

Part 2 Streaming Ingestion with Ka ka f @gamussa | @confluentinc | #DataStreamingSummit

@gamussa | @confluentinc | #DataStreamingSummit

BROKER ZOOKEEPER BROKER CONTROLLER SERVER 1 SERVER 2 SERVER 3

BROKER ZOOKEEPER BROKER CONTROLLER CONSUMING SEGMENT SERVER 1 SERVER 2 SERVER 3

Part 3 Stream Join in Flink @gamussa | @confluentinc | #DataStreamingSummit

Flink 101 @gamussa | @confluentinc | #DataStreamingSummit

«Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams.» @gamussa | @confluentinc | #DataStreamingSummit

Real-time services rely on stream processing Files Real-time Stream Processing Ka ka Sinks Sources Apps Databases Key/Value Stores f @gamussa | @confluentinc | #DataStreamingSummit

What is Flink SQL @gamussa | @confluentinc | #DataStreamingSummit

A standards-compliant SQL engine for processing both batch and streaming data with the scalability, performance, and consistency of Apache Flink @gamussa | @confluentinc | #DataStreamingSummit

Is Flink SQL a database? No. Bring your own data. CREATE TABLE MovieRatings ( movieId INT, rating DOUBLE, ratingTimeMillis BIGINT, ratingTime AS TO_TIMESTAMP_LTZ(ratingTimeMillis, 3) ) WITH ( ‘connector’ = ‘pulsar’, ‘topics’ = ‘persistent://public/default/ratings’, ‘service-url’ = ‘pulsar://pulsar:6650’, ‘value.format’ = ‘json’, ‘source.subscription-name’ = ‘flink-ratingssubscription’, ‘source.subscription-type’ = ‘Shared’ );

How does Flink work with Pulsar? @gamussa | @confluentinc | #DataStreamingSummit

@gamussa | @confluentinc | #DataStreamingSummit

Source: Streaming Databases, Hubert Dulay, Ralph Matthias Debusmann @gamussa | @confluentinc | #DataStreamingSummit

Find the code of the demo 👉 https://gamov.dev/uncorking-analytics @gamussa | @confluentinc | #DataStreamingSummit

Uncorking Real-Time Analytics with Pulsar, Pinot and Flink

Slide 1

Slide 2

Slide 3

Slide 4

Slide 5

Slide 6

Slide 7

Slide 8

Slide 9

Slide 10

Slide 11

Slide 12

Slide 13

Slide 14

Slide 15

Slide 16

Slide 17

Slide 18

Slide 19

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Slide 25

Slide 26

Slide 27

Slide 28

Slide 29

Slide 30

Slide 31

Slide 32

Slide 33

Slide 34

Slide 35

Slide 36

Slide 37

Slide 38

Slide 39

Slide 40

Slide 41

Slide 42

Slide 43

Slide 44

Slide 45