How Do You Query a Stream?

A presentation at jPrime 2024 in May 2024 in Sofia, Bulgaria by Viktor Gamov

Slide 1

Slide 1

How To Query A Stream? Viktor Gamov, StarTree @gamussa Geecon, Kraków, 2024 @gamussa | @startreedata | @apachepinot

Slide 2

Slide 2

@gamussa | @startreedata | @apachepinot

Slide 3

Slide 3

@gamussa | @startreedata | @apachepinot

Slide 4

Slide 4

Viktor GAMOV Head of Developer Advocacy | StarTree f THE CLOUD CONNECTIVITY COMPANY Twitter X: @gamussa Kong Con idential

Slide 5

Slide 5

Simpler times Monolith

Slide 6

Slide 6

Simpler analytics ETL and CDC

Slide 7

Slide 7

DHW->Hadoop Mobile Era

Slide 8

Slide 8

Data Pipelines Streaming data pipelines and Microservices

Slide 9

Slide 9

LOG

Slide 10

Slide 10

OLTP stream vs OLAP vs. OLTP in Streams OLAP streams

Slide 11

Slide 11

Our Options f • • • • • • Connect/Relational DB Ka ka Streams Streaming SQL Cloud Data Warehouse Data Lake Real-Time OLAP Database

Slide 12

Slide 12

f Ka ka Connect

Slide 13

Slide 13

Connect/RDBMS Broker Broker Broker Cluster Data Source Kafka Connect Kafka Connect Data Sink

Slide 14

Slide 14

` Connect/RDBMS • Suitable for smaller data • Transactional • Familiar to users

Slide 15

Slide 15

f Ka ka Streams

Slide 16

Slide 16

Ka ka Streams (transactional) f • Ingests directly from a topic • KTable • Forms an in-memory key/value store suitable for querying by topic key • Scalable across members of a consumer group • Readable through Interactive Queries

Slide 17

Slide 17

Ka ka Streams (transactional) final KStream<String, String> stream = builder.stream(inputTopic, Consumed.with(stringSerde, stringSerde)); f final KTable<String, String> convertedTable = stream.toTable(Materialized.as(“streamconverted-to-table”));

Slide 18

Slide 18

Ka ka Streams (analytical) • • • • • Full-featured Java stream processing API Arbitrary streaming computation Can emit new streams (not this talk) KTables queryable by key f Every read pattern requires its own topology • Interactive Queries again

Slide 19

Slide 19

Ka ka Streams (analytical) KTable<String, Long> wordCounts = textLines .flatMapValues(textLine -> Arrays.asList(textLine.toLowerCase().split(“\W+”))) .groupBy((key, word) -> word) .count(Materialized.<String, Long, KeyValueStore<Bytes, byte[]>>as(“counts-store”)); f wordCounts.toStream().to(“WordsWithCountsTopic”, Produced.with(Serdes.String(), Serdes.Long()));

Slide 20

Slide 20

Streaming SQLs

Slide 21

Slide 21

Streaming SQL • • • • Materialize DeltaStream RisingWave ksqlDB

Slide 22

Slide 22

I love you, little squirrel. Why not Flink?

Slide 23

Slide 23

But this talk is not about you. Why not Flink?

Slide 24

Slide 24

Materialize f • Replacement data warehouse • Integrates with Ka ka, Postgres, dbt • The Materialized View is the central abstraction • Views are persistent and queryable • Postgres wire-compatible • Positioned as an analytics solution

Slide 25

Slide 25

Delta Stream Cloud-native streaming SQL Serverless, BYOC Ka ka, Kinesis integration Materialized views and streaming pipelines • streaming database and streaming analytics f • • • •

Slide 26

Slide 26

Rising Wave f • Distributed SQL Streaming database • Cloud and OSS versions • Implementation of Flink in Rust • Ka ka, Pulsar, Kinesis integrations • Flink+persistent views • Postgres wire-compatible

Slide 27

Slide 27

ksqlDB f • «Streaming Database» • Provides persistent TABLE abstraction • Pull and Push queries • Like Ka kaStreams, but in SQL

Slide 28

Slide 28

Cloud Data Warehouses

Slide 29

Slide 29

Cloud Data Warehouses

Slide 30

Slide 30

Cloud Data Warehouses • The cloud-based heir of legacy DWH • Ingest from batch and streaming sources • Biased towards structured data and batch access

Slide 31

Slide 31

Data Lake

Slide 32

Slide 32

Data Lake f Anything else We’ll igure this out

Slide 33

Slide 33

Data Lakes • • • • • Started as the HDFS cluster Became S3 That didn’t help… ELT vs. ETL Iceberg/Hudi/DeltaLake

Slide 34

Slide 34

Data Lakes f • Storage and compute are radically decoupled • Structure is relatively less important • Reads are slow • Streaming is historically dif icult

Slide 35

Slide 35

Data Lakes f • Storage and compute are radically decoupled • Structure is relatively less important • Reads are slow • Streaming is historically dif icult

Slide 36

Slide 36

Real-Time Analytics Database

Slide 37

Slide 37

Real-Time OLAP f • Designed for high concurrency, low latency queries • Ingests from streaming and batch sources • Intimate integration with Ka ka • Conventional tables and SQL

Slide 38

Slide 38

Real-Time OLAP • Analytics shaped like realtime data • Analytics when users are decision makers

Slide 39

Slide 39

No Solutions Technology Selection only Trade Offs

Slide 40

Slide 40

Sometimes you go with what you know

Slide 41

Slide 41

This is not bad!

Slide 42

Slide 42

Performance Performance

Slide 43

Slide 43

Community/Adoption Community

Slide 44

Slide 44

Differentiated Application Code Area of Exploration Kafka

Slide 45

Slide 45

Need your feedback @gamussa | @startreedata | @apachepinot

Slide 46

Slide 46

Need your feedback It’s Anonymous @gamussa | @startreedata | @apachepinot

Slide 47

Slide 47

For more resources on Apache Pinot: dev.startree.ai Viktor Gamov, StarTree @gamussa