The Streaming Mindset

A presentation at Bristech Meetup in January 2021 in by Marta Paes

Slide 1

Slide 1

The Streaming Mindset … what, why, how? Marta Paes (@morsapaes) Developer Advocate © 2020 Ververica

Slide 2

Slide 2

About Ververica Original Creators of Apache Flink® 2 @morsapaes Enterprise Stream Processing Part of With Ververica Platform Alibaba Group

Slide 3

Slide 3

Working in DevRel J. Doe ● 00:00 3 @morsapaes

Slide 4

Slide 4

Working in DevRel Me ● 00:01 �� 4 @morsapaes

Slide 5

Slide 5

Working in DevRel Me ● 00:01 5 @morsapaes

Slide 6

Slide 6

Working in DevRel Me ● 00:01 6 @morsapaes

Slide 7

Slide 7

Working in DevRel Me ● 00:01 7 @morsapaes

Slide 8

Slide 8

Where do you start? @morsapaes

Slide 9

Slide 9

1 Go Headfirst ● @morsapaes Stream Processing 101

Slide 10

Slide 10

Analytics…Not that Long Ago OLTP Database(s) ETL … Data Warehouse (DWH) FTP Servers 10 @morsapaes

Slide 11

Slide 11

Analytics…Not that Long Ago The quest for data… Long, nightly jobs OLTP Databases x Someone waking up Re-run long, nightly job ETL Someone complaining … Data Warehouse (DWH) FTP Servers 11 @morsapaes Results But in the end… • Most source data is continuously produced • Not everyone can wait for yesterday’s data • Most logic is not changing that frequently

Slide 12

Slide 12

Everything is a Stream @morsapaes

Slide 13

Slide 13

Everything is a Stream Your static data records become events that are continuously produced and should be continuously processed. Stream Processing Stream Processing Stream Processing … Event Sources Applications, Sensors, Databases, Devices, … Log / Stream Storage Kafka, Kinesis, Pulsar, … Sinks Long-term Storage K/V Store, Database, Log, Application, … S3, HDFS, … … 13 @morsapaes

Slide 14

Slide 14

Stream Processing 101 14 Batch Processing Continuous Streaming query/logic changes fast data changes fast data changes slowly query/logic changes slowly E.g: Ad-hoc queries, data exploration, ML model training E.g: Most business logic nowadays @morsapaes A good starter: Streaming 101: the World Beyond Batch

Slide 15

Slide 15

Stream Processing 101 Batch Processing Continuous Streaming query/logic changes fast data changes fast data changes slowly query/logic changes slowly E.g: Ad-hoc queries, data exploration, ML model training E.g: Most business logic nowadays more batch-like Offline ML Model Training Data Warehousing OLAP / BI / Reporting 15 @morsapaes more real-time Real-time Behavior Modeling Unified Offline/ Online Analytics (e.g. recommenders, pricing) Online ML Model Training/Evaluation Continuous Monitoring Continuous ETL (e.g. position, risk) Real-time Alerting (e.g. fraud, security) Distributed OLTP-style Apps

Slide 16

Slide 16

Stream Processing Use Cases Examples Large-scale Data Pipelines 16 @morsapaes ML-Based Fraud Detection Service Monitoring & Anomaly Detection

Slide 17

Slide 17

Stream Processing Use Cases Examples 17 Large-scale Data Pipelines ML-Based Fraud Detection Service Monitoring & Anomaly Detection Unified Online/Offline Model Training E2E Streaming Analytics Pipelines ML Feature Generation @morsapaes

Slide 18

Slide 18

2 Bridge Concepts @morsapaes ● Bounded vs. Unbounded data ● Event time vs. Processing time ● Fault tolerance

Slide 19

Slide 19

Bounded vs. Unbounded Data Batch Processing 19 Continuous Streaming • Data “at rest” • Data “on the fly” • Hard boundaries (e.g. process 1 day of data) • Ever-growing, infinite data set @morsapaes

Slide 20

Slide 20

Bounded vs. Unbounded Data Batch Processing Continuous Streaming Window • Data “at rest” • Data “on the fly” • Hard boundaries (e.g. process 1 day of data) • Ever-growing, infinite data set Windows split the stream into buckets of finite size, over which you can apply computations 20 @morsapaes

Slide 21

Slide 21

Event Time vs. Processing Time Event time ● Deterministic results ● Handle out-of-order or late events ● Trade-off result completeness/correctness and latency Processing time 21 @morsapaes ● Non-deterministic results ● Best performance and lowest latency ● Speed > completeness/correctness

Slide 22

Slide 22

Fault Tolerance 22 Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines @morsapaes

Slide 23

Slide 23

Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State State 23 @morsapaes

Slide 24

Slide 24

Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State State Persistent Storage checkpointed state 24 @morsapaes checkpointed state checkpointed state Checkpoint

Slide 25

Slide 25

Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines State ❌ State 25 @morsapaes

Slide 26

Slide 26

Fault Tolerance Batch Processing Continuous Streaming pipelines run on a fixed schedule Long-running pipelines Reset position in input stream State State Persistent Storage checkpointe d state 26 @morsapaes checkpointe d state checkpointe d state Restore Recover all embedded state

Slide 27

Slide 27

3 Pick a Flavour & Build @morsapaes

Slide 28

Slide 28

The Flink API Stack Layered, with different tradeoffs for expressiveness and ease of use. You can mix and match all the APIs! Ease of Use Flink SQL Streaming Analytics & ML Table API (dynamic tables) PyFlink 28 DataStream API (streams, windows) Stateful Stream Processing Expressiveness 28 @morsapaes Building Blocks (events, state, (event) time)

Slide 29

Slide 29

How to Get Hands-On? Start with whatever language and/or abstractions are more familiar to you! Java/Scala 29 SQL Python ● Self-paced Training Course ● Flink SQL Cookbook ● PyFlink Walkthrough ● DataStream API Walkthrough ● Table API Walkthrough ● Zeppelin Notebooks @morsapaes

Slide 30

Slide 30

Starting from the beginning @morsapaes

Slide 31

Slide 31

From being dumbfounded… J. Doe ● 00:00 Me ● 00:01 31 @morsapaes

Slide 32

Slide 32

…to actually having a plan! J. Doe ● 00:00 Me ● 00:01 ✅ Invest in learning the Stream Processing 101 ✅ Take the time to understand how it differs from Batch Processing ✅ Start with something familiar and increase complexity gradually ✅ Ask questions! 32 @morsapaes

  • Where to ask questions: How do I get help from the Apache Flink community?

Slide 33

Slide 33

Thank you, Bristech! Follow me on Twitter: @morsapaes Learn more about Flink: https://flink.apache.org/ @morsapaes