Stream Processing As You’ve Never Seen Before (Seriously): Apache Flink for Java Developers

A presentation at Geecon in May 2025 in Kraków, Poland by Viktor Gamov

Slide 1

Slide 1

X/Bluesky: @gamussa

Slide 2

Slide 2

Viktor GAMOV Principal Developer Advocate | Con luent co-author of Manning’s Ka ka in Action f f f THE CLOUD CONNECTIVITY COMPANY X/Bluesky: @gamussa Kong Con idential

Slide 3

Slide 3

Slides and Video https://speaking.gamov.io/ X/Bluesky: @gamussa

Slide 4

Slide 4

What is Apache Flink?

Slide 5

Slide 5

What is Flink? Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. X/Bluesky: @gamussa

Slide 6

Slide 6

Streaming unbounded stream bounded stream past now future ● A stream is a sequence of events ● Business data is always a stream: bounded or unbounded ● For Flink, batch processing is just a special case in the runtime X/Bluesky: @gamussa

Slide 7

Slide 7

Key Features of Flink

Slide 8

Slide 8

Stream processing with Flink Files Real-time Stream Processing Kafka Sinks Sources Apps Databases Key/Value Stores X/Bluesky: @gamussa

Slide 9

Slide 9

Stream processing with Flink Real-time Stream Processing Files Kafka Sinks Sources Apps Databases Key/Value Stores X/Bluesky: @gamussa

Slide 10

Slide 10

Introduction to DataStream API

Slide 11

Slide 11

// Set up the execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from some elements DataStream<String> inputStream = env.fromData(“apple”, “banana”, “cherry”, “date”, “elderberry”); // Perform a transformation DataStream<Tuple2<String, Integer>> resultStream = inputStream .map(value -> new Tuple2<>(value, value.length())) .returns(Types.TUPLE(Types.STRING, Types.INT)); // Print the results to the console resultStream.print(); // Execute the Flink job env.execute(“Simple Flink Job”); @gamussa | @confluentinc | @apacheflink

Slide 12

Slide 12

// Set up the execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from some elements DataStream<String> inputStream = env.fromData(“apple”, “banana”, “cherry”, “date”, “elderberry”); // Perform a transformation DataStream<Tuple2<String, Integer>> resultStream = inputStream .map(value -> new Tuple2<>(value, value.length())) .returns(Types.TUPLE(Types.STRING, Types.INT)); // Print the results to the console resultStream.print(); // Execute the Flink job env.execute(“Simple Flink Job”); @gamussa | @confluentinc | @apacheflink

Slide 13

Slide 13

// Set up the execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from some elements DataStream<String> inputStream = env.fromData(“apple”, “banana”, “cherry”, “date”, “elderberry”); // Perform a transformation DataStream<Tuple2<String, Integer>> resultStream = inputStream .map(value -> new Tuple2<>(value, value.length())) .returns(Types.TUPLE(Types.STRING, Types.INT)); // Print the results to the console resultStream.print(); // Execute the Flink job env.execute(“Simple Flink Job”); @gamussa | @confluentinc | @apacheflink

Slide 14

Slide 14

// Set up the execution environment StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); // Create a DataStream from some elements DataStream<String> inputStream = env.fromData(“apple”, “banana”, “cherry”, “date”, “elderberry”); // Perform a transformation DataStream<Tuple2<String, Integer>> resultStream = inputStream .map(value -> new Tuple2<>(value, value.length())) .returns(Types.TUPLE(Types.STRING, Types.INT)); // Print the results to the console resultStream.print(); // Execute the Flink job env.execute(“Simple Flink Job”); @gamussa | @confluentinc | @apacheflink

Slide 15

Slide 15

The JobGraph (or topology) X/Bluesky: @gamussa

Slide 16

Slide 16

The JobGraph (or topology) CONNECTION OPERATOR X/Bluesky: @gamussa

Slide 17

Slide 17

Stream processing • Parallel • Forward • Repartition SOURCE grouped by shape • Rebalance X/Bluesky: @gamussa

Slide 18

Slide 18

Stream processing • Parallel • Forward • Repartition SOURCE grouped by shape • Rebalance X/Bluesky: @gamussa

Slide 19

Slide 19

Stream processing • Parallel • Forward • Repartition FILTER group by color • Rebalance X/Bluesky: @gamussa

Slide 20

Slide 20

Stream processing 4 3 2 1 • Parallel • Forward COUNT • Repartition 1 • Rebalance 3 X/Bluesky: @gamussa 2

Slide 21

Slide 21

Stream processing with SQL events INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color orange GROUP BY color; COUNT results GROUP BY color WHERE color <> orange

< X/Bluesky: @gamussa

Slide 22

Slide 22

Stream processing with SQL events INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color orange GROUP BY color; COUNT results GROUP BY color WHERE color <> orange

< X/Bluesky: @gamussa

Slide 23

Slide 23

Stream processing with SQL events INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color orange GROUP BY color; COUNT results GROUP BY color WHERE color <> orange

< X/Bluesky: @gamussa

Slide 24

Slide 24

Stream processing with SQL events INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color orange GROUP BY color; COUNT results GROUP BY color WHERE color <> orange

< X/Bluesky: @gamussa

Slide 25

Slide 25

Stream processing with SQL events COUNT results INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color orange GROUP BY color; GROUP BY color events WHERE color <> orange

< X/Bluesky: @gamussa

Slide 26

Slide 26

Flink’s APIs SQL API Table API Optimizer / Planner DataStream API Low-Level Stream Operator API Apache Flink Runtime X/Bluesky: @gamussa

Slide 27

Slide 27

Flink’s APIs: mix & match Easy to use / declarative Flink SQL ● ● ● ● Code Generation Efficient data types Cost-based optimizer Highly efficient operator implementations (joins, aggregations, deduplications, …) → Easy to write efficient code with low effort ● Ready-made operators for Windowing: sliding, tumbling, session. Late event handling. CEP / Async IO operators Sources / Sinks / Flink Connectors Level of abstraction Table API DataStream API Process Functions ● ● ● ● Custom data types Raw, low level access to state, time → A lot of potential to make mistakes :) Low level / expressive X/Bluesky: @gamussa

Slide 28

Slide 28

State Management

Slide 29

Slide 29

X/Bluesky: @gamussa

Slide 30

Slide 30

Stateful stream processing with Flink SQL events INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color orange GROUP BY color; COUNT results GROUP BY color WHERE color <> orange

< X/Bluesky: @gamussa

Slide 31

Slide 31

Stateful stream processing with Flink SQL events INSERT INTO results SELECT color, COUNT(*) FROM events WHERE color orange GROUP BY color; ● Filtering is stateless COUNT results GROUP BY color WHERE color <> orange

< X/Bluesky: @gamussa

Slide 32

Slide 32

Stateful stream processing with FlinkSQL events COUNT results ● Counting requires state GROUP BY color WHERE color <> orange X/Bluesky: @gamussa

Slide 33

Slide 33

State Stored on the heap or on disk using RocksDB (a KV store) • Local • Fast • Fault tolerant Distributed, reliable storage such as S3 or HDFS X/Bluesky: @gamussa

Slide 34

Slide 34

Summary Streaming State Event time and watermarks State snapshots for recovery A sequence of events. Delightfully simple ● local ● key/value ● single-threaded Watermarks indicate how much progress the time in the stream has made. Transparent to application developers, enables correctness and operations. Unfamiliar to many developers, but ultimately straightforward. X/Bluesky: @gamussa

Slide 35

Slide 35

Slide 36

Slide 36

As Always Have a Nice Day… X/Bluesky: @gamussa