Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline!

A presentation at DataOps Poland in August 2020 in by Robin Moffatt

Slide 1

Slide 1

Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! Robin Moffatt @rmoff

Slide 2

Slide 2

$ whoami • Robin Moffatt (@rmoff) • Senior Developer Advocate at Confluent (Apache Kafka, not Wikis 😉) • Working in data & analytics since 2001 • Oracle ACE Director (Alumnus) http://rmoff.dev/talks · http://rmoff.dev/blog · http://rmoff.dev/youtube @rmoff | #ConfluentVUG | @confluentinc

Slide 3

Slide 3

Kafka is an Event Streaming Platform App App App App request-response changelogs App messaging OR stream processing App KAFKA App App DWH Hadoop @rmoff | streaming data pipelines #ConfluentVUG | @confluentinc

Slide 4

Slide 4

What is an Event Streaming Platform? Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | #ConfluentVUG | @confluentinc

Slide 5

Slide 5

Immutable Event Log Old New Messages are added at the end of the log @rmoff | #ConfluentVUG | @confluentinc

Slide 6

Slide 6

Topics Clicks Orders Customers Topics are similar in concept to tables in a database @rmoff | #ConfluentVUG | @confluentinc

Slide 7

Slide 7

Partitions Clicks p0 P1 P2 Messages are guaranteed to be strictly ordered within a partition @rmoff | #ConfluentVUG | @confluentinc

Slide 8

Slide 8

Messages are just K/V bytes plus headers + timestamp Clicks Header Timestamp Key Value @rmoff | #ConfluentVUG | @confluentinc

Slide 9

Slide 9

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | #ConfluentVUG | @confluentinc

Slide 10

Slide 10

Consumers have a position all of their own Old New Sally Scan is here @rmoff | #ConfluentVUG | @confluentinc

Slide 11

Slide 11

Consumers have a position all of their own Old New Fred Sally Scan is here Scan is here @rmoff | #ConfluentVUG | @confluentinc

Slide 12

Slide 12

Consumers have a position all of their own George Scan is here Old New Fred Sally Scan is here Scan is here @rmoff | #ConfluentVUG | @confluentinc

Slide 13

Slide 13

Free Books! https://rmoff.dev/83x @rmoff | #ConfluentVUG | @confluentinc

Slide 14

Slide 14

@rmoff | #ConfluentVUG | @confluentinc Photo by Victor Garcia on Unsplash Streaming Data Pipelines

Slide 15

Slide 15

Database Offload Amazon S3 RDBMS Kafka Kafka Connect Connect HDFS @rmoff | #ConfluentVUG | @confluentinc

Slide 16

Slide 16

Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing @rmoff | #ConfluentVUG | @confluentinc

Slide 17

Slide 17

Building a streaming data pipeline @rmoff | #ConfluentVUG | @confluentinc

Slide 18

Slide 18

Stream Integration + Processing @rmoff | #ConfluentVUG | @confluentinc

Slide 19

Slide 19

Integration @rmoff | #ConfluentVUG | @confluentinc

Slide 20

Slide 20

Streaming Integration with Kafka Connect syslog Sources Kafka Connect @rmoff | Kafka Brokers #ConfluentVUG | @confluentinc

Slide 21

Slide 21

Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect @rmoff | Kafka Brokers #ConfluentVUG | @confluentinc

Slide 22

Slide 22

Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect @rmoff | Kafka Brokers #ConfluentVUG | @confluentinc

Slide 23

Slide 23

Look Ma, No Code! { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } @rmoff | #ConfluentVUG | @confluentinc

Slide 24

Slide 24

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | #ConfluentVUG | @confluentinc

Slide 25

Slide 25

Extensible Connector Transform(s) @rmoff | Converter #ConfluentVUG | @confluentinc

Slide 26

Slide 26

Confluent Hub hub.confluent.io @rmoff | #ConfluentVUG | @confluentinc

Slide 27

Slide 27

Kafka Connect Standalone Worker S3 Task #1 JDBC Task #1 JDBC Task #2 Offsets Worker @rmoff | #ConfluentVUG | @confluentinc

Slide 28

Slide 28

Kafka Connect Distributed Workers S3 Task #1 JDBC Task #1 Kafka Connect cluster JDBC Task #2 Worker Worker Offsets Config Status @rmoff Yeah! | #ConfluentVUG Fault-tolerant? | @confluentinc

Slide 29

Slide 29

Multiple Distributed Clusters S3 Task #1 JDBC Task #1 Kafka Connect cluster #1 JDBC Task #2 Kafka Connect cluster #2 Offsets Offsets Config Config Status Status @rmoff | #ConfluentVUG | @confluentinc

Slide 30

Slide 30

Stream Integration + Processing @rmoff | #ConfluentVUG | @confluentinc

Slide 31

Slide 31

Stream Processing @rmoff | #ConfluentVUG | @confluentinc

Slide 32

Slide 32

} “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash { @rmoff | #ConfluentVUG | @confluentinc

Slide 33

Slide 33

SELECT * FROM WIDGETS WHERE WEIGHT_G > 120 { SELECT COUNT(*) FROM WIDGETS GROUP BY PRODUCTION_LINE } “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash SELECT AVG(TEMP_CELCIUS) AS TEMP FROM WIDGETS GROUP BY SENSOR_ID HAVING TEMP>20 CREATE SINK CONNECTOR dw WITH ( Object store, ‘connector.class’ = ‘S3Connector’, data warehouse, ‘topics’ = ‘widgets’ RDBMS …); @rmoff | #ConfluentVUG | @confluentinc

Slide 34

Slide 34

Stream Processing with ksqlDB Source stream @rmoff | #ConfluentVUG | @confluentinc

Slide 35

Slide 35

Stream Processing with ksqlDB Source stream @rmoff | #ConfluentVUG | @confluentinc

Slide 36

Slide 36

Stream Processing with ksqlDB Source stream @rmoff | #ConfluentVUG | @confluentinc

Slide 37

Slide 37

Stream Processing with ksqlDB Source stream Analytics @rmoff | #ConfluentVUG | @confluentinc

Slide 38

Slide 38

Stream Processing with ksqlDB Source stream Applications / Microservices @rmoff | #ConfluentVUG | @confluentinc

Slide 39

Slide 39

Stream Processing with ksqlDB …SUM(TXN_AMT) GROUP BY AC_ID AC _I D= 42 BA LA NC AC E= _I 94 D= .0 42 0 Source stream Applications / Microservices @rmoff | #ConfluentVUG | @confluentinc

Slide 40

Slide 40

Scaling ksqlDB Kafka cluster ksqlDB @rmoff | #ConfluentVUG | @confluentinc

Slide 41

Slide 41

Scaling ksqlDB Kafka cluster ksqlDB Work split by partition ksqlDB ksqlDB cluster @rmoff | #ConfluentVUG | @confluentinc

Slide 42

Slide 42

Slide 43

Slide 43

Think Applications, not database instances ksqlDB cluster Kafka cluster ksqlDB cluster Fraud Inventory ksqlDB cluster Orders @rmoff | #ConfluentVUG | @confluentinc

Slide 44

Slide 44

Let’s Build It! Rating events App Pro d uc e rA PI { “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } @rmoff | #ConfluentVUG | @confluentinc

Slide 45

Slide 45

Let’s Build It! Rating events App User data Pro d uc e rA PI a k f a K t c e n n o C RDBMS @rmoff | #ConfluentVUG | @confluentinc

Slide 46

Slide 46

Let’s Build It! Rating events App User data RDBMS Pro d uc e rA PI a k f a K t c e n n o C Kafka Connect Operational Dashboard Elasticsearch ksqlDB Join events to users, and filter @rmoff | #ConfluentVUG | @confluentinc

Slide 47

Slide 47

Let’s Build It! Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n RDBMS Pro d I P A r e m Operational Dashboard Elasticsearch n Co User data Push notification ksqlDB Join events to users, and filter @rmoff | Data Lake SnowflakeDB/ S3/HDFS/etc #ConfluentVUG | @confluentinc

Slide 48

Slide 48

Let’s Build It! Rating events App Push notification Pro d I P A r e m App u s n o C uc e rA PI Demo Time! a fk t Ka ec n RDBMS Operational Dashboard Elasticsearch n Co User data a k f a K t c e n n o C Kafka Connect ksqlDB Join events to users, and filter @rmoff | Data Lake SnowflakeDB/ S3/HDFS/etc #ConfluentVUG | @confluentinc

Slide 49

Slide 49

Push notification Rating events App Kafka Connect a fk t Ka ec n RDBMS u s n o C uc e rA PI a k f a K t c e n n o C App Operational Dashboard Elasticsearch n Co User data Pro d I P A r e m ksqlDB Join events to users, and filter @rmoff | Data Lake SnowflakeDB/ S3/HDFS/etc #ConfluentVUG | @confluentinc

Slide 50

Slide 50

Push notification to Slack Rating events App Pro d I P A r e m App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch n Co a fk t Ka ec n a k f a K t c e n n o C poor_ratings Data ratings ksqlDB User data RDBMS Lake S3/HDFS/ SnowflakeDB Filter events @rmoff | etc #ConfluentVUG | @confluentinc

Slide 51

Slide 51

{ Filter all ratings where STARS<3 “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” POOR_RATINGS } Producer API CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3 @rmoff | #ConfluentVUG | @confluentinc

Slide 52

Slide 52

Kafka Connect Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n RDBMS Pro d I P A r e m Operational Dashboard Elasticsearch n Co User data Push notification to Slack Join events to users, and filter @rmoff | Data Lake SnowflakeDB/ S3/HDFS/etc #ConfluentVUG | @confluentinc

Slide 53

Slide 53

Producer API MySQL t c e n n o C a k f Ka m u i z e b e D @rmoff | #ConfluentVUG | @confluentinc

Slide 54

Slide 54

Time The Stream/Table Duality Table Account ID Balance 12345 £50 Stream Account ID Amount 12345 + £50 12345

  • £25 12345
  • £60 Account ID Balance 12345 £75 Account ID Balance 12345 £15 @rmoff | #ConfluentVUG | @confluentinc

Slide 55

Slide 55

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash

Slide 56

Slide 56

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Join each rating to customer data Producer API RATINGS_WITH_CUSTOMER_DATA t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS SELECT * FROM RATINGS LEFT JOIN CUSTOMERS ON R.ID=C.ID; } @rmoff | #ConfluentVUG | @confluentinc

Slide 57

Slide 57

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Join each rating to customer data Producer API RATINGS_WITH_CUSTOMER_DATA Filter for just PLATINUM customers t c e n n o C a k f a K UNHAPPY_PLATINUM_CUSTOMERS { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3 } @rmoff | #ConfluentVUG | @confluentinc

Slide 58

Slide 58

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) Join each rating to customer data FROM RATINGS_WITH_CUSTOMER_DATA Producer APWINDOW I TUMBLING RATINGS_WITH_CUSTOMER_DATA (SIZE 1 MINUTES) GROUP BY CLUB_STATUS; } t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” } Aggregate per-minute by CLUB_STATUS RATINGS_BY_CLUB_STATUS_1MIN @rmoff | #ConfluentVUG | @confluentinc

Slide 59

Slide 59

Kafka Connect → Elasticsearch @rmoff | #ConfluentVUG | @confluentinc

Slide 60

Slide 60

on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry) @rmoff | #ConfluentVUG | @confluentinc

Slide 61

Slide 61

Free Books! https://rmoff.dev/83x @rmoff | #ConfluentVUG | @confluentinc

Slide 62

Slide 62

60 DE VA DV $50 USD off your bill each calendar month for the first three months when you sign up https://rmoff.dev/ccloud Free money! (additional $60 towards your bill 😄 ) Fully Managed Kafka as a Service * Limited availability. Activate by 11th September 2020. Expires after 90 days of activation. Any unused promo value on the expiration date will be forfeited.

Slide 63

Slide 63

Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io

Slide 64

Slide 64

Confluent Community Slack group cnfl.io/slack @rmoff | #ConfluentVUG | @confluentinc

Slide 65

Slide 65

Further reading / watching • Kafka as a Platform: the Ecosystem from the Ground Up http://rmoff.dev/youtube • https://rmoff.dev/kafka101 • Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! • https://rmoff.dev/ljc-kafka-01 • From Zero to Hero with Kafka Connect • https://rmoff.dev/ljc-kafka-02 • Introduction to ksqlDB • https://rmoff.dev/ljc-kafka-03 • Integrating Oracle and Kafka • https://rmoff.dev/oracle-and-kafka • The Changing Face of ETL: Event-Driven Architectures for Data Engineers • https://rmoff.dev/oredev19-changing-face-of-etl • 🚂On Track with Apache Kafka: Building a Streaming Platform solution with Rail Data • https://rmoff.dev/oredev19-on-track-with-kafka @rmoff | #ConfluentVUG | @confluentinc

Slide 66

Slide 66

Resources #EOF • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / partners@confluent.io) can help with introductions on a given sales op @rmoff | #ConfluentVUG | @confluentinc