Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline!

A presentation at DevOps Pro Europe in May 2021 in by Robin Moffatt

Slide 1

Slide 1

Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! Robin Moffatt @rmoff

Slide 2

Slide 2

$ whoami • Robin Moffatt (@rmoff) • Senior Developer Advocate at Confluent (Apache Kafka, not Wikis 😉) • Working in data & analytics since 2001 • Oracle ACE Director (Alumnus) http://rmoff.dev/talks · http://rmoff.dev/blog · http://rmoff.dev/youtube @rmoff | @confluentinc

Slide 3

Slide 3

Kafka is an Event Streaming Platform App App App App request-response changelogs App messaging OR stream processing App KAFKA App App DWH Hadoop streaming data pipelines @rmoff | @confluentinc

Slide 4

Slide 4

What is an Event Streaming Platform? Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | @confluentinc

Slide 5

Slide 5

Immutable Event Log Old New Messages are added at the end of the log @rmoff | @confluentinc

Slide 6

Slide 6

Topics Clicks Orders Customers Topics are similar in concept to tables in a database @rmoff | @confluentinc

Slide 7

Slide 7

Partitions Clicks p0 P1 P2 Messages are guaranteed to be strictly ordered within a partition @rmoff | @confluentinc

Slide 8

Slide 8

Messages are just K/V bytes plus headers + timestamp Clicks Header Timestamp Key Value @rmoff | @confluentinc

Slide 9

Slide 9

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | @confluentinc

Slide 10

Slide 10

Consumers have a position all of their own Old New Sally Scan is here @rmoff | @confluentinc

Slide 11

Slide 11

Consumers have a position all of their own Old New Fred is here Scan Sally Scan is here @rmoff | @confluentinc

Slide 12

Slide 12

Consumers have a position all of their own George Scan is here Old New Fred is here Scan Sally Scan is here @rmoff | @confluentinc

Slide 13

Slide 13

@rmoff | @confluentinc Photo by Victor Garcia on Unsplash Streaming Data Pipelines

Slide 14

Slide 14

Database Offload Amazon S3 RDBMS Kafka Connect Kafka Connect HDFS @rmoff | @confluentinc

Slide 15

Slide 15

Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing @rmoff | @confluentinc

Slide 16

Slide 16

Building a streaming data pipeline @rmoff | @confluentinc

Slide 17

Slide 17

Stream Integration + Processing @rmoff | @confluentinc

Slide 18

Slide 18

Integration @rmoff | @confluentinc

Slide 19

Slide 19

Streaming Integration with Kafka Connect syslog Sources Kafka Connect Kafka Brokers @rmoff | @confluentinc

Slide 20

Slide 20

Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect Kafka Brokers @rmoff | @confluentinc

Slide 21

Slide 21

Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect Kafka Brokers @rmoff | @confluentinc

Slide 22

Slide 22

Look Ma, No Code! { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } @rmoff | @confluentinc

Slide 23

Slide 23

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | @confluentinc

Slide 24

Slide 24

Extensible Connector Transform(s) Converter @rmoff | @confluentinc

Slide 25

Slide 25

Confluent Hub hub.confluent.io @rmoff | @confluentinc

Slide 26

Slide 26

Kafka Connect Standalone Worker S3 Task #1 JDBC Task #1 JDBC Task #2 Worker Offsets @rmoff | @confluentinc

Slide 27

Slide 27

Kafka Connect Distributed Workers S3 Task #1 JDBC Task #1 Kafka Connect cluster JDBC Task #2 Worker Worker Offsets Config Status Fault-tolerant? Yeah! @rmoff | @confluentinc

Slide 28

Slide 28

Multiple Distributed Clusters S3 Task #1 JDBC Task #1 Kafka Connect cluster #1 JDBC Task #2 Kafka Connect cluster #2 Offsets Offsets Config Config Status Status @rmoff | @confluentinc

Slide 29

Slide 29

Stream Integration + Processing @rmoff | @confluentinc

Slide 30

Slide 30

Stream Processing @rmoff | @confluentinc

Slide 31

Slide 31

} “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash { @rmoff | @confluentinc

Slide 32

Slide 32

SELECT * FROM WIDGETS WHERE WEIGHT_G > 120 { SELECT COUNT(*) FROM WIDGETS GROUP BY PRODUCTION_LINE } “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash SELECT AVG(TEMP_CELCIUS) AS TEMP FROM WIDGETS GROUP BY SENSOR_ID HAVING TEMP>20 CREATE SINK CONNECTOR dw WITH ( Object store, ‘connector.class’ = ‘S3Connector’, data warehouse, ‘topics’ = ‘widgets’ RDBMS …); @rmoff | @confluentinc

Slide 33

Slide 33

Stream Processing with ksqlDB Source stream @rmoff | @confluentinc

Slide 34

Slide 34

Stream Processing with ksqlDB Source stream @rmoff | @confluentinc

Slide 35

Slide 35

Stream Processing with ksqlDB Source stream @rmoff | @confluentinc

Slide 36

Slide 36

Stream Processing with ksqlDB Source stream Analytics @rmoff | @confluentinc

Slide 37

Slide 37

Stream Processing with ksqlDB Source stream Applications / Microservices @rmoff | @confluentinc

Slide 38

Slide 38

Stream Processing with ksqlDB …SUM(TXN_AMT) GROUP BY AC_ID AC _I D= 42 BA LA NC AC E= _I 94 D= .0 42 0 Source stream Applications / Microservices @rmoff | @confluentinc

Slide 39

Slide 39

Scaling ksqlDB Kafka cluster ksqlDB @rmoff | @confluentinc

Slide 40

Slide 40

Scaling ksqlDB Kafka cluster ksqlDB Work split by partition ksqlDB ksqlDB cluster @rmoff | @confluentinc

Slide 41

Slide 41

Slide 42

Slide 42

Think Applications, not database instances ksqlDB cluster Kafka cluster ksqlDB cluster Fraud Inventory ksqlDB cluster Orders @rmoff | @confluentinc

Slide 43

Slide 43

Let’s Build It! Rating events App Pro d uc e rA PI { “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } @rmoff | @confluentinc

Slide 44

Slide 44

Let’s Build It! Rating events App User data Pro d uc e rA PI a k f a K t c e n n o C RDBMS @rmoff | @confluentinc

Slide 45

Slide 45

Let’s Build It! Rating events App User data RDBMS Pro d uc e rA PI a k f a K t c e n n o C Kafka Connect Operational Dashboard Elasticsearch ksqlDB Join events to users, and filter @rmoff | @confluentinc

Slide 46

Slide 46

Let’s Build It! Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n RDBMS Pro d I P A r e m Operational Dashboard Elasticsearch n Co User data Push notification ksqlDB Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc @rmoff | @confluentinc

Slide 47

Slide 47

Let’s Build It! Rating events App Push notification Pro d I P A r e m App u s n o C uc e rA PI Demo Time! a fk t Ka ec n RDBMS Operational Dashboard Elasticsearch n Co User data a k f a K t c e n n o C Kafka Connect ksqlDB Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc @rmoff | @confluentinc

Slide 48

Slide 48

Push notification Rating events App Kafka Connect a fk t Ka ec n RDBMS u s n o C uc e rA PI a k f a K t c e n n o C App Operational Dashboard Elasticsearch n Co User data Pro d I P A r e m ksqlDB Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc @rmoff | @confluentinc

Slide 49

Slide 49

Push notification to Slack Rating events App Pro d I P A r e m App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch n Co a fk t Ka ec n a k f a K t c e n n o C poor_ratings Data ratings ksqlDB User data RDBMS Lake Filter events S3/HDFS/ SnowflakeDB etc @rmoff | @confluentinc

Slide 50

Slide 50

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” Filter all ratings where STARS<3 POOR_RATINGS } Producer API CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3 @rmoff | @confluentinc

Slide 51

Slide 51

Kafka Connect Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n RDBMS Pro d I P A r e m Operational Dashboard Elasticsearch n Co User data Push notification to Slack Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc @rmoff | @confluentinc

Slide 52

Slide 52

Producer API MySQL t c e n n o C a k f Ka m u i z e b e D @rmoff | @confluentinc

Slide 53

Slide 53

Time The Stream/Table Duality Stream Account ID Amount 12345 + £50 12345

  • £25 12345
  • £60 Table Account ID Balance 12345 £50 Account ID Balance 12345 £75 Account ID Balance 12345 £15 @rmoff | @confluentinc

Slide 54

Slide 54

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash

Slide 55

Slide 55

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Producer API Join each rating to customer data RATINGS_WITH_CUSTOMER_DATA t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS SELECT * FROM RATINGS LEFT JOIN CUSTOMERS ON R.ID=C.ID; } @rmoff | @confluentinc

Slide 56

Slide 56

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Producer API t c e n n o C a k f a K Join each rating to customer data RATINGS_WITH_CUSTOMER_DATA Filter for just PLATINUM customers UNHAPPY_PLATINUM_CUSTOMERS { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3 } @rmoff | @confluentinc

Slide 57

Slide 57

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) Join each rating to customer data FROM RATINGS_WITH_CUSTOMER_DATA Producer APWINDOW I TUMBLING RATINGS_WITH_CUSTOMER_DATA (SIZE 1 MINUTES) GROUP BY CLUB_STATUS; } t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” } Aggregate per-minute by CLUB_STATUS RATINGS_BY_CLUB_STATUS_1MIN @rmoff | @confluentinc

Slide 58

Slide 58

Kafka Connect → Elasticsearch @rmoff | @confluentinc

Slide 59

Slide 59

on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry) @rmoff | @confluentinc

Slide 60

Slide 60

Free Books! https://rmoff.dev/kafka-books @rmoff | @confluentinc

Slide 61

Slide 61

60 DE VA DV $200 USD off your bill each calendar month for the first three months when you sign up https://rmoff.dev/ccloud Free money! (additional $60 towards your bill 😄 ) Fully Managed Kafka as a Service * T&C: https://www.confluent.io/confluent-cloud-promo-disclaimer

Slide 62

Slide 62

Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io

Slide 63

Slide 63

Confluent Community Slack group cnfl.io/slack @rmoff | @confluentinc

Slide 64

Slide 64

Further reading / watching • Kafka as a Platform: the Ecosystem from the Ground Up • https://rmoff.dev/kafka101 http://rmoff.dev/youtube • Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! • https://rmoff.dev/ljc-kafka-01 • From Zero to Hero with Kafka Connect • https://rmoff.dev/ljc-kafka-02 • Introduction to ksqlDB • https://rmoff.dev/ljc-kafka-03 • Integrating Oracle and Kafka • https://rmoff.dev/oracle-and-kafka • The Changing Face of ETL: Event-Driven Architectures for Data Engineers • https://rmoff.dev/oredev19-changing-face-of-etl • 🚂On Track with Apache Kafka: Building a Streaming Platform solution with Rail Data • https://rmoff.dev/oredev19-on-track-with-kafka @rmoff | @confluentinc

Slide 65

Slide 65

Resources #EOF • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / partners@confluent.io) can help with introductions on a given sales op @rmoff | @confluentinc