Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! Robin Moffatt @rmoff

$ whoami • Robin Moffatt (@rmoff) • Senior Developer Advocate at Confluent (Apache Kafka, not Wikis 😉) • Working in data & analytics since 2001 • Oracle ACE Director (Alumnus) http://rmoff.dev/talks · http://rmoff.dev/blog · http://rmoff.dev/youtube @rmoff | @confluentinc

Kafka is an Event Streaming Platform App App App App request-response changelogs App messaging OR stream processing App KAFKA App App DWH Hadoop streaming data pipelines @rmoff | @confluentinc

What is an Event Streaming Platform? Producer Connectors Consumer The Log Connectors Streaming Engine @rmoff | @confluentinc

Immutable Event Log Old New Messages are added at the end of the log @rmoff | @confluentinc

Topics Clicks Orders Customers Topics are similar in concept to tables in a database @rmoff | @confluentinc

Partitions Clicks p0 P1 P2 Messages are guaranteed to be strictly ordered within a partition @rmoff | @confluentinc

Messages are just K/V bytes plus headers + timestamp Clicks Header Timestamp Key Value @rmoff | @confluentinc

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | @confluentinc

Consumers have a position all of their own Old New Sally Scan is here @rmoff | @confluentinc

Consumers have a position all of their own Old New Fred is here Scan Sally Scan is here @rmoff | @confluentinc

Consumers have a position all of their own George Scan is here Old New Fred is here Scan Sally Scan is here @rmoff | @confluentinc

@rmoff | @confluentinc Photo by Victor Garcia on Unsplash Streaming Data Pipelines

Database Offload Amazon S3 RDBMS Kafka Connect Kafka Connect HDFS @rmoff | @confluentinc

Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing @rmoff | @confluentinc

Building a streaming data pipeline @rmoff | @confluentinc

Stream Integration + Processing @rmoff | @confluentinc

Integration @rmoff | @confluentinc

Streaming Integration with Kafka Connect syslog Sources Kafka Connect Kafka Brokers @rmoff | @confluentinc

Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect Kafka Brokers @rmoff | @confluentinc

Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect Kafka Brokers @rmoff | @confluentinc

Look Ma, No Code! { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } @rmoff | @confluentinc

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | @confluentinc

Extensible Connector Transform(s) Converter @rmoff | @confluentinc

Confluent Hub hub.confluent.io @rmoff | @confluentinc

Kafka Connect Standalone Worker S3 Task #1 JDBC Task #1 JDBC Task #2 Worker Offsets @rmoff | @confluentinc

Kafka Connect Distributed Workers S3 Task #1 JDBC Task #1 Kafka Connect cluster JDBC Task #2 Worker Worker Offsets Config Status Fault-tolerant? Yeah! @rmoff | @confluentinc

Multiple Distributed Clusters S3 Task #1 JDBC Task #1 Kafka Connect cluster #1 JDBC Task #2 Kafka Connect cluster #2 Offsets Offsets Config Config Status Status @rmoff | @confluentinc

Stream Integration + Processing @rmoff | @confluentinc

Stream Processing @rmoff | @confluentinc

} “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash { @rmoff | @confluentinc

SELECT * FROM WIDGETS WHERE WEIGHT_G > 120 { SELECT COUNT(*) FROM WIDGETS GROUP BY PRODUCTION_LINE } “reading_ts”: “2020-02-14T12:19:27Z”, “sensor_id”: “aa-101”, “production_line”: “w01”, “widget_type”: “acme94”, “temp_celcius”: 23, “widget_weight_g”: 100 Photo by Franck V. on Unsplash SELECT AVG(TEMP_CELCIUS) AS TEMP FROM WIDGETS GROUP BY SENSOR_ID HAVING TEMP>20 CREATE SINK CONNECTOR dw WITH ( Object store, ‘connector.class’ = ‘S3Connector’, data warehouse, ‘topics’ = ‘widgets’ RDBMS …); @rmoff | @confluentinc

Stream Processing with ksqlDB Source stream @rmoff | @confluentinc

Stream Processing with ksqlDB Source stream @rmoff | @confluentinc

Stream Processing with ksqlDB Source stream @rmoff | @confluentinc

Stream Processing with ksqlDB Source stream Analytics @rmoff | @confluentinc

Stream Processing with ksqlDB Source stream Applications / Microservices @rmoff | @confluentinc

Stream Processing with ksqlDB …SUM(TXN_AMT) GROUP BY AC_ID AC _I D= 42 BA LA NC AC E= _I 94 D= .0 42 0 Source stream Applications / Microservices @rmoff | @confluentinc

Scaling ksqlDB Kafka cluster ksqlDB @rmoff | @confluentinc

Scaling ksqlDB Kafka cluster ksqlDB Work split by partition ksqlDB ksqlDB cluster @rmoff | @confluentinc

Think Applications, not database instances ksqlDB cluster Kafka cluster ksqlDB cluster Fraud Inventory ksqlDB cluster Orders @rmoff | @confluentinc

Let’s Build It! Rating events App Pro d uc e rA PI { “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } @rmoff | @confluentinc

Let’s Build It! Rating events App User data Pro d uc e rA PI a k f a K t c e n n o C RDBMS @rmoff | @confluentinc

Let’s Build It! Rating events App User data RDBMS Pro d uc e rA PI a k f a K t c e n n o C Kafka Connect Operational Dashboard Elasticsearch ksqlDB Join events to users, and filter @rmoff | @confluentinc

Let’s Build It! Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n RDBMS Pro d I P A r e m Operational Dashboard Elasticsearch n Co User data Push notification ksqlDB Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc @rmoff | @confluentinc

Let’s Build It! Rating events App Push notification Pro d I P A r e m App u s n o C uc e rA PI Demo Time! a fk t Ka ec n RDBMS Operational Dashboard Elasticsearch n Co User data a k f a K t c e n n o C Kafka Connect ksqlDB Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc @rmoff | @confluentinc

Push notification Rating events App Kafka Connect a fk t Ka ec n RDBMS u s n o C uc e rA PI a k f a K t c e n n o C App Operational Dashboard Elasticsearch n Co User data Pro d I P A r e m ksqlDB Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc @rmoff | @confluentinc

Push notification to Slack Rating events App Pro d I P A r e m App u s n o C uc e rA PI Kafka Connect Operational Dashboard Elasticsearch n Co a fk t Ka ec n a k f a K t c e n n o C poor_ratings Data ratings ksqlDB User data RDBMS Lake Filter events S3/HDFS/ SnowflakeDB etc @rmoff | @confluentinc

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” Filter all ratings where STARS<3 POOR_RATINGS } Producer API CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3 @rmoff | @confluentinc

Kafka Connect Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n RDBMS Pro d I P A r e m Operational Dashboard Elasticsearch n Co User data Push notification to Slack Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc @rmoff | @confluentinc

Producer API MySQL t c e n n o C a k f Ka m u i z e b e D @rmoff | @confluentinc

Time The Stream/Table Duality Stream Account ID Amount 12345 + £50 12345

  • £25 12345
  • £60 Table Account ID Balance 12345 £50 Account ID Balance 12345 £75 Account ID Balance 12345 £15 @rmoff | @confluentinc

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Producer API Join each rating to customer data RATINGS_WITH_CUSTOMER_DATA t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS SELECT * FROM RATINGS LEFT JOIN CUSTOMERS ON R.ID=C.ID; } @rmoff | @confluentinc

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Producer API t c e n n o C a k f a K Join each rating to customer data RATINGS_WITH_CUSTOMER_DATA Filter for just PLATINUM customers UNHAPPY_PLATINUM_CUSTOMERS { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3 } @rmoff | @confluentinc

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) Join each rating to customer data FROM RATINGS_WITH_CUSTOMER_DATA Producer APWINDOW I TUMBLING RATINGS_WITH_CUSTOMER_DATA (SIZE 1 MINUTES) GROUP BY CLUB_STATUS; } t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” } Aggregate per-minute by CLUB_STATUS RATINGS_BY_CLUB_STATUS_1MIN @rmoff | @confluentinc

Kafka Connect → Elasticsearch @rmoff | @confluentinc

on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry) @rmoff | @confluentinc

Free Books! https://rmoff.dev/kafka-books @rmoff | @confluentinc

60 DE VA DV $50 USD off your bill each calendar month for the first three months when you sign up https://rmoff.dev/ccloud Free money! (additional $60 towards your bill 😄 ) Fully Managed Kafka as a Service * Limited availability. Activate by 11th September 2020. Expires after 90 days of activation. Any unused promo value on the expiration date will be forfeited.

Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io

Confluent Community Slack group cnfl.io/slack @rmoff | @confluentinc

Further reading / watching • Kafka as a Platform: the Ecosystem from the Ground Up • https://rmoff.dev/kafka101 http://rmoff.dev/youtube • Apache Kafka and ksqlDB in Action: Let’s Build a Streaming Data Pipeline! • https://rmoff.dev/ljc-kafka-01 • From Zero to Hero with Kafka Connect • https://rmoff.dev/ljc-kafka-02 • Introduction to ksqlDB • https://rmoff.dev/ljc-kafka-03 • Integrating Oracle and Kafka • https://rmoff.dev/oracle-and-kafka • The Changing Face of ETL: Event-Driven Architectures for Data Engineers • https://rmoff.dev/oredev19-changing-face-of-etl • 🚂On Track with Apache Kafka: Building a Streaming Platform solution with Rail Data • https://rmoff.dev/oredev19-on-track-with-kafka @rmoff | @confluentinc

Resources #EOF • CDC Spreadsheet • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / partners@confluent.io) can help with introductions on a given sales op @rmoff | @confluentinc