Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

A presentation at Seattle Event Driven Meetup in June 2019 in Seattle, WA, USA by Robin Moffatt

Kafka is a Streaming Platform App App App App @rmoff #DevoxxUK request-response changelogs App App KAFKA App App DWH Hadoop messaging OR stream processing streaming data pipelines Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Database Offload Amazon S3 RDBMS Kafka Connect Kafka Connect HDFS Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Stream Processing with Apache Kafka and KSQL order events CDC RDBMS customer orders customer Stream Processing Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Message Bus @rmoff #DevoxxUK order events App <x> Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Stream Processing @rmoff #DevoxxUK order events Stream Processing App <x> Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Transform Once, Use Many @rmoff #DevoxxUK order events Completed orders Stream Processing App <x> Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Evolve processing from old systems to new Existing App New App <x> C D C RDBMS Stream Processing Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Evolve processing from old systems to new Existing App New App <x> New App <y> C D C RDBMS Stream Processing Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Let’s Build It! Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n RDBMS I P A r e m Operational Dashboard Elasticsearch n Co User data Pro d Push notification KSQL Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Confluent Community Components Apache Kafka with a bunch of cool stuff! For free! Log Events Database Changes loT Data Web Events … Confluent Platform Data Integration Real-time Applications Monitoring & Administration Confluent Control Center | Security Confluent Platform Transformations Hadoop Operations Replicator | Auto Data Balancing Custom Apps Database Data Compatibility Schema Registry SQL Stream Processing KSQL Data Warehouse Development and Connectivity Clients | Connectors | REST Proxy | CLI CRM Monitoring Apache Kafka® Core | Connect API | Streams API … CUSTOMER SELF-MANAGED Datacenter Public Cloud Analytics … CONFLUENT FULLY-MANAGED Confluent Cloud Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Stream processing with KSQL @rmoff #DevoxxUK Push notification to Slack Rating events App Kafka Connect a fk t Ka ec n RDBMS u s n o C uc e rA PI a k f a K t c e n n o C ratings App Operational Dashboard Elasticsearch n Co User data Pro d I P A r e m poor_ratings Data KSQL Filter events Lake S3/HDFS/ SnowflakeDB etc Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” @rmoff #DevoxxUK Filter all ratings where STARS<3 POOR_RATINGS } Producer API CREATE STREAM POOR_RATINGS AS SELECT * FROM ratings WHERE STARS <3 Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK KSQL is the Streaming SQL Engine for Apache Kafka Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Filter messages with KSQL @rmoff #DevoxxUK completedOrders orders → → → → → → → → → → → 01, £10.00, 05, £10.00, 06, £24.00, 02, £12.33, 04, £5.50, → COMPLETE COMPLETE COMPLETE PENDING COMPLETE CREATE STREAM completedOrders AS SELECT * FROM orders WHERE status=’COMPLETE’; → → → → → → → → → → → 01, £10.00, 06, £24.00, 02, £12.33, 04, £5.50, → COMPLETE COMPLETE COMPLETE COMPLETE Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Drop columns with KSQL customer → → → → → → → → → → →→ {“id”:1, {“id”:2, {“id”:3, “name”:”Dana Lidgerton”, “name”:”Milo Wellsman”, “name”:”Dolph Cleeton”, “card”:”5048370182840140} “card”:”3557977885537506} “card”:”3586303633007251} CREATE STREAM customerNoCC AS SELECT ID, NAME customerNoCC FROM customer; → → → → → → → → → → →→ {“id”:1, {“id”:2, {“id”:3, “name”:”Dana Lidgerton”} “name”:”Milo Wellsman”} “name”:”Dolph Cleeton”} Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Stateful aggregation with KSQL @rmoff #DevoxxUK customersByCountry customer → → → → → → → → → → →→ {“id”:1, {“id”:2, {“id”:3, “name”:”Dana Lidgerton”, “name”:”Milo Wellsman”, “name”:”Dolph Cleeton”, “country”:”UK”} “country”:”UK”} “country”:”Germany”} CREATE STREAM customersByCountry AS SELECT country, COUNT(*) AS customerCount FROM customer WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY country; → → → → → → → → → → →→ {“country”:”UK”, {“country”:”Germany”, “customerCount”:2} “customerCount”:1} Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

KSQL for Anomaly Detection @rmoff #DevoxxUK Identifying patterns or anomalies in real-time data, surfaced in milliseconds CREATE TABLE possible_fraud AS SELECT card_number, count(*) FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count(*) > 3; Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

KSQL for Data Transformation @rmoff #DevoxxUK Make simple derivations of existing topics from the command line CREATE STREAM pageviews WITH (PARTITIONS=4, VALUE_FORMAT=’AVRO’) AS SELECT * FROM pageviews_json; Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

KSQL for Streaming ETL @rmoff #DevoxxUK Joining, filtering, and aggregating streams of event data CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = ‘Platinum’; Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK KSQL in Development and Production Interactive KSQL for development and testing Headless KSQL for Production REST Desired KSQL queries have been identified “Hmm, let me try out this idea…” Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Kafka Connect Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n RDBMS I P A r e m Operational Dashboard Elasticsearch n Co User data Pro d Push notification to Slack Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Streaming Integration with Kafka Connect syslog Sources Tasks Workers Kafka Connect Kafka Brokers Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Streaming Integration with Kafka Connect Amazon S3 Google BigQuery Sinks Tasks Workers Kafka Connect Kafka Brokers Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Tasks Workers Kafka Connect Kafka Brokers Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Kafka Connect Reliable and scalable integration of Kafka with other systems – no coding required. { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo”, “table.whitelist”: “sales,orders,customers” } https://docs.confluent.io/current/connect/ Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Extensible • Connector Transform(s) Converter Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Confluent Hub hub.confluent.io Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Demo Time! Producer API MySQL t c e n n o C a k f Ka m u i z e b e D Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Time The Stream/Table Duality Stream Account ID Amount 12345 + €50 12345
€25 12345 -€60 @rmoff #DevoxxUK Account ID Balance Table 12345 €50 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Producer API @rmoff #DevoxxUK Join each rating to customer data RATINGS_WITH_CUSTOMER_DATA t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM RATINGS_WITH_CUSTOMER_DATA AS SELECT * FROM RATINGS LEFT JOIN CUSTOMERS ON R.ID=C.ID; } Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Producer API t c e n n o C a k f a K @rmoff #DevoxxUK Join each rating to customer data RATINGS_WITH_CUSTOMER_DATA Filter for just PLATINUM customers UNHAPPY_PLATINUM_CUSTOMERS { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” CREATE STREAM UNHAPPY_PLATINUM_CUSTOMERS AS SELECT * FROM RATINGS_WITH_CUSTOMER_DATA WHERE STARS < 3 } Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” @rmoff #DevoxxUK CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) Join each rating to customer data FROM RATINGS_WITH_CUSTOMER_DATA Producer APWINDOW I TUMBLING RATINGS_WITH_CUSTOMER_DATA (SIZE 1 MINUTES) GROUP BY CLUB_STATUS; } t c e n n o C a k f a K { “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” } Aggregate per-minute by CLUB_STATUS RATINGS_BY_CLUB_STATUS_1MIN Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Kafka Connect → Elasticsearch @rmoff #DevoxxUK Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

https://www.confluent.io/ksql Streaming ETL, powered by Apache Kafka and Confluent Platform @rmoff #DevoxxUK http://cnfl.io/demo-scene http://cnfl.io/book-bundle http://cnfl.io/slack @rmoff Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Related Talks •The Changing Face of ETL: Event-Driven Architectures for Data Engineers •Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline! • 📖 Slides • 📖 Slides • 📽 Recording • 👾 Code • 📽 Recording •ATM Fraud detection with Kafka and KSQL • 📖 Slides •No More Silos: Integrating Databases and Apache Kafka • 👾 Code • 📖 Slides • 📽 Recording • 👾 Code (MySQL) • 👾 Code (Oracle) •Embrace the Anarchy: Apache Kafka’s Role in Modern Data Architectures • 📽 Recording • 📖 Slides • 📽 Recording Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

@rmoff #DevoxxUK Resources • CDC Spreadsheet #EOF • Blog: No More Silos: How to Integrate your Databases with Apache Kafka and CDC • #partner-engineering on Slack for questions • BD team (#partners / partners@confluent.io) can help with introductions on a given sales op Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Robin Moffatt
@rmoff

1 / 42

Have you ever thought that you needed to be a programmer to do stream processing and build streaming data pipelines? Think again! Apache Kafka is a distributed, scalable, and fault-tolerant streaming platform, providing low-latency pub-sub messaging coupled with native storage and stream processing capabilities. Integrating Kafka with RDBMS, NoSQL, and object stores is simple with Kafka Connect, which is part of Apache Kafka. KSQL is the open-source SQL streaming engine for Apache Kafka, and makes it possible to build stream processing applications at scale, written using a familiar SQL interface.

In this talk we’ll explain the architectural reasoning for Apache Kafka and the benefits of real-time integration, and we’ll build a streaming data pipeline using nothing but our bare hands, Kafka Connect, and KSQL.

Gasp as we filter events in real time! Be amazed at how we can enrich streams of data with data from RDBMS! Be astonished at the power of streaming aggregates for anomaly detection!

Video

Resources

The following resources were mentioned during the presentation or are useful additional information.

Buzz and feedback

Here’s what was said about this presentation on social media.

I want KSQL
— juno suárez (they/them) (@junosz) June 26, 2019
I'm getting all sorts of ideas around what I could prospectively do with KSQL... and with prospective schema migrations. 👍🏻

Gonna have to connect #kafka to my #cassandra cluster too for some upcoming features. Will do that on stream upcoming too! 😁
— Λdrøn (@Adron) June 26, 2019
Lot's of interesting concepts @rmoff is bringing up. I'm curious how to setup schema migration via KSQL. Curious if anybody has done this type of thing for #kafka?
— Thrashing Code (@ThrashingCode) June 26, 2019
I couldn't make the Seattle Event-Driven meetup today, but thanks to @Adron streaming it we can all watch it from home.

Currently @rmoff will be talking about Apache Kafka.

🔴 https://t.co/HmCkZ6QSoI
— Lena Hall (@lenadroid) June 26, 2019

Apache Kafka and KSQL in Action : Let’s Build a Streaming Data Pipeline!

Link for this presentation:

HTML code for embedding:

Share on social media:

Video

Resources

👾 Demo code

💬 Confluent Community Slack group

📚Free eBooks (including “Kafka: The Definitive Guide”)

Buzz and feedback