Streaming ETL in Practice with PostgreSQL, Apache Kafka, and KSQL

A presentation at Postgresconf NYC in March 2019 in New York, NY, USA by Viktor Gamov

Slide 1

Slide 1

@gamussa #Postgres @confluentinc Streaming ETL in Practice: Build Data Pipelines without a single line of code!

Slide 2

Slide 2

@gamussa @gamussa #Postgres @confluentinc

Slide 3

Slide 3

Special Thanks! @rmoff @gamussa #Postgres @confluentinc

Slide 4

Slide 4

Raffle, yeah 🚀 must follow @gamussa @confluentinc 📸🖼🐘 Tag @gamussa With #peoplepostgresdata @gamussa #Postgres @confluentinc

Slide 5

Slide 5

@gamussa #Postgres @confluentinc

Slide 6

Slide 6

Apache Kafka is an event streaming platform @gamussa #Postgres @confluentinc

Slide 7

Slide 7

But what is An event streaming platform? @gamussa #Postgres @confluentinc

Slide 8

Slide 8

A bit of a mess… App App App App cache monitoring cache MQ DWH security MQ search Hadoop @gamussa #Postgres @confluentinc

Slide 9

Slide 9

Streams are changing all of this App App App App request-response changelogs App KAFKA App App App DWH messaging OR stream processing Hadoop streaming data pipelines @gamussa #Postgres @confluentinc

Slide 10

Slide 10

Message Bus order events New App <x> @gamussa #Postgres @confluentinc

Slide 11

Slide 11

Stream Processing order events New App <x> Stream Processing @gamussa #Postgres @confluentinc

Slide 12

Slide 12

Data Enrichment order events C D C Postgres customer New App <x> Stream Processing @gamussa #Postgres @confluentinc

Slide 13

Slide 13

Transform Once, Use Many order events customer orders C D C Postgres customer New App <x> Stream Processing @gamussa #Postgres @confluentinc

Slide 14

Slide 14

Evolve processing from old systems to new New App <x> Existing App C D C Postgres Stream Processing @gamussa #Postgres @confluentinc

Slide 15

Slide 15

Evolve processing from old systems to new New App <x> Existing App New App <y> C D C Postgres Stream Processing @gamussa #Postgres @confluentinc

Slide 16

Slide 16

Evolve processing from old systems to new New App <x> Existing App New App <y> C D C Postgres Stream Processing @gamussa #Postgres @confluentinc

Slide 17

Slide 17

Push notification to Slack Rating events Operational Dashboard User data Join events to users, and filter Data Lake @gamussa #Postgres @confluentinc

Slide 18

Slide 18

Rating events App Pro d uc e rA PI User data Postgres Join events to users, and filter I P A r e m u s n o C Push notification to Slack App Operational Elasticsearch Dashboard Data Lake S3/HDFS/ SnowflakeDB etc @gamussa #Postgres @confluentinc

Slide 19

Slide 19

Rating events App uc e rA PI a k f a K t c e n n o C Kafka Connect a fk t Ka ec n Postgres u s n o C Push notification to Slack App Operational Elasticsearch Dashboard n Co User data Pro d I P A r e m Join events to users, and filter Data Lake S3/HDFS/ SnowflakeDB etc @gamussa #Postgres @confluentinc

Slide 20

Slide 20

Streaming Integration with Kafka Connect syslog flat file CSV JSON Sources MQTT Tasks Workers Kafka Connect Kafka Brokers @gamussa #Postgres @confluentinc

Slide 21

Slide 21

Streaming Integration with Kafka Connect Amazon S3 Sinks MQT Tasks Kafka Connect Workers Kafka Brokers @gamussa #Postgres @confluentinc

Slide 22

Slide 22

Streaming Integration with Kafka Connect Amazon S3 syslog flat file CSV JSON Sources Sinks MQT MQTT Tasks Workers Kafka Connect Kafka Brokers @gamussa #Postgres @confluentinc

Slide 23

Slide 23

Integrating Postgres with Kafka Kafka Connect & Debezium Kafka Connect & JDBC Sink @gamussa #Postgres @confluentinc

Slide 24

Slide 24

Confluent Hub •One-stop place to discover and download : •Connectors •Transformations •Converters hub.confluent.io @gamussa #Postgres @confluentinc

Slide 25

Slide 25

Stop! Demo time! Producer API Postgres t c e n n o C a k f Ka m u i z e b e D @gamussa #Postgres @confluentinc

Slide 26

Slide 26

Rating events App uc e rA PI a k f a K t c e n n o C Kafka Connect a fk t Ka ec n RDBMS u s n o C Push notification to Slack App Operational Elasticsearch Dashboard n Co User data Pro d I P A r e m KSQL Join events to users, and filter Data Lake S3/HDFS/ SnowflakeDB etc @gamussa #Postgres @confluentinc

Slide 27

Slide 27

Coding Sophistication Lower the bar to enter the world of streaming Core developers who use Java/Scala streams Core developers who don’t use Java/Scala Data engineers, architects, DevOps/SRE BI analysts User Population @gamussa #Postgres @confluentinc

Slide 28

Slide 28

KSQL #FTW ksql> 1 UI 2 POST /query CLI 3 REST @gamussa 4 #Postgres Headless @confluentinc

Slide 29

Slide 29

Interactive KSQL for development and testing Headless KSQL for Production REST Desired KSQL queries have been identified “Hmm, let me try out this idea…” @gamussa #Postgres @confluentinc

Slide 30

Slide 30

Interaction with Kafka KSQL (processing) Kafka JVM application with Kafka Streams (processing) (data) Does not run on Kafka brokers Does not run on Kafka brokers @gamussa #Postgres @confluentinc

Slide 31

Slide 31

Fault-Tolerance, powered by Kafka @gamussa #Postgres @confluentinc

Slide 32

Slide 32

Standing on the shoulders of Streaming Giants KSQL Ease of use Powered by KSQL UDFs Kafka Streams Powered by Producer, Consumer APIs Flexibility @gamussa #Postgres @confluentinc

Slide 33

Slide 33

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” Demo Time! } Producer API Postgres t c e n n o C a k f Ka { Kafka Connect Kafk a Elasticsearch Con nec t “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” Postgres } @gamussa #Postgres @confluentinc

Slide 34

Slide 34

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” Filter all ratings where STARS<3 POOR_RATINGS } Producer API @gamussa #Postgres @confluentinc

Slide 35

Slide 35

Do you think that’s a table you are querying ?

Slide 36

Slide 36

The Stream-Table Duality Table Alice €50 (balance) Stream (payments) Alice

  • €50 Alice €50 Alice €75 €75 Alice €15 Bob €18 €18 Bob €18 Bob €18 Bob
  • €18 Alice
  • €25 Alice – €60 time @gamussa #Postgres @confluentinc

Slide 37

Slide 37

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf @gamussa #Postgres @confluentinc

Slide 38

Slide 38

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Producer API Join each rating to customer data RATINGS_WITH_CUSTOMER_DATA t c e n n o C a k f a K { } “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” @gamussa #Postgres @confluentinc

Slide 39

Slide 39

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” } Producer API t c e n n o C a k f a K { } Join each rating to customer data RATINGS_WITH_CUSTOMER_DATA Filter for just PLATINUM customers UNHAPPY_PLATINUM_CUSTOMERS “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” @gamussa #Postgres @confluentinc

Slide 40

Slide 40

{ “rating_id”: 5313, “user_id”: 3, “stars”: 4, “route_id”: 6975, “rating_time”: 1519304105213, “channel”: “web”, “message”: “worst. flight. ever. #neveragain” CREATE TABLE RATINGS_BY_CLUB_STATUS AS SELECT CLUB_STATUS, COUNT(*) Join each rating to customer data FROM ProducerRATINGS_WITH_CUSTOMER_DATA API RATINGS_WITH_CUSTOMER_DATA WINDOW TUMBLING (SIZE 1 MINUTES) GROUP BY CLUB_STATUS; } t c e n n o C a k f a K { } “id”: 3, “first_name”: “Merilyn”, “last_name”: “Doughartie”, “email”: “mdoughartie1@dedecms.com”, “gender”: “Female”, “club_status”: “platinum”, “comments”: “none” Aggregate per-minute by CLUB_STATUS RATINGS_BY_CLUB_STATUS_1MIN @gamussa #Postgres @confluentinc

Slide 41

Slide 41

Resources and Next Steps https://github.com/confluentinc/examples http://confluent.io https://slackpass.io/confluentcommunity #ksql #connect @gamussa #Postgres @confluentinc

Slide 42

Slide 42

Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle @gamussa #Postgres @confluentinc

Slide 43

Slide 43

One last thing… @gamussa #Postgres @confluentinc

Slide 44

Slide 44

https://kafka-summit.org Gamov30 @gamussa #Postgres @confluentinc

Slide 45

Slide 45

Thanks! @gamussa viktor@confluent.io @gamussa #Postgres @confluentinc