No More Silos: Integrating Databases and Apache Kafka

A presentation at Berlin Buzzwords in June 2021 in by Robin Moffatt

Slide 1

Slide 1

No More Silos: Integrating Databases ® and Apache Kafka @rmoff Berlin Buzzwords 2021 #bbuzz

Slide 2

Slide 2

Photo by Emily Morter on Unsplash @rmoff | #bbuzz

Slide 3

Slide 3

Analytics - Database Offload RDBMS HDFS / S3 / BigQuery etc @rmoff | #bbuzz

Slide 4

Slide 4

Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing @rmoff | #bbuzz

Slide 5

Slide 5

Evolve processing from old systems to new Existing App New App <x> RDBMS Stream Processing @rmoff | #bbuzz

Slide 6

Slide 6

“ But streaming…I’ve just got data in a database…right? @rmoff | #bbuzz

Slide 7

Slide 7

@rmoff

Slide 8

Slide 8

“ Bold claim: all your data is event streams @rmoff | #bbuzz

Slide 9

Slide 9

Human generated events A Sale A Stock movement @rmoff | #bbuzz

Slide 10

Slide 10

Machine generated events Networking IoT Applications @rmoff | #bbuzz

Slide 11

Slide 11

Databases @rmoff | #bbuzz

Slide 12

Slide 12

Do you think that’s a table you are querying? @rmoff | #bbuzz

Slide 13

Slide 13

The Stream Table Duality Account ID Balance 12345 €50 @rmoff | #bbuzz

Slide 14

Slide 14

Time The Stream Table Duality Account ID Amount 12345 + €50 Account ID Balance 12345 €50 @rmoff | #bbuzz

Slide 15

Slide 15

Time The Stream Table Duality Account ID Amount 12345 + €50 12345

  • €25 Account ID Balance 12345 €75 @rmoff | #bbuzz

Slide 16

Slide 16

Time The Stream Table Duality Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €15 @rmoff | #bbuzz

Slide 17

Slide 17

Time The Stream Table Duality Stream Table Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €15 @rmoff | #bbuzz

Slide 18

Slide 18

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash @rmoff | #bbuzz

Slide 19

Slide 19

Photo by Vadim Sherbakov on Unsplash @rmoff | #bbuzz

Slide 20

Slide 20

Streaming Integration with Kafka Connect syslog Sources Kafka Connect Kafka Brokers @rmoff | #bbuzz

Slide 21

Slide 21

Streaming Integration with Kafka Connect Amazon Sinks Google Kafka Connect Kafka Brokers @rmoff | #bbuzz

Slide 22

Slide 22

Streaming Integration with Kafka Connect Amazon syslog Google Kafka Connect Kafka Brokers @rmoff | #bbuzz

Slide 23

Slide 23

Look Ma, No Code! { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } @rmoff | #bbuzz

Slide 24

Slide 24

Kafka Connect basics Source Kafka Connect Kafka @rmoff | #bbuzz

Slide 25

Slide 25

Connectors Connector Source Kafka Connect Kafka @rmoff | #bbuzz

Slide 26

Slide 26

Connectors Connector Native data Connect Record Source Kafka Connect Kafka @rmoff | #bbuzz

Slide 27

Slide 27

Converters Converter Connector Native data Connect bytes[] Record Source Kafka Connect Kafka @rmoff | #bbuzz

Slide 28

Slide 28

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | #bbuzz

Slide 29

Slide 29

The Confluent Schema Registry Avro Schema Schema Registry Target Source Kafka Connect Avro Message Avro Message Kafka Connect @rmoff | #bbuzz

Slide 30

Slide 30

Single Message Transforms Connector Source Transform(s) Converter Kafka Connect Kafka @rmoff | #bbuzz

Slide 31

Slide 31

Extensible Connector Transform(s) Converter @rmoff | #bbuzz

Slide 32

Slide 32

hub.confluent.io @rmoff | #bbuzz

Slide 33

Slide 33

Change-Data-Capture (CDC) Query-based Log-based @rmoff | #bbuzz

Slide 34

Slide 34

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 35

Slide 35

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 36

Slide 36

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 37

Slide 37

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 38

Slide 38

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 39

Slide 39

Log-based CDC #051024 17:24:13 server id 1 # Position Timestamp

00000004 9d fc 5c 43

end_log_pos 98 Type Master ID 0f 01 00 00 00 Size 5e 00 00 00 Master Pos Flags 62 00 00 00 00 00

00000017 04 00 35 2e 30 2e 31 35

2d 64 65 62 75 67 2d 6c |..5.0.15.debug.l|

00000027 6f 67 00 00 00 00 00 00

00 00 00 00 00 00 00 00 |og…………..|

00000037 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 |…………….|

00000047 00 00 00 00 9d fc 5c 43

13 38 0d 00 08 00 12 00 |…….C.8……|

00000057 04 04 04 04 12 00 00 4b

00 04 1a |…….K…| @rmoff | #bbuzz

Slide 40

Slide 40

Log-based CDC Transaction log @rmoff | #bbuzz

Slide 41

Slide 41

Log-based CDC Transaction log @rmoff | #bbuzz

Slide 42

Slide 42

Log-based CDC Transaction log @rmoff | #bbuzz

Slide 43

Slide 43

Log-based CDC Transaction log @rmoff | #bbuzz

Slide 44

Slide 44

Demo Time! https://rmoff.dev/cdc-demo @rmoff | #bbuzz

Slide 45

Slide 45

@rmoff | #bbuzz

Slide 46

Slide 46

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 47

Slide 47

Query-based CDC CREATE TABLE my_table ( ID INT, FOO VARCHAR, BAR VARCHAR, WIBBLE VARCHAR, TS_COL TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ) SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 48

Slide 48

Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 49

Slide 49

Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 50

Slide 50

Query-based CDC UPDATE @rmoff | #bbuzz

Slide 51

Slide 51

Query-based CDC UPDATE SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #bbuzz

Slide 52

Slide 52

Query-based CDC DELETE @rmoff | #bbuzz

Slide 53

Slide 53

Query-based CDC DELETE SELECT * FROM my_table WHERE ts_col > previous ts e p o N @rmoff | #bbuzz

Slide 54

Slide 54

Query-based CDC orderID status address updateTS SELECT * FROM orders WHERE updateTS > previous ts ⏰now = 10:54:00 previous ts = 10:53:30 @rmoff | #bbuzz

Slide 55

Slide 55

Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM orders WHERE updateTS > previous ts ⏰now = 10:54:30 previous ts = 10:54:00 { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” @rmoff | #bbuzz

Slide 56

Slide 56

Query-based CDC orderID status address updateTS 42

10:54:01 PENDING INSERT INTO orders (orderID, status) VALUES (42, ‘PENDING’); @rmoff | #bbuzz

Slide 57

Slide 57

Query-based CDC orderID status address 42 1640 Riverside Drive 10:54:20 PENDING updateTS UPDATE orders SET address = ‘1640 Riverside Drive’ WHERE orderID = 42; @rmoff | #bbuzz

Slide 58

Slide 58

Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:25 PENDING UPDATE orders SET address = ‘29 Acacia Road’ WHERE orderID = 42; @rmoff | #bbuzz

Slide 59

Slide 59

Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED UPDATE orders SET status = ‘SHIPPED’ WHERE orderID = 42; @rmoff | #bbuzz

Slide 60

Slide 60

Query-based CDC orderID status orderID address status 42 SHIPPED 42 address 29 PENDING Acacia Road — updateTS updateTS 10:54:29 10:54:01 42 PENDING 1640 Riverside Drive 10:54:20 42 PENDING 29 Acacia Road 10:54:25 42 SHIPPED 29 Acacia Road 10:54:29 @rmoff | #bbuzz

Slide 61

Slide 61

Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM orders WHERE updateTS > previous ts ⏰now = 10:54:30 previous ts = 10:54:00 { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” @rmoff | #bbuzz

Slide 62

Slide 62

Query-based CDC ✅ Usually easier to setup ✅ Requires fewer permissions 🛑 Needs specific columns in source schema to track changes 🛑 Impact of polling the DB (or higher latencies tradeoff) Photo by Matese Fields on Unsplash 🛑 Can’t track deletes 🛑 Can’t track multiple events between polling interval Read more: http://cnfl.io/kafka-cdc @rmoff | #bbuzz

Slide 63

Slide 63

@rmoff | #bbuzz

Slide 64

Slide 64

Log-based CDC Transaction log @rmoff | #bbuzz

Slide 65

Slide 65

Log-based CDC UPDATE Transaction log @rmoff | #bbuzz

Slide 66

Slide 66

Log-based CDC UPDATE Transaction log @rmoff | #bbuzz

Slide 67

Slide 67

Log-based CDC DELETE Transaction log @rmoff | #bbuzz

Slide 68

Slide 68

Log-based CDC DELETE Transaction log @rmoff | #bbuzz

Slide 69

Slide 69

Log-based CDC DELETE Transaction log @rmoff | #bbuzz

Slide 70

Slide 70

Log-based CDC Commit log @rmoff | #bbuzz

Slide 71

Slide 71

Photo by Sebastian Pociecha on Unsplash Log-based CDC ✅ Greater data fidelity ✅ Lower latency ✅ Lower impact on source 🛑 More setup steps 🛑 Higher system privileges required 🛑 For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc @rmoff | #bbuzz

Slide 72

Slide 72

tl;dr : which tool do I use? Query-based CDC: confluent.io/hub @rmoff | #bbuzz

Slide 73

Slide 73

tl;dr : which tool do I use? Log-based CDC - Oracle confluent.io/hub @rmoff | #bbuzz

Slide 74

Slide 74

tl;dr : which tool do I use? Log-based CDC - MySQL, Postgres, MongoDB, etc confluent.io/hub @rmoff | #bbuzz

Slide 75

Slide 75

on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry) @rmoff | #bbuzz

Slide 76

Slide 76

Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io

Slide 77

Slide 77

#EOF @rmoff rmoff.dev/talks youtube.com/rmoff