No More Silos: Integrating Databases and Apache Kafka

A presentation at ApacheCon in September 2020 in by Robin Moffatt

Slide 1

Slide 1

ApacheCon @ Home 2020 Integrating Databases and Apache Kafka ® @rmoff #acah2020

Slide 2

Slide 2

Photo by Emily Morter on Unsplash @rmoff | #acah2020

Slide 3

Slide 3

Analytics - Database Offload RDBMS HDFS / S3 / BigQuery etc @rmoff | #acah2020

Slide 4

Slide 4

Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing @rmoff | #acah2020

Slide 5

Slide 5

Evolve processing from old systems to new Existing App New App <x> RDBMS Stream Processing @rmoff | #acah2020

Slide 6

Slide 6

“ But streaming…I’ve just got data in a database…right? @rmoff | #acah2020

Slide 7

Slide 7

Slide 8

Slide 8

“ Bold claim: all your data is event streams @rmoff | #acah2020

Slide 9

Slide 9

Human generated events A Sale A Stock movement @rmoff | #acah2020

Slide 10

Slide 10

Machine generated events Networking IoT Applications @rmoff | #acah2020

Slide 11

Slide 11

Databases @rmoff | #acah2020

Slide 12

Slide 12

Do you think that’s a table you are querying? @rmoff | #acah2020

Slide 13

Slide 13

The Stream Table Duality Account ID Balance 12345 €50 @rmoff | #acah2020

Slide 14

Slide 14

Time The Stream Table Duality Account ID Amount 12345 + €50 Account ID Balance 12345 €50 @rmoff | #acah2020

Slide 15

Slide 15

Time The Stream Table Duality Account ID Amount 12345 + €50 12345

  • €25 Account ID Balance 12345 €75 @rmoff | #acah2020

Slide 16

Slide 16

Time The Stream Table Duality Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €15 @rmoff | #acah2020

Slide 17

Slide 17

Time The Stream Table Duality Stream Table Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €15 @rmoff | #acah2020

Slide 18

Slide 18

The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf Photo by Bobby Burch on Unsplash @rmoff | #acah2020

Slide 19

Slide 19

Photo by Vadim Sherbakov on Unsplash @rmoff | #acah2020

Slide 20

Slide 20

Streaming Integration with Kafka Connect syslog Sources Kafka Connect Kafka Brokers @rmoff | #acah2020

Slide 21

Slide 21

Streaming Integration with Kafka Connect Amazon Sinks Google Kafka Connect Kafka Brokers @rmoff | #acah2020

Slide 22

Slide 22

Streaming Integration with Kafka Connect Amazon syslog Google Kafka Connect Kafka Brokers @rmoff | #acah2020

Slide 23

Slide 23

Look Ma, No Code! { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } @rmoff | #acah2020

Slide 24

Slide 24

Kafka Connect basics Source Kafka Connect Kafka @rmoff | #acah2020

Slide 25

Slide 25

Connectors Connector Source Kafka Connect Kafka @rmoff | #acah2020

Slide 26

Slide 26

Connectors Connector Native data Connect Record Source Kafka Connect Kafka @rmoff | #acah2020

Slide 27

Slide 27

Converters Converter Connector Native data Connect bytes[] Record Source Kafka Connect Kafka @rmoff | #acah2020

Slide 28

Slide 28

Serialisation & Schemas JSON Avro Protobuf Schema JSON CSV 👍 👍 👍 😬 https://rmoff.dev/qcon-schemas @rmoff | #acah2020

Slide 29

Slide 29

The Confluent Schema Registry Avro Schema Schema Registry Target Source Kafka Connect Avro Message Avro Message Kafka Connect @rmoff | #acah2020

Slide 30

Slide 30

Single Message Transforms Connector Source Transform(s) Converter Kafka Connect Kafka @rmoff | #acah2020

Slide 31

Slide 31

Extensible Connector Transform(s) Converter @rmoff | #acah2020

Slide 32

Slide 32

hub.confluent.io @rmoff | #acah2020

Slide 33

Slide 33

Change-Data-Capture (CDC) Query-based Log-based @rmoff | #acah2020

Slide 34

Slide 34

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 35

Slide 35

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 36

Slide 36

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 37

Slide 37

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 38

Slide 38

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 39

Slide 39

Log-based CDC #051024 17:24:13 server id 1 # Position Timestamp

00000004 9d fc 5c 43

end_log_pos 98 Type Master ID 0f 01 00 00 00 Size 5e 00 00 00 Master Pos Flags 62 00 00 00 00 00

00000017 04 00 35 2e 30 2e 31 35

2d 64 65 62 75 67 2d 6c |..5.0.15.debug.l|

00000027 6f 67 00 00 00 00 00 00

00 00 00 00 00 00 00 00 |og…………..|

00000037 00 00 00 00 00 00 00 00

00 00 00 00 00 00 00 00 |…………….|

00000047 00 00 00 00 9d fc 5c 43

13 38 0d 00 08 00 12 00 |…….C.8……|

00000057 04 04 04 04 12 00 00 4b

00 04 1a |…….K…| @rmoff | #acah2020

Slide 40

Slide 40

Log-based CDC Transaction log @rmoff | #acah2020

Slide 41

Slide 41

Log-based CDC Transaction log @rmoff | #acah2020

Slide 42

Slide 42

Log-based CDC Transaction log @rmoff | #acah2020

Slide 43

Slide 43

Log-based CDC Transaction log @rmoff | #acah2020

Slide 44

Slide 44

Demo Time! https://rmoff.dev/cdc-demo @rmoff | #acah2020

Slide 45

Slide 45

@rmoff | #acah2020

Slide 46

Slide 46

Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 47

Slide 47

Query-based CDC CREATE TABLE my_table ( ID INT, FOO VARCHAR, BAR VARCHAR, WIBBLE VARCHAR, TS_COL TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ) SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 48

Slide 48

Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 49

Slide 49

Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 50

Slide 50

Query-based CDC UPDATE @rmoff | #acah2020

Slide 51

Slide 51

Query-based CDC UPDATE SELECT * FROM my_table WHERE ts_col > previous ts @rmoff | #acah2020

Slide 52

Slide 52

Query-based CDC DELETE @rmoff | #acah2020

Slide 53

Slide 53

Query-based CDC DELETE e p o SELECT * FROM my_table WHERE ts_col > previous ts @rmoff N | #acah2020

Slide 54

Slide 54

Query-based CDC orderID status address updateTS SELECT * FROM orders WHERE updateTS > previous ts @rmoff | #acah2020

Slide 55

Slide 55

Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM orders WHERE updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” @rmoff | #acah2020

Slide 56

Slide 56

Query-based CDC orderID status address updateTS 42

10:54:01 PENDING INSERT INTO orders (orderID, status) VALUES (42, ‘PENDING’); @rmoff | #acah2020

Slide 57

Slide 57

Query-based CDC orderID status address 42 1640 Riverside Drive 10:54:20 PENDING updateTS UPDATE orders SET address = ‘1640 Riverside Drive’ WHERE orderID = 42; @rmoff | #acah2020

Slide 58

Slide 58

Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:25 PENDING UPDATE orders SET address = ‘29 Acacia Road’ WHERE orderID = 42; @rmoff | #acah2020

Slide 59

Slide 59

Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED UPDATE orders SET status = ‘SHIPPED’ WHERE orderID = 42; @rmoff | #acah2020

Slide 60

Slide 60

Query-based CDC orderID status orderID address status 42 SHIPPED 42 address 29 PENDING Acacia Road — updateTS updateTS 10:54:29 10:54:01 42 PENDING 1640 Riverside Drive 10:54:20 42 PENDING 29 Acacia Road 10:54:25 42 SHIPPED 29 Acacia Road 10:54:29 @rmoff | #acah2020

Slide 61

Slide 61

Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM orders WHERE updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” @rmoff | #acah2020

Slide 62

Slide 62

Query-based CDC ✅ Usually easier to setup ✅ Requires fewer permissions 🛑 Needs specific columns in source schema to track changes 🛑 Impact of polling the DB (or higher latencies tradeoff) Photo by Matese Fields on Unsplash 🛑 Can’t track deletes 🛑 Can’t track multiple events between polling interval Read more: http://cnfl.io/kafka-cdc @rmoff | #acah2020

Slide 63

Slide 63

@rmoff | #acah2020

Slide 64

Slide 64

Slide 65

Slide 65

Log-based CDC Transaction log @rmoff | #acah2020

Slide 66

Slide 66

Log-based CDC UPDATE Transaction log @rmoff | #acah2020

Slide 67

Slide 67

Log-based CDC UPDATE Transaction log @rmoff | #acah2020

Slide 68

Slide 68

Log-based CDC DELETE Transaction log @rmoff | #acah2020

Slide 69

Slide 69

Log-based CDC DELETE Transaction log @rmoff | #acah2020

Slide 70

Slide 70

Log-based CDC Commit log @rmoff | #acah2020

Slide 71

Slide 71

Photo by Sebastian Pociecha on Unsplash Log-based CDC ✅ Greater data fidelity ✅ Lower latency ✅ Lower impact on source 🛑 More setup steps 🛑 Higher system privileges required 🛑 For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc @rmoff | #acah2020

Slide 72

Slide 72

tl;dr : which tool do I use? • If you’re wanting to do query-based CDC, the choice is simple: confluent.io/hub @rmoff | #acah2020

Slide 73

Slide 73

Which Log-Based CDC Tool? • Open Source RDBMS, e.g. MySQL, PostgreSQL • Debezium • (+ paid options) • Proprietory RDBMS, e.g. Oracle, MS SQL, DB2 • Oracle GoldenGate • Debezium + XStream • Qlik (Attunity) • Mainframe e.g. VSAM, IMS • IBM InfoSphere Data Replication • SQData • Attunity • HVR • SQData • tcVISION • tcVISION • Etc See also: https://rmoff.dev/oracle @rmoff | #acah2020

Slide 74

Slide 74

Real-time Event Stream Enrichment ratings Customer ratings C D C RDBMS <y> customer Stream Processing @rmoff | #acah2020

Slide 75

Slide 75

Demo Time! https://rmoff.dev/cdc-demo @rmoff | #acah2020

Slide 76

Slide 76

on Photo by Want to learn more? CTAs, not CATs (sorry, not sorry) @rmoff | #acah2020

Slide 77

Slide 77

Free Books! https://rmoff.dev/acah20 @rmoff | #acah2020

Slide 78

Slide 78

60 DE VA DV $200 USD off your bill each calendar month for the first three months when you sign up https://rmoff.dev/ccloud Free money! (additional $60 towards your bill 😄 ) Fully Managed Apache Kafka as a Service T&C: https://www.confluent.io/confluent-cloud-promo-disclaimer

Slide 79

Slide 79

Learn Kafka. Start building with Apache Kafka at Confluent Developer. developer.confluent.io

Slide 80

Slide 80

Confluent Community Slack group cnfl.io/slack @rmoff | #acah2020

Slide 81

Slide 81

Further reading / watching https://rmoff.dev/kafka-talks @rmoff | #acah2020

Slide 82

Slide 82

#EOF @rmoff rmoff.dev/talks youtube.com/rmoff