@rmoff #KScope19 Integrating Databases and Apache Kafka ®
A presentation at KScope 19 in June 2019 in Seattle, WA, USA by Robin Moffatt
@rmoff #KScope19 Integrating Databases and Apache Kafka ®
@rmoff #KScope19 Photo by Emily Morter on Unsplash No More Silos: Integrating Databases and Apache Kafka
Analytics - Database Offload RDBMS @rmoff #KScope19 HDFS / S3 / BigQuery etc No More Silos: Integrating Databases and Apache Kafka
Real-time Event Stream Enrichment @rmoff #KScope19 order events customer orders C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Evolve processing from old systems to new Existing App New App <x> RDBMS Stream Processing No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 “ But streaming…I’ve just got data in a database…right? @rmoff / No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 “ Bold claim: all your data is event streams @rmoff / No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 A Customer Experience No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 A Sale No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 A Sensor Reading No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 An Application Log Entry No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Databases No More Silos: Integrating Databases and Apache Kafka
The Stream Table Duality @rmoff #KScope19 Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka
Time The Stream Table Duality Account ID Amount 12345 + €50 @rmoff #KScope19 Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka
Time The Stream Table Duality Account ID Amount 12345 + €50 12345
Time The Stream Table Duality @rmoff #KScope19 Account ID Amount 12345 + €50 12345
Time The Stream Table Duality Stream @rmoff #KScope19 Table Account ID Amount 12345 + €50 12345
@rmoff #KScope19 The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf No More Silos: Integrating Databases and Apache Kafka Photo by Bobby Burch on Unsplash
@rmoff #KScope19 Photo by Vadim Sherbakov on Unsplash No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Streaming Integration with Kafka Connect syslog Sources Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Streaming Integration with Kafka Connect Amazon S3 Sinks Google BigQuery Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Streaming Integration with Kafka Connect Amazon S3 syslog Google BigQuery Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka
Kafka Connect basics Source Kafka Connect @rmoff #KScope19 Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Connectors Connector Source Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka
Easy to configure @rmoff #KScope19 { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “jdbc:mysql://asgard:3306/demo”, “table.whitelist”: “sales,orders,customers” } https://docs.confluent.io/current/connect/ “connection.url”: No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Converters Connector Source Converter Kafka Connect https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Serialisation & Schemas Avro -> Confluent Schema Registry Protobuf JSON CSV https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf No More Silos: Integrating Databases and Apache Kafka
The Confluent Schema Registry Avro Schema @rmoff #KScope19 Schema Registry Target Source Kafka Connect Avro Message Avro Message Kafka Connect No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Single Message Transforms Connector Source Transform(s) Kafka Connect Converter Kafka No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Extensible Connector Transform(s) Converter hub.confluent.io No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Change-Data-Capture (CDC) Query-based Log-based No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #KScope19 SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #KScope19 SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #KScope19 SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Log-based CDC #051024 17:24:13 server id 1 # Position Timestamp
end_log_pos 98 Type Master ID 0f 01 00 00 00 Size 5e 00 00 00 Master Pos Flags 62 00 00 00 00 00
2d 64 65 62 75 67 2d 6c |..5.0.15.debug.l|
00 00 00 00 00 00 00 00 |og…………..|
00 00 00 00 00 00 00 00 |…………….|
13 38 0d 00 08 00 12 00 |…….C.8……|
00 04 1a |…….K…| No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka
Demo Time! @rmoff #KScope19 No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Try it yourself: https://github.com/confluentinc/demo-scene/tree/master/no-more-silos No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 should I use?” Photo by Tyler Nix on Unsplash “Which one No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 It Depends! No More Silos: Integrating and on Apache Kafka Photo by Databases Trevor Cole Unsplash
Query-based CDC @rmoff @rmoff #kafkasummit #KScope19 ✅ Usually easier to setup ✅ Requires fewer permissions 🛑 Needs specific columns in source schema to track changes 🛑 Can’t track deletes 🛑 Can’t track multiple events between polling interval Read more: http://cnfl.io/kafka-cdc Photo by Matese Fields on Unsplash 🛑 Impact of polling the DB (or higher latencies tradeoff) No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #KScope19 SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #KScope19 SELECT * CREATE TABLE my_table ( ID INT, FOO VARCHAR, BAR VARCHAR, WIBBLE VARCHAR, TS_COL TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP ) FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC INSERT @rmoff #KScope19 SELECT * FROM WHERE my_table ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC INSERT SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #KScope19 UPDATE No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC UPDATE SELECT * FROM my_table WHERE ts_col > previous ts No More Silos: Integrating Databases and Apache Kafka
Query-based CDC @rmoff #KScope19 DELETE No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC DELETE SELECT * FROM my_table WHERE ts_col > previous ts e p o N No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC orderID status address updateTS SELECT * FROM WHERE orders updateTS > previous ts No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka
10:54:00 PENDING INSERT INTO orders (orderID, status) VALUES (42, ‘PENDING’); No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC orderID status address 42 1640 Riverside Drive 10:54:20 PENDING updateTS UPDATE orders SET address = ‘1640 Riverside Drive’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:25 PENDING UPDATE orders SET address = ‘29 Acacia Road’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED UPDATE orders SET status = ‘SHIPPED’ WHERE orderID = 42; No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC orderID status orderID address status 42 SHIPPED 42 address 29 PENDING Acacia Road — updateTS updateTS 10:54:29 10:54:00 42 PENDING 1640 Riverside Drive 10:54:20 42 PENDING 29 Acacia Road 10:54:25 42 SHIPPED 29 Acacia Road 10:54:29 No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Query-based CDC orderID status address updateTS 42 29 Acacia Road 10:54:29 SHIPPED SELECT * FROM WHERE orders updateTS > previous ts { } “orderID”: 42, “status”: “SHIPPED”, “address”: “29 Acacia Road”, “updateTS”: “10:54:29” No More Silos: Integrating Databases and Apache Kafka
@rmoff #kafkasummit Event-driven app Query-based CDC Log-based CDC
Log-based CDC @rmoff #KScope19 Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 UPDATE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
Log-based CDC @rmoff #KScope19 DELETE Transaction log No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Log-based CDC Immutable event log No More Silos: Integrating Databases and Apache Kafka
Photo by Sebastian Pociecha on Unsplash @rmoff #KScope19 Log-based CDC ✅ Greater data fidelity ✅ Lower latency ✅ Lower impact on source 🛑 More setup steps 🛑 Higher system privileges required 🛑 For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Change-Data-Capture (CDC) Query-based Log-based No More Silos: Integrating Databases and Apache Kafka
tl;dr : which tool do I use? @rmoff #KScope19 • Query-based CDC confluent.io/connector/kafka-connect-jdbc No More Silos: Integrating Databases and Apache Kafka
Which Log-Based CDC Tool? • Open Source RDBMS, e.g. MySQL, PostgreSQL • Debezium • (+ paid options) @rmoff #KScope19 • Proprietory RDBMS, e.g. Oracle, MS SQL, DB2 • Oracle GoldenGate • Debezium + XStream • Attunity • Mainframe e.g. VSAM, IMS • IBM InfoSphere Data Replication • SQData • Attunity • HVR • SQData • tcVISION • tcVISION • Etc See also: https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka-december-2018/ No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Oracle and Kafka integration •Oracle GoldenGate for Big Data—Requires the OGGBD licence, not just OGG •Debezium—Open source, Oracle support in Beta • currently uses XStream— which requires OGG licence •Attunity, IBM IIDR, HVR, SQData, tcVision, StreamSets—all offer commerical CDC integration into Kafka with support for Schema Registry • DBVisit Replicate—no longer under development •JDBC Connector—Open source, but not “true” CDC https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka-december-2018/ No More Silos: Integrating Databases and Apache Kafka
https://twitter.com/gunnarmorling/status/1143534042832392192 Debezium and Oracle @rmoff #KScope19 No More Silos: Integrating Databases and Apache Kafka
Real-time Event Stream Enrichment @rmoff #KScope19 ratings Customer ratings C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka
Demo Time! @rmoff #KScope19 No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 Try it yourself: https://github.com/confluentinc/demo-scene/tree/master/no-more-silos No More Silos: Integrating Databases and Apache Kafka
Confluent Community Components @rmoff #KScope19 Apache Kafka with a bunch of cool stuff! For free! Log Events Database Changes loT Data Web Events … Confluent Platform Data Integration Real-time Applications Monitoring & Administration Confluent Control Center | Security Confluent Platform Transformations Hadoop Operations Replicator | Auto Data Balancing Custom Apps Database Data Compatibility Schema Registry SQL Stream Processing KSQL Analytics Data Warehouse Development and Connectivity Clients | Connectors | REST Proxy | CLI CRM Monitoring Apache Kafka® Core | Connect API | Streams API … CUSTOMER SELF-MANAGED Datacenter Public Cloud … CONFLUENT FULLY-MANAGED Confluent Cloud No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 http://cnfl.io/book-bundle No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 CONFLUENT COMMUNITY DISCOUNT CODE KS19Meetup. 25% OFF* *Standard Priced Conference pass No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 https://www.confluent.io/download/ http://cnfl.io/kafka-cdc http://cnfl.io/slack @rmoff robin@confluent.io No More Silos: Integrating Databases and Apache Kafka
@rmoff #KScope19 #EOF No More Silos: Integrating Databases and Apache Kafka