No More Silos: Integrating Databases and Apache Kafka

A presentation at UKOUG 2018 in December 2018 in Liverpool, UK by Robin Moffatt

Slide 1

Slide 1

@rmoff #ukoug_tech18 Integrating Databases and Apache Kafka Robin Moffatt, Developer Advocate @ Confluent

Slide 2

Slide 2

@rmoff #ukoug_tech18 Photo by Emily Morter on Unsplash No More Silos: Integrating Databases and Apache Kafka

Slide 3

Slide 3

Kafka is a Streaming Platform App App App App @rmoff #ukoug_tech18 request-response changelogs App App KAFKA App App DWH Hadoop messaging OR stream processing streaming data pipelines No More Silos: Integrating Databases and Apache Kafka

Slide 4

Slide 4

Analytics - Database Offload RDBMS @rmoff #ukoug_tech18 HDFS / S3 / BigQuery etc No More Silos: Integrating Databases and Apache Kafka

Slide 5

Slide 5

@rmoff #ukoug_tech18 Real-time Event Stream Enrichment order events customer orders C D C RDBMS <y> customer Stream Processing No More Silos: Integrating Databases and Apache Kafka

Slide 6

Slide 6

@rmoff #ukoug_tech18 Evolve processing from old systems to new Existing App New App <x> RDBMS Stream Processing No More Silos: Integrating Databases and Apache Kafka

Slide 7

Slide 7

“ But streaming…I’ve just got data in a database…right? @rmoff / No More Silos: Integrating Databases and Apache Kafka

Slide 8

Slide 8

“ Bold claim: all your data is event streams @rmoff / No More Silos: Integrating Databases and Apache Kafka

Slide 9

Slide 9

@rmoff #ukoug_tech18 A Customer Experience No More Silos: Integrating Databases and Apache Kafka

Slide 10

Slide 10

@rmoff #ukoug_tech18 A Sale No More Silos: Integrating Databases and Apache Kafka

Slide 11

Slide 11

@rmoff #ukoug_tech18 A Sensor Reading No More Silos: Integrating Databases and Apache Kafka

Slide 12

Slide 12

@rmoff #ukoug_tech18 An Application Log Entry No More Silos: Integrating Databases and Apache Kafka

Slide 13

Slide 13

@rmoff #ukoug_tech18 Databases No More Silos: Integrating Databases and Apache Kafka

Slide 14

Slide 14

@rmoff #ukoug_tech18 Do you think that’s a table you are querying? No More Silos: Integrating Databases and Apache Kafka

Slide 15

Slide 15

The Stream Table Duality @rmoff #ukoug_tech18 Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka

Slide 16

Slide 16

Time The Stream Table Duality Account ID Amount 12345 + €50 @rmoff #ukoug_tech18 Account ID Balance 12345 €50 No More Silos: Integrating Databases and Apache Kafka

Slide 17

Slide 17

Time The Stream Table Duality Account ID Amount 12345 + €50 12345

  • €25 @rmoff #ukoug_tech18 Account ID Balance 12345 €75 No More Silos: Integrating Databases and Apache Kafka

Slide 18

Slide 18

Time The Stream Table Duality @rmoff #ukoug_tech18 Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €15 No More Silos: Integrating Databases and Apache Kafka

Slide 19

Slide 19

@rmoff #ukoug_tech18 Time The Stream Table Duality Stream Table Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 Account ID Balance 12345 €15 No More Silos: Integrating Databases and Apache Kafka

Slide 20

Slide 20

@rmoff #ukoug_tech18 The truth is the log. The database is a cache of a subset of the log. —Pat Helland Immutability Changes Everything http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf No More Silos: Integrating Databases and Apache Kafka Photo by Bobby Burch on Unsplash

Slide 21

Slide 21

@rmoff #ukoug_tech18 KSQL is the Streaming SQL Engine for Apache Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 22

Slide 22

@rmoff #ukoug_tech18 KSQL for Real-Time Monitoring • Log data monitoring, tracking and alerting • syslog data • Sensor / IoT data CREATE STREAM SYSLOG_INVALID_USERS AS SELECT HOST, MESSAGE FROM SYSLOG WHERE MESSAGE LIKE ‘%Invalid user%’; http://cnfl.io/syslogs-filtering / http://cnfl.io/syslog-alerting No More Silos: Integrating Databases and Apache Kafka

Slide 23

Slide 23

KSQL for Streaming ETL @rmoff #ukoug_tech18 Joining, filtering, and aggregating streams of event data CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = ‘Platinum’; No More Silos: Integrating Databases and Apache Kafka

Slide 24

Slide 24

@rmoff #ukoug_tech18 Streaming ETL with Apache Kafka and KSQL Producer API Oracle t c e n n o C a k f Ka m u i z e b e D Kafka Connect Elasticsearch No More Silos: Integrating Databases and Apache Kafka

Slide 25

Slide 25

@rmoff #ukoug_tech18 KSQL for Anomaly Detection Identifying patterns or anomalies in real-time data, surfaced in milliseconds CREATE TABLE possible_fraud AS SELECT card_number, count() FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count() > 3; No More Silos: Integrating Databases and Apache Kafka

Slide 26

Slide 26

@rmoff #ukoug_tech18 Photo by Vadim Sherbakov on Unsplash No More Silos: Integrating Databases and Apache Kafka

Slide 27

Slide 27

@rmoff #ukoug_tech18 Streaming Integration with Kafka Connect syslog flat file CSV JSON Sources MQTT Tasks Workers Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka

Slide 28

Slide 28

@rmoff #ukoug_tech18 Streaming Integration with Kafka Connect Amazon S3 Sinks MQTT Tasks Workers Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka

Slide 29

Slide 29

@rmoff #ukoug_tech18 Streaming Integration with Kafka Connect Amazon S3 syslog flat file CSV JSON Sources Sinks MQTT MQTT Tasks Workers Kafka Connect Kafka Brokers No More Silos: Integrating Databases and Apache Kafka

Slide 30

Slide 30

Kafka Connect basics Source Kafka Connect @rmoff #ukoug_tech18 Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 31

Slide 31

@rmoff #ukoug_tech18 Connectors Connector Source Kafka Connect Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 32

Slide 32

@rmoff #ukoug_tech18 Converters Connector Source Converter Kafka Connect https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 33

Slide 33

Single Message Transforms Connector Source Transform(s) Kafka Connect @rmoff #ukoug_tech18 Converter Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 34

Slide 34

Extensible Connector Source @rmoff #ukoug_tech18 Browse & download at hub.confluent.io Transform(s) Kafka Connect https://docs.confluent.io/current/connect/javadocs/ Converter Kafka No More Silos: Integrating Databases and Apache Kafka

Slide 35

Slide 35

@rmoff #ukoug_tech18 Kafka Connect + Schema Registry = WIN Avro Schema Schema Registry Elasticsearch RDBMS Kafka Connect Avro Message Kafka Connect No More Silos: Integrating Databases and Apache Kafka

Slide 36

Slide 36

@rmoff #ukoug_tech18 Kafka Connect + Schema Registry = WIN Avro Schema Schema Registry Elasticsearch RDBMS Kafka Connect Avro Message Kafka Connect No More Silos: Integrating Databases and Apache Kafka

Slide 37

Slide 37

@rmoff #ukoug_tech18 Change-Data-Capture (CDC) • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific. No More Silos: Integrating Databases and Apache Kafka

Slide 38

Slide 38

Query-based CDC @rmoff #ukoug_tech18 • Use a database query to try and identify new & changed rows SELECT * FROM my_table WHERE col > <value of col last time we polled> • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query • Incremental ingest driven through incrementing ID column and/or timestamp column No More Silos: Integrating Databases and Apache Kafka

Slide 39

Slide 39

Log-based CDC @rmoff #ukoug_tech18 • Use the database’s transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…) No More Silos: Integrating Databases and Apache Kafka

Slide 40

Slide 40

Demo Time! @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

Slide 41

Slide 41

“Which one should I use?” Photo by Tyler Nix on Unsplash @rmoff #ukoug_tech18 No More Silos: Integrating Databases and Apache Kafka

Slide 42

Slide 42

@rmoff #ukoug_tech18 It Depends! No More Silos: Integrating and on Apache Kafka Photo by Databases Trevor Cole Unsplash

Slide 43

Slide 43

Photo by Matese Fields on Unsplash @rmoff #ukoug_tech18 Query-based vs Log-based CDC • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can’t track deletes, or multiple events between polling interval Read more: http://cnfl.io/kafka-cdc No More Silos: Integrating Databases and Apache Kafka

Slide 44

Slide 44

Photo by Sebastian Pociecha on Unsplash @rmoff #ukoug_tech18 Query-based vs Log-based CDC • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc No More Silos: Integrating Databases and Apache Kafka

Slide 45

Slide 45

@rmoff #ukoug_tech18 Considerations for Integration into Apache Kafka • Chucking data over the fence into a Kafka topic is not enough • CDC tools should integrate with standard ways of building data pipelines in Kafka • Schema handling • Serialisation formats Photo by Matthew Smith on Unsplash No More Silos: Integrating Databases and Apache Kafka

Slide 46

Slide 46

Oracle and Kafka integration •Oracle GoldenGate for Big Data—Requires the OGGBD licence, not just OGG •Debezium—Open source,Oracle support in Beta • currently uses XStream— which requires OGG licence @rmoff #ukoug_tech18 •Attunity, IBM IIDR, HVR, SQData, StreamSets—all offer commerical CDC integration into Kafka with support for Schema Registry •DBVisit Replicate—no longer under development •JDBC Connector—Open source, but not “true” CDC No More Silos: Integrating Databases and Apache Kafka

Slide 47

Slide 47

Which Log-Based CDC Tool? • Open Source RDBMS, e.g. MySQL, PostgreSQL • Debezium • (+ paid options) @rmoff #ukoug_tech18 ⓘ For query-based CDC, use the Confluent Kafka Connect JDBC connector • Proprietory RDBMS, e.g. Oracle, MS SQL • Oracle GoldenGate • Debezium + XStream • Attunity • Mainframe e.g. VSAM, IMS • IBM InfoSphere Data Replication (IIDR) • Attunity • SQData • SQData • HVR All these options integrate with Apache Kafka and Confluent Platform, including support for the Schema Registry No More Silos: Integrating Databases and Apache Kafka

Slide 48

Slide 48

Confluent Open Source : Apache Kafka with a bunch of cool stuff! For free! @rmoff #ukoug_tech18 Log Events Database Changes loT Data Web Events … Confluent Platform Data Integration Real-time Applications Monitoring & Administration Confluent Control Center | Security Confluent Platform Transformations Hadoop Operations Replicator | Auto Data Balancing Custom Apps Database Data Compatibility Schema Registry SQL Stream Processing KSQL Analytics Data Warehouse Development and Connectivity Clients | Connectors | REST Proxy | CLI CRM Monitoring Apache Kafka® Core | Connect API | Streams API … CUSTOMER SELF-MANAGED Datacenter Public Cloud … CONFLUENT FULLY-MANAGED Confluent Cloud No More Silos: Integrating Databases and Apache Kafka

Slide 49

Slide 49

@rmoff #ukoug_tech18 Free Books! https://www.confluent.io/apache-kafka-stream-processing-book-bundle No More Silos: Integrating Databases and Apache Kafka

Slide 50

Slide 50

@rmoff #ukoug_tech18 https://www.confluent.io/download/ http://cnfl.io/kafka-cdc http://cnfl.io/slack @rmoff robin@confluent.io No More Silos: Integrating Databases and Apache Kafka

Slide 51

Slide 51

@rmoff #ukoug_tech18 #EOF No More Silos: Integrating Databases and Apache Kafka