Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

A presentation at KScope 19 in June 2019 in Seattle, WA, USA by Robin Moffatt

Slide 1

Slide 1

Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL @rmoff #KScope19

Slide 2

Slide 2

@rmoff #KScope19 Analytics—In the beginning… Sales DWH Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 3

Slide 3

And then there were more data sources… Sales @rmoff #KScope19 DWH Inventory Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 4

Slide 4

Batch Transformations … (ETL / ELT) Sales @rmoff #KScope19 DWH Inventory Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 5

Slide 5

@rmoff #KScope19 Add a Data Lake… Sales Inventory DWH Data Lake Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 6

Slide 6

@rmoff #KScope19 …or Replace the Data Warehouse Sales Inventory Data Lake Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 7

Slide 7

Still need to do Batch transformations… @rmoff #KScope19 Sales Inventory Data Lake Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 8

Slide 8

@rmoff #KScope19 Want your data anytime 🔜 ? Batch is Latency built in by Design Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 9

Slide 9

@rmoff #KScope19 The World has Changed Internet of Things Microservices Mobile Machine Learning Photo by Denys Nevozhai on Unsplash Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 10

Slide 10

Lots of new technologies @rmoff #KScope19 (whether you like it or not) Photo by Rosie Fraser on Unsplash Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 11

Slide 11

@rmoff #KScope19 App App App App cache monitoring cache MQ DWH security MQ search Hadoop Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 12

Slide 12

@rmoff #KScope19 App App App App request-response changelogs App App KAFKA App App DWH Hadoop messaging OR stream processing streaming data pipelines Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 13

Slide 13

“ But streaming…I’ve just got data in a database…right?

Slide 14

Slide 14

“ Bold claim: all your data is event streams

Slide 15

Slide 15

@rmoff #KScope19 A Customer Experience Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 16

Slide 16

@rmoff #KScope19 A Sale Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 17

Slide 17

@rmoff #KScope19 A Sensor Reading Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 18

Slide 18

@rmoff #KScope19 An Application Log Entry Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 19

Slide 19

@rmoff #KScope19 Databases Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 20

Slide 20

Time The Stream/Table Duality Stream Account ID Amount 12345 + €50 12345

  • €25 12345 -€60 @rmoff #KScope19 Account ID Balance Table 12345 €50 Account ID Balance 12345 €75 Account ID Balance 12345 €15 Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 21

Slide 21

Streaming Platform Vision @rmoff #KScope19 Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 22

Slide 22

Streaming Platform Vision @rmoff #KScope19 Event-Driven Scalable Decoupled Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 23

Slide 23

@rmoff #KScope19 Database offload→Hadoop/Object Storage/Cloud DW for Analytics RDBMS CDC HDFS / S3 / BigQuery etc Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 24

Slide 24

Streaming ETL with Apache Kafka and KSQL @rmoff #KScope19 order items CDC RDBMS customer orders customer Stream Processing Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 25

Slide 25

@rmoff #KScope19 Real-time Event Stream Enrichment with Apache Kafka and KSQL order events customer orders C D C RDBMS <y> customer Stream Processing Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 26

Slide 26

Transform Once, Use Many @rmoff #KScope19 order events customer orders C D C RDBMS <y> customer Stream Processing New App <x> Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 27

Slide 27

Transform Once, Use Many @rmoff #KScope19 order events customer orders C D C RDBMS <y> HDFS / S3 / etc customer Stream Processing New App <x> Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 28

Slide 28

Evolve processing from old systems to new Existing App @rmoff #KScope19 New App <x> C D C RDBMS Stream Processing Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 29

Slide 29

Evolve processing from old systems to new Existing App @rmoff #KScope19 New App <x> New App <y> C D C RDBMS Stream Processing Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 30

Slide 30

@rmoff #KScope19 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 31

Slide 31

@rmoff #KScope19 Let’s Build It! Push notification Rating events Operational Dashboard User data Join events to users, and filter Data Lake Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 32

Slide 32

@rmoff #KScope19 Let’s Build It! Push notification Rating events App App Operational Dashboard Elasticsearch User data Oracle KSQL Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 33

Slide 33

Kafka Connect Rating events App a k f a K t c e n n o C App u s n o C uc e rA PI Kafka Connect a fk t Ka ec n Oracle Pro d I P A r e m Operational Dashboard Elasticsearch n Co User data Push notification to Slack @rmoff #KScope19 Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 34

Slide 34

Streaming Integration with Kafka Connect @rmoff #KScope19 Amazon S3 syslog Google BigQuery Tasks Workers Kafka Connect Kafka Brokers Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 35

Slide 35

@rmoff #KScope19 Kafka Connect Reliable and scalable integration of Kafka with other systems – no coding required. { “connector.class”: “io.confluent.connect.jdbc.JdbcSourceConnector”, “connection.url”: “jdbc:mysql://localhost:3306/demo?user=rmoff&password=foo”, “table.whitelist”: “sales,orders,customers” } https://docs.confluent.io/current/connect/ Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 36

Slide 36

@rmoff #KScope19 Serialisation & Schemas Avro -> Confluent Schema Registry Protobuf JSON CSV https://qconnewyork.com/system/files/presentation-slides/qcon_17_-_schemas_and_apis.pdf Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 37

Slide 37

@rmoff #KScope19 The Confluent Schema Registry Avro Schema Schema Registry Target Source Kafka Connect Avro Message Avro Message Kafka Connect Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 38

Slide 38

@rmoff #KScope19 Confluent Hub hub.confluent.io Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 39

Slide 39

Change-Data-Capture (CDC) @rmoff #KScope19 • CDC is a generic term referring to capturing changing data typically from a RDBMS. • Two general approaches: • Query-based CDC • Log-based CDC There are other options including hacks with Triggers, Flashback etc but these are system and/or technology-specific. Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 40

Slide 40

Query-based CDC @rmoff #KScope19 • Use a database query to try and identify new & changed rows SELECT * FROM my_table WHERE col > <value of col last time we polled> • Implemented with the open source Kafka Connect JDBC connector • Can import based on table names, schema, or bespoke SQL query • Incremental ingest driven through incrementing ID column and/or timestamp column Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 41

Slide 41

Log-based CDC @rmoff #KScope19 • Use the database’s transaction log to identify every single change event • Various CDC tools available that integrate with Apache Kafka (more of this later…) Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 42

Slide 42

Query-based vs Log-based CDC Photo by Matese Fields on Unsplash @rmoff #KScope19 • Query-based +Usually easier to setup, and requires fewer permissions - Needs specific columns in source schema - Impact of polling the DB (or higher latencies tradeoff) - Can’t track deletes, or multiple events between polling interval Read more: http://cnfl.io/kafka-cdc Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 43

Slide 43

@rmoff #KScope19 Query-based vs Log-based CDC • Log-based +Greater data fidelity +Lower latency +Lower impact on source - More setup steps - Higher system privileges required - For propriatory databases, usually $$$ Read more: http://cnfl.io/kafka-cdc Photo by Sebastian Pociecha on Unsplash Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 44

Slide 44

@rmoff #KScope19 Oracle and Kafka integration •Oracle GoldenGate for Big Data—Requires the OGGBD licence, not just OGG •Debezium—Open source, Oracle support in Beta • currently uses XStream— which requires OGG licence •Attunity, IBM IIDR, HVR, SQData, tcVision, StreamSets—all offer commerical CDC integration into Kafka with support for Schema Registry • DBVisit Replicate—no longer under development •JDBC Connector—Open source, but not “true” CDC https://rmoff.net/2018/12/12/streaming-data-from-oracle-into-kafka-december-2018/ Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 45

Slide 45

@rmoff #KScope19 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 46

Slide 46

Stream Processing with KSQL Push notification to Slack Rating events @rmoff #KScope19 App App Operational Dashboard Elasticsearch User data Oracle KSQL Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 47

Slide 47

@rmoff #KScope19 KSQL is the Streaming SQL Engine for Apache Kafka Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 48

Slide 48

@rmoff #KScope19 Filter messages with KSQL completedOrders orders → → → → → → → → → → → 01, £10.00, 05, £10.00, 06, £24.00, 02, £12.33, 04, £5.50, → COMPLETE COMPLETE COMPLETE PENDING COMPLETE CREATE STREAM completedOrders AS SELECT * FROM orders WHERE status=’COMPLETE’; → → → → → → → → → → → 01, £10.00, 06, £24.00, 02, £12.33, 04, £5.50, → COMPLETE COMPLETE COMPLETE COMPLETE Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 49

Slide 49

@rmoff #KScope19 Drop columns with KSQL customer → → → → → → → → → → →→ {“id”:1, {“id”:2, {“id”:3, “name”:”Dana Lidgerton”, “name”:”Milo Wellsman”, “name”:”Dolph Cleeton”, “card”:”5048370182840140} “card”:”3557977885537506} “card”:”3586303633007251} CREATE STREAM customerNoCC AS SELECT ID, NAME customerNoCC FROM customer; → → → → → → → → → → →→ {“id”:1, {“id”:2, {“id”:3, “name”:”Dana Lidgerton”} “name”:”Milo Wellsman”} “name”:”Dolph Cleeton”} Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 50

Slide 50

@rmoff #KScope19 Stateful aggregation with KSQL customersByCountry customer → → → → → → → → → → →→ {“id”:1, {“id”:2, {“id”:3, “name”:”Dana Lidgerton”, “name”:”Milo Wellsman”, “name”:”Dolph Cleeton”, “country”:”UK”} “country”:”UK”} “country”:”Germany”} CREATE STREAM customersByCountry AS SELECT country, COUNT(*) AS customerCount FROM customer WINDOW TUMBLING (SIZE 1 HOUR) GROUP BY country; → → → → → → → → → → →→ {“country”:”UK”, {“country”:”Germany”, “customerCount”:2} “customerCount”:1} Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 51

Slide 51

@rmoff #KScope19 KSQL for Anomaly Detection Identifying patterns or anomalies in real-time data, surfaced in milliseconds CREATE TABLE possible_fraud AS SELECT card_number, count() FROM authorization_attempts WINDOW TUMBLING (SIZE 5 SECONDS) GROUP BY card_number HAVING count() > 3; Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 52

Slide 52

@rmoff #KScope19 KSQL for Data Transformation Make simple derivations of existing topics from the command line CREATE STREAM pageviews WITH (PARTITIONS=4, VALUE_FORMAT=’AVRO’) AS SELECT * FROM pageviews_json; Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 53

Slide 53

@rmoff #KScope19 KSQL for Streaming ETL Joining, filtering, and aggregating streams of event data CREATE STREAM vip_actions AS SELECT userid, page, action FROM clickstream c LEFT JOIN users u ON c.userid = u.user_id WHERE u.level = ‘Platinum’; Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 54

Slide 54

KSQL in Development and Production Interactive KSQL for development and testing @rmoff #KScope19 Headless KSQL for Production REST Desired KSQL queries have been identified “Hmm, let me try out this idea…” Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 55

Slide 55

@rmoff #KScope19 Streaming ETL, powered by Apache Kafka and Confluent Platform KSQL Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 56

Slide 56

Demo Time!

Slide 57

Slide 57

@rmoff #KScope19 Push notification Rating events App App Kafka Connect a fk t Ka ec n Oracle Elasticsearch n Co User data a k f a K t c e n n o C Operational Dashboard KSQL Join events to users, and filter Data Lake SnowflakeDB/ S3/HDFS/etc Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 58

Slide 58

@rmoff #KScope19 http://cnfl.io/book-bundle Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 59

Slide 59

@rmoff #KScope19 CONFLUENT COMMUNITY DISCOUNT CODE KS19Meetup. 25% OFF* *Standard Priced Conference pass Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL

Slide 60

Slide 60

@rmoff #KScope19 https://www.confluent.io/download/ http://cnfl.io/kafka-cdc http://cnfl.io/slack @rmoff robin@confluent.io No More Silos: Integrating Databases and Apache Kafka

Slide 61

Slide 61

@rmoff #KScope19 #EOF Streaming ETL in Practice with Oracle, Apache Kafka, and KSQL