Data Engineering and the Death of Visual ETL

A presentation at Atlanta Snowflake, Big Data and Spark Joint Meetup in January 2020 in Atlanta, GA, USA by Stewart Bryson

Slide 1

Slide 1

@redpilla @stewartbryson

Slide 2

Slide 2

Founder & CEO Red Pill Analytics twitter: @stewartbryson medium: @stewartbryson linkedin: stewartbryson

Slide 3

Slide 3

Slide 4

Slide 4

Slide 5

Slide 5

Slide 6

Slide 6

Technologies We Use Data Warehouse Analytics ANALYTICS @redpilla Data-engineering & ETL

Slide 7

Slide 7

@redpilla We’re hiring. @stewartbryson

Slide 8

Slide 8

https://medium.com/free-code-camp/the-rise-of-the-data-engineer-91be18f1e603 @redpilla We’re hiring. @stewartbryson

Slide 9

Slide 9

@redpilla @stewartbryson “There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software.” Maxime Beauchemin, The Rise of the Data Engineer, Jan 20, 2017

Slide 10

Slide 10

@redpilla @stewartbryson

Slide 11

Slide 11

What’s so different about ETL? @redpilla @stewartbryson

Slide 12

Slide 12

Code versus Clicks Source control CI/CD Collaboration Code anywhere @redpilla Faster Easier to start Any discipline Business friendly @stewartbryson

Slide 13

Slide 13

Assumption: Code is better, but harder. @redpilla @stewartbryson

Slide 14

Slide 14

Data Warehouses SQL >> GUI (ETL) >> GUI (ELT) Modern Data Platforms SQL >> GUI (ETL) >> GUI (ELT) >> Distributed >> SQL @redpilla @stewartbryson

Slide 15

Slide 15

@redpilla @stewartbryson

Slide 16

Slide 16

@redpilla @stewartbryson

Slide 17

Slide 17

@redpilla @stewartbryson

Slide 18

Slide 18

CREATE OR REPLACE VIEW stage.promotions_vw AS SELECT promo_id, promo_name, promo_subcategory, promo_begin_date, promo_end_date, time_id status_date, status, weekday_flag FROM ( SELECT time_id, CASE WHEN day_name IN (‘Saturday’,’Sunday’) THEN ‘N’ ELSE ‘Y’ END weekday_flag, promo_id, MAX(promo_name) OVER (partition BY promo_id) promo_name, MAX(promo_subcategory) OVER (partition BY promo_id) promo_subcategory, MAX(promo_begin_date) OVER (partition BY promo_id) promo_begin_date, MAX(promo_end_date) OVER (partition BY promo_id) promo_end_date, nvl(status, ‘ongoing’) status FROM stage.promotions_stg partition BY (promo_id) right outer JOIN sh.times ON time_id=status_date) WHERE time_id BETWEEN promo_begin_date AND promo_end_date @redpilla https://www.rittmanmead.com/blog/2009/04/using-a-partition-by-join-to-fill-in-the-gaps

Slide 19

Slide 19

@redpilla @stewartbryson https://www.rittmanmead.com/blog/2009/04/using-a-partition-by-join-to-fill-in-the-gaps

Slide 20

Slide 20

@redpilla @stewartbryson

Slide 21

Slide 21

@redpilla @stewartbryson

Slide 22

Slide 22

Code >> GUI >> Code @redpilla @stewartbryson

Slide 23

Slide 23

I would rather write this. @redpilla SELECT first_name, last_name, email, phone_number, hire_date, job_id, manager_id, department_id, salary, commission_pct, eff_from_date, eff_to_date, current_flag FROM EMPLOYEES JOIN DEPARTMENTS using (DEPARTMENT_ID) JOIN JOBS using (JOB_ID) @stewartbryson

Slide 24

Slide 24

@redpilla @stewartbryson We aren’t starting with a blank slate. @stewartbryson

Slide 25

Slide 25

An embarrassment of riches. @redpilla @stewartbryson

Slide 26

Slide 26

Shouldn’t my pipelines just live in my data warehouse? @redpilla @stewartbryson

Slide 27

Slide 27

Data Build Tool (dbt) @redpilla @stewartbryson

Slide 28

Slide 28

Am I feeding other downstream systems? Data warehouses typically aren’t very good at that. @redpilla @stewartbryson

Slide 29

Slide 29

TABLE STREAMS & TASKS Web IoT External Stage (S3, Azure Blobs, GCS) Auto-Ingest Table Streams & Tasks Mobile Enterprise Apps © 2019 Snowflake Inc. All Rights Reserved Continuous Bulk Load (Snowpipe) OR (COPY) Storage Integration Staging Transformations Table involving any DML Target Table 1 Target Table 2 Unload External Stage (S3, Azure Blobs, GCS)

Slide 30

Slide 30

TABLE STREAMS & TASKS Table Streams & Tasks Target Table 1 Staging Table © 2019 Snowflake Inc. All Rights Reserved Transformations involving any DML Target Table 2 Unload External Stage (S3, Azure Blobs, GCS)

Slide 31

Slide 31

What are some other use cases we might want to consider? @redpilla @stewartbryson

Slide 32

Slide 32

Realtime Events © 2016 RED PILL Analytics

Slide 33

Slide 33

© 2016 RED PILL Analytics Mobile Analytics

Slide 34

Slide 34

Search © 2016 RED PILL Analytics

Slide 35

Slide 35

© 2016 RED PILL Analytics Machine Learning

Slide 36

Slide 36

Why Spark? @redpilla @stewartbryson

Slide 37

Slide 37

Why Databricks? @redpilla @stewartbryson

Slide 38

Slide 38

@redpilla @stewartbryson

Slide 39

Slide 39

@redpilla @stewartbryson

Slide 40

Slide 40

Demonstration @redpilla @stewartbryson

Slide 41

Slide 41