Data Engineering and the Death of Visual ETL

A presentation at UKOUG Techfest 2019 in December 2019 in Brighton, UK by Stewart Bryson

Slide 1

Slide 1

@redpilla @stewartbryson

Slide 2

Slide 2

Founder & CEO Red Pill Analytics twitter: @stewartbryson medium: @stewartbryson linkedin: stewartbryson

Slide 3

Slide 3

Slide 4

Slide 4

Slide 5

Slide 5

Slide 6

Slide 6

Partnerships Data Warehouse Analytics ANALYTICS @redpilla Data-engineering & ETL

Slide 7

Slide 7

https://medium.com/free-code-camp/the-rise-of-the-data-engineer-91be18f1e603 @redpilla We’re hiring. @stewartbryson

Slide 8

Slide 8

@redpilla @stewartbryson “There’s a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it’s that ultimately code is the best abstraction there is for software.” Maxime Beauchemin, The Rise of the Data Engineer, Jan 20, 2017

Slide 9

Slide 9

@redpilla @stewartbryson

Slide 10

Slide 10

What’s so different about ETL? @redpilla @stewartbryson

Slide 11

Slide 11

Code versus Clicks Source control CI/CD Collaboration Code anywhere @redpilla Faster Easier to start Any discipline Business friendly @stewartbryson

Slide 12

Slide 12

Assumption: Code is better, but harder. @redpilla @stewartbryson

Slide 13

Slide 13

@redpilla @stewartbryson

Slide 14

Slide 14

@redpilla @stewartbryson

Slide 15

Slide 15

@redpilla @stewartbryson

Slide 16

Slide 16

CREATE OR REPLACE VIEW stage.promotions_vw AS SELECT promo_id, promo_name, promo_subcategory, promo_begin_date, promo_end_date, time_id status_date, status, weekday_flag FROM ( SELECT time_id, CASE WHEN day_name IN (‘Saturday’,’Sunday’) THEN ‘N’ ELSE ‘Y’ END weekday_flag, promo_id, MAX(promo_name) OVER (partition BY promo_id) promo_name, MAX(promo_subcategory) OVER (partition BY promo_id) promo_subcategory, MAX(promo_begin_date) OVER (partition BY promo_id) promo_begin_date, MAX(promo_end_date) OVER (partition BY promo_id) promo_end_date, nvl(status, ‘ongoing’) status FROM stage.promotions_stg partition BY (promo_id) right outer JOIN sh.times ON time_id=status_date) WHERE time_id BETWEEN promo_begin_date AND promo_end_date @redpilla https://www.rittmanmead.com/blog/2009/04/using-a-partition-by-join-to-fill-in-the-gaps

Slide 17

Slide 17

@redpilla @stewartbryson https://www.rittmanmead.com/blog/2009/04/using-a-partition-by-join-to-fill-in-the-gaps

Slide 18

Slide 18

@redpilla @stewartbryson mapping.create(‘MY_PROJECT’, ‘DEMO_FOLDER’, ‘EMPLOYEE_DIM_LOAD’) .datastores([ [name: “HR.EMPLOYEES”], [name: “HR.DEPARTMENTS”], [name: “HR.JOBS”], [name: “PERF.D_EMPLOYEE”, integration_type: “SCD”], ]) .select(“EMPLOYEES”) .filter(‘NAME_FILTER’, [filter_condition: “EMPLOYEES.FIRST_NAME LIKE ‘D%’” ]) .join(‘EMP_DEPT’, [‘DEPARTMENTS’], [join_condition: “EMPLOYEES.DEPARTMENT_ID = DEPARTMENTS.DEPARTMENT_ID” ]) .join(‘DEPT_JOBS’, [‘JOBS’], [join_condition: “EMPLOYEES.JOB_ID = JOBS.JOB_ID” ]) .connect(“D_EMPLOYEE”, [[ attr: “employee_id”, key_indicator: true ], [ attr: “eff_from_date”, expression: “sysdate”, execute_on_hint: “TARGET”], [ attr: “eff_to_date”, expression: “sysdate”, execute_on_hint: “TARGET”], [ attr: “current_flag”, expression: 1, execute_on_hint: “TARGET”], [ attr: “surr_key”, expression: “:RM_PROJECT.D_EMPLOYEE_SEQ_NEXTVAL”, execute_on_hint: “TARGET”],]) .commit() .validate() https://www.rittmanmead.com/blog/2017/03/accelerating-your-odi-implementation-rittmanmead-style/ https://www.rittmanmead.com/blog/2009/04/using-a-partition-by-join-to-fill-in-the-gaps

Slide 19

Slide 19

@redpilla @stewartbryson https://www.rittmanmead.com/blog/2017/03/accelerating-your-odi-implementation-rittmanmead-style/

Slide 20

Slide 20

@redpilla @stewartbryson

Slide 21

Slide 21

Code >> GUI >> Code @redpilla @stewartbryson

Slide 22

Slide 22

I would rather write this. @redpilla SELECT first_name, last_name, email, phone_number, hire_date, job_id, manager_id, department_id, salary, commission_pct, eff_from_date, eff_to_date, current_flag FROM EMPLOYEES JOIN DEPARTMENTS using (DEPARTMENT_ID) JOIN JOBS using (JOB_ID) @stewartbryson

Slide 23

Slide 23

@redpilla @stewartbryson

Slide 24

Slide 24

@redpilla @stewartbryson

Slide 25

Slide 25

@redpilla @stewartbryson We aren’t starting with a blank slate. @stewartbryson

Slide 26

Slide 26

An embarrassment of riches. @redpilla @stewartbryson

Slide 27

Slide 27

Why Spark? @redpilla @stewartbryson

Slide 28

Slide 28

Why Databricks? @redpilla @stewartbryson

Slide 29

Slide 29

@redpilla @stewartbryson

Slide 30

Slide 30

@redpilla @stewartbryson

Slide 31

Slide 31

Demonstration @redpilla @stewartbryson

Slide 32

Slide 32