Become a Data Scientist with Oracle Analytics Cloud

A presentation at Oracle Openworld 2019 in September 2019 in San Francisco, CA, USA by Francesco Tisiot

Slide 1

Slide 1

Become a Data Scientist Francesco Tisiot Analytics Tech Lead

Slide 2

Slide 2

Francesco Tisiot Analytics Tech Lead Verona, Italy http://ritt.md/ftisiot Over10 Years in Analytics ft@rittmanmead.com @FTisiot Oracle ACE Director ITOUG Board President

Slide 3

Slide 3

info@rittmanmead.com www.rittmanmead.com @rittmanmead Data Engineering Analytics Data Science

Slide 4

Slide 4

Agenda •OAC •Data Scientist •Become a Data Scientist

Slide 5

Slide 5

Oracle Analytics Cloud • Platform Services (PaaS) • Delivered entirely in the cloud: •No infrastructure footprint • Flexibility •Simplified, metered licensing • Several options to suit your needs: •BYOL • Functionality bundled into 2 editions • Professional • Enterprise

Slide 6

Slide 6

Functions OAC supports Every type of analytics Classic Modern

Slide 7

Slide 7

Classic Enterprise BI • Similar to OBIEE 12c • Centrally maintained & governed • Semantic model • Interactive Dashboards • KPI measurement & monitoring • Guided navigation paths • BI Publisher • Highly formatted, burst outputs • Action Framework • Navigation actions • Scheduled agents

Slide 8

Slide 8

Modern Data Discovery • Data Preparation •Acquire data • Clean/Enrich •Transform • Repeatable Flows • Data Visualisation • Create visual insights rapidly •Construct narrated storyboards • Share findings

Slide 9

Slide 9

Unified Analytics Free Discovery Centralised Reporting Unique Source of Truth Specific Access Control Raw Data To Insights Data Enrichment and Cleaning https://speakerdeck.com/ftisiot/become-an-equilibrista-find-the-right-balance-in-the-analytics-tech-ecosystem

Slide 10

Slide 10

Augmented Analytics Data Enrichment Suggestions Natural Language Processing Explain One-Click Advanced Analytics Advanced Machine Learning

Slide 11

Slide 11

Data Scientist

Slide 12

Slide 12

Data Scientist Is a person who has the knowledge and skills to conduct sophisticated and systematic analyses of data. A data scientist extracts insights from data sets, and evaluates and identifies strategic opportunities. https://bigdata-madesimple.com/what-is-a-data-scientist-14-definitions-of-a-data-scientist/

Slide 13

Slide 13

D ata Scientist Is a Data Analyst who lives in California! https://bigdata-madesimple.com/what-is-a-data-scientist-14-definitions-of-a-data-scientist/

Slide 14

Slide 14

Data Scientist Skills

Slide 15

Slide 15

Brendan Tierney Oracle Ace Director https://www.oralytics.com/2012/06/data-science-is-multidisciplinary.html

Slide 16

Slide 16

Data Scientist …Company Missing a Data Scientist

Slide 17

Slide 17

Low Hanging Fruit Theory Democratise Data Science

Slide 18

Slide 18

Basic Operations Based on my Experience I can Guess…. What are the Drivers for My Sales? Statistically Significant Drivers for Sales Are … Augmented Analytics

Slide 19

Slide 19

Basic Operations YES/NO Is this Client going to accept the Offer? 50% Basic ML Model 70%

Slide 20

Slide 20

Become a Data Scientist with OAC

Slide 21

Slide 21

Before Starting…. Define the Problem!

Slide 22

Slide 22

Problem Definition: Predicting Wine Quality

Slide 23

Slide 23

TEP Task Experience Performance Classify Good/Bad Wine Corpus of Wine Descriptions with Rating Accuracy

Slide 24

Slide 24

Become a Data Scientist with OAC Connect

Slide 25

Slide 25

Connection Options in OAC Pre-Defined Data Models External Data Sources

Slide 26

Slide 26

Select Relevant Columns and Apply Filters

Slide 27

Slide 27

Become a Data Scientist with OAC Connect Clean

Slide 28

Slide 28

What Everybody Thinks a Data Scientist Does What He Really Does

Slide 29

Slide 29

https://www.infoworld.com/article/3228245/data-science/the-80-20-data-science-dilemma.html

Slide 30

Slide 30

Cleaning What? Mark <> MArk City “Rome” Missing Values Wrong Values Irrelevant Observations Col1 -> Name Role: CIO Salary:500 K$ 0-200k 0-1 Handling Outliers Feature Scaling N/A Labelling Columns

Of Clicks

Train: 80% Test: 20% Aggregation Train/Test Set Split

Slide 31

Slide 31

Cleaning How? Data Flows - Filter - Aggregate - Join

Slide 32

Slide 32

Cleaning What? N/A UPPER Mark <> MArk City FILTER “Rome” Missing Values Wrong Values Irrelevant Observations CASE … Automated WHEN… COLUMN Col1 -> Name RENAME Role: CIO FILTER? Salary:500 K$ 0-200k KPI/ Automated (MAX-MIN) 0-1 Handling Outliers Labelling Columns

of Clicks COUNT Aggregation

Feature Scaling Train: 80% FILTER Automated Test: 20% Train/Test Set Split

Slide 33

Slide 33

Why Removing an Outlier? Years Experience Salary 1 30.000 2 32.000 3 35.000 4 35.500 5 36.000 6 40.000 7 50.000 8 70.000 9 90.000 10 500.000

Slide 34

Slide 34

How To Find Outliers? One Dimension

Slide 35

Slide 35

How To Find Outliers? Two Dimensions

Slide 36

Slide 36

Become a Data Scientist with OAC Connect Clean Transform & Enrich

Slide 37

Slide 37

Feature Engineering Location -> ZIP Code Additional Data Sources? Name -> Sex 2 Locations -> Distance Data Flow Day/Month/Year -> Date

Slide 38

Slide 38

Data Preparation Recommendations

Slide 39

Slide 39

Spatial Enrichment Oracle Spatial Studio https://www.rittmanmead.com/blog/2019/07/oracle-spatial-studio/

Slide 40

Slide 40

Become a Data Scientist with OAC Connect Analyse Clean Transform & Enrich

Slide 41

Slide 41

Data Overview

Slide 42

Slide 42

Explain

Slide 43

Slide 43

Explain - Key Drivers

Slide 44

Slide 44

Become a Data Scientist with OAC Connect Clean Analyse Train & Evaluate Transform & Enrich

Slide 45

Slide 45

What Problem are we Trying to Solve? Supervised Unsupervised “I want to predict the value of Y, here are some examples” “Here is a dataset, make sense out of it!” Regression Classification Clustering https://towardsdatascience.com/supervised-vs-unsupervised-learning-14f68e32ea8d

Slide 46

Slide 46

Easy Models

Slide 47

Slide 47

NLP

Slide 48

Slide 48

DataFlow Train Model

Slide 49

Slide 49

Which Model - Parameters?

Slide 50

Slide 50

Select, Try, Save, Change, Try, Save …..

Slide 51

Slide 51

Compare - Classification Real Value Predicted Value Good Bad Good Bad

Slide 52

Slide 52

There is No Single Truth… 502/(502+896) = Precision 64.77% 471/(471+866)= 64.09%

Slide 53

Slide 53

Compare - Regression

Slide 54

Slide 54

Become a Data Scientist with OAC Connect Clean Transform & Enrich Analyse Train & Evaluate Predict

Slide 55

Slide 55

Use On the Fly

Slide 56

Slide 56

Step of a Data Flow

Slide 57

Slide 57

Congratulations! …You are now a Data Scientist!

Slide 58

Slide 58

Nearly There

Slide 59

Slide 59

Required Knowledge 50% . 60% 80% 90% 95% 97%

Slide 60

Slide 60

…But Data Cleaning Feature Engineering Model Creation & Evaluation Feature Selection 80% > 50%

Slide 61

Slide 61

ML Production Deployment Data Scientist ML -> Data Oracle Advanced Analytics

Slide 62

Slide 62

Become a Data Scientist with OAC http://ritt.md/OAC-datascience

Slide 63

Slide 63

ML in Action with OAC http://ritt.md/OAC-ML-Video

Slide 64

Slide 64

Insights Lab https://www.rittmanmead.com/insight-lab/

Slide 65

Slide 65

O AC Data Science

Slide 66

Slide 66

Become a Data Scientist Francesco Tisiot Analytics Tech Lead