Mastering Data Governance & Security In Modern Cloud Data Platforms using Databricks And Snowflake

A presentation at DataTune 2024 in March 2024 in Nashville, TN, USA by Philip Goldman

Slide 1

Slide 1

Mastering Data Governance & Security In Modern Cloud Data Platforms using Databricks And Snowflake

Slide 2

Slide 2

The Cambridge Analytica Scandal Cambridge Analytica and Facebook: The Scandal and the Fallout So Far

Slide 3

Slide 3

Marriott Data Breach Marriott’s Data Breach One of the Biggest in History

Slide 4

Slide 4

Equifax Data Breach Case Study: Equifax Data Breach - Seven Pillars Institute

Slide 5

Slide 5

Facebook Data Leak Facebook Data Breach: How to Tell If Your Account Was Exposed | Fortune

Slide 6

Slide 6

Why Data Governance & Data Security Matter Data analyst No row and column level permissions Data Lake Metadata Data Engineer Data scientist Data Warehouse ML models, dashboards Inflexible when policies change Can be out of sync with data Different governance model Yet another governance model 6

Slide 7

Slide 7

Philip Goldman Sr. Data Solution Architect EPAM

Slide 8

Slide 8

Navigating Today’s Discussion Elaborate on the data you want to discuss. Introduction Introduction to Data 2 Principles Conclusion Exploring Data Security & Empowering Your Data Journey Governance Principles 4 Governance & Security. 1 3 Best Practices Best Practices for Security and Data Governance in Databricks and Snowflake. 8

Slide 9

Slide 9

Choosing Databricks & Snowflake Some of the benefits and reasons businesses choose Databricks or Snowflake

Slide 10

Slide 10

Magic Quadrant for Cloud Database Management Systems 10

Slide 11

Slide 11

Principles of Security and Data Governance

Slide 12

Slide 12

Principles of data governance Data Governance Unify data management 12

Slide 13

Slide 13

Principles of data governance Data Governance Unify data management Unify Data Security 13

Slide 14

Slide 14

Principles of data governance Data Governance Unify data management Unify Data Security Manage Data Quality 14

Slide 15

Slide 15

Principles of data governance Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing 15

Slide 16

Slide 16

Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges 16

Slide 17

Slide 17

Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges Data security 17

Slide 18

Slide 18

Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges Data security Network security 18

Slide 19

Slide 19

Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges Data security Network security Compliance & Privacy 19

Slide 20

Slide 20

Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges Data security Network security Compliance & Privacy Security Monitoring 20

Slide 21

Slide 21

How do we implement these principles in real-world cloud platforms like Databricks and Snowflake? — Bring Principles with Practice

Slide 22

Slide 22

Security & Data Governance Best Practices — In Snowflake

Slide 23

Slide 23

SNOWFLAKE SECURITY AT A GLANCE Snowflake Operational Controls Access • All communication secured & encrypted • TLS 1.2 encryption in both trusted and untrusted networks • NIST 800-53 • SOC2 Type 2 • HIPAA • PCI • FedRAMP Authentication • Password Policy enforcement • IP Whitelisting • Multifactor Authentication • Private Link • SAML 2.0 support for Federated Authentication Authorization • Flexible user management • Role-based access control for granular control • RBAC for data and actions Data • Encrypted at rest • Hierarchical key model rooted in Cloud HSM • Automatic key rotation • Time Travel 1-90 days • Tri-Secret Secure • Query statement encryption Infrastructure • AWS, Azure Physical Security • AWS, Azure Redundancy • Regional Data Centers US EU AP 23

Slide 24

Slide 24

AWS/Azure/GCP Private Link 24

Slide 25

Slide 25

Snowflake Network Polices Network policies can be applied at three levels

  1. Snowflake Account
  2. Outside Integration
  3. User Specific 9 The most specific policy always wins. 25

Slide 26

Slide 26

Snowflake Network rule ACCOUNT LEVEL Network Policy 1 INT LEVEL Network Policy 2 USER LEVEL Network Policy 3 … VPCE ID … IPv4 … Azure Link ID … Internal Storage

Slide 27

Slide 27

Snowflake Authentication Human Launches Human Launches Interactive Sessions Server / Desktop SaaS Oauth Auth Server SAML 2.0 External Browser IdP MFA Trust Relationship SAML 2.0 IdP MFA Service/Process/Machine-toMachine Launches Attended Server Sessions Unattended Sessions Server / Desktop SaaS Server / Desktop SaaS Native Username Password Snowflake OAuth Key Pair External OAuth Oauth Auth Server Okta MFA Trust Relationship Session Policies Authentication Policies Snowflake 27

Slide 28

Slide 28

User Provisioning with SCIM SCIM Provider Snowflake SCIM New New SCIM Change Group 1 Change SCIM Role 1 New Group New Role Group 2 Role 2 SCIM New Member Now Assigned

Slide 29

Slide 29

RBAC Role-based access control

Slide 30

Slide 30

Tri-Secret Secure Encrypt using customer managed key (CMK) Customer Key ● Bring your own key (BYOK) ● Revoke at any time → no access to data ● Keys are rotated every 30 days ● Customers setup re-keying SFMK HSM AMKs TMKs F F F F F F 30

Slide 31

Slide 31

Unified Governance Know Your Data Classification What Where Protect Your Data Row Access Policies Dynamic Data Masking External Tokenization Tag-Based Policies Connect Your Ecosystem Object Tagging Object Dependencies Direct Secure Sharing Who Access History Account Usage Conditional Masking Anonymization (PrPr) Tag-Based Policies Data Cleanrooms Data Marketplace Pre-built Partner Integrations to Manage Entire Dat Data Estate 31

Slide 32

Slide 32

Object Tagging Track sensitive data and compute objects ACCOUNT Track sensitive and PII data DATABAS E SCHEMA Track resource usage for cost visibility and attribution WAREHOUSE Flexible privilege management options TABLE VIEW STAGE 32

Slide 33

Slide 33

Data Classification Classify sensitive personal data E AM N G ER D N E AG E O PH fname gen age phone Jane Doe F 50 333-555-1236 Michael Gaines M 75 666-666-1357 Ann Marshall F 48 555-555-1234 John Smith M 39 123-555-1234 E N BE M U _N R Analyze table columns for sensitive personal information Get recommended tags using built-in machine learning Apply tags to track and audit sensitive data 33

Slide 34

Slide 34

Row Access Policies Row-level Security Customer Spend Region ACME $820,000 North America Koko $2,100,000 North America ROLE: EU_NA Filter unauthorized rows at query time Use mapping tables for authorization Apply one policy to many tables Customer Spend Region AGM $5,757,000 Europe ROLE: EU_RL 34

Slide 35

Slide 35

Access History (Reads) Satisfy regulatory compliance, understand usage with column-level access visibility View: INFO (Directly Accessed) Taylor ID 101 102 103 Select * from Info; (Privileged User) Select * from Sensitive; Morgan (Admin) Phone 248-222-3333 800-778-9904 415-887-8888 Access History Unique_ID 333-78-9999 779-66-8908 Log 111-00-8888 User Tables Columns Taylor Info, Sensitive, Contact ID, Phone, Unique_ID, SSN, Mobile Morgan Sensitive ID, SSN Table: SENSITIVE Table: CONTACT ID SSN ID Mobile 101 333-78-9999 101 248-222-3333 102 779-66-8908 102 800-778-9904 103 111-00-8888 103 415-887-8888 35

Slide 36

Slide 36

Access History (Writes) Know the lineage of data create table as select copy into stage_S1 table_T1 table_T2 Insert into select from copy into table_T3 PATH stage_S2 TARGET_NAME TARGET_DOMAIN TARGET_COLUMNS stage_S1–>table_T1 table_T1 TABLE [“CONTENT”] stage_S1–>table_T1–>table_T2 table_T2 TABLE [“ID”,”NAME”] stage_S1–>table_T1–>table_T3 table_T3 TABLE [“NAME”,”ID”] stage_S1–>table_T1–>table_T3->stage_S2 stage_S2 STAGE [] 36

Slide 37

Slide 37

Object Dependencies Identify dependencies and downstream impact create table as select create external table with location external_stage_S1 REFERENCED_OBJECT_NAME external_table_ET1 materialized_view_MV1 REFERENCED_OBJECT_DOMAIN REFERENCING_OBJECT_NAME TREFERENCING_OBJECT_DOMAIN external_stage_S1 STAGE external_table_ET1 EXTERNAL TABLE external_table_ET1 EXTERNAL_TABLE materialized_view_MV1 MATERIALIZED_VIEW 37

Slide 38

Slide 38

Dynamic Data Masking Column-level Security ID Phone SSN 101 ***-***-5534


102 ***-***-3564


103 ***-***-9787


(Authorized Access: Restricted Data) Dynamically mask data at query time Centralized policy management Apply one policy to many columns ID Phone SSN 101 408-123-5534 111-22-3333 102 510-335-3564 222-33-4444 103 214-553-9787 333-44-5555 (Authorized Access: Unrestricted Data) 38

Slide 39

Slide 39

External Tokenization Dynamically de-tokenize data for authorized users Externally tokenize protected data at ingest ● Using tokenization provider agents on ETL tools Dynamically de-tokenize at query time ● ● Call third-party service using external functions to de-tokenize data ● Customer VPC / VNet Alex (Restricted Access) Tokenized ID Phone SSN 101 111-222-3333 000-78-9999 102 002-778-9904 779-66-8908 103 100-887-8888 111-00-8888 For unauthorized users, third-party service is not called Policy Based Control ● Ingest tokenized data Table/View owners and privileged users unauthorized by default Centralized policy management Tokenized REST API POLICIES Morgan (Unrestricted Access) EXTERNAL FUNCTION Tokenization Provider De-tokenized De-tokenized ID Phone SSN 101 408-123-5534 387-78-3456 102 510-334-3564 226-44-8908 103 214-553-9787 359-9987-0098 40

Slide 40

Slide 40

Benefits of Snowflake Collaboration Across Cloud & Region with Snowgrid Delivers direct access to live, ready-to-query data across clouds and regions with auto-fulfillment and no ETL More than Data Snowflake enables customers to collaborate with data, data services and applications including built-in usage based monetization. Robust Data Governance Achieve privacy-preserving collaboration with targeted discovery, revocable access and custom event logging. Snowflake Regions AWS Azure GCP 41

Slide 41

Slide 41

Benefits of Snowflake Collaboration Data Enable direct access to live, readyto query data without ETL Across Cloud & Region with Snowgrid Delivers direct access to live, ready-to-query data across clouds and regions with auto-fulfillment and no ETL More than Data Snowflake enables customers to collaborate with data, data services and applications including built-in usage based monetization. Robust Data Governance Achieve privacy-preserving collaboration with targeted discovery, revocable access and custom event logging. Data Services Applications Deliver insights without exposing underlying data Discover, build and distribute apps that run natively within your Snowflake account 42

Slide 42

Slide 42

Benefits of Snowflake Collaboration Discovery Audit Across Cloud & Region with Snowgrid ● Single Account ● Listing Views & Events Delivers direct access to live, ready-to-query data across clouds and regions with auto-fulfillment and no ETL ● Account Group ● Jobs Run by Consumer ● Cloud Region(s) ● Object & Columns Accessed ● Public Marketplace ● Custom Event Logging More than Data Snowflake enables customers to collaborate with data, data services and applications including built-in usage based monetization. Robust Data Governance Achieve privacy-preserving collaboration with targeted discovery, revocable access and custom event logging. Access ● Row Access Policies ● Dynamic Data Masking ● Conditional Masking ● Query Constraints* 43

Slide 43

Slide 43

SHARE YOUR LAKE! 1 DATA MARKETPLACE List Share with Snowflake Customers via Data Exchange Request Data Consumer External Tables BI, Query Tools SHARE View, Query, Join DATA LAKE SHARE Materialized View 2 Reader Account Share with Non-Snowflake Customers BI, Query Tools

Slide 44

Slide 44

Security & Data Governance Best Practices — In Databricks

Slide 45

Slide 45

Security Overview Databricks account Databricks Control plane Web application Users & Applications Compute orchestration Encryption in transit Cloud provider’s networks Databricks Serverless workloads Encryption in transit Cloud provider’s networks (default private link enabled) Network security Process security Compute Encryption in transit Cloud provider’s networks Unity Catalog 5+ options to secure your data Queries and code Encryption at rest 6+ layers of authentication & network controls Encryption at rest Access / data security Compute Customer Storages 4 layers of computing isolation Encryption at rest SSO build-in for Azure / GCP Set up SSO for AWS Set up data access policies via Unity Catalog Built-in through Unity Catalog Set up IP Access Lists / Front-End Private Link Built-in through Serverless Set up allow list for Storage Set up Private Link for Storage Enable Unity Catalog for Audit Logs Setup CMK services Build-in through Serverless Set up Audit Logs for Storage Enable Storage Audit Logs 49

Slide 46

Slide 46

What is Databricks Unity Catalog? Databricks Unity Catalog Discovery Access Controls Tables Monitoring Lineage Files Models Notebooks Auditing Dashboards Data Sharing

Slide 47

Slide 47

Unity Catalog - Key Capabilities Unified governance for all data assets • • • • • • Centralized metadata and user management Centralized access controls Data lineage Data access auditing Data search and discovery Secure data sharing with Delta Sharing 52

Slide 48

Slide 48

Map, secure and audit data across clouds • • • • Catalog all your data, analytics and Al assets and create a unified view of you entire data estate Centrally manage access permissions and audit controls for files, tables across all workspaces and workloads using a familiar interface based on ANSI SQL Enable fine-grained access controls on rows, and columns 53

Slide 49

Slide 49

Open data sharing and collaboration 54

Slide 50

Slide 50

Data Discovery 55

Slide 51

Slide 51

Automated lineage for all workloads End-to-end visibility into how data flows and consumed in your organization 56

Slide 52

Slide 52

Unity Catalog - Architecture 57

Slide 53

Slide 53

Life of a query with Unity Catalog With Azure Management Identities ● Creates an IAM role (AWS)/ Managed Identity (Azure) / Service Account (GCP) ● Creates storage credentials/ external locations in Unity Catalog Admin ● Defines access policies in Unity Catalog Check namespace , 2 metadata and grants Send query (SQL 1 Python, R, Scala,) Unity Catalog Return list of paths/data files 4 and scoped down temporary tokens User 8 Send result Write audit log Cluster or SQL warehouse 5 Request/ingest data from paths/data files with temporary tokens Audit log Assume IAM Role / 3 Managed Identity / Service Account Enforce 7 policies 6 Return data Cloud Storage (S3, ADLS) 58

Slide 54

Slide 54

Centralized Metadata and User Management Create a unified view of your data estate With Unity Catalog Without Unity Catalog Unity Catalog Databricks Workspace 1 Databricks Workspace 2 User Management User Management Metastore Metastore Databricks Workspace Databricks Workspace Access Controls Access Controls Clusters SQL Warehouses Clusters SQL Warehouses Clusters SQL Warehouses Clusters SQL Warehouses User Management Metastore Foreign Databases Access Controls 59

Slide 55

Slide 55

Databricks Account and the Cloud Provider Hierarchy Create a unified view of your data estate Azure https://accounts.azuredatabricks.net/ AWS https://accounts.cloud.databricks.com/ • Global Admin used initially for security 60

Slide 56

Slide 56

Dynamic View Limit access to columns Limit access to rows Data Masking Omit column values from output Omit rows from output Obscure data 62

Slide 57

Slide 57

Centralized Access Controls Centrally grant and manage access permissions across workloads Using ANSI SQL DCL Using UI GRANT <privilege> ON <securable_type> <securable_name> TO <principal> GRANT SELECT ON iot.events TO engineers Choose permission level ‘Table’= collection of files in S3/ADLS Sync groups from your identity provider 63

Slide 58

Slide 58

Row and Column Filtering Fine-grained governance for the Lakehouse 64

Slide 59

Slide 59

Manage Data Sources & External Locations Simplify data access management across clouds Audit log External Locations & Credentials Unity Catalog Cloud Storage Access Control (S3, ADLS, GCS) Managed tables User Cluster or SQL warehouse External tables Files in Cloud Strg Managed container / bucket Managed Data Sources External container / bucket … External Locations External container / bucket 65

Slide 60

Slide 60

Data Lineage – How it works in Databricks End-to-end visibility into how data flows and consumed in your organization ETL / Job Workspace cluster / SQL Warehouse Lineage service Table and column lineage Explore lineage in UI Alation Ad-hoc Collibra Code (any language) is submitted to a cluster or SQL warehouse or DLT* executes data flow • Lineage service analyzes logs emitted from the cluster, and pulls metadata from DLT • Assembles column and table level lineage Microsoft Purview • Presented to the end user graphically in Databricks • Lineage can be exported via API and imported into other 66

Slide 61

Slide 61

Delta Sharing

Slide 62

Slide 62

Delta Sharing Ecosystem Endorsed by many of Databricks partners with integration and connectors being developed

Slide 63

Slide 63

Conclusion

Slide 64

Slide 64

Security, Compliance, and Privacy Capability Databricks Snowflake Automated Account Management Role-Based Access Control (RBAC) Multi-Factor Authentication (MFA) Identity Federation Data Classification and Encryption Tokenization and Access Control for Sensitive Data with partner integrations) Advanced Firewall and Intrusion Detection Systems (IDS) 80

Slide 65

Slide 65

Security, Compliance, and Privacy continues Capability Endpoint Protection Clarity in Roles and Responsibilities Compliance Documentation and Assurance Automated Compliance Tools Data Localization and PII Management Automated Security Scanning Security Information and Event Management (SIEM) 81

Slide 66

Slide 66

Data Governance Capability Centralized Data Catalog Data Lifecycle Management Data Integration and Quality Tools Centralized Access Control Audit and Monitoring System Data Masking and Encryption 82

Slide 67

Slide 67

Data Governance continues Capability Data Quality Framework Automated Data Quality Tools Open and Secure Exchange Technology Real-time Data Sharing Platforms Data Usage Agreements and Compliance 83

Slide 68

Slide 68

Thank you! Q&A

Slide 69

Slide 69

Learn More • • • Security and Trust Center – Databricks Data governance guide | Databricks Security and compliance guide | Databricks • • • Snowflake Security Overview and Best Practices Securing Snowflake | Snowflake Documentation Snowflake Security 101 85