Mastering Data Governance & Security In Modern Cloud Data Platforms using Databricks And Snowflake
A presentation at DataTune 2024 in March 2024 in Nashville, TN, USA by Philip Goldman
Mastering Data Governance & Security In Modern Cloud Data Platforms using Databricks And Snowflake
The Cambridge Analytica Scandal Cambridge Analytica and Facebook: The Scandal and the Fallout So Far
Marriott Data Breach Marriott’s Data Breach One of the Biggest in History
Equifax Data Breach Case Study: Equifax Data Breach - Seven Pillars Institute
Facebook Data Leak Facebook Data Breach: How to Tell If Your Account Was Exposed | Fortune
Why Data Governance & Data Security Matter Data analyst No row and column level permissions Data Lake Metadata Data Engineer Data scientist Data Warehouse ML models, dashboards Inflexible when policies change Can be out of sync with data Different governance model Yet another governance model 6
Philip Goldman Sr. Data Solution Architect EPAM
Navigating Today’s Discussion Elaborate on the data you want to discuss. Introduction Introduction to Data 2 Principles Conclusion Exploring Data Security & Empowering Your Data Journey Governance Principles 4 Governance & Security. 1 3 Best Practices Best Practices for Security and Data Governance in Databricks and Snowflake. 8
Choosing Databricks & Snowflake Some of the benefits and reasons businesses choose Databricks or Snowflake
Magic Quadrant for Cloud Database Management Systems 10
Principles of Security and Data Governance
Principles of data governance Data Governance Unify data management 12
Principles of data governance Data Governance Unify data management Unify Data Security 13
Principles of data governance Data Governance Unify data management Unify Data Security Manage Data Quality 14
Principles of data governance Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing 15
Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges 16
Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges Data security 17
Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges Data security Network security 18
Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges Data security Network security Compliance & Privacy 19
Principles of security, compliance, and privacy Data Governance Unify data management Unify Data Security Manage Data Quality Data Sharing Security, compliance, and privacy Identity & Privileges Data security Network security Compliance & Privacy Security Monitoring 20
How do we implement these principles in real-world cloud platforms like Databricks and Snowflake? — Bring Principles with Practice
Security & Data Governance Best Practices — In Snowflake
SNOWFLAKE SECURITY AT A GLANCE Snowflake Operational Controls Access • All communication secured & encrypted • TLS 1.2 encryption in both trusted and untrusted networks • NIST 800-53 • SOC2 Type 2 • HIPAA • PCI • FedRAMP Authentication • Password Policy enforcement • IP Whitelisting • Multifactor Authentication • Private Link • SAML 2.0 support for Federated Authentication Authorization • Flexible user management • Role-based access control for granular control • RBAC for data and actions Data • Encrypted at rest • Hierarchical key model rooted in Cloud HSM • Automatic key rotation • Time Travel 1-90 days • Tri-Secret Secure • Query statement encryption Infrastructure • AWS, Azure Physical Security • AWS, Azure Redundancy • Regional Data Centers US EU AP 23
AWS/Azure/GCP Private Link 24
Snowflake Network Polices Network policies can be applied at three levels
Snowflake Network rule ACCOUNT LEVEL Network Policy 1 INT LEVEL Network Policy 2 USER LEVEL Network Policy 3 … VPCE ID … IPv4 … Azure Link ID … Internal Storage
Snowflake Authentication Human Launches Human Launches Interactive Sessions Server / Desktop SaaS Oauth Auth Server SAML 2.0 External Browser IdP MFA Trust Relationship SAML 2.0 IdP MFA Service/Process/Machine-toMachine Launches Attended Server Sessions Unattended Sessions Server / Desktop SaaS Server / Desktop SaaS Native Username Password Snowflake OAuth Key Pair External OAuth Oauth Auth Server Okta MFA Trust Relationship Session Policies Authentication Policies Snowflake 27
User Provisioning with SCIM SCIM Provider Snowflake SCIM New New SCIM Change Group 1 Change SCIM Role 1 New Group New Role Group 2 Role 2 SCIM New Member Now Assigned
RBAC Role-based access control
Tri-Secret Secure Encrypt using customer managed key (CMK) Customer Key ● Bring your own key (BYOK) ● Revoke at any time → no access to data ● Keys are rotated every 30 days ● Customers setup re-keying SFMK HSM AMKs TMKs F F F F F F 30
Unified Governance Know Your Data Classification What Where Protect Your Data Row Access Policies Dynamic Data Masking External Tokenization Tag-Based Policies Connect Your Ecosystem Object Tagging Object Dependencies Direct Secure Sharing Who Access History Account Usage Conditional Masking Anonymization (PrPr) Tag-Based Policies Data Cleanrooms Data Marketplace Pre-built Partner Integrations to Manage Entire Dat Data Estate 31
Object Tagging Track sensitive data and compute objects ACCOUNT Track sensitive and PII data DATABAS E SCHEMA Track resource usage for cost visibility and attribution WAREHOUSE Flexible privilege management options TABLE VIEW STAGE 32
Data Classification Classify sensitive personal data E AM N G ER D N E AG E O PH fname gen age phone Jane Doe F 50 333-555-1236 Michael Gaines M 75 666-666-1357 Ann Marshall F 48 555-555-1234 John Smith M 39 123-555-1234 E N BE M U _N R Analyze table columns for sensitive personal information Get recommended tags using built-in machine learning Apply tags to track and audit sensitive data 33
Row Access Policies Row-level Security Customer Spend Region ACME $820,000 North America Koko $2,100,000 North America ROLE: EU_NA Filter unauthorized rows at query time Use mapping tables for authorization Apply one policy to many tables Customer Spend Region AGM $5,757,000 Europe ROLE: EU_RL 34
Access History (Reads) Satisfy regulatory compliance, understand usage with column-level access visibility View: INFO (Directly Accessed) Taylor ID 101 102 103 Select * from Info; (Privileged User) Select * from Sensitive; Morgan (Admin) Phone 248-222-3333 800-778-9904 415-887-8888 Access History Unique_ID 333-78-9999 779-66-8908 Log 111-00-8888 User Tables Columns Taylor Info, Sensitive, Contact ID, Phone, Unique_ID, SSN, Mobile Morgan Sensitive ID, SSN Table: SENSITIVE Table: CONTACT ID SSN ID Mobile 101 333-78-9999 101 248-222-3333 102 779-66-8908 102 800-778-9904 103 111-00-8888 103 415-887-8888 35
Access History (Writes) Know the lineage of data create table as select copy into stage_S1 table_T1 table_T2 Insert into select from copy into table_T3 PATH stage_S2 TARGET_NAME TARGET_DOMAIN TARGET_COLUMNS stage_S1–>table_T1 table_T1 TABLE [“CONTENT”] stage_S1–>table_T1–>table_T2 table_T2 TABLE [“ID”,”NAME”] stage_S1–>table_T1–>table_T3 table_T3 TABLE [“NAME”,”ID”] stage_S1–>table_T1–>table_T3->stage_S2 stage_S2 STAGE [] 36
Object Dependencies Identify dependencies and downstream impact create table as select create external table with location external_stage_S1 REFERENCED_OBJECT_NAME external_table_ET1 materialized_view_MV1 REFERENCED_OBJECT_DOMAIN REFERENCING_OBJECT_NAME TREFERENCING_OBJECT_DOMAIN external_stage_S1 STAGE external_table_ET1 EXTERNAL TABLE external_table_ET1 EXTERNAL_TABLE materialized_view_MV1 MATERIALIZED_VIEW 37
Dynamic Data Masking Column-level Security ID Phone SSN 101 ***-***-5534
102 ***-***-3564
103 ***-***-9787
(Authorized Access: Restricted Data) Dynamically mask data at query time Centralized policy management Apply one policy to many columns ID Phone SSN 101 408-123-5534 111-22-3333 102 510-335-3564 222-33-4444 103 214-553-9787 333-44-5555 (Authorized Access: Unrestricted Data) 38
External Tokenization Dynamically de-tokenize data for authorized users Externally tokenize protected data at ingest ● Using tokenization provider agents on ETL tools Dynamically de-tokenize at query time ● ● Call third-party service using external functions to de-tokenize data ● Customer VPC / VNet Alex (Restricted Access) Tokenized ID Phone SSN 101 111-222-3333 000-78-9999 102 002-778-9904 779-66-8908 103 100-887-8888 111-00-8888 For unauthorized users, third-party service is not called Policy Based Control ● Ingest tokenized data Table/View owners and privileged users unauthorized by default Centralized policy management Tokenized REST API POLICIES Morgan (Unrestricted Access) EXTERNAL FUNCTION Tokenization Provider De-tokenized De-tokenized ID Phone SSN 101 408-123-5534 387-78-3456 102 510-334-3564 226-44-8908 103 214-553-9787 359-9987-0098 40
Benefits of Snowflake Collaboration Across Cloud & Region with Snowgrid Delivers direct access to live, ready-to-query data across clouds and regions with auto-fulfillment and no ETL More than Data Snowflake enables customers to collaborate with data, data services and applications including built-in usage based monetization. Robust Data Governance Achieve privacy-preserving collaboration with targeted discovery, revocable access and custom event logging. Snowflake Regions AWS Azure GCP 41
Benefits of Snowflake Collaboration Data Enable direct access to live, readyto query data without ETL Across Cloud & Region with Snowgrid Delivers direct access to live, ready-to-query data across clouds and regions with auto-fulfillment and no ETL More than Data Snowflake enables customers to collaborate with data, data services and applications including built-in usage based monetization. Robust Data Governance Achieve privacy-preserving collaboration with targeted discovery, revocable access and custom event logging. Data Services Applications Deliver insights without exposing underlying data Discover, build and distribute apps that run natively within your Snowflake account 42
Benefits of Snowflake Collaboration Discovery Audit Across Cloud & Region with Snowgrid ● Single Account ● Listing Views & Events Delivers direct access to live, ready-to-query data across clouds and regions with auto-fulfillment and no ETL ● Account Group ● Jobs Run by Consumer ● Cloud Region(s) ● Object & Columns Accessed ● Public Marketplace ● Custom Event Logging More than Data Snowflake enables customers to collaborate with data, data services and applications including built-in usage based monetization. Robust Data Governance Achieve privacy-preserving collaboration with targeted discovery, revocable access and custom event logging. Access ● Row Access Policies ● Dynamic Data Masking ● Conditional Masking ● Query Constraints* 43
SHARE YOUR LAKE! 1 DATA MARKETPLACE List Share with Snowflake Customers via Data Exchange Request Data Consumer External Tables BI, Query Tools SHARE View, Query, Join DATA LAKE SHARE Materialized View 2 Reader Account Share with Non-Snowflake Customers BI, Query Tools
Security & Data Governance Best Practices — In Databricks
Security Overview Databricks account Databricks Control plane Web application Users & Applications Compute orchestration Encryption in transit Cloud provider’s networks Databricks Serverless workloads Encryption in transit Cloud provider’s networks (default private link enabled) Network security Process security Compute Encryption in transit Cloud provider’s networks Unity Catalog 5+ options to secure your data Queries and code Encryption at rest 6+ layers of authentication & network controls Encryption at rest Access / data security Compute Customer Storages 4 layers of computing isolation Encryption at rest SSO build-in for Azure / GCP Set up SSO for AWS Set up data access policies via Unity Catalog Built-in through Unity Catalog Set up IP Access Lists / Front-End Private Link Built-in through Serverless Set up allow list for Storage Set up Private Link for Storage Enable Unity Catalog for Audit Logs Setup CMK services Build-in through Serverless Set up Audit Logs for Storage Enable Storage Audit Logs 49
What is Databricks Unity Catalog? Databricks Unity Catalog Discovery Access Controls Tables Monitoring Lineage Files Models Notebooks Auditing Dashboards Data Sharing
Unity Catalog - Key Capabilities Unified governance for all data assets • • • • • • Centralized metadata and user management Centralized access controls Data lineage Data access auditing Data search and discovery Secure data sharing with Delta Sharing 52
Map, secure and audit data across clouds • • • • Catalog all your data, analytics and Al assets and create a unified view of you entire data estate Centrally manage access permissions and audit controls for files, tables across all workspaces and workloads using a familiar interface based on ANSI SQL Enable fine-grained access controls on rows, and columns 53
Open data sharing and collaboration 54
Data Discovery 55
Automated lineage for all workloads End-to-end visibility into how data flows and consumed in your organization 56
Unity Catalog - Architecture 57
Life of a query with Unity Catalog With Azure Management Identities ● Creates an IAM role (AWS)/ Managed Identity (Azure) / Service Account (GCP) ● Creates storage credentials/ external locations in Unity Catalog Admin ● Defines access policies in Unity Catalog Check namespace , 2 metadata and grants Send query (SQL 1 Python, R, Scala,) Unity Catalog Return list of paths/data files 4 and scoped down temporary tokens User 8 Send result Write audit log Cluster or SQL warehouse 5 Request/ingest data from paths/data files with temporary tokens Audit log Assume IAM Role / 3 Managed Identity / Service Account Enforce 7 policies 6 Return data Cloud Storage (S3, ADLS) 58
Centralized Metadata and User Management Create a unified view of your data estate With Unity Catalog Without Unity Catalog Unity Catalog Databricks Workspace 1 Databricks Workspace 2 User Management User Management Metastore Metastore Databricks Workspace Databricks Workspace Access Controls Access Controls Clusters SQL Warehouses Clusters SQL Warehouses Clusters SQL Warehouses Clusters SQL Warehouses User Management Metastore Foreign Databases Access Controls 59
Databricks Account and the Cloud Provider Hierarchy Create a unified view of your data estate Azure https://accounts.azuredatabricks.net/ AWS https://accounts.cloud.databricks.com/ • Global Admin used initially for security 60
Dynamic View Limit access to columns Limit access to rows Data Masking Omit column values from output Omit rows from output Obscure data 62
Centralized Access Controls Centrally grant and manage access permissions across workloads Using ANSI SQL DCL
Using UI
GRANT <privilege> ON <securable_type> <securable_name> TO <principal>
GRANT SELECT ON iot.events TO engineers
Choose permission level
‘Table’= collection of files in S3/ADLS
Sync groups from your identity provider
63
Row and Column Filtering Fine-grained governance for the Lakehouse 64
Manage Data Sources & External Locations Simplify data access management across clouds Audit log External Locations & Credentials Unity Catalog Cloud Storage Access Control (S3, ADLS, GCS) Managed tables User Cluster or SQL warehouse External tables Files in Cloud Strg Managed container / bucket Managed Data Sources External container / bucket … External Locations External container / bucket 65
Data Lineage – How it works in Databricks End-to-end visibility into how data flows and consumed in your organization ETL / Job Workspace cluster / SQL Warehouse Lineage service Table and column lineage Explore lineage in UI Alation Ad-hoc Collibra Code (any language) is submitted to a cluster or SQL warehouse or DLT* executes data flow • Lineage service analyzes logs emitted from the cluster, and pulls metadata from DLT • Assembles column and table level lineage Microsoft Purview • Presented to the end user graphically in Databricks • Lineage can be exported via API and imported into other 66
Delta Sharing
Delta Sharing Ecosystem Endorsed by many of Databricks partners with integration and connectors being developed
Conclusion
Security, Compliance, and Privacy Capability Databricks Snowflake Automated Account Management Role-Based Access Control (RBAC) Multi-Factor Authentication (MFA) Identity Federation Data Classification and Encryption Tokenization and Access Control for Sensitive Data with partner integrations) Advanced Firewall and Intrusion Detection Systems (IDS) 80
Security, Compliance, and Privacy continues Capability Endpoint Protection Clarity in Roles and Responsibilities Compliance Documentation and Assurance Automated Compliance Tools Data Localization and PII Management Automated Security Scanning Security Information and Event Management (SIEM) 81
Data Governance Capability Centralized Data Catalog Data Lifecycle Management Data Integration and Quality Tools Centralized Access Control Audit and Monitoring System Data Masking and Encryption 82
Data Governance continues Capability Data Quality Framework Automated Data Quality Tools Open and Secure Exchange Technology Real-time Data Sharing Platforms Data Usage Agreements and Compliance 83
Thank you! Q&A
Learn More • • • Security and Trust Center – Databricks Data governance guide | Databricks Security and compliance guide | Databricks • • • Snowflake Security Overview and Best Practices Securing Snowflake | Snowflake Documentation Snowflake Security 101 85