Leveraging Microsoft Fabric for Advanced Data Engineering Solutions

A presentation at Nashville Data Engineering Group Meetup in April 2024 in Nashville, TN, USA by Philip Goldman

Slide 1

Slide 1

Leveraging Microsoft Fabric for Advanced Data Engineering Solutions

Slide 2

Slide 2

Philip Goldman Field Sr. Data Solution Architect EPAM

Slide 3

Slide 3

Introducing Microsoft Fabric Deep Dive into Data Engineering DEMO

Slide 4

Slide 4

Introducing Microsoft Fabric for Data Engineering

Slide 5

Slide 5

Microsoft Fabric Complete Analytics Platform Lake-centric and Open Empower Every Office User Persistent Security and Governance Everything, unified OneLake Familiar and Intuitive End-to-End Visibility Unified SaaS Solution One Copy Built Into Office 365 Always Governed Low Code Plus Pro Dev Always Synced Insight to Action Secure by Default

Slide 6

Slide 6

Introducing Microsoft Fabric for Data Engineering •Complete analytics platform

Slide 7

Slide 7

Scalable analytics are complex and fragmented Every analytics project has many subsystems Every subsystem need a different class of product Products often come from multiple vendors Integration at scale across products is complex, fragile, and expensive

Slide 8

Slide 8

Scalable analytics are complex and fragmented Every analytics project has many subsystems Every subsystem need a different class of product Products often come from multiple vendors Integration at scale across products is complex, fragile and expensive “ Simplify, I am the Chief Data Officer and don’t want to be the Chief Integration Officer.” Every CDO, Every Enterprise

Slide 9

Slide 9

A silver lining? Analytics systems have very predictable patterns Microsoft has all the products with the right scale needed to build a complete analytics system Data Integration Data Engineering Data Warehousing Real Time Analytics Data Science Data Lake Governance and Administration Business Intelligence

Slide 10

Slide 10

A silver lining? Analytics systems have very predictable patterns Microsoft has all the products with the right scale needed to build a complete analytics system Data Integration Data Engineering Data Warehousing RealTime Analytics Data Science Data Lake Governance and Administration Business Intelligence

Slide 11

Slide 11

Still far too complex Many Products Different Experiences Proprietary and Open Power BI Synapse Kusto Azure AI Data Factory Spark Dedicated and Serverless PaaS and SaaS Different Business Models Steep Learning Curves Deep Expertise Needed High Integration Effort

Slide 12

Slide 12

Microsoft Fabric Data Integration Data Lake Spark Engines Data Warehouse Real Time Analytics Data Science Business Intelligence Unified analytics fabric End-to-end analytics data fabric From the data lake to the business user Governance

Slide 13

Slide 13

Microsoft Fabric Single… Data Integration Data Engineering Data Warehousing RealTime Analytics Data Science AI Assisted Shared Workspaces Universal Compute Capacities OneSecurity OneLake The Intelligent Data Fabric Business Intelligence Onboarding and trials Sign-on Navigation model UX model Workspace organization Collaboration experience Data Lake Storage format Data copy for all engines Security model CI/CD Monitoring hub Data Hub Governance & compliance

Slide 14

Slide 14

Persona Centric Experiences

Slide 15

Slide 15

Data Integration

Slide 16

Slide 16

Data Engineering

Slide 17

Slide 17

Data Warehouse

Slide 18

Slide 18

Real-Time Analytics

Slide 19

Slide 19

Data Science

Slide 20

Slide 20

Power BI

Slide 21

Slide 21

AI Assisted Creation in Microsoft Fabric Data Integration Data Engineering Data Warehousing Real Time Analytics Data Science AI Assisted Shared Workspaces Universal Compute Capacities OneSecurity OneLake The Intelligent Data Fabric Business Intelligence The Fabric platform will include built in Azure Open AI based assistant that will serve all the workloads First GPT-based feature is already shipping in Power BI - NL2DAX – DAX calculation creation based on natural language prompts Ongoing major ramp-up for pervasive AOAI based product-wide AI assistance

Slide 22

Slide 22

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform •Lake-centric and open architecture

Slide 23

Slide 23

OneLake for all Data “The OneDrive for Data” Data Warehousing Data Engineering Data Integration Data Science Real Time Analytics Business Intelligence A single SaaS lake for the whole organization Provisioned automatically with the tenant All workloads automatically store their data in the OneLake workspace folders OneLake Storage Intelligent data fabric All the data is organized in an intuitive hierarchical namespace The data in OneLake is automatically indexed for discovery, MIP labels, lineage, PII scans, sharing, governance and compliance

Slide 24

Slide 24

One Copy for all computes Real separation of compute and storage Data Warehousing Data Engineering T-SQL Spark Finance Customers 360 Delta – Parquet Format Delta – Parquet Format Data Integration Data Science Serverless Compute OneLake Storage Real Time Analytics Business Intelligence All the compute engines store their data automatically in OneLake The data is stored in a single common format Delta – Parquet, an open standards format, is the storage format for all tabular data in Fabric KQL Analysis Services Service Telemetry Business KPIs Delta – Parquet Format Delta – Parquet Format Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines

Slide 25

Slide 25

One Copy for all computes Universal security makes it real Data Warehousing T-SQL Data Engineering Spark Data Integration Data Science Serverless Compute Real Time Analytics KQL Business Intelligence The data is stored in a single common format Analysis Services Customers 360 Delta – Parquet Format Delta – Parquet Format OneLake Storage Delta – Parquet, an open standards format, is the storage format for all tabular data in Fabric Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export OneSecurity Finance All the compute engines store their data automatically in OneLake Service Telemetry Business KPIs Delta – Parquet Format Delta – Parquet Format All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines

Slide 26

Slide 26

Taking One Copy to the next level Shortcuts Data Integration Finance Data Engineering Customers 360 Azure Data Warehousing Real Time Analytics OneLake Storage Amazon Data Science Business Intelligence Sharing data in OneLake is as easy as sharing files in OneDrive, removing the needs for data duplication With shortcuts, data throughout OneLake can be composed together without any data movement Service Telemetry Business KPIs Google Shortcuts also allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement, making OneLake the first multi-cloud data lake With support for industry standard APIs, OneLake data can be directly accessed by any application or service

Slide 27

Slide 27

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture •Empower every office user

Slide 28

Slide 28

Office Integration

Slide 29

Slide 29

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user •Persistent security and governance

Slide 30

Slide 30

Regulatory compliance • Fabric will be available in every Azure region • Data at rest: compliant with EUDB and other single-geo data residency regulations • Multi-geo capacities allow control over content storage location in most Azure data centers world-wide Microsoft pledges support for EU Data Boundary

Slide 31

Slide 31

Access control • Fabric workspace roles define default permissions on workload items on the Control Plane • Workload item permissions can be modified and managed via sharing • On the Data Plane, Universal Security defines access policies on the delta tables directly and all workload compute engines will respect such policies Defining user access via workspace roles and sharing

Slide 32

Slide 32

Microsoft Fabric Complete Analytics Platform Lake-centric and Open Empower Every Office User Persistent Security and Governance Everything, unified OneLake Familiar and Intuitive End-to-End Visibility SaaS Solution One Copy Built Into Office 365 Always Governed Low Code Plus Pro Dev Always Synced Insight to Action Secure by Default

Slide 33

Slide 33

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering •Data pipelines and movement

Slide 34

Slide 34

The choice of tools for your data transformations Notebooks Spark Job Definitions Code-based pipelines Data pipelines Data Flows UI-based pipelines

Slide 35

Slide 35

No-code data pipelines

Slide 36

Slide 36

Run data flows on the cloud

Slide 37

Slide 37

Write and run Notebooks from web & VS Code

Slide 38

Slide 38

Write and run Notebooks from web & VS Code

Slide 39

Slide 39

Schedule your data transformations with ease

Slide 40

Slide 40

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement •Data storage and architecture

Slide 41

Slide 41

One Lake Unified lake house Based on open standards Accessible from any workload

Slide 42

Slide 42

One Lake working with both files and tables

Slide 43

Slide 43

Browse your OneLake from Windows Explorer

Slide 44

Slide 44

Slide 45

Slide 45

Copy data from a wide range of services…

Slide 46

Slide 46

… and even more with Data Flows

Slide 47

Slide 47

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement • Data storage and architecture •Delivering data to data analysts and data scientists

Slide 48

Slide 48

Access lakehouse data from Notebooks and Spark Jobs

Slide 49

Slide 49

Expose Datasets ready for Power BI

Slide 50

Slide 50

Use Lakehouse data from Excel

Slide 51

Slide 51

Use Lakehouse data from Excel

Slide 52

Slide 52

Access your Lakehouse data from any SQL Client

Slide 53

Slide 53

Organize your data through workspaces & shortcuts

Slide 54

Slide 54

From raw data to certified datasets

Slide 55

Slide 55

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement • Data storage and architecture • Delivering data to data analysts and data scientists DEMO

Slide 56

Slide 56

Architecture

Slide 57

Slide 57

Data model Wide World Importers (WWI) data model See Wide World Importers sample databases for Microsoft SQL.

Slide 58

Slide 58

Data and Transformation flow

Slide 59

Slide 59

References • Introduction End-to-End Analytics in Microsoft Fabric • Lakehouse Get Started Lakehouses • Spark on Lakehouse : Use Apache Spark in Lakehouse • Work with Delta Delta Lake Tables in Microsoft Fabric • Data Factory Pipelines Pipelines, Activities, Templates • Data Warehouses Get started with Data Warehouses • Real-Time Analytics Analyze real-time data • Data Science Get started with data science in Microsoft Fabric • Administer Administration, Security, and Govern data in Microsoft Fabric • Medallion Architecture Design Fabric Medallion Architecture with Bronze, Silver and Gold layers of Lakehouse • DataFlow Gen2 Ingest with Dataflows in Microsoft Fabric • Data Analysis with Kusto Query Language Explore the fundamentals of data analysis • Azure Data Engineer - free online training from Azure

Slide 60

Slide 60

Slides Q&A AND SOCIAL ADS X @philg0ld #NashvilleDataEngineering http://linkedin.com/in/philg0ld

Slide 61

Slide 61

APPENDIX

Slide 62

Slide 62

Sign up for a free Fabric trial Fabric trial - Microsoft Fabric | Microsoft Learn

Slide 63

Slide 63

Microsoft Fabric concepts and licenses

Slide 64

Slide 64

DAX overview FILTER CONTEXT CALCULATED COLUMNS SQL SERVER ANALISIS SERVICES ROW CONTEXT MEASURE DAX Power BI DESKTOP

  1. DAX is a formula language like the formula language in Excel that uses functions 2. Create new information/ column/ measures/ calculation/ formula 3. DAX usage may depend on the design pattern: • Work could be pushed to the Power Query • Work could be pushed to the Database • Work could also be pushed to an ETL tool 4. Great for time intelligence, for example, YTD, MTD, parallel period comparisons 5. Enhances the data model

Slide 65

Slide 65

Power BI Architecture - Fabric

Slide 66

Slide 66

What is Delta Lake? • • Open-source storage framework that enables building a Lakehouse architecture with compute engines, including Spark, PrestoDB, Flink, Trino, and Hive, and APIs for Scala, Java, Rust, Ruby, and Python. ACID-compliant storage layer that runs on top of cloud object stores such as MinIO, Hadoop HDFS, Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. • • • Provides features such as scalable metadata handling for petabyte-scale tables with billions of partitions and files with ease. Provides time travel access/reverts to earlier versions of data for audits, rollbacks, or reproduce. Production-ready and has been battle-tested in over 10,000+ production environments.

Slide 67

Slide 67

rawACID data to certified datasets DeltaFrom Lake Implementation