Leveraging Microsoft Fabric for Advanced Data Engineering Solutions
A presentation at Nashville Data Engineering Group Meetup in April 2024 in Nashville, TN, USA by Philip Goldman
Leveraging Microsoft Fabric for Advanced Data Engineering Solutions
Philip Goldman Field Sr. Data Solution Architect EPAM
Introducing Microsoft Fabric Deep Dive into Data Engineering DEMO
Introducing Microsoft Fabric for Data Engineering
Microsoft Fabric Complete Analytics Platform Lake-centric and Open Empower Every Office User Persistent Security and Governance Everything, unified OneLake Familiar and Intuitive End-to-End Visibility Unified SaaS Solution One Copy Built Into Office 365 Always Governed Low Code Plus Pro Dev Always Synced Insight to Action Secure by Default
Introducing Microsoft Fabric for Data Engineering •Complete analytics platform
Scalable analytics are complex and fragmented Every analytics project has many subsystems Every subsystem need a different class of product Products often come from multiple vendors Integration at scale across products is complex, fragile, and expensive
Scalable analytics are complex and fragmented Every analytics project has many subsystems Every subsystem need a different class of product Products often come from multiple vendors Integration at scale across products is complex, fragile and expensive “ Simplify, I am the Chief Data Officer and don’t want to be the Chief Integration Officer.” Every CDO, Every Enterprise
A silver lining? Analytics systems have very predictable patterns Microsoft has all the products with the right scale needed to build a complete analytics system Data Integration Data Engineering Data Warehousing Real Time Analytics Data Science Data Lake Governance and Administration Business Intelligence
A silver lining? Analytics systems have very predictable patterns Microsoft has all the products with the right scale needed to build a complete analytics system Data Integration Data Engineering Data Warehousing RealTime Analytics Data Science Data Lake Governance and Administration Business Intelligence
Still far too complex Many Products Different Experiences Proprietary and Open Power BI Synapse Kusto Azure AI Data Factory Spark Dedicated and Serverless PaaS and SaaS Different Business Models Steep Learning Curves Deep Expertise Needed High Integration Effort
Microsoft Fabric Data Integration Data Lake Spark Engines Data Warehouse Real Time Analytics Data Science Business Intelligence Unified analytics fabric End-to-end analytics data fabric From the data lake to the business user Governance
Microsoft Fabric Single… Data Integration Data Engineering Data Warehousing RealTime Analytics Data Science AI Assisted Shared Workspaces Universal Compute Capacities OneSecurity OneLake The Intelligent Data Fabric Business Intelligence Onboarding and trials Sign-on Navigation model UX model Workspace organization Collaboration experience Data Lake Storage format Data copy for all engines Security model CI/CD Monitoring hub Data Hub Governance & compliance
Persona Centric Experiences
Data Integration
Data Engineering
Data Warehouse
Real-Time Analytics
Data Science
Power BI
AI Assisted Creation in Microsoft Fabric Data Integration Data Engineering Data Warehousing Real Time Analytics Data Science AI Assisted Shared Workspaces Universal Compute Capacities OneSecurity OneLake The Intelligent Data Fabric Business Intelligence The Fabric platform will include built in Azure Open AI based assistant that will serve all the workloads First GPT-based feature is already shipping in Power BI - NL2DAX – DAX calculation creation based on natural language prompts Ongoing major ramp-up for pervasive AOAI based product-wide AI assistance
Introducing Microsoft Fabric for Data Engineering • Complete analytics platform •Lake-centric and open architecture
OneLake for all Data “The OneDrive for Data” Data Warehousing Data Engineering Data Integration Data Science Real Time Analytics Business Intelligence A single SaaS lake for the whole organization Provisioned automatically with the tenant All workloads automatically store their data in the OneLake workspace folders OneLake Storage Intelligent data fabric All the data is organized in an intuitive hierarchical namespace The data in OneLake is automatically indexed for discovery, MIP labels, lineage, PII scans, sharing, governance and compliance
One Copy for all computes Real separation of compute and storage Data Warehousing Data Engineering T-SQL Spark Finance Customers 360 Delta – Parquet Format Delta – Parquet Format Data Integration Data Science Serverless Compute OneLake Storage Real Time Analytics Business Intelligence All the compute engines store their data automatically in OneLake The data is stored in a single common format Delta – Parquet, an open standards format, is the storage format for all tabular data in Fabric KQL Analysis Services Service Telemetry Business KPIs Delta – Parquet Format Delta – Parquet Format Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines
One Copy for all computes Universal security makes it real Data Warehousing T-SQL Data Engineering Spark Data Integration Data Science Serverless Compute Real Time Analytics KQL Business Intelligence The data is stored in a single common format Analysis Services Customers 360 Delta – Parquet Format Delta – Parquet Format OneLake Storage Delta – Parquet, an open standards format, is the storage format for all tabular data in Fabric Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export OneSecurity Finance All the compute engines store their data automatically in OneLake Service Telemetry Business KPIs Delta – Parquet Format Delta – Parquet Format All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines
Taking One Copy to the next level Shortcuts Data Integration Finance Data Engineering Customers 360 Azure Data Warehousing Real Time Analytics OneLake Storage Amazon Data Science Business Intelligence Sharing data in OneLake is as easy as sharing files in OneDrive, removing the needs for data duplication With shortcuts, data throughout OneLake can be composed together without any data movement Service Telemetry Business KPIs Google Shortcuts also allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement, making OneLake the first multi-cloud data lake With support for industry standard APIs, OneLake data can be directly accessed by any application or service
Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture •Empower every office user
Office Integration
Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user •Persistent security and governance
Regulatory compliance • Fabric will be available in every Azure region • Data at rest: compliant with EUDB and other single-geo data residency regulations • Multi-geo capacities allow control over content storage location in most Azure data centers world-wide Microsoft pledges support for EU Data Boundary
Access control • Fabric workspace roles define default permissions on workload items on the Control Plane • Workload item permissions can be modified and managed via sharing • On the Data Plane, Universal Security defines access policies on the delta tables directly and all workload compute engines will respect such policies Defining user access via workspace roles and sharing
Microsoft Fabric Complete Analytics Platform Lake-centric and Open Empower Every Office User Persistent Security and Governance Everything, unified OneLake Familiar and Intuitive End-to-End Visibility SaaS Solution One Copy Built Into Office 365 Always Governed Low Code Plus Pro Dev Always Synced Insight to Action Secure by Default
Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering •Data pipelines and movement
The choice of tools for your data transformations Notebooks Spark Job Definitions Code-based pipelines Data pipelines Data Flows UI-based pipelines
No-code data pipelines
Run data flows on the cloud
Write and run Notebooks from web & VS Code
Write and run Notebooks from web & VS Code
Schedule your data transformations with ease
Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement •Data storage and architecture
One Lake Unified lake house Based on open standards Accessible from any workload
One Lake working with both files and tables
Browse your OneLake from Windows Explorer
Copy data from a wide range of services…
… and even more with Data Flows
Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement • Data storage and architecture •Delivering data to data analysts and data scientists
Access lakehouse data from Notebooks and Spark Jobs
Expose Datasets ready for Power BI
Use Lakehouse data from Excel
Use Lakehouse data from Excel
Access your Lakehouse data from any SQL Client
Organize your data through workspaces & shortcuts
From raw data to certified datasets
Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement • Data storage and architecture • Delivering data to data analysts and data scientists DEMO
Architecture
Data model Wide World Importers (WWI) data model See Wide World Importers sample databases for Microsoft SQL.
Data and Transformation flow
References • Introduction End-to-End Analytics in Microsoft Fabric • Lakehouse Get Started Lakehouses • Spark on Lakehouse : Use Apache Spark in Lakehouse • Work with Delta Delta Lake Tables in Microsoft Fabric • Data Factory Pipelines Pipelines, Activities, Templates • Data Warehouses Get started with Data Warehouses • Real-Time Analytics Analyze real-time data • Data Science Get started with data science in Microsoft Fabric • Administer Administration, Security, and Govern data in Microsoft Fabric • Medallion Architecture Design Fabric Medallion Architecture with Bronze, Silver and Gold layers of Lakehouse • DataFlow Gen2 Ingest with Dataflows in Microsoft Fabric • Data Analysis with Kusto Query Language Explore the fundamentals of data analysis • Azure Data Engineer - free online training from Azure
Slides Q&A AND SOCIAL ADS X @philg0ld #NashvilleDataEngineering http://linkedin.com/in/philg0ld
APPENDIX
Sign up for a free Fabric trial Fabric trial - Microsoft Fabric | Microsoft Learn
Microsoft Fabric concepts and licenses
DAX overview FILTER CONTEXT CALCULATED COLUMNS SQL SERVER ANALISIS SERVICES ROW CONTEXT MEASURE DAX Power BI DESKTOP
Power BI Architecture - Fabric
What is Delta Lake? • • Open-source storage framework that enables building a Lakehouse architecture with compute engines, including Spark, PrestoDB, Flink, Trino, and Hive, and APIs for Scala, Java, Rust, Ruby, and Python. ACID-compliant storage layer that runs on top of cloud object stores such as MinIO, Hadoop HDFS, Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. • • • Provides features such as scalable metadata handling for petabyte-scale tables with billions of partitions and files with ease. Provides time travel access/reverts to earlier versions of data for audits, rollbacks, or reproduce. Production-ready and has been battle-tested in over 10,000+ production environments.
rawACID data to certified datasets DeltaFrom Lake Implementation