Leveraging Microsoft Fabric for Advanced Data Engineering Solutions

Philip Goldman Field Sr. Data Solution Architect EPAM

Introducing Microsoft Fabric Deep Dive into Data Engineering DEMO

Introducing Microsoft Fabric for Data Engineering

Microsoft Fabric Complete Analytics Platform Lake-centric and Open Empower Every Office User Persistent Security and Governance Everything, unified OneLake Familiar and Intuitive End-to-End Visibility Unified SaaS Solution One Copy Built Into Office 365 Always Governed Low Code Plus Pro Dev Always Synced Insight to Action Secure by Default

Introducing Microsoft Fabric for Data Engineering •Complete analytics platform

Scalable analytics are complex and fragmented Every analytics project has many subsystems Every subsystem need a different class of product Products often come from multiple vendors Integration at scale across products is complex, fragile, and expensive

Scalable analytics are complex and fragmented Every analytics project has many subsystems Every subsystem need a different class of product Products often come from multiple vendors Integration at scale across products is complex, fragile and expensive “ Simplify, I am the Chief Data Officer and don’t want to be the Chief Integration Officer.” Every CDO, Every Enterprise

A silver lining? Analytics systems have very predictable patterns Microsoft has all the products with the right scale needed to build a complete analytics system Data Integration Data Engineering Data Warehousing Real Time Analytics Data Science Data Lake Governance and Administration Business Intelligence

A silver lining? Analytics systems have very predictable patterns Microsoft has all the products with the right scale needed to build a complete analytics system Data Integration Data Engineering Data Warehousing RealTime Analytics Data Science Data Lake Governance and Administration Business Intelligence

Still far too complex Many Products Different Experiences Proprietary and Open Power BI Synapse Kusto Azure AI Data Factory Spark Dedicated and Serverless PaaS and SaaS Different Business Models Steep Learning Curves Deep Expertise Needed High Integration Effort

Microsoft Fabric Data Integration Data Lake Spark Engines Data Warehouse Real Time Analytics Data Science Business Intelligence Unified analytics fabric End-to-end analytics data fabric From the data lake to the business user Governance

Microsoft Fabric Single… Data Integration Data Engineering Data Warehousing RealTime Analytics Data Science AI Assisted Shared Workspaces Universal Compute Capacities OneSecurity OneLake The Intelligent Data Fabric Business Intelligence Onboarding and trials Sign-on Navigation model UX model Workspace organization Collaboration experience Data Lake Storage format Data copy for all engines Security model CI/CD Monitoring hub Data Hub Governance & compliance

Persona Centric Experiences

Data Integration

Data Engineering

Data Warehouse

Real-Time Analytics

Data Science

Power BI

AI Assisted Creation in Microsoft Fabric Data Integration Data Engineering Data Warehousing Real Time Analytics Data Science AI Assisted Shared Workspaces Universal Compute Capacities OneSecurity OneLake The Intelligent Data Fabric Business Intelligence The Fabric platform will include built in Azure Open AI based assistant that will serve all the workloads First GPT-based feature is already shipping in Power BI - NL2DAX – DAX calculation creation based on natural language prompts Ongoing major ramp-up for pervasive AOAI based product-wide AI assistance

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform •Lake-centric and open architecture

OneLake for all Data “The OneDrive for Data” Data Warehousing Data Engineering Data Integration Data Science Real Time Analytics Business Intelligence A single SaaS lake for the whole organization Provisioned automatically with the tenant All workloads automatically store their data in the OneLake workspace folders OneLake Storage Intelligent data fabric All the data is organized in an intuitive hierarchical namespace The data in OneLake is automatically indexed for discovery, MIP labels, lineage, PII scans, sharing, governance and compliance

One Copy for all computes Real separation of compute and storage Data Warehousing Data Engineering T-SQL Spark Finance Customers 360 Delta – Parquet Format Delta – Parquet Format Data Integration Data Science Serverless Compute OneLake Storage Real Time Analytics Business Intelligence All the compute engines store their data automatically in OneLake The data is stored in a single common format Delta – Parquet, an open standards format, is the storage format for all tabular data in Fabric KQL Analysis Services Service Telemetry Business KPIs Delta – Parquet Format Delta – Parquet Format Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines

One Copy for all computes Universal security makes it real Data Warehousing T-SQL Data Engineering Spark Data Integration Data Science Serverless Compute Real Time Analytics KQL Business Intelligence The data is stored in a single common format Analysis Services Customers 360 Delta – Parquet Format Delta – Parquet Format OneLake Storage Delta – Parquet, an open standards format, is the storage format for all tabular data in Fabric Once data is stored in the lake, it is directly accessible by all the engines without needing any import/export OneSecurity Finance All the compute engines store their data automatically in OneLake Service Telemetry Business KPIs Delta – Parquet Format Delta – Parquet Format All the compute engines have been fully optimized to work with Delta Parquet as their native format Shared universal security model is enforced across all the engines

Taking One Copy to the next level Shortcuts Data Integration Finance Data Engineering Customers 360 Azure Data Warehousing Real Time Analytics OneLake Storage Amazon Data Science Business Intelligence Sharing data in OneLake is as easy as sharing files in OneDrive, removing the needs for data duplication With shortcuts, data throughout OneLake can be composed together without any data movement Service Telemetry Business KPIs Google Shortcuts also allow instant linking of data already existing in Azure and in other clouds, without any data duplication and movement, making OneLake the first multi-cloud data lake With support for industry standard APIs, OneLake data can be directly accessed by any application or service

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture •Empower every office user

Office Integration

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user •Persistent security and governance

Regulatory compliance • Fabric will be available in every Azure region • Data at rest: compliant with EUDB and other single-geo data residency regulations • Multi-geo capacities allow control over content storage location in most Azure data centers world-wide Microsoft pledges support for EU Data Boundary

Access control • Fabric workspace roles define default permissions on workload items on the Control Plane • Workload item permissions can be modified and managed via sharing • On the Data Plane, Universal Security defines access policies on the delta tables directly and all workload compute engines will respect such policies Defining user access via workspace roles and sharing

Microsoft Fabric Complete Analytics Platform Lake-centric and Open Empower Every Office User Persistent Security and Governance Everything, unified OneLake Familiar and Intuitive End-to-End Visibility SaaS Solution One Copy Built Into Office 365 Always Governed Low Code Plus Pro Dev Always Synced Insight to Action Secure by Default

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering •Data pipelines and movement

The choice of tools for your data transformations Notebooks Spark Job Definitions Code-based pipelines Data pipelines Data Flows UI-based pipelines

No-code data pipelines

Run data flows on the cloud

Write and run Notebooks from web & VS Code

Write and run Notebooks from web & VS Code

Schedule your data transformations with ease

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement •Data storage and architecture

One Lake Unified lake house Based on open standards Accessible from any workload

One Lake working with both files and tables

Browse your OneLake from Windows Explorer

Copy data from a wide range of services…

… and even more with Data Flows

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement • Data storage and architecture •Delivering data to data analysts and data scientists

Access lakehouse data from Notebooks and Spark Jobs

Expose Datasets ready for Power BI

Use Lakehouse data from Excel

Use Lakehouse data from Excel

Access your Lakehouse data from any SQL Client

Organize your data through workspaces & shortcuts

From raw data to certified datasets

Introducing Microsoft Fabric for Data Engineering • Complete analytics platform • Lake-centric and open architecture • Empower every office user • Persistent security and governance Deep Dive into Data Engineering • Data pipelines and movement • Data storage and architecture • Delivering data to data analysts and data scientists DEMO

Architecture

Data model Wide World Importers (WWI) data model See Wide World Importers sample databases for Microsoft SQL.

Data and Transformation flow

References • Introduction End-to-End Analytics in Microsoft Fabric • Lakehouse Get Started Lakehouses • Spark on Lakehouse : Use Apache Spark in Lakehouse • Work with Delta Delta Lake Tables in Microsoft Fabric • Data Factory Pipelines Pipelines, Activities, Templates • Data Warehouses Get started with Data Warehouses • Real-Time Analytics Analyze real-time data • Data Science Get started with data science in Microsoft Fabric • Administer Administration, Security, and Govern data in Microsoft Fabric • Medallion Architecture Design Fabric Medallion Architecture with Bronze, Silver and Gold layers of Lakehouse • DataFlow Gen2 Ingest with Dataflows in Microsoft Fabric • Data Analysis with Kusto Query Language Explore the fundamentals of data analysis • Azure Data Engineer - free online training from Azure

Slides Q&A AND SOCIAL ADS X @philg0ld #NashvilleDataEngineering http://linkedin.com/in/philg0ld

APPENDIX

Sign up for a free Fabric trial Fabric trial - Microsoft Fabric | Microsoft Learn

Microsoft Fabric concepts and licenses

DAX overview FILTER CONTEXT CALCULATED COLUMNS SQL SERVER ANALISIS SERVICES ROW CONTEXT MEASURE DAX Power BI DESKTOP

  1. DAX is a formula language like the formula language in Excel that uses functions 2. Create new information/ column/ measures/ calculation/ formula 3. DAX usage may depend on the design pattern: • Work could be pushed to the Power Query • Work could be pushed to the Database • Work could also be pushed to an ETL tool 4. Great for time intelligence, for example, YTD, MTD, parallel period comparisons 5. Enhances the data model

Power BI Architecture - Fabric

What is Delta Lake? • • Open-source storage framework that enables building a Lakehouse architecture with compute engines, including Spark, PrestoDB, Flink, Trino, and Hive, and APIs for Scala, Java, Rust, Ruby, and Python. ACID-compliant storage layer that runs on top of cloud object stores such as MinIO, Hadoop HDFS, Amazon S3, Azure Data Lake Storage, and Google Cloud Storage. • • • Provides features such as scalable metadata handling for petabyte-scale tables with billions of partitions and files with ease. Provides time travel access/reverts to earlier versions of data for audits, rollbacks, or reproduce. Production-ready and has been battle-tested in over 10,000+ production environments.

rawACID data to certified datasets DeltaFrom Lake Implementation