CloudNation - Inspiration

Microsoft Fabric as a Data Engineering platform

Written by CloudNation | Jun 17, 2024 12:30:59 PM

Microsoft states that Fabric is an all-in-one analytics solution for enterprises that covers everything from data movement to data science, Real-Time Analytics, and business intelligence. It can be seen as a kind of operating system on which different data tools reside. This blog will evaluate from a platform perspective how Microsoft Fabric can be used to implement an end-to-end data engineering solution.

Architecture

The biggest advantage of Fabric is that it is a Software-as-a-Service (SaaS) solution. This means that with Fabric, the end-user does not need to deploy nor maintain any cloud infrastructure. This is a huge advantage, but comes with some drawbacks, as you will read in the following sections.

Figure 1: Microsoft Fabric SaaS Foundation

A unique selling point of Fabric is that all Fabric services are built on the same data lake, named OneLake. One of the advantages of using the OneLake is that Power BI can directly use the data that has been ingested to a Lakehouse or Warehouse by ETL pipelines. This eliminated the need to use DirectQuery to retrieve data in real-time or use Import to store snapshots of the data in a Power BI dataset. The feature that enables users to use real-time data from OneLake directly in Power BI reports is called DirectLake.


Figure 2: DirectQuery, Import & DirectLake

Networking

As mentioned before, Microsoft Fabric is a SaaS service, meaning that all infrastructure is managed by Microsoft. This also means that your data will live outside of the company’s internal network. A question that many companies have is: Where does my data get stored? The short answer is: Somewhere in Azure. Fabric Workspaces can be connected to Fabric capacities. These capacities are Azure resources that are hosted within a certain region, but not inside an Azure VNET in the customers subscription.

Another possible concern might be that the data is available over the public internet. Luckily, Microsoft just released Azure Private-link support for Fabric. With Azure Private-link customers can restrict network traffic to Fabric. This means that Fabric can be cut off from the public internet and that only traffic from the internal network is allowed. Note that Azure Private-link support is still in public preview.

Besides Azure Private-link, Microsoft also just released two methods to connect to data that is hosted in self-managed storage service, namely the VNET Data Gateway and Managed private endpoints. The VNET Data Gateway does incur some additional costs and currently only supports Dataflows Gen2, Power BI Semantic Models and Power BI Paginated Reports. The managed private endpoints are included in the price of the capacity; however, they are only supported on F64 or higher SKUs. Note that the capacity can be paused and resumed in order to save costs, for example, in case the data only needs to be processed and accessible during office hours.

DTAP cycle

Developers generally want lots of flexibility in how they can access data to minimize dependencies and not get obstructed during development of new features. Enabling high levels of flexibility in a production environment can be risky and would take away the carefree workflow of developers. On the other hand, when building and maintaining reliable production workloads, it is of utmost importance that components are rigorously tested before being released to production. It is a best practice to separate the development-, test-, acceptance- and production environment. This allows the developers to freely edit or create new items without the risk of interfering with production data, while still being able to discover any potential bugs before releasing the items to production.

In Fabric one can create different workspaces to separate environments. The workspaces host Fabric items like Lakehouses and Warehouses, which in turn host the data. However, this does not mean that items from one workspace cannot access data from another workspace. For example, a Dataflow in the development workspace can write data to a Lakehouse in the production workspace. There are no network restrictions that can separate the data of different workspace/environments. The only way to ensure users cannot write to items in a production workspace is by managing the permissions of the item/workspace.

When working with separate environments, it is important to make sure that the differences between environments are kept to a minimum. Microsoft Fabric has integrated Deployment pipelines to propagate new or edited items from one workspace to another in the DTAP cycle. This means that there is no need to write YAML pipelines, like for the deployment of Synapse artifact. Deployment pipeline rules can be configured in each consecutive stage of the Deployment pipeline, to ensure that each item is linked to the corresponding items in that workspace. For example, connecting a Notebook to the corresponding default Lakehouse of that workspace. At the time of writing, Dataflows Gen2 are not supported in Deployment pipelines. Furthermore Lakehouses, Warehouses, Data pipelines, Notebooks and Datamarts are still in preview.

 

Git integration

Having proper version control is another essential capability for maintaining production workflows. Git enables developers to collaborate using branches and control what goes to production by using pull requests. Furthermore, Git allows for quick debugging and roll-back in case of an incident. Microsoft Fabric integrates with Azure DevOps repos to store the definition files of various items. At the time of writing git integration does not support Warehouses, Dataflows Gen2 and Dashboards. Microsoft mentions that the estimated release timeline for dataflow support is in Q2 2024. However, the documentation does not clearly state whether this includes Dataflows Gen2. There are no currently announced investment areas for Warehouse and Dashboard git support.

Item type Git integration
Lakehouses
Warehouses
Data pipelines
Notebooks
Dataflows Gen2
Semantic models
Reports
Paginated reports
Dashboards

 

Conclusion

So, from a data engineering platform perspective, when should one use Microsoft Fabric and when should one use more established data engineering platforms such as Synapse Analytics or Azure Databricks? 

In general Microsoft Fabric is an amazing platform that accelerates development and gaining quick insights in your data. It is incredibly powerful when different personas, like data engineer, data scientist and Power BI developer, collaborate on the same platform to build end-to-end data solutions leveraging the OneLake architecture. 

On the other hand, many features of the platform are still unsupported, in preview or only partly generally available. Furthermore, the interface is clearly still under development. Most things work very intuitively, but some interfaces seem buggy and error messages do not always give clear directions of where to look. However, in my opinion, the biggest disadvantage of Fabric is the limited level of environmental separation. Having only one level of security between development and production significantly increases the risk of production incidents. 

When you start developing from scratch and do not have strict production deadlines, you could assume that most Fabric items will have both Git integration support and Deployment pipeline support in the not-so-distant future. The lack of proper separation of environment, however, is a topic that probably will not be solved. So, if having network separated environments is important to you instead of just relying on setting the right permissions to prohibit users from interfering with production data, Microsoft Fabric is not the platform for your data engineering practices.