Realizing a Data Mesh has been saved
Realizing a Data Mesh
Delta Lake and the Lakehouse architecture
Following the growing interest in Data Mesh as a topic, designing and implementing its qualities into existing data landscapes becomes a crucial step. We are creating a series where we explore technologies that can be utilized to achieve Data Mesh’s sociotechnical qualities. Databricks Delta Lake and the Lakehouse architecture is our first example.
By Jarvin Mutatiina
Data Mesh emphasizes the principles and benefits of defining a decentralized way of managing data across cross functional business domains within an organization. Key benefits are agility - ability to move from idea to production within the local business domains with reduced central bottlenecks and scalability of the concept. Business domains are able to independently operationalize their own data via a central and self-service infrastructure following standardized governance guidelines. The quality conformed data (also known as a data product) is published and treated as discoverable and self-explanatory products. Data Mesh is a sociotechnical approach that mitigates operational and organizational challenges of centralized data landscapes. For more in-depth foundational information on what Data Mesh is plus its benefits and barriers , please refer to our previous article.
Multiple technologies can be brought together to achieve a Data Mesh setup. A number of vendors have explored how different principles of the popular concept can be achieved with their software/cloud offering. For this blog, we shall look at an example where Databricks Delta Lake and its Lakehouse architecture can be utilized to achieve Data Mesh qualities.
Delta Lake is a data management and transactional storage layer that extends a data lake to provide reliability, quality & consistency and improved performance. On the other hand, Lakehouse architecture extends the conventional data management capabilities with a metadata and governance layer defined on top of an existing data lake. In addition to its management features like data versioning and performance optimizations, Delta Lake as a metadata and governance layer also acts as a single source of truth, hence enforcing data quality compliance. These mentioned aspects already address multiple business data requirements regarding data governance & lineage, compliance, metadata management and more. Additionally, the combination of Delta Lake as a metadata layer and general Lakehouse architecture is a perfectly suitable setup for organizations to extend their existing cloud data lakes with Data Mesh qualities.
Designing and implementing Data Mesh especially with an integration to existing data landscape has been an interest of multiple clients. Deloitte is taking an active role in multiple engagements to support on that front. Realizing a Data Mesh is a long-term and iterative process that requires changes and innovation on organizational, operational, architectural and technical levels. Consequently, we are creating a series of articles to expound on the technical aspects and considerations in relation to Data Mesh. In this series, we shall highlight a couple of technologies and how their features can be utilized for the “Data Mesh” architectural concept.
Figure 1: Delta Lake Overview- Source: delta.io
Key Features of Delta Lake and the Lakehouse architecture that can be utilized for a Data Mesh architecture
Delta Lake supports multiple qualities of Data Mesh of data governance, quality and compliance, trustworthiness, data product lineage and more. This demonstrates suitability for Data Mesh. Below is a list of highlight features and how they enable Data Mesh:
- Schema enforcement- definition of expected data structure and types to ensure standardized and consistent data quality during insertion. This feature supports Data Mesh’s quality and governance aspects
- Scalable metadata handling – with the foundation of spark distributed data processing, large metadata stemming from even larger data can be handled with ease
- Performance optimization via data skipping for only relevant data and compaction where small files into larger ones for improved read speeds.
- ACID(Atomicity, Consistency, Isolation, Durability) transactions –data integrity through one transaction log serving a single source of truth tracking all changes from multiple users. This feature supports the trustworthiness requirement for a data product
- Data versioning via time travel – ability to reproduce, audit and/or roll back historical data changes. This is an important factor to support data lineage.
Figure 2: Architectural design of a Data Mesh implementation with Delta Lake & the Lakehouse architecture
Implementing Data Mesh qualities with Delta Lake
We implemented a proof of concept where we explored some of Delta Lake and Lakehouse architecture’s key features to demonstrate how they can be utilized for a Data Mesh on a practical level. The use case is centered around a company with multiple cross-functional domains/departments that work with the same data that currently exists in a cloud based general purpose data lake. Key architectural components as illustrated in figure 2 are:
- An existing Data Lake - an existing repository of raw data from diverse sources in different formats. This is similar to what most clients’ existing data landscape looks like- no defined governance on data
- Domain teams - Ownership of the data is decentralized back to the respective domains that process and standardize it into data products
- Metadata and governance layer - Standardization of the data products in each domain team is supported by Delta Lake and the Lakehouse architecture where data products are developed under certain guidelines within three groups:
a. Bronze - Raw data that might be in multiple formats and structures
b. Silver - filtered & cleaned data
c. Gold - aggregated data that is ready for business use cases.
Within this layer, Delta Lake features of schema enforcements, data versioning and performance optimization are implemented. Additional governance guidelines can also be enforced
- Data Catalog - These data products are then made available and discoverable in Azure Data Catalog where users can access them in a self-service manner. The data catalog also showcases the data expert associated, schema, data sample and direct URL to use the data
We are continuously supporting our clients in various industries in conceptualizing and adopting various aspects of Data Mesh to strengthen and future-proof their existing data landscapes. Data management is an ever evolving field and staying ahead of the curve is crucial for an operational and innovative advantage
This proof of concept implements a couple concepts of Data Mesh with a focus on scalability, data quality and governance, decentralized data responsibility and trustworthy data products. It also builds on top of the evolving data landscape that is moving away from the centralized solutions- data warehouses and data lakes to more agile decentralized architectural concepts like Data Mesh. The organizational change aspects of Data Mesh are not highlighted here but are very crucial in success of such a paradigm shift. We are also exploring other technologies that satisfy Data Mesh qualities e.g. Snowflake, AWS, Azure and many more.