Posted: 08 Feb. 2022 7 min. read

Data Lakehouse

Navigating through the dichotomy of Data Warehouse and Data Lake

With the ever-increasing data needs of multiple stakeholders and consumers in an enterprise, we see that data architectures are evolving significantly to meet the rising demand. In this regard, specialists from Data Architect teams at Deloitte explore how storage architecture, namely Data Lakehouse, enable an integrated, agile and cost-effective approach to data management for enterprises.

Data Warehouses have been the answer to all enterprise business intelligence needs for decades. They are cleansed, standardized, and enabled data for specifically targeted analytics. With the demand for end-user self-sufficiency, reducing latency, soaring storage costs, and the advent of streaming data—the need for a storage layer that can accept data in a variety of formats at lesser cost created an opportunity for Data Lake architecture. Data Lake allowed massive storage layers at the forefront with no schema enforcements or minimal standardization. While it addressed many challenges related to data ingestion and storage, it often caused serious bottlenecks for consumption, including limited capabilities with transaction query engines, aggregation without heavy data workloads, as well as difficulties in establishing relationship between datasets. So, enterprises tend to maintain both versions—Data Lakes and Data Warehouses. They consolidate all sources of data in a Data Lake and later develop heavy pipelines to transform only the required data to their respective Data Warehouses or Data Marts.

In the current Age of With, where enterprises are focused on monetizing data and insights to their best advantage—the speed and agility of provisioning new data for decision-making are paramount. It is imperative to swiftly eliminate delays and noise in the data early in the architecture, avoiding cost overheads and inefficiencies.

The key constraints while using a combination of Data Lake and Data Warehousing to note are:

  • Time for realization, as Data Lake accepts datasets in raw formats and must undergo a series of transformations to establish any meaningful and structured relation in the data
  • Complex data pipelines and manual stewardship for curating and loading into respective warehouses and consumptions layers
  • Data management, including metadata management; developing processes to identify relevant versions, and resolving data quality issues that require considerable effort & investment

 

Data Lakehouse: Embracing a cohesive approach

Rapid developments in Compute and Cloud have provided an opportunity for a new data architecture paradigm, allowing transactional data processing capabilities (including structured query languages) directly on large volumes of raw data in their native and diverse formats i.e., sourcing layer compared to curated or consumption layers or limiting noise data (unwanted data) without heavy data workloads (for instance, Extraction Transformation Loading). Technologies adopting this paradigm enable custom capabilities, including but not limited to effectively achieving governance, time travel, lineage, and support for ACID (Atomicity, Consistency, Isolation, and Durability) properties. These technologies allow to simultaneously cater to business intelligence and data scientists with dynamic data frame capabilities with unconstrained access to data.

Data Lakehouse architecture drastically reduces the need for large-scale complex data pipelines to curate and standardize data—allowing a single centralized layer for all reporting, analytical and Artificial Intelligence/Machine Learning (AI/ML) needs.

 

Real-world application of Lakehouse architecture

Generally, health care or insurance companies process large volumes and a variety of data from internal and external applications to effectively predict risk and optimize costs. End users require access to governed, cataloged, yet not highly standardized data, a historical lineage with minimal data latency. The data architecture is expected to be centralized, agile, and cost-efficient for heavier workloads and data volumes to empower a broad group of stakeholders—operational, regulatory, compliance analytics (profit and loss, claims, International Financial Reporting Standards, etc.), Data science and power users (back-office analytics, customer churn, segregation, underwriting risk management, etc.). Data Lakehouse architecture offers an effective solution to these diversified data and aggregation requirements through a spectrum of inbuilt functionalities and highly optimized query engines, directly on open data formats, enabling flexibility and agility. Limiting to Data Lake or Data Warehouse architectures would require creating heavy data pipelines, longer time to realization, and constrained versus standardized data, to name a few.

 

Is Lakehouse the way forward for all enterprises?

Shifting to Lakehouse would be an ideal approach for organizations that are looking for all or any of the below features: 

  • Faster time to market: New datasets are adopted immediately and eliminate the need for complex data pipeline development.
  • Ease of access: As the data is available in open formats, it can be easily accessed by a variety of analytical applications.
  • Hard to append data: ACID Properties being applied at the RAW layer eliminates the issues of incorrect reads.
  • Being compliant: General Data Protection Regulation (GDPR) requires making fine-grained changes to an existing Data Lake.
  • Maintaining historical version of data: With the maintenance of transaction logs instead of making real updates, users can query the required version of the data, helping in fallback options and satisfying audit requirements.
  • Eliminating partial data issues: Ensures that either the data is completely loaded or fails, but does not stop halfway.
  • Eliminating data swamps: Allows faster resolution of sync issues without any dependencies, made possible by entrusting responsibility to the public cloud provider.
  • Reduced movement of data: Data is unified and easily accessible, eliminating the need to migrate into different layers, thus reducing transferring, and duplicating large amounts of data.

 

The bottom line

Lakehouse is a new paradigm shift in Data Architecture, leveraging technology advancements in infrastructure provisioning and software services. It is not just a replacement, but an integrated solution for Data Warehouses or Data Lakes within permissible volumes and latency. It is ideal for organizations that heavily bank on data-driven decisions requiring agility in adopting new sources. Lakehouse enables a unified architecture for business intelligence and exploratory analytics, low data latency, drive, and is the most favorable strategic investment that will offer an enterprise a more comprehensive and sustainable solution with reduced investment.

At Deloitte, we not only train our professionals to uncover newer, innovative ways of data management, but we also give implementation experience to professionals in how to maximize use from the more efficient ways to address some of the biggest challenges of our clients. Just as we saw how Lakehouse sets Data Architecture apart for various reasons, in the same way, we have other areas of specialization that we encourage exploration into, to support one of the key pillars of pursuing innovation in technology. 

 

About the authors:

Ashakiran Vemulapalli is a Specialist Master in Analytics & Cognitive (A&C) group at Deloitte Consulting India Private Limited. He has vast experience in the health care industry coupled with Enterprise Architect skills with specialization in providing solutions for large-scale enterprises in the areas of Cloud, Analytics and Big data.

Tulasi Rapeti is a Manager in the Analytics & Cognitive (A&C) group at Deloitte Consulting India Private Limited. He specializes in the area of building Data and Analytical solutions in the health care domain. He has immense experience in resolving complexities involved while building Enterprise Data Warehouse or Data Lakes. He has also worked to build cost-efficient solutions in the area of cloud.

Avinab Chakraborty is a Data Engineer in Analytics & Cognitive (A&C) group at Deloitte Consulting India Private Limited. He has extensive experience in the design and implementation of cloud-based Data Warehouses and Platform Modernization. He has led multiple large-scale data migration and data analytics engagements and gained immense experience in handling big data problems both on-prem and Cloud.

Key Contacts

Arunabha Mookerjea

Arunabha Mookerjea

Specialist Leader

Arunabha Mookerjea is a specialist leader and distinguished cloud architect in the Strategy and Analytics practice at Deloitte Consulting India Private Limited. He specializes in technology advisory, solution architectures and directing large scale delivery in next generation areas of cloud and big data platforms, IoT, micro-services and digital core solutions. Arunabha is a member of the Next Gen Architecture Program.

Chandra Narra

Chandra Narra

Leader | Analytics & Cognitive, Consulting

Chandra Narra is the Leader for the Analytics & Cognitive (A&C) group at Deloitte Consulting India Private Limited. He has 18+ years of extensive global consulting and technology experience. He is a certified data scientist and subject matter specialist in Artificial Intelligence and advanced data management technologies. He delivers advanced analytics solutions to help organizations unlock and monetize the full potential of their data through innovation and exponential technologies.