

It enables atomicity, consistency, isolation, and durability (ACID) transactions in a data lake. In order to build a data lakehouse, an incremental data processing framework, such as Apache Hudi, is required.Īpache Hudi, which stands for Hadoop Upserts Deletes Incrementals, is an open-source framework developed by Uber in 2016 that manages the storage of large datasets on distributed file systems, such as cloud stores, HDFS, or any other Hadoop FileSystem compatible storage. Support for diverse data types ranging from unstructured to structured data.Key features of a data lakehouse (referenced from What is a Data Lakehouse?) : Additionally, lakehouses enable data consumption at lower latency and higher velocity than a traditional data warehouse because the data can be directly queried from the data lakehouse. It overcomes the limitations of data lakes by supporting ACID transactions and ensuring consistency of the data as it is concurrently read from and updated. However, data lakes still have limitations, including transaction support (which makes it difficult to keep data lakes up to date) and ACID compliance (which prevents concurrent reads and writes).ĭata lakehouses reap the low-cost storage benefits of data lakes, such as S3, GCS, Azure Blob Storage, etc., along with the data structures and data management capabilities of a data warehouse. In comparison to data warehouses, data lakes contain raw data in multiple storage formats, which can be used for current and future use cases. Typically, data warehouses contain only structured data, are not cost-effective, and are loaded using batch ETL jobs.ĭata Lakes were introduced to overcome some of these limitations by supporting structured, semi-structured, and unstructured data with low-cost storage, as well as enabling both batch and streaming pipelines. Traditional data warehouses are designed to provide a platform for storing historical data that has been transformed/aggregated for specific use cases/data domains to be used in conjunction with BI tools to derive insights. Simply put: Data Lake + Data Warehouse = Data Lakehouse.

The culmination of the efforts from Uber, Databricks, and Netflix has resulted in the next generation of Data Lakes meant to deliver up-to-date data in a scalable, adaptable, and reliable manner - the Data Lakehouse. Apache Hudi (Uber), Delta Lake (Databricks), and Apache Iceberg (Netflix) are incremental data processing frameworks meant to perform upserts and deletes in the data lake on a distributed file system, such as S3 or HDFS. Efforts from Uber, Databricks, and Netflix have resulted in solutions aiming to address the challenges that data engineers face.


The challenges above are further complicated by the increase in data volume and the frequency at which this data needs to be kept up to date. Lack of support for ACID transactions: The inability to enforce ACID compliance can lead to inconsistent results when there are concurrent readers and writers.
#LAKEHOUSE DESIGNS UPDATE#
